Comp 7 / Bio 40 Lab 6

November 2, 2016

Purpose of this lab

To start working with gene expression data and interpretation. To extend students' programming expertise to more practical problems.

Lab problems

  1. Differential expression using GEO2R

    In this problem, we'll analyze differential expression in two different kinds of lymphoma. Go to the GEO web site, and look up data set (GEO "series") GSE68895. Scroll down through the resulting page and select "Analyze with GEO2R."

    The data set comes from the paper "Diffuse large B-cell lymphoma outcome prediction by gene expression profiling and supervised machine learning", by M. Shipp et al., Nature Medicine 8:68-74, 2002. You will use their data to analyse gene expression differences between diffuse large B-cell lymphomas and follicular lymphomas.

    The GEO data set contains normalized expression data for 58 diffuse large B-cell lymphoma samples and 19 follicular lymphoma samples.

    The first thing you'll need to do is define groups of samples. Click "Define groups", put your cursor in the white box, and enter "DLBCL" (for diffuse large B-cell lymphoma). Do the same and type "follicular" (the other type of lymphoma studied here). Click "Define groups" again to close the pull-down.

    Now, in the header line above the sample names, click on "Disease state" to sort the samples by disease classification. Highlight all the Diiffuse large B-cell Lymphoma samples. Click Define groups again and assign them to the DLBCL group by clicking the box to the left of the name DLBCL. Do the same for the follicular lymphoma samples.

    To do an initial analysis, click "Top 250". Wait for the results. Click on a row of results to see a bargraph of expression levels in all the samples for that gene.

    1. What are the three most significantly differentially expressed probes?
    2. To which gene symbols do they correspond (if this is known)?
    3. In in which group of samples are they more highly expressed?
    4. Do they have positive or negative t scores?
    5. Are there more or fewer than 250 genes with BH-adjusted p-values below 0.01?
    6. Are there more or fewer than 250 genes with Bonferroni-adjusted p-values below 0.01? (You will need to adjust the correction method on the Options page, and then click Recalculate on the main GEO2R page with the Top250 results, to see the Bonferroni-adjusted p-values.)

    Leaving the Bonferroni adjustment in place for now, download all the results from this analysis by clicking on the "save all results" link. When the text page loads, right click on it and save it as a text file. You can now load this tab-delimited file into Excel or the spreadsheet of your choice. Sort it by adjusted p-value in Excel. There should now be both positive and negative t scores among the top 10 or 20 entries, reflecting significant changes in either direction.

    1. Exactly how many probes have a Bonferroni adjusted p-value below 1e-03 (.001)?

    Next, let's analyze these genes in DAVID. Select just the gene symbols from those that have a Bonferroni adjusted p-value less than .001. Copy them from Excel and paste them into the "paste a list" window in DAVID, on the Upload tab. Select "official gene symbol" as the identifier in step 2, select gene list in step 3, and submit. You will need to indicate that you are using Homo sapiens in both the List tab and the Background tab. Then create a functional annotation chart in DAVID with the default annotation. Sort the results by "Benjamini" (adjusted p-value for enrichment of the terms)

    1. What is the most significantly enriched term?

    Further exploration in PubMed may help explain why genes involved in this process might be differentially expressed in different types of lymphomas.

  2. Python: Student ID lookup

    Save the file studentnames.txt from the labs web page. This is a file containing a list of exactly 50 student names and id numbers, where each id is a number from 1 to 50 (and each number is used exactly once).

    Write a python program that opens and reads the tab-delimited list of names from a file. Specifically:

    1. Open the file for reading.

    2. Read each line, one at a time. Split it up into fields using the split() command. Store the resulting names (both first and last, together in one string) in a list: the list index of the name should be the id number that you read on the same line as that name.

    Next, you will write the part of the program that allows users to query this "database" of names and id numbers. Specifically, you should write a loop that asks the user for a student id number, and then print the name that corresponds to that id number. Repeat the loop until the number the user enters is outside the range 1-50.

    Note that the id numbers range from 1-50 but Python list indices start counting at zero. It is up to you to decide how to handle this discrepancy; use comments to explain these sorts of choices in your code. (Recall that comments start with the number sign.)

    An example run of your program (with a different student list) might look like:

    Please enter an id number:  12
    Tom Brady
    Please enter an id number:  2
    Derek Jeter
    Please enter an id number:  33
    Larry Bird
    Please enter an id number:  34
    David Ortiz
    Please enter an id number:  87
    Bye! 
    


Going Further:

    If you have time during lab or after, we strongly encourage you to explore GenePattern as well, so you have experience with two different approaches to differential gene expression analysis before starting Project 2. To get started with GenePattern, complete the Tutorial below, then you can explore the same data set you did with GEO2R in the following problem.

    There are a couple of additional programming questions that follow. Those of you interested in extra programming practice should try some of these offline if there isn't time in lab.

  1. The GenePattern Tutorial

    Open a window with the Firefox browser. There are other browsers available on the lab machines too, but Firefox is the most hassle-free web browser for today's analyses. Type the following URL directly into the Firefox window, or follow the link in the online version of the lab:

    http://www.broadinstitute.org/cancer/software/genepattern/
    
    You will have entered the GenePattern web site.

    Click on the link "GenePattern Basics" 10-minute Tutorial in the center of the top part of the page. You will do the sections titled "Start GenePattern," "Run an Analysis," and "View the Analysis Results."

    In the "Start GenePattern" section, use the Broad public server with the link they give you. follow the instructions to create an account. Don't worry, they won't spam you or sell your information anywhere.

    Note that the version of GenePattern used to create the tutorial looks a little different from the current one. In particular, to find a module like "ComparativeMarkerSelection," just start typing that into the Search Modules & Pipelines slot on the upper left.

    Following the tutorial, run ComparativeMarkerSelection, then ComparativeMarkerSelectionViewer, as described in the tutorial.

    In ComparativeMarkerSelection, we suggest that you edit the number of permutations to 1000 rather than 10,000 (the default) to speed things up a little.

    For the dataset files, the ftp locations indicated by the tutorial won't work from the lab machines. But, you can get at the same files by clicking on Add Path or URL, then clicking on "workshop_files", and choosing all_aml_train.gct (different format than the .res file from the tutorial, but the same data), and for the "cls" file, all_aml_train.cls. After specifying the file, click "Select" to load it.

    Under "View the Analysis Results," note that the little downward-pointing blue arrow icon, which is supposed to display a menu of commands you can use to work with the output of the ComparativeMarkerSelection run, now looks like down-left pointing arrow.

    The ComparativeMarkerSelectionViewer shows you a plot of the score (in this case, the t-score for differential expression) for all the genes on the array, in sorted order. You can use this to filter the gene list to choose just the significantly differentially-expressed genes.

    To do so, select Edit/Filter Features. You will get a popup window whose first field is the criterion on which you will filter. Select FDR(BH) to choose the false-discovery rate, and choose <= 0.05. You will get approximately 333 genes that are upregulated in ALL and 284 in AML (these are abbreviations for two types of leukemia). The list at the bottom will contain only those genes or identifiers.

    Note that the "Features" in the table, with names like L07956_at, are technically called Affymetrix "qualifiers." Affymetrix is the company that makes the particular type of microarray used in this experiment, and a qualifier is the name given to a set of probes for a specific gene or transcript. The "Description" column in the table tells you the name of the corresponding gene.

    To export the list of genes (qualifiers), select all the rows remaining in the table at the bottom of the page (after you have filtered the list as above), and then choose File/Save Feature List to download the list. Click the button next to "Download," type the name you want the file to have, and click OK. The file that downloads will contain a list of the differentially-expressed qualifiers that you have selected. (We will do this again in project 2.)

    Upload this file with your lab submission. Note that the newline characters make this file hard to view properly in Notepad; that is okay.


  2. Gene expression patterns in B-cell lymphoma

    Save the data files from the lab page: dlbcl_vs_fscc.res and dlbcl_vs_fscc.cls.

    The data set comes from the paper "Diffuse large B-cell lymphoma outcome prediction by gene expression profiling and supervised machine learning", by M. Shipp et al., Nature Medicine 8:68-74, 2002. You will use their data to analyse gene expression differences between diffuse large B-cell lymphomas and follicular lymphomas.

    The first file contains the normalized expression data for 58 diffuse large B-cell lymphoma samples and 19 follicular lymphoma samples. The second one is the "class" file designating which samples correspond to which phenotype, or class. In this case, the zeros correspond to the diffuse large B-cell lymphomas (DLBCL) and ones correspond to follicular lymphomas (FSCC). Your job is to identify differentially expressed genes in this data set.

    Repeat the same sort of analysis that you did in the tutorial, but using this data set. Start from the front page of the GenePattern web site: (http://genepattern.broadinstitute.org ).

    Go back to the "Modules" tab and find ComparativeMarkerSelection again.

    Upload dlbcl_vs_fscc.res as your input file, and dlbcl_vs_fscc.cls as the class file (you can either browse to the files' locations, or drag and drop the file inside the dotted rectangles). For "number of permutations," again type 1000. (We are again using fewer permutations than we might want, choosing speed over accuracy because we have limited time.) Change "log transformed data" from "no" to "yes." (You can look at the file dlbcl_vs_fscc.res in a text editor to see what the values look like, and note that compared to the all_aml_train.res file, the values appear to be in log scale.) For the rest of the options, choose the default parameters. Click on "Run," and wait until the status indicates that the run has completed the computation needed to identify differential expression. When it is done, it adds a link to the resulting file below the dialog box, just as in the tutorial.

    When it is completed, click the little curved arrow to the left of the link to the "dlbcl_vs_fscc.comp.marker.odf" file. When a sliding menu appears, click "ComparativeMarkerSelectionViewer" under the heading "Send to Module" to send the .odf file to the viewer module. You will see the .odf file already uploaded to the module. For dataset filename, select the dlbcl_vs_fscc.res file (the same file that you uploaded to "ComparativeMarkerSelection"). Hit "Run". Click "open visualizer" when this link appears.

    You may see a small alert box at the top of the screen, just under the URL field on Firefox, asking you whether you want to allow genepattern.broadinstitute.org to run Java Platform SE 8 U. Click "Allow," and on the resulting popup box, hit "Allow Now" to confirm. If you are asked to give permission to run the application, do so. If you are asked to update Java, click "Later".

    The number upregulated in each of the two classes is shown on the right of the Red/Blue plot of the t-scores, above the table. Note that there are a total of 6299 genes, or "Features", on the array (the sum of the numbers listed as upregulated in each of the two classes).

    Now, filter the gene list. To do so, select Edit/Filter Features. Choose FDR(BH) and set a maximum value of 0.1. This will select only those genes with an adjusted p-value below 0.1.

    1. How many transcripts have an FDR(BH) adjusted p value that is less than or equal to 0.1? In other words, how many features / genes / Affymetrix qualifiers (assume they are all the same thing, for the sake of this problem) pass this filter?

    2. Of the genes meeting this criterion, how many are upregulated in DLBCL compared to FSCC?

    3. How many genes have a Bonferroni-adjusted p-value of below 0.1? This isn't shown in the table by default, but you can still filter on it.

    4. Recall that "The FDR represents the expected proportion of non-marker genes (false positives) within the set of genes declared to be differentially expressed" ( Using GenePattern for Gene Expression Analysis, Heidi Kuehn et al., Curr Protoc Bioinformatics, 2008). Based on this, what is the chance that the gene at the top of the list is not really differentially expressed between these two types of lymphoma samples, but appears on our list just by chance?

    Save the filtered list of genes. You should be able to do this by selecting all the genes that have passed your filter, and then choosing File/Save Feature List. Use the browser to navigate to the directory you're working in (either your home directory or a subdirectory of it). Click the button next to "Download," type the name you want the file to have (the computer should append the extension ".txt" automatically, so that viewers can recognize it as a text file), and click OK.

    Upload your filtered gene list with your lab writeup.

    Next, perform a functional analysis by uploading this list to DAVID. (See the link on the class Links page.) Choose "AFFYMETRIX_3PRIME_IVT_ID" as the type of gene identifier, and "KEGG pathways" as the only type of annotation.

    1. Which KEGG pathways are identified as enriched in the significantly differentially expressed genes?


  3. Python: GC content

    Write a program that computes the GC content of an input string containing a nucleotide sequence.

    To do this, first write a function that takes a string as an argument. Assume the string contains a nucleotide sequence. Keep a counter of how many C's and G's you've seen so far. (What should the value of this counter be when you start?)

    Write a for loop that steps through each character in the sequence. If the character is 'C' or 'G', increment your counter.

    Have the function compute, and then return, the GC percentage. (Hint: what other values do you need in order to calculate this number? How can you get them?)

    In the main part of the program, use raw_input() to read in a sequence. Have the program call the function on the string you read and print out the GC content value that the function returns.

    Test your program on some short test strings where you know the right answer.

    If you want to make your program more robust, you could

    Think about how you might want to do these things, and where in the program they would fit best.

  4. Find out whether coca-cola bottles have your name on them.

    Save the file cocacola_names.txt from the labs web page. This file contains all the first names that have appeared on coca-cola bottle labels. Read the names from the file and store them in a Python list.

    Then, have the program ask the user for their name (using the raw_input() function) and store the input string in a variable named myname. Finally, loop through the list with your choice of loop: for or while, and compare myname to each item in the list until you either find your name in the list or determine that it is not there. Print a suitable message saying what you found.