Comp 7 / Bio 40 - Fall 2019

Comp 7 / Bio 40 Lab 5:
Gene Expression and Functional Analysis

October 29, 2019

Purpose of this lab

To introduce gene expression databases and methods, and to (optionally) introduce dicts and while loops in Python.

Lab problems

Finding gene expression experiments in the GEO database
Go to the GEO web site. Under the "Search for studies at GEO DataSets" link, search for a data set related to “prolymphocytic leukemia.”
1. There should be only one record whose ID number starts with “GDS” (GEO Data Set) as opposed to GSE (GEO Series). You can find the single GDS data set by clicking the "DataSets" category under the filter menu on the left, under the heading "Entry type." What is the "Accession number" for that record (it starts with GDS)? What is its "Series" number (starts with GSE)?
Every GDS record is a curated version of some GSE record. The curation allows for additional analyses to be performed (see “Going Further,” below). GEO is no longer curating GDS records from GSEs (although there are some efforts to automate the process, so GDS records may start being created again soon).
Click on the name of that data set to get to a page that describes it more fully. Find the link to the “Reference Series” (the associated GSE record) and click on this.
Read the Summary and Overall Design entries to get a sense of what this data set is studying. View the full list of samples to better understand the experimental design.
1. What are the two types of samples that are being compared in this study, and how many samples are there of each type?
For GDS records, you can find differentially expressed genes using the tools at the bottom of the GDS page.
If you are still on the GSE page, go back to the GDS page. Under "Data Analysis Tools," click "Compare 2 sets of samples." In step 1, choose 2-tailed t test (the default), and a significance level of 0.05. In step 2, click on the T-PLL light green block, and then on the blinking arrows to put those samples into set A. Then click on the "normal" (bright green) block and then the right- pointing arrows to put those samples into set B. Click Ok to return, and then in Step 3 click on "Query group A vs. group B."
1. What are the top two transcripts shown? Do they appear to be differentially expressed between the two classes of samples?
The GenePattern Tutorial
Open a window with the Firefox browser. There are other browsers available on the lab machines too, but Firefox is the most hassle-free web browser for today's analyses. Type the following URL directly into the Firefox window, or follow the link in the online version of the lab:
```
http://www.broadinstitute.org/cancer/software/genepattern/
```
You will have entered the GenePattern web site.
Click on the link "GenePattern Basics" 10-minute Tutorial in the center of the top part of the page. You will do the sections titled "Start GenePattern," "Run an Analysis," and "View the Analysis Results."
In the "Start GenePattern" section, use the Broad public server with the link they give you. follow the instructions to create an account. Don't worry, they won't spam you or sell your information anywhere.
Note that the version of GenePattern used to create the tutorial looks a little different from the current one. In particular, to find a module like "ComparativeMarkerSelection," just start typing that into the Search Modules & Pipelines slot on the upper left.
Following the tutorial, run ComparativeMarkerSelection, then ComparativeMarkerSelectionViewer, as described in the tutorial.
In ComparativeMarkerSelection, we suggest that you edit the number of permutations to 1000 rather than 10,000 (the default) to speed things up a little.
For the dataset files, pasting the ftp locations indicated by the tutorial into the won't work from the lab machines. But, you can get at the same files by right-clicking on the links in the tutorial, saving them to your computer, and then uploading those files.
Under "View the Analysis Results," note that the little downward-pointing blue arrow icon, which is supposed to display a menu of commands you can use to work with the output of the ComparativeMarkerSelection run, now looks like .
The ComparativeMarkerSelectionViewer shows you a plot of the score (in this case, the t-score for differential expression) for all the genes on the array, in sorted order. You can use this to filter the gene list to choose just the significantly differentially-expressed genes.
To do so, select Edit/Filter Features. You will get a popup window whose first field is the criterion on which you will filter. Select FDR(BH) to choose the false-discovery rate, and choose <= 0.05. You will get approximately 333 genes that are upregulated in ALL and 284 in AML (these are abbreviations for two types of leukemia). The list at the bottom will contain only those genes or identifiers.
Note that the "Features" in the table, with names like L07956_at, are technically called Affymetrix "qualifiers." Affymetrix is the company that makes the particular type of microarray used in this experiment, and a qualifier is the name given to a set of probes for a specific gene or transcript. The "Description" column in the table tells you the name of the corresponding gene.
To export the list of genes (qualifiers), select all the rows remaining in the table at the bottom of the page (after you have filtered the list as above), and then choose File/Save Feature List to download the list. Click the button next to "Download," type the name you want the file to have, and click OK. The file that downloads will contain a list of the differentially-expressed qualifiers that you have selected. (We will do this again in project 2.)
Upload this file with your lab submission. Note that the newline characters make this file hard to view properly in Notepad; that is okay.
Differential expression in PLL

Go back to GEO and find GEO data set GSE5788, which is the GEO record you were working with in problem 1.
Your job now is to use GenePattern to identify differentially expressed genes in this data set. To do these, you need an expression data file (in GenePattern format this has an extension ".gct"), and a class or phenotype file (extension ".cls"). Normally you would create the cls file by hand in Excel, TextEdit, Notepad, or something similar, but in the interest of time, we're creating it for you.
So download and save the class file designating which samples correspond to which phenotype or class. Look at the cls file. It has 3 lines. The first lists the number of samples (14), the number of phenotypes (2), and the number 1 (always), separated by spaces or tabs. The second line starts with a '#' character and lists the phenotypes in order (e.g. 1 is PLL, 0 is controls). The third is a whitespace-separated list of phenotype numbers (0s or 1s in this case) corresponding to the samples in the order in which they appear in the data file. Information about file formats required by GenePattern can be found under the help tab along the top menu.
Next, we have to create a GenePattern verison of the data file. We will use the GEOimporter module in Genepattern. Browse modules to find GEOimporter, and type the GEO id (GSE5788) into the first parameter box. You don't need anything else. Run it. When it is done, save the .gct file it produces by clicking on the file name (GSE5788.gct) and choosing "Save file."
Now, you'll repeat the same sort of differential expression analysis that you did in the tutorial, but using this data set. By clicking on the little curved down arrow to the left of the .gct output file name, you get a list of possible modules to run using that file. Choose ComparativeMarkerSelection, and it will automatically populate the data set field with the .gct file you just created.
Upload the saved prolymphleuk.cls as the class file (you can either browse to the files' locations, or drag and drop the file inside the dotted rectangles). For "number of permutations," again type 1000. (We are again using fewer permutations than we might want, choosing speed over accuracy because we have limited time.) For the rest of the options, choose the default parameters. Click on "Run," and wait until the status indicates that the run has completed the computation needed to identify differential expression. When it is done, it adds a link to the resulting file below the dialog box, just as in the tutorial.

When it is completed, click the little curved arrow to the left of the link to the "GSE5788.comp.marker.odf" file. When a sliding menu appears, click "ComparativeMarkerSelectionViewer" under the heading "Send to Module" to send the .odf file to the viewer module. You will see the .odf file already uploaded to the module. For dataset filename, select the GSE5788.gct file (the same file that you created from GEOimporter; this is why we had you save it). Hit "Run". Click "open visualizer" when this link appears.
First, look at the heat map. Figure out what colors refer to higher or lower expression in what types of samples. You can scroll through the heatmap from top to bottom.
Filter to find the most significantly different genes. To do so, select Edit/Filter Features. Choose FDR (BH) and set a maximum value of 0.1. Since there are many genes with FDR < 0.1, modify the filter to require that the genes also have a fold-change above 3.0.
1. How many of significantly differentially expressed genes with a 3-fold change in expression levels do you see?
2. Recall that "The FDR represents the expected proportion of non-marker genes (false positives) within the set of genes declared to be differentially expressed" ( Using GenePattern for Gene Expression Analysis, Heidi Kuehn et al., Curr Protoc Bioinformatics, 2008). Based on this, what is the chance that the gene at the top of the list is not really differentially expressed between the PLL and control samples, but appears on our list just by chance?
3. Choose View / Upregulated features to see a graphic of the t-scores of the selected genes. According to the legend on the right, how many are upregulated in prolymphacytic leukemia?
4. Export the list of selected genes (probe set ids) by selecting all rows in the filtered table, and choosing File / Save Feature List, and then choosing "download."
5. Finally, do any genes have a Bonferroni-adjusted p-value of below 0.1? This isn't shown in the table by default, but you can still filter on it.
Functional enrichment in a list of genes
Find the file pllsymbols.txt on the labs web page. The file contains genes corresponding to most of the probe set id's in the list you just saved.
On the right of the Gene Ontology web page, you'll see a white box entitled "GO Enrichment Analysis." Copy the contents of the pllsymbols.txt file and paste them into the white box. Ensure that the species says "Homo sapiens" and the ontology is "Biological Process" (these are generally the defaults), and click 'Launch.'
At the top of the results page is a grayish box with your analysis choices. Below that is a small table titled "Results."
1. How many IDs were on your gene list? How many were successfully mapped to known gene identifiers? How many are ambiguously mapped? Were there any that GO couldn't figure out ("unmapped")?
Below that are the significantly enriched (FDR < 0.05) Gene Ontology Biological Process terms on the list. They are sorted by the GO hierarchy, so you can see which terms are subsets of which others.
1. Browse through this list to get a sense of what functions are enriched in the PLL gene list. Write a sentence or two summarizing broadly what types of functional processes appear to be involved. (This isn't meant to be a list of all the enriched functions. Do you see any coherent themes that keep re-occurring?)
You can also sort by significance of enrichment. Scroll to the top of the list of results. Clicking on any of the column headers will sort by values in that column. Clicking it again sorts in the other order.
Click on the header "Fold Enrichment." Note that this is not talking about the fold-change of a gene; it is describing how many times the expected number of genes in that category appear in the gene set. The first column after the biological process name, entitled "Homo sapiens (Ref) #," is the number of genes with that annotation in the whole H. sapiens reference genome; the next column, just titled "#," has the number of genes in the submitted gene set with that annotation.
1. What is the name of the most enriched Biological Process (the one at the top of the list)? What is the Fold Enrichment for this process? How many genes with that annotation are expected in a gene set of the size you submitted? How many were observed? Do you believe the "Fold Enrichment" computed is accurate? Do you believe this particular functional enrichment is meaningful?
Another option is to sort by significance. Try clicking on "FDR." You may have to do this twice to get the most significantly enriched terms (you want values much lower than .05 at the top of the list). Note also that clicking on the number of genes with that annotation that are on your list will give you a list of the specific genes.
1. What is the name of the most significantly enriched Biological Process (the one at the top of the list now)? What is the Fold Enrichment for this process? How many genes with that annotation are expected in a gene set of the size you submitted? How many were observed? Do you believe the "Fold Enrichment" computed is accurate? Do you believe this particular functional enrichment is meaningful?
A "GO-Slim" is a slim version of the Gene Ontology with much less repetition of concepts. Scroll up to the top and try re-running this analysis using the Panther GO-Slim Biological Process annotation data set and the same options for everything else. Click the orange "launch" button to run.
1. How does using a GO-slim change the results?
Going Further:
1. What's in a name?
  In this problem, you will create your own dict to translate the common names of various species to their Latin names.
  Write a Python script called name.py containing a dict whose keys are strings containing the common names of animals, and whose values are strings containing the associated Latin names. Pick any four species you would like, and in your program, add these four species to your dict.
  The program should then use the dict to obtain a list of the included species, and should print out a message telling the user what species are available. Then it should ask the user to choose one to look up. If the species that the user enters is in the dict, the program should print out the corresponding Latin name. If not, it should print a message saying so.
  A sample run of your code might look like this:
```
The available animals are 
house wren 
aardwolf 
striped bass
cheetah 

Which would you like to look up? 
>>>  cheetah
The Latin name for the cheetah is Acinonyx jubatus.
```
2. While I recall your nickname...
  Write a python program, nickname.py, that asks the user for pairs of names and nicknames, sequentially, until the user enters the name "done". (You should use a while loop for this.)
  The program should create a dict with keys equal to the input names, and values that are the nicknames. Then when done, it should print out all the name / nickname combinations. An example run might look like:
```
What is the next person's name?  Samuel
What is their nickname? Scooter
What is the next person's name? Frederick 
What is their nickname?  Fritz
What is the next person's name? Christopher 
What is their nickname?  Squi
What is the next person's name? done
Frederick:  Fritz
Christopher: Squi
Samuel: Scooter
```
3. Differences between functional annotation sources
  Use the Amigo browser to look for Gene Ontology terms associated with your favorite gene.
  1. What is the gene?
  2. With which GO terms is it annotated?
  3. Which of the three major ontologies (Molecular Function, Cellular Component, Biological Process) did you find most informative in this case, and why?
  Now look up the same gene in the Reactome and KEGG pathway databases. How does what you learned about your gene differ depending on the database you use?
4. Gene set enrichment analysis
  Try GSEA analysis in GenePattern on the GSE5788.gct file; you can use the "job results" tab to see all the previous jobs you ran, and click on the GSE5788.gct results file from GEOimporter to get GSEA as one of the options. Choose "c5.all..." for the "gene sets database" option. Reduce the number of permutations to 200. Choose the prolymphleuk.cls file you used before for the phenotype labels. Save HG-U133A.chip and designate that as the "chip" file, then press Run.
  When it is done, scroll down to find the output file "index.html" and view that to get a summary of, and way to navigate, your results.
  What are the differences between this and the previous analysis?

Comp 7 / Bio 40 Lab 5: Gene Expression and Functional Analysis

October 29, 2019

Purpose of this lab

Lab problems

Comp 7 / Bio 40 Lab 5:
Gene Expression and Functional Analysis