Comp 7 / Bio 40 - Fall 2019

Comp 7 / Bio 40 Lab 6

November 5th, 2019

Purpose of this lab

To test out gene set enrichment analysis and to give students more practice with python dictionaries and code.

Lab problems

Running Gene Set Enrichment Analysis (GSEA) on diffuse large B-cell lymphoma data
We are now going to find the Reactome pathways that are over-represented for differential expression between follicular lymphomas and diffuse large B-cell lymphomas using the Gene Set Enrichment Analysis (GSEA) tool. We will again use GenePattern to run this analysis.
To analyze lymphoma expression in these pathways in GSEA, save the data files dlbcl_vs_fscc.raw.res, dlbcl_vs_fscc.cls, and hu6800.chip from the web page. The data set comes from the paper "Diffuse large B-cell lymphoma outcome prediction by gene expression profiling and supervised machine learning", by M. Shipp et al., Nature Medicine 8:68-74, 2002. You will use their data to analyse gene expression differences between diffuse large B-cell lymphomas and follicular lymphomas.
Log in to the GenePattern web site. Make sure you are on "Modules" tab on the left panel. Click "Browse Modules" and find the "GSEA" module.
There are five parameters that we will change to run this module:
- Upload dlbcl_vs_fscc.res as your expression dataset.
- Click on the pull-down menu for the gene sets database and select "c2.cp.reactome.v7.0.symbols.gmt [Curated]". This is a collection of gene sets representing the pathways in the Reactome database.
- For "number of permutations", enter 500.
- Upload the file dlbcl_vs_fscc.cls where it asks for phenotype labels.
- For the chip platform, upload the file hu6800.chip.
Leave the default settings for the rest of the parameters, scroll down to the bottom of the list of parameters (and over to the right if necessary to see the "Run" button), and hit "Run" to start the analysis.
You will see a blue dialog box showing the progress of the job that you just submitted. Your job may take a couple of minutes to run. While you are waiting, you can read about Reactome here.
There will be a long list of output files produced when the run is complete. Scroll down to nearly the bottom of the list, and click on the file called index.html (or search for it on the page). From the options that pop up, choose the Open Link option. In a web page, you will see a summary of the Gene Set Enrichment Analysis results. Read it to answer the following questions:
1. How many Reactome pathways show significant upregulation in diffuse large B-cell lymphomas (DLBCL) with an FDR < 0.25? How many Reactome pathways show significant upregulation in follicular lymphomas (FSCC) with an FDR < 0.25?
Click on the "enrichment results in html" link under the heading "Enrichment in phenotype DLBCL." At the top of the table, click on the link entitled "plain text format", and open the resulting tab-delimited text file in Excel or another spreadsheet program. Sort the list of enriched Reactome pathways by FDR q-value from lowest to highest.
1. What are the two most significantly enriched Reactome pathways for diffuse large B-cell lymphomas (DLBCL) based on FDR q-value?
2. Read further through the list of Reactome pathways upregulated in diffuse large B-cell lymphomas (DLBCL). How would you interpret the results?
Translating an APOE coding region
Save the file codondict.py from the labs web page. This file contains a dict including the mapping from codons to single letter amino acid codes. (Stop codons are represented by the character '-'.) You can write the rest of your code in this file and save it under this same name.
Also save the file apoe_cds.txt from the labs web page. This is a file that contains the nucleotide sequence of part of the coding region of the APOE gene.
Your goal for this problem is to translate the APOE coding region to its amino acid sequence.
Define a string variable called proteinseq. Read the APOE sequence from the file, and use the dict to convert each trio of letters in order into the appropriate amino acid. Add the translated amino acid to proteinseq.
Write the translated sequence to a file named "apoe_protein.txt". Submit this file with your writeup.
Converting gene names to aliases
In the NCBI Gene database, genes are known not only by their official gene symbols (e.g., “APOE”), but by unique ID numbers. For example, APOE has the id 348, TGFB1 is 7040, and KDR is 3791. In addition to a Gene ID number, each gene has its official symbol, such as APOE, and also may be known by any of a list of aliases. So for example, APOE has aliases AD2 and LPG. TGFB1 has aliases CED, LAP, and TGFB. KDR can also be known as FLK1 or VEGFR.
1. Create a script genealiases.py. First, have it contain a dictionary called geneid that maps gene names to their id numbers, and hard code the IDs for the three genes listed above.
2. Next, create another dict called aliases that maps the gene id numbers to a list containing all the aliases for that gene. (So 348 should map to the list [‘AD2’, ‘LPG’].) Include the data for the gene IDs listed above. Add two more genes of your choice from the NCBI gene pages; make sure your chosen genes have at least one "alias" that you can use.
3. Finally, write code to create a list of all the canonical gene names, and then iterate through this list, printing out the gene name and all of its aliases. It should print out something like
```
APOE: AD2, LPG  
```
  for each gene in the list.

Going Further:

Describe how GSEA analysis is fundamentally different from the previous type of gene expression analysis we've performed in GenePattern. You may want to look at the original GSEA paper, with PubMed ID 16199517 (you can search for this directly in the pubmed search window), for help in answering this question.
Write code to count the frequency of each letter in a text file. Convert all the characters to lowercase first, and use a dictionary to keep your counts. To convert to a frequency, divide by the total number of letters. You may want to multiply by 100 to view these frequencies as percentages, which may make the numbers a bit more readable. You may also want to limit the number of digits you print.
Download the text files infile1 and infile2. You will use these as input, although to start out testing your methods, you might want to get a short input string from the keyboard.
Print out all the letters in alphabetical order and the frequencies of each.
The first input file, infile1, contains an excerpt from Shakespeare's "A Midsummer Night's Dream." The second, infile2, contains some paragraphs from "A Void," a translation of Georges Perec's novel "La Disparation."
Once you've run your code on both input files, visually compare the observed frequencies of the vowels. What is unusual about Perec's work, as compared to standard written English (as illustrated by Shakespeare)?