Your job in part a of this project is to:
In part b of the project, we will address the problem that your differential expression analysis in GenePattern only produced significant probe set IDs. These cannot be submitted directly to the Gene Ontology enrichment tool. Thus, we will do enrichment analysis in DAVID in part a. DAVID has a lot to recommend it - it includes many different annotation sources and is easy to use. However, the DAVID annotation data is about three years old; the GO changes much faster than that and their own enrichment tool is kept up to date.
To address this issue, you will write a Python script that converts the probe set ids from the GenePattern differential expression analysis to gene symbols, which can then be submitted to the GO enrichment tool. Details for writing this code appear below.
Next, you will use GO to perform functional analysis on your data using the output from this script and compare your results to those from part a and from the authors.
The study that you choose should contain samples from two different classes (e.g., treated and untreated cells, or disease samples and controls), and should have at least 7 samples in each class. Note that you can set filters in your GEO data search to look for only GEO data sets and series that have "GPL570" and that have at least 14 samples. (Look at the link "Show additional filters" in the left-hand column when you are searching.)
Note: If you are working in Windows lab, we suggest that you use Firefox to run GenePattern since it works well with the version of Java installed on the lab machines. (Some other browsers may complain; you do not have permission to update Java on these machines.) If you are working on your own machine and have a current version of Java or can update it, other browsers should be okay.
To start in GenePattern, you will need to create an expression data file. The easy way to do this is to use the GEOimporter module in GenePattern. Type the GEO identifier for the data set (e.g., GSE41662) in the first input box, and do not enter anything for the line that asks for the SOFT file. Use the default entry ("VALUE") for the "data column name."
The output of GEOimporter should include a file with a ".gct" extension.
You will also need a data file with a ".cls" extension. The file with the ".gct" extension contains expression data, and the file with the ".cls" extension contains information indicating the class to which each sample in the corresponding .gct file belongs. (A detailed description of all GenePattern file formats can be found at: http://www.broadinstitute.org/cancer/software/genepattern/gp_guides/file-formats.)
To make the ".cls" file, send the output ".gct" file output by GEOimporter through the "ClsFileCreator" module, creating the ".cls" file.
ClsFileCreator instructions: on the first screen, select all samples. On the second, type the name of the first class (e.g., "untreated") into the box on the upper left and click Add Class. Repeat for the second class. Click Next. You might have to look at the sample names in GEO to figure out which ones go in which class. Select the samples in one class (the "Sample Name" box at the top selects all of them, which can be helpful). Select the class name from the pull down on the right, and then click the right arrow to move the selected samples into that class. Repeat with the remaining samples and the other class. Click Next. Review the class breakdown, and save the .cls file in the default location. When you return to GenePattern it will be there under the Files tab.
You may alternatively create .gct and .cls files manually by downloading the Matrix (text) file from the appropriate GEO series page, and editing the format in Excel or a similar program. See the file format link above, or this video for instructions.
Please upload your .gct and cls files with your writeup. Use GenePattern comparative marker selection and filtering to identify a list of genes (really, Affymetrix probe set IDs) that are significantly differentially expressed between the two classes in your data set. In your writeup, describe the steps that you took to do so. How many genes do you consider to be differentially expressed in your data set, and what criteria are you using to determine this?
Note that if you have too many differentially expressed qualifiers, some annotation tools (e.g. DAVID) will not work. Try changing your filter criteria to get a list of fewer than 3000 qualifiers. Similarly, if you have too few differentially expressed genes, you can pick the top few hundred genes by rank. You could also relax your filter criteria substantially. In either case, describe what you did and your decision-making process.
Perform a functional analysis of this gene list in DAVID. This is a different tool than the one we used in lab, but one that allows you to upload probe set IDs from affymetrix arrays directly. You can also choose different annotation categories, although GO is among the defaults.
Go to the DAVID web site and choose Start Analysis from the top menu. In Step 1, choose your probe id text file to upload. For Step 2, use the default, "Affymetrix_3prime_ivt_id." For Step 3, click Gene List, then submit. Click Functional Annotation Chart from the presented tool list (you may explore others as well), and then on the next screen, verify that the Background is Homo sapiens (if not, choose Background from the menu at the left and ensure that it is). You can also expand the annotation categories and see what the options are. Finally, click the Functional Annotation Chart button and look at the resulting output file.
Include in your writeup a description of what you found, and what you learned about the original experiment. How does your analysis relate to what, if anything, the original authors describe learning from the data set (in either the associated publication, or the GEO series description)?
Details for part b)
The file you will use for converting probe set IDs from the arrays used in this study (HG-U133_2.0 arrays) is here.
This file is tab-delimited. The first column is the name of the probe set ID. The second is the gene symbol, or a number of gene symbols separated by the sequence " /// ". The third is a description of the gene.
Write a script that first reads in all the probe set IDs from your list of differentially expressed genes. Create a dictionary called "desired." The keys of this dictionary should be probe set IDs from your list; the values should all be 1. (So this is a dictionary that has a probe set ID as a key only if it is on your list of differentially expressed genes.)
Now read the HG-U133_Plus_2.annot.txt file, line by line. When you read a line, assign the probe set ID to one variable (with a suitable name) and the gene symbol to another. If that probe set ID is in the "desired" dict, print the value of the gene symbol corresponding to that probe set ID to an output file. After reading the entire file, you should have all the desired gene symbols, although some of them will not yet be usable because they include multiple symbols separated by "///", and others are blanks (denoted "---" in the file).
Next, revise your code to split up the gene symbol string by the sequence " /// " and print out only the first part. Finally, if the gene symbol is "---", don't print anything for that probe set ID.
The resulting output file should contain exactly and only the gene symbols corresponding to the differentially expressed genes in your data set.
Now upload this list to the Gene Ontology enrichment tool. Compare the annotation you found in this part of the project to that in part a. What similarities or differences are there?
Please upload the following files to Gradescope: