First, choose a data set to investigate, using the GEO database and the criteria specified below. If you do not want to choose your own, you may use one that we suggest (see Details).
Your job in part a of this project is to:
In part b of the project, you will write a Python script that reads in both the list of inflammatory qualifiers from this file, and the list of differentially-expressed qualifiers you identified, and then creates a list and count of those qualifiers shared by both (i.e., the inflammatory markers that are differentially expressed in your data set). One of the points of this project is to see how widespread inflammation is in disease -- you might choose a data set that has with no differentially expressed inflammatory markers, but it isn't all that likely.
You should also submit a paragraph explaining how the output of your script relates to what you saw in the functional enrichment analysis of the gene list in part a.
The study that you choose should contain samples from two different classes (e.g., treated and untreated cells, or disease samples and controls), and should have at least 7 samples in each class. Note that you can set filters in your GEO data search to look for only GEO data sets and series that have "GPL570" and that have at least 14 samples. (Look at the link "Show additional filters" in the left-hand column when you are searching.)
To analyze the data set using GEO2R, find the GSE (Gene Expression Series) page and scroll down to the GEO2R link. You can use this to identify differentially expressed genes.
If you would prefer to do more complex analyses than GEO2R allows, you may create a data set for GenePattern from GEO. The easy way to do this is to use the GEOimporter module in GenePattern. Type the GEO identifier for the data set (e.g., GSE41662) in the first input box, and do not enter anything for the line that asks for the SOFT file. Use the default entry ("VALUE") for the "data column name."
The output of GEOimporter should include a file with a ".gct" extension.
You will also need one with a ".cls" extension. The file with a ".gct" extension contains expression data, and the file with a ".cls" extension contains information indicating the class to which each sample in the corresponding .gct file belongs. (A detailed description of all GenePattern file formats can be found at: http://www.broadinstitute.org/cancer/software/genepattern/gp_guides/file-formats.)
To make the ".cls" file, send the output ".gct" file output by GEOimporter through the "ClsFileCreator" module, creating the ".cls" file.
ClsFileCreator instructions: on the first screen, select all samples. On the second, type the name of the first class (e.g., "untreated") into the box on the upper left and click Add Class. Repeat for the second class. Click Next. You might have to look at the sample names in GEO to figure out which ones go in which class. Select the samples in one class (the "Sample Name" box at the top selects all of them, which can be helpful). Select the class name from the pull down on the right, and then click the right arrow to move the selected samples into that class. Repeat with the remaining samples and the other class. Click Next. Review the class breakdown, and save the .cls file in the default location. When you return to GenePattern it will be there under the Files tab.
You may alternatively create .gct and .cls files manually by downloading the Matrix (text) file from the appropriate GEO series page, and editing the format in Excel or a similar program. See the file format link above, or this video for instructions.
Please upload your .gct and cls files with your writeup.
Note: If you are working in Windows lab, we suggest that you use Firefox to run GenePattern since it works even with the older version of Java installed on the lab machines. (Some other browsers complain; you do not have permission to update Java on these machines.) If you are working on your own machine and have a current version of Java or can update it, other browsers should be okay.
If you don't want to choose your own data set, you may use GSE41662, which examines gene expression in lesional and non-lesional skin biopsies from psoriasis patients.
Note that if you have too many differentially expressed qualifiers, some of the tools (e.g. DAVID) will not work. Try changing your filter criteria to get a list of fewer than 3000 qualifiers. Similarly, if you have too few differentially expressed genes, you might try a functional analysis in GSEA. That won't directly give you a gene list for part 2 of the project, but if there are significantly differentially expressed gene sets, you could pick the "leading edge genes" from these and create a list of differentially expressed genes that way. (Feel free to discuss this with us.) You could also relax your filter criteria substantially. In either case, describe what you did and your decision-making process.
In the directory you want to work in, save a copy of the list of differentially-expressed qualifiers you uploaded in part a. Also, save this file of 1359 Affymetrix qualifiers corresponding to inflammatory genes.
This list of inflammatory qualifiers was made by going to the AmiGO web site for browsing the Gene Ontology (GO), searching for the Biological Process term "inflammatory response", limiting it to humans (the "Taxon" filter), downloading the resulting list of genes (some of which are annotated to the term "inflammatory response," others to more specific terms that are descendents of "inflammatory response" in the ontology), and then converting those gene symbols to Affymetrix U133 Plus 2.0 qualifiers using information from Affymetrix' web site.
Your job is then to write a program to do the following: