Comp 7 / Bio 40 Project 2: Gene Expression Analysis

Part a: Due November 9, 2016

Part b: Due November 16, 2016

In this project, you will investigate the role that inflammation plays in a gene expression data set of your choice.

First, choose a data set to investigate, using the GEO database and the criteria specified below. If you do not want to choose your own, you may use one that we suggest (see Details).

Your job in part a of this project is to:

  1. Pick a GEO data set satisfying the criteria below.
  2. Describe the data set you chose, the problem being studied, the two classes of samples compared, and the number of samples from each class.
  3. Identify a list of genes that are significantly differentially expressed between these two classes. Describe the steps that you took to do so.
  4. Export this list (of Affymetrix "qualifiers") to a text file.
  5. Perform a functional analysis of this gene list, describing the steps that you took to do so and summarizing what you found.
  6. Submit a writeup containing all the descriptions above, and the file containing the list of differentially expressed qualifiers.

In part b of the project, you will write a Python script that reads in both the list of inflammatory qualifiers from this file, and the list of differentially-expressed qualifiers you identified, and then creates a list and count of those qualifiers shared by both (i.e., the inflammatory markers that are differentially expressed in your data set). One of the points of this project is to see how widespread inflammation is in disease -- you might choose a data set that has with no differentially expressed inflammatory markers, but it isn't all that likely.

You should also submit a paragraph explaining how the output of your script relates to what you saw in the functional enrichment analysis of the gene list in part a.


Details for part a:

1. Pick a GEO data set

Use GEO to search for a data set that examines some disease or problem of interest to you, and that uses the microarray platform GPL570 (the Affymetrix U133 Plus 2.0 arrays). This is a very commonly-used platform, so there will be thousands of data sets to choose from. It only assays expression of human genes, so your study will need to involve gene expression in human samples or cells.

The study that you choose should contain samples from two different classes (e.g., treated and untreated cells, or disease samples and controls), and should have at least 7 samples in each class. Note that you can set filters in your GEO data search to look for only GEO data sets and series that have "GPL570" and that have at least 14 samples. (Look at the link "Show additional filters" in the left-hand column when you are searching.)

To analyze the data set using GEO2R, find the GSE (Gene Expression Series) page and scroll down to the GEO2R link. You can use this to identify differentially expressed genes.

If you would prefer to do more complex analyses than GEO2R allows, you may create a data set for GenePattern from GEO. The easy way to do this is to use the GEOimporter module in GenePattern. Type the GEO identifier for the data set (e.g., GSE41662) in the first input box, and do not enter anything for the line that asks for the SOFT file. Use the default entry ("VALUE") for the "data column name."

The output of GEOimporter should include a file with a ".gct" extension.

You will also need one with a ".cls" extension. The file with a ".gct" extension contains expression data, and the file with a ".cls" extension contains information indicating the class to which each sample in the corresponding .gct file belongs. (A detailed description of all GenePattern file formats can be found at: http://www.broadinstitute.org/cancer/software/genepattern/gp_guides/file-formats.)

To make the ".cls" file, send the output ".gct" file output by GEOimporter through the "ClsFileCreator" module, creating the ".cls" file.

ClsFileCreator instructions: on the first screen, select all samples. On the second, type the name of the first class (e.g., "untreated") into the box on the upper left and click Add Class. Repeat for the second class. Click Next. You might have to look at the sample names in GEO to figure out which ones go in which class. Select the samples in one class (the "Sample Name" box at the top selects all of them, which can be helpful). Select the class name from the pull down on the right, and then click the right arrow to move the selected samples into that class. Repeat with the remaining samples and the other class. Click Next. Review the class breakdown, and save the .cls file in the default location. When you return to GenePattern it will be there under the Files tab.

You may alternatively create .gct and .cls files manually by downloading the Matrix (text) file from the appropriate GEO series page, and editing the format in Excel or a similar program. See the file format link above, or this video for instructions.

Please upload your .gct and cls files with your writeup.

Note: If you are working in Windows lab, we suggest that you use Firefox to run GenePattern since it works even with the older version of Java installed on the lab machines. (Some other browsers complain; you do not have permission to update Java on these machines.) If you are working on your own machine and have a current version of Java or can update it, other browsers should be okay.

If you don't want to choose your own data set, you may use GSE41662, which examines gene expression in lesional and non-lesional skin biopsies from psoriasis patients.

2. Describe what you chose

In your writeup, describe the data set you chose, the problem being studied, the two classes of samples compared, and the number of samples from each class. Note that most GEO data sets, somewhere in their GSE information page, include a link to any relevant publications associated with the data. Your data set may or may not have any associated publications. But if it does, they are a good source of additional information.

3. Find differentially expressed genes

Use GEO2R or GenePattern to identify a list of genes (really, Affymetrix qualifiers and their associated genes) that are significantly differentially expressed between the two classes in your data set. In your writeup, describe the steps that you took to do so. How many genes do you consider to be differentially expressed in your data set, and what criteria are you using to determine this?

Note that if you have too many differentially expressed qualifiers, some of the tools (e.g. DAVID) will not work. Try changing your filter criteria to get a list of fewer than 3000 qualifiers. Similarly, if you have too few differentially expressed genes, you might try a functional analysis in GSEA. That won't directly give you a gene list for part 2 of the project, but if there are significantly differentially expressed gene sets, you could pick the "leading edge genes" from these and create a list of differentially expressed genes that way. (Feel free to discuss this with us.) You could also relax your filter criteria substantially. In either case, describe what you did and your decision-making process.

4. Save this list of qualifiers to a file

Upload the list of differentially expressed qualifiers along with your submission.

5. Functional analysis

Perform a functional analysis of this gene list. Include in your writeup a description of how you analyzed the data, what you found, and what you learned about the original experiment. How does it relate to what, if anything, the original authors describe learning from the data set (in either the associated publication, or the GEO series description)?

6. Submit your results

In Trunk, upload your writeup with all the information listed above, and the file containing the list of differentially expressed qualifiers.



Details for part b:

In the directory you want to work in, save a copy of the list of differentially-expressed qualifiers you uploaded in part a. Also, save this file of 1359 Affymetrix qualifiers corresponding to inflammatory genes.

This list of inflammatory qualifiers was made by going to the AmiGO web site for browsing the Gene Ontology (GO), searching for the Biological Process term "inflammatory response", limiting it to humans (the "Taxon" filter), downloading the resulting list of genes (some of which are annotated to the term "inflammatory response," others to more specific terms that are descendents of "inflammatory response" in the ontology), and then converting those gene symbols to Affymetrix U133 Plus 2.0 qualifiers using information from Affymetrix' web site.

Your job is then to write a program to do the following:

  1. Read the qualifiers from your list of differentially-expressed ones.
  2. Read the qualifiers from the list of inflammatory ones.
  3. Check whether each qualifier from one list is in the other list too. (You can do this inefficiently but acceptably using loops, or much more quickly using dictionaries, which we're just starting to learn about; either is fine for this assignment.)
  4. Count the number of qualifiers that are on both lists, and print them out, one per line.
  5. When done, print a message saying how many inflammatory qualifiers are on your list, and the total length of your list.

Submitting your work

Please upload the following files to Trunk:

  1. Your Python script
  2. The file you used as input to your Python script
  3. A paragraph explaining how the output of your script relates to what you saw in the functional enrichment analysis of the gene list in part a.