Comp 7 / Bio 40 Project 2: Gene Expression Analysis

Part a: Due November 8, 2019 (any time)

Part b: Due November 19, 2019 (before class)

In this project, you will investigate expression changes in a data set of your choice.

Your job in part a of this project is to:

  1. Pick a GEO data set satisfying the criteria below.
  2. Describe the data set you chose, the problem being studied, the two classes of samples compared, and the number of samples from each class.
  3. Identify a list of genes that are significantly differentially expressed between these two classes. Describe the steps that you took to do so.
  4. Export this list (of Affymetrix probe set IDs, corresponding to genes) to a text file.
  5. Perform a functional enrichment analysis of this gene list.
  6. Submit a writeup describing your work so far and what you found.

In part b of the project, we will address the problem that your differential expression analysis in GenePattern only produced significant probe set IDs. These cannot be submitted directly to the Gene Ontology enrichment tool. Thus, we will do enrichment analysis in DAVID in part a. DAVID has a lot to recommend it - it includes many different annotation sources and is easy to use. However, the DAVID annotation data is about three years old; the GO changes much faster than that and their own enrichment tool is kept up to date.

To address this issue, you will write a Python script that converts the probe set ids from the GenePattern differential expression analysis to gene symbols, which can then be submitted to the GO enrichment tool. Details for writing this code appear below.

Next, you will use GO to perform functional analysis on your data using the output from this script and compare your results to those from part a and from the authors.


Details for part a:

1. Pick a GEO data set

Use GEO to search for a data set that examines some disease or problem of interest to you, and that uses the microarray platform GPL570 (the Affymetrix U133 Plus 2.0 arrays). You can enter GPL570 as a search term in GEO. This is a very commonly-used platform, so there will be thousands of data sets to choose from. It only assays expression of human genes, so your study will need to involve gene expression in human samples or cells.

The study that you choose should contain samples from two different classes (e.g., treated and untreated cells, or disease samples and controls), and should have at least 7 samples in each class. Note that you can set filters in your GEO data search to look for only GEO data sets and series that have "GPL570" and that have at least 14 samples. (Look at the link "Show additional filters" in the left-hand column when you are searching.)

Note: If you are working in Windows lab, we suggest that you use Firefox to run GenePattern since it works well with the version of Java installed on the lab machines. (Some other browsers may complain; you do not have permission to update Java on these machines.) If you are working on your own machine and have a current version of Java or can update it, other browsers should be okay.

2. Describe what you chose

In your writeup, describe the data set you chose, the problem being studied, the two classes of samples compared, and the number of samples from each class. Note that most GEO data sets, somewhere in their GSE information page, include a link to any relevant publications associated with the data. Your data set may or may not have any associated publications. But if it does, these are a good source of additional information. Sometimes the comparison discussed in the paper may not be the one you will focus on here.

3. Find differentially expressed genes

Use GenePattern to find differentially expressed genes in your data set.

To start in GenePattern, you will need to create an expression data file. The easy way to do this is to use the GEOimporter module in GenePattern. Type the GEO identifier for the data set (e.g., GSE41662) in the first input box, and do not enter anything for the line that asks for the SOFT file. Use the default entry ("VALUE") for the "data column name."

The output of GEOimporter should include a file with a ".gct" extension.

You will also need a data file with a ".cls" extension. The file with the ".gct" extension contains expression data, and the file with the ".cls" extension contains information indicating the class to which each sample in the corresponding .gct file belongs. (A detailed description of all GenePattern file formats can be found at: http://www.broadinstitute.org/cancer/software/genepattern/gp_guides/file-formats.)

To make the ".cls" file, send the output ".gct" file output by GEOimporter through the "ClsFileCreator" module, creating the ".cls" file.

ClsFileCreator instructions: on the first screen, select all samples. On the second, type the name of the first class (e.g., "untreated") into the box on the upper left and click Add Class. Repeat for the second class. Click Next. You might have to look at the sample names in GEO to figure out which ones go in which class. Select the samples in one class (the "Sample Name" box at the top selects all of them, which can be helpful). Select the class name from the pull down on the right, and then click the right arrow to move the selected samples into that class. Repeat with the remaining samples and the other class. Click Next. Review the class breakdown, and save the .cls file in the default location. When you return to GenePattern it will be there under the Files tab.

You may alternatively create .gct and .cls files manually by downloading the Matrix (text) file from the appropriate GEO series page, and editing the format in Excel or a similar program. See the file format link above, or this video for instructions.

Please upload your .gct and cls files with your writeup. Use GenePattern comparative marker selection and filtering to identify a list of genes (really, Affymetrix probe set IDs) that are significantly differentially expressed between the two classes in your data set. In your writeup, describe the steps that you took to do so. How many genes do you consider to be differentially expressed in your data set, and what criteria are you using to determine this?

Note that if you have too many differentially expressed qualifiers, some annotation tools (e.g. DAVID) will not work. Try changing your filter criteria to get a list of fewer than 3000 qualifiers. Similarly, if you have too few differentially expressed genes, you can pick the top few hundred genes by rank. You could also relax your filter criteria substantially. In either case, describe what you did and your decision-making process.

4. Save your differentially expressed probe set ids to a text file. Upload the list with your submission.

5. Functional analysis

Perform a functional analysis of this gene list in DAVID. This is a different tool than the one we used in lab, but one that allows you to upload probe set IDs from affymetrix arrays directly. You can also choose different annotation categories, although GO is among the defaults.

Go to the DAVID web site and choose Start Analysis from the top menu. In Step 1, choose your probe id text file to upload. For Step 2, use the default, "Affymetrix_3prime_ivt_id." For Step 3, click Gene List, then submit. Click Functional Annotation Chart from the presented tool list (you may explore others as well), and then on the next screen, verify that the Background is Homo sapiens (if not, choose Background from the menu at the left and ensure that it is). You can also expand the annotation categories and see what the options are. Finally, click the Functional Annotation Chart button and look at the resulting output file.

Include in your writeup a description of what you found, and what you learned about the original experiment. How does your analysis relate to what, if anything, the original authors describe learning from the data set (in either the associated publication, or the GEO series description)?

6. Submit your results

In Gradescope, upload your writeup with all the information listed above, the file containing the list of differentially expressed qualifiers, and the two data files for genepattern.


Details for part b)

The file you will use for converting probe set IDs from the arrays used in this study (HG-U133_2.0 arrays) is here.

This file is tab-delimited. The first column is the name of the probe set ID. The second is the gene symbol, or a number of gene symbols separated by the sequence " /// ". The third is a description of the gene.

Write a script that first reads in all the probe set IDs from your list of differentially expressed genes. Create a dictionary called "desired." The keys of this dictionary should be probe set IDs from your list; the values should all be 1. (So this is a dictionary that has a probe set ID as a key only if it is on your list of differentially expressed genes.)

Now read the HG-U133_Plus_2.annot.txt file, line by line. When you read a line, assign the probe set ID to one variable (with a suitable name) and the gene symbol to another. If that probe set ID is in the "desired" dict, print the value of the gene symbol corresponding to that probe set ID to an output file. After reading the entire file, you should have all the desired gene symbols, although some of them will not yet be usable because they include multiple symbols separated by "///", and others are blanks (denoted "---" in the file).

Next, revise your code to split up the gene symbol string by the sequence " /// " and print out only the first part. Finally, if the gene symbol is "---", don't print anything for that probe set ID.

The resulting output file should contain exactly and only the gene symbols corresponding to the differentially expressed genes in your data set.

Now upload this list to the Gene Ontology enrichment tool. Compare the annotation you found in this part of the project to that in part a. What similarities or differences are there?

Please upload the following files to Gradescope:

  1. Your Python script
  2. The file you used as input to your Python script
  3. A paragraph summarizing your findings in the GO enrichment tool and explaining how the output of your script relates to what you saw in the functional enrichment analysis of the gene list in part a.