## Comp 7 / Bio 40 Lab 5: Gene Expression and Functional Analysis

### Purpose of this lab

To give students practice parsing data from files, working with some gene expression databases, and using functional enrichment tools.

### Lab problems

1. Loopy riddles

In this problem, you will write a script that reads lines from a file containing tab-delimited text storing people's names, ages, and ID numbers. Your job will be to print out the last names of the individuals whose ages are at most 25 and whose ID numbers are at least 500.

To make it more interesting, we have added a way to check your work. Each file includes a header line (i.e., the first line of the file) that contains five columns rather than four. This means, by the way, that you should deal with it before entering the loop, because it is different. (We suggest using a Boolean "flag" variable to tell whether you've read this line yet.) The first four columns in this header line tell you what information appears in their respective columns in the rest of the file (just like any header line typically would). The fifth column in the header line contains a silly joke or riddle. Have your script print this out.

Next, solve the problem above, and print out the last names of entries satisfying the specified criteria, one per line. If you have done this correctly, the first letters of the list of last names will spell out the answer to the riddle.

Recall that each line will be read in as a string. You will then need to separate on the tab character ("\t") to identify the "tokens" in the line. These will still be strings, so be sure to convert them to the appropriate type if you are concerned with numeric rather than string values (e.g., you want to compare 13 to 25, not "13" to 25).

A sample run of the program might look like this:

```What do you call an ancient reptile with an excellent vocabulary?
Agnew
Thweatt
Hopton
Edmonson
Stanback
Arboleda
Underhill
Ransdell
Unsworth
Schultheis
```

Finally, modify the program to print just the first letters of the last names it is now printing. So a sample run of the program would now look like this:

```What do you call an ancient reptile with an excellent vocabulary?
A
T
H
E
S
A
U
R
U
S
```

### Details

#### Getting input from a file

Save the files riddle1, riddle2, riddle3, riddle4, from the web site. You may hard-code the name of the file you want to open in your script if you like, or you may ask the user for it.

#### Reading and parsing the file

When you are reading from the file, you will need to remove the last charcter from each line. You will also want to split() the line up into a list to choose just the fields you want. Start counting at zero.

#### Testing and debugging your work

I would suggest starting with the very short input file called demo.txt - it only contains 5 lines and should print out names starting with the letters Y, E, and S.

2. Using DAVID to explore functional enrichment in a list of genes

Save the file BPD.txt from the labs web page. The file contains known disease genes for bronchopulmonary dysplasia.

Bronchopulmonary dysplasia is a serious lung condition that mostly affects babies born preterm. These babies do not have fully developed lungs at birth and, because they are born early, lack sufficient lung surfactant for the lungs to function properly. These babies are given surfactant and oxygen. Most develop the ability to transition to room air and to breathe normally by the time they reach their original due dates if not significantly earlier. However, some require supplemental oxygen past this point and are diagnosed as having bronchopulmonary dysplasia (BPD). While most eventually are able to breathe on their own, those with BPD may need treatment for many months or years and can have pulmonary difficulties throughout life.

In this problem, we are going to run functional annotation analysis on the set of genes linked to bronchopulmonary dysplasia to understand what functional processes have been implicated in infants with BPD.

Go to the DAVID web site: http://david.abcc.ncifcrf.gov/ or follow the link on the lab page. (Or Google "DAVID"; surprisingly, this site will usually be among the top 3 hits!). Click on Start Analysis.

• Step 1: Make sure you are on the "Upload" tab. Under "B: Choose From a File", browse to find the file that you just downloaded (BPD.txt).
• In Step 2, choose the Identifier type "OFFICIAL_GENE_SYMBOL". (Notice what other sorts of identifiers you can upload to this tool.)
• Select the default "Gene List" in Step 3 (you can also upload a list with a background distribution, but we'll just use their default, which is the whole human genome).
• Step 4, click on the Submit List button.
• Step 5: DAVID will present you with a popup window telling you that it's not sure what species your gene list comes from. Click OK.
• Step 6: you will be directed to the "List" tab. Select "Homo sapiens" to tell DAVID what species these genes come from.
• Step 7: Click on the "Background" tab. Select "Homo sapiens" to use their default background distribution for the human genome) and click on the Select Species button.

Next, use the default annotation terms (but expand some of the checkboxes so you can see what they are), and create a functional annotation Chart (click on the "Functional Annotation Chart" button, then do so again on the next screen). This list is sorted by (raw) P-value, lowest (most-significant) to highest.

1. What are the 3 most enriched terms?
2. Do they provide much useful information about what functions might be associated with the disease?
3. What are the three most-significant "GOTERM_BP_DIRECT" (i.e., GO Biological Process) term? What does this tell you about bronchopulmonary dysplasia?
4. How does significance of enrichment (e.g. p-value) differ from simply counting the number of genes in the BPD list with each annotation term? Hint: note that the list is initially sorted by p-value, but there is another column titled "count" that shows the number of genes in the list that are annotated with the indicated term. Clicking on the header of each column will sort by its value. You can find more information about what is in the Functional Annotation Chart results here. What is the p-value telling you that the count is not?

Now go back to the main DAVID page and click on Functional Annotation Clustering. This shows clusters of related annotation terms that are enriched in the gene list you uploaded. This can give you a different overview of enriched functions (focusing attention on related terms that appear to be significantly enriched, not just the few that are independently the most significant).

1. In your own words, how would you characterize the 3rd, 4th, and 5th functional clusters? (E.g., give a phrase summarizing the sorts of functions you see in each.) Does the functional annotation clustering change your view of the mechanisms underlying this disease?

Finally, go back to the page where you started running the analysis (the top of the page will say Annotation Summary Results). Click "clear all" to get rid of all of the annotation categories that are pre-selected. Now check the box next to "General Annotations" and select "cytoband". Create an annotation Chart again.

1. What is the most significantly enriched cytoband? How many genes from that band are on your list? (You can click on the blue bar in the "Genes" column to see what they are.)
2. Why do you think that cytoband is enriched for the bronchopulmonary dysplasia gene set? How would you relate the chromosomal band and the enriched functions for genes known to be responsible for bronchopulmonary dysplasia?

3. Finding gene expression experiments in the GEO database

Go to the GEO web site. Under the "Search for studies at GEO DataSets" link, search for a data set related to “prolymphocytic leukemia.”

1. There should be only one record whose ID number starts with “GDS” (GEO Data Set) as opposed to GSE (GEO Series). (You can find this by eye near the top of the list, or by clicking the "DataSets" category under the filter menu on the left, under the heading "Entry type.") What is that ID number?

Every GDS record is a curated version of some GSE record. The curation allows for additional analyses to be performed (see “Going Further,” below). GEO is no longer curating GDS records from GSEs (although there are some efforts to automate the process, so GDS records may start being created again soon).

Click on the name of that data set to get to a page that describes it more fully. Find the link to the “Reference Series” (the associated GSE record) and click on this.

Read the Summary and Overall Design entries to get a sense of what this data set is studying. View the full list of samples to better understand the experimental design.

1. What are the two types of samples that are being compared in this study, and how many samples are there of each type?

Note that at the bottom of the page, there is a link called “SOFT formatted family file(s).” If you wanted to download the expression data so that you could analyze it yourself (which we will do next week), you could do it by saving this SOFT file.

Going Further:

1. Gene expression in Unigene libraries

Suppose you are interested in studying expression in the hypothalamus and pituitary gland to help suggest novel drug targets for Parkinson’s disease. We will use the Unigene database to find genes that show differential expression between the hypothalamus and the pituitary.

Go to UniGene, and choose the “Library Browser” under the heading "Unigene Tools" in the center of the screen. Select human libraries with at least 1000 sequences. Using control-F, search for “hypothalamus.”

1. How many human hypothalamus libraries do you find?

Select a hypothalamus library, and identify it by its Unigene ID number and description.

1. What are the five most frequently expressed genes in this cDNA library?

Select the link next to the name of one of these most frequently expressed genes. Scroll down to the box labeled “Gene Expression,” and click the “EST Profile” link to find the distribution of each of these genes in other tissues.

We will say a gene is mostly specific to a given tissue if it is expressed at a level of at least 1000 TPM (transcripts per million) in that tissue, and it is not expressed at above 1000 TPM in any other tissue. There is a legend at the bottom of the page that will show you how to read the EST Profile report.

1. Looking only at the “normal” tissues (under the heading “Breakdown by Body Sites”), are any of the five most frequently expressed genes from your library “mostly specific” to brain? If so, list them. If not, list which one(s) come closest, and another tissue in which they are also frequently expressed.

Follow the DDD link on the Unigene page to find the Digital Differential Display link. You’d like to find genes differentially expressed between hypothalamus and hippocampus. Select human as the species of choice, and click “continue.” Do a search (control-F again) for “hypothalamus” to create your first sample set. Click on the id number for each possible sample to get a description of it; choose only those samples from normal adult hypothalamus.

1. List which samples you chose for this first class.

Name the pool “hypothalamus” and click “Continue.” Define the second class similarly. Search for “pituitary.” Select only those samples from normal adult pituitary. (You can click on the number next to each sample to get a better description of what is in it. If there is no developmental stage listed, assume it is not adult. If there is no obvious pathology listed in the description, you may assume it is normal.) Name this class "pituitary" and click "Continue."

1. List which samples you chose for this second class.
2. What are the two most differentially expressed genes between these groups?

Click on the id number of the top gene to go to its unigene site. Now scroll down to the GEO profiles link, and follow that to find data sets in which this gene is significantly differentially expressed. Look at the names of the top 10 GEO profiles. Can you learn anything about this protein's expression pattern during normal fetal development?

2. Copy and paste the BPD.txt gene list into the GO enrichment tool, choosing the biological process ontology and Homo sapiens for the options.

Sort the resulting list by P value by clicking on the "P value" column header until the most significantly enriched terms appear at the top of the list.

1. What are the 3 most enriched functions listed here?
2. Why do you think these results are different from those shown by DAVID?
3. Use the Amigo browser to look for Gene Ontology terms associated with your favorite gene.

1. What is the gene?
2. With which GO terms is it annotated?
3. Which of the three major ontologies (Molecular Function, Cellular Component, Biological Process) did you find most informative in this case, and why?

Now look up the same gene in the Reactome and KEGG pathway databases. How does what you learned about your gene differ depending on the database you use?