In this problem, you will write a script that reads lines from a file containing tab-delimited text storing people's names, ages, and ID numbers. Your job will be to print out the last names of the individuals whose ages are at most 25 and whose ID numbers are at least 500.
To make it more interesting, we have added a way to check your work. Each file includes a header line (i.e., the first line of the file) that contains five columns rather than four. This means, by the way, that you should deal with it before entering the loop, because it is different. (We suggest using a Boolean "flag" variable to tell whether you've read this line yet.) The first four columns in this header line tell you what information appears in their respective columns in the rest of the file (just like any header line typically would). The fifth column in the header line contains a silly joke or riddle. Have your script print this out.
Next, solve the problem above, and print out the last names of entries satisfying the specified criteria, one per line. If you have done this correctly, the first letters of the list of last names will spell out the answer to the riddle.
Recall that each line will be read in as a string. You will then need to separate on the tab character ("\t") to identify the "tokens" in the line. These will still be strings, so be sure to convert them to the appropriate type if you are concerned with numeric rather than string values (e.g., you want to compare 13 to 25, not "13" to 25).
A sample run of the program might look like this:
What do you call an ancient reptile with an excellent vocabulary? Agnew Thweatt Hopton Edmonson Stanback Arboleda Underhill Ransdell Unsworth Schultheis
Finally, modify the program to print just the first letters of the last names it is now printing. So a sample run of the program would now look like this:
What do you call an ancient reptile with an excellent vocabulary? A T H E S A U R U S
When you are reading from the file, you will need to remove the last charcter from each line. You will also want to split() the line up into a list to choose just the fields you want. Start counting at zero.
I would suggest starting with the very short input file called demo.txt - it only contains 5 lines and should print out names starting with the letters Y, E, and S.
Save the file BPD.txt from the labs web page. The file contains known disease genes for bronchopulmonary dysplasia.
Bronchopulmonary dysplasia is a serious lung condition that mostly affects babies born preterm. These babies do not have fully developed lungs at birth and, because they are born early, lack sufficient lung surfactant for the lungs to function properly. These babies are given surfactant and oxygen. Most develop the ability to transition to room air and to breathe normally by the time they reach their original due dates if not significantly earlier. However, some require supplemental oxygen past this point and are diagnosed as having bronchopulmonary dysplasia (BPD). While most eventually are able to breathe on their own, those with BPD may need treatment for many months or years and can have pulmonary difficulties throughout life.
In this problem, we are going to run functional annotation analysis on the set of genes linked to bronchopulmonary dysplasia to understand what functional processes have been implicated in infants with BPD.
Go to the DAVID web site: http://david.abcc.ncifcrf.gov/ or follow the link on the lab page. (Or Google "DAVID"; surprisingly, this site will usually be among the top 3 hits!). Click on Start Analysis.
Next, use the default annotation terms (but expand some of the checkboxes so you can see what they are), and create a functional annotation Chart (click on the "Functional Annotation Chart" button, then do so again on the next screen). This list is sorted by (raw) P-value, lowest (most-significant) to highest.
Now go back to the main DAVID page and click on Functional Annotation Clustering. This shows clusters of related annotation terms that are enriched in the gene list you uploaded. This can give you a different overview of enriched functions (focusing attention on related terms that appear to be significantly enriched, not just the few that are independently the most significant).
Finally, go back to the page where you started running the analysis (the top of the page will say Annotation Summary Results). Click "clear all" to get rid of all of the annotation categories that are pre-selected. Now check the box next to "General Annotations" and select "cytoband". Create an annotation Chart again.
Go to the GEO web site. Under the "Search for studies at GEO DataSets" link, search for a data set related to “prolymphocytic leukemia.”
Every GDS record is a curated version of some GSE record. The curation allows for additional analyses to be performed (see “Going Further,” below). GEO is no longer curating GDS records from GSEs (although there are some efforts to automate the process, so GDS records may start being created again soon).
Click on the name of that data set to get to a page that describes it more fully. Find the link to the “Reference Series” (the associated GSE record) and click on this.
Read the Summary and Overall Design entries to get a sense of what this data set is studying. View the full list of samples to better understand the experimental design.
Note that at the bottom of the page, there is a link called “SOFT formatted family file(s).” If you wanted to download the expression data so that you could analyze it yourself (which we will do next week), you could do it by saving this SOFT file.
Suppose you are interested in studying expression in the hypothalamus and pituitary gland to help suggest novel drug targets for Parkinson’s disease. We will use the Unigene database to find genes that show differential expression between the hypothalamus and the pituitary.
Go to UniGene, and choose the “Library Browser” under the heading "Unigene Tools" in the center of the screen. Select human libraries with at least 1000 sequences. Using control-F, search for “hypothalamus.”
Select a hypothalamus library, and identify it by its Unigene ID number and description.
Select the link next to the name of one of these most frequently expressed genes. Scroll down to the box labeled “Gene Expression,” and click the “EST Profile” link to find the distribution of each of these genes in other tissues.
We will say a gene is mostly specific to a given tissue if it is expressed at a level of at least 1000 TPM (transcripts per million) in that tissue, and it is not expressed at above 1000 TPM in any other tissue. There is a legend at the bottom of the page that will show you how to read the EST Profile report.
Follow the DDD link on the Unigene page to find the Digital Differential Display link. You’d like to find genes differentially expressed between hypothalamus and hippocampus. Select human as the species of choice, and click “continue.” Do a search (control-F again) for “hypothalamus” to create your first sample set. Click on the id number for each possible sample to get a description of it; choose only those samples from normal adult hypothalamus.
Name the pool “hypothalamus” and click “Continue.” Define the second class similarly. Search for “pituitary.” Select only those samples from normal adult pituitary. (You can click on the number next to each sample to get a better description of what is in it. If there is no developmental stage listed, assume it is not adult. If there is no obvious pathology listed in the description, you may assume it is normal.) Name this class "pituitary" and click "Continue."
Click on the id number of the top gene to go to its unigene site. Now scroll down to the GEO profiles link, and follow that to find data sets in which this gene is significantly differentially expressed. Look at the names of the top 10 GEO profiles. Can you learn anything about this protein's expression pattern during normal fetal development?
Sort the resulting list by P value by clicking on the "P value" column header until the most significantly enriched terms appear at the top of the list.
Use the Amigo browser to look for Gene Ontology terms associated with your favorite gene.
Now look up the same gene in the Reactome and KEGG pathway databases. How does what you learned about your gene differ depending on the database you use?