Read in the file presidents.txt that lists the names of our past few presidents and the years that they served.
Open this file for reading, and open a new file for writing. Read in each line of the input file. For the odd numbered lines, print the president's full name to the keyboard, and ask the user to enter their more common nickname if it exists. (E.g., instead of 'James Earl Carter, Jr.', you might enter 'Jimmy Carter.') Print the nickname to the output file instead of the full name. If there is no commonly used nickname, you can just enter their full name. For the even numbered lines, which have the dates that they served, just copy the line over into the output file.
Submit your code with your lab writeup.
Recall that to read from a file in Python, you should:
To write to a file,
Hint: for help while using MEGA, click on the "Tutorial" icon (the one with the image of a book on it) in the main MEGA window.
Load the file in MEGA. In the main window, select "Phylogeny" and then "Construct/Test Neighbor Joining Tree." MEGA should ask if you want to use the currently active data, which should be ebola.meg. Select "Yes." For options, change model to p-distance and make sure bootstrapping is off under test of phylogeny. Leave all other options as default, and click "Compute."
Now take a look at the constructed tree. How are the different outbreaks grouped? What can you conclude about the current outbreak from the structure of the tree? Note that, with a subtree node selected, you can select the "Display in a window" option in the "Subtree" menu to get a better view of a subtree.
Try changing where the tree is rooted (you can do this by left clicking on an interior node and selecting Subtree/Root option). Is there any rooting of the tree that seems most correct? Why?
Explore the options under the View menu. Choose View/Tree/Branch Style/Radiation. Toggle the Topology only option. Show and hide the branch lengths. See what else you can do with the Tree Explorer options.
What is the weakest clade in the data set? Do you believe it is likely a real clade?
Show the tree with the bootstrapping results in your writeup.
The file ebola.np.fasta contains a multiple sequence alignment of part of the NP gene from the Ebola virus data set. Download this file and use it to determine whether this gene is under positive or negative selection using the SNAP web site and the instructions below.
Upload the alignment file. Check the XYPLOT option. Click "submit". After it runs for a while, it will show a large table comparing all pairs of sequences. Here is an explanation of the output columns:
Compare: Lists the two sequences compared, starting with 0 (so 4 sequences would be numbered 0-3) Sequences_names: The names of the two sequences being compared. Sd: The number of observed synonymous substitutions Sn: The number of observed non-synonymous substitutions S: The number of potential synonymous substitutions (the average for the two compared sequences) N: The number of potential non-synonymous substitutions (the average for the two compared sequences) ps: The proportion of observed synonymous substitutions: Sd/S pn: The proportion of observed non-synonymous substitutions: Sn/N ps/pn: The ration of synonymous to non-synonymous substitutions ds: ps, using the Jukes-Cantor correction for multiple hits dn: pn, using the Jukes-Cantor correction for multiple hits ds/dn: The ratio of synonymous to non-synonymous substitutions, using the Jukes-Cantor correction
Scroll down to the bottom of the output screen, where it shows the average of all the pairwise comparisons. What is the average dS/dN ratio? What does this suggest about the selective pressure on the NP gene in Ebola?
There is a new Ebola genome browser available at UC Santa Cruz's Ebola genome portal. Go to this portal, and then click on the browser image below the phrase "Explore the Ebola genome with the UCSC Browser" to see the genome browser tool. Click the "zoom out" buttons (first 3x, then 1.5x) to see the entire genome. The navy blue track near the top of the browser shows the genes. Click on the NP gene (near the start of the genomic sequence) to see what is known about it. You can also look it up in the Gene database. Does this inform your answer to question 3 above?
Explore the browser and the portal further if you are interested.
The file four_aligned_globins.txt contains a multiple sequence alignment (from MAFFT) of four globin transcripts. Download this file and use it to determine whether the globins are under positive or negative selection using the SNAP web site and the instructions above.
Upload the alignment file (or cut and paste its contents into the SNAP window). Click "submit" and look at the statistics calculated for all pairs of these four sequences.
What is the dS/dN ratio for the human and chimp protein? What is the average dS/dN ratio? What does this suggest about the selective pressure on globins? How does this relate to what you saw in Ebola? Can you explain why you (might) see the results you do?
Look at the XYPlots, which show the normalized rates of accepted nonsynonymous and synonymous substitutions across the input sequences, for both the globins and Ebola NP. Do these plots make sense to you in light of your overall results?
The following exercises will give you more practice with Python lists and for loops. For all of the following problems, you can either generate a user-specified list from raw_input, or you may hard-code a list.
Do some further exploration in MEGA. What happens if you use different tree construction algorithms? Which methods do believe are most accurate on your chosen data set, and why? You may want to refer to the text to see if you can find a good explanation for your observations.