Recall that to read from a file in Python, you should:
To write to a file,
To split up a string into a list, use the string method split(). With no arguments, split() splits on whitespace. If there is an argument, that character or string is used to split up the input string into list entries. So for example:
mystr="These are the words in a sentence." mystr=mystr.strip(".") #removes the period at the end words=mystr.split() #creates a list where each element is a word print(words)would print out: ['These', 'are', 'the', 'words', 'in', 'a', 'sentence']
However,
commasep = "Alice,Bob,Charlie,Denzel,Eli,Frank" names=commasep.split(",") print(names[0])would print Alice.
The file plosmedTC.txt is a short data file with a list of some articles from some PLoS journals. The file contains three lines for each paper: the first is the title, the second is the author list, and the third is the citation (journal, volume, etc.)
The problem is that the current version of the file is hard to read, because the title and author lists are very long. You will write a script that reads this file line by line and creates a new version of the file with a shortened title (entered by the user) and a shortened author list that includes the first author's name and "et al." whenever there is more than one author.
For example, if the input file contains the following text:
Systematic review of the predictors of statin adherence for the primary prevention of cardiovascular disease. David M. Diamond, Michel de Lorgeril, Malcolm Kendrick, Uffe Ravnskov, Paul J. Rosch PLoS One. 2019; 14(1): e0205138. Measures of possible allostatic load in comorbid cocaine and alcohol use disorder: Brain white matter integrity, telomere length, and anti-saccade performance Jonika Tannous, Benson Mwangi, Khader M. Hasan, Ponnada A. Narayana, Joel L. Steinberg, Consuelo Walss-Bass, F. Gerard Moeller, Joy M. Schmitz, Scott D. Lane PLoS One. 2019; 14(1): e0199729.then a run of the program might display the following to the console (with user input shown in boldface):
Systematic review of the predictors of statin adherence for the primary prevention of cardiovascular disease. Please enter a short summary of this title: Statin adherence in cardiovascular disease Measures of possible allostatic load in comorbid cocaine and alcohol use disorder: Brain white matter integrity, telomere length, and anti-saccade performance Please enter a short summary of this title: Allostatic load in cocaine and alcohol use disorderand the output file would then look like:
Statin adherence in cardiovascular disease David M. Diamond, et al. PLoS One. 2019; 14(1): e0205138. Allostatic load in cocaine and alcohol use disorder Jonika Tannous, et al. PLoS One. 2019; 14(1): e0199729.
If you are comfortable writing this script without further help, you may do so. But if not, here are some suggested steps that will get you there:
Name your file passthrough.py and submit your code with your lab writeup in Gradescope.
Hint: for help while using MEGA, click on the "Help Docs" icon one with the image of a lifesaver on it) in the main MEGA window.
Load the file in MEGA. In the main window, select "Phylogeny" and then "Construct/Test Neighbor Joining Tree." MEGA should ask if you want to use the currently active data, which should be ebola.meg. Select "Yes." For options, change model to p-distance and under test of phylogeny, choose "None". Leave all other options as default, and click "OK".
Now take a look at the constructed tree. How are the different outbreaks grouped? What can you conclude about the 2014 outbreak from the structure of the tree? Note that, with a subtree node selected, you can select the "Display in a window" option in the "Subtree" menu to get a better view of a subtree.
What is the weakest clade in the data set? Do you believe it is a real clade?
Show the tree with the bootstrapping results in your writeup.
Selection
The file ebola.np.fasta contains a multiple sequence alignment of part of the NP gene from the Ebola virus data set. Download this file and use it to determine whether this gene is under positive or negative selection using the SNAP web site and the instructions below.
Upload the alignment file. Check the XYPLOT option. Click "submit". After it runs for a while, it will show a large table comparing all pairs of sequences. Most of what you want is at the top or bottom, but here is an explanation of the output columns:
Compare: Lists the two sequences compared, starting with 0 (so 4 sequences would be numbered 0-3) Sequences_names: The names of the two sequences being compared. Sd: The number of observed synonymous substitutions Sn: The number of observed non-synonymous substitutions S: The number of potential synonymous substitutions (the average for the two compared sequences) N: The number of potential non-synonymous substitutions (the average for the two compared sequences) ps: The proportion of observed synonymous substitutions: Sd/S pn: The proportion of observed non-synonymous substitutions: Sn/N ps/pn: The ration of synonymous to non-synonymous substitutions ds: ps, using the Jukes-Cantor correction for multiple hits dn: pn, using the Jukes-Cantor correction for multiple hits ds/dn: The ratio of synonymous to non-synonymous substitutions, using the Jukes-Cantor correction
Scroll down to the bottom of the output screen, where it shows the average of all the pairwise comparisons. What is the average dS/dN ratio? What does this suggest about the selective pressure on the NP gene in Ebola?
At the top of the results screen it says "Click here to view XYPLOT". Do so. Focusing on just the green and red lines showing the relative rates of synonymous and nonsynonymous mutations per codon, do you see any positions in the sequence that seem more accepting of synonymous mutations compared to nonsynonymous ones? Or regions where the two lines are more distant? Can this plot help you refine your conclusions about selective pressure affecting the NP gene?
There is an Ebola genome browser available at UC Santa Cruz's Ebola genome portal. Go to this portal, and then click on the browser image below the phrase "Explore the Ebola genome with the UCSC Browser" to see the genome browser tool. Click the "zoom out" buttons (first 3x, then 1.5x) to see the entire genome. The navy blue track near the top of the browser shows the genes. Click on the NP gene (near the start of the genomic sequence) to see what is known about it. You can also look it up in the Gene database. Does this inform your answer to question 3 above?
Explore the browser and the portal further if you are interested.
The file four_aligned_globins.txt contains a multiple sequence alignment (from MAFFT) of four globin transcripts. Download this file and use it to determine whether the globins are under positive or negative selection using the SNAP web site and the instructions above.
Upload the alignment file (or cut and paste its contents into the SNAP window). Click "submit" and look at the statistics calculated for all pairs of these four sequences.
What is the dS/dN ratio for the human and chimp protein? What is the average dS/dN ratio? What does this suggest about the selective pressure on globins? How does this relate to what you saw in Ebola? Can you explain why you (might) see the results you do?
Look at the XYPlots, which show the normalized rates of accepted nonsynonymous and synonymous substitutions across the input sequences, for both the globins and Ebola NP. Do these plots make sense to you in light of your overall results?
The following exercises will give you more practice with Python lists and for loops. For all of the following problems, you can either generate a user-specified list from raw_input, or you may hard-code a list.