Comp 7 / Bio 40 Lab 4: Phylogeny

October 10, 2019

Purpose of this lab

To use MEGA X for phylogenetic analysis. We will also try writing a pass-through script.

Recall that to read from a file in Python, you should:

  1. open the file for reading (the open function returns a 'file object' that you have to save)
  2. use a for loop to read each consecutive line from the file object ('for line in fi:')
  3. do whatever you need to with each line. Note that it will have a newline character at the end. The command line = line.strip() removes all leading and trailing whitespace from the string in variable line. To remove a particular trailing character, like a period, pass that character as an argument to strip (e.g., line = line.strip(".") will remove a trailing period from the line.)
  4. close the file using the file object.

To write to a file,

  1. open the file for writing (the open function returns a 'file object' that you have to save)
  2. use write to write each line to the file. Recall that the argument to write needs to be a string. If you want a newline character at the end of a line, you may need to add one in explicitly. (e.g. fileobj.write( mystr + "\n"))
  3. remember to close the file.

To split up a string into a list, use the string method split(). With no arguments, split() splits on whitespace. If there is an argument, that character or string is used to split up the input string into list entries. So for example:

mystr="These are the words in a sentence."
mystr=mystr.strip(".")   #removes the period at the end
words=mystr.split()      #creates a list where each element is a word
print(words)
would print out: ['These', 'are', 'the', 'words', 'in', 'a', 'sentence']

However,

 
commasep = "Alice,Bob,Charlie,Denzel,Eli,Frank"
names=commasep.split(",") 
print(names[0])
would print Alice.

Lab problems

  1. Python programming: writing a pass-through script for PubMed listings

    The file plosmedTC.txt is a short data file with a list of some articles from some PLoS journals. The file contains three lines for each paper: the first is the title, the second is the author list, and the third is the citation (journal, volume, etc.)

    The problem is that the current version of the file is hard to read, because the title and author lists are very long. You will write a script that reads this file line by line and creates a new version of the file with a shortened title (entered by the user) and a shortened author list that includes the first author's name and "et al." whenever there is more than one author.

    For example, if the input file contains the following text:

    Systematic review of the predictors of statin adherence for the primary prevention of cardiovascular disease.
    David M. Diamond, Michel de Lorgeril, Malcolm Kendrick, Uffe Ravnskov, Paul J. Rosch
    PLoS One. 2019; 14(1): e0205138. 
    Measures of possible allostatic load in comorbid cocaine and alcohol use disorder: Brain white matter integrity, telomere length, and anti-saccade performance
    Jonika Tannous, Benson Mwangi, Khader M. Hasan, Ponnada A. Narayana, Joel L. Steinberg, Consuelo Walss-Bass, F. Gerard Moeller, Joy M. Schmitz, Scott D. Lane
    PLoS One. 2019; 14(1): e0199729. 
    
    then a run of the program might display the following to the console (with user input shown in boldface):
    Systematic review of the predictors of statin adherence for the primary prevention of cardiovascular disease.
    
    Please enter a short summary of this title:  Statin adherence in cardiovascular disease
    
    Measures of possible allostatic load in comorbid cocaine and alcohol use disorder: Brain white matter integrity, telomere length, and anti-saccade performance
    
    Please enter a short summary of this title:  Allostatic load in cocaine and alcohol use disorder
    
    and the output file would then look like:
    Statin adherence in cardiovascular disease
    David M. Diamond, et al.
    PLoS One. 2019; 14(1): e0205138. 
    Allostatic load in cocaine and alcohol use disorder
    Jonika Tannous, et al.
    PLoS One. 2019; 14(1): e0199729. 
    

    If you are comfortable writing this script without further help, you may do so. But if not, here are some suggested steps that will get you there:

    1. To start, save the file plosmedTC.txt. Then write code that opens that file to a filehandle named fi. Open an output file to a filehandle named fo. Read each line of the input file using the for line in fi: construct, and write that same line to the output file. Close both files, and test your program until it works. Specifically, look at the output file and ensure that it looks just like your input file. This will ensure that you have the file input and output pieces of your 'pass through' script and that content actually does pass through.

    2. Now go back and modify your program to count the lines of the file as it reads them. Have it print out each line number to the output file at the front of each line it copies over. Test it again, and make sure that the output file lines look just like the input file lines except for having a line number before each. This will ensure that your line counter is working properly.

    3. You don't really want to see the line numbers in the output file; you just want to use them to figure out whether a line contains a title, an author list, or publication details. So, remove the code the writes the line to the output file. Instead, figure out how to tell from the counter if you've just read a line containing a title. If you have, print that title line to the screen. Continue to print all three lines to the output file in addition. Test your program again.

    4. Now, after printing the title line to the console, ask the user to input a shorter version of the title. Read the shorter title from the console, write the shorter title to the output file, and continue. If the line is not a title line, just copy it to the output file. Test.

    5. Finally, check to see if the line is an author line. If so, split it up into a list where each element is an author, using the split() command described above. If there is more than one author name, write just the first one to the output file and add "et al." after the author's name. If there is only one name, just copy it over to the output file. Copy the publication line as it is in all cases. Test your file a final time.

    Name your file passthrough.py and submit your code with your lab writeup in Gradescope.

Phylogeny

  1. For the next question, you will use the MEGA software to construct phylogenetic trees from a multiple sequence alignment. You can also use MEGA to build the MSA, but you have done that already using other tools, and it will be slow on a data set of this size, so here we are giving you an MSA that we have pre-computed.
    1. Follow these instructions to get started in MEGA:
      1. Start MEGA (from the Start Menu, choose All Programs/MEGA X/MEGA X/, or double-click the MEGA icon on your desktop if it is there).
      2. Open the file Drosophila_Adh.meg, which contains the DNA sequences for the Adh gene in 11 Drosophila species. To open it in MEGA, click on File / Open a File/Session /, browse to find the correct file and click on the "Open" button. Click on the alignment icon (the box with a big TA in it). The first sequence (that from Drosophila melanogaster) appears across the top; below it are shown any differences from that first sequence. Note for example that all the sequences start with the start codon ATG, but the second codon, TCG, appears in no species other than D. melanogaster. All the sequences are aligned, and there are no gaps.

        Hint: for help while using MEGA, click on the "Help Docs" icon one with the image of a lifesaver on it) in the main MEGA window.

      3. Now, you will compute a table of estimated evolutionary distances between all pairs of sequences in your file, under various distance models. To do this, go to the main MEGA launch bar (the icons at the top of your MEGA window) and select Distance / Compute Pairwise Distance. When the program asks if you would like to use the currently active data, click Yes. You should get a popup window titled Analysis Preferences.
      4. In this window, ensure that the Substitutions Type is set to Nucleotide. Click the pull-down for Model/Method, and select the p-distance method. For now, we will be using the defaults for the other options. Click "OK" to begin the computation. When it is done, a window will appear showing a table of pairwise distances between all pairs of sequences. In this table (i.e., using the nucleotide p-distance measure) is D. pseudoobscura closer to D. afinidisjuncta or D. nigra under this model? Recall that these are measures of distance, so smaller distances suggest two taxa are more closely related.
      5. How do the distances change when you recomput the distances using the Jukes-Cantor model?
      6. Now try computing pairwise distances using the amino acid p-distance option by choosing "Amino Acid" instead of "Nucleotide" and then choosing "p-distance" under "Model/Method". Is D. pseudoobscura closer to D. affinidisjuncta or D. nigra? Explain your results.

    2. Download the file ebola.meg. This file contains whole genome Ebola sequences from the current and several previous outbreaks. Three of the 2014 sequences - EBOV_2014_KJ660346,7, and 8 - are from Guinea, where the outbreak began in February. The remainder of the 2014 samples were collected from EBOV patients in Sierra Leone, where the virus spread starting May 2014. The remainder of the samples are from past EBOV outbreaks in Central Africa, denoted by year.

      Load the file in MEGA. In the main window, select "Phylogeny" and then "Construct/Test Neighbor Joining Tree." MEGA should ask if you want to use the currently active data, which should be ebola.meg. Select "Yes." For options, change model to p-distance and under test of phylogeny, choose "None". Leave all other options as default, and click "OK".

      Now take a look at the constructed tree. How are the different outbreaks grouped? What can you conclude about the 2014 outbreak from the structure of the tree? Note that, with a subtree node selected, you can select the "Display in a window" option in the "Subtree" menu to get a better view of a subtree.

    3. Try changing where the tree is rooted (you can do this by left clicking on an interior node and selecting Subtree/Root option). Is there any rooting of the tree that seems most correct? Why?

    4. Next, try using bootstrapping to assess confidence in the tree. Build a new phylogenetic tree using the same methods as before, except this time choose bootstrapping for test of phylogeny. Set the number of bootstrapping iterations to 100 instead of the default 500 (this will speed things up a little).

      What is the weakest clade in the data set? Do you believe it is a real clade?

      Show the tree with the bootstrapping results in your writeup.

    5. Rebuild the tree using maximum parsimony and bootstrapping. What has changed from part c? Does the weakest clade still exist? What do you think about the validity of the parsimony assumption for this data set?

    6. Interpretation: which do you think is more similar to the 1995 outbreak - that of 2007 or of 1976-77? How do you know.
  2. Selection

    The file ebola.np.fasta contains a multiple sequence alignment of part of the NP gene from the Ebola virus data set. Download this file and use it to determine whether this gene is under positive or negative selection using the SNAP web site and the instructions below.

    Upload the alignment file. Check the XYPLOT option. Click "submit". After it runs for a while, it will show a large table comparing all pairs of sequences. Most of what you want is at the top or bottom, but here is an explanation of the output columns:

    Compare:  Lists the two sequences compared, starting with 0 (so 4 sequences would be numbered 0-3)
    Sequences_names:  The names of the two sequences being compared. 
    Sd:  The number of observed synonymous substitutions 
    Sn:  The number of observed non-synonymous substitutions 
    S:  The number of potential synonymous substitutions (the average for the two compared sequences)  
    N:  The number of potential non-synonymous substitutions (the average for the two compared sequences)  
    ps:  The proportion of observed synonymous substitutions: Sd/S  
    pn:  The proportion of observed non-synonymous substitutions: Sn/N  
    ps/pn:  The ration of synonymous to non-synonymous substitutions 
    ds:  ps, using the Jukes-Cantor correction for multiple hits 
    dn:  pn, using the Jukes-Cantor correction for multiple hits 
    ds/dn: The ratio of synonymous to non-synonymous substitutions, using the Jukes-Cantor correction  
    

    Scroll down to the bottom of the output screen, where it shows the average of all the pairwise comparisons. What is the average dS/dN ratio? What does this suggest about the selective pressure on the NP gene in Ebola?

    At the top of the results screen it says "Click here to view XYPLOT". Do so. Focusing on just the green and red lines showing the relative rates of synonymous and nonsynonymous mutations per codon, do you see any positions in the sequence that seem more accepting of synonymous mutations compared to nonsynonymous ones? Or regions where the two lines are more distant? Can this plot help you refine your conclusions about selective pressure affecting the NP gene?


Going Further:

  1. Explore what MEGA can tell you about your data. Look at the options under the View menu. Choose View/Tree/Branch Style/Radiation. Toggle the Topology only option. Show and hide the branch lengths. See what else you can do with the Tree Explorer options. Next, see if you can find and perform Tajima's relative rate test on three taxa. Determine if the molecular clock hypothesis is a good one for the ebola data set.

  2. The Ebola genome

    There is an Ebola genome browser available at UC Santa Cruz's Ebola genome portal. Go to this portal, and then click on the browser image below the phrase "Explore the Ebola genome with the UCSC Browser" to see the genome browser tool. Click the "zoom out" buttons (first 3x, then 1.5x) to see the entire genome. The navy blue track near the top of the browser shows the genes. Click on the NP gene (near the start of the genomic sequence) to see what is known about it. You can also look it up in the Gene database. Does this inform your answer to question 3 above?

    Explore the browser and the portal further if you are interested.

  3. More about selection

    The file four_aligned_globins.txt contains a multiple sequence alignment (from MAFFT) of four globin transcripts. Download this file and use it to determine whether the globins are under positive or negative selection using the SNAP web site and the instructions above.

    Upload the alignment file (or cut and paste its contents into the SNAP window). Click "submit" and look at the statistics calculated for all pairs of these four sequences.

    What is the dS/dN ratio for the human and chimp protein? What is the average dS/dN ratio? What does this suggest about the selective pressure on globins? How does this relate to what you saw in Ebola? Can you explain why you (might) see the results you do?

    Look at the XYPlots, which show the normalized rates of accepted nonsynonymous and synonymous substitutions across the input sequences, for both the globins and Ebola NP. Do these plots make sense to you in light of your overall results?

  4. Python practice

    The following exercises will give you more practice with Python lists and for loops. For all of the following problems, you can either generate a user-specified list from raw_input, or you may hard-code a list.

    1. Create a list called artElectives, containing the strings "Photography", "Drawing", and "Pottery". Add two new classes (your choice) to the end of the list using append. Use a list slice to print out the last three elements of the list.

    2. Write a function that takes two lists as arguments, and outputs a new list containing the first element of the first list and the last element of the second.

    3. Count the number of strings in a list of strings that start with a capital letter, and print the result.

    4. Write a loop that prints out the multiplication table for some value x, showing the value of x multiplied by all values from 1 to 10. Then write another loop that allows x to vary from 1 to 10 too. Print out the full multiplication table this way. Note that you can next one for loop inside another to make this work.