Comp 7 / Bio 40 Lab 4: Phylogeny

October 17, 2016

Purpose of this lab

To introduce students to the issues involved in multiple sequence alignment methods.

Lab problems

  1. Python programming: writing a pass-through script.

    Read in the file presidents.txt that lists the names of our past few presidents and the years that they served.

    Open this file for reading, and open a new file for writing. Read in each line of the input file. For the odd numbered lines, print the president's full name to the keyboard, and ask the user to enter their more common nickname if it exists. (E.g., instead of 'James Earl Carter, Jr.', you might enter 'Jimmy Carter.') Print the nickname to the output file instead of the full name. If there is no commonly used nickname, you can just enter their full name. For the even numbered lines, which have the dates that they served, just copy the line over into the output file.

    Submit your code with your lab writeup.

    Recall that to read from a file in Python, you should:

    1. open the file for reading (the open function returns a 'file object' that you have to save)
    2. use a for loop to read each consecutive line from the file object ('for line in fo:')
    3. do whatever you need to with each line. Note that it will have a newline character at the end. line = line.strip() removes all leading and trailing whitespace from the string in variable line.
    4. close the file using the file object

    To write to a file,

    1. open the file for writing (the open function returns a 'file object' that you have to save)
    2. use write to write each line to the file. Recall that the argument to write needs to be a string. If you want a newline character at the end of a line, you may need to add one in explicitly. (e.g. fileobj.write( mystr + "\n"))
    3. remember to close the file
Phylogeny

  1. For the next question, you will use the MEGA software to construct phylogenetic trees from a multiple sequence alignment. You can also use MEGA to build the MSA, but you have done that already using other tools, and it will be slow on a data set of this size, so here we are giving you an MSA that we have pre-computed.
    1. Follow these instructions to get started in MEGA:
      1. Start MEGA (from the Start Menu, choose All Programs/MEGA 6/MEGA 6/, or double-click the MEGA icon on your desktop if it is there).
      2. Open the file Drosophila_Adh.meg, which contains the DNA sequences for the Adh gene in 11 Drosophila species. To open it in MEGA, click on File / Open a File/Session /, browse to find the correct file and click on the "Open" button. Click on the alignment icon (the box with a big TA in it). The first sequence (that from Drosophila melanogaster) appears across the top; below it are shown any differences from that first sequence. Note for example that all the sequences start with the start codon ATG, but the second codon, TCG, appears in no species other than D. melanogaster. All the sequences are aligned, and there are no gaps.

        Hint: for help while using MEGA, click on the "Tutorial" icon (the one with the image of a book on it) in the main MEGA window.

      3. Now, you will compute a table of estimated evolutionary distances between all pairs of sequences in your file, under various distance models. To do this, go to the main MEGA launch bar (the icons at the top of your MEGA window) and select Distance / Compute Pairwise Distance. When the program asks if you would like to use the currently active data, click Yes. You should get a popup window titled Analysis Preferences.
      4. In this window, ensure that the Substitutions Type is set to Nucleotide. Click the pull-down for Model/Method, and select the p-distance method. For now, we will be using the defaults for the other options. Click "Compute" to begin the computation. When it is done, a window will appear showing a table of pairwise distances between all pairs of sequences. In this table (i.e., using the nucleotide p-distance measure) is D. pseudoobscura closer to D. afinidisjuncta or D. nigra under this model?
      5. How do the distances change when you recomput the distances using the Jukes-Cantor model?
      6. Now try computing pairwise distances using the amino acid p-distance option. Is D. pseudoobscura closer to D. affinidisjuncta or D. nigra? Explain your results.

    2. Download the file ebola.meg. This file contains whole genome Ebola sequences from the current and several previous outbreaks. Three of the 2014 sequences - EBOV_2014_KJ660346,7, and 8 - are from Guinea, where the outbreak began in February. The remainder of the 2014 samples were collected from EBOV patients in Sierra Leone, where the virus spread starting May 2014. The remainder of the samples are from past EBOV outbreaks in Central Africa, denoted by year.

      Load the file in MEGA. In the main window, select "Phylogeny" and then "Construct/Test Neighbor Joining Tree." MEGA should ask if you want to use the currently active data, which should be ebola.meg. Select "Yes." For options, change model to p-distance and make sure bootstrapping is off under test of phylogeny. Leave all other options as default, and click "Compute."

      Now take a look at the constructed tree. How are the different outbreaks grouped? What can you conclude about the current outbreak from the structure of the tree? Note that, with a subtree node selected, you can select the "Display in a window" option in the "Subtree" menu to get a better view of a subtree.

      Try changing where the tree is rooted (you can do this by left clicking on an interior node and selecting Subtree/Root option). Is there any rooting of the tree that seems most correct? Why?

      Explore the options under the View menu. Choose View/Tree/Branch Style/Radiation. Toggle the Topology only option. Show and hide the branch lengths. See what else you can do with the Tree Explorer options.

    3. Now let's try using bootstrapping to assess confidence in our tree. Build a new phylogenetic tree using the same methods as before, except this time choose bootstrapping for test of phylogeny. Set the number of bootstrapping iterations to 100 instead of the default 500.

      What is the weakest clade in the data set? Do you believe it is likely a real clade?

      Show the tree with the bootstrapping results in your writeup.

    4. Rebuild the tree using maximum parsimony and bootstrapping. What has changed from part c? Does the weakest clade still exist? What do you think about the validity of the parsimony assumption for this data set?
  2. Selection

    The file ebola.np.fasta contains a multiple sequence alignment of part of the NP gene from the Ebola virus data set. Download this file and use it to determine whether this gene is under positive or negative selection using the SNAP web site and the instructions below.

    Upload the alignment file. Check the XYPLOT option. Click "submit". After it runs for a while, it will show a large table comparing all pairs of sequences. Here is an explanation of the output columns:

    Compare:  Lists the two sequences compared, starting with 0 (so 4 sequences would be numbered 0-3)
    Sequences_names:  The names of the two sequences being compared. 
    Sd:  The number of observed synonymous substitutions 
    Sn:  The number of observed non-synonymous substitutions 
    S:  The number of potential synonymous substitutions (the average for the two compared sequences)  
    N:  The number of potential non-synonymous substitutions (the average for the two compared sequences)  
    ps:  The proportion of observed synonymous substitutions: Sd/S  
    pn:  The proportion of observed non-synonymous substitutions: Sn/N  
    ps/pn:  The ration of synonymous to non-synonymous substitutions 
    ds:  ps, using the Jukes-Cantor correction for multiple hits 
    dn:  pn, using the Jukes-Cantor correction for multiple hits 
    ds/dn: The ratio of synonymous to non-synonymous substitutions, using the Jukes-Cantor correction  
    

    Scroll down to the bottom of the output screen, where it shows the average of all the pairwise comparisons. What is the average dS/dN ratio? What does this suggest about the selective pressure on the NP gene in Ebola?


Going Further:

  1. The Ebola genome

    There is a new Ebola genome browser available at UC Santa Cruz's Ebola genome portal. Go to this portal, and then click on the browser image below the phrase "Explore the Ebola genome with the UCSC Browser" to see the genome browser tool. Click the "zoom out" buttons (first 3x, then 1.5x) to see the entire genome. The navy blue track near the top of the browser shows the genes. Click on the NP gene (near the start of the genomic sequence) to see what is known about it. You can also look it up in the Gene database. Does this inform your answer to question 3 above?

    Explore the browser and the portal further if you are interested.

  2. More about selection

    The file four_aligned_globins.txt contains a multiple sequence alignment (from MAFFT) of four globin transcripts. Download this file and use it to determine whether the globins are under positive or negative selection using the SNAP web site and the instructions above.

    Upload the alignment file (or cut and paste its contents into the SNAP window). Click "submit" and look at the statistics calculated for all pairs of these four sequences.

    What is the dS/dN ratio for the human and chimp protein? What is the average dS/dN ratio? What does this suggest about the selective pressure on globins? How does this relate to what you saw in Ebola? Can you explain why you (might) see the results you do?

    Look at the XYPlots, which show the normalized rates of accepted nonsynonymous and synonymous substitutions across the input sequences, for both the globins and Ebola NP. Do these plots make sense to you in light of your overall results?

  3. Python practice

    The following exercises will give you more practice with Python lists and for loops. For all of the following problems, you can either generate a user-specified list from raw_input, or you may hard-code a list.

    1. Create a list called artElectives, containing the strings "Photography", "Drawing", and "Pottery". Add two new classes (your choice) to the end of the list using append. Use a list slice to print out the last three elements of the list.

    2. Write a function that takes two lists as arguments, and outputs a new list containing the first element of the first list and the last element of the second.

    3. Count the number of strings in a list of strings that start with a capital letter, and print the result.

    4. Write a loop that prints out the multiplication table for some value x, showing the value of x multiplied by all values from 1 to 10. Then write another loop that allows x to vary from 1 to 10 too. Print out the full multiplication table this way.

  4. Tree inference methods

    Do some further exploration in MEGA. What happens if you use different tree construction algorithms? Which methods do believe are most accurate on your chosen data set, and why? You may want to refer to the text to see if you can find a good explanation for your observations.