Comp 7 / Bio 40 - Fall 2019

Comp 7 / Bio 40 Lab 3: MSA and number theory

September 26, 2019

Purpose of this lab

To introduce students to the issues involved in multiple sequence alignment methods; to start to work with for loops and practice moving data in and out of Python functions

An example of a function definition and how to call it appears below:

def my_function (param):
  result = (param + 18) / 3.5  # arbitrary  math computation here 
  return result

foo = my_function(72)
print "The return value is " + str(foo)

An example of the syntax for a for loop that examines each element in a list appears below:

courselist=['Math','Spanish','Physics','English','History']
for course in courselist:
    print ("You are taking %s" % (course))

Lab problems

Optional starting question: remembering functions
If you aren't really sure about how functions work or how to call them, start here. If this sounds totally trivial to you, you can skip this problem.
Write a function called FtoC. The function should have one parameter, which can contain an integer or a float, representing a temperature in degrees Farenheit. The function should convert the temperature to degrees Celsius using the formula C = (F - 32.0) * (5.0/9.0). The function should then return the temperature in degrees Celsius.
For the main part of the function, call your function with various temperatures and print out the results.
Python: lists and loops
For this first problem, create a new Python script called animals.py. In it, create a list containing five strings, each of which is an animal name. Also, create an empty list (just open and close square brackets) named pets.
Next, write a for loop that goes through each animal in turn, storing its name in a variable called critter.
For each value of critter, ask the user (using raw_input) if they would like to have a pet critter. If the answer string starts with "y" or "Y", append the name of the animal to the list pets using pets.append("name").
At the end of the loop, print out the list pets.
For example, a sample run of the program might look like:
```
Would you like to have a pet sloth?  No
Would you like to have a pet fox?  Yes
Would you like to have a pet pangolin? Yes
Would you like to have a pet tiger? No
Would you like to have a pet capybara? Yup!
['fox','pangolin','capybara']
```
Python Programming: functions and loops
We will work more with the hypothesis we tested in class, that every odd perfect square (1, 9, 25, 49, …) is one more than a multiple of 8. To test this hypothesis empirically, we will write some Python code.
Create a program called oddsquares.py. Start it by doing what you did in the anti-quiz: Write a Python function that takes in one integer argument, oddnum.
The function should check to see that the argument is actually odd; if not, it can print an error message and return False.
Assuming the argument is odd, the function should return True if the square of oddnum is equal to one more than a multiple of 8, and False otherwise. Test your function to make sure that it returns a Boolean value on all inputs you try.
Next, write code to call your function on the odd numbers between 1 and 20. To do this, create a list named first_odds and set it equal to the list [1,3,5,...,17,19]. This shouldn't take that long to type.
(If you are really comfortable with Python and looking for a new challenge, you can try to create this list using the built-in function range() instead. Range takes up to three arguments: range(start, stop, step), and returns a list that contains the integers from start to stop-1, incremented by step. If you omit step it counts by 1's; if you provide only one argument, it is assumed to be stop and that you meant to start counting from 0.)
Once you have this list, use a for loop to iterate through the numbers in the list. Call your function on each number. If you find any numbers that return False, print them out. When the program finishes the loop, have it print "done."
Multiple sequence alignment: internal splicing.
There are twelve isoforms of the mouse protein doublecortin-like kinase 1 (Dclk1). We will look at how different alignment methods work to align the exons in these different isoforms.
First, look at them in a genome browser. There are several genome browsers available, but the easiest way to see all the isoforms is to look up the Dclk1 gene in the NCBI gene page. Specifically:
Go to the course Links page or otherwise find NCBI's Gene web site. Search for the mouse Dclk1 gene page. Make sure that the page you end up looking at is in Mus musculus!) Then scroll down to the section that is titled "Genomic regions, transcripts, and products." The Genome View is below the Genomic Sequence line and the phrase "Go to nucleotide: Graphics FASTA GenBank"
At the top of this Genome View is a horizontal light grey line featuring several buttons and controls including a "Tools" pull-down and a help menu. Below this is the "Graphical panel" which shows a close-up of a particular chromosome (you should see chromosomal positions in the 55,000s) and various genomic features to be found there. A label on the upper left of the graphical panel should say "Genes, NCBI Mus musculus Annotation Release..." In the middle of the panel below that section, the name Dclk1 in blue text should appear above a depiction of the transcripts and isoforms for this gene.
Below this title are many transcripts and their corresponding proteins. There are seven transcripts whose RefSeq identifiers start with "NM," corresponding to seven curated isoforms of the protein. (RefSeq identifiers starting with "X" such as XM_006500978.1, can be considered "X-perimental"; they are computational predictions that have not yet been reviewed.) Genes are represented in green on this map; introns as thin horizontal lines, exons as thick lines that appear to be vertical bars at this scale.
1. What is the RefSeq ID for the mRNA transcript corresponding to the shortest (in terms of genomic length) of the isoforms with "NM" at the start of their RefSeq IDs? Note that some of these are much shorter than the others and appear below the full-length ones. If you hold your mouse over one of the horizontal bars or refseq id's on this map, you will see the length of the corresponding coding sequence (CDS) and the protein product length in amino acids.
Download and save the text file mouse.Dclk1.fasta. This file contains the amino acid sequences corresponding to isoforms 1 to 7 of Dclk1 that you were looking at in the genome browser. Look at this file in Notepad++ to see the different lengths of the sequences. One of the exons common to this protein starts with the highly conserved sequence VNGTP. Search for this sequence in the file; look for all occurrences of it.
1. For each of the seven isoforms, list whether they have this exon and if so, approximately where in the sequence does this exon begin? (Near the N-terminus, the C-terminus, or in the middle of the protein?)
2. Another shared exon starts with the highly conserved sequence EESEE. As above, search for this sequence in the fasta file. Which isoforms contain this exon?
The variations in the locations of these exons make this a hard problem for MSA algorithms.
Go to www.phylogeny.fr. We'll use this interface to try out a few sequence alignment tools. Scroll down to find the third bulleted option, "Online phylogeny programs (direct access to the individual tools available on this server)." Choose Clustal-W under the heading of multiple sequence alignment (MSA) tools.
On the next page (Clustal-W input), browse to upload the file mouse.Dclk1.fasta, and click the Submit button. After the alignment is computed (this may take a few seconds), you will see the MSA in a scrollable window. You may use this window to view the alignment and answer the following questions.
1. Search for the VNGTP motif in the window containing the alignment results. How well did Clustal-W do at aligning this exon in the different sequences? What about the EESEE motif?
2. Starting from the main page at www.phylogeny.fr again, repeat this alignment using MUSCLE, starting again with the third bulleted option on the main page, "Online phylogeny programs (direct access to the individual tools available on this server)," and choosing MUSCLE under the heading of multiple sequence alignment tools. Use the default parameters (i.e., just upload the file and click Submit). How does it do at aligning the exon starting with VNGTP? What about the EESEE motif?
3. Which of these is a progressive alignment method? Which is an iterative one? List one drawback and one advantage of progressive alignment methods over iterative ones. How could this difference explain the results you have observed so far?
EBI MSA tools
Clustal-W was a very commonly used multiple sequence alignment program, but it's considered obsolete now because better programs exist. One such is Clustal-omega. It is not linked through the pipeline tools at www.phylogeny.fr, but you can reach it directly through the EBI's tools site at http://www.ebi.ac.uk/Tools/msa/. Click on "Launch Clustal Omega" to start.
Try using this to align the sequences you worked on in problem 2. Look for both of the motifs we specified in problem 2. How does this alignment differ in its identification of these conserved exons from that produced by Clustal-W? By MUSCLE? Explain how the different types of alignment methods give the observed results, whether you might want to edit any of the resulting alignments, and if so, how.

Going Further:

What do pigs' teeth tell you about their diet?
This exercise will give you a chance to practice building your own MSA data input file to answer a question about pigs. Tooth enamel in carnivores differs from that of herbivores. Pigs, however, notoriously eat everything. Is their tooth enamel more like that of herbivores or that of carnivores?
The gene ENAM (enamelin precursor) helps control the development of tooth enamel in many organisms. You will collect the ENAM sequences from several species and perform a multiple sequence alignment of the pig sequence with those of several carnivores and herbivores.
Start by doing a protein BLAST of the pig ENAM sequence, NP_999406.1, against the protein RefSeq database, restricted to mammals. Once you get the BLAST results, check the boxes to the left of the following BLAST hits:
- pig (Sus scrofa)
- dromedary (Camelus dromedarius)
- cat (Felis catus)
- Siberian tiger (Panthera tigris altaica)
- cow (Bos taurus)
- dog (Canis lupus familiaris)
- horse (Equus caballus)
- elephant (Loxodonta africana)
At the top of the list BLAST hits is a pulldown labeled "Download". Select this, choose "FASTA (complete sequence)" from among the options, and click Continue. Save the file someplace you can find it.
Before you build an MSA with the file, it would make sense to rename the sequences so you can see what they are. Open the file in Notepad or another text editor. Each sequence starts, in FASTA format, with a '>' character followed by an incomprehensible name of formal identifiers, such as
```
>gi|47523564|ref|NP_999406.1| enamelin precursor [Sus scrofa]
```
Edit each of these so that they have the ">" character followed by the name of the animal, such as
```
>pig
```
Save the resulting file.
Now build an MSA using this data file. Describe which methods you used. The herbivores are the dromedary, cow, horse, and elephant; the carnivores are the cat, tiger, and dog.
Based on the MSA, do you think the pig's teeth look more like those of herbivores or of carnivores? Look at parts of the alignment where there are differences that separate the herbivores and carnivores. You may want to look at a colored alignment ("show colors") to help you see these differences.
Justify your answer with an image of part of the MSA and an explanation of what you are looking at. Does your answer change with different MSA algorithms?
More Python Practice
Try doing some of the following problems to help your exercise with Python functions and conditionals.
1. Common to scientific names of species:
  Write a program that has a user enter the common name of an animal as a string. If the common name is "mouse", print the scientific name for mouse, "Mus musculus." If it is "dog", print "Canis lupus familiaris." You can add in other animal names if you like. If the user enters a name that is not explicitly mentioned in your program, print a message saying "I don't know that animal."
2. The Lake Pontchartrain Causeway is the longest continuous bridge over water in the world, at 23.83 miles long. It would probably also be a terrible place to run out of gas. Write a program to help a user determine whether or not to get gas before they start across the bridge. Ask for three items of input: tank size, miles per gallon, and what percent of the gas tank is full (this should be a value between 0 and 1; for example, a value of 0.5 indicates that the tank is half full). Then tell the user either "Get gas!" or "You're fine!," depending on whether or not there is enough gas for the trip.
  You might also try catching mistakes in user input. Tell the user there is a mistake if the entered tank size or mpg is a negative number, or if the entered value for percent-full is not between 0 and 1.
3. List slicing: Write a function that takes as input two lists, and outputs a new list containing the first element of the first list and the last element of the second.
4. Accumulator variables: Given a list of strings, count how many of the strings start with a capital letter, and print out the result.
Glycolysis in plants and animals
Create a multiple sequence alignment with these nucleotide sequences, containing the gene sequences for hexokinase (an enzyme that catalyzes the first reaction in the glycolysis pathway) from a few different organisms. Use Clustal-omega again, and be sure to select a nucleotide alignment instead of protein.
Look at the alignment. Are there many conserved nucleotides (these have stars underneath the alignment columns)? Why do you think this is?
Which sequences seem most closely related? Does this make sense, given what the organisms are?