## Comp 7 / Bio 40 Lab 3: MSA and number theory

### Purpose of this lab

To introduce students to the issues involved in multiple sequence alignment methods; to practice writing loops and moving data in and out of Python functions

### Lab problems

1. Python Programming: functions and loops

We will work more with the hypothesis we tested in class, that every odd perfect square (1, 9, 25, 49, …) is one more than a multiple of 8. To test this hypothesis empirically, we will write some Python code.

Create a program called oddsquares.py. Start it by doing what you did in the anti-quiz: Write a Python function that takes in one integer argument, oddnum.

The function should check to see that the argument is actually odd; if not, it can print an error message and return False. Assuming the argument is odd, the function should return True if the square of oddnum is equal to one more than a multiple of 8, and False otherwise. Test your function to make sure that it returns a Boolean value any time you give it an odd number.

Next, write code to call your function on the odd numbers between 1 and 100. To do this, use the range() function to create a list with these numbers, and then loop through the list, calling the function each time. (Recall that range can take up to three arguments: range(start, stop, step) ) If you find a counter-example to the hypothesis, print it out. Otherwise, print a message saying that the hypothesis holds for all the tested numbers.

2. Multiple sequence alignment: internal splicing.

There are seven isoforms of the mouse protein doublecortin-like kinase 1 (Dclk1). We will look at how different alignment methods work to align the exons in these different isoforms.

First, look at them in a genome browser. There are several genome browsers available, but the easiest way to see all the isoforms is to look up the Dclk1 gene in the NCBI gene page. Specifically:

Go to the course Links page or otherwise find NCBI's Gene web site. Search for the mouse Dclk1 gene page. Make sure that the page you end up looking at is in Mus musculus!) Then scroll down to the section that is titled "Genomic regions, transcripts, and products." The Genome View is below the Genomic Sequence line and the phrase "Go to nucleotide: Graphics FASTA GenBank"

At the top of this Genome View is a horizontal light grey line featuring several buttons and controls including a "Tools" pull-down and a help menu. Below this is the "Graphical panel" which shows a close-up of a particular chromosome (you should see chromosomal positions in the 55,000s) and various genomic features to be found there. A label on the upper left of the graphical panel should say "Genes, NCBI Mus musculus Annotation Release..." In the middle of the panel below that section, the name Dclk1 in blue text should appear above a depiction of the transcripts and isoforms for this gene.

Below this title are many transcripts and their corresponding proteins. There are seven transcripts whose RefSeq identifiers start with "NM," corresponding to seven curated isoforms of the protein. (RefSeq identifiers starting with "X" such as XM_006500978.1, can be considered "X-perimental"; they are computational predictions that have not yet been reviewed.) Genes are represented in green on this map; introns as thin horizontal lines, exons as thick lines that appear to be vertical bars at this scale.

1. What is the RefSeq ID for the mRNA transcript corresponding to the shortest of the seven isoforms with "NM" at the start of their RefSeq IDs?

Download and save the text file mouse.Dclk1.fasta. This file contains the amino acid sequences corresponding to those isoforms 1 to 7 of Dclk1 that you were looking at in the genome browser. Look at it in Notepad++ to see the different lengths of the sequences. One of the exons common to this protein starts with the highly conserved sequence VNGTP. Search for this sequence in the file; look for all occurrences of it.

1. For each of the seven isoforms, list whether they have this exon and if so, approximately where in the sequence does this exon begin? (Near the N-terminus, the C-terminus, or in the middle of the protein?)

2. Another shared exon starts with the highly conserved sequence EESEE. As above, search for this sequence in the fasta file. Which isoforms contain this exon?

The variations in the locations of these exons make this a hard problem for MSA algorithms.

Go to www.phylogeny.fr. We'll use this interface to try out a few sequence alignment tools. Scroll down to find the third bulleted option, "Online phylogeny programs (direct access to the individual tools available on this server)." Choose Clustal-W under the heading of multiple sequence alignment tools.

On the next page (Clustal-W input), browse to upload the file mouse.Dclk1.fasta, and click the Submit button. After the alignment is computed (this may take a few seconds), you will see the MSA in a scrollable window. You may use this window to view the alignment and answer the following questions, or you can click on the button "Start Jalview" button at the bottom of the page. Jalview is an MSA viewer that you can use to view – and even edit – the results of any multiple alignment program, and to color the alignments to make different features more easily visible.

1. Search for the VNGTP motif in the window containing the alignment results. How well did Clustal-W do at aligning this exon in the different sequences? What about the EESEE motif?

2. Starting from the main page at www.phylogeny.fr again, repeat this alignment using MUSCLE, starting again with the third bulleted option on the main page, "Online phylogeny programs (direct access to the individual tools available on this server)," and choosing MUSCLE under the heading of multiple sequence alignment tools. Use the default parameters (i.e., just upload the file and click Submit). How does it do at aligning the exon starting with VNGTP? What about the EESEE motif?

3. Which of these is a progressive alignment method? Which is an iterative one? List one drawback and one advantage of progressive alignment methods over iterative ones. How could this difference explain the results you have observed so far?

3. EBI MSA tools

Clustal-W was a very commonly used multiple sequence alignment program, but it's considered obsolete now because better programs exist. One such is Clustal-omega. It is not linked through the pipeline tools at www.phylogeny.fr, but you can reach it directly through the EBI's tools site at http://www.ebi.ac.uk/Tools/msa/. Click on "Launch Clustal Omega" to start.

Try using this to align the sequences you worked on in problem 2. Look for both of the motifs we specified in problem 2. How does this alignment differ in its identification of these conserved exons from that produced by Clustal-W? By MUSCLE? Explain how the different types of alignment methods give the observed results, whether you might want to edit any of the resulting alignments, and if so, how.

4. What do pigs' teeth tell you about their diet?

This exercise will give you a chance to practice building your own MSA data input file to answer a question about pigs. Tooth enamel in carnivores differs from that of herbivores. Pigs, however, notoriously eat everything. Is their tooth enamel more like that of herbivores or that of carnivores?

The gene ENAM (enamelin precursor) helps control the development of tooth enamel in many organisms. You will collect the ENAM sequences from several species and perform a multiple sequence alignment of the pig sequence with those of several carnivores and herbivores.

Start by doing a protein BLAST of the pig ENAM sequence, NP_999406.1, against the protein RefSeq database, restricted to mammals. Once you get the BLAST results, check the boxes to the left of the following BLAST hits:

• pig (Sus scrofa)
• dromedary (Camelus dromedarius)
• cat (Felis catus)
• Siberian tiger (Panthera tigris altaica)
• cow (Bos taurus)
• dog (Canis lupus familiaris)
• horse (Equus caballus)
• elephant (Loxodonta africana)

At the top of the list BLAST hits is a pulldown labeled "Download". Select this, choose "FASTA (complete sequence)" from among the options, and click Continue. Save the file someplace you can find it.

Before you build an MSA with the file, it would make sense to rename the sequences so you can see what they are. Open the file in Notepad or another text editor. Each sequence starts, in FASTA format, with a '>' character followed by an incomprehensible name of formal identifiers, such as

```>gi|47523564|ref|NP_999406.1| enamelin precursor [Sus scrofa]
```
Edit each of these so that they have the ">" character followed by the name of the animal, such as
```>pig
```
Save the resulting file.

Now build an MSA using this data file. Describe which methods you used. The herbivores are the dromedary, cow, horse, and elephant; the carnivores are the cat, tiger, and dog. Based on the MSA, do you think the pig's teeth look more like those of herbivores or of carnivores? Justify your answer with an image of part of the MSA and an explanation of what you are looking at.

Going Further:
1. More Python Practice

1. Common to scientific names of species:

Write a program that has a user enter the common name of an animal as a string. If the common name is "mouse", print the scientific name for mouse, "Mus musculus." If it is "dog", print "Canis lupus familiaris." You can add in other animal names if you like. If the user enters a name that is not explicitly mentioned in your program, print a message saying "I don't know that animal."

2. If you drop a brick from the top of a building, the distance it has fallen can be given in meters as: y = g t2 / 2 , where t is the time in seconds since is was dropped and g is the gravitational constant 9.8. Write a function to compute the distance the brick has fallen at a given time.

The tallest building in the world is the Burj Khalifa, at 829.8 meters tall. Modify your function so that the maximum distance it can return is 829.8 (the brick hits the ground and stops moving). You might also make this "maximum fall distance" a second parameter of your function.

3. The Lake Pontchartrain Causeway is the longest continuous bridge over water in the world, at 23.83 miles long. It would probably also be a terrible place to run out of gas. Write a program to help a user determine whether or not to get gas before they start across the bridge. Ask for three items of input: tank size, miles per gallon, and what percent of the gas tank is full (this should be a value between 0 and 1; for example, a value of 0.5 indicates that the tank is half full). Then tell the user either "Get gas!" or "You're fine!," depending on whether or not there is enough gas for the trip.

You might also try catching mistakes in user input. Tell the user there is a mistake if the entered tank size or mpg is a negative number, or if the entered value for percent-full is not between 0 and 1.

4. List slicing: Write a function that takes as input two lists, and outputs a new list containing the first element of the first list and the last element of the second.

5. Accumulator variables: Given a list of strings, count how many of the strings start with a capital letter, and print out the result.

6. Lists, and string manipulation: Write a program that takes a list of strings, and prints each string before the first "stop". If the user entered "this is a stop sign", you would output "this is a".

2. Glycolysis in plants and animals

Create a multiple sequence alignment with these nucleotide sequences, containing the gene sequences for hexokinase (an enzyme that catalyzes the first reaction in the glycolysis pathway) from a few different organisms. Use Clustal-omega again, and be sure to select a nucleotide alignment instead of protein.

Look at the alignment. Are there many conserved nucleotides (these have stars underneath the alignment columns)? Why do you think this is?

Which sequences seem most closely related? Does this make sense, given what the organisms are?