We will work more with the hypothesis we tested in class, that every odd perfect square (1, 9, 25, 49, …) is one more than a multiple of 8. To test this hypothesis empirically, we will write some Python code.
Create a program called oddsquares.py. Start it by doing what you did in the anti-quiz: Write a Python function that takes in one integer argument, oddnum.
The function should check to see that the argument is actually odd; if not, it can print an error message and return False. Assuming the argument is odd, the function should return True if the square of oddnum is equal to one more than a multiple of 8, and False otherwise. Test your function to make sure that it returns a Boolean value any time you give it an odd number.
Next, write code to call your function on the odd numbers between 1 and 100. To do this, use the range() function to create a list with these numbers, and then loop through the list, calling the function each time. (Recall that range can take up to three arguments: range(start, stop, step) ) If you find a counter-example to the hypothesis, print it out. Otherwise, print a message saying that the hypothesis holds for all the tested numbers.
There are seven isoforms of the mouse protein doublecortin-like kinase 1 (Dclk1). We will look at how different alignment methods work to align the exons in these different isoforms.
First, look at them in a genome browser. There are several genome browsers available, but the easiest way to see all the isoforms is to look up the Dclk1 gene in the NCBI gene page. Specifically:
Go to the course Links page or otherwise find NCBI's Gene web site. Search for the mouse Dclk1 gene page. Make sure that the page you end up looking at is in Mus musculus!) Then scroll down to the section that is titled "Genomic regions, transcripts, and products." The Genome View is below the Genomic Sequence line and the phrase "Go to nucleotide: Graphics FASTA GenBank"
At the top of this Genome View is a horizontal light grey line featuring several buttons and controls including a "Tools" pull-down and a help menu. Below this is the "Graphical panel" which shows a close-up of a particular chromosome (you should see chromosomal positions in the 55,000s) and various genomic features to be found there. A label on the upper left of the graphical panel should say "Genes, NCBI Mus musculus Annotation Release..." In the middle of the panel below that section, the name Dclk1 in blue text should appear above a depiction of the transcripts and isoforms for this gene.
Below this title are many transcripts and their corresponding proteins. There are seven transcripts whose RefSeq identifiers start with "NM," corresponding to seven curated isoforms of the protein. (RefSeq identifiers starting with "X" such as XM_006500978.1, can be considered "X-perimental"; they are computational predictions that have not yet been reviewed.) Genes are represented in green on this map; introns as thin horizontal lines, exons as thick lines that appear to be vertical bars at this scale.
Download and save the text file mouse.Dclk1.fasta. This file contains the amino acid sequences corresponding to those isoforms 1 to 7 of Dclk1 that you were looking at in the genome browser. Look at it in Notepad++ to see the different lengths of the sequences. One of the exons common to this protein starts with the highly conserved sequence VNGTP. Search for this sequence in the file; look for all occurrences of it.
The variations in the locations of these exons make this a hard problem for MSA algorithms.
Go to www.phylogeny.fr. We'll use this interface to try out a few sequence alignment tools. Scroll down to find the third bulleted option, "Online phylogeny programs (direct access to the individual tools available on this server)." Choose Clustal-W under the heading of multiple sequence alignment tools.
On the next page (Clustal-W input), browse to upload the file mouse.Dclk1.fasta, and click the Submit button. After the alignment is computed (this may take a few seconds), you will see the MSA in a scrollable window. You may use this window to view the alignment and answer the following questions, or you can click on the button "Start Jalview" button at the bottom of the page. Jalview is an MSA viewer that you can use to view – and even edit – the results of any multiple alignment program, and to color the alignments to make different features more easily visible.
Clustal-W was a very commonly used multiple sequence alignment program, but it's considered obsolete now because better programs exist. One such is Clustal-omega. It is not linked through the pipeline tools at www.phylogeny.fr, but you can reach it directly through the EBI's tools site at http://www.ebi.ac.uk/Tools/msa/. Click on "Launch Clustal Omega" to start.
Try using this to align the sequences you worked on in problem 2. Look for both of the motifs we specified in problem 2. How does this alignment differ in its identification of these conserved exons from that produced by Clustal-W? By MUSCLE? Explain how the different types of alignment methods give the observed results, whether you might want to edit any of the resulting alignments, and if so, how.
This exercise will give you a chance to practice building your own MSA data input file to answer a question about pigs. Tooth enamel in carnivores differs from that of herbivores. Pigs, however, notoriously eat everything. Is their tooth enamel more like that of herbivores or that of carnivores?
The gene ENAM (enamelin precursor) helps control the development of tooth enamel in many organisms. You will collect the ENAM sequences from several species and perform a multiple sequence alignment of the pig sequence with those of several carnivores and herbivores.
Start by doing a protein BLAST of the pig ENAM sequence, NP_999406.1, against the protein RefSeq database, restricted to mammals. Once you get the BLAST results, check the boxes to the left of the following BLAST hits:
At the top of the list BLAST hits is a pulldown labeled "Download". Select this, choose "FASTA (complete sequence)" from among the options, and click Continue. Save the file someplace you can find it.
Before you build an MSA with the file, it would make sense to rename the sequences so you can see what they are. Open the file in Notepad or another text editor. Each sequence starts, in FASTA format, with a '>' character followed by an incomprehensible name of formal identifiers, such as
>gi|47523564|ref|NP_999406.1| enamelin precursor [Sus scrofa]Edit each of these so that they have the ">" character followed by the name of the animal, such as
>pigSave the resulting file.
Now build an MSA using this data file. Describe which methods you used. The herbivores are the dromedary, cow, horse, and elephant; the carnivores are the cat, tiger, and dog. Based on the MSA, do you think the pig's teeth look more like those of herbivores or of carnivores? Justify your answer with an image of part of the MSA and an explanation of what you are looking at.
Try doing some of the following problems to help your exercise with Python functions and conditionals.
Write a program that has a user enter the common name of an animal as a string. If the common name is "mouse", print the scientific name for mouse, "Mus musculus." If it is "dog", print "Canis lupus familiaris." You can add in other animal names if you like. If the user enters a name that is not explicitly mentioned in your program, print a message saying "I don't know that animal."
The tallest building in the world is the Burj Khalifa, at 829.8 meters tall. Modify your function so that the maximum distance it can return is 829.8 (the brick hits the ground and stops moving). You might also make this "maximum fall distance" a second parameter of your function.
You might also try catching mistakes in user input. Tell the user there is a mistake if the entered tank size or mpg is a negative number, or if the entered value for percent-full is not between 0 and 1.
Create a multiple sequence alignment with these nucleotide sequences, containing the gene sequences for hexokinase (an enzyme that catalyzes the first reaction in the glycolysis pathway) from a few different organisms. Use Clustal-omega again, and be sure to select a nucleotide alignment instead of protein.
Look at the alignment. Are there many conserved nucleotides (these have stars underneath the alignment columns)? Why do you think this is?
Which sequences seem most closely related? Does this make sense, given what the organisms are?