This project makes use of the skills you have learned in finding sequences, creating multiple sequence alignments (MSAs), and building and analyzing evolutionary trees, to tell us more about your novel gene.
If you aren’t happy with the gene you discovered in project 1a, you may use my candidate novel globin, with the following GenBank ID in the NCBI nucleotide database: JV170156.1 This is a nucleotide sequence that was found by blasting human beta globin (gene symbol HBB) against the TSA database, which may include low quality draft sequence (see the TSA db link). I limited the search to birds and excluded chickens and turkeys.
Please write up your work using full sentences (as if this were a paper rather than a lab notebook). Describe the choices you faced and the reasons behind your decisions. Use images where appropriate to support your claims.
Students often ask about how much detail to include in the project writeup. Think about two things: what would someone need to know in order to reproduce your work, and what would graders need to know so that we don't have to reproduce your work to figure out why you made the decisions you did. The former in particular is an important principle to keep in mind when writing about scientific experiments in any context.
Work on this on your own. Cite all sources of assistance.
This project has four main steps. First, you will choose sequences to use in your MSA and phylogenetic tree. Next, you will write a Python script that edits your downloaded sequence file so that the names of the sequences are more human-interpretable. Third, you will build a multiple sequence alignment of your sequences, which you will then use to infer a phylogenetic tree relating your novel sequence to others in its family. And finally, you will try to interpret and assess your results and place them in context.
Select sequences that are at least plausibly homologous to your new sequence based on their pairwise alignments. If you have a lot of good hits to choose from, try to find some that are at a range of evolutionary distances (i.e., don’t pick ten hits that are all 98% identical to the query sequence). If you don’t have many good hits, you may want to compare some that are of questionable homology (i.e., the E-values are borderline/high, and the alignments substantial in length, but distant), or use the methods you have learned to find more confident distant homologs. Download the sequences in fasta format from the BLAST results page. (You can select the checkboxes next to the sequences you want and then choose “FASTA (complete sequence)” from the “Download” pull-down.)
For example, if the input file looks like:
>NP_003241.1|gi 72164212|Hexokinase 1 mRNA [Canis lupis familiaris] MSFGRDMELEHFDERDKAQRYSRGSRVNGLPSPTHSAHCSFYRTRTLQTLSSEKKAKKVRFYRNGDRYFK GIVYAISPDRFRSFEALLADLTRTLSDNVNLPQGVRTIYTIDGLKKISSLDQLVEGESYVCGSIEPFKKL >NP_001104521.1|gi 20618419|Hexokinase 1 like protein [Equus caballus] MLELIEVNGTPGSQLSTPRSGKSPSPSPTSPGSLRKQRISQHGGSSTSLSSTKVCSSMDENDGPGEEESEEGFQIPATIT ERYKVGRTIGDGNFAVVKECIERSTAREYALKIIKKSKCRGKEHMIQNEVSILRRVKHPNIVLLIEEMDVPTELYLVMELThen your interaction with the user might look like (user input in bold):
>NP_003241.1|gi 72164212|Hexokinase 1 mRNA [Canis lupis familiaris] Enter new description: Dog >NP_001104521.1|gi 20618419|Hexokinase 1 like protein [Equus caballus] Enter new description: HorseAnd the output fasta file would look like
>Dog MSFGRDMELEHFDERDKAQRYSRGSRVNGLPSPTHSAHCSFYRTRTLQTLSSEKKAKKVRFYRNGDRYFK GIVYAISPDRFRSFEALLADLTRTLSDNVNLPQGVRTIYTIDGLKKISSLDQLVEGESYVCGSIEPFKKL >Horse MLELIEVNGTPGSQLSTPRSGKSPSPSPTSPGSLRKQRISQHGGSSTSLSSTKVCSSMDENDGPGEEESEEGFQIPATIT ERYKVGRTIGDGNFAVVKECIERSTAREYALKIIKKSKCRGKEHMIQNEVSILRRVKHPNIVLLIEEMDVPTELYLVMELYour output is a fasta file that you can use to build MSAs and phylogenetic trees just like the original, but you will have a much easier time reading and interpreting the labels of the sequences if you use this new file. Note: You will have written some similar programs in lab or class. You are welcome to start with one such program and modify it for this purpose rather than starting from scratch! Just document that you have done so in your writeup.
To do it in MEGA, you must convert the sequences into MEGA format. Specifically, open MEGA and load your sequence data by following the rest of this step: Launch the Alignment Explorer by selecting the Align | Edit/Build Alignment on the launch bar of the main MEGA window. Select Create New Alignment and click Ok. A dialog will appear asking “Are you building a DNA or Protein sequence alignment?” Click the button that describes the data in your fasta file. From the Alignment Explorer main menu, select Data | Open | Retrieve sequences from File. Select your fasta file.
To align sequences by ClustalW or Muscle in MEGA: From the Alignment Explorer window, select the Edit | Select All menu command to select all sites for every sequence in the data set. Select Alignment | Align by ClustalW (or Align by Muscle) from the main menu to align the selected sequences. Once the alignment is complete, save the current alignment session by selecting Data | Save Session from the main menu. Save as type "alignment session (*.mas)". This will allow the current alignment to be used for phylogenetic analysis.
You may alternatively build your alignment using phylogeny.fr. You can use fasta-formatted data for this.
Examine the multiple sequence alignment you built. Can you find highly conserved regions between your protein and its homologs? Do you need to remove any sequences and re-generate the alignment, especially if some of the sequences you chose in step 1 had questionable homology? Do you think you have the whole protein sequence for your novel protein? (If not, either try to find the rest of the sequence, or explain why this isn't feasible.)
In your writeup, describe the process you used. Explain what sequences you selected and why. Describe any sequences you discarded and how the alignment changed as a result. Describe your thoughts about the final alignment's quality.
Submit your input fasta file of the sequences along with your writeup (so that we can replicate your work). Save and submit your MSA. Provide a picture of at least part of your MSA in the writeup as well.
To load data for your tree in MEGA, from the main MEGA launch bar, select Phylogeny | Construct/Test Neighbor-Joining Tree (or another method of building the tree). It will ask that you load a file. Chose the .mas file you saved. It should then allow you to customize the phylogenetic tree building options as you have seen previously. If you want to import an an MSA from another source and into MEGA for phylogenetic analysis, load the fasta alignment into MEGA for alignment as described above, and then choose Data/Export Alignment/MEGA format to convert fasta data into MEGA format.
Include a picture of the tree in your writeup. Is your tree rooted, or can you root your tree? How? Discuss your confidence in the observed relationships. How is your gene related to other sequences? Does this analysis affect your conclusions about the gene’s functional role?
Submitting your work:
Submit your work using Trunk. This is vastly preferable to sending it to us via email; it ensures that it doesn't get lost in overflowing inboxes. We also need electronic versions of your sequence files and alignment file to replicate parts of your work during the grading process
If there are multiple submissions, we will only look at the last of them unless you ask us for help on an earlier one. In that case, only the last submission will be graded. Please ensure that you submit all the requested components in each submission that you wish to have graded; otherwise, we may not see them all.