Comp 7 / Bio 40 Project 1b: Find a Novel Gene

Due October 24, 2016

This project makes use of the skills you have learned in finding sequences, creating multiple sequence alignments (MSAs), and building and analyzing evolutionary trees, to tell us more about your novel gene.

If you aren’t happy with the gene you discovered in project 1a, you may use my candidate novel globin, with the following GenBank ID in the NCBI nucleotide database: JV170156.1 This is a nucleotide sequence that was found by blasting human beta globin (gene symbol HBB) against the TSA database, which may include low quality draft sequence (see the TSA db link). I limited the search to birds and excluded chickens and turkeys.

Please write up your work using full sentences (as if this were a paper rather than a lab notebook). Describe the choices you faced and the reasons behind your decisions. Use images where appropriate to support your claims.

Students often ask about how much detail to include in the project writeup. Think about two things: what would someone need to know in order to reproduce your work, and what would graders need to know so that we don't have to reproduce your work to figure out why you made the decisions you did. The former in particular is an important principle to keep in mind when writing about scientific experiments in any context.

Work on this on your own. Cite all sources of assistance.


This project has four main steps. First, you will choose sequences to use in your MSA and phylogenetic tree. Next, you will write a Python script that edits your downloaded sequence file so that the names of the sequences are more human-interpretable. Third, you will build a multiple sequence alignment of your sequences, which you will then use to infer a phylogenetic tree relating your novel sequence to others in its family. And finally, you will try to interpret and assess your results and place them in context.


  1. Choosing sequences for the MSA is your first step. The alignment should include your novel protein sequence, the original query protein, and a group of other members of this family. A typical number of proteins to use in a multiple sequence alignment of this sort is between 10 and 20, although the exact number is up to you and depends in part on what similar sequences you can find. You will then need to create a FASTA file containing these sequences. Do this using BLAST against a protein database (nr or RefSeq). You may have done a similar search for project 1a, but now you definitely want to look for hits in other organisms as well as the one from which the novel sequence is derived.

    Select sequences that are at least plausibly homologous to your new sequence based on their pairwise alignments. If you have a lot of good hits to choose from, try to find some that are at a range of evolutionary distances (i.e., don’t pick ten hits that are all 98% identical to the query sequence). If you don’t have many good hits, you may want to compare some that are of questionable homology (i.e., the E-values are borderline/high, and the alignments substantial in length, but distant), or use the methods you have learned to find more confident distant homologs. Download the sequences in fasta format from the BLAST results page. (You can select the checkboxes next to the sequences you want and then choose “FASTA (complete sequence)” from the “Download” pull-down.)

  2. Write a Python script to create a new copy of your fasta file. This script should read each line from the input file. If the line starts with any character except '>', the script should just print the line to the output file as it is. But, if it finds a line starting with '>', print that line to the screen and ask the user to supply a shorter name. Create a new string starting with '>' and then containing the shorter name, and print that to the output file instead of the original long name.

    For example, if the input file looks like:

    >NP_003241.1|gi 72164212|Hexokinase 1 mRNA [Canis lupis familiaris]
    >NP_001104521.1|gi 20618419|Hexokinase 1 like protein [Equus caballus]
    Then your interaction with the user might look like (user input in bold):
    >NP_003241.1|gi 72164212|Hexokinase 1 mRNA [Canis lupis familiaris]
    Enter new description:  Dog
    >NP_001104521.1|gi 20618419|Hexokinase 1 like protein [Equus caballus]
    Enter new description:  Horse
    And the output fasta file would look like
    Your output is a fasta file that you can use to build MSAs and phylogenetic trees just like the original, but you will have a much easier time reading and interpreting the labels of the sequences if you use this new file. Note: You will have written some similar programs in lab or class. You are welcome to start with one such program and modify it for this purpose rather than starting from scratch! Just document that you have done so in your writeup.

  3. Build a multiple sequence alignment of your sequences. Which method did you choose, and why? (You may do this in MEGA or using individual MSA web sites.)

    To do it in MEGA, you must convert the sequences into MEGA format. Specifically, open MEGA and load your sequence data by following the rest of this step: Launch the Alignment Explorer by selecting the Align | Edit/Build Alignment on the launch bar of the main MEGA window. Select Create New Alignment and click Ok. A dialog will appear asking “Are you building a DNA or Protein sequence alignment?” Click the button that describes the data in your fasta file. From the Alignment Explorer main menu, select Data | Open | Retrieve sequences from File. Select your fasta file.

    To align sequences by ClustalW or Muscle in MEGA: From the Alignment Explorer window, select the Edit | Select All menu command to select all sites for every sequence in the data set. Select Alignment | Align by ClustalW (or Align by Muscle) from the main menu to align the selected sequences. Once the alignment is complete, save the current alignment session by selecting Data | Save Session from the main menu. Save as type "alignment session (*.mas)". This will allow the current alignment to be used for phylogenetic analysis.

    You may alternatively build your alignment using You can use fasta-formatted data for this.

    Examine the multiple sequence alignment you built. Can you find highly conserved regions between your protein and its homologs? Do you need to remove any sequences and re-generate the alignment, especially if some of the sequences you chose in step 1 had questionable homology? Do you think you have the whole protein sequence for your novel protein? (If not, either try to find the rest of the sequence, or explain why this isn't feasible.)

    In your writeup, describe the process you used. Explain what sequences you selected and why. Describe any sequences you discarded and how the alignment changed as a result. Describe your thoughts about the final alignment's quality.

    Submit your input fasta file of the sequences along with your writeup (so that we can replicate your work). Save and submit your MSA. Provide a picture of at least part of your MSA in the writeup as well.

  4. Use the alignment to generate a phylogenetic tree that describes the relationship between your protein and its homologs. Describe what tree inference options you selected and why.

    To load data for your tree in MEGA, from the main MEGA launch bar, select Phylogeny | Construct/Test Neighbor-Joining Tree (or another method of building the tree). It will ask that you load a file. Chose the .mas file you saved. It should then allow you to customize the phylogenetic tree building options as you have seen previously. If you want to import an an MSA from another source and into MEGA for phylogenetic analysis, load the fasta alignment into MEGA for alignment as described above, and then choose Data/Export Alignment/MEGA format to convert fasta data into MEGA format.

    Include a picture of the tree in your writeup. Is your tree rooted, or can you root your tree? How? Discuss your confidence in the observed relationships. How is your gene related to other sequences? Does this analysis affect your conclusions about the gene’s functional role?

  5. What else can you say about your novel gene? If you were not already sure about what species the sequence comes from, do the alignment and tree shed any light on this issue? Discuss what you have learned about this family of proteins, using any information you want from online databases such as PubMed, OMIM, Homologene, etc. How does the new information from your work inform your initial thoughts about this protein?

Submitting your work:

Submit your work using Trunk. This is vastly preferable to sending it to us via email; it ensures that it doesn't get lost in overflowing inboxes. We also need electronic versions of your sequence files and alignment file to replicate parts of your work during the grading process

If there are multiple submissions, we will only look at the last of them unless you ask us for help on an earlier one. In that case, only the last submission will be graded. Please ensure that you submit all the requested components in each submission that you wish to have graded; otherwise, we may not see them all.