Comp 7 / Bio 40 - Fall 2019

Comp 7 / Bio 40 Project 1b: Find a Novel Gene

Due October 17, 2019

This project makes use of the skills you have learned in finding sequences, creating multiple sequence alignments (MSAs), writing Python scripts, and building and analyzing evolutionary trees, to tell us more about the novel gene you discovered in project 1a.

Please write up your work using full sentences (as if this were a paper rather than a lab notebook). Describe the choices you faced and the reasons behind your decisions. Use images where appropriate to support your claims.

Students often ask about how much detail to include in the project writeup. Think about two things: what would someone need to know in order to reproduce your work, and what would graders need to know so that we don't have to reproduce your work to figure out why you made the decisions you did. The former in particular is an important principle to keep in mind when writing about scientific experiments in any context.

Work on this on your own. Cite all sources of assistance.

Overview:

This project has four main steps. First, you will choose sequences to which you will compare your novel gene. Next, you will write a Python script that edits your downloaded sequence file so that the names of the sequences are more human-interpretable. You will then run the script to create a new input file for your MSA and phylogenetics tools. Third, you will use this new file to build a multiple sequence alignment of your sequences, which you will then use to infer a phylogenetic tree relating your novel sequence to others in its family. And finally, you will try to interpret and assess your results and place them in context.

Details:

Choosing sequences for the MSA is your first step.
The alignment should include:
- your novel protein sequence,
- the original query protein you started with in project 1a, and
- other homologous proteins in this family.
A typical number of proteins to use in a multiple sequence alignment of this sort is between 10 and 20, although the exact number is up to you and depends in part on what similar sequences you can find. Part of the process will involve getting a feel for choosing sequences that are neither too similar to nor too distant from your sequence so that they can provide useful information about the new gene.
You will then need to create a FASTA file containing these sequences. Do this using BLAST against a protein database (such as nr or RefSeq). You may have done a similar search for project 1a, but now you definitely want to look for hits in other organisms as well as the one from which the novel sequence is derived. You may want to select sequences from multiple BLAST searches, adjusting the ranges of organisms or distances you look for in different searches.
Select sequences you believe are homologous to your new sequence based on their pairwise alignments. If you have a lot of good hits to choose from, try to find some that are at a range of evolutionary distances (i.e., don’t pick ten hits that are all 98% identical to the query sequence, especially if your new gene is only 46% identical to it). If you don’t have many good hits, you may want to use the methods you have learned to find more compelling distant homologs.
Download the sequences in FASTA format from the BLAST results page. (You can select the checkboxes next to the sequences you want and then choose “FASTA (complete sequence)” from the “Download” pull-down.) Submit this FASTA file as part of your project upload.
Write a Python pass-through script to create a new copy of your FASTA file. This script should read each line from the input file. If the line starts with any character except '>', the script should just print the line to the output file as it is. But, if it finds a line starting with '>', print that line to the screen and ask the user to supply a shorter name. Create a new string starting with '>' and then containing the shorter name, and print that to the output file instead of the original long name.
For example, if the input file looks like:
```
>NP_003241.1|gi 72164212|Hexokinase 1 mRNA [Canis lupis familiaris]
MSFGRDMELEHFDERDKAQRYSRGSRVNGLPSPTHSAHCSFYRTRTLQTLSSEKKAKKVRFYRNGDRYFK
GIVYAISPDRFRSFEALLADLTRTLSDNVNLPQGVRTIYTIDGLKKISSLDQLVEGESYVCGSIEPFKKL
>NP_001104521.1|gi 20618419|Hexokinase 1 like protein [Equus caballus]
MLELIEVNGTPGSQLSTPRSGKSPSPSPTSPGSLRKQRISQHGGSSTSLSSTKVCSSMDENDGPGEEESEEGFQIPATIT
ERYKVGRTIGDGNFAVVKECIERSTAREYALKIIKKSKCRGKEHMIQNEVSILRRVKHPNIVLLIEEMDVPTELYLVMEL
```
Then your interaction with the user might look like (user input in bold):
```
>NP_003241.1|gi 72164212|Hexokinase 1 mRNA [Canis lupis familiaris]
Enter new description:  Dog
>NP_001104521.1|gi 20618419|Hexokinase 1 like protein [Equus caballus]
Enter new description:  Horse
```
And the output fasta file would look like
```
>Dog
MSFGRDMELEHFDERDKAQRYSRGSRVNGLPSPTHSAHCSFYRTRTLQTLSSEKKAKKVRFYRNGDRYFK
GIVYAISPDRFRSFEALLADLTRTLSDNVNLPQGVRTIYTIDGLKKISSLDQLVEGESYVCGSIEPFKKL
>Horse
MLELIEVNGTPGSQLSTPRSGKSPSPSPTSPGSLRKQRISQHGGSSTSLSSTKVCSSMDENDGPGEEESEEGFQIPATIT
ERYKVGRTIGDGNFAVVKECIERSTAREYALKIIKKSKCRGKEHMIQNEVSILRRVKHPNIVLLIEEMDVPTELYLVMEL
```
Your output is a FASTA file that you can use to build MSAs and phylogenetic trees just like the original, but you will have a much easier time reading and interpreting the labels of the sequences if you use this new file.
Note: You will have written some similar programs in lab or class. You are welcome to start with one such program and modify it for this purpose rather than starting from scratch! Just document that you have done so in your writeup.
Build a multiple sequence alignment of your sequences. Which method did you choose, and why? (You may do this in MEGA or using individual MSA web sites. However, if you do it outside of MEGA you will need to import it into MEGA and save it in a MEGA-readable format, as below.)
To build your MSA in MEGA, choose File from the main MEGA menu, then select "Open a file/session", and choose the FASTA file containing your sequences. When the program asks you whether to Analyze or Align the file, choose Align. Then select Alignment from the Alignment window's menu and select a metho, such as Clustal-W or MUSCLE, and whatever options you wish.
Once the alignment is complete, save the current alignment session by selecting Data | Save Session from the main menu. Save as type "alignment session (*.mas)". This will allow the current alignment to be used for phylogenetic analysis.

Examine the multiple sequence alignment you built. Can you find highly conserved regions between your protein and its homologs? Do you need to remove any sequences and re-generate the alignment, especially if some of the sequences you chose in step 1 had questionable homology? Do you think you have the whole protein sequence for your novel protein? (If not, either try to find the rest of the sequence, or explain why this isn't feasible.)
In your writeup, describe the process you used. Explain what sequences you selected and why. Describe any sequences you discarded and how the alignment changed as a result. Describe your thoughts about the final alignment's quality.
Submit your input fasta file of the sequences along with your writeup (so that we can replicate your work). Save and submit your MSA's .mas file too, so we can see what you were working with. Provide a picture of at least part of your MSA (you can take a screenshot for this) in the writeup as well.
Use the alignment to generate a phylogenetic tree that describes the relationship between your protein and its homologs. Describe what tree inference options you selected and why.
To build a tree in MEGA, from the main MEGA launch bar, select Phylogeny | Construct/Test Neighbor-Joining Tree (or another method of building the tree). It will ask that you load a file. Chose the .mas file you saved in step 3. It should then allow you to customize the phylogenetic tree building options as you have seen previously.
Include a picture of the tree in your writeup. Is your tree rooted, or can you root your tree? How? Discuss your confidence in the observed relationships. How is your gene related to other sequences? Does this analysis affect your conclusions about the gene’s functional role?
What else can you say about your novel gene? If you were not already sure about what species the sequence comes from, do the alignment and tree shed any light on this issue? Discuss what you have learned about this family of proteins, using any information you want from online databases such as PubMed, OMIM, Homologene, etc. How does the new information from your work inform your initial thoughts about this protein?

Submitting your work:

Submit your work using Gradescope. Please include a writeup, a fasta file of sequences, and a .mas file with the alignment.

If there are multiple submissions, we will only look at the last of them unless you ask us for help on an earlier one. In that case, only the last submission will be graded. Please ensure that you submit all the requested components in each submission that you wish to have graded; otherwise, we may not see them all.