This project makes use of the skills you have learned in finding sequences and in running and interpreting the various flavors of BLAST to identify a novel (i.e., previously unannotated) gene in an organism. This process is covered in detail in chapter 4 (pages 155-159) of the Pevsner text.
Write up your work and results a bit more carefully than you would in a lab assignment. For example, for this assignment, you have the time to write in complete and grammatical sentences. Please include whatever details we would need in order to replicate your work.
Work on this on your own. Cite all sources of assistance.
For this first part of the project, you will identify a novel protein, meaning that it has not yet been manually annotated with functional information. To do so, you will start with a known protein and try to find a distant homolog using BLAST.
It may take several attempts to demonstrate that the protein you have found is in fact a) novel (as defined in step 5 below), and b) homologous to your start protein (or to something else identifiable). Thus, we suggest you start out by just writing casual lab notes documenting what you're doing (could even be hand-written), and then, when you have a documented chain of events that you’re going to use, polish the writing describing that particular story before turning it in.
To obtain the sequence of your starting protein, you should use the RefSeq database. You can find RefSeq sequence links through the NCBI Gene or Protein sites as well as from other resources. Check Pevsner's description of RefSeq and the accession number prefixes to make sure you are getting the canonical protein sequence and not, for example, the entire chromosome. You can also find a RefSeq FAQ at http://www.ncbi.nlm.nih.gov/books/NBK50679/.
Note that it is very unlikely that you will find a novel gene from a well known or well studied organism such as S. cerevisiae, human or mouse, because those genomes have already been thoroughly annotated. In fact, most lab animals and common household/farm animals and plants are also well annotated by this point. It is much more likely that you will discover a new gene in a genome that is currently being sequenced.
Further, NCBI is now using an automatic annotation pipeline to do some of what we are doing in this project, which makes it harder to find totally unannotated sequence. However, you can see current lists of recently annotated and not-yet annotated organisms at https://www.ncbi.nlm.nih.gov/genome/annotation_euk/status/. Those that have not yet been fully through the NCBI annotation pipeline, or bacterial or environmental samples that are totally unknown, are your best bet.
Include part of the output of that BLAST search in your document. If necessary, change the font to Courier size 10 or smaller and insert line breaks so that the results are displayed neatly. You can also screen capture the BLAST output (e.g. using the print-screen key or keys appropriate for your machine). Please don't print out many pages of BLAST results (most of which will be irrelevant to your novel gene), but please do include the parts you need to highlight next:
On the BLAST results, identify a match to a sequence that you believe is both novel and homologous to your query sequence. We need to be able to inspect the pairwise alignment you have selected, including the E value and score.
Pevsner points out: "In general, this step is the most difficult for students because it requires you to have a 'feel' for how to interpret BLAST results." You need to distinguish between sequences that are not novel; homologs that might be novel, depending on the results of step 5; and sequences that might be novel but are unlikely to be homologous.
Note that the BLAST results show which positions in the subject (database) sequence align with which positions in the query sequence, so you may find the relevant region that way, as well as at least part of the amino acid sequence (in the alignment). Which sequence is the "query" sequence in the alignment and which is the "subject?"
Be careful: you will prove novelty by BLASTing your sequence back against the protein RefSeq database. If you choose a large piece of genomic sequence (e.g. an assembled "contig") as your source, and BLAST that against RefSeq, you will hit all the proteins that sequence contains. Most will not be homologous to your query sequence.
To find the amino acid sequence of your protein, you might also want to use the ExPASy Translate Tool http://web.expasy.org/translate/, which will translate all six reading frames of a nucleotide sequence. Again, whether or not you need this may depend on what you did to find the original hit, and how long the target sequence is. If it is an EST, your biggest problem is that you might not have all of it. If it is a genomic contig, your issue is that you may have too much surrounding sequence (e.g. many megabases of it).
Please describe what you did and how you reached your conclusions about the result.
Propose a name for your novel protein and the species from which it derives (to whatever extent it is possible to figure this out).
Submitting your work:
Submit your work using Gradescope. Please submit only a single document containing your writeup.
If there are multiple submissions, we will only see the last of them unless you ask us for help on an earlier one. In that case, only the last submission will be graded. Please ensure that you submit all parts of the assignment in a single document for each submission that you wish to have graded (i.e. if you make a second submission, make sure to include your full writeup, not just the parts you have changed).