Comp 7 / Bio 40 - Fall 2019

Comp 7 / Bio 40 Project 1a: Find a Novel Gene

Due September 24, 2019

This project makes use of the skills you have learned in finding sequences and in running and interpreting the various flavors of BLAST to identify a novel (i.e., previously unannotated) gene in an organism. This process is covered in detail in chapter 4 (pages 155-159) of the Pevsner text.

Write up your work and results a bit more carefully than you would in a lab assignment. For example, for this assignment, you have the time to write in complete and grammatical sentences. Please include whatever details we would need in order to replicate your work.

Work on this on your own. Cite all sources of assistance.

Overview:

For this first part of the project, you will identify a novel protein, meaning that it has not yet been manually annotated with functional information. To do so, you will start with a known protein and try to find a distant homolog using BLAST.

It may take several attempts to demonstrate that the protein you have found is in fact a) novel (as defined in step 5 below), and b) homologous to your start protein (or to something else identifiable). Thus, we suggest you start out by just writing casual lab notes documenting what you're doing (could even be hand-written), and then, when you have a documented chain of events that you’re going to use, polish the writing describing that particular story before turning it in.

Pick a known protein that you would like to use as a starting point. You are going to compare your protein to a database of newly-obtained nucleotide sequences to try to identify distant homologs that have not yet been characterized. You may use the examples of RBP4 or beta globin, as in the text, or any other protein you would like as your starting point. Please indicate what protein you have chosen, including its RefSeq accession number, and from what organism. Also, optionally, indicate why this protein is of interest to you. Important Note: you may not want to fully write up this part until you have completed steps 2-6 successfully.
To obtain the sequence of your starting protein, you should use the RefSeq database. You can find RefSeq sequence links through the NCBI Gene or Protein sites as well as from other resources. Check Pevsner's description of RefSeq and the accession number prefixes to make sure you are getting the canonical protein sequence and not, for example, the entire chromosome. You can also find a RefSeq FAQ at http://www.ncbi.nlm.nih.gov/books/NBK50679/.
As described in Pevsner, Chapter 4, pp. 155-159, perform the appropriate type of search against a database of potentially uncharacterized nucleotide sequences. Tell us what you did. Did you use tblastn? blastp? Why? What database(s) did you choose, and why? Options might include whole genome shotgun sequence (WGS), uncharacterized genomic survey sequences (GSS), the TSA nucleotide database, an EST database, the SRA database, or a database of environmental samples. To learn more about the possible databases to search, click on the “?” sign next to the database sequence pull-down menu, and then click on “more” to get a description of the different databases. Another possibility is to do a web search for ongoing genome projects that have not been published but for which sequences are being deposited in online resources. You don't have to use the NCBI BLAST server; you might find a BLAST-able database at EBI or elsewhere (e.g. a sequencing project's database) that has what you want.
Note that it is very unlikely that you will find a novel gene from a well known or well studied organism such as S. cerevisiae, human or mouse, because those genomes have already been thoroughly annotated. In fact, most lab animals and common household/farm animals and plants are also well annotated by this point. It is much more likely that you will discover a new gene in a genome that is currently being sequenced.
Further, NCBI is now using an automatic annotation pipeline to do some of what we are doing in this project, which makes it harder to find totally unannotated sequence. However, you can see current lists of recently annotated and not-yet annotated organisms at https://www.ncbi.nlm.nih.gov/genome/annotation_euk/status/. Those that have not yet been fully through the NCBI annotation pipeline, or bacterial or environmental samples that are totally unknown, are your best bet.
Include part of the output of that BLAST search in your document. If necessary, change the font to Courier size 10 or smaller and insert line breaks so that the results are displayed neatly. You can also screen capture the BLAST output (e.g. using the print-screen key or keys appropriate for your machine). Please don't print out many pages of BLAST results (most of which will be irrelevant to your novel gene), but please do include the parts you need to highlight next:
On the BLAST results, identify a match to a sequence that you believe is both novel and homologous to your query sequence. We need to be able to inspect the pairwise alignment you have selected, including the E value and score.
Pevsner points out: "In general, this step is the most difficult for students because it requires you to have a 'feel' for how to interpret BLAST results." You need to distinguish between sequences that are not novel; homologs that might be novel, depending on the results of step 5; and sequences that might be novel but are unlikely to be homologous.
Identify the amino acid sequence of this novel protein. This is not always trivial, especially if you found only a small fraction of it, or if your query hits a very long piece of genomic sequence (so that it is not clear which part of the sequence you need to translate).
Note that the BLAST results show which positions in the subject (database) sequence align with which positions in the query sequence, so you may find the relevant region that way, as well as at least part of the amino acid sequence (in the alignment). Which sequence is the "query" sequence in the alignment and which is the "subject?"
Be careful: you will prove novelty by BLASTing your sequence back against the protein RefSeq database. If you choose a large piece of genomic sequence (e.g. an assembled "contig") as your source, and BLAST that against RefSeq, you will hit all the proteins that sequence contains. Most will not be homologous to your query sequence.
To find the amino acid sequence of your protein, you might also want to use the ExPASy Translate Tool http://web.expasy.org/translate/, which will translate all six reading frames of a nucleotide sequence. Again, whether or not you need this may depend on what you did to find the original hit, and how long the target sequence is. If it is an EST, your biggest problem is that you might not have all of it. If it is a genomic contig, your issue is that you may have too much surrounding sequence (e.g. many megabases of it).
Please describe what you did and how you reached your conclusions about the result.
Prove that this gene, and its corresponding protein, are novel. For the purposes of this project, “novel” is defined as follows. Use the protein sequence that you obtained in step 4 as the query sequence in a blastp search of the RefSeq protein database at NCBI. Click "Models (XM/XP)" under "Exclude" to remove the automatic pipeline annotations that are not yet curated. (This hasn't been working well for some queries. You may exclude such hits if BLAST doesn't seem to do so properly.)
- If there is a match with 100% amino acid identity over the whole sequence length to a protein with an NP_ accession number (not an XP_ predicted one) in the database, from the same species, then your protein is NOT novel (even if the match is to a protein with a name such as “unknown”). Someone has already found and annotated this sequence, and assigned it an accession number.
- If there is a match with less than 100% identity, or only to a short fragment of your sequence, then it is likely that your protein is novel, and you have succeeded.
- If there is a match with 100% identity, but to only a different species than the one your sequence comes from, then you have succeeded in finding a novel gene, even if it is 100% identical to one in a closely related species.
- If there is no match to the original query sequence from step 1, (and if you are seeing hits of a similar quality), this indicates that you have partially succeeded: yes, you may have found a new gene, but no, it is probably not actually homologous to the original query. You should probably start over. However, if you picked a sequence (in step 2) that was quite a distant homolog from your original query sequence (the protein you selected in step 1), you could try to demonstrate that it is a distant homolog by using PSI-BLAST. If you do this, start with your newly discovered sequence, and blast it against just the RefSeq protein sequences from the species in which you started. Don’t run too many iterations, and carefully examine any hits you add in each iteration. Or try delta-BLAST to see if it can find information on your protein family.
Gather more information about this “novel” protein through additional BLAST searches. Do you think you have the entire coding sequence? If you use BLAST to confirm your result, can you extend the sequence further? Can you identify the species of origin if you chose something from an environmental sample where the species is not already known? Can you use delta-BLAST or PSI-BLAST to learn more about the protein family? Describe any additional searches you did, indicate what you were hoping to learn from each, and summarize the results you obtained.
Propose a name for your novel protein and the species from which it derives (to whatever extent it is possible to figure this out).

Submitting your work:

Submit your work using Gradescope. Please submit only a single document containing your writeup.

If there are multiple submissions, we will only see the last of them unless you ask us for help on an earlier one. In that case, only the last submission will be graded. Please ensure that you submit all parts of the assignment in a single document for each submission that you wish to have graded (i.e. if you make a second submission, make sure to include your full writeup, not just the parts you have changed).