Comp 7 / Bio 40 Project 1a: Find a Novel Gene
Due September 28, 2016
This project makes use of the skills you have learned in finding
sequences and in running and interpreting the various flavors of BLAST
to identify a novel (i.e., previously unannotated) gene in an
organism. This process is covered in detail in chapter 4 (pages 155-159)
Write up your work and results a bit more carefully than you would in
a lab assignment. For example, for this assignment, you have the time
to write in complete sentences and to check your grammar. Please include
whatever details we would need in order to replicate your work.
Work on this on your own. Cite all sources of assistance.
For this first part of the project, you will identify a
novel protein, meaning that it has not yet been annotated with
information about its function
To do so, you will start with a known
protein and try to find a distant homolog using BLAST. However, it
may take you a couple of tries to demonstrate that the protein you
have found is in fact a) novel (as defined in step 6 below),
and b) homologous to your start
protein (or to something else identifiable). So you may want to
start out just by writing casual lab notes about what
you are doing, and then, when you have a documented chain of events
that you’re going to use, polish the writing describing that
particular story before turning it in.
- Pick a known protein that you would like to use as a starting
point. You are going to compare your protein to a database of
newly-obtained DNA sequences to try to identify distant homologs that have not
yet been characterized. You may use RBP4, beta globin, or any other
protein you would like as your starting point.
Please indicate what protein (including its
RefSeq accession number) you have chosen, and from what organism. Also,
optionally, indicate why this protein is of interest to you.
Important Note: you may not want to fully
write up this part until you have
completed steps 2-5 successfully.
To obtain the sequence of your starting protein, you should
use the RefSeq database.
You can find RefSeq sequence links through the NCBI Gene or Protein
sites as well as from other resources. Check Pevsner's description
of RefSeq and the accession number prefixes to make sure you are getting
the canonical protein sequence and not, for example, the entire chromosome.
You can also find a RefSeq FAQ at
- As described in Pevsner, Chapter 4,
perform the appropriate type of search
against a database of uncharacterized DNA sequences. Tell us what you
did. Did you use tblastn? blastp? Why? What database(s) did you
choose, and why? Options might include whole genome shotgun sequence (WGS),
uncharacterized genomic survey
sequences (GSS), the TSA nucleotide database, an EST database, or a database
of environmental samples. To learn more about the possible databases
to search, click on the “?” sign next to the database sequence
pull-down menu, and then click on “more” to get a description of the
Include part of the output of that BLAST search in your document. If
necessary, change the font to Courier size 10 or smaller and insert
line breaks so that the results are displayed neatly. You can also
screen capture the BLAST output (e.g. using the print-screen key or
keys appropriate for your machine). Please don't print out
many pages of BLAST results (most of which will be irrelevant to your
novel gene), but please do include the parts you need to highlight next:
On the BLAST results, identify a match to a sequence
that you believe is both novel and homologous to your query sequence.
We need to
be able to inspect the pairwise alignment you have selected, including
the E value and score.
Pevsner points out:
"In general, this step is the most difficult for students because it
requires you to have a 'feel' for how to interpret BLAST results. " You
need to distinguish between sequences that are not novel;
those that might be 'novel,' depending on the results of step 5;
and sequences that might be novel but also are unlikely to be homologous.
Note that it is very unlikely (though still possible) that you
will find a novel gene from an organism such as S. cerevisiae, human
or mouse, because those genomes have already been thoroughly
annotated. Most lab animals and common household/farm animals are also
well annotated by this point. It is much more likely that you will
discover a new gene in a genome that is currently being sequenced.
- Identify the amino acid sequence of this novel protein.
not always trivial, especially if you started out with only a small
fraction of it, or if your query hits a very long piece of genomic
sequence (so that it is not clear which part of the sequence you need to
translate). Note that the BLAST results show which positions in the
subject (database) sequence align with which positions in the query
sequence, so you may find the relevant region that way.
You may want to try the ExPASy Translate Tool
which will translate all six
reading frames of a nucleotide sequence. Please describe what you did
and how you reached your conclusions about the result.
- Gather more information about this “novel” protein through
additional BLAST searches. Do you think you have the entire coding
sequence? If you use BLAST to confirm your result, can you extend it
further? Can you identify the species of origin if that is not
already known? Describe any additional searches you did, indicate
what you were hoping to learn from each, and summarize the results you
Propose a name for your novel protein and the species from which it
derives (to whatever extent it is possible to figure this out).
- Prove that this gene, and its corresponding protein, are
novel. For the purposes of this project, “novel” is defined as
follows. Use the protein sequence that you obtained in step 4
as the query sequence in a blastp search of the nr protein database
- If there is a match with 100% amino acid identity over the whole sequence length to a protein in
the database, from the same species, then your protein is NOT novel
(even if the match is to a protein with a name such as
“unknown”). Someone has already found and annotated this sequence, and
assigned it an accession number.
- If there is a match with less than 100% identity, or only to a short fragment of your sequence, then it is likely
that your protein is novel, and you have succeeded.
- If there is a match with 100% identity, but to only
a different species
than the one your sequence comes from, then you have succeeded in finding a
novel gene, even if it is 100% identical to one in a closely related
- If there is no match to the original query sequence from step 1,
(and if you are seeing hits of a similar quality),
this indicates that you have partially succeeded: yes, you may have
found a new gene, but no, it is probably not actually homologous to
the original query. You should probably start over. However, if you
picked a sequence (in step 2) that was quite a distant homolog from
your original query sequence (the protein you selected in step 1), you
could try to demonstrate that it is a distant homolog by using PSI-BLAST.
If you do this, start
with your newly discovered sequence, and blast it
against just the nr protein sequences from the species in which you
started. Don’t run too many iterations, and carefully examine any
hits you add in each iteration.
Submitting your work:
Submit your work using Trunk.
Please submit only a single document containing your writeup.
are multiple submissions, we will only look at the last of them unless
you ask us for help on an earlier one. In that case, only the last
submission will be graded. Please ensure that you submit all parts of the assignment in a single document for each submission that you wish to have graded (i.e. if you make a second submission, make sure to include your full writeup, not just the parts you have changed).