Comp 7 / Bio 40 Project 1a: Find a Novel Gene

Due September 28, 2016

This project makes use of the skills you have learned in finding sequences and in running and interpreting the various flavors of BLAST to identify a novel (i.e., previously unannotated) gene in an organism. This process is covered in detail in chapter 4 (pages 155-159) of the Pevsner text.

Write up your work and results a bit more carefully than you would in a lab assignment. For example, for this assignment, you have the time to write in complete sentences and to check your grammar. Please include whatever details we would need in order to replicate your work.

Work on this on your own. Cite all sources of assistance.


For this first part of the project, you will identify a novel protein, meaning that it has not yet been annotated with information about its function To do so, you will start with a known protein and try to find a distant homolog using BLAST. However, it may take you a couple of tries to demonstrate that the protein you have found is in fact a) novel (as defined in step 6 below), and b) homologous to your start protein (or to something else identifiable). So you may want to start out just by writing casual lab notes about what you are doing, and then, when you have a documented chain of events that you’re going to use, polish the writing describing that particular story before turning it in.

  1. Pick a known protein that you would like to use as a starting point. You are going to compare your protein to a database of newly-obtained DNA sequences to try to identify distant homologs that have not yet been characterized. You may use RBP4, beta globin, or any other protein you would like as your starting point. Please indicate what protein (including its RefSeq accession number) you have chosen, and from what organism. Also, optionally, indicate why this protein is of interest to you. Important Note: you may not want to fully write up this part until you have completed steps 2-5 successfully.

    To obtain the sequence of your starting protein, you should use the RefSeq database. You can find RefSeq sequence links through the NCBI Gene or Protein sites as well as from other resources. Check Pevsner's description of RefSeq and the accession number prefixes to make sure you are getting the canonical protein sequence and not, for example, the entire chromosome. You can also find a RefSeq FAQ at

  2. As described in Pevsner, Chapter 4, perform the appropriate type of search against a database of uncharacterized DNA sequences. Tell us what you did. Did you use tblastn? blastp? Why? What database(s) did you choose, and why? Options might include whole genome shotgun sequence (WGS), uncharacterized genomic survey sequences (GSS), the TSA nucleotide database, an EST database, or a database of environmental samples. To learn more about the possible databases to search, click on the “?” sign next to the database sequence pull-down menu, and then click on “more” to get a description of the different databases.

    Include part of the output of that BLAST search in your document. If necessary, change the font to Courier size 10 or smaller and insert line breaks so that the results are displayed neatly. You can also screen capture the BLAST output (e.g. using the print-screen key or keys appropriate for your machine). Please don't print out many pages of BLAST results (most of which will be irrelevant to your novel gene), but please do include the parts you need to highlight next:

  3. On the BLAST results, identify a match to a sequence that you believe is both novel and homologous to your query sequence. We need to be able to inspect the pairwise alignment you have selected, including the E value and score.

    Pevsner points out: "In general, this step is the most difficult for students because it requires you to have a 'feel' for how to interpret BLAST results. " You need to distinguish between sequences that are not novel; those that might be 'novel,' depending on the results of step 5; and sequences that might be novel but also are unlikely to be homologous.

    Note that it is very unlikely (though still possible) that you will find a novel gene from an organism such as S. cerevisiae, human or mouse, because those genomes have already been thoroughly annotated. Most lab animals and common household/farm animals are also well annotated by this point. It is much more likely that you will discover a new gene in a genome that is currently being sequenced.

  4. Identify the amino acid sequence of this novel protein. This is not always trivial, especially if you started out with only a small fraction of it, or if your query hits a very long piece of genomic sequence (so that it is not clear which part of the sequence you need to translate). Note that the BLAST results show which positions in the subject (database) sequence align with which positions in the query sequence, so you may find the relevant region that way. You may want to try the ExPASy Translate Tool, which will translate all six reading frames of a nucleotide sequence. Please describe what you did and how you reached your conclusions about the result.

  5. Gather more information about this “novel” protein through additional BLAST searches. Do you think you have the entire coding sequence? If you use BLAST to confirm your result, can you extend it further? Can you identify the species of origin if that is not already known? Describe any additional searches you did, indicate what you were hoping to learn from each, and summarize the results you obtained.

    Propose a name for your novel protein and the species from which it derives (to whatever extent it is possible to figure this out).

  6. Prove that this gene, and its corresponding protein, are novel. For the purposes of this project, “novel” is defined as follows. Use the protein sequence that you obtained in step 4 as the query sequence in a blastp search of the nr protein database at NCBI.

Submitting your work:

Submit your work using Trunk. Please submit only a single document containing your writeup.

If there are multiple submissions, we will only look at the last of them unless you ask us for help on an earlier one. In that case, only the last submission will be graded. Please ensure that you submit all parts of the assignment in a single document for each submission that you wish to have graded (i.e. if you make a second submission, make sure to include your full writeup, not just the parts you have changed).