Comp 7 / Bio 40 Lab 2: Advanced BLAST

September 17, 2019

Purpose of this lab

To help students develop greater familiarity with interpreting results of the larger family of BLAST programs; to offer practice with Python functions, Booleans, conditionals, control flow, keyboard input, and problem solving.

General information (Review from last time)

You can find most of the web links you need for this lab from the "Links" tab on the course web site.

Writing your lab notes

Recall that you should be documenting the queries you run in your writeup. Please use Word, or any another document editor that allows you to save your results in pdf format.

Using the Filesystem

Note that you need to be sure to save files in or below your home directory (drive Z:\ on the lab machines), not to the local desktop, to ensure that they will be saved once you log out and will be accessible from other lab machines. You may want to create folders and substructure (e.g., "comp7" or "lab1" folders) to keep the file contents manageable.

Starting IDLE and creating new files

To start IDLE on the lab machines, you should now find a shortcut to Python 2.7 (slightly newer and less buggy than Python 2.6) on the desktop on the lab machines. Click on that and it should start up an IDLE window. If you can't find the shortcut on the desktop, click on the Windows icon in the lower left corner of the screen, and then on the programs listing, scroll down to Python 2.6 and run "IDLE (Python GUI)" below that. This will start IDLE.

To create a new python file, click on the File menu and choose "New Window" or "New File" (depending on the version). This will pop up a new editor window with a blank python script. To save it, click on the File menu and select "Save". Type in the filename: e.g., call it "SomeName.py" - the ".py" extension is important. It tells you - and various computer programs - that this text file contains a Python script. Choose a location on the Z drive for your file.

The first line of your Python script should be a comment line that gives both of your names. E.g.,

# This is the solution to Lab 2, problem 1, by Joe Shmoe and Penny Python

Lab problems

  1. Simple function:

    For this problem, you will write a function that takes no input arguments and returns nothing, but that reads input from and prints output to the console. Specifically, define a function called simple() that uses the raw_input() function (as in the Pyg Latin assignment in Codecademy) to prompt the user to input a string. Save that string in a variable. Then have the function print out a message saying whether the string is all uppercase or not.

    (Hint: you already know a function to make a string uppercase, and you know how to compare strings to each other. How you use these to determine if the input string is in uppercase?)

    Write a "main" part of the program, below the function definition, that just calls this function twice.

    Save the program as simple.py, and submit it on Gradescope for problem 1.

  2. In Between?

    Locations of sequence mutations are often specified by a particular position in a defined region of genomic sequence. One important question is whether the mutation affects the coding sequence of a protein or not.

    For this problem, we will address a simplified version of this question. We will write a Python function called inExon that takes three integer arguments: start, end, and mutation.

    The first two should represent the start and end positions of an exon in a gene's genomic sequence. The third represents the position of the mutation. The function should use a single if/elif/else block to determine whether the third number is in between the other two. Recall that you can use and, or, and not in constructing your statements.

    If the mutation position is between the other two, have the function return the Boolean value True. Otherwise, have it return False.

    To make this question straightforward, you may assume that the input is correct and well-behaved: that the input values are all integers, and that you are getting forward strand position numbers so that the start position is strictly less than the end position.

    To complete the program, write a main program that uses the raw_input() function to read these three values from the console. Use the int() function to convert these values, which will be input as strings, to integers. Then call your function with these three values as arguments.

    Don't forget to save the value returned by the function to a variable in the main program. (What type does the value stored in this variable have?) The main program should then print out a message saying either that the mutation is within the exon or outside of it, depending on the value returned from the function.

    An example run might look like the following:

    > Please enter the exon start position:  14
    > Now enter the exon end position: 92
    > Now enter the position of the mutation:  73
    > The mutation is within the exon.
    

    Note that the first three lines are printed by the main program. Then the input strings "14", "92", and "73" are converted to integers, and the function isExon is called with arguments containing the values 14, 92, and 73. This function determines that 73 is between 14 and 92, so it returns True. The main program then uses that information, along with another if/else block, to determine that it should print a message saying the mutation is within the exon.

    Save this program with the name inExon.py and submit it on Gradescope for problem 2.

  3. Scoring matrices: Globins in plants

    The globins are a protein superfamily whose members are structurally similar, containing a "globin fold" made up of 8 alpha helices. Hemoglobin in vertebrate blood binds to and carries oxygen from the lungs to the rest of the body. Given this functional role, one might wonder whether there are globin proteins in more distant organisms, say, plants. We can use BLAST to find out.

    Use the human beta globin (HBB) protein, NP_000509.1, as the query in a BLASTp search of the nr protein database, restricted to plants, using the default BLASTp algorithm and default options.

    1. What is the (default) E-value cutoff you used for this query? (Hint: it is listed under Algorithm Parameters and is called the "Expect threshold.") How many hits are reported for this query? What is the E-value range of the reported hits? Approximately what fraction have a name containing "globin" and have therefore probably been determined to likely be homologous to the query protein? Look at the alignments to get a sense of how alignments with those scores might look like.

      Now, we'll vary the scoring matrix. To select a different scoring matrix, expand the "Algorithm Parameters" checkbox and select a different scoring matrix under scoring parameters.

    2. Run a BLASTp search on this same query protein against the nr database in plants, but this time select the PAM30 scoring matrix; check the box to show the results in a new window so you can compare your output to that from the previous search. How many hits do you see? What is the range of E-values of these hits? Approximately what fraction have a name containing "globin?" Look at the alignments again. Do you believe the proteins without "globin" in their names are nonetheless potentially homologous?

    3. Do this search one more time, using the BLOSUM45 matrix instead. How many hits do you see? What is the range of E-values? Approximately what fraction of the hits are called globins?

    4. Why do you think you got the results you did from varying the scoring matrix?

  4. Globins in plants: finding distant homology

    Let's continue the searches from problem 4 above.

    Using the HBB protein, NP_000509.1, again as your query protein, do a protein search of the nr protein database, restricted to plants, using PSI-BLAST instead of BLASTp. Under Algorithm Parameters, set the PSI-BLAST threshold (under algorithm parameters) to 0.075 and change "Max target sequences" to 100. Set the scoring matrix to BLOSUM45. Also, check the box "Show results in a new window" so you can keep the query screen active.

    1. How many hits are below this threshold in the first round?

    2. What is the best hit from the first round, and what is its E-value?

    3. Select all and only the hits below the threshold to use in PSI-BLAST (this is the default, so you don't really need to select anything), then click Run to run PSI-Blast iteration 2. How many hits are below the E-value threshold in this round?

    4. What is the best hit in this round, and what is its E-value?

    5. Approximately what fraction of the hits shown in round 2 (e.g., none, a few, half, most, all) are named as globins in plants?

    If you didn't get to problem 5 below, try to do it on your own later. You will want this before project 1b.


    Going Further

    1. Finding Sets of Sequences:

      Suppose you are interested in the protein rabaptin (Rabep1) in rats. Its RefSeq ID number is NP_061997. You would like to collect sets of homologous sequences from a number of organisms at different evolutionary distances. In this problem we will find three different ways to do this.

      1. For the first collection of protein sequences, we will use the HomoloGene database. HomoloGene is the NCBI collection of Eukaryotic homologs. Strict criteria are required for declaring homology, with the result being that this list is higher quality but much more restrictive than the lists of homologs you might find in other database (such as Ensembl from the EMBL group).

        While you can query HomoloGene directly from any NCBI database page, the easiest way to get there for a particular gene is from the NCBI Gene page. For this problem, find the Rabep1 gene in Rattus norvegicus in the NCBI Gene database. Search for HomoloGene on that page and you will find a link in the column of links on the right. Follow that link. Click on the word Downloads on the right hand side over the Genes and Proteins sections of the page. Choose to save the Protein sequences in fasta format (click the Download button on this page to save the file). Rename the file you saved; call it rabep1.homologene.fasta.

        How many homologs are listed? If you use the pairwise BLAST on this page (under the heading Protein Alignments) to find an alignment between the human and zebrafish sequences, what is the percent amino acid identity?

      2. For the next source of sequences, we'll try the PFAM database. Go back to the HomoloGene page for Rabep1. Under the heading Proteins, click on the RefSeq ID for the rat protein and copy its fasta-formatted amino acid sequence. Go to the pfam page. Select "sequence search" and paste in the query sequence of the rat Rabep1 protein. Click on the first family ("Rabaptin"), and then choose the Alignments page from the menu on the left hand side. You can see the multiple sequence alignment of the three seed family members by clicking on the checkmark in the row "HTML" under "Seed (3)".

        Alternatively, you can download all the fasta-formatted sequences from the full alignment by clicking on the "download" link at the very bottom of the page, below the Download options table.

        Rename this file as rabep1.pfam.fasta.

      3. Finally, you can download sequences selected from your BLAST hits. Go back to the BLASTp web site, BLAST your query protein sequence from rat against the RefSeq protein database using the default options but checking "Models (XM/XP)" in the Exclude row.

        There will be many good hits. Choose the first 10 of these that come from different organisms, and click the checkbox on the left under "Select" to select those hits. Choose the link "Download" on the upper right of that section to download these sequences. Save this file as rabep1.blastp.fasta.

      4. Upload these three saved files on Gradescope.

    2. In between, part 2:

      To make problem 2 above simple, you were allowed to assume that the input was correct and well-behaved: that the input values were all integers, and that they represented forward strand position numbers so that the start position is strictly less than the end position. However, what if they are reverse strand positions, so start is greater than end? What if someone enters a string or floating point number instead of an integer? If you found the straightforward version of the problem less challenging, rewrite your script to address these points as well.

    3. Another brick in the fall:

      If you drop a brick from the top of a building, the distance it has fallen can be given in meters as: y = g t2 / 2 , where t is the time in seconds since is was dropped and g is the gravitational constant 9.8. Write a function to compute and return the distance the brick has fallen at a given time. Call it several times from the main program, collect the return value, and print the time and distance for each call.

      Next, note that the tallest building in the world is the Burj Khalifa, at 829.8 meters tall. Modify your function so that the maximum distance it can return is 829.8 (after which, the brick hits the ground and stops moving). You might also make this "maximum fall distance" a second parameter of your function.

    4. Delta BLAST

      Try using human gamma globin, NP_000550.2, as your query protein, against the nr database in plants. Compare BLASTp and PSI-BLAST to Delta BLAST. How many hits do you get with an E-value below 0.1 using each method? How many hits are reported in total? Explain what Delta BLAST does differently that leads to these results.