Comp 7 / Bio 40 Lab 2: Advanced BLAST

September 21, 2016

Purpose of this lab

To help students develop greater familiarity with interpreting BLAST results; to offer practice with Python functions, Booleans, conditionals, and problem solving.

General information (Review from last time)

You can find most of the web links you need for this lab from the "links" tab on the course web site.

Writing your lab notes

You may record your lab notes in any way that you choose so long as we are able to read it. A plain text file is fine: you may edit such files using Notepad++ on the lab machines. Alternatively, Microsoft Office is also available on the lab machines.

Using the Filesystem

Note that you need to be sure to save files in or below your home directory (drive Z:\ on the lab machines), not to the local desktop, to ensure that they will be saved once you log out and will be accessible from other lab machines.

Starting IDLE and creating new files

To start IDLE on the lab machines, go to Start -> All Programs -> Python 2.6, and run "IDLE (Python GUI)". This will start IDLE in the interpreter window.

To create a new python file, click on the File menu and choose "New Window". This will pop up a new editor window with a blank python script. To save it, click on the File menu and select "Save". Type in the filename: call in "SomeName.py" - the ".py" extension is important. It tells you - and various computer programs - that this text file contains a Python script.

The first line of your Python script should be a comment line that gives both of your names. E.g.,

# This is the solution to Lab 2, problem 1, by Joe Shmoe and Penny Python

Lab problems

  1. Case checking

    For this problem, you will create a Python program that asks the user to type an input string, and, if the first letter of that input string is uppercase, prints a message saying so. If not, it should print a message saying that the first letter is lowercase. You may assume that the input string contains only letters and is not empty.

    In the body of the program, first, prompt for and obtain a single string of user input. Recall that the built-in raw_input() function can be used to obtain user input, as in the Pig Latin module in CodeAcademy. Store the resulting input string in a variable with a name that you choose.

    Next, you need to figure out how to get the first letter of the input string. (You might want to look at the Pig Latin module for the syntax here.) For convenience, you might want to save this letter in a variable of its own.

    Finally, you need to figure out how to tell if it is an uppercase letter or not. You already know at least one built-in function that deals with case, and you know how to test for equality. How could you use these to determine the case of the first letter in the input string?

    Write code to check if the letter is uppercase or not and print an appropriate message.

    Save your program with the name uppercase.py and upload it with your lab submission.

  2. Case checking with functions

    Next, create a new version of your Python program that solves the case-checking problem above. We'll modify this program to use a function to do the case-checking part. Specifically, you will define a Boolean function that takes one parameter (a string) as input and that returns the value True if the first letter of the input string is uppercase. You may still assume that the parameter string only contains letters and is not empty.

    In the main part of the program, you should prompt for and obtain a single string of user input. Then call the function that you defined. Use the value returned by this function to determine which message to print out (either "the first letter is uppercase" or "the first letter is lowercase"). Printing this message should happen in the main body of the program, not in the function.

    Remember that you can define and call functions using the following syntax:

    def isOne(x):
       if x==1: 
           return True
       else: 
           return False
           
    aBoolean = isOne(10-9)
    print "Ten minus nine is one:  ", aBoolean
    

    Save your program with the name uppercaseFunc.py and upload it with your lab submission.

  3. Finding Gene information

    Suppose you are interested in the human gene ACE. Look this up in the NCBI Gene database. Make sure you're dealing with the human gene.

    1. Write a one-sentence description of ACE's putative function.

    2. Look at the genome map and browser shown in the section titled "Genomic regions, transcripts, and products." How many different isoforms are shown on the map? (See if you can visually observe which exons distinguish each from the others.)

      Note that some of these are simply predicted transcripts according to the RefSeq naming conventions here and here. They do not have curated RefSeq identifiers that start with NP. You can still count them for this problem, though.

    3. Find the curated RefSeq Protein record for the protein corresponding to the longest isoform. Save that protein sequence in FASTA text format as the file ACE.fasta. Include its RefSeq ID in your writeup, and upload this fasta file with your lab submission.

  4. Scoring matrices: Globins in plants

    Use the human beta globin (HBB) protein, NP_000509, to perform a BLASTp search of the nr protein database, restricted to plants, using the default BLAST algorithm.

    1. How many hits are reported (using what E-value cutoff)? Approximately what fraction (you don't have to count carefully, just estimate) are described as globins and are therefore likely to have homology to the query protein? This time, we'll vary the scoring matrix. To select a different scoring matrix, expand the "Algorithm Parameters" checkbox and select a different scoring matrix under scoring parameters.

    2. Run a BLASTp search on this same query protein using the PAM30 scoring matrix; check the box to show the results in a new window so you can compare your output to that from the previous search. How many hits do you get, at what E-value cutoff? How many are globins?

    3. Do this search one more time, using the BLOSUM45 matrix instead. How many hits do you get, at what E-value cutoff? Approximately how many are globins?

    4. Why do you think you got the results you did from varying the scoring matrix?

  5. Globins in plants: finding distant homology

    Let's continue the searches from problem 4 above.

      Using the HBB protein, NP_000509, again as your query protein, do a protein search of the nr protein database, restricted to plants, using PSI-BLAST instead of BLASTp. Under Algorithm Parameters, set the PSI-BLAST threshold (under algorithm parameters) to 0.075 and change "Max target sequences" to 100. Set the scoring matrix to BLOSUM62 (the default). Also, check the box "Show results in a new window" so you can keep the query screen active.

      1. How many hits are below this threshold in the first round?
      2. Select all the hits below the threshold to use in PSI-BLAST, then click GO to run PSI-Blast iteration 2. What is the best hit from the first round, and what is its E-value? What is its E-value in the second round?

      Go back to the query screen and start the same PSI-BLAST search over again. This time, before starting the second round, find the entry named "chitinase B [sorgum halepense]" on the list of hits that are worse than the E-value threshold, and check the box so that this protein is added in for the second PSI-BLAST round. Click "GO" to run the second round.

      1. The best hit from the first round should be the same as in part ii, above. What is its E-value in this second round?
      2. Approximately what fraction of the proteins making the cutoff in this second round are not named as globins or likely globins? (Again, just scroll down and estimate, you don't have to count exactly.)

        Can you find some hits in this round that do not appear to have globin domains? (You can tell by clicking on the genbank ID number for the sequence, and in the resulting page, clicking on the "Identify Conserved Domains" link in the upper right corner. This runs the HMM-based domain-finding algorithm from the Conserved Domain Database.)

  6. Comparing numbers

    Write a Python program that prompts the user to input two integers and stores the result in two variables called firstnum and secondnum. Use the raw_input() function to ask for two integers and save them to these variables. Recall that raw_input() returns a string; you will have to convert the input strings to integers using int().

    Your job is to determine whether the value of the first number is greater than, less than, or equal to the value of the second number.

    Specifically, if the first number is greater than the second, print a message saying so. If the first number is less than the second, print a message saying that. Otherwise, print a message saying that they are equal.

    You should be able to do this with a single if/elif/else block.

    Save this program with the name compare.py and upload it with your lab submission.

Going Further

  1. In Between

    Write a Python program that prompts the user to input three integers, one at a time.

    Your job is to determine whether the value of the first number is in between the values of the second and third numbers.

    Specifically, if the first number is greater than the second and less than the third, print an appropriate message. If the first number is greater than the third and less than the second, print a different message. Otherwise, print a message that says that the first number is not between the second and third numbers.

    You should be able to do this with a single if/elif/else block.

    Save this program with the name inbetween.py and upload it with your lab submission.

  2. Odd or even?

    Write a function that takes as input an integer and returns a Boolean that is TRUE if the integer is even. Recall that you know how to do integer division and use the modulo operator ("%"), so you should be able to determine evenness.

    Now have the main program allow the user to input two strings. Compute the length of each of these strings. Without creating a concatenated string, and without adding their lengths together, write a single if/else statement (you may use AND or OR in your statement) to determine if the sum of the lengths of the two strings is even or odd.

    Hint: think about creating a truth table that determines the "parity" (evenness) of the sum of two numbers based on whether those two numbers are odd or even. Then implement this using an if statement.

  3. Delta BLAST

    Try using human gamma globin, NP_000550.2, as your query protein, against the nr database in plants. Compare BLASTp to Delta BLAST. How many hits do you get with an E-value below 0.1 using each method? How many hits are reported in total? Explain why you are seeing the observed results.