Comp 7 / Bio 40 Lab 1: BLASTing Off

September 10, 2019

Purpose of this lab

This lab has three goals. First, you will be introduced to the editor for writing Python scripts and will use it to write a short program using your understanding of variables and strings. Second, you will begin to use online sequence data resources and interpret BLAST alignment results, in part to investigate the origins of antibiotic resistance in human pathogens. Finally, you will become familiar with writing and submitting electronic lab notes about your computational work.


General Lab Information

Instructions on what to write and how to submit via Gradescope are available at General lab information.

Lab problems

  1. To start, you will create a Python script from scratch. You may want to review the Codecademy lesson on Strings and Console Output for reminders about syntax. The section on String Methods (screens 5-9) may help you figure out how to get string length or convert to uppercase, and screens 10-15 show some ways of printing both strings and variables. There are several ways of solving this problem; for now, any solution is fine.

    For this problem, you will work with the recognition sequences of two restriction enzymes, EcoRI and NotI. EcoRI (pronounced "eeko-are-one") is a restriction enzyme derived from E. coli. It recognizes and cuts DNA at the sequence "GAATTC." Specifically, consider that sequence and the complementary sequence on the reverse strand:

    G A A T T C
    C T T A A G
    
    Note that this sequence, like many restriction enzyme recognition sequences, is a DNA palindrome; the sequences read the same, going from the 5' to 3' end on both the forward and reverse strands. EcoRI produces DNA fragments with "sticky" ends, which means that the enzyme cuts the two DNA strands at different places, allowing the resulting DNA fragments to then bind to another fragment cut the same way. For EcoRI, the forward strand is cut after the "G", while the reverse strand is cut after the G on the reverse strand when read from 5' to 3', as shown by the slashes in the sequence below:
    G\A A T T C
    C T T A A\G
    

    Writing your program:

    1. Following the directions in the general lab info link under the heading "Creating and editing Python programs," create a new Python script called rest_enzymes.py. Start your script by assigning the string "gaattc" to a variable named EcoRI and the string "gcggccgc" to a variable named NotI. (The enzyme NotI, called "not one", is derived from the bacterium N. otitidis.)

      Print these two sequences to the console.

      Following the general lab instructions under the heading "Running your work", save your program, then test (run) your program until it runs and produces the correct results. Then continue to modify this same program as described in steps ii-iv, below.

    2. Next, we'll use the built-in string methods and functions we learned about in "Strings and Console Output" to create uppercase versions of the two restriction enzyme recognition sequences, and store each in the variable it came from. For example, when you are done, the command
      print(EcoRI)
      
      should print out the string "GAATTC".

      First, add a comment (using the "#" character) to indicate that the code below this point is addressing part ii of problem 1.

      Next, using the variables EcoRI and NotI, have your program additionally print the phrases "EcoRI recognizes the sequence GAATTC" and "NotI recognizes the sequence GCGGCCGC". (You can leave the original statements from the first part of the program intact and just add these on.)

    3. Now, add a comment saying that this is part iii.

      Create a new variable called total_re_length. Using the variables EcoRI and NotI, compute the sum of the length of the two restriction enzyme recognition sites, and store this value in the variable total_re_length. Print "The total length of the RE sequences is ", followed by the value of this variable.

      Hint: remember that your string printing methods may be expecting everything they print to be a string (depending on how you print this). What is the type of the value stored in total_re_length? You may want to use the str() function here (see Codecademy "Strings and Console Output" lesson 13).

    4. Finally, we are going to look at the sequences in the sticky ends of the cut fragments. Given that EcoRI cuts after the first letter of the sequence, create a variable called ecor1_end1 that contains just that letter (extracted from the EcoRI variable you already have). The NotI enzyme cuts after the first two letters of the sequence. Create a variable called not1_end1 that contains a string comprised of these first two letters, extracted from the sequence in NotI. (By that, we mean that if the value of NotI were changed to a different sequence, and you ran your code again, these sequences would change automatically as well.)

      Finally, have your code print the sentences: "The forward strand of the fragments produced by EcoRI ends in ", followed by the contents of ecor1_end1, and "The forward strand of the fragments produced by NotI ends in ", followed by the contents of not1_end1.

      At this point, your program output should look like the following:

      gaattc
      gcggccgc
      EcoRI recognizes the sequence GAATTC
      NotI recognizes the sequence GCGGCCGC
      The total length of the RE sequences is 14
      The forward strand of the fragments produced by EcoRI ends in G
      The forward strand of the fragments produced by NotI ends in GC
      

      Put your names in a comment line at the top of the program, just in case, and Submit this program under the Gradescope link "Lab 1 problem 1 code". Make sure it is named rest_enzymes.py, because the autograder is looking for a file with that name. In your written lab document for the rest of the assignment, document any issues you had with writing the code.

  2. Next, you will run some basic BLAST queries.

    Your written answers to each of the next three questions should include the equivalent of an electronic lab notebook documenting your bioinformatics analysis efforts. (See writeup instructions.) Download the file Lab1.unknownseq.fasta from the online version of this lab. This is a text file containing the sequence of a known and well-characterized protein in FASTA format. You will use BLAST to figure out what protein it is.

    Run a protein BLAST search against the "Reference proteins" (RefSeq) database to identify this sequence. Recall from your reading in Chapter 2 that RefSeq is a database that provides canonical representative sequences for each transcript and protein product, and that mRNA sequences have RefSeq IDs starting with "NM" (e.g., NM_000518.4), while protein sequences have RefSeq IDs starting with "NP" (e.g., NP_000509.1).

    Under Organism, exclude "Bacteria (taxid:2)" (start typing and it will allow you to choose completions from a pulldown menu). Be sure to check the "exclude" box. Doing this will make your search a little faster.

    Use the defaults for all other options.

    In your lab writeup, document any choices you needed to make to perform this search, and answer the following questions:

    1. What is the RefSeq ID of the unknown sequence?
    2. What is the name of the corresponding protein?
    3. What organism do you think it comes from, and why?
    4. How do you know that you have found the correct sequence?

  3. Next, we will look at a gene that confers resistance to erythromycin in Streptococcus agalactiae, a bacterial pathogen first identified in cattle. There are two possible routes by which bacterial antibiotic resistance might evolve: through vertical inheritance during the normal course of evolution (albeit perhaps under strong selective pressure due to antibiotic exposure), or through horizontal gene transfer (HGT). One possible way to determine if there is HGT of antibiotic resistance genes to human pathogens from agricultural use of antibiotics would be to perform a BLAST search, looking for distantly related bacteria that nonetheless have the same genes conferring drug resistance and that are human pathogens.

    Download the file Lab1.resistgene.fasta from the online version of this lab. This is the ermB gene, which confers erythromycin resistance in Streptococcus agalactiae.

    Start a new nucleotide BLAST search. Use the nr/nt database, limit the organism to bacteria, and check the Exclude box (the one on the line below the organism line) to exclude sequences from uncultured/environmental samples. (These are bacterial sequences from environmental samples where we don't necessarily know what organisms they came from; such hits won't help us address our question, so we are excluding them.) Use the defaults for all other options.

    Document your searches in your lab writeup, and use the results to answer the following questions (also in your writeup):

    1. Look at the identity and coverage for the BLAST results. Has BLAST found highly similar sequences to your query? Do the results suggest you have found the same gene in different organisms? Explain your reasoning.

    2. Are the similar sequences found in distantly related bacteria, or only in closely related bacteria? Are any of these bacteria known to be found in humans?

      You can find a recent phylogenetic tree of 17 bacterial phyla here (the tree in part b of the figure will suffice for giving you a sense of how the different phyla are related to each other). Escherichia are from the phylum Proteobacteria; Bacteroides are from the phylum Bacteroidetes, and both Staphylococcus and Streptococcus belong to the phylum Firmicutes.

      To answer these questions, you may want to refine your BLAST search. Try excluding Gram-positive bacteria (which include Streptococcus agalactiae). You could also limit your search to specific distantly-related bacterial genera that include human pathogens, such as Bacteroides or Escherichia.

      Describe, in your writeup, what you did to answer this question. Recall that you want to include enough detail about options you chose that you could replicate your work a month from now (or that someone else could).

    3. Do your results support the hypothesis that there is horizontal transfer of antibiotic resistance genes to from animal pathogens to human pathogens? Explain your reasoning in a couple of sentences. What else might you like to know to further investigate this possibility?

  4. What is the longest human protein? (This question, inspired in part by one of the lab exercises in your textbook, helps you gain familiarity with finding information in the NCBI databases.) Again, answer these questions by documenting your analysis in your writeup and then including a separate section with explicit answers.

    1. There are many ways to answer this question, but one way is to go to the NCBI Protein database (start from the class Links page) and search for a range of sequence lengths. For example, type 30000:40000[Sequence Length] into the search window. Then, either use the filter on the left sidebar to limit species to Homo sapiens, or click the Advanced link just under the search bar and build a query that does the same thing. How many entries are there in this length range, in human, and to what protein do they correspond?

    2. Go to the NCBI Gene page, and look up this protein by name. What is its official gene symbol? What are two other names it is known by? What chromosome is it on?

    Going Further:

    1. You can query a protein database with the same DNA sequence we used in problem 3. To do this, we need to do a BlastX search. Go back to the main BLAST page, and select blastx instead of protein or nucleotide blast. Use the same query sequence as in problem 3, and search the nr database, restricted to bacteria on the Organism line. This time, add a second Organism line excluding gram-positive bacteria, and again check the box below that to exclude uncultured/environmental sample sequences.

      Run the query, and look at some of the lower-scoring alignments shown. Are these proteins homologous to the query sequence? Do you think they are close enough to have been horizontally transferred?

    2. In problem 2, think about why we suggested that you BLAST against the RefSeq database to try to identify this sequence. Explain how you would expect the results to have been different if you chose to BLAST against the pdb database instead. (This database is described in Pevsner, but you can also find a brief description on the BLAST site.) Would it have made a difference in your ability to identify this particular sequence? Could you have known this before you identified the sequence?

    3. Remember that you can test out small amounts of code in the Python interpreter window instead of writing a script. Try the following:
      >>> i = 2
      >>> city = "Boston"
      >>> city
      >>> len(city)+i
      

      What output do you see if you type the same code into a new python program file, save it, and run it? What would you have to change about your program to see the same output that you saw in the interpreter window? Answer these questions in your lab writeup document.


      For the next few problems, you will write additional python programs. Save these separately from your solution to problem 1, and upload them separately under the appropriate question numbers (as optional uploads) in Gradescope.

    4. To read input from the screen, you can use the "raw_input" function. For example, type the following into the interpreter window:
      >>> dog = raw_input("Please type in your dog's name:  ")
      
      Type "Snoopy" at the prompt. If you then type "print dog" into the interpreter window, you will see that the variable dog now holds the string "Snoopy". (More documentation on the raw_input function appears here.)

      Make a new copy of your script rest_enzymes.py, and modify this copy to ask users to input the restriction enzyme recognition sequence. Specifically, have the program print a message like "please enter your sequence:", read the input from the user, and store it in a variable. Type in a lowercase sequence and have the program convert it to uppercase, print it out, and determine its length.

    5. There are several ways to print out a string with a variable at the end of it. Try editing your program rest_enzymes.py to use all of the following in the program from Problem 1: "+", ",", and string formatting with % (from screens 14 and 15 in CodeAcademy's Strings and Console Output section). Use these to print messages like "The total length of the RE sequences is 14" in different ways.

    6. Write a Python script that computes your age in days but is more accurate than what we did in class. Determine your age in years, months, and days, then use division and fractions to determine how many days old you are. (E.g., 19 years, 4 months, 3 days means you could compute (19.33 * (days per year)) + 3). Be careful about integer division in Python 2!

      How else could you make this more accurate? Could you account for leap years? For the numbers of days in given months? If you still have time left in lab, go to town on this!