This lab has three goals. First, you will be introduced to the editor for writing Python scripts and will use it to write a short program using your understanding of variables and strings. Second, you will begin to use online sequence data resources and interpret BLAST alignment results, in part to investigate the origins of antibiotic resistance in human pathogens. Finally, you will become familiar with writing and submitting electronic lab notes about your computational work.
To start, you will create a Python script from scratch. You may want to review the Codecademy lesson on Strings and Console Output for reminders about syntax. The section on String Methods (screens 5-9) may help you figure out how to get string length or convert to uppercase, and screens 10-15 show some ways of printing both strings and variables. There are several ways of solving this problem; for now, any solution is fine.
For this problem, you will work with the recognition sequences of two restriction enzymes, EcoRI and NotI. EcoRI (pronounced "eeko-are-one") is a restriction enzyme derived from E. coli. It recognizes and cuts DNA at the sequence "GAATTC." Specifically, consider that sequence and the complementary sequence on the reverse strand:
G A A T T C C T T A A GNote that this sequence, like many restriction enzyme recognition sequences, is a DNA palindrome; the sequences read the same, going from the 5' to 3' end on both the forward and reverse strands. EcoRI produces DNA fragments with "sticky" ends, which means that the enzyme cuts the two DNA strands at different places, allowing the resulting DNA fragments to then bind to another fragment cut the same way. For EcoRI, the forward strand is cut after the "G", while the reverse strand is cut after the G on the reverse strand when read from 5' to 3', as shown by the slashes in the sequence below:
G\A A T T C C T T A A\G
Print these two sequences to the console.
Following the general lab instructions under the heading "Running your work", save your program, then test (run) your program until it runs and produces the correct results. Then continue to modify this same program as described in steps ii-iv, below.
print(EcoRI)should print out the string "GAATTC".
First, add a comment (using the "#" character) to indicate that the code below this point is addressing part ii of problem 1.
Next, using the variables EcoRI and NotI, have your program additionally print the phrases "EcoRI recognizes the sequence GAATTC" and "NotI recognizes the sequence GCGGCCGC". (You can leave the original statements from the first part of the program intact and just add these on.)
Create a new variable called total_re_length. Using the variables EcoRI and NotI, compute the sum of the length of the two restriction enzyme recognition sites, and store this value in the variable total_re_length. Print "The total length of the RE sequences is ", followed by the value of this variable.
Hint: remember that your string printing methods may be expecting everything they print to be a string (depending on how you print this). What is the type of the value stored in total_re_length? You may want to use the str() function here (see Codecademy "Strings and Console Output" lesson 13).
Finally, have your code print the sentences: "The forward strand of the fragments produced by EcoRI ends in ", followed by the contents of ecor1_end1, and "The forward strand of the fragments produced by NotI ends in ", followed by the contents of not1_end1.
At this point, your program output should look like the following:
gaattc gcggccgc EcoRI recognizes the sequence GAATTC NotI recognizes the sequence GCGGCCGC The total length of the RE sequences is 14 The forward strand of the fragments produced by EcoRI ends in G The forward strand of the fragments produced by NotI ends in GC
Put your names in a comment line at the top of the program, just in case, and Submit this program under the Gradescope link "Lab 1 problem 1 code". Make sure it is named rest_enzymes.py, because the autograder is looking for a file with that name. In your written lab document for the rest of the assignment, document any issues you had with writing the code.
Your written answers to each of the next three questions should include the equivalent of an electronic lab notebook documenting your bioinformatics analysis efforts. (See writeup instructions.) Download the file Lab1.unknownseq.fasta from the online version of this lab. This is a text file containing the sequence of a known and well-characterized protein in FASTA format. You will use BLAST to figure out what protein it is.
Run a protein BLAST search against the "Reference proteins" (RefSeq) database to identify this sequence. Recall from your reading in Chapter 2 that RefSeq is a database that provides canonical representative sequences for each transcript and protein product, and that mRNA sequences have RefSeq IDs starting with "NM" (e.g., NM_000518.4), while protein sequences have RefSeq IDs starting with "NP" (e.g., NP_000509.1).
Under Organism, exclude "Bacteria (taxid:2)" (start typing and it will allow you to choose completions from a pulldown menu). Be sure to check the "exclude" box. Doing this will make your search a little faster.
Use the defaults for all other options.
In your lab writeup, document any choices you needed to make to perform this search, and answer the following questions:
Download the file Lab1.resistgene.fasta from the online version of this lab. This is the ermB gene, which confers erythromycin resistance in Streptococcus agalactiae.
Start a new nucleotide BLAST search. Use the nr/nt database, limit the organism to bacteria, and check the Exclude box (the one on the line below the organism line) to exclude sequences from uncultured/environmental samples. (These are bacterial sequences from environmental samples where we don't necessarily know what organisms they came from; such hits won't help us address our question, so we are excluding them.) Use the defaults for all other options.
Document your searches in your lab writeup, and use the results to answer the following questions (also in your writeup):
You can find a recent phylogenetic tree of 17 bacterial phyla here (the tree in part b of the figure will suffice for giving you a sense of how the different phyla are related to each other). Escherichia are from the phylum Proteobacteria; Bacteroides are from the phylum Bacteroidetes, and both Staphylococcus and Streptococcus belong to the phylum Firmicutes.
To answer these questions, you may want to refine your BLAST search. Try excluding Gram-positive bacteria (which include Streptococcus agalactiae). You could also limit your search to specific distantly-related bacterial genera that include human pathogens, such as Bacteroides or Escherichia.
Describe, in your writeup, what you did to answer this question. Recall that you want to include enough detail about options you chose that you could replicate your work a month from now (or that someone else could).
Run the query, and look at some of the lower-scoring alignments shown. Are these proteins homologous to the query sequence? Do you think they are close enough to have been horizontally transferred?
>>> i = 2 >>> city = "Boston" >>> city >>> len(city)+i
What output do you see if you type the same code into a new python program file, save it, and run it? What would you have to change about your program to see the same output that you saw in the interpreter window? Answer these questions in your lab writeup document.
For the next few problems, you will write additional python programs. Save these separately from your solution to problem 1, and upload them separately under the appropriate question numbers (as optional uploads) in Gradescope.
>>> dog = raw_input("Please type in your dog's name: ")Type "Snoopy" at the prompt. If you then type "print dog" into the interpreter window, you will see that the variable dog now holds the string "Snoopy". (More documentation on the raw_input function appears here.)
Make a new copy of your script rest_enzymes.py, and modify this copy to ask users to input the restriction enzyme recognition sequence. Specifically, have the program print a message like "please enter your sequence:", read the input from the user, and store it in a variable. Type in a lowercase sequence and have the program convert it to uppercase, print it out, and determine its length.
How else could you make this more accurate? Could you account for leap years? For the numbers of days in given months? If you still have time left in lab, go to town on this!