Comp 7 / Bio 40 Lab 1: BLASTing Off

September 14, 2016

Purpose of this lab

This lab has three goals. First, you will be introduced to the editor for writing Python scripts and will use it to write a short program using your understanding of variables and strings. Second, you will begin to use and interpret BLAST results, in part to investigate the origins of antibiotic resistance in human pathogens. Finally, you will become familiar with writing electronic lab notes about your computational work.


General Lab Information

You can find most of the web links you need for this lab under the "links" tab on the course web site.

What you will submit for lab assignments

In each lab, you will be asked to do some bioinformatics analysis and answer some specific questions. You may also write programs. At the end of the lab period, you should minimally submit a document with the answers to these questions, and a separate python program file for each programming question, through Trunk. Make sure you and your lab partner have your names in all submitted documents.

In addition, you should create an electronic lab notebook documenting your bioinformatics analysis efforts and perhaps your program development efforts. Most people combine this notebook file with the file containing the answers to the lab questions, but you may choose to keep it separate. Discuss with your lab partner which you prefer.

You may record your lab notes and responses in any way that you choose so long as we are able to read them. A plain text file is fine: you may edit such files using Notepad++ on the lab machines. Alternatively, Microsoft Office is also available on the lab machines.

Students often ask how much detail to include in electronic lab notebooks. The answer is that you should include enough detail that someone else could reproduce your efforts. That said, it needn't be written expansively; sentence fragments and lists of options for online tools are fine for lab notebooks (although not for Project submissions).

When you are using online bioinformatics tools, you should document all options you chose and why you chose them. If you are programming in Python, your code may tell the whole story (if it is well written and includes comments), so you don't necessarily need to write anything in the notebook about programming. On the other hand, if you went through a process of trying different ideas in your program, you may want to document these design decisions or the process you followed.

Working in pairs

When working with a partner in lab, at any given time one person should be typing and the other should be closely observing and suggesting content. For programming in particular, it has been shown that working in pairs helps reduce frustration and identify errors more quickly. In a pedagogical setting, this works best when you swap roles frequently, so please remember to take turns. Counter-intuitively, it may also be more effective to have the person who is less confident about a particular part of the material be the one who types when you are working on that part.

Saving your files

Be sure to save your files in or below your home directory (drive Z:\ on the lab machines), not to the local desktop, to ensure that they will be saved once you log out and will be accessible from other lab machines.

When you are doing lab work with a partner, we strongly suggest that both of you save copies of your submissions. One simple way to do this would be to have the team member who is not logged in in lab be the one to submit the files via Trunk. Another would be to email a copy of the submission to that member of the team. Before leaving, or on another occasion, the recipient can log in to one of the lab machines, read the email or go to the Trunk site, and save the files to their own home directory.

Creating and editing Python programs

You will be writing and running your programs using a tool called IDLE. The advantage of IDLE is that it is simple to use, it offers all the functionality we need for this course, and it is available for both Mac and Windows platforms if you want to work on your own machines.

To start IDLE on the lab machines, go to Start -> All Programs -> Python 2.7, and run "IDLE (Python GUI)". This will start IDLE in the interpreter window.

To create a new python file, click on the File menu and choose "New File". This will pop up a new editor window with a blank python script. To save it, click on the File menu and select "Save". Type in the filename. For example, your file might be called "SomeName.py". The ".py" extension is important. It tells you - and various computer programs - that this text file contains a Python script. We suggest that you save it under your Z: directory so that it will remain available after you log out.

The first line of each Python script should be a comment line that gives both of your names. E.g.,

# This is the solution to Lab 2, problem 3, by Joe Shmoe and Penny Python

Running and testing your Python code

For now, we will run our programs within IDLE. Make sure your cursor is in the editor window. Find the Run menu at the top of the screen, and choose "Run Module" from the Run menu pulldown. After asking if it's okay to save the program first (click OK), IDLE will run your entire script in the interpreter window, where it will ask for any input and print any output.

When you try to run your script, it may not work at all if the program's syntax is incorrect. In this case, you should see an error message with a line number, or red highlighting the line where IDLE thinks your syntax error is located. Note that IDLE is not always completely accurate when reporting which line contains the error; if it really looks right to you, look at the lines just before and after as well.

Revise your program and continue to test it until you are satisfied with the results.


Lab problems

  1. To start, you will create a Python script from scratch. You may want to review the Code Academy lesson on Strings and Console Output for reminders about syntax. The section on String Methods may help you figure out how to get string length or convert to uppercase, and the section on Advanced Printing shows some ways of printing both strings and variables. There are several ways of solving this problem; for now, any solution is fine.

    1. Following the directions above ("Creating and editing Python programs"), create a new Python script called myname.py. In this script, you should start by assigning your first name and last name to separate variables called, respectively, firstname and lastname. For example:
      firstname = "Donna"
      lastname = "Slonim"
      

      Then, create a new variable called totalLen. Have the program use the variables you created to compute the sum of the length of your first name and the length of your last name, and store that value in the variable totalLen. For example, "Donna" has 5 letters, and "Slonim" has 6, so the value of totalLen should be 11.

      Print the value of all three variables. For example, the output in my case would look like:

      Donna
      Slonim
      11
      

      Following the instructions under "Running your work", above, test your program until it runs and produces the correct results. Then continue to modify this same program as described in steps ii-iv, below.

    2. Next, create a new variable called fullname. Have the program set its value to that of a string containing the values stored in firstname and lastname, concatenated together, with a space between them. You should use the variables you have already created to do this (rather than typing your name over again).

      Print the value of this new variable too. Run and test your code again.

    3. Finally, create a new variable called FULLNAME (note, that this is not the same thing as fullname!). Set this variable to contain the contents of fullname, but converted to uppercase. Print the value of FULLNAME too.

      Again, run and test your code until it does the right thing. At this point, the output of your program (which now prints the values of all five variables) might look like the following:

      Donna
      Slonim
      11
      Donna Slonim
      DONNA SLONIM
      
    4. Recall that you've already learned about several data types in Python, including string, int, and float. Numbers with a decimal point after them are considered floats (floating-point numbers), even if the value after the decimal is zero. You can make an integer into a floating point number using the built-in float() function. For this part of the problem, you will need to deal with the different data types explicitly to perform a mathematical operation.

      You want to compute what percentage of the total length of your first and last names is accounted for by your first name. Use variables that hold the length of each and figure out what mathematical operations you need to do to compute the percentage. For example, in my case, my 5-letter first name is approximately 45.45% of the total length of 11 letters. (Python will print more digits after the decimal point; we'll learn later how to get exactly the desired amount of precision.) Print this percentage out after the other variables. For example:

      Donna
      Slonim
      11
      Donna Slonim
      DONNA SLONIM
      45.454545454545453
      

      Submit this program with your writeup. In your writeup, document any issues you had with writing the code, and explain whether you got the answer you were expecting when computing the percentage. If not, why not? Explain what you had to do to get the correct result.

  2. Next, you will run some basic BLAST queries.

    Download the file Lab1.unknownseq.fasta from the online version of this lab. This is a text file containing the sequence of a known and well-characterized protein in FASTA format. You will use BLAST to figure out what protein it is.

    Run a protein BLAST search against the "Reference proteins" (RefSeq) database to identify this sequence. Recall from your reading in Chapter 2 that RefSeq is a database that provides canonical representative sequences for each transcript and protein product, and that mRNA sequences have RefSeq IDs starting with "NM" (e.g., NM_000518.4), while protein sequences have RefSeq IDs starting with "NP" (e.g., NP_000509.1). Use the defaults for all other options.

    (In case NCBI BLAST is not working, choose the EBI BLAST server link from the Links page, select the Protein link under the header "NCBI BLAST," and compare to the "uniprotkb_reference_proteomes" database. To see the alignments, you will then need to look at the box marked "Apply to selection:" to the left of the summary table, and click the "Show" button under the heading "Alignments.")

    In your lab writeup, document any choices you needed to make, and answer the following questions:

    1. What is the RefSeq ID of the unknown sequence?
    2. What is the name of the corresponding protein?
    3. What organism do you think it comes from, and why?
    4. How do you know that you have found the correct sequence?

  3. Next, we will look at a gene that confers resistance to erythromycin in Streptococcus agalactiae, a bacterial pathogen first identified in cattle. There are two possible routes by which bacterial antibiotic resistance might evolve: through vertical inheritance during the normal course of evolution (albeit perhaps under strong selective pressure due to antibiotic exposure), or through horizontal gene transfer (HGT). One possible way to determine if there is HGT of antibiotic resistance genes to human pathogens from agricultural use of antibiotics would be to perform a BLAST search, looking for distantly related bacteria that nonetheless have the same genes conferring drug resistance and that are human pathogens.

    Download the file Lab1.resistgene.fasta from the online version of this lab. This is the ermB gene, which confers erythromycin resistance in Streptococcus agalactiae.

    Start a new nucleotide BLAST search. Use the nr/nt database, limit the organism to bacteria, and check the Exclude box (the one on the line below the organism line) to exclude sequences from uncultured/environmental samples. (These are bacterial sequences from environmental samples where we don't necessarily know what organisms they came from; such hits won't help us address our question, so we are excluding them.) Use the defaults for all other options.

    Answer the following questions:

    1. Look at the identity and coverage for the BLAST results. Has BLAST found highly similar sequences to your query? Do the results suggest you have found the same gene in different organisms? Explain your reasoning.

    2. Are the similar sequences found in distantly related bacteria, or only in closely related bacteria? Are any of these bacteria known to be found in humans? To answer these questions, you may want to refine your BLAST search. Try excluding Gram-positive bacteria (which include Streptococcus agalactiae). You could also limit your search to specific distantly-related bacterial genera that include human pathogens, such as Bacteroides or Escherichia. You can find a recent phylogenetic tree of 17 bacterial phyla here (the tree in part b of the figure will suffice for giving you a sense of how the different phyla are related to each other). Escherichia are from the phylum Proteobacteria; Bacteroides are from the phylum Bacteroidetes, and both Staphylococcus and Streptococcus belong to the phylum Firmicutes.

    3. Describe what you did to answer this question. Do your results support the hypothesis that there is horizontal transfer of antibiotic resistance genes to from animal pathogens to human pathogens? Explain your reasoning in a couple of sentences. What else might you like to know to further investigate this possibility?

Submitting your work:

First, if you are working with a partner, please be sure that both of your names are in all the files you intend to upload.

You will submit your work on Trunk.

Log in to Trunk if you haven't yet done so. Select the Assignments tab, and select Lab 1. You should be able to submit from this page. Simply attach all of your files, including your lab notes, and submit. You should receive email confirmation of your successful submission.


Going Further:

  1. This problem, inspired in part by one of the lab exercises in your textbook, helps you gain familiarity with finding information in the NCBI databases.
    1. What is the longest human protein? There are many ways to answer this question, but one way is to go to the NCBI Protein database (start from the class Links page) and search for a range of sequence lengths. For example, type 30000:40000[Sequence Length] into the search window. Then, either use the filter on the left sidebar to limit species to Homo sapiens, or click the Advanced link just under the search bar and build a query that does the same thing. How many entries are there in this length range, in human, and to what protein do they correspond?
    2. Go to the NCBI Gene page, and look up this protein by name. What is its official gene symbol? What are two other names it is known by? What chromosome is it on?

  2. In problem 2, think about why we suggested that you BLAST against the RefSeq database to try to identify this sequence. Explain how you would expect the results to have been different if you chose to BLAST against the pdb database instead. Would it have made a difference in your ability to identify this particular sequence? Could you have known this before you identified the sequence?

  3. You can query a protein database with the same DNA sequence we used in problem 3. To do this, we need to do a BlastX search. Go back to the main BLAST page, and select blastx instead of protein or nucleotide blast. Use the same query sequence as in problem 3, and search the nr database, restricted to bacteria and again checking the box to exclude uncultured/environmental sample sequences.

    Run the query, and look at some of the lower-scoring alignments shown. Are these proteins homologous to the query sequence? Do you think they are close enough to have been horizontally transferred?

  4. Remember that you can test out small amounts of code in the Python interpreter window instead of writing a script. Try the following:
    >>> i = 2
    >>> city = "Boston"
    >>> city
    >>> len(city)+i
    

    What output do you see if you type the same code into a new python program file, save it, and run it? What would you have to change about your program to see the same output that you saw in the interpreter window?

  5. To read input from the screen, you can use the "raw_input" function. For example, type the following into the interpreter window:
    >>> dog = raw_input("Please type in your dog's name:  ")
    
    Type "Snoopy" at the prompt. If you then type "print dog" into the interpreter window, you will see that the variable dog now holds the string "Snoopy". (More documentation on the raw_input function appars here.)

    Modify your script myname.py to ask users to input their first and last names. Specifically, have the program print a message like "please enter your first name:", read the input from the user, and store it in the firstname variable. Do the same for the last name. Then print the same output: the sum of the string lengths, the concatenated name, and the uppercase name.

  6. Extend your program above to also print the initials corresponding to the name, e.g.:
    Initials are DS
    
    To do this, recall that you can access the letters of a string by putting the index of the position you want in square brackets. For example, if the variable dog holds the value "Snoopy", then
    print dog[1]
    
    prints the letter 'n', because computers count starting at zero.

  7. There are several ways to print out a string with a variable at the end of it. Try editing your program to use all of the following in the program from Problem 1: "+", ",", and string formatting with % (from screens 14 and 15 in CodeAcademy's Strings and Console Output section). Use these to print messages that look more like "The total length of the two names is 11", rather than just printing a single variable on each line as in the problem 1 sample output.

  8. Write a Python script that computes your age in days but is more accurate than what we did in class. Determine your age in years, months, and days, then use division and fractions to determine how many days old you are. (E.g., 19 years, 4 months, 3 days means you could compute (19.33 * (days per year)) + 3). How else could you make this more accurate? Could you account for leap years? For the numbers of days in given months? If you still have time left in lab, go to town on this!