Comp 7 / Bio 40 Lab 9
The Zoo Inside You

November 30, 2016

Purpose of this lab

To help students become familiar with functional analysis of metagenomics sequence data and tree inference based on sample characteristics.


Metagenomics is the study of the community of microbes found in environmental samples. Metagenomics is often used to study communities of microbes living in different areas of the body. In the metagenomics project, we are investigating the community of microbes in the retroauricular crease and the hard palate.

There are two main types of metagenomics experiments that investigators run. These are 16S and WGS (whole genome shotgun sequencing). To better understand these types of experiments, imagine that you are a zookeeper, and you need to determine the composition of your zoo.

16S Sequencing

As zookeeper, you decide that the best way to catalogue the animals in your zoo is by going to each habitat and making a list of all the animals present. Each animal has a tag on its collar that indicates its exact species. You walk through your zoo and tabulate that you have 12 ostriches, 23 snakes, 4 rhinoceroses, 33 lions, 40 parrots, and an incredible 97 turtles. Now you know the composition of your zoo.

16S Sequencing works in a similar fashion. A unique region of each microbe's ribosomal RNA (rRNA) is sequenced. This segment of the rRNA is different in every microbe so it functions like the tag on an animal's collar. Each piece of rRNA sequenced came from a single microbe, and a computer program can be used to determine which microbe each piece of rRNA came from. This is the type of sequencing that was performed on your samples.

WGS Sequencing

As zookeeper, you decide that you aren't interested in just determining the species present in your zoo. In any case, you don't really care about which exact animals you have. What you really care about is the feel of your zoo. You want your customers to see bright colors, sharp teeth, and flippers. So, you walk around your zoo writing down that you saw 687 razor sharp teeth, 33 flowing manes, 4,000 magnificent azure feathers. After accounting for your zoo in this manner, you don't know which species are present, but you understand the types of things those animals can do. For example, you don't know which animals have the 687 razor sharp teeth, but you know that they are present in your community.

WGS sequencing works in a similar way. DNA isolated from the metagenomic community is sequenced and aligned to reference genomes (using a program similar to BLAST).The enzymes present in the community are determined from this analysis. For example, your community might contain the enzymes necesary to break down sugars. However, this type of experiment does not indicate which bacteria contributed this function, just that the ability to break down sugar exists in the community.

Lab Problems

  1. Classification in some individual samples

    In this section, we will look at the 16S reads in the traditional manner using copies of BaseSpace's metagenomic reports. If you were actually working on your own data, you'd use their software directly, but for this problem we'll just give you some screenshots to work with so you don't need to apply for accounts on their system.

    The sample names include subject identifiers, ranging from 01TS to 24TS, with a suffix of either RC or HP (retroauricular crease or hard palate). So for example, samples 04TS-HP and 04TS-RC are from two different sites but the same person.

    Answer the following questions:

    1. At this page (metagenomics.html), look at the breakdown by phylum of two HP samples (01TS-HP and 04TS-HP). What is the most common phylum in each? Are the samples consistent? Look at the breakdown for these samples at the genus level. Are the most abundant genera from the most common phyla? (You can look the classifications up online.)

    2. Do the same with the two RC samples (01TS-RC and 04TS-RC). What is the most common phylum and genus in each? Are the samples consistent?

    3. Do the bacterial populations in the mouth and on the skin of the same patients resemble each other?

Load balancing

Note: for the sake of minimizing the load on the Galaxy server, the people working in rows 2 and 4 should do problem 3 first and problem 2 afterwards; those in rows 1 and 3 should do problem 2 first and then problem 3.

  1. Functions performed by microbes in each community

    PICRUSt and LEfSe

    To fully understand a metagenomic community, it would be ideal to do both 16S and WGS sequencing. However, these experiments are expensive and time consuming. So what we would like to do instead is perform one sequencing experiment and try to infer the data that would have come from doing both.

    PICRUSt (Phylogenetic Investiagion of Communities by Reconstruction of Unobserved States) is a program that attempts to do this. Essentially, it tries to estimate the output of a WGS experiment from 16S read data. The program uses the 16S reads obtained from the sequencing to determine what the WGS reads would have looked like if the experiment had been run.

    LEfSe (LDA Effect Size) is a framework for performing pair-wise statistical tests. As input, LEfSe take in a set of samples and their classifications into different groups. LEfSe looks at all of the samples in each group and determines what features are overrepresented in each group. In this lab, we will use LEfSe to determine which functions are more abundant in the RC and which are more abundant in the HP.


    This input file (otutable.txt) contains the raw counts of the number of each type of microbe found in each sample taken from the entire set. Save this file to your computer.

    Running PICRUSt

    1. Open a browser and go to
    2. Go to the Get Data module and click Upload File.
    3. Select "picrust" as the file format and upload otutable.txt.
    4. Click Execute. You will see the job running on the right side of the browsrer.
    5. When the data upload finishes, the bar showing your data file will turn green. Looking back at the menu on the left of the screen, click on PICRUSt, and then on Normalize By Copy Number
    6. The website should automtically find the file you just uploaded as the input data. Change the input format from BIOM format to Legacy QIIME format (tab-delimited). Click Execute.
    7. Once this job finishes running (the bar will turn green again), click Predict Metagenome in the menu on the left (below PICRUSt). Leave the default settings as they are. As before, the website should find the output of the last step to use as input for this one. Click Execute.
    8. This may run for a little while, especially when we have ten jobs running at once. (It takes about a minute when just one person is doing it; if it is slow, you can work on another part of the lab in the meantime.) Once this job has finished running, click Categorize by Function again on the menu below PICRUSt) and change the Type of output to Legacy QIIME. Click Execute.
    9. When this task bar turns green, click on the bar. Click on the download icon (the little picture of a computer disk), which will open the file; right-click and choose Save As, and save it in your Downloads folder.

      When done, use the Back button on your browser to get back to the main Galaxy window. If it re-launches the previous Galaxy command, just cancel it.

    Output from PICRUSt

    The output from PICRUSt gives the KEGG pathway names corresponding to the types of functions present. These are a way of summarizing the WGS reads so that they are more easily understandable. A couple of things in these files need to be modified so that the file will be compatible with LEfSe.

    Sometimes, when we need to convert between file formats, we write a Python script. Here is one that you can use to create the LEfSe input file. Save your output file in the Downloads folder as picrust.txt, and run this script in the same directory. It will create a file called pythonlefseinput.txt that you can use for the next step of the analysis.

    Running LEfSe

    1. Upload the file pythonlefseinput.txt that you created with the python script, using the same Get Data module on the Galaxy website. This time, indicate that the file format is tabular. Click Execute.
    2. Click on LEfSe on the lefthand menu, then on Format Data for LEfSe and run this job with all of the default parameters.
    3. Run the LDA Effect Size with all the defualt parameters.
    4. Run "Plot LEfSe Results" with the default parameters. When it is done, click on the name of the job ("Plot LEfSe Results ...") in the taskbar that just turned green, and click on the disk icon to download the image. Upload it with your writeup.

    Examining the LEfSe Output

    LEfSe creates graphs that determine features that are significantly enriched between two classes of samples using a statistical test called LDA (linear discriminant analysis). The detalis of how this test works are beyond the scope of this class. However, a higher LDA (score) is more significant. The functions shown are significantly enriched in RC samples or HP samples (see the legend at the top for which color corresponds to which class). Longer bars indicate more significant enrichment in one group of samples or the other.

    For your lab writeup, please answer the following questions:

    1. What are some functions (at least two of each) that are enriched in the HP and RC at an LDA score with an absolute value of at least 3?
    2. What are some functions that are enriched in the HP and RC at an LDA score with an absolute value of at least 2?
    3. Select two functions enriched in either class at an LDA >= 2 (or <= -2), and speculate about why those functions might be useful to bacteria present in that class of samples.
    4. Explain how you are able to know about enriched biomedical functions present in each environment when you only have the sequences from a single gene.
  1. Finding Relationships Between Samples:

    In this section of the lab we will build a tree showing the relationships between all the samples obtained from the experiment. We will do so using a data file whose rows represent the samples and whose columns represent all the genera observed in the sample collection. The value shown in position (i,j) in this table are the percentage of all reads for sample i that are in genus j. We will use a clustering method to determine the relationships between the samples. Samples that are closer together are more similar. The algorithm for clustering the samples is almost exactly the same as the UPGMA algorithm for building phylogenetic trees, but in this case distances have to do with the bacterial populations in each sample. We will see if samples cluster by the person they came from or by the body site they were collected from. Samples that end with "HP" came from the hard palate and samples that end with "RC" came from the retroauricular crease.

    1. Download distMatrixGen.r and aggregateTableGenus.txt
    2. From the Start menu, click All Programs. Open the folder called R. Select R x64 2.15.1
    3. Once the program loads, click the Open Script icon (or choose File/Open Script) and select the distMatrixGet.r script that you just downloaded. Open this file.
    4. About a quarter of the way down the script you will see the constants INPUT and OUTPUT. In the file paths being loaded into these constants, replace YOUR_ID with the id you use to log into the computer lab machines. For example, if my ID were philb, I would change the input file to:
      "C:\\Users\\philb\\Downloads\\aggregateTableGenusNEW.txt". If you downloaded the files to your desktop, replace Downloads with Desktop, or otherwise edit these lines to show the location of the data file and where you want your output.
    5. Position your cursor on the line that starts with INPUT
    6. Click the third icon from the left just below the menu bar (right next to the save icon; this icon is called "Run line or selection"). Keep clicking this icon until a tree of clustered samples appears. (Each time you press this button, one line of code in your script is executed. If you know R, you can look at some of the variables to figure out what each step is doing.)

    For your lab writeup, please answer the following questions:

    1. Do samples cluster by body site or by the person the sample originated from? How can you tell?
    2. Based on these results, would you hypotethize that there is more variation in the microbiome between the same body sites on different individuals or different body sites on the same individual? Why?
    3. To do this analysis, do we need just 16s rRNA sequence data, the output of PICRUSt and LEfSe on 16s rRNA sequence data, or full WGS data? Why?

    Going Further

    1. List comprehensions in Python

      Look at the CodeAcademy unit Advanced Topics in Python, screens 4-6, on List Comprehensions. These are ways to write loops and tests more efficiently in a single line of code. After refreshing your memory of this unit, try the following list comprehension challenges.

      • Suppose you have a list saved in a variable: a = [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]. Write one line of Python that takes this list a and makes a new list that contains only the even elements of the list.

      • Suppose you have a list b = [-20, 20, -16, 16, -12, 12, -8, 8, -4, 4]. Write a list comprehension to create, from b, a new list containing only the positive elements. Do not use list slices to do this; you should be able to use the same code to extract just the positive elements of any list of integers in any order.

      • Write a list comprehension that uses nested for-clauses to create a single list with all 36 different dice combinations from (1,1) to (6,6). Note that each tuple needs to be in parentheses.

      • Modify the nested list comprehension above to include only those dice combinations that are not doubles (e.g., not (1,1) or (6,6)).
    2. Reading data files in Python

      The file GSE6575_series_matrix.txt contains the expression data downloaded from GEO for data set GSE6575. You are welcome to look this up to see what it is about. However, the point of this experiment is to obtain the data part of a series data matrix file, which has in it both the expression data and a lot of header information.

      The data file is tab-delimited. First, view it in Excel or a text editor. Figure out what separates the header lines from those containing the "series matrix table" of expression data. The matrix table part contains one header line with sample names, followed by lines that start with Affymetrix qualifiers and continue with tab-delimited numbers representing expression in each sample. Figure out, from looking at the file, what you would have to do to read the file until you got to the series_matrix_table part of the file.

      1. Now write a script to read the input file, discard all the header lines, and copy the series matrix table into a new output file, without the quotation marks surrounding the Affymetrix qualifier names. Do this using the "with open ("filename.txt","r") as my_file" format.
      2. Upload your script and the output file.
      3. What is different about using "with open" instead of the format "my_file=open('filename.txt','r')"?
    3. Learn more about the bacteria in our metagenomics samples:

      Pick one or more bacterial genera from samples other than those you looked at for problem 1 above. Look them up online. What species might be representative of them in the samples we are exploring? Can you find different species in the genus you chose that might be beneficial, harmful, or neither for their human hosts? Discuss.