Comp 7 / Bio 40 Lab 8
The Zoo Inside You

November 19, 2019

Purpose of this lab

To help students become familiar with functional analysis of metagenomics sequence data and tree inference based on sample characteristics.

Background

Metagenomics is the study of the community of microbes found in environmental samples. Metagenomics is often used to study communities of microbes living in different areas of the body. In the metagenomics project, we are investigating the community of microbes in the retroauricular crease and the hard palate.

There are two main types of metagenomics experiments that investigators run. These are 16S and WGS (whole genome shotgun sequencing). To better understand these types of experiments, imagine that you are a zookeeper, and you need to determine the composition of your zoo.

16S Sequencing

As zookeeper, you decide that the best way to catalogue the animals in your zoo is by going to each habitat and making a list of all the animals present. Each animal has a tag on its collar that indicates its exact species. You walk through your zoo and tabulate that you have 12 ostriches, 23 snakes, 4 rhinoceroses, 33 lions, 40 parrots, and an incredible 97 turtles. Now you know the composition of your zoo.

16S Sequencing works in a similar fashion. A unique region of each microbe's ribosomal RNA (rRNA) is sequenced. This segment of the rRNA is different in every microbe so it functions like the tag on an animal's collar. Each piece of rRNA sequenced came from a single microbe, and a computer program can be used to determine which microbe each piece of rRNA came from. This is the type of sequencing that was performed on your samples.

WGS Sequencing

As zookeeper, you decide that you aren't interested in just determining the species present in your zoo. In any case, you don't really care about which exact animals you have. What you really care about is the feel of your zoo. You want your customers to see bright colors, sharp teeth, and flippers. So, you walk around your zoo writing down that you saw 687 razor sharp teeth, 33 flowing manes, 4,000 magnificent azure feathers. After accounting for your zoo in this manner, you don't know which species are present, but you understand the types of things those animals can do. For example, you don't know which animals have the 687 razor sharp teeth, but you know that they are present in your community.

WGS sequencing works in a similar way. DNA isolated from the metagenomic community is sequenced and aligned to reference genomes (using a program similar to BLAST).The enzymes present in the community are determined from this analysis. For example, your community might contain the enzymes necesary to break down sugars. However, this type of experiment does not indicate which bacteria contributed this function, just that the ability to break down sugar exists in the community.

Lab Problems

  1. Classification in some individual samples

    In this section, we will look at the 16S reads in the traditional manner that Illumina, the company that makes the sequencing machines, recommends on its own metagenomic analysis software, called BaseSpace. If you were actually working on your own data, you'd use their software directly, but for this problem we'll just give you some screenshots to work with so you don't need to apply for accounts on their system.

    You will be looking at data from an experiment in which we sampled the microbiomes at two sites from consenting students in an earlier version of comp 7. The sample names include subject identifiers, ranging from 01TS to 24TS, with a suffix of either RC or HP ("retroauricular crease" (i.e., behind the ear), or "hard palate" (i.e., roof of the mouth)). So for example, samples 04TS-HP and 04TS-RC are from two different sites but the same person.

    Answer the following questions:

    1. At this page (metagenomics.html), look at the breakdown by phylum of two HP samples (01TS-HP and 04TS-HP). What is the most common phylum in each? Are the samples consistent? Look at the breakdown for these samples at the genus level. Are the most abundant genera from the most common phyla? (You can look the classifications up online.)

    2. Do the same with the two RC samples (01TS-RC and 04TS-RC). What is the most common phylum and genus in each? Are the samples consistent?

    3. Do the bacterial populations in the mouth and on the skin of the same patients resemble each other?

Load balancing

Note: for the sake of minimizing the load on the Galaxy server, the people working in rows 2 and 4 should do problem 3 first and problem 2 afterwards; those in rows 1 and 3 should do problem 2 first and then problem 3.

  1. Functions performed by microbes in each community

    PICRUSt and LEfSe

    To fully understand a metagenomic community, it would be ideal to do both 16S and WGS sequencing. However, these experiments are expensive and time consuming. So what we would like to do instead is perform one sequencing experiment and try to infer the data that would have come from doing both.

    PICRUSt (Phylogenetic Investiagion of Communities by Reconstruction of Unobserved States) is a program that attempts to do this. A paper describing this computational approach is available here. Essentially, it tries to estimate the output of a WGS experiment from 16S read data. The program uses the 16S reads obtained from the sequencing to determine what the WGS reads would have looked like if the experiment had been run.

    LEfSe (LDA Effect Size) is a framework for performing pair-wise statistical tests. As input, LEfSe take in a set of samples and their classifications into different groups. LEfSe looks at all of the samples in each group and determines what features are overrepresented in each group. In this lab, we will use LEfSe to determine which functions are more abundant in the RC and which are more abundant in the HP.

    Input

    This input file (otutable.txt) contains the raw counts of the number of sequences that (with high probability) were mapped to each type of microbe found in each sample taken from the entire set. That is, it is a matrix whose rows correspond to individual bacterial genera, whose columns are samples, and whose values are the number of sequences thought to come from that genus in that sample. Save this file to your computer.

    Running PICRUSt

    1. Open a browser and go to http://huttenhower.sph.harvard.edu/galaxy/
    2. Go to the Get Data module and click Upload File.
    3. Add otutable.txt to the list of files to upload. Select "picrust" as the file Type (there is a large pulldown) and click Start to upload otutable.txt.
    4. When the data upload finishes, the bar on the right of the window showing your data file in the History column will turn green. Looking back at the menu on the left of the screen, click on PICRUSt, and then on Normalize By Copy Number
    5. The website should automtically find the file you just uploaded as the input data. Change the input format from BIOM format to Legacy QIIME format (tab-delimited). Click Execute.
    6. Once this job finishes running (the History bar will turn green again), click Predict Metagenome in the menu on the left (below PICRUSt). Leave the default settings as they are. As before, the website should find the output of the last step to use as input for this one. Click Execute.
    7. This may run for a little while, especially when we have ten jobs running at once. (It takes about a minute when just one person is doing it; if it is slow, you can work on another part of the lab in the meantime.) Once this job has finished running, click Categorize by Function again on the menu below PICRUSt) and change the Type of output to Legacy QIIME. Click Execute.
    8. When this task bar turns green, click on the bar. Click on the download icon (the little picture of a computer disk), which will open the file; right-click and choose Save Link As, and save it in your Downloads folder.

      When done, use the Back button on your browser to get back to the main Galaxy window. If it re-launches the previous Galaxy command, just cancel it.

    Output from PICRUSt

    The output from PICRUSt gives the KEGG pathway names corresponding to the types of functions present. These are a way of summarizing the WGS reads so that they are more easily understandable. A couple of things in these files need to be modified so that the file will be compatible with LEfSe.

    Sometimes, when we need to convert between file formats, we write a Python script. Here is one that you can use to create the LEfSe input file. Save your output file in the Downloads folder as picrust.txt, and run this script in the same directory. (To do this, open Python 2.7, open the convertpicrust.py script, and run it.) It will create a file called pythonlefseinput.txt that you can use for the next step of the analysis.

    Running LEfSe

    1. Upload (to Galaxy) the file pythonlefseinput.txt that you created with the python script, using the same Get Data module on the Galaxy website. This time, indicate that the file format is tabular. Click Execute.
    2. Click on LEfSe on the lefthand menu, then on Format Data for LEfSe and run this job with all of the default parameters.
    3. Run the LDA Effect Size with all the defualt parameters.
    4. Run "Plot LEfSe Results" with the default parameters. When it is done, click on the name of the job ("Plot LEfSe Results ...") in the taskbar that just turned green, and click on the disk icon to download the image. Upload it with your writeup.

    Examining the LEfSe Output

    LEfSe creates graphs that determine features that are significantly enriched between two classes of samples using a statistical test called LDA (linear discriminant analysis). The detalis of how this test works are beyond the scope of this class. However, a higher LDA (score) is more significant. The functions shown are significantly enriched in RC samples or HP samples (see the legend at the top for which color corresponds to which class). Longer bars indicate more significant enrichment in one group of samples or the other.

    For your lab writeup, please answer the following questions:

    1. What are some functions (at least two of each) that are enriched in the HP and RC at an LDA score with an absolute value of at least 3?
    2. What are some functions that are enriched in the HP and RC at an LDA score with an absolute value of at least 2?
    3. Select two functions enriched in either class at an LDA >= 2 (or <= -2), and speculate about why those functions might be useful to bacteria present in that class of samples.
    4. Explain how you are able to know about enriched biomedical functions present in each environment when you only have the sequences from a single gene.
  1. Finding Relationships Between Samples:

    In this section of the lab we will build a tree showing the relationships between all the samples obtained from the experiment. We will do so using a data file whose rows represent the samples and whose columns represent all the genera observed in the sample collection. The value shown in position (i,j) in this table are the percentage of all reads for sample i that are in genus j. (So this is essentially the transpose of the matrix from problem 2, but with the data values normalized to percentages for each row.)

    We will use a clustering method to determine the relationships between the samples. We will show the relationships by building a tree. Samples that are closer together in the tree are more similar. The algorithm for clustering the samples is almost exactly the same as the UPGMA algorithm for building phylogenetic trees, but in this case distances have to do with the bacterial populations in each sample. We will see if samples cluster by the person they came from or by the body site they were collected from. Recall that samples that end with "HP" came from the hard palate and samples that end with "RC" came from the retroauricular crease, and that the id numbers (e.g. the 01 in 01TS-RC) refer to an individual subject, so 01TS-RC and 01TS-HP come from the same person).

    1. Download distMatrixGen.r and aggregateTableGenus.txt
    2. From the Start menu, click All Programs. Open the folder called R. Select R x64 2.15.1 or R i386 2.15.1 (whatever version of R is present).
    3. Once the program loads, click the Open Script icon (or choose File/Open Script) and select the distMatrixGet.r script that you just downloaded. Open this file.
    4. About a quarter of the way down the script you will see the constants INPUT and OUTPUT. In the file paths being loaded into these constants, replace YOUR_ID with the id you use to log into the computer lab machines. For example, if my ID were philb, I would change the input file to:
      "C:\\Users\\philb\\Downloads\\aggregateTableGenusNEW.txt". If you downloaded the files to your desktop, replace Downloads with Desktop, or otherwise edit these lines to show the location of the data file and where you want your output.
    5. Position your cursor on the line that starts with INPUT
    6. Click the third icon from the left just below the menu bar (right next to the save icon; this icon is called "Run line or selection"). Keep clicking this icon until a tree of clustered samples appears. (Each time you press this button, one line of code in your script is executed. If you know R, you can look at some of the variables to figure out what each step is doing.)

    For your lab writeup, please answer the following questions:

    1. Do samples cluster by body site or by the person the sample originated from? How can you tell?
    2. Based on these results, would you hypothesize that there is more variation in the microbiome between the same body sites on different individuals or different body sites on the same individual? Why?
    3. To do this analysis, do we need just 16s rRNA sequence data, the output of PICRUSt and LEfSe on 16s rRNA sequence data, or full WGS data? Why?


    Going Further

    1. Reading data files in Python
      (practice reading data files, if you want more practice with these) The file GSE6575_series_matrix.txt contains the expression data downloaded from GEO for data set GSE6575. You are welcome to look this up to see what it is about. However, the point of this experiment is to obtain the data part of a series data matrix file, which has in it both the expression data and a lot of header information.

      The data file is tab-delimited. First, view it in Excel or a text editor. Figure out what separates the header lines from those containing the "series matrix table" of expression data. The matrix table part contains one header line with sample names, followed by lines that start with Affymetrix qualifiers and continue with tab-delimited numbers representing expression in each sample. Figure out, from looking at the file, what you would have to do to read the file until you got to the series_matrix_table part of the file.

      1. Now write a script to read the input file, discard all the header lines, and copy the series matrix table into a new output file, without the quotation marks surrounding the Affymetrix qualifier names. Do this using the "with open ("filename.txt","r") as my_file" format.
      2. Upload your script and the output file.
      3. What is different about using "with open" instead of the format "my_file=open('filename.txt','r')"?
    2. List comprehensions in Python
      (Advanced coding for those who want a challenge)

      Look at the CodeAcademy unit Advanced Topics in Python, screens 4-6, on List Comprehensions. These are ways to write loops and tests more efficiently in a single line of code. After refreshing your memory of this unit, try the following list comprehension challenges.

      • Suppose you have a list saved in a variable: a = [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]. Write one line of Python that takes this list a and makes a new list that contains only the even elements of the list.

      • Suppose you have a list b = [-20, 20, -16, 16, -12, 12, -8, 8, -4, 4]. Write a list comprehension to create, from b, a new list containing only the positive elements. Do not use list slices to do this; you should be able to use the same code to extract just the positive elements of any list of integers in any order.

      • Write a list comprehension that prints out all the first letters of the words in the sentence, "Hi, Ellen! Let's log out." Use lower() to change the string to lowercase, and split() to split the sentence (string) into a list of words. Then use a list comprehension to get the first letters and print just those.
    3. Learn more about the bacteria in our metagenomics samples:

      Pick one or more bacterial genera from samples other than those you looked at for problem 1 above. Look them up online. What species might be representative of them in the samples we are exploring? Can you find different species in the genus you chose that might be beneficial, harmful, or neither for their human hosts? Discuss.