Metagenomics is the study of the community of microbes found in environmental samples. Metagenomics is often used to study communities of microbes living in different areas of the body. In the metagenomics project, we are investigating the community of microbes in the retroauricular crease and the hard palate.
There are two main types of metagenomics experiments that investigators run. These are 16S and WGS (whole genome shotgun sequencing). To better understand these types of experiments, imagine that you are a zookeeper, and you need to determine the composition of your zoo.
As zookeeper, you decide that the best way to catalogue the animals in your zoo is by going to each habitat and making a list of all the animals present. Each animal has a tag on its collar that indicates its exact species. You walk through your zoo and tabulate that you have 12 ostriches, 23 snakes, 4 rhinoceroses, 33 lions, 40 parrots, and an incredible 97 turtles. Now you know the composition of your zoo.
16S Sequencing works in a similar fashion. A unique region of each microbe's ribosomal RNA (rRNA) is sequenced. This segment of the rRNA is different in every microbe so it functions like the tag on an animal's collar. Each piece of rRNA sequenced came from a single microbe, and a computer program can be used to determine which microbe each piece of rRNA came from. This is the type of sequencing that was performed on your samples.
As zookeeper, you decide that you aren't interested in just determining the species present in your zoo. In any case, you don't really care about which exact animals you have. What you really care about is the feel of your zoo. You want your customers to see bright colors, sharp teeth, and flippers. So, you walk around your zoo writing down that you saw 687 razor sharp teeth, 33 flowing manes, 4,000 magnificent azure feathers. After accounting for your zoo in this manner, you don't know which species are present, but you understand the types of things those animals can do. For example, you don't know which animals have the 687 razor sharp teeth, but you know that they are present in your community.
WGS sequencing works in a similar way. DNA isolated from the metagenomic community is sequenced and aligned to reference genomes (using a program similar to BLAST).The enzymes present in the community are determined from this analysis. For example, your community might contain the enzymes necesary to break down sugars. However, this type of experiment does not indicate which bacteria contributed this function, just that the ability to break down sugar exists in the community.
Classification in some individual samples
In this section, we will look at the 16S reads in the traditional manner using copies of BaseSpace's metagenomic reports. If you were actually working on your own data, you'd use their software directly, but for this problem we'll just give you some screenshots to work with so you don't need to apply for accounts on their system.
The sample names include subject identifiers, ranging from 01TS to 24TS, with a suffix of either RC or HP (retroauricular crease or hard palate). So for example, samples 04TS-HP and 04TS-RC are from two different sites but the same person.
Answer the following questions:
Note: for the sake of minimizing the load on the Galaxy server, the people working in rows 2 and 4 should do problem 3 first and problem 2 afterwards; those in rows 1 and 3 should do problem 2 first and then problem 3.
Functions performed by microbes in each community
To fully understand a metagenomic community, it would be ideal to do both 16S and WGS sequencing. However, these experiments are expensive and time consuming. So what we would like to do instead is perform one sequencing experiment and try to infer the data that would have come from doing both.
PICRUSt (Phylogenetic Investiagion of Communities by Reconstruction of Unobserved States) is a program that attempts to do this. Essentially, it tries to estimate the output of a WGS experiment from 16S read data. The program uses the 16S reads obtained from the sequencing to determine what the WGS reads would have looked like if the experiment had been run.
LEfSe (LDA Effect Size) is a framework for performing pair-wise statistical tests. As input, LEfSe take in a set of samples and their classifications into different groups. LEfSe looks at all of the samples in each group and determines what features are overrepresented in each group. In this lab, we will use LEfSe to determine which functions are more abundant in the RC and which are more abundant in the HP.
This input file (otutable.txt) contains the raw counts of the number of each type of microbe found in each sample taken from the entire set. Save this file to your computer.
When done, use the Back button on your browser to get back to the main Galaxy window. If it re-launches the previous Galaxy command, just cancel it.
The output from PICRUSt gives the KEGG pathway names corresponding to the types of functions present. These are a way of summarizing the WGS reads so that they are more easily understandable. A couple of things in these files need to be modified so that the file will be compatible with LEfSe.
Sometimes, when we need to convert between file formats, we write a Python script. Here is one that you can use to create the LEfSe input file. Save your output file in the Downloads folder as picrust.txt, and run this script in the same directory. It will create a file called pythonlefseinput.txt that you can use for the next step of the analysis.
LEfSe creates graphs that determine features that are significantly enriched between two classes of samples using a statistical test called LDA (linear discriminant analysis). The detalis of how this test works are beyond the scope of this class. However, a higher LDA (score) is more significant. The functions shown are significantly enriched in RC samples or HP samples (see the legend at the top for which color corresponds to which class). Longer bars indicate more significant enrichment in one group of samples or the other.
For your lab writeup, please answer the following questions:
Finding Relationships Between Samples:
In this section of the lab we will build a tree showing the relationships between all the samples obtained from the experiment. We will do so using a data file whose rows represent the samples and whose columns represent all the genera observed in the sample collection. The value shown in position (i,j) in this table are the percentage of all reads for sample i that are in genus j. We will use a clustering method to determine the relationships between the samples. Samples that are closer together are more similar. The algorithm for clustering the samples is almost exactly the same as the UPGMA algorithm for building phylogenetic trees, but in this case distances have to do with the bacterial populations in each sample. We will see if samples cluster by the person they came from or by the body site they were collected from. Samples that end with "HP" came from the hard palate and samples that end with "RC" came from the retroauricular crease.
For your lab writeup, please answer the following questions:
List comprehensions in Python
Look at the CodeAcademy unit Advanced Topics in Python, screens 4-6, on List Comprehensions. These are ways to write loops and tests more efficiently in a single line of code. After refreshing your memory of this unit, try the following list comprehension challenges.
Reading data files in Python
The file GSE6575_series_matrix.txt contains the expression data downloaded from GEO for data set GSE6575. You are welcome to look this up to see what it is about. However, the point of this experiment is to obtain the data part of a series data matrix file, which has in it both the expression data and a lot of header information.
The data file is tab-delimited. First, view it in Excel or a text editor. Figure out what separates the header lines from those containing the "series matrix table" of expression data. The matrix table part contains one header line with sample names, followed by lines that start with Affymetrix qualifiers and continue with tab-delimited numbers representing expression in each sample. Figure out, from looking at the file, what you would have to do to read the file until you got to the series_matrix_table part of the file.
Learn more about the bacteria in our metagenomics samples:
Pick one or more bacterial genera from samples other than those you looked at for problem 1 above. Look them up online. What species might be representative of them in the samples we are exploring? Can you find different species in the genus you chose that might be beneficial, harmful, or neither for their human hosts? Discuss.