Metagenomics is the study of the community of microbes found in environmental samples. Metagenomics is often used to study communities of microbes living in different areas of the body. In the metagenomics project, we are investigating the community of microbes in the retroauricular crease and the hard palate.
There are two main types of metagenomics experiments that investigators run. These are 16S and WGS (whole genome shotgun sequencing). To better understand these types of experiments, imagine that you are a zookeeper, and you need to determine the composition of your zoo.
As zookeeper, you decide that the best way to catalogue the animals in your zoo is by going to each habitat and making a list of all the animals present. Each animal has a tag on its collar that indicates its exact species. You walk through your zoo and tabulate that you have 12 ostriches, 23 snakes, 4 rhinoceroses, 33 lions, 40 parrots, and an incredible 97 turtles. Now you know the composition of your zoo.
16S Sequencing works in a similar fashion. A unique region of each microbe's ribosomal RNA (rRNA) is sequenced. This segment of the rRNA is different in every microbe so it functions like the tag on an animal's collar. Each piece of rRNA sequenced came from a single microbe, and a computer program can be used to determine which microbe each piece of rRNA came from. This is the type of sequencing that was performed on your samples.
As zookeeper, you decide that you aren't interested in just determining the species present in your zoo. In any case, you don't really care about which exact animals you have. What you really care about is the feel of your zoo. You want your customers to see bright colors, sharp teeth, and flippers. So, you walk around your zoo writing down that you saw 687 razor sharp teeth, 33 flowing manes, 4,000 magnificent azure feathers. After accounting for your zoo in this manner, you don't know which species are present, but you understand the types of things those animals can do. For example, you don't know which animals have the 687 razor sharp teeth, but you know that they are present in your community.
WGS sequencing works in a similar way. DNA isolated from the metagenomic community is sequenced and aligned to reference genomes (using a program similar to BLAST).The enzymes present in the community are determined from this analysis. For example, your community might contain the enzymes necesary to break down sugars. However, this type of experiment does not indicate which bacteria contributed this function, just that the ability to break down sugar exists in the community.
Classification in some individual samples
In this section, we will look at the 16S reads in the traditional manner that Illumina, the company that makes the sequencing machines, recommends on its own metagenomic analysis software, called BaseSpace. If you were actually working on your own data, you'd use their software directly, but for this problem we'll just give you some screenshots to work with so you don't need to apply for accounts on their system.
You will be looking at data from an experiment in which we sampled the microbiomes at two sites from consenting students in an earlier version of comp 7. The sample names include subject identifiers, ranging from 01TS to 24TS, with a suffix of either RC or HP ("retroauricular crease" (i.e., behind the ear), or "hard palate" (i.e., roof of the mouth)). So for example, samples 04TS-HP and 04TS-RC are from two different sites but the same person.
Answer the following questions:
Note: for the sake of minimizing the load on the Galaxy server, the people working in rows 2 and 4 should do problem 3 first and problem 2 afterwards; those in rows 1 and 3 should do problem 2 first and then problem 3.
Functions performed by microbes in each community
To fully understand a metagenomic community, it would be ideal to do both 16S and WGS sequencing. However, these experiments are expensive and time consuming. So what we would like to do instead is perform one sequencing experiment and try to infer the data that would have come from doing both.
PICRUSt (Phylogenetic Investiagion of Communities by Reconstruction of Unobserved States) is a program that attempts to do this. A paper describing this computational approach is available here. Essentially, it tries to estimate the output of a WGS experiment from 16S read data. The program uses the 16S reads obtained from the sequencing to determine what the WGS reads would have looked like if the experiment had been run.
LEfSe (LDA Effect Size) is a framework for performing pair-wise statistical tests. As input, LEfSe take in a set of samples and their classifications into different groups. LEfSe looks at all of the samples in each group and determines what features are overrepresented in each group. In this lab, we will use LEfSe to determine which functions are more abundant in the RC and which are more abundant in the HP.
This input file (otutable.txt) contains the raw counts of the number of sequences that (with high probability) were mapped to each type of microbe found in each sample taken from the entire set. That is, it is a matrix whose rows correspond to individual bacterial genera, whose columns are samples, and whose values are the number of sequences thought to come from that genus in that sample. Save this file to your computer.
When done, use the Back button on your browser to get back to the main Galaxy window. If it re-launches the previous Galaxy command, just cancel it.
The output from PICRUSt gives the KEGG pathway names corresponding to the types of functions present. These are a way of summarizing the WGS reads so that they are more easily understandable. A couple of things in these files need to be modified so that the file will be compatible with LEfSe.
Sometimes, when we need to convert between file formats, we write a Python script. Here is one that you can use to create the LEfSe input file. Save your output file in the Downloads folder as picrust.txt, and run this script in the same directory. (To do this, open Python 2.7, open the convertpicrust.py script, and run it.) It will create a file called pythonlefseinput.txt that you can use for the next step of the analysis.
LEfSe creates graphs that determine features that are significantly enriched between two classes of samples using a statistical test called LDA (linear discriminant analysis). The detalis of how this test works are beyond the scope of this class. However, a higher LDA (score) is more significant. The functions shown are significantly enriched in RC samples or HP samples (see the legend at the top for which color corresponds to which class). Longer bars indicate more significant enrichment in one group of samples or the other.
For your lab writeup, please answer the following questions:
Finding Relationships Between Samples:
In this section of the lab we will build a tree showing the relationships between all the samples obtained from the experiment. We will do so using a data file whose rows represent the samples and whose columns represent all the genera observed in the sample collection. The value shown in position (i,j) in this table are the percentage of all reads for sample i that are in genus j. (So this is essentially the transpose of the matrix from problem 2, but with the data values normalized to percentages for each row.)
We will use a clustering method to determine the relationships between the samples. We will show the relationships by building a tree. Samples that are closer together in the tree are more similar. The algorithm for clustering the samples is almost exactly the same as the UPGMA algorithm for building phylogenetic trees, but in this case distances have to do with the bacterial populations in each sample. We will see if samples cluster by the person they came from or by the body site they were collected from. Recall that samples that end with "HP" came from the hard palate and samples that end with "RC" came from the retroauricular crease, and that the id numbers (e.g. the 01 in 01TS-RC) refer to an individual subject, so 01TS-RC and 01TS-HP come from the same person).
For your lab writeup, please answer the following questions:
Reading data files in Python
(practice reading data files, if you want more practice with these)
The
file GSE6575_series_matrix.txt
contains the expression data downloaded from GEO for data set GSE6575. You
are welcome to look this up to see what it is about. However, the point of
this experiment is to obtain the data part of a series data matrix file,
which has in it both the expression data and a lot of header
information.
The data file is tab-delimited. First, view it in Excel or a text editor. Figure out what separates the header lines from those containing the "series matrix table" of expression data. The matrix table part contains one header line with sample names, followed by lines that start with Affymetrix qualifiers and continue with tab-delimited numbers representing expression in each sample. Figure out, from looking at the file, what you would have to do to read the file until you got to the series_matrix_table part of the file.
List comprehensions in Python
(Advanced coding for those who want a challenge)
Look at the CodeAcademy unit Advanced Topics in Python, screens 4-6, on List Comprehensions. These are ways to write loops and tests more efficiently in a single line of code. After refreshing your memory of this unit, try the following list comprehension challenges.
Learn more about the bacteria in our metagenomics samples:
Pick one or more bacterial genera from samples other than those you looked at for problem 1 above. Look them up online. What species might be representative of them in the samples we are exploring? Can you find different species in the genus you chose that might be beneficial, harmful, or neither for their human hosts? Discuss.