Comp 7 / Bio 40 Project: Reliability of Metagenomics Reads

Due December 5, 2019

Background

In this class, we are studying 16S rRNA metagenomic sequencing, in which the sequences read come from one of the hypervariable regions of the 16S rRNA genes, which are generally well conserved across prokaryotes. Because of the variability in the region we are sequencing, it is possible to use computational means to try to identify the taxonomic classification of each sequence read. Of course, the sequence data can be noisy; noise may be represented by nucleotides reported as 'N' in the sequence data files. In addition, not every bacterial species has previously been sequenced, and some species may have sufficiently similar sequences even in these hypervariable regions, making it difficult to classify them exactly.

The MiSeq metagenomics pipeline uses the method of Wang et al. ( Assignment of rRNA Sequences into the New Bacterial Taxonomy. Q. Wang, G. M. Garrity, J. M. Tiedje, J. R. Cole. Appl. Environ. Microbiol. 73(16):5261, 2007) to taxonomically classify sequences. This is a probablistic method, so the classification at each level of the taxonomic hierarchy is associated with a confidence score, which corresponds roughly to an estimate of the probability that the classification is correct.

The software uses a cutoff of 80% confidence to report a result. If the classification software is less than 80% confident in its classification at any given taxonomic level, it reports that the sequence is "Unclassified" at that level. Some samples have many unclassified reads, while others have relatively few.

Hypothesis

We are going to focus on the most specific level of taxonomic classification reported by our software, the genus level. Our hypothesis is the reads that could not be classified with greater than 80% certainty at the genus level had a higher percentage of undetermined nucleotides (Ns). In this assignment, you will analyze next-generation sequencing data to confirm or refute this hypothesis.


Overview

You will write a python program to address this question by combining data from two data files produced by the MiSeq software. If you want, we will help you design the program and will even give you the outline of the code to start with, but you will fill in each of the pieces!

In order to test your hypothesis, you will need to extract information from a data file to determine which reads are "unclassified" at the genus level, and which are not. You will also need to get the actual nucleotide sequences reported for each of those reads. These are stored in separate data files, so you will have to match the data up between two files using the read identifiers.

Matching data between two files is a very common problem in bioinformatics, and it is one that is easily solved using the dictionary data structure that you have recently learned. First we will describe the two file formats you will encounter, and then we will describe the program you will need to complete.

FASTQ File Format

Most next-generation raw sequencing data comes in the form of FASTQ files. FASTQ files list each read that the sequencer captured. Each read in a FASTQ file consists of four lines. The first line starts with the "@" character and then a unique identifier naming the read, along with another identifier after some whitespace; the second line contains the nucleotide sequence of the read (with N representing reads that are too noisy for a confident nucleotide call); the third line starts with the "+" character and optionally includes the read name again, and the fourth line contains a set of symbols the same length as the read, representing the quality score for each nucleotide. A sample read from a FASTQ file is shown below.

@M01675:6:000000000-A6HK1:1:1101:14905:1750 1:N:0:39
ATTTACATTNNATCANATACATAAAGGNNNNNNN
+
ARSENKC*&^$^&*$NIARINKC()()($%)@($

For this analysis, you will not need to use the third or fourth lines of each set. Keep this in mind when you are writing a function to read in the FASTQ file.

Classification File

The classification file is a text file that contains the read ID and the phylogenetic classification for each read. Phylogeny classification can happen at each level of scientific classification. For example, consider the Horse and E. coli bacterium below.

These organisms can be classified at any of the following levels:

Domain
Kingdom
Phylum
Class
Order
Family
Genus
Species
Eukaryota
Animalia
Chordata
Mammalia
Perissodactyla
Equidae
Equus
E. ferus
Bacteria
Eubacteria
Proteobacteria
Gammaproteobacteria
Enterobacteriales
Enterobacteriaceae
Escherichia
E. coli

At the Domain level, we could say that the organism on the left is a eukaryote, and the organism on the right is a prokaryote. However, at the Species level, we can say that the organism on the left is a horse and the organism on the right is an E. coli bacterium.

When running a 16S experiment, the software tries to identify each read to the most fine-grained level of phylogeny that it can from the sequence information. Sometimes, the software will be able to classify a read to the genus level (which is what you are interested in), and sometimes the software won’t have enough information to perform the classification to that level.

Each read in the classification file consists of two lines. The first line includes the read name (similar to that from the fastq file, with the differences described below), and the second line lists the classification at each taxonomic level and the corresponding probability. The classifications are semicolon-delimited, and the columns correspond to the classification at the level of kingdom, phylum, class, order, family, and genus.

Note that there is no species-level classification reported; we don't have enough sequence variability nor depth of sequencing to make accurate claims at that level.

A sample two-row entry from the classfication file looks like the following:

>39 M01675:6:000000000-A6HK1:1:1101:14905:1750
Bacteria; 0.96; Firmicutes; 0.96; Bacilli; 0.96; Lactobacillales; 0.71; Enterococcaceae; 0.61; Melissococcus; 0.61; 

This suggests 96% confidence that read M01675:6:000000000-A6HK1:1:1101:14905:1750 comes from the domain of Bacteria, but only 61% confidence of the mapping to the genus Melissococcus. In the description of the code outline below, we refer to the second line in each two-row classification file entry (the one starting with "Bacteria" in our example above) as a "classification string").

Note also that some classification strings are shorter than this and don't include data for some of the finer levels of classification. The remaining levels can be inferred to have confidence scores below 0.70, so you will want to count them as sequences with below 80% confidence. For example, you might see the following lines in the classification file, which only classify that sequence down to the class level:

>39 M01675:6:000000000-A6HK1:1:1101:14975:1719
Bacteria; 0.96; Proteobacteria; 0.91; Gammaproteobacteria; 0.91;

Keep in mind that the read ID lines used by the classification file and the corresponding ID lines in the FASTQ file are slightly different and have to be modified to be compatible with one another. Specifically, a read-name line in the classification file looks like:

>39 M01675:6:000000000-A6HK1:1:1101:14064:1631
while the comparable line in the fastq file is:
@M01675:6:000000000-A6HK1:1:1101:14064:1631 1:N:0:39
We suggest you use just M01675:6:000000000-A6HK1:1:1101:14064:1631 as the read name. So in the first case, you will need to remove the '>39 ' part of the line, while in the second case you will need to remove the ' 1:N:0:39' part and the leading '@' character.

The full classification data file is called metagen.txt and the fastq file, metagen.fastq. Both are available for download from the projects page. There are also smaller sample files for use when you are testing your code (rand_10.txt and rand_10.fasta, rand_100.txt, etc.) The number reflects the number of sequences or records in the file. We strongly suggest testing on small data files first.

What to submit

Upload your code. Include comments at the top of your program explaining whether you believe the hypothesis is true for each pair of data files, including the full metagen ones.

Style matters. Specifically, you should comment your code so that a reader could figure out what you're trying to do. Break things up into functions if appropriate (we did a lot of that for you with the skeleton code, but you're welcome to write any helper functions you think you need). Use informative variable names.

Code Introduction

If you would like to write your own code from scratch, you are free to do so. In that case, you can ignore the rest of this page, and simply download the data files mentioned above. The skeleton code is initially set to use just the smallest of these (rand_10.txt and rand_10.fasta). Only replace these file names with the names of the larger test files and, eventually, the full sized files, after you get your code to work on the smaller data sets. We have given you most of the main part of the program, and described the missing part of the main program and several functions that you need to write for this assignment.

Functions that need to be implemented in the skeleton code simply say, “pass” under the function definition. For example:

def read_fasta(filename):
     pass

Skeleton Code Overview

At the top of the skeleton code, there are two constants: CLASS_FILE and READS_FILE. Add the names of the classification file and the FASTQ file in quotation marks to each of these constants, respectively. Make sure that these files are located in the same directory as the code.

Now take a look at the main function in the skeleton code. It performs the following steps:

  1. Initializes two accumulator variables (lists) for storing the numbers of Ns in sequences below and above the cutoff, respectively.
  2. Calls read_fastq(FASTQ) to read the sequences from the fastq-formatted input file, whose name is stored in the constant FASTQ. This returns a dict whose keys are the sequence identifier strings and whose values are sequences.
  3. Calls read_classifications(CLASSIF) to read the classification strings from the input file whose name is stored in the constant CLASSIF. This returns a dict whose keys are sequence identifier strings and whose values are classification strings.
  4. The part you should write: look for all sequences that are in both files and determine which list to add them to by calling the getGenusConfidence() function and comparing it to the value in the constant CUTOFF.
    Then, compute the number of Ns and add this to the appropriate list (either abovecutoffNs or belowcutoffNs).
  5. After all sequences have been processed, main computes and prints the average number of Ns per sequence for the sequences below the cutoff and for those above it.

Specifically, you will complete the code in main following the comments and the instructions in (4) above, and then you will write the contents of each of the other functions so that the input and output match those in the comments.

Using Modules

You've been using Python Modules from the first days of Codecademy this term (recall the Time and Date module?). But now you have the option of using two of them to make your code easier or more powerful.

Commented out at the top of the skeleton code is a line that imports the BioPython SeqIO module. If you use this, the command SeqIO.parse(filename, format) can be used to open, and loop through (iterate over), and close, the data file filename in any of a number of common sequence data formats. This includes "fastq" and "fasta" as possible format options. You can see the full set of options and some examples of reading files at the BioPython Wiki.

(There is a link at the left-hand menu for those who would like to try downloading and installing this on their own machines. Make sure to use "pip2" rather than "pip" to install to python 2 rather than python 3.) This module is also available on most of the lab machines; start at the back of the room and work forward if you're not finding it (they have to push it out one machine at a time, and they are short-staffed this week). I have tested and know it is working on machines a, b, and d, which are in the last row (they actually have labels).

The other thing you might want to do (although this is not required and skipping it won't hurt your grade) is use the statistics part of the scipy module to determine whether the differences you are seeing are significant.

Scipy is a standard module for doing scientific computation in Python, and it will be installed automatically along with BioPython. There are a huge number of statistical functions described here. Click on each test to ensure it does what you want and to find out about assumptions behind it.

For example, you could use a t-test (called ttest_ind in scipy) to tell the difference between the means of the two lists, but the values are definitely not normally distributed (use the Shapiro test to see this). The Kolmogorov Smirnov test measures the difference between two distributions, but it's not really accurate if the samples are not continuous values and have a lot of ties (e.g. most values here are zero). Thus, although you can try these, the resulting p-values aren't really valid.

So far as I can tell, your best bet is to try a Mann Whitney U test to compare the ranks of the values in the two lists (e.g., print (stats.mannwhitneyu(list1, list2)). Feel free to discuss if you have other thoughts about this. (A chi-squared test with a different column for each value (number of N's) is another option, but that gets unwieldy with too many distinct values, and that will differ by data set).

Implementation Strategy

We suggest that you start by getting some of the smaller and simpler functions to work first. Look at the function descriptions below. Helper functions like avgList and getPctNs should be easy to write and debug on their own. You must write some code to test these, although you won't need to run all the tests in your final version. Then try something like readInFastq. Test each of these out before you go on to another function.

Important note: For a project of this size, it is essential that you only write and test one function at a time! Or even just a part of each function at a time. Once one function is working and thoroughly tested, then go on to the next one. If you come to us having written the whole program but never tested any of it and claiming it doesn't work, we will not be able to help you without removing everything you wrote!

Function Details

We have listed and described each of the functions you need to implement below:

main(): Fill in the missing details as described in step 4, above.

read_fastq(filename):
Reads in the FASTQ file (filename). Creates a dictionary so that for each read, the read ID is the key and the nucleotide string is the value. Should return the created dictionary.

The first line of each fastq record starts with "@" and includes a second block of text that should be disregarded. Consider using the split() function to discard this second block of text. The 3rd and 4th lines of each fastq entry should be discarded. Consider using the modulo operator ("%") to determine which lines you are going to use and which lines you will skip.

read_classifications(filename):
This function reads the classification file (filename) and creates a dictionary so that for each read, the read ID is the key and the classification string is the value. (That is, this dict has the form {id: classification_string, ....}.) The function should return the created dictionary.

Note that you'll need to remove the ">39 " characters from the front of the id string to be able to match identifiers between files.

getGenusConfidence(clstring):
This function takes as input a classification string (e.g. "Bacteria; 0.96; Firmicutes; 0.96; Bacilli; 0.96;"), and returns a floating point number corresponding to the numerical value of the 12th element of the list (the significance of classification of the indicated sequence at the genus level) if it exists. If it does not exist, the function should return zero.

countNs(sequence):
This function takes as input a string containing a nucleotide sequence and returns the number of characters matching 'N' in the sequence.

printListAvg(label,data):
The inputs to this function are a string called 'label', and a list of numeric values called 'data'. The function prints the label and then prints the average of the numeric values in the data list. The function does not return anything.