Comp 7 / Bio 40 Lab 8
Sequence variation

November 21, 2016

Purpose of this lab

To test out gene set analysis of expression data, and to begin writing code to handle files with sequence data.

Lab problems

  1. Sequence variation in APOE

    Save the file apoe_sequences.txt from the labs web page. This is a file that contains the reference sequence of the APOE gene, along with the sequences of APOE genes from five different patients with hyperlipidemia, thought to be associated with some APOE variants. Your goal in this problem is to identify all variants in the patients' sequences compared to the reference sequence. (In this case, you may assume that all the sequences are perfectly aligned and there are no gaps, insertions, or deletions. All you need to do is identify single base substitutions.) Such substitutions might, in a larger data set, serve as candidate variants for association with hyperlipidemia.

    There are six lines in the file. The very first line is the APOE reference sequence. Read the first line and store it in a variable called reference. Then define a variable called sampleid and set its initial value to be 0. You should increment this value everytime you read a line from the file, and use this number as the id for each sample (patient). That is, sampleid should be an integer from 0-4, corresponding to each of our five patients.

    Read the file line by line and store each line in a variable called line. Once you have read a line, use another loop to loop through each character in line and compare it with the corresponding character in the same position of the reference sequence that you have already stored. In other words, you are comparing line[i] and reference[i] for each position i. If line[i] is NOT same as reference[i], it means that there is a subtitution at position i.

    Everytime you find a substitution, report it by printing out sampleid, the index (i) of the position with the substitution, the letter at that position in the reference sequence (reference), and the letter at that position in the sample sequence (line). Increment the sampleid at the end of the outer loop (in which you are reading and processing a line from the file), so that you will have the correct id for the next sample.

    The result of your program should look like:

    0     2062     G     A
    0     2902     T     C
    ... (more sequence variants in sample 0, 1, 2, 3 and 4) ...

    Answer the following questions:

    1. How many subtitutions does each sample have?
    2. Suppose that all the patients have very similar clinical profiles. Does this mean that all the patients have the same sequence variants in their APOE genes? That is, do you find that all five patients have the same variants in the same positions?
    3. Just by looking at the output, can you identify the most frequent sequence variant observed in the samples? What is it? List the index in the sequence, the reference allele, the variant allele, and in which samples this change occurs.

  2. Counting letter frequencies in text

    In this problem, you will write a script that reads some text from a file and computes the frequencies of each letter or character in that text. To do this, you will create a dict whose keys are the characters and whose values are the number of times each character has been seen so far.

    To start, initialize an empty dict. You will also need to recall the has_key() method for a dict object, as in:

    if mydict.has_key('Donna'):
        print "I already have an entry for Donna"
    This is useful because you cannot increment a dict entry if that entry does not yet exist, so you need some way to test whether to initialize it or to just add 1 to its value.

    Download the text files infile1 and infile2. You will use these as input, although to start out testing your methods, you might want to get a short input string from the keyboard.

    Since these input files are relatively short, you can read an entire input file into a single string, or use a single string of user input. Create a lowercase version of the string to work with, so that we don't need to worry about uppercase and lowercase letters. For now, don't worry about all the punctuation and space characters; your code can just treat them as letters.

    Now, write a loop that looks at each character in the string, and increments the dict's value for that character, making it a counter for the number of times that character has been read so far.

    Finally, when you have read and counted characters in the entire input sequence, print out the list of all the characters you've read, and the number of times you've seen each. To do this, here are some useful methods and functions you haven't officially seen yet. The keys() method (as in mydict.keys()) returns a list of all the keys of a dict. If you want them in order, you can use the sorted() function to sort a list. So sorted(mydict.keys()) returns an alphanumerically-sorted list of the keys of your dictionary.

Going Further:

Problem 5 below is substantial, but interesting. If you have a lot of time left, take a stab at that first. If not, work on problems 3 and 4, which are minor modifications of what you've already done. Looking at problem 5 may be interesting even if you don't have time to complete it in lab.

  1. Modify your code for problem 3 above to calculate not just the counts, but the frequencies, of each letter in the text. You can do this by dividing by the total number of characters, and for ease of reading you may want to multiply by 100 (making it a percentage). Print these out.

    The first input file, infile1, contains an excerpt from Shakespeare's "A Midsummer Night's Dream." The second, infile2, contains some paragraphs from "A Void," a translation of Georges Perec's novel "La Disparation."

    Once you've run your code on both input files, visually compare the observed frequencies of the vowels. What is unusual about Perec's work?

  2. Modify your code for problem 1 above to computationally identify and output the most frequent variant. Ensure that the program agrees with your results in question 1.iii.

  3. Distinguishing novel from known variants

    VCF format is a standard file format for identifying variants in sequence. The format is described at this link

    We have given you just the data portion (e.g., minus the header lines) from two vcf files: dbsnp.b37.vcf, and calls.b37.vcf. The first lists all the known SNPs in the dbSNP database of single nucleotide polymorphisms, limited to just human chromosomes 18-21. The second is a list of variant calls from a new sequence data set, restricted to human chromosome 20.

    We will explore some important metrics for evaluating a SNP set, but first we need to define several terms.

    Recall that a “novel” variant is one that is not present in the database of known variation (dbSNP), as opposed to a “known” variant. To call a variant "known," we generally do NOT require that the alternate allele be the same one present in the database, but rather just that the genomic location (i.e. chromosome and position) be the same as one in the database.

    A SNP is “filtered” out of the set by setting the value of the FILTER column in the VCF to anything but “PASS”, otherwise it is “passing” or “called.”

    Your job is to write a script or program to compute the following statistics:

    1. The total number of passing SNPs in the calls file. The expectation is that there is 1 SNP in a human genome every 1000 bases (approximately) and chromosome 20 is about 63M bases long.

    2. The % of passing SNPs in the set that are known. The expectation is that with older catalogs of variation (i.e. before the 1000 Genomes Project) we should see a rate of 90-95% in the database. This number is closer to 99% now that we have the 1000G data, but the sample being evaluated is from 1000G so we are using an older version of dbSNP.

    3. The % of filtered SNPs in the set that are known. If this number is too high it means that we have been too aggressive in our filtering.

    To do this, we suggest that you start by reading the known vcf file, dbsnp.b37.vcf. Create an empty dict. Read the header line first, using readline() or a Boolean flag. Then for each line that you read (after the header line), create a string containing the chromosome, a colon, and the position on the chromosome of that line's SNP. For example, "20:61098" represents a SNP at position 61098 on chromosome 21. Add that as the key to your dict. The value can be either the dbSNP ID (the string that starts with the letters "rs"), or a combination of the Reference and Alternative alleles, or even just the value True. (It won't matter, because you're not really going to use this value, just the fact that it does or doesn't exist.)

    Next, read the calls.b37.vcf file. For each SNP in this file, create the same key with the chromosome, colon, and position of the SNP, and then use the dict to figure out if it is known in dbSNP. Look at the FILTER column to determine whether it is passing or filtered. Use this information to compute the statistics you want.

    Submit your code and the values you obtained for each of the three statistics above.