Comp 7 / Bio 40 Lab 7
Biomart and coding practice

November 12, 2019

Purpose of this lab

To give students practice working with Python dictionaries and more complex control structures while exploring the BioMart interface to the Ensembl databases.

Lab problems

  1. About Ensembl

    The Ensembl databases include a vast amount of data on genomes, sequences, and variantion. Until now we've mostly been using NCBI IDs and data. Ensembl has slightly different criteria for listing transcripts and homology, for example.

    Start your queries at https://useast.ensembl.org/index.html. Choose "human" and type "BRCA2" in the query window and click "Go." You will get a long list of records related to human BRCA2, including the Gene entry (just one of these), a long list of transcripts, phenotypes, variants, etc. However, the first entry should say "BRCA2 (Human Gene)". Click on this.

    (Note: Ensembl Gene identifiers start with ENSG and some numbers; Transcript identifiers start with ENST. You can guess what ENSP stands for.)

    Look at the table of transcripts for BRCA2.

    1. How many protein coding transcripts are shown in Ensembl? How does this compare to what you see in NCBI Gene when you look for BRCA2 or the Ensembl Gene ID of this gene? (To find transcripts, scroll down to the map image of the mRNA transcripts under "Genomic regions, transcripts, and products" or look under the mRNA RefSeq identifiers.)

    Scroll down on the Ensembl Gene page to see the map overview of all the different transcripts, coding and otherwise.

    Click on the Ensembl ID of the first transcript in the table, ENST00000380152.7. Scroll down under summary to see the intron/exon structure of the transcript. To see the sequence of that transcript, look in the left-hand menu column headed by "Transcript-based displays" and under Sequence click on "cDNA." Click on the question mark in a circle next to "cDNA Sequence" to find a description of the three lines of the sequence data.

    1. What are the first five amino acids shown for the protein sequence encoded by this transcript?

  2. Building queries in BioMart

    Clearly, Ensembl includes a lot of sequence data in its databases, and if you just want to browse through a single sequence, you can view it as above. But we've been starting to talk about thinking about sets of genes or transcripts or proteins as a group. What if you wanted to get the sequences for a whole list of genes or transcripts? Would you have to view each in the Ensembl Transcript window, click on "cDNA," click on "Download this sequence," and cut and paste the sequences from each file created this way into one shared fasta file?

    Of course not.

    Ensembl BioMart is a web-based interface that allows you to perform fairly complex queries of the Ensembl databases without learning database query languages. You can find and download almost any data from the Ensembl databases using this tool.

    Go back to https://useast.ensembl.org/index.html and at the top of the page, in the box labeled "Tools," click on "BioMart." You will first have to select a database from the "CHOOSE DATABASE" pull-down. Pick Ensembl Genes 98 at the top of the list. You will then be shown a "CHOOSE DATASET" pull-down; pick the "Human genes (GRCh38.p13)" option. (GRCh38.p13 refers to the version of the human genome being used: Genome Reference Consortium Human Build 38.)

    For any BioMart query, you will need to specify both "filters" and "attributes" from a long list of possibilities for both. Filters refers to which ensembl entities, such as genes, you want to look at. You may pick using various identifiers, or in chromosomal regions, or by phenotype, or by uploading a list, among other options.

    For this lab, we will upload this list of Ensembl gene identifiers for our filter. (Note that you don't have to have Ensembl identifiers already; you can filter on NCBI Gene ID numbers, gene symbols, or many other identifiers as well.) Save the list from the web page to your computer. To load this list, click on "Filters" in the Menu on the left. Expand the Gene box by clicking on the "+" to the left of the filter type "GENE."

    Where it says "Input external references ID list," make sure that the first option, "Gene Stable IDs", is selected in the pull-down of identifier types (though feel free to browse to see how many other sorts of identifiers are accepted), and then click the Browse button to upload the file you saved of ensembl gene identifiers.

    Next, click "Attributes" in the left-hand menu window. You have to choose one of several attribute categories, corresponding to the radio buttons at the top of the central panel. For now, leave "Features" selected (the default).

    Expand the GENE box by clicking on the "+" and un-select Transcript Stable ID and Transcript Stable ID version (but leave the Gene Stable ID boxes checked). Instead, select "Gene Name" and "Chromosome/scaffold name."

    Above the left-hand menu column, click on the rectangle labeled "Results" to run the query. Scroll through the output. Look at the reformatting and export options if you like.

    1. On what chromosomes are the genes WNT16 and NFYA?

  3. Finding sequence data in BioMart

    Suppose that you want to obtain the genomic sequences and 200 bp of upstream flanking sequences of all genes in your gene lists. There are a couple of ways to do this in BioMart. One is to figure out what the chromosomal locations of the start and end positions are, and to use this to find the exact genomic coordinates you want. (Notice that under the Attributes list of Features are options "Gene start (bp)" and "Gene end (bp)".)

    Another approach is to run two separate sequence queries. A strength of BioMart is that it knows exactly where genes and transcripts are located for a particular version of the genome; you can just request the genomic sequence a specific gene, and you will get it. You can also request upstream or downstream flanking sequence. Unfortunately, you can't do all of these at once.

    Thus, we will run two queries: one to get the genomic sequence of each of the genes, and another to get the upstream flanking sequence. We'll save each of these to a file, and then in the next problem, you'll write a script to knit these together.

    To save you the problem that there are multiple transcripts for each gene, we also created a file containing a transcript list of transcripts for most of the genes on the prior list. We picked relatively short transcripts for all of them.

    Save this file first. Go back and edit the Filters tab to use "Transcript Stable IDs" as the identifier type under GENE/Input external references ID list, and have it upload this file.

    Click on "Attributes" in the left-hand menu again. In the box of radio buttons at the top of the center panel, where "Features" is currently selected, click on "Sequences" instead. Expand "SEQUENCES" and select "Unspliced (Transcript)." Click Results to run the query. At the top of the main panel, export all the results to a fasta file and save that file (without opening it, if possible) to your computer. You may want to name it something informative like transcriptseq.txt, rather than the default, "mart_export.txt", especially because the next file will be saved as mart_export.txt too and you'd like it to be different.

    Click on "Attributes" again. Under "Sequences" choose "Flank (Transcript)" instead of "Unspliced (Transcript)." Below the heading "Sequences" is another labeled "Upstream flank." Click on this and enter the value 200. Run the query and save the file. Rename it to something informative like upstream.txt.

    Upload both of these files.

  4. Integrating data files in Python

    Write a program, combineseqs.py, to combine the two pieces of the sequences into one. Note that while the two files should include transcript sequence and upstream sequence from the same transcripts, they will very likely not include them in the same order! So you will need to use dictionaries to match one to the other.

    First, have your program open the upstream sequence file. Read the file line by line. If the line is a header line, remove the ">" character from it and use the rest of the line as a dictionary key. (It has four names in it, separated by "|": the gene, the gene with version number, the transcript, and the transcript with version number. Don't worry about separating these for now - just use the whole thing as a key.) Also create (or reset) an empty string accumulator variable to use when you read the sequence.

    If it's not a header line, add your sequence to the accumulator variable.

    When you find the next header line, don't forget to save the sequence you've accumulated as the value corresponding to the previous key before resetting the accumulator.

    When you have finished reading the upstream sequence file, you should have a dictionary whose keys are the header lines (minus the ">" character) and whose values are the 200bp of upstream sequence for each.

    Next, open an output file for writing. Then, read the transcript sequence files. These sequences are longer. The good news is that you won't have to save these longer sequences in a dictionary. Instead, you can read the sequences the same way you did above (think about code re-use here). When you have accumulated the sequence for a particular transcript, print the header to the output file. Create a combined sequence using the dictionary with the saved upstream sequence and the newly-accumulated transcript sequence, and write to the output file (you can do this as one line for now; we'll format it nicely in the Going Further section).

    Close all the files when you are done. Test your code and make sure the first and last sequences in each file are handled properly!


Going Further:

  1. Nicer formatting for your integrated sequence files

    Create a new version of combineseqs.py, called prettyseqs.py, that prints the output more nicely.

    Specifically, write a function, printFasta(header,seq), that prints out a sequence seq in fasta format using the header content header. Each sequence line should be up to 80 characters long. (That is, each line is 80 characters except the last, which may be shorter.)

    Use this function to print the output of your combined sequences into a fasta-formatted output file containing all the sequences.

  2. Sequence variation in APOE

    Save the file apoe_sequences.txt from the labs web page. This is a file that contains the reference sequence of the APOE gene, along with the sequences of APOE genes from five different patients with hyperlipidemia, thought to be associated with some APOE variants. Your goal in this problem is to identify all variants in the patients' sequences compared to the reference sequence. (In this case, you may assume that all the sequences are perfectly aligned and there are no gaps, insertions, or deletions. All you need to do is identify single base substitutions.) Such substitutions might, in a larger data set, serve as candidate variants for association with hyperlipidemia.

    There are six lines in the file. The very first line is the APOE reference sequence. Read the first line and store it in a variable called reference. Then define a variable called sampleid and set its initial value to be 0. You should increment this value everytime you read a line from the file, and use this number as the id for each sample (patient). That is, sampleid should be an integer from 0-4, corresponding to each of our five patients.

    Read the file line by line and store each line in a variable called line. Once you have read a line, use another loop to loop through each character in line and compare it with the corresponding character in the same position of the reference sequence that you have already stored. In other words, you are comparing line[i] and reference[i] for each position i. If line[i] is NOT same as reference[i], it means that there is a subtitution at position i.

    Everytime you find a substitution, report it by printing out sampleid, the index (i) of the position with the substitution, the letter at that position in the reference sequence (reference), and the letter at that position in the sample sequence (line). Increment the sampleid at the end of the outer loop (in which you are reading and processing a line from the file), so that you will have the correct id for the next sample.

    The result of your program should look like:

    0     2062     G     A
    0     2902     T     C
    ... (more sequence variants in sample 0, 1, 2, 3 and 4) ...

    Answer the following questions:

    1. How many subtitutions does each sample have?
    2. Suppose that all the patients have very similar clinical profiles. Does this mean that all the patients have the same sequence variants in their APOE genes? That is, do you find that all five patients have the same variants in the same positions?
    3. Just by looking at the output, can you identify the most frequent sequence variant observed in the samples? What is it? List the index in the sequence, the reference allele, the variant allele, and in which samples this change occurs.