Comp 7 / Bio 40 Lab 9: the Incidentalome

December 3, 2019

Purpose of this lab

To reinforce the lessons of Bayesian probability and the Incidentalome by introducing basic simulation with random numbers in Python.

Reading for this week
Incidentalome paper and Python documentation on the random() function and random module

  1. Start simulating genetic test results

    Write a program that uses random() to simulate random test results. The following code snippet will generate a single random number:

    import random
    
    x = random.random() #x is between 0 and 1
    

    For this first program, you will simulate a population of only 100 people, and an amazingly-accurate test with a 0.01 (one in a hundred) false positive rate and a 0 false negative rate (so that everybody who has the disease tests positive). The prior probability of the disease in the population is also one in a hundred (0.01).

    Your job for this first part of the problem will be to simulate randomly assigning disease status (whether or not they have the disease) and test results (positive or negative) to each of the people in this population, and to print out the cause and the subject id for each positive test result.

    Structuring your program

    To start, create variables representing the values above describing the test and the population size. Name them in capital letters at the start of the program (recall that this is the convention for naming "constant" variables whose values don't change during the program). For example, you might start out with

    POP_SIZE = 100

    Now write a loop that goes through each person, and determines whether the i-th person has the disease with the appropriate probability, using random(). You also need to figure out if they test positive or negative, based on their disease status.

    Once you've determined whether person i has the disease or not, you may wish to update one of three lists as long as the population size (currently 100): one for disease status (so disease[i] == True if person i has the disease), one for false positive status (set it to True if the person does not have the disease but does test positive - this requires a different call to the random() function to generate a new random number to determine the chance of a false positive!), and one list that represents positive test results (set its i-th value to True if person i either has the disease and tests positive, or has a false positive result).

    Note that you can initialize a list mylist to contain listlen zeros through the shorthand

    mylist = [0] * listlen
    . Similarly, you can initialize lists to have listlen Boolean values by e.g. [False] * listlen .

    Whenever a patient is either given the disease or a false positive test result, print a message, such as "person i has the disease" or "person i has a false positive test".

    1. Run your program several times, with a new population of 100 each time. On average, in each run, about how many people have the disease? About how many have false positives?

    Submit this version of your code for problem 1.

  2. What does a positive test result mean?

    Next, make a new copy of the program. You're going to modify it to compute an empirical estimate of the probability that a patient has the disease given that they have a positive test. This probability is denoted Pr(D|P), where D is the predicate "the patient has the disease" and P is the predicate "the patient has a positive test result".

    Now you will see why we kept an array with total positives (both true and false positives). To estimate Pr(D|P), just compute the fraction of the positive-testing patients who actually have the disease.

    Print this estimate of Pr(D|P) for each of several runs of the current version of your program.

    Warning: dividing by zero will cause an error in any computer program! But you might not see this error every time you run your program, because your results depend on random numbers. So you should probably check that your denominator is not zero before performing any division. If it is zero, print a suitable message. Also, remember what you know about integer vs. floating point division in Python 2.

    1. Run the program several times for your 100-person population. What sort of range do you see for the probability of disease given a positive test result? How much does this probability change across multiple runs? (That is, is this measure highly variable?) What is the approximate average of this probability over multiple runs? (you can just estimate). How does this estimate compare to the computed Pr(D|P) (i.e., use Bayes' Theorem to calculate what it should be) for this same set of parameters? What happens if you change the population size to 10,000 from 100?

    Submit this version of your code for problem 2.

  3. Scaling up

    Once you are satisfied that your code is working, tweak the values of the variables to run the simulation described in the paper on page 213 column 1:
    A population size of ten million, a prior probability of disease of 1 in 100,000, and a false positive test rate of one in one thousand. The simulation in the paper also had a non-zero false negative rate, but you can ignore this for now - there won't be enough true positives for the false negatives to affect your results.

    Before you run this, comment out the print statements that print a message each time you assign someone a test result. Otherwise the printing will slow down your code, and there will be so many true and false positives reported in ten million trials that you won't be able to see them all anyway. Then, run the program with the new population size and probabilities.

    Even with this larger population, your program should take substantially less than a minute to run (our version takes about 12 seconds). But if you want reassurance, you can have it print a status message once every million patients.

    1. What is your empirical estimate of Pr(D|P) for these new problem parameters? How does it compare to the true Pr(D|P) for these parameters, as discussed in the paper?

    Submit this version of your code for problem 3.


    Going Further:

  4. The simulation in Figure 1

    Make one more copy of your program. In this copy, add another loop to simulate running multiple independent genetic tests. Remove all the code for computing Pr(D|P) at the end of your previous version of the program, and instead, keep track of how many patients have had at least one false positive test result. Develop your code and test it for just 100 simulated genetic tests. When you have it working, try emulating the simulation in the figure: Run 1000 tests, with a population of 100,000, no false negatives, a 1/10,000 false positive rate, and disease prevalence of 1/100,000 for each test. You can compare your results to those at the 1000-test mark in the figure. (I don't want you to wait long enough to simulate 10,000 tests.)

    1. In the figure, approximately what percentage of the population has had at least one false positive test result after performing 1000 independent genetic tests? How does this compare to your results?

      Enjoy tweaking the values of these and other simulations you write and seeing the outcomes. You can use this approach to simulate results for your own experiments. Have fun!

  5. Add in false negatives

    The simulation in the paper does have a false negative rate. Accounting for this may require changing your code, particularly how you compute Pr(D|P). Add in the false negative rate from the paper and try to adjust your code to handle it. Re-run the results for the larger simulation from Problem 3 using a false negative rate of one in one thousand.

  6. Practice with Dicts

    You will need to use dicts for project 3. This problem gives you some practice with them. Download the files geneids.txt and aliases.txt. Write a program that reads the first file and creates a dict mapping gene names (the first column in the file) to id numbers (the second column in the file). Now edit your program to create another dict that maps id numbers to gene aliases (the first and second columns respectively in the aliases.txt file). Finally, write a while loop that asks the user for a gene name and prints out its alias, looking it up in the two dicts that you have created.

  7. Course Evaluations

    Please be sure to fill out your online course evaluations. It really helps us to hear from everyone. Thanks!