Reading for this week
Incidentalome paper and
Python documentation on the random() function and random module
Write a program that uses random() to simulate random test results. The following code snippet will generate a single random number:
import random x = random.random() #x is between 0 and 1
For this first program, you will simulate a population of only 100 people, and an amazingly-accurate test with a 0.01 (one in a hundred) false positive rate and a 0 false negative rate (so that everybody who has the disease tests positive). The prior probability of the disease in the population is also one in a hundred (0.01).
Your job for this first part of the problem will be to simulate randomly assigning disease status (whether or not they have the disease) and test results (positive or negative) to each of the people in this population, and to print out the cause and the subject id for each positive test result.
Structuring your program
To start, create variables representing the values above describing the test and the population size. Name them in capital letters at the start of the program (recall that this is the convention for naming "constant" variables whose values don't change during the program). For example, you might start out with
POP_SIZE = 100
Now write a loop that goes through each person, and determines whether the i-th person has the disease with the appropriate probability, using random(). You also need to figure out if they test positive or negative, based on their disease status.
Once you've determined whether person i has the disease or not, you may wish to update one of three lists as long as the population size (currently 100): one for disease status (so disease[i] == True if person i has the disease), one for false positive status (set it to True if the person does not have the disease but does test positive - this requires a different call to the random() function to generate a new random number to determine the chance of a false positive!), and one list that represents positive test results (set its i-th value to True if person i either has the disease and tests positive, or has a false positive result).
Note that you can initialize a list mylist to contain listlen zeros through the shorthand
mylist = [0] * listlen. Similarly, you can initialize lists to have listlen Boolean values by e.g. [False] * listlen .
Whenever a patient is either given the disease or a false positive test result, print a message, such as "person i has the disease" or "person i has a false positive test".
Submit this version of your code for problem 1.
Next, make a new copy of the program. You're going to modify it to compute an empirical estimate of the probability that a patient has the disease given that they have a positive test. This probability is denoted Pr(D|P), where D is the predicate "the patient has the disease" and P is the predicate "the patient has a positive test result".
Now you will see why we kept an array with total positives (both true and false positives). To estimate Pr(D|P), just compute the fraction of the positive-testing patients who actually have the disease.
Print this estimate of Pr(D|P) for each of several runs of the current version of your program.
Warning: dividing by zero will cause an error in any computer program! But you might not see this error every time you run your program, because your results depend on random numbers. So you should probably check that your denominator is not zero before performing any division. If it is zero, print a suitable message. Also, remember what you know about integer vs. floating point division in Python 2.
Submit this version of your code for problem 2.
Once you are satisfied that your code is working, tweak the values of the variables
to run the simulation described in the paper on page 213 column 1:
A population size of ten million, a prior probability of disease of
1 in 100,000, and a false positive test rate of one in one thousand.
The simulation in the paper also had a non-zero false negative rate,
but you can ignore this for now - there won't be enough true positives
for the false negatives to affect your results.
Before you run this, comment out the print statements that print a message each time you assign someone a test result. Otherwise the printing will slow down your code, and there will be so many true and false positives reported in ten million trials that you won't be able to see them all anyway. Then, run the program with the new population size and probabilities.
Even with this larger population, your program should take substantially less than a minute to run (our version takes about 12 seconds). But if you want reassurance, you can have it print a status message once every million patients.
Submit this version of your code for problem 3.
Going Further:
The simulation in Figure 1
Make one more copy of your program. In this copy, add another loop to simulate running multiple independent genetic tests. Remove all the code for computing Pr(D|P) at the end of your previous version of the program, and instead, keep track of how many patients have had at least one false positive test result. Develop your code and test it for just 100 simulated genetic tests. When you have it working, try emulating the simulation in the figure: Run 1000 tests, with a population of 100,000, no false negatives, a 1/10,000 false positive rate, and disease prevalence of 1/100,000 for each test. You can compare your results to those at the 1000-test mark in the figure. (I don't want you to wait long enough to simulate 10,000 tests.)
Enjoy tweaking the values of these and other simulations you write and seeing the outcomes. You can use this approach to simulate results for your own experiments. Have fun!
Add in false negatives
The simulation in the paper does have a false negative rate. Accounting for this may require changing your code, particularly how you compute Pr(D|P). Add in the false negative rate from the paper and try to adjust your code to handle it. Re-run the results for the larger simulation from Problem 3 using a false negative rate of one in one thousand.
You will need to use dicts for project 3. This problem gives you some practice with them. Download the files geneids.txt and aliases.txt. Write a program that reads the first file and creates a dict mapping gene names (the first column in the file) to id numbers (the second column in the file). Now edit your program to create another dict that maps id numbers to gene aliases (the first and second columns respectively in the aliases.txt file). Finally, write a while loop that asks the user for a gene name and prints out its alias, looking it up in the two dicts that you have created.
Please be sure to fill out your online course evaluations. It really helps us to hear from everyone. Thanks!