## Comp 7 / Bio 40 Lab 10: the Incidentalome

### December 7, 2016

#### Purpose of this lab

To reinforce the lessons of Bayesian probability and the Incidentalome by introducing basic simulation with random numbers in Python.

Reading for this week
Incidentalome paper and Python documentation on the random() function and random module

1. Start simulating genetic test results

Write a program that uses random() to simulate random test results. The following code snippet will generate a single random number:

```import random

x = random.random() #x is between 0 and 1
```

For this first program, you will simulate a population of only 100 people, and an amazingly-accurate test with a 0.01 (one in a hundred) false positive rate and a 0 false negative rate (so that everybody who has the disease tests positive). The prior probability of the disease in the population is also one in a hundred (0.01). Your job will be to simulate randomly assigning disease status (whether or not they have the disease) and test results (positive or negative) to each of the people in this population, and to print out the cause and the subject id for each positive test result.

To start, create variables representing the values above describing the test and the population size. Name them in capital letters at the start of the program (recall that this is the convention for naming "constant" variables whose values don't change during the program). For example, you might start out with

`POP_SIZE = 100`

Now write a loop that goes through each person, and determines whether the i-th person has the disease with the appropriate probability, using random(). Once you've determined whether person i has the disease or not, you may wish to update one of three lists of length 100: one for disease status (so disease[i] == 1 if person i has the disease), one for false positive status (set it to 1 if the person does not have the disease but does test positive - this requires a different call to the random() function to generate a new random number to determine the chance of a false positive!), and one list that represents positive test results (set its i-th value to 1 if person i either has the disease or has a false positive result).

Recall that you can initialize the list mylist to contain listlen zeros, by using

`mylist = [0] * listlen`

Whenever a patient is either given the disease or a false positive test result, print a message, such as "person i has the disease" or "person i has a false positive test".

1. Run your program several times, with a new population of 100 each time. On average, in each run, about how many people have the disease? About how many have false positives?

2. What does a positive test result mean?

Next, make a new copy of the program. You're going to modify it to compute an empirical estimate of the probability that a patient has the disease given that they have a positive test. This probability is denoted Pr(D|P), where D is the predicate "the patient has the disease" and P is the predicate "the patient has a positive test result".

Now you will see why we kept an array with total positives (both true and false positives). To estimate Pr(D|P), just compute the fraction of the positive-testing patients who actually have the disease for the whole population. Recall that you can sum all the values in the list mylist and store the sum in the variable mysum using the command

`mysum = sum(mylist)`
Print this estimate of Pr(D|P) for each of several runs of the current version of your program.

Warning: dividing by zero will cause an error in any computer program! But you might not see this error every time you run your program, because your results depend on random numbers. So you should probably check that your denominator is not zero before performing any division. If it is zero, print a suitable message. Also, remember what you know about integer vs. floating point division in Python...

1. On average, over several runs of the program for your 100-person population, approximately what is the observed probability of disease given a positive test result? How does this estimate compare to the computed Pr(D|P) (i.e., use Bayes' Theorem to calculate what it should be) for this same set of parameters?

3. Scaling up

Once you are satisfied that your code is working, tweak the values of the variables to run the simulation described in the paper on page 213 column 1:
A population size of ten million, a prior probability of disease of 1 in 100,000, and a false positive test rate of one in one thousand. The simulation in the paper also had a non-zero false negative rate, but you can ignore this for now - there won't be enough true positives for the false negatives to affect your results.

Before you run this, comment out the print statements that print a message each time you assign someone a test result. Otherwise the printing will slow down your code, and there will be so many true and false positives reported in ten million trials that you won't be able to see them all anyway. Then, run the program with the new population size and probabilities. This is the version of this program that you should submit with your lab.

Even with this larger population, your program should take substantially less than a minute to run (our version takes about 12 seconds). But if you want reassurance, you can have it print a status message once every million patients.

1. What is your empirical estimate of Pr(D|P) for these new problem parameters? How does it compare to the true Pr(D|P) for these parameters, as discussed in the paper?

Going Further:

4. The simulation in Figure 1

Make one more copy of your program. In this copy, add another loop to simulate running multiple independent genetic tests. Remove all the code for computing Pr(D|P) at the end of your previous version of the program, and instead, keep track of how many patients have had at least one false positive test result. Develop your code and test it for just 100 simulated genetic tests. When you have it working, try emulating the simulation in the figure: Run 1000 tests, with a population of 100,000, no false negatives, a 1/10,000 false positive rate, and disease prevalence of 1/100,000 for each test. You can compare your results to those at the 1000-test mark in the figure. (I don't want you to wait long enough to simulate 10,000 tests.)

1. In the figure, approximately what percentage of the population has had at least one false positive test result after performing 1000 independent genetic tests? How does this compare to your results?

Enjoy tweaking the values of these and other simulations you write and seeing the outcomes. You can use this approach to simulate results for your own experiments. Have fun!

5. Practice with Dicts

You will need to use dicts for the project. This problem gives you some practice with them. Download the files geneids.txt and aliases.txt. Write a program that reads the first file and creates a dict mapping gene names (the first column in the file) to id numbers (the second column in the file). Now edit your program to create another dict that maps id numbers to gene aliases (the first and second columns respectively in the aliases.txt file). Finally, write a loop that asks the user for a gene name and prints out its alias, looking it up in the two dicts that you have created.

6. Course Evaluations

Please be sure to fill out your online course evaluations. It really helps us to hear from everyone. Thanks!