Idle Theory: Bayes' Theorem

Deriving Bayes' Theorem

1% of women at age forty who participate in routine screening have breast cancer. 80% of women with breast cancer will get positive mammographies. 9.6% of women without breast cancer will also get positive mammographies. A woman in this age group had a positive mammography in a routine screening. What is the probability that she actually has breast cancer? (An Intuitive Explanation of Bayesian Reasoning)

I was hunting around on the web trying to understand Bayes' Theorem when I came across the question above at the site linked, which promised an intuitive approach to it. It asked me first to make my own estimate of an answer to the question it posed. So I got out pen and paper, and started thinking. The answer looked like it would be somewhere up around 80%.

There are 3 probabilities here, and we can call them P1, P2, and P3. Where P1 = 0.01, P2 = 0.8, and P3 = 0.096.

Suppose some number, N = 1,000,000 women are tested. Then the number of them with breast cancer , N1, will be given by

N1 = P1.N = 10,000 women (1)

Of these 10,000 women, some number, N2, will test rightly positive, and

N2 = P2.N1 = 8,000 women (2)

And then, of the remainder of the 10,000 women, some number, N3, will test falsely positive, and

N3 = P3.(N - N1) = 95,040 women. (3)

I was surprised so many women would wrongly test positive.

The probability P of that someone testing positive actually has breast cancer is

P = N2 / (N2 + N3) = 0.0776 or 7.76% (4)

I then skipped down the page to see if I'd got the right answer. I had.

I didn't read any more of the webpage, because it occurred to me that I could bundle my reasoning into a single equation for P by substituting the component terms of N2 and N3 into equation 4

So P = P1.P2.N / ( (N - P1.N).P3 + P1.P2.N )

The N's above and below cancel out leaving

P = P1.P2 / ( (1 - P1).P3 + P1.P2 ) (5)

I then checked equation (5), and it gave P = 7.76%

Was this Bayes' Theorem? Scrolling down to the bottom, I found Bayes' Theorem expressed as

p(A|X) = p(X|A)*p(A) / ( p(X|A)*p(A) + p(X|~A)*p(~A) ) (6)

This certainly seemed to be of the same form, but using an incomprehensible notation. I guessed p(A) to be my known probability P1, and p(X|A) my right positive probability P2, the product of which appeared twice. Looked like I had the same thing, but with a different notation.

I tried it out on another problem, from Wikipedia.

Bayes' theorem is useful in evaluating the result of drug tests. Suppose a certain drug test is 99% accurate, that is, the test will correctly identify a drug user as testing positive 99% of the time, and will correctly identify a non-user as testing negative 99% of the time. This would seem to be a relatively accurate test, but Bayes' theorem will reveal a potential flaw. Let's assume a corporation decides to test its employees for opium use, and 0.5% of the employees use the drug. We want to know the probability that, given a positive drug test, an employee is actually a drug user. (Answer 0.3322)

OK, in this case P1, the actual probability, is 0.5%. And P2, the probability of getting the right answer is 99%. And P3, the probability of getting the wrong answer is 1%. Using these values, P came out at 0.3322. It worked again.

Equation (5) is Bayes' Theorem, as I derived it.

Bayes' Theorem is shrouded in mystery, but having derived it myself, I couldn't see what the fuss was about. It's just a bit of probability theory that throws up some surprising results. I hadn't expected 7.8% to be the answer to that first question.

What puzzled me was where you'd empirically get P1 and P2 and P3 from. How do you know that 1% of women over 40 have breast cancer? How do you know that 80% of these are correctly diagnosed? How do you know that 9.6% are incorrectly diagnosed? There must be some way of ascertaining whether women have breast cancer independent of mammography to get these figures.

I guess that maybe what happens is that a lot of women are tested, and some test positive and some negative. Of these, some who test positive don't go on to develop full-blown breast cancer, and some who test negative do go on to develop full-blown breast cancer. So from this you can estimate P2 and P3, if records of tests are compared with outcomes. And maybe, given your the overall numbers of women who develop breast cancer and those who don't, you can get an idea of actual breast cancer incidence, P1.

The only problem with this is that while you might get P1, P2, and P3 on a trial basis, what happens if those women who are test positive with mammography are then treated in some way, and their cancer cured. How do you know whether you've cured breast cancer or simply wrongly diagnosed it? Was it the cancer cured, or did they maybe just never have cancer anyway?

The same sort of questions arise with the second drug test puzzle. How do they know how many employees use opium? Do they ask them whether they use it? How do they know that 99% of the time they can identify users and non-users? There has to be some independent way of finding out other than the drug test.

Author: Chris Davis
First created: 21 June 2007