John Derbyshire has posted an interesting problem/puzzle, discussed in some detail here. Briefly, the problem is this:
A man says: “I have two children. One is a boy born on a Tuesday. What is the probability I have two boys?”
The answer (according to Derb) seems to be 13/27. This is very odd. I explore it a bit below, in my own way, with a little Python.
Testing
When I first read this problem, I spent about 30s thinking about it, said “probably 1/3”, and immediately coded up some tests in Python. (As I may have mentioned before, I have little faith in the power of reason when divorced from experiment.)
# Child is (sex, birthday)
# Sex: 0->Boy, 1->Girl
# Birthday: 0->Sunday, 1->Monday, …, 6->Saturday
import random
def makechild(): return (random.randint(0,1), random.randint(0,6))
def makefamily(): return (makechild(), makechild())
families = [makefamily() for i in range(1000000)]
def boyP(child): return child[0] == 0
def tuesdaysBoyP(child): return (child[0] == 0) and (child[1] == 2)
def hasTuesdaysBoyP(family): return tuesdaysBoyP(family[0]) or tuesdaysBoyP(family[1])
def hasTwoBoysP(family): return boyP(family[0]) and boyP(family[1])
familiesWithTuesdaysBoy = filter(hasTuesdaysBoyP, families)
print len(filter(hasTwoBoysP, familiesWithTuesdaysBoy))/float(len(familiesWithTuesdaysBoy))
This returns an answer of 0.481444362136 (in my test), which is pretty darn close to 13/27 (0.481481…). This result surprised me; as I mentioned, I was expecting something more along the lines of 1/3. What’s going on?
1/3
First, a word about why I expected a result of 1/3. If we ignore the seemingly-irrelevant “Tuesday” datum, we can reason like this:
There is a 50% chance that the man’s first child (birth order) is a boy, and an independent 50% chance that the man’s second child is a boy. So, knowing only that the man has two children, his potential families are evenly distributed between these four options:
- Boy, Boy
- Boy, Girl
- Girl, Boy
- Girl, Girl
If the man tells us that he has one son, that eliminates the (Girl, Girl) option, and his potential families are now evenly distributed between these three options:
- Boy, Boy
- Boy, Girl
- Girl, Boy
Exactly one third of his potential families have two sons. Experiment agrees with this reasoning; this code:
def hasBoyP(family): return boyP(family[0]) or boyP(family[1])
familiesWithBoy = filter(hasBoyP, families)
print len(filter(hasTwoBoysP, familiesWithBoy))/float(len(familiesWithBoy))
returns an answer of 0.332656581599 when I run it, which strikes me as close enough.
13/27
But, to return to the problem as stated: Why does the “Tuesday” datum affect the probabilities the way it does? Well, to be honest, I’m not sure. Cribbing heavily from Derb’s explanation, however, you can reason about it this way:
To begin with, although a large (pseudo-)random population is useful for testing, its very size can make it hard to see what’s going on. To simplify matters, let’s realize we’re dealing with four independent variables:
- First child’s sex
- First child’s birthday
- Second child’s sex
- Second child’s birthday
There are 2*7*2*7 = 196 equally likely families, so these are all we really need to think about. Indeed, when we restrict the set of families to an enumeration of the 196 possibilities, the numbers come out exactly right:
families = [((s0, b0), (s1, b1)) for s0 in range(2) for b0 in range(7) for s1 in range(2) for b1 in range(7)]
familiesWithTuesdaysBoy = filter(hasTuesdaysBoyP, families)
print len(filter(hasTwoBoysP, familiesWithTuesdaysBoy))/float(len(familiesWithTuesdaysBoy))
familiesWithBoy = filter(hasBoyP, families)
print len(filter(hasTwoBoysP, familiesWithBoy))/float(len(familiesWithBoy))
This code outputs results of 0.481481481481 and 0.333333333333, just as one would expect.
We can actually see the set of families with a boy born on Tuesday:
sexes = ["Boy", "Girl"]
days = ["Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"]
def printFamily(family):
print "(%4s, %-9s), (%4s, %-9s)"%(sexes[family[0][0]],days[family[0][1]],sexes[family[1][0]],days[family[1][1]])
for f in familiesWithTuesdaysBoy: printFamily(f)
This outputs:
( Boy, Sunday ), ( Boy, Tuesday )
( Boy, Monday ), ( Boy, Tuesday )
( Boy, Tuesday ), ( Boy, Sunday )
( Boy, Tuesday ), ( Boy, Monday )
( Boy, Tuesday ), ( Boy, Tuesday )
( Boy, Tuesday ), ( Boy, Wednesday)
( Boy, Tuesday ), ( Boy, Thursday )
( Boy, Tuesday ), ( Boy, Friday )
( Boy, Tuesday ), ( Boy, Saturday )
( Boy, Tuesday ), (Girl, Sunday )
( Boy, Tuesday ), (Girl, Monday )
( Boy, Tuesday ), (Girl, Tuesday )
( Boy, Tuesday ), (Girl, Wednesday)
( Boy, Tuesday ), (Girl, Thursday )
( Boy, Tuesday ), (Girl, Friday )
( Boy, Tuesday ), (Girl, Saturday )
( Boy, Wednesday), ( Boy, Tuesday )
( Boy, Thursday ), ( Boy, Tuesday )
( Boy, Friday ), ( Boy, Tuesday )
( Boy, Saturday ), ( Boy, Tuesday )
(Girl, Sunday ), ( Boy, Tuesday )
(Girl, Monday ), ( Boy, Tuesday )
(Girl, Tuesday ), ( Boy, Tuesday )
(Girl, Wednesday), ( Boy, Tuesday )
(Girl, Thursday ), ( Boy, Tuesday )
(Girl, Friday ), ( Boy, Tuesday )
(Girl, Saturday ), ( Boy, Tuesday )
At this point, you can just count: 27 of the possible 196 families have at least one boy born on a Tuesday, and 13 of those have two boys.
But Why?
For me, however, these demonstrations don’t get to the why of this weird effect; why does the inclusion of an apparently irrelevant datum raise the odds of the man having two sons from 1/3 to 13/27? The best answer I can provide comes in the form of this piece of code:
familiesWithTuesdaysBoy == filter(hasTuesdaysBoyP, familiesWithBoy)
Note that this is a predicate, not an assignment; it returns True
, demonstrating that the set of families with a boy born on Tuesday can be computed by filtering the set of all families with boys. Since we know that the former set is more densely populated with 2-boy families, we can conclude that the hasTuesdaysBoyP
predicate disproportionately filters out boy-girl families. Which makes sense; boy-boy families have two independent chances to contain a boy born on Tuesday, while boy-girl families have only one. You’d expect the predicate to remove more of the latter families.
Bayesian
You can also compute an answer with straight-forward Bayesian analysis:
P(D|H) = 13/49
P(H) = 1/4
P(D) = 27/196
13 1 196
P(H|D) = -- * - * --- = 13/27
49 4 27
Weird
The really counter-intuitive part of this, for me, are the implications for a question that wasn’t asked. Suppose the original problem was:
A man says: “I have two children. One is a boy. What is the probability I have two boys?”
Although it would be the long way ’round, it seems reasonable to try to solve this problem by saying:
We don’t know what day this boy was born on. But we do know that the probability of his birthday falling on any particular weekday is 1/7. So, the probability of the man having two boys is 1/7 * the probability of him having two boys if his son was born on a Sunday, plus 1/7 * the probability of him having two boys if his son was born on a Monday, and so on. Since all the conditional probabilities are the same, and since we’re adding 7 terms, the probability of the man having two boys is the same as the probability of him having two boys given that his son was born on (say) a Tuesday, i.e. 13/27.
The difficulty with that line of argument is that it gives an answer which we know to be incorrect. I think this is the most interesting part of the puzzle.
Pingback: The Shape of Code » Flawed analysis of “one child is a boy” problem