PDA

View Full Version : More Complex Birthday Problem


gaming_mouse
02-23-2005, 12:04 AM
Inspired by this post (http://forumserver.twoplustwo.com/showflat.php?Cat=&Number=1784262&page=0&view=colla psed&sb=5&o=14&fpart=3#Post1787325) in OOT.

We ask random people for their birthday, and we want to know how long it will be before we get a representative for every day of the year. Obviously the minimum required is 365 people, but what is the distribution of the waiting time till we get a person for every bday?

I was thinking about it like this... number every day in the year 1,2,... 365. So the problem can be reduced to generating random numbers between 1 and 365. After asking N people their birthday (or generating N random numbers), the chance that a particular birthday (say Jan 1) has not yet been found is:

(364/365)^N

Letting E1 = event that birthday 1 is not yet found
E2 = same for bday 2
etc

The chance that we have not completed the full bday spectrum after N should be:

P(E1 or E2 or E3... or E365)

We should then be able to use inclusion-exlcusion. However, there must be an error in my thinking b/c I get

365*(364/365)^N - (365 choose 2)*(363/365)^N

for the first 2 terms, which gives you a negative number for N=400. What's wrong with this approach? What a better way to do it?

gm

reubenf
02-23-2005, 01:29 AM
NM /images/graemlins/frown.gif

pzhon
02-23-2005, 03:35 AM
This situation is called the coupon collector problem (http://rec-puzzles.org/new/sol.pl/probability/coupon).

Your approach is ok, though you need many more terms to get a good approximations when N is small. It's like e^-x = 1-x+x^2/2-x^3/6+x^4/24-... for x=10. If you stop after a couple of terms, your approximation will be horrible. You need to continue until the terms are smaller than your error tolerance.

The way I usually think of the problem is that there is an independent series of 365 waits for the next new birthday. Each of these has a geometric distribution (based at 1), so you want a convolution of a geometric distributions of mean 365/i for i=1 to 365. You may be able to get a good approximation by using a normal approximation for most of the terms, then convolving the last few together. However, you can also compute each term directly by inclusion-exclusion.

The probability for N=400 is extremely low. You do not want to trust the normal approximation that far from the mean (2365). I get a probability of 1.470387 x 10^-117 for the probability of collecting all 365 birthdays on the 400th person, and 1.771290 x 10^-117 for the probability of collecting all 365 birthdays on or before the 400th person.

gaming_mouse
02-23-2005, 01:56 PM
[ QUOTE ]
NM /images/graemlins/frown.gif

[/ QUOTE ]

What does NM mean?

gaming_mouse
02-23-2005, 02:04 PM
[ QUOTE ]
Your approach is ok, though you need many more terms to get a good approximations when N is small.

[/ QUOTE ]

pzhon,

thanks. the one thing i was also curious about is.... Will this approach yield an answer of 0 whenever N <= 365?

reubenf
02-23-2005, 02:32 PM
"Nevermind" -- I posted some dumb stuff at first /images/graemlins/laugh.gif

gaming_mouse
02-23-2005, 02:33 PM
[ QUOTE ]
Each of these has a geometric distribution (based at 1), so you want a convolution of a geometric distributions of mean 365/i for i=1 to 365.

[/ QUOTE ]

Can you explain this more? Maybe work it out a little. It's been a while since I did any convolutions...

pzhon
02-24-2005, 08:14 AM
[ QUOTE ]
Will this approach yield an answer of 0 whenever N <= 365?

[/ QUOTE ]
I think what you wrote was the probability not all birthdays would be covered by the nth person, and inclusion-exclusion correctly computes that the probability is 1 when N<365.

I think you are vulnerable to round-off errors if you try inclusion-exclusion directly, since you may get some very large terms in relation to the final answer. If you use a program that handles exact arithmetic, such as Mathematica, you should be ok.

Probability of getting all 365 birthdays among 400 people=

52483783422874597804160994139395052025933928631726 725852252
99768854194834360516158398330245464957400255876549 462098238
47228687271798613101893209981804672821136797825938 362553538
25049977758566210242062865782721137615697095321391 577013114
25676752994679811065943416002065242556195356480054 219759684
31210746116923468802817449397369613633099595400370 320464218
24872536041413773310765315008498940023780391424078 265768602
19987070046017307154967050540477087738245139006423 901519499
01389454812286268745961138241066130684236022645943 939185388
77700816709566362914353680068270916089700797886807 635786185
61899033640283175798640933254274115346564829172120 638441654
17041313354684823539241894869119994267587969671204 469096824
96880975959627142155839914533457990127206730956608 829624310
50981276975114189260121962841396152335583744120632 158434885
632

/

29630252322857330632860424262750594471896783621785 973743878
41241915001644078374133983027032394826074040462250 328944831
38752795141245056701059379755101369221623724578878 905323513
47169340119880781481602513373705371474348886379318 578519619
00275516891935071175600989306565258776254981567031 087846506
12907337087388163058618770889105501750377059626426 176323012
93726956610244222002778589678328490750986439279558 940328110
07993934014055469891109772226591317066138091345357 944630363
15195939166023536301499226617746210919458226962382 015628472
57533834343265983699483897086207754231171856141729 326450570
70588325353207133182276562122241667612147364191719 951173014
73621509459492954942517135016421871450666811790994 892406807
25957806076959867245679135355421611586685398429027 388875114
27694051624908740292708935377520917848398986895760 820807587
07346219272122014985846478290749386512588048286793 800129024
47174999221097998732335487837019627477275207638740 539550781
25

pzhon
02-24-2005, 08:32 AM
[ QUOTE ]
[ QUOTE ]
Each of these has a geometric distribution (based at 1), so you want a convolution of a geometric distributions of mean 365/i for i=1 to 365.

[/ QUOTE ]

Can you explain this more? Maybe work it out a little. It's been a while since I did any convolutions...

[/ QUOTE ]

A discrete convolution is what happens if you multiply two polynomials or power series.

A generating function for the waiting time between the first birthday and the second new birthday is

f_2 = 364/365 x + (364/365)(1/365) x^2 + (364/365)(1/365)^2 x^3 + ...

The coefficient of x^n is the probability that the wait is n.

A generating function for the waiting time between the second new birthday and the third new birthday is

f_3 = 363/365 x + (363/365)(2/365) x^2 + ... + (363/365)(2/365)^(i-1) x^i + ...

To get a generating function from the first birthday until the third, multiply f_2 by f_3.

f_n = x (366-n)/365 / (1 - (n-1) x / 365) (as a power series)

I'm not sure this helps much for computing any individual probability. I used it, but I still had to do some work to avoid losing numerical precision or relying on slow exact arithmetic. However, recognizing that the total waiting time is a sum of independent random variables lets you compute the mean and standard deviation easily from the means and standard deviations of the time between the n-1st and nth new birthday for n=1,...,365.

rivermeelmo
02-24-2005, 09:53 AM
isn't this basically a pvnp type deal

pzhon
02-24-2005, 10:53 AM
[ QUOTE ]
isn't this basically a pvnp type deal

[/ QUOTE ]
No.

d10
02-28-2005, 08:12 PM
Too many formulas.

If you want a statistical based answer, I threw together a simulation in Perl which randomly generates a number 0-364 (for 365 total possible outcomes) and kept on generating numbers until all 365 outcomes had hit at least once. I then recorded how many numbers I needed to pull from the RNG before all 365 outcomes had hit at least once. That process was looped 1000 times and each result was saved in a log file. I then took this file full of numbers and calculated the following from it:

Average = 2347
Minimum = 1412
Maximum = 5692
Std Dev = 461.37

pzhon
03-02-2005, 11:32 AM
[ QUOTE ]
Too many formulas.

[/ QUOTE ]
I don't see another good way to solve the problem as thoroughly. Perhaps you could suggest one?

[ QUOTE ]

If you want a statistical based answer, I threw together a simulation in Perl which randomly generates a number 0-364 (for 365 total possible outcomes) and kept on generating numbers until all 365 outcomes had hit at least once.

[/ QUOTE ]
That is a bad idea for many reasons.

/images/graemlins/diamond.gif It relies on the very high quality of your random number generator. A RNG that worked well for many other purposes could easily produce a very skewed result, on average.

/images/graemlins/diamond.gif You can't compute a probability that is lower than 10^-100 (like the probability all birthdays are represented by the 400th person) by observing an experiment like this.

/images/graemlins/diamond.gif It is far more complicated than 365(1 + 1/2 + ... + 1/365) = 2364.65 , one of those formulas you disliked.

gaming_mouse
03-02-2005, 02:40 PM
[ QUOTE ]
[ QUOTE ]
Too many formulas.

[/ QUOTE ]
I don't see another good way to solve the problem as thoroughly. Perhaps you could suggest one?

[ QUOTE ]

If you want a statistical based answer, I threw together a simulation in Perl which randomly generates a number 0-364 (for 365 total possible outcomes) and kept on generating numbers until all 365 outcomes had hit at least once.

[/ QUOTE ]
That is a bad idea for many reasons.

/images/graemlins/diamond.gif It relies on the very high quality of your random number generator. A RNG that worked well for many other purposes could easily produce a very skewed result, on average.

/images/graemlins/diamond.gif You can't compute a probability that is lower than 10^-100 (like the probability all birthdays are represented by the 400th person) by observing an experiment like this.

/images/graemlins/diamond.gif It is far more complicated than 365(1 + 1/2 + ... + 1/365) = 2364.65 , one of those formulas you disliked.

[/ QUOTE ]

Not to mention that the point wasn't to have an answer per se -- I know I, at least, didn't care about that. The point was to see how to solve the problem.

BillsChips
03-02-2005, 02:51 PM
I think there's something wrong with your calculations. How could the Standard Deviation be less then the minimum?

d10
03-02-2005, 08:58 PM
I guess it's harder to appreciate a simulation like that if you're only looking at the results. I understand that in hypothetical problems like these the point is to discover the means to an answer and not the answer itself, so I got much more out of writing the code for the simulation than anyone could get out of viewing the results of it.

I have to disagree with the earlier post. Statistical analyses of a computer simulation is often, in the practical sense, just as useful as deriving a formula. Look at the question posed again in the first post of this thread. Are you saying that the results of my simulation don't accurately reflect the correct answer to this question? If you want the answer simply because you're curious, then you might prefer the more exact answer offered by a formula. However if you were looking for a practical answer, say if you actually needed to find 365 people with different birthdays for whatever reason, and you wanted to know how many people you should expect to contact before you found your 365 representatives, my answer of 2347 is just as practical as 2364.65.

Yes, there are problems in the real world that can only be solved through formulas, there are also problems that can only be solved through statistics (Can you give me a formula that says how often you can expect to win with AA at a 10 player table? I'd prefer to look at the statistics after it's dealt to me 1,000 times and assume that the number I get is, in all practical senses, the number I can expect in the future). I think the question in this thread, as it was posed, can be effectively solved either way. And I think the fact that both of our methods produced similar answers is a good check which validates both of our methods and ensures that neither one of us missed anything as we thought through the problem. I'm sorry that all I could provide from my method was the results. I had thought about posting the code I used to create the simulation but it's been years since I wrote software for a living, and the code I put together for this problem is quite inefficient and embarassing, but it works.

d10
03-02-2005, 09:09 PM
I'm not sure what one has to do with the other. Given a sample of 19,10,1, the standard deviation will be greater than the minimum value. Given 5001,5000,4999, the standard deviation will be less than the minimum value. I think given the sample of this problem both the standard deviation and the minimum values I arrived at look OK.

fishfeet
03-04-2005, 04:05 AM
Cool! A discussion about the thread I made...

I never got to all 365 bdays... I ran out of forums to post on!!

Anyway, here is the data I collected so far.. maybe Ill pick it up again sometime...


254 Birthdays
427 Responces (173 repeats)

I tried reading this tread to understand how to solve for an expected value... but it went way over my head.
Ive always been a good at math.. just took Stats last semseter.. got an A.
But damn, This stuff is confusing.

Jonas Wa
03-04-2005, 09:32 PM
Your Problem is with union and interscetions (inclusion-exlcusion)
your:
365*(364/365)^N - (365 choose 2)*(363/365)^N
should be followed if done right (havnt look realy at it)
+(365 choose 3))*(362/365)^N -(365 choose 4))*(361/365)^N
and so on........
Smartest should be to take 1-P(all found) I think.

gaming_mouse
03-04-2005, 11:20 PM
[ QUOTE ]
Your Problem is with union and interscetions (inclusion-exlcusion)
your:
365*(364/365)^N - (365 choose 2)*(363/365)^N
should be followed if done right (havnt look realy at it)
+(365 choose 3))*(362/365)^N -(365 choose 4))*(361/365)^N
and so on........

[/ QUOTE ]


Yes, I know that. I was confused about why the first two terms were so far off, but, as pzhon pointed out, it just takes longer to converge in this problem.

[ QUOTE ]
Smartest should be to take 1-P(all found) I think.

[/ QUOTE ]

To do what? We're trying to estimate, eg, the chance that they have all been found after 400 trials, or 1000 trial, or whatever. The "1 - " trick doesn't apply here in any way I can see.