View Full Version : Baseball Probability Question

06-23-2003, 10:48 AM
A friend of mine asked me the follwing question:

"Suppose you have n "bins" and r white balls and s black balls. Further, suppose the balls are distributed randomly in the bins such that no bin contains an equal number of white and black balls. What is the expected value of the number of bins that contain more white balls than black balls? (A bin can be thought of as a (baseball) game, all the bins a (baseball)season, and the balls points (runs).)"

Any comments?

(Cross posted to sci.stat.math)

06-23-2003, 11:15 AM
There is a famous baseball sabermetric formula that covers this, called the Pythagorean Projection. However, the balls aren't distributed exactly "randomly", as baseball scoring is peculiar.

Anyways, someone (I believe Bill James) in the '80s found out that a good approximation was:

Team Winning Percentage = (runs scored^2) / (runs scored^2 + runs allowed^2).

Later on the formula got refined, with the powers being a number around 1.8 (I don't remember the exact number), and everything else the same. The small change doesn't change the formula's win% by much.

06-23-2003, 12:21 PM
>Team Winning Percentage = (runs scored^2) / (runs scored^2 + runs allowed^2).

That is very cool! Can you find the reference for this formula in a book or article. I would be very interested.

I notice that formula is not correct for the exact question that I posted if r, s, and n are small. For example if there is one white ball (r = 1), and n-1 black balls (s=n-1), and n bins, then the only possible way to distribute the balls is one ball per bin, so the white balls exceed the black balls in exactly one bin. Then the win percent is exactly 1/n whereas the Team Winning Percentage formula gives 1/(n^2+1).


06-23-2003, 04:32 PM
I found a reference for the formula in case anyone is intersted.

Pythagorean Baseball Win Formula (http://personal.bgsu.edu/~albert/tsub/essay.htm)

(This formula is not the answer to the original question, but it may be more interesting than the original question.)

06-23-2003, 05:03 PM
Ah yes, another example of totally useless statistics that can't be used to predict a damn thing!

I also HATE imposing artificial, albeit "natural", assumptions such as the fitted line passing through (0,0) because if R=RA the record is expected to be .500.

Pitching dominates baseball. It isnt uncommon for a team has three relative aces and two dogs, and average hitting, they are going to win a lot of 1 run games, and get blown out in a lot of games, so their record may be far better than .500 despite R <= RA. I suspect something like that is what is going on in the Toronto numbers that year. (An alternative explanation is injuries at some point in the season that led to them fielding two "different teams".)

Baseball may be the sport where statistics are most valuable in decision making (hit and run, bunt, sacrifice etc.). As far as predicting results, though, statistics are by far most valuable in hockey, where results are dominated even more by the goalie than baseball is by pitchers (because the same goalie plays 75% of the games, but the same pitcher only 20-25% of the games). In fact betting on Stanley Cup futures just before they start has been profitable based on a single statistic for each team in 13 out of the last 15 years, including this one!