Two Plus Two Older Archives - View Single Post

elitegimp · #3 09-04-2005, 02:39 AM

First of all, thanks for the link to the article Bruce. I got a PM from Lexander and that reminded me that I had ignored this thread after reading Bruce's response. I had hoped to get more input, so here's some more info on what I'm thinking.

1) This thread is going to shift from a poker stat thread to a baseball stat thread. Sorry, I just tossed out the BB/session as an analogy because... well, I dunno why.

2) One of my friends asked me if there was a statistical way to track a hitter's consistency, from a scale of "a very consistent hitter" to "a very streaky hitter." Initially, thought that a consistent hitter would be one who doesn't stray far from his season-long batting average in any given game... hence the relation between Hits / Game and BB / session.

3) I figured a good comparison to see if this statistic means anything would be between Albert Pujols (in my opinion, he is very consistent) and Brian Roberts (who started the season extremely hot, and his cooled off considerably since then).

Okay, so I did this analysis as laid out in the article linked by Bruce and found that the variance from game to game was very similar between the two hitters... meaning (to me, at least) that this was a bad method. It's probably why we (as poker players) measure SD in BB/100 and not BB/session [img]/images/graemlins/smile.gif[/img]

I think it would be interesting to group the ABs in 10s or 20s and calculate the standard deviation in H/10 ABs or H/20 ABs, but that would mean an incredible amount of work for me to cull the data out of game logs (I'll probably end up doing this to satisfy my inner curiousity, but I'm not looking forward to it).

The last thing I did, and I think it's really neat, is that used data from each game to plot the number of hits each hitter had on the season as a function of their total at-bats.

Example: Player A plays in three games and goes 0-2, 1-4, 4-4, so I plot the points (2,0), (6,1), and (10,5).

I then fit the least squares line to the data (I don't know the verb form of "linear regression"), and plotted the "actual hits - predicted hits" at each data point. Interestingly, Pujol's error data looked like random noise, varying between -4 Hits and +4 Hits, while Robert's error data had a noticeable "parabolic" curve. It was positive early in the season, when he was running hot, then negative in the middle of the season, and positive again towards the end. He also had a much bigger range than Pujols, varying as much as +-8 hits.

So I think that is very interesting, but I don't really know how to interpret it. I also don't know how to turn that information into a single number. Both models had very high correlation co-efficients (0.999 for Pujols, 0.995 for Roberts I believe), and I don't like the idea of looking at a stat on a scale of 0.99 to 1 or something.

So anyway, I guess my next step is to calculate the standard deviation in H/10. Does anyone have any thoughts? Anything to point out that I may have missed? And does anyone have a good interpretation of the regression data? I'll post some plots tomorrow afternoon if I get a chance.