question about sample variance [Archive] - Two Plus Two Older Archives

naphinfitos

12-18-2005, 09:38 PM

When calculating this, you divide by n-1, rather than n. why is this? is there any rational explanation or is it arbitrary. ty in advance.

12-18-2005, 09:48 PM

The simple answer is that n-1 "works" to make it unbiased. The MLE estimator divides by n, but is biased. When you take the expectation of the MLE estimator, you see that dividing by n-1 instead of n makes the estimator unbiased. And people tend to like unbiased estimators. So it's not arbitrary, but there's not some particularly deep reasoning either.

naphinfitos

12-18-2005, 09:53 PM

thanks for the help. i was wondering why its unbiased though. my teacher refuses to tell us the reasoning for using n-1, and i was interested. ty.

AaronBrown

12-18-2005, 11:29 PM

It's pretty simple, but before I go through it I would add that if it makes a difference to a real decision, you need more data. It's the kind of mathematical subtlety that is used to torture students, nothing you should worry about for practical reasoning.

Suppose you have a sample from a population with mean M. The sample average happens to be A. To compute the variance, you begin by summing the squared deviations from the average.

(1) Sum from i = 1 to N of (Xi - A)^2

What you really want is the sum of the deviations from the true mean:

(2) Sum from i = 1 to N of (Xi - M)^2

If you subtract (2) from (1) you get the error introduced by using the sample average instead of the true mean:

(3) Sum from i = 1 to N of (Xi - A)^2 - (Xi - M)^2
= Sum from i = 1 to N of -2*A*Xi + A^2 + 2*M*Xi - M^2
= Sum from i = 1 to N of 2*Xi*(M - A) + A^2 - M^2

But the sum from i = 1 to N of Xi is N*A and all the other terms are constants so (3) is equal to

(4) 2*N*A*(M - A) + N*(A^2 - M^2)
=N*(M - A)*(2*A - A - M)
=-N*(M - A)^2

Note that this is always negative or zero, so the sum of the squared deviations from the sample average is always less than or equal to the sum of the squared deviations from the true mean (and it's equal only if the sample average happens to be exactly equal to the true mean). Looking at things another way, the sample average is the value that minimizes the sum of the squared deviations, if the true mean is any different, you will underestimate the squared deviations.

Since the expected value of N*(M - A)^2 is exactly equal to the true variance we have:

E(Sum of squared deviations from sample average) + Variance = N*Variance

so:

E(Sum of squared deviations from sample average) = (N - 1)*Variance

so the expected value of the sum of the squared deviations from the sample average divided by N - 1 is the variance.