Two Plus Two Older Archives - View Single Post

AaronBrown · #4 12-18-2005, 11:29 PM

It's pretty simple, but before I go through it I would add that if it makes a difference to a real decision, you need more data. It's the kind of mathematical subtlety that is used to torture students, nothing you should worry about for practical reasoning.

Suppose you have a sample from a population with mean M. The sample average happens to be A. To compute the variance, you begin by summing the squared deviations from the average.

(1) Sum from i = 1 to N of (Xi - A)^2

What you really want is the sum of the deviations from the true mean:

(2) Sum from i = 1 to N of (Xi - M)^2

If you subtract (2) from (1) you get the error introduced by using the sample average instead of the true mean:

(3) Sum from i = 1 to N of (Xi - A)^2 - (Xi - M)^2
= Sum from i = 1 to N of -2*A*Xi + A^2 + 2*M*Xi - M^2
= Sum from i = 1 to N of 2*Xi*(M - A) + A^2 - M^2

But the sum from i = 1 to N of Xi is N*A and all the other terms are constants so (3) is equal to

(4) 2*N*A*(M - A) + N*(A^2 - M^2)
=N*(M - A)*(2*A - A - M)
=-N*(M - A)^2

Note that this is always negative or zero, so the sum of the squared deviations from the sample average is always less than or equal to the sum of the squared deviations from the true mean (and it's equal only if the sample average happens to be exactly equal to the true mean). Looking at things another way, the sample average is the value that minimizes the sum of the squared deviations, if the true mean is any different, you will underestimate the squared deviations.

Since the expected value of N*(M - A)^2 is exactly equal to the true variance we have:

E(Sum of squared deviations from sample average) + Variance = N*Variance

so:

E(Sum of squared deviations from sample average) = (N - 1)*Variance

so the expected value of the sum of the squared deviations from the sample average divided by N - 1 is the variance.