A Concise Derivation of the Chi-Squared Tests

On this page, I will derive the covariance matrix for the terms \frac{o_i-e_i}{\sqrt{e_i}} in a Chi-Squared Goodness-of-Fit test. It is not necessarily necessary to digest this whole derivation to understand why Chi-Squared tests work; the main point of interest in this post is the final result. For a discussion of this result that avoids the derivation, but tries to build intuitions and visualizations instead, see my other post here.

Preamble/Moan

I wrote this post (in fact, created this blog) out of a frustration with the quality of materials on this topic on the web. Many derivations out there are outright wrong, or gloss over the most important facts. This page, for example, claims that we are using E as the denominator in \frac{(O-E)^2}{E} because E is the variance of a Poisson distribution. But, in general, we do not have a Poisson distribution; if the probability p of a success on any given trial is nontrivial, then the variance is different (often, very different) from this — by a factor of (1-p). They also cannot account for why we lose any degrees of freedom. This page recognises that the (apparent!) error of (1-p) does exist, but wafts it away by saying “There are theoretical reasons, beyond the scope of this book, that make it preferable to omit the factors (1 – pi)”. Seriously?!?! Why would it be preferable to omit a factor that could be arbitrarily far from 1?? And according to that derivation, we still haven’t lost any degrees of freedom anywhere.

Then I found this page and started to realise that the reason there aren’t clear explanations, is because the real explanation is quite complicated, and those other pages I referenced were trying to avoid falling down the hole of trying to make sense of it! That page is absolutely correct, but quite terse, and takes some reading. Finally, I stumbled across this wonderful, clear, derivation. Finally, the penny dropped!

However, looking over the derivation, I felt that it would be possible to capture the underlying gist in far less space, using a little bit of linear algebra. So, what follows is my as-concise-as-possible derivation of the why the Chi-Squared test works. If any of it does not make sense to you, I refer you to the very clear derivation linked at the end of the last paragraph, and on which this was based.

The Derivation

To make this clean, we will need to use vector notation, I’ll define \mathbf{o} to be the vector of m observed counts for each multinomial category, and \mathbf{e} to be the vector of expected values. In the goodness-of-fit test, we are modelling that the observed values are the counts from a multinomial distribution. So \mathbb{E}(\mathbf{o}) = \mathbf{e}, and the observed and expected values will sum to the same value (\mathbf{o}^T\mathbf{1} = \mathbf{e}^T\mathbf{1}).

Our main obstacle is that of finding \mathbb{E}(\mathbf{oo}^T). For this part, I will use the analogy and (similar notation) from the derivation here. Our multinomial is like throwing n balls X_1,…,X_n into m buckets B_1,…,B_m, and our observed values o_i are given by the number of balls in each given bucket:

o_i = \sum_{l=1}^n I(X_l \in B_i).

We can use this analogy to find \mathbb{E}(\mathbf{oo}^T):

\mathbb{E}(o_io_j) = \mathbb{E}\left(\left[\sum\limits_{l=1}^n I(X_l \in B_i)\right]\left[\sum\limits_{l'=1}^n I(X_{l'} \in B_j)\right]\right) = \mathbb{E}\left(\sum_{l,l'} I(X_l \in B_i)I(X_{l'} \in B_j)\right) = \mathbb{E}\left(\sum_{l=l'} I(X_l \in B_i)I(X_{l'} \in B_j) + \sum_{l \neq l'} I(X_l \in B_i)I(X_{l'} \in B_j)\right) = \sum_{l=l'} \underbrace{\mathbb{E}\left(I(X_l \in B_i)I(X_{l'} \in B_j)\right)}_{=I(i=j)p_i} + \sum_{l \neq l'} \underbrace{\mathbb{E}\left(I(X_l \in B_i)I(X_{l'} \in B_j)\right)}_{=p_ip_j} = n \left[ I(i=j)p_i \right] + n(n-1) \left[ p_ip_j \right]

= I(i=j)e_i + \frac{n-1}{n}e_ie_j,

where the result in the first underbrace above comes from the fact that a ball cannot be in two different buckets at the same time (so the \mathbb{E} there is zero for i \neq j).

From this we have:

\mathbb{E}(\mathbf{oo}^T) = \mathrm{diag}(\mathbf{e})+\frac{n-1}{n}\mathbf{ee}^T,

where I have used \mathrm{diag} to denote the placing of the elements of the vector onto the diagonal of an otherwise zero matrix.

From here, deriving the covariance matrix of \mathbf{o}-\mathbf{e} is pretty easy. Making use of the facts that \mathbb{E}(\mathbf{o})=\mathbf{e} and, therefore \mathbb{E}(\mathbf{o}-\mathbf{e})=\mathbf{0}:

\mathrm{cov}(\mathbf{o}-\mathbf{e})
= \mathbb{E}(\mathbf{o}-\mathbf{e})(\mathbf{o}-\mathbf{e})^T
= \mathbb{E}(\mathbf{oo}^T) - \mathbb{E}(\mathbf{o})\mathbf{e}^T - \mathbf{e}\mathbb{E}(\mathbf{o})^T + \mathbf{ee}^T
= \mathbb{E}(\mathbf{oo}^T) - \mathbf{ee}^T
= \left[ \mathrm{diag}(\mathbf{e}) + \frac{n-1}{n}\mathbf{ee}^T \right] - \mathbf{ee}^T
= \mathrm{diag}(\mathbf{e}) -\frac{1}{n}\mathbf{ee}^T

Now, in the chi-squared formula, for each cell, we are calculating \left( \frac{o_i-e_i}{\sqrt{e_i}} \right)^2. What happens when we divide by \sqrt{e_i}? Here, for succinctness, I will use the slight abuse of notation that \sqrt{\mathbf{e}} is calculated by taking the square root of each element of \mathbf{e} independently:

\mathrm{cov} \left(  \left[\mathrm{diag}(\sqrt{\mathbf{e}})\right]^{-1} (\mathbf{o}-\mathbf{e})  \right)
= \left[\mathrm{diag}(\sqrt{\mathbf{e}})\right]^{-1} \mathrm{cov} (\mathbf{o}-\mathbf{e}) \left[\mathrm{diag}(\sqrt{\mathbf{e}}) \right]^{-1}
= \left[\mathrm{diag}(\sqrt{\mathbf{e}})\right]^{-1} \left( \mathrm{diag}(\mathbf{e}) -\frac{1}{n}\mathbf{ee}^T \right) \left[\mathrm{diag}(\sqrt{\mathbf{e}}) \right]^{-1}
= I - \frac{1}{n}\sqrt{\mathbf{e}}\sqrt{\mathbf{e}}^T
= I - \sqrt{\mathbf{p}}\sqrt{\mathbf{p}}^T

where the vector \sqrt{\mathbf{p}} is the vector of the square roots of the probability of each multinomial outcome. Importantly, this vector is a unit vector, so that the covariance matrix is singular, having a zero eigenvalue in the direction \sqrt{\mathbf{p}}, and unit eigenvalues in (m-1) directions orthogonal to it. So, our covariance matrix tells us that our \frac{o_i-e_i}{\sqrt{e_i}} terms are distributed according to a unit sphere that has been squashed flat in on dimension. Looking along the diagonal, we see that each variance in isolation is indeed (1-p_i); it is when we consider the shape of all the terms together that the \chi^2-distributed nature emerges. I go into more intuitions, and present some visualizations, based on this, here.

Leave a Reply

Your email address will not be published. Required fields are marked *