Pearson’s chi-squared test () is a fundamental statistical method used to analyze categorical data by evaluating whether observed differences between data sets occurred by chance.
Let be the distribution function of a known discrete population. We have independent and identically distributed observation samples
Assume . The hypothesis testing problem is:
When the sample size is large, Karl Pearson proposed the goodness-of-fit test in 1900. The following discussion addresses two scenarios: whether contains unknown parameters or not.
Case 1: Probability Function Does Not Contain Unknown Parameters First, consider the case where does not contain unknown parameters. Let the probability function of be:
Let . The observed frequency and the expected frequency are shown in the table below.
Value | ||||
---|---|---|---|---|
Expected Frequency | ||||
Observed Frequency |
By the law of large numbers, when holds, converges in probability to , . Therefore, the expected frequency is close to the observed frequency . When holds, does not converge in probability to , , so the expected frequency is far from the observed frequency . The test statistic is:
When holds, it can be proven (details omitted),
This test method is called the goodness-of-fit test.
Case 2: The probability function contains several unknown parameters Let the probability function of the distribution function be
where are unknown, .
When holds, the maximum likelihood estimate of is . According to the invariance of the maximum likelihood estimate, when holds, the maximum likelihood estimate of is
Construct the test statistic
When holds, , where is the number of unknown parameters. The degrees of freedom can be understood as follows: there are degrees of freedom in total. To find the maximum likelihood estimates of , equations are needed, i.e., there are constraints, reducing degrees of freedom. The constraint further reduces one degree of freedom, leaving degrees of freedom.
Contingence Table
Let there be two attributes: and . Attribute has levels: , and attribute has levels: . There is a sample of observations, where the frequency of attribute having level and attribute having level is , . The data is presented in the following contingency table:
We want to examine whether attributes and are independent. This is a problem of testing for independence in a contingency table (testing if the random variables corresponding to the rows and columns are independent). The hypothesis testing problem is:
: Attributes and are independent : Attributes and are not independent
Let
When holds,
From the problem description, the maximum likelihood estimates of and are:
where is the sum of over (i.e., the row sum), and is the sum of over (i.e., the column sum).
The observed frequency of attribute having level and attribute having level is . The expected frequency when holds is . Therefore, the test statistic is defined as:
When holds, it can be proven (details omitted here) that:
The degrees of freedom can be calculated as follows: There are a total of degrees of freedom. Estimating the maximum likelihood estimates of requires equations (because ), which reduces degrees of freedom. Estimating the maximum likelihood estimates of requires equations, which reduces degrees of freedom. Furthermore, since the null hypothesis holds:
This reduces another 1 degree of freedom. Thus, the remaining degrees of freedom are .