Pearson's chi-squared test ( $χ^{2}$ ) is a fundamental statistical method used to analyze categorical data by evaluating whether observed differences between data sets occurred by chance.

Let $F_{0} (x)$ be the distribution function of a known discrete population. We have $n$ independent and identically distributed observation samples

X_{1}, X_{2}, \dots, X_{n} .

Assume $X_{1} \sim F (x)$ . The hypothesis testing problem is:

H_{0} : F (x) = F_{0} (x), x \in R ⟷ H_{1} : \exists x \in R such that F (x) \neq = F_{0} (x)

When the sample size $n$ is large, Karl Pearson proposed the $χ^{2}$ goodness-of-fit test in 1900. The following discussion addresses two scenarios: whether $F_{0} (x)$ contains unknown parameters or not.

Case 1: Probability Function Does Not Contain Unknown Parameters First, consider the case where $F_{0} (x)$ does not contain unknown parameters. Let the probability function of $F_{0} (x)$ be:

p_{j} = P (X_{1} = a_{j}), j = 1, 2, \dots, k .

Let $n_{j} = \sum_{i = 1}^{n} I_{{X_{i} = a_{j}}}, j = 1, 2, \dots, k$ . The observed frequency $n_{j}$ and the expected frequency $n p_{j}$ are shown in the table below.

Value	$a_{1}$	$a_{2}$	$\dots$	$a_{k}$
Expected Frequency	$n p_{1}$	$n p_{2}$	$\dots$	$n p_{k}$
Observed Frequency	$n_{1}$	$n_{2}$	$\dots$	$n_{k}$

By the law of large numbers, when $H_{0}$ holds, $n_{j} / n$ converges in probability to $p_{j}$ , $j = 1, 2, \dots, k$ . Therefore, the expected frequency $n p_{j}$ is close to the observed frequency $n_{j}$ . When $H_{1}$ holds, $n_{j} / n$ does not converge in probability to $p_{j}$ , $j = 1, 2, \dots, k$ , so the expected frequency $n p_{j}$ is far from the observed frequency $n_{j}$ . The test statistic is:

T = j = 1 \sum k \frac{( n _{j} - n p _{j} ) ^{2}}{n p _{j}} .

When $H_{0}$ holds, it can be proven (details omitted),

T d χ_{k - 1}^{2} .

This test method is called the goodness-of-fit test.

Case 2: The probability function contains several unknown parameters Let the probability function of the distribution function $F_{0} (x)$ be

P (X = a_{j}) = p_{j} (θ_{1}, θ_{2}, \dots, θ_{r}), j = 1, 2, \dots, k,

where $θ_{1}, θ_{2}, \dots, θ_{r}$ are unknown, $r < k$ .

When $H_{0}$ holds, the maximum likelihood estimate of $(θ_{1}, θ_{2}, \dots, θ_{r})$ is $(\hat{θ}_{1}, \hat{θ}_{2}, \dots, \hat{θ}_{r})$ . According to the invariance of the maximum likelihood estimate, when $H_{0}$ holds, the maximum likelihood estimate of $p_{j}$ is

\overset{p}{^}_{j} = \overset{p}{^}_{j} (\hat{θ}_{1}, \hat{θ}_{2}, \dots, \hat{θ}_{r}), j = 1, 2, \dots, k .

Construct the test statistic

T = j = 1 \sum k \frac{( n _{j} - n p ^ _{j} ) ^{2}}{n p ^ _{j}} .

When $H_{0}$ holds, $T d χ_{k - r - 1}^{2}$ , where $r$ is the number of unknown parameters. The degrees of freedom can be understood as follows: there are $k$ degrees of freedom in total. To find the maximum likelihood estimates of $θ_{1}, θ_{2}, \dots, θ_{r}$ , $r$ equations are needed, i.e., there are $r$ constraints, reducing $r$ degrees of freedom. The constraint $\sum_{j = 1}^{k} \overset{p}{^}_{j} = 1$ further reduces one degree of freedom, leaving $k - 1 - r$ degrees of freedom.

Contingence Table

Let there be two attributes: $A$ and $B$ . Attribute $A$ has $k$ levels: $a_{1}, a_{2}, \dots, a_{k}$ , and attribute $B$ has $m$ levels: $b_{1}, b_{2}, \dots, b_{m}$ . There is a sample of $n$ observations, where the frequency of attribute $A$ having level $a_{i}$ and attribute $B$ having level $b_{j}$ is $n_{ij}$ , $i = 1, 2, \dots, k, j = 1, 2, \dots, m$ . The data is presented in the following contingency table:

A \ B a_{1} a_{2} ⋮ a_{k} Column Totals b_{1} n_{11} n_{21} ⋮ n_{k 1} n_{\cdot 1} b_{2} n_{12} n_{22} ⋮ n_{k 2} n_{\cdot 2} \dots \dots \dots ⋱ \dots \dots b_{m} n_{1 m} n_{2 m} ⋮ n_{km} n_{\cdot m} Row Totals n_{1 \cdot} n_{2 \cdot} ⋮ n_{k \cdot} n

We want to examine whether attributes $A$ and $B$ are independent. This is a problem of testing for independence in a contingency table (testing if the random variables corresponding to the rows and columns are independent). The hypothesis testing problem is:

$H_{0}$ : Attributes $A$ and $B$ are independent $\leftrightarrow H_{1}$ : Attributes $A$ and $B$ are not independent

Let

u_{i} v_{j} = P (Attribute A has level a_{i}), i = 1, 2, \dots, k, = P (Attribute B has level b_{j}), j = 1, 2, \dots, m .

When $H_{0}$ holds,

P (Attribute A has level a_{i}, Attribute B has level b_{j}) = u_{i} v_{j}, i \in [k], j \in [m] .

From the problem description, the maximum likelihood estimates of $u_{i}$ and $v_{j}$ are:

\overset{u}{^}_{i} = \frac{n _{i \cdot}}{n}, \overset{v}{^}_{j} = \frac{n _{\cdot j}}{n},

where $n_{i \cdot}$ is the sum of $n_{ij}$ over $j$ (i.e., the row sum), and $n_{\cdot j}$ is the sum of $n_{ij}$ over $i$ (i.e., the column sum).

The observed frequency of attribute $A$ having level $a_{i}$ and attribute $B$ having level $b_{j}$ is $n_{ij}$ . The expected frequency when $H_{0}$ holds is $n_{i \cdot} n_{\cdot j} / n$ . Therefore, the test statistic is defined as:

T = i = 1 \sum k j = 1 \sum m \frac{( n _{ij} - n _{i \cdot} n _{\cdot j} / n ) ^{2}}{n _{i \cdot} n _{\cdot j} / n}

When $H_{0}$ holds, it can be proven (details omitted here) that:

T ⟶ d χ_{(k - 1) (m - 1)}^{2} .

The degrees of freedom can be calculated as follows: There are a total of $km$ degrees of freedom. Estimating the maximum likelihood estimates of $u_{1}, u_{2}, \dots, u_{k}$ requires $k - 1$ equations (because $u_{1} + u_{2} + \dots + u_{k} = 1$ ), which reduces $k - 1$ degrees of freedom. Estimating the maximum likelihood estimates of $v_{1}, v_{2}, \dots, v_{m}$ requires $m - 1$ equations, which reduces $m - 1$ degrees of freedom. Furthermore, since the null hypothesis holds:

i = 1 \sum k j = 1 \sum m \frac{n _{i \cdot} n _{\cdot j}}{n ^{2}} = 1

This reduces another 1 degree of freedom. Thus, the remaining degrees of freedom are $km - (k - 1) - (m - 1) - 1 = (k - 1) (m - 1)$ .

Lin's Notes Garden

Explorer

Pearson's chi-squared Test

Contingence Table

Graph View

Backlinks