Chi-Square Test for Categorical Variables
分类变量的卡方检验

Introduction 介绍

The chi-square test is a statistical method used to determine if there is a significant association between two categorical variables. This test is widely used in various fields, including social sciences, marketing, and healthcare, to analyze survey data, experimental results, and observational studies.
卡方检验是一种统计方法，用于确定两个分类变量之间是否存在显著关联。该测试广泛用于各个领域，包括社会科学、市场营销和医疗保健，以分析调查数据、实验结果和观察性研究。

Concept 概念

The chi-square test is a non-parametric statistical method used to examine the association between two categorical variables. It evaluates whether the frequencies of observed outcomes significantly deviate from expected frequencies, assuming the variables are independent. The test is grounded in the chi-square distribution, which is applied to count data and helps in determining if any observed deviations could have arisen by random chance.
卡方检验是一种非参数统计方法，用于检查两个分类变量之间的关联。它评估观察到的结果的频率是否显着偏离预期频率，假设变量是独立的。该测试以卡方分布为基础，卡方分布应用于计数数据，有助于确定任何观察到的偏差是否可能是随机产生的。

Null Hypothesis and Alternative Hypothesis
原假设和备择假设

The chi-square test involves formulating two hypotheses:
卡方检验涉及制定两个假设：

Null Hypothesis $( 𝐻_0 )$ - Assumes that there is no association between the categorical variables, implying that any observed differences are due to random chance.
原假设 $(𝐻_{0})$ - 假设分类变量之间没有关联，这意味着任何观察到的差异都是由于随机机会造成的。

Alternative Hypothesis $(H_1 )$ - Assumes that there is a significant association between the variables, indicating that the observed differences are not due to chance alone.
备择假设 $(H_{1})$ - 假设变量之间存在显著关联，这表明观察到的差异不仅仅是由于偶然造成的。

Formula 公式

The chi-square statistic is calculated using the formula:
卡方统计量使用以下公式计算：

$\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}$

where 哪里
$O_i$ is the observed frequency for category $i$ .
$O_{i}$ 是 category $i$ 的观测频率。
$E_i$ is the expected frequency for category $i$ , calculated as:
$E_{i}$ 是 category $i$ 的预期频率，计算公式为：

$E_i = \frac{(row \ total \times column\ total)}{grand \ total}$

The sum is taken over all cells in the contingency table.
总和将覆盖列联表中的所有单元格。

The calculated chi-square statistic is then compared to a critical value from the chi-square distribution table. This table provides critical values for different degrees of freedom $(df)$ and significance levels $( \alpha )$ .
然后将计算出的卡方统计量与卡方分布表中的临界值进行比较。下表提供了不同自由 $(d f)$ 度和显著性水平 $( \alpha )$ 的临界值。

The degrees of freedom for the test are calculated as:
测试的自由度计算如下：

$df = (r - 1) \times (c - 1)$

where $r$ is the number of rows and $c$ is the number of columns in the table.
其中 $r$ 是行数， $c$ 是表中的列数。

Applications 应用

Market Research: Analyzing the association between customer demographics and product preferences.
市场调查：分析客户人口统计数据与产品偏好之间的关联。
Healthcare: Studying the relationship between patient characteristics and disease incidence.
医疗：研究患者特征与疾病发病率之间的关系。
Social Sciences: Investigating the link between social factors (e.g., education level) and behavioral outcomes (e.g., voting patterns).
社会科学：调查社会因素（例如，教育水平）与行为结果（例如，投票模式）之间的联系。
Education: Examining the connection between teaching methods and student performance.
教育：研究教学方法与学生表现之间的联系。
Quality Control: Assessing the association between manufacturing conditions and product defects.
质量管理：评估制造条件与产品缺陷之间的关联。

Practical Example - Weak Correlation
实例 - 弱相关

Suppose a researcher wants to determine if there is an association between gender (male, female) and preference for a new product (like, dislike). The researcher surveys 100 people and records the following data:
假设研究人员想要确定性别（男性、女性）和对新产品的偏好（喜欢、不喜欢）之间是否存在关联。研究人员调查了 100 人并记录了以下数据：

Category 类别	Like 喜欢	Dislike 讨厌	Total 总
Male 雄	20	30	50
Female 女性	25	25	50
Total 总	45	55	100

Step 1: Calculate Expected Frequencies
第 1 步：计算预期频率

Using the formula for expected frequencies:
使用预期频率的公式：

$E_{Male, Like} = \frac{(50 \times 45)}{100} = 22.5$
$E_{Male, Dislike} = \frac{(50 \times 55)}{100} = 27.5$
$E_{Female, Like} = \frac{(50 \times 45)}{100} = 22.5$
$E_{Female, Dislike} = \frac{(50 \times 55)}{100} = 27.5$

Step 2: Compute Chi-Square Statistic
第 2 步：计算卡方统计量

$\chi^2 = \frac{(20 - 22.5)^2}{22.5} + \frac{(30 - 27.5)^2}{27.5} + \frac{(25 - 22.5)^2}{22.5} + \frac{(25 - 27.5)^2}{27.5}$

$\chi^2 = \frac{(2.5)^2}{22.5} + \frac{(2.5)^2}{27.5} + \frac{(2.5)^2}{22.5} + \frac{(2.5)^2}{27.5}$

$\chi^2 = \frac{6.25}{22.5} + \frac{6.25}{27.5} + \frac{6.25}{22.5} + \frac{6.25}{27.5}$

$\chi^2 = 0.277 + 0.227 + 0.277 + 0.227$

$\chi^2 = 1.008$

Step 3: Determine Degrees of Freedom
步骤 3：确定自由度

$df = (2 - 1) \times (2 - 1) = 1$

Step 4: Interpret the Result
第 4 步：解释结果

Using a chi-square distribution table, we compare the calculated chi-square value (1.008) with the critical value at one degree of freedom and a significance level (e.g., 0.05). The critical value, as determined from chi-square distribution tables, is approximately 3.841.
使用卡方分布表，我们将计算出的卡方值（1.008）与一个自由度和一个显著性水平（例如 0.05）的临界值进行比较。根据卡方分布表确定的临界值约为 3.841。

Since 1.008 < 3.841, we fail to reject the null hypothesis. Thus, there is no significant association between gender and product preference in this sample.
由于 1.008 < 3.841，我们无法否定原假设。因此，在这个样本中，性别和产品偏好之间没有显著的关联。

Practical Example - Strong Association
实例 - 强关联

Consider a study investigating the relationship between smoking status (smoker, non-smoker) and the incidence of lung disease (disease, no disease). The researcher collects data from 200 individuals and records the following information:
考虑一项调查吸烟状况（吸烟者、非吸烟者）与肺部疾病发病率（疾病、无疾病）之间关系的研究。研究人员从 200 个人那里收集数据并记录以下信息：

Category 类别	Disease 疾病	No Disease 无病	Total 总
Smoker 吸烟者	50	30	80
Non-Smoker 非吸烟者	20	100	120
Total 总	70	130	200

Step 1: Calculate Expected Frequencies
第 1 步：计算预期频率

Using the formula for expected frequencies:
使用预期频率的公式：

$E_{Smoker, Disease} = \frac{(80 \times 70)}{200} = 28$
$E_{Smoker, No\ Disease} = \frac{(80 \times 130)}{200} = 52$
$E_{Non-Smoker, Disease} = \frac{(120 \times 70)}{200} = 42$
$E_{Non-Smoker, No\ Disease} = \frac{(120 \times 130)}{200} = 78$

Step 2: Compute Chi-Square Statistic
第 2 步：计算卡方统计量

$\chi^2 = \frac{(50 - 28)^2}{28} + \frac{(30 - 52)^2}{52} + \frac{(20 - 42)^2}{42} + \frac{(100 - 78)^2}{78}$

$\chi^2 = \frac{(22)^2}{28} + \frac{(22)^2}{52} + \frac{(22)^2}{42} + \frac{(22)^2}{78}$

$\chi^2 = \frac{484}{28} + \frac{484}{52} + \frac{484}{42} + \frac{484}{78}$

$\chi^2 = 17.29 + 9.31 + 11.52 + 6.21$

$\chi^2 = 44.33$

Step 3: Determine Degrees of Freedom
步骤 3：确定自由度

$df = (2 - 1) \times (2 - 1) = 1$

Step 4: Interpret the Result
第 4 步：解释结果

Using a chi-square distribution table, we compare the calculated chi-square value (44.33) with the critical value at one degree of freedom and a significance level (e.g., 0.05), approximately 3.841. Since 44.33 > 3.841, we reject the null hypothesis. This indicates a significant association between smoking status and the incidence of lung disease in this sample.
使用卡方分布表，我们将计算出的卡方值（44.33）与一个自由度和显着性水平（例如，0.05）的临界值（约为 3.841）进行比较。由于 44.33 > 3.841，我们拒绝原假设。这表明该样本中的吸烟状况与肺部疾病发病率之间存在显著关联。

Conclusion 结论

The chi-square test is a powerful tool for analyzing the relationship between categorical variables. By comparing observed and expected frequencies, researchers can determine if there is a statistically significant association, providing valuable insights in various fields of study.
卡方检验是分析分类变量之间关系的强大工具。通过比较观测频率和预期频率，研究人员可以确定是否存在具有统计意义的关联，从而在各个研究领域提供有价值的见解。

Chi-Square Test for Categorical Variables分类变量的卡方检验