这是用户在 2025-1-14 12:12 为 https://app.immersivetranslate.com/pdf-pro/7f6352c8-977b-47ab-8222-00bfa6c8a790 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

A new correlation coefficient between categorical, ordinal and interval variables with Pearson characteristics
具有 Pearson 特征的分类变量、有序变量和区间变量之间的新相关系数

M. Baak a, a,  ^("a, ")^(**){ }^{\text {a, }}{ }^{*}, R. Koopman a ^("a "){ }^{\text {a }}, H. Snoek b,c , S . K l o u s a,d b,c  , S . K l o u s a,d  ^("b,c ",S.Klous)^("a,d "){ }^{\text {b,c }, ~ S . ~ K l o u s ~}{ }^{\text {a,d }}
M. Baak a, a,  ^("a, ")^(**){ }^{\text {a, }}{ }^{*} , R. Koopman a ^("a "){ }^{\text {a }} , H. Snoek b,c , S . K l o u s a,d b,c  , S . K l o u s a,d  ^("b,c ",S.Klous)^("a,d "){ }^{\text {b,c }, ~ S . ~ K l o u s ~}{ }^{\text {a,d }}
a a ^(a){ }^{a} KPMG Advisory N.V., Laan van Langerhuize 1, 1186 DS, The Netherlands
a a ^(a){ }^{a} KPMG Advisory N.V., Laan van Langerhuize 1, 1186 DS, 荷兰
b b ^(b){ }^{\mathrm{b}} Nikhef National Institute for Subatomic Physics, Science Park 105, 1098 XG Amsterdam, The Netherlands
b b ^(b){ }^{\mathrm{b}} Nikhef 国家亚原子物理研究所,Science Park 105, 1098 XG Amsterdam, The Netherlands
c ^("c "){ }^{\text {c }} Institute of Physics, University of Amsterdam, Science Park 904, 1098 XH Amsterdam, The Netherlands
c ^("c "){ }^{\text {c }} Institute of Physics, University of Amsterdam, Science Park 904, 1098 XH Amsterdam, The Netherlands
d ^("d "){ }^{\text {d }} Informatics Institute, University of Amsterdam, Science Park 904, 1098 XH Amsterdam, The Netherlands
d ^("d "){ }^{\text {d }} Informatics Institute, University of Amsterdam, Science Park 904, 1098 XH Amsterdam, The Netherlands

ARTICLE INFO  文章信息

Article history:  文章历史:

Received 28 November 2018
收稿日期 2018 年 11 月 28 日

Received in revised form 23 June 2020
2020 年 6 月 23 日以修订形式收到

Accepted 24 June 2020  接受日期 2020 年 6 月 24 日
Available online 27 June 2020
2020 年 6 月 27 日在线提供

Keywords:  关键字:

Data analysis  数据分析
Correlation  相关
Contingency test  应急测试
Significance  意义
Simulation  模拟

Abstract  抽象

A prescription is presented for a new and practical correlation coefficient, ϕ K ϕ K phi_(K)\phi_{K}, based on several refinements to Pearson’s hypothesis test of independence of two variables. The combined features of ϕ K ϕ K phi_(K)\phi_{K} form an advantage over existing coefficients. Primarily, it works consistently between categorical, ordinal and interval variables, in essence by treating each variable as categorical, and can therefore be used to calculate correlations between variables of mixed type. Second, it captures nonlinear dependency. The strength of ϕ K ϕ K phi_(K)\phi_{K} is similar to Pearson’s correlation coefficient, and is equivalent in case of a bivariate normal input distribution. These are useful properties when studying the correlations between variables with mixed types, where some are categorical. Two more innovations are presented: to the proper evaluation of statistical significance of correlations, and to the interpretation of variable relationships in a contingency table, in particular in case of sparse or low statistics samples and significant dependencies. Two practical applications are discussed. The presented algorithms are easy to use and available through a public Python library. 1 1 ^(1){ }^{1}
根据对两个变量独立性的 Pearson 假设检验的几次改进,提出了一个新的实用相关系数 ϕ K ϕ K phi_(K)\phi_{K} 的处方。的组合特征 ϕ K ϕ K phi_(K)\phi_{K} 形成了优于现有系数的优势。首先,它在分类变量、有序变量和区间变量之间一致地工作,实质上是将每个变量视为分类变量,因此可用于计算混合类型变量之间的相关性。其次,它捕获非线性依赖性。的 ϕ K ϕ K phi_(K)\phi_{K} 强度类似于 Pearson 的相关系数,并且在二元正态输入分布的情况下是等效的。在研究具有混合类型的变量之间的相关性时,这些属性非常有用,其中一些是分类的。还提出了另外两项创新:正确评估相关性的统计显着性,以及解释列联表中的变量关系,特别是在稀疏或低统计样本和显着依赖关系的情况下。讨论了两个实际应用。提供的算法易于使用,可通过公共 Python 库获得。 1 1 ^(1){ }^{1}

© 2020 Published by Elsevier B.V.
© 2020 年由 Elsevier BV 出版

1. Introduction  1. 引言

The calculation of correlation coefficients between paired data variables is a standard tool of analysis for every data analyst. Pearson’s correlation coefficient (Pearson, 1895) is a de facto standard in most fields, but by construction only works for interval variables. In practice it is inconvenient to mix and compare different correlation coefficients when dealing with mixed variable types. While many coefficients of association exist, each with different strengths, we have not been able to identify a correlation coefficient with Pearson-like strength and a sound statistical interpretation that can be used to calculate correlations between variables of mixed type. (The convention adopted here is that a correlation coefficient is bound, e.g. in the range [ 0 , 1 ] [ 0 , 1 ] [0,1][0,1] or [ 1 , 1 ] [ 1 , 1 ] [-1,1][-1,1], and that a coefficient of association is not.)
成对数据变量之间相关系数的计算是每个数据分析师的标准分析工具。皮尔逊相关系数 (Pearson, 1895) 是大多数领域的事实上的标准,但通过构造仅适用于区间变量。在实践中,在处理混合变量类型时,混合和比较不同的相关系数很不方便。虽然存在许多关联系数,每个系数都有不同的强度,但我们无法确定具有类似 Pearson 强度的相关系数和可用于计算混合类型变量之间相关性的合理统计解释。(这里采用的约定是相关系数是绑定的,例如在 range [ 0 , 1 ] [ 0 , 1 ] [0,1][0,1] [ 1 , 1 ] [ 1 , 1 ] [-1,1][-1,1] 中,而关联系数不是。
This paper describes among others a novel correlation coefficient, ϕ K ϕ K phi_(K)\phi_{K}, with properties that - taken together - form an advantage over existing methods. More broadly, it covers innovations to three related topics typically encountered in data analysis.
本文描述了一种新的相关系数 ϕ K ϕ K phi_(K)\phi_{K} ,其特性 - 综合起来 - 形成了优于现有方法的优势。更广泛地说,它涵盖了数据分析中通常遇到的三个相关主题的创新。
The calculation of the correlation coefficient ϕ K ϕ K phi_(K)\phi_{K}, for each variable-pair of interest. The correlation ϕ K ϕ K phi_(K)\phi_{K} follows a uniform treatment for interval, ordinal and categorical variables, because its definition is invariant under the ordering of
相关系数的计算 ϕ K ϕ K phi_(K)\phi_{K} ,对于每个感兴趣的变量对。相关性 ϕ K ϕ K phi_(K)\phi_{K} 遵循对区间、序数和分类变量的统一处理,因为它的定义在
the values of each variable. Essentially it is treating each variable as if its type is categorical. This is particularly useful in modern-day analysis when studying the dependencies between a set of variables with mixed types, where some variables are categorical.
每个变量的值。本质上,它将每个变量视为其类型是分类的。这在现代分析中,当研究一组具有混合类型的变量之间的依赖关系时特别有用,其中某些变量是分类变量。
The correlation ϕ K ϕ K phi_(K)\phi_{K} is derived from Pearson’s χ 2 χ 2 chi^(2)\chi^{2} contingency test (Barnard, 1992), i.e. the hypothesis test of independence between two (or more) variables in a contingency table, henceforth called the Factorization Assumption. In a contingency table each row is the category of one variable and each column the category of a second variable. Each cell describes the number of records occurring in both categories at the same time. The values for levels of correlation are bound in the range [0, 1], with 0 for no association and +1 for complete association. By construction, the strength of ϕ K ϕ K phi_(K)\phi_{K} is the same as Pearson’s correlation coefficient in case of a bivariate normal input distribution. Unlike Pearson, which describes the average linear dependency between two variables, ϕ K ϕ K phi_(K)\phi_{K} also captures nonlinear relations.
相关性 ϕ K ϕ K phi_(K)\phi_{K} 源自 Pearson 的 χ 2 χ 2 chi^(2)\chi^{2} 列联检验 (Barnard, 1992),即列联表中两个(或多个)变量之间独立性的假设检验,以下简称为因式分解假设。在列联表中,每行是一个变量的类别,每列是一个变量的类别。每个单元格描述同时出现在两个类别中的记录数。相关级别的值限制在 [0, 1] 范围内,其中 0 表示无关联,+1 表示完全关联。通过构造,在二元正态输入分布的情况下,的 ϕ K ϕ K phi_(K)\phi_{K} 强度与 Pearson 相关系数相同。与 Pearson 不同,Pearson 描述两个变量之间的平均线性相关性, ϕ K ϕ K phi_(K)\phi_{K} 它也捕获非线性关系。
The evaluation of the statistical significance of each correlation. Presented is a new and robust statistical prescription for the significance evaluation of the level of variable association, based on a hybrid method of Monte Carlo sample simulation and adjustments of the χ 2 χ 2 chi^(2)\chi^{2} distribution when using the G G GG-test statistic (Sokal and Rohlf, 2012). The asymptotic approximation commonly advertised to evaluate the statistical significance of the hypothesis test, e.g. by statistics libraries such as R R RR (Anon, 0000b) and scipy (Anon, 0000a), makes particular assumptions on the number of degrees of freedom and the shape of the χ 2 χ 2 chi^(2)\chi^{2} distribution. This approach is unusable for sparse data samples, which may happen for two variables with a strong correlation and for low- to medium-statistics data samples, and leads to incorrect p p pp-values. (Examples follow in Section 5.)
对每个相关性的统计显著性的评估。提出了一种新的、强大的统计处方,用于变量关联水平的显着性评估,基于蒙特卡洛样本模拟的混合方法和使用检验统计量时的 χ 2 χ 2 chi^(2)\chi^{2} G G GG 分布调整(Sokal 和 Rohlf,2012 年)。通常宣传的渐近近似用于评估假设检验的统计显着性,例如通过 R R RR (Anon, 0000b) 和 scipy (Anon, 0000a) 等统计库,对自由度的数量和 χ 2 χ 2 chi^(2)\chi^{2} 分布的形状做出了特定的假设。此方法不适用于稀疏数据样本,对于具有强相关性的两个变量以及中低统计数据样本,可能会发生这种情况,并导致 -value 不正确 p p pp 。(示例如下 Section 5.)
Insights in the correlation of each variable-pair, by studying outliers and their significances. To help interpret any relationship found, we provide a newly refined method for the detection of significant excesses or deficits of records with respect to the expected values in a contingency table, so-called outliers, using a statistically independent evaluation for expected frequency of records. We evaluate the significance of each outlier frequency, putting particular emphasis on the statistical uncertainty on the expected number of records and on the scenario of low statistics data samples.
通过研究异常值及其显著性,深入了解每个变量对的相关性。为了帮助解释发现的任何关系,我们提供了一种新改进的方法,用于检测记录相对于列联表中预期值的重大超额或不足,即所谓的异常值,使用对记录预期频率的统计独立评估。我们评估了每个异常值频率的显著性,特别强调预期记录数的统计不确定性和低统计数据样本的场景。
The methods presented in this work can be applied to many analysis problems. Insights in variable dependencies serve as useful input to all forms of model building, be it classification or regression based, such as the identification of customer groups, outlier detection for predictive maintenance or fraud analytics, and decision making engines. More generally, they can be used to find correlations across (big) data sets, and correlations over time (in correlograms). Two use-cases are discussed: the study of numbers of insurance claims and of survey responses.
这项工作中介绍的方法可以应用于许多分析问题。对变量依赖关系的洞察可作为所有形式的模型构建的有用输入,无论是基于分类还是基于回归的模型,例如客户组的识别、用于预测性维护或欺诈分析的异常值检测以及决策引擎。更一般地说,它们可用于查找(大)数据集之间的相关性,以及随时间变化的相关性(在 correlogram 中)。讨论了两个用例:保险索赔数量和调查响应的研究。
In summary, this paper covers three innovations. We present a new correlation coefficient that is useful when dealing with data sets with mixed variable types, including categorical ones. We introduced a novel “hybrid” approach for the significance evaluation of contingency tables, that works in particular for sparse and low-statistics data sets. And we discuss a new, statistically refined method for inspecting the relationship between two dependent variables, one that also works for any two mixed variable types.
总之,本文涵盖了三项创新。我们提出了一种新的相关系数,在处理具有混合变量类型(包括分类变量)的数据集时非常有用。我们引入了一种新的 “混合 ”方法来评估列联表的显著性,该方法特别适用于稀疏和低统计数据集。我们还讨论了一种新的、统计改进的方法来检查两个因变量之间的关系,该方法也适用于任何两种混合变量类型。
This document is organized as follows. A brief overview of existing correlation coefficients is provided in Section 2. Section 3 describes the contingency test, which serves as input for Section 4, detailing the derivation of the correlation coefficient ϕ K ϕ K phi_(K)\phi_{K}. The statistical significance evaluation of the contingency test is discussed in Section 5 . In Section 6 we zoom in on the interpretation of the dependency between a specific pair of variables, where the significance evaluation of outlier frequencies in a contingency table is presented. Two practical applications can be found in Section 7. Section 8 describes the implementation of the presented algorithms in publicly available Python code, before concluding in Section 9.
本文档的组织方式如下。第 2 节提供了现有相关系数的简要概述。第 3 节描述了列联检验,该检验作为第 4 节的输入,详细说明了相关系数的推导 ϕ K ϕ K phi_(K)\phi_{K} 。列联检验的统计显着性评估在第 5 节中讨论。在第 6 节中,我们放大了对特定变量对之间依赖关系的解释,其中介绍了列联表中异常值频率的显着性评估。第 7 节中可以找到两个实际应用。第 8 节描述了在公开可用的 Python 代码中实现所提出的算法,然后在第 9 节结束。

2. Measures of variable association
2. 变量关联的测量

A correlation coefficient quantifies the level of mutual, statistical dependence between two variables. Multiple types of correlation coefficients exist in probability theory, each with its own definition and features. Some focus on linear relationships where others are sensitive to any dependency, some are robust against outliers, etc. In general, different correlation coefficients are used to describe dependencies between interval, ordinal, and categorical variables separately. Typically their values range from -1 to +1 or 0 to +1 , where 0 means no statistical association, +1 means the strongest possible association (a one-on-one dependency), and -1 means the strongest negative relation. While these particular two or three correlation points are well-defined statistically, the various correlation coefficients follow different approaches to navigate between them. When working with different correlation coefficients for different types of variables, their comparison can thus be complex. (Some correlation constants (Yoo et al., 2020) have an axiomatic foundation, which in this context forms an advantage of other correlation coefficients.)
相关系数量化了两个变量之间的相互统计依赖性水平。概率论中存在多种类型的相关系数,每种都有自己的定义和特征。有些侧重于线性关系,而另一些则对任何依赖关系都敏感,有些对异常值具有鲁棒性,等等。通常,使用不同的相关系数分别描述区间变量、有序变量和分类变量之间的依赖关系。通常,它们的值范围从 -1 到 +1 或 0 到 +1 ,其中 0 表示没有统计关联,+1 表示最强的关联(一对一依赖关系),-1 表示最强的负关系。虽然这些特定的两个或三个相关点在统计上是明确定义的,但各种相关系数遵循不同的方法在它们之间导航。因此,当对不同类型的变量使用不同的相关系数时,它们的比较可能会很复杂。(一些相关常数(Yoo et al., 2020)具有公理基础,在这种情况下,这构成了其他相关系数的优势。
In this paper, we are interested in correlation constants that work with interval, ordinal, and categorical variables alike. This section briefly discusses existing correlations coefficients and other measures of variable association. This is done separately for interval, ordinal, and categorical variables, as measures of association are typically designed to work with one specific variable type. In addition, several related concepts used throughout this work are presented.
在本文中,我们对适用于区间、序数和分类变量的相关常数感兴趣。本节简要讨论现有的相关系数和变量关联的其他度量。这是分别对区间变量、有序变量和分类变量完成的,因为关联度量通常设计为使用一种特定的变量类型。此外,还介绍了整个工作中使用的几个相关概念。
An interval variable, sometimes called continuous or real-valued variable, has well-defined intervals between the values of the variable. Examples are distance or temperature measurements. The Pearson correlation coefficient is a de
区间变量(有时称为连续变量或实值变量)在变量的值之间具有明确定义的区间。例如距离或温度测量。Pearson 相关系数是一个 de

facto standard to quantify the level of association between two interval variables. For a sample of size N N NN with variables x x xx and y y yy, it is defined as the covariance of the two variables divided by the product of their standard deviations:
用于量化两个区间变量之间关联水平的 FACTO 标准。对于具有变量 x x xx y y yy 的大小 N N NN 样本,它定义为两个变量的协方差除以它们的标准差的乘积:
ρ = i = 1 N ( x i x ¯ ) ( y i y ¯ ) i = 1 N ( x i x ¯ ) 2 i = 1 N ( y i y ¯ ) 2 , ρ = i = 1 N x i x ¯ y i y ¯ i = 1 N x i x ¯ 2 i = 1 N y i y ¯ 2 , rho=(sum_(i=1)^(N)(x_(i)-( bar(x)))(y_(i)-( bar(y))))/(sqrt(sum_(i=1)^(N)(x_(i)-( bar(x)))^(2))sqrt(sum_(i=1)^(N)(y_(i)-( bar(y)))^(2))),\rho=\frac{\sum_{i=1}^{N}\left(x_{i}-\bar{x}\right)\left(y_{i}-\bar{y}\right)}{\sqrt{\sum_{i=1}^{N}\left(x_{i}-\bar{x}\right)^{2}} \sqrt{\sum_{i=1}^{N}\left(y_{i}-\bar{y}\right)^{2}}},
where x ¯ x ¯ bar(x)\bar{x} and y ¯ y ¯ bar(y)\bar{y} are the sample means. Notably, ρ ρ rho\rho is symmetric in x x xx and y y yy, and ρ [ 1 , 1 ] ρ [ 1 , 1 ] rho in[-1,1]\rho \in[-1,1].
其中 x ¯ x ¯ bar(x)\bar{x} y ¯ y ¯ bar(y)\bar{y} 是样本均值。值得注意的是, ρ ρ rho\rho x x xx y y yy 、 和 ρ [ 1 , 1 ] ρ [ 1 , 1 ] rho in[-1,1]\rho \in[-1,1] 中对称。

Extending this to a set of input variables, Pearson’s correlation matrix C C CC, containing the ρ ρ rho\rho values of all variable pairs, is obtained from the covariance matrix V V VV as:
将其扩展到一组输入变量,包含所有变量对 ρ ρ rho\rho 值的 Pearson 相关矩阵 C C CC 从协方差矩阵 V V VV 中获得:
C i j = V i j V i i V j j C i j = V i j V i i V j j C_(ij)=(V_(ij))/(sqrt(V_(ii)V_(jj)))C_{i j}=\frac{V_{i j}}{\sqrt{V_{i i} V_{j j}}}
where i j i j iji j are the indices of a variable pair. Equivalently one can write:
其中 i j i j iji j 是变量对的索引。等效地,您可以编写:
V = E 1 2 C E 1 2 ; V 1 = E 1 2 C 1 E 1 2 , V = E 1 2 C E 1 2 ; V 1 = E 1 2 C 1 E 1 2 , V=E^((1)/(2))CE^((1)/(2))quad;quadV^(-1)=E^(-(1)/(2))C^(-1)E^(-(1)/(2)),V=E^{\frac{1}{2}} C E^{\frac{1}{2}} \quad ; \quad V^{-1}=E^{-\frac{1}{2}} C^{-1} E^{-\frac{1}{2}},
where E = diag ( V ) E = diag ( V ) E=diag(V)E=\operatorname{diag}(V).  其中 E = diag ( V ) E = diag ( V ) E=diag(V)E=\operatorname{diag}(V) .
The Pearson correlation coefficient measures the strength and direction of the linear relationship between two interval variables; a well-known limitation is therefore that nonlinear dependencies are not (well) captured. In addition, ρ ρ rho\rho is known to be to sensitive to outlier records. Pearson’s correlation coefficient requires interval variables as input, which can be unbinned or binned. It cannot be evaluated for categorical variables, and ordinal variables can only be used when ranked (see below).
Pearson 相关系数衡量两个区间变量之间线性关系的强度和方向;因此,一个众所周知的限制是非线性依赖关系没有被(很好地)捕获。此外, ρ ρ rho\rho 已知对离群值记录敏感。Pearson 相关系数需要区间变量作为输入,该变量可以是非分箱变量或分箱变量。它不能针对分类变量进行评估,并且序数变量只能在排名时使用(见下文)。
A direct relationship exists between ρ ρ rho\rho and a bivariate normal distribution:
二元正态分布之间存在 ρ ρ rho\rho 直接关系:
f b . n . . ( x , y x ¯ , y ¯ , σ x , σ y , ρ ) = 1 2 π σ x σ y 1 ρ 2 exp ( 1 2 ( 1 ρ 2 ) [ ( x x ¯ ) 2 σ x 2 + ( y y ¯ ) 2 σ y 2 2 ρ ( x x ¯ ) ( y y ¯ ) σ x σ y ] ) , f b . n . . x , y x ¯ , y ¯ , σ x , σ y , ρ = 1 2 π σ x σ y 1 ρ 2 exp 1 2 1 ρ 2 ( x x ¯ ) 2 σ x 2 + ( y y ¯ ) 2 σ y 2 2 ρ ( x x ¯ ) ( y y ¯ ) σ x σ y , {:[f_(b.n..)(x,y∣( bar(x)),( bar(y)),sigma_(x),sigma_(y),rho)=],[qquad(1)/(2pisigma_(x)sigma_(y)sqrt(1-rho^(2)))exp(-(1)/(2(1-rho^(2)))[((x-( bar(x)))^(2))/(sigma_(x)^(2))+((y-( bar(y)))^(2))/(sigma_(y)^(2))-(2rho(x-( bar(x)))(y-( bar(y))))/(sigma_(x)sigma_(y))])","]:}\begin{aligned} & f_{\mathrm{b} . \mathrm{n.} .}\left(x, y \mid \bar{x}, \bar{y}, \sigma_{x}, \sigma_{y}, \rho\right)= \\ & \qquad \frac{1}{2 \pi \sigma_{x} \sigma_{y} \sqrt{1-\rho^{2}}} \exp \left(-\frac{1}{2\left(1-\rho^{2}\right)}\left[\frac{(x-\bar{x})^{2}}{\sigma_{x}^{2}}+\frac{(y-\bar{y})^{2}}{\sigma_{y}^{2}}-\frac{2 \rho(x-\bar{x})(y-\bar{y})}{\sigma_{x} \sigma_{y}}\right]\right), \end{aligned}
where σ x ( σ y ) σ x σ y sigma_(x)(sigma_(y))\sigma_{x}\left(\sigma_{y}\right) is the width of the probability distribution in x ( y ) x ( y ) x(y)x(y), and the correlation parameter ρ ρ rho\rho signifies the linear tilt between x x xx and y y yy. We use this relation in Section 4 to derive the correlation coefficient ϕ K ϕ K phi_(K)\phi_{K}.
其中 σ x ( σ y ) σ x σ y sigma_(x)(sigma_(y))\sigma_{x}\left(\sigma_{y}\right) 是 中 x ( y ) x ( y ) x(y)x(y) 概率分布的宽度,相关性参数 ρ ρ rho\rho 表示 和 y y yy 之间的 x x xx 线性倾斜。我们在第 4 节中使用这个关系来推导出相关系数 ϕ K ϕ K phi_(K)\phi_{K}
Another useful measure is the global correlation coefficient (James and Roos, 1975), which is a number between zero and one obtained from the correlation matrix C C CC that gives the highest possible correlation between variable k k kk and the linear combination of all other variables:
另一个有用的度量是全局相关系数(James 和 Roos,1975),它是从相关矩阵 C C CC 中获得的介于 0 和 1 之间的数字,它在变量 k k kk 和所有其他变量的线性组合之间给出了尽可能高的相关性:
g k = 1 [ V k k ( V 1 ) k k ] 1 = 1 [ ( C 1 ) k k ] 1 . g k = 1 V k k V 1 k k 1 = 1 C 1 k k 1 . {:[g_(k)=sqrt(1-[V_(kk)**(V^(-1))_(kk)]^(-1))],[=sqrt(1-[(C^(-1))_(kk)]^(-1)).]:}\begin{aligned} g_{k} & =\sqrt{1-\left[V_{k k} *\left(V^{-1}\right)_{k k}\right]^{-1}} \\ & =\sqrt{1-\left[\left(C^{-1}\right)_{k k}\right]^{-1}} . \end{aligned}
An ordinal variable has two or more categories with a clear ordering of these categories. For example, take the variable “level of education” with six categories: no education, elementary school graduate, high school graduate, college and university graduate, PhD. A rank correlation measures the statistical relationship between two variables that can be ordered; the rank of a variable is its index in the ordered sequence of values. For ordinal variables a numbering is assigned to the categories, e.g. 0 , 1 , 2 , 3 0 , 1 , 2 , 3 0,1,2,30,1,2,3. Note the equidistant spacing between the categorical values.
序数变量具有两个或多个类别,这些类别具有明确的排序。例如,以变量“level of education”为例,它有六个类别:无教育、小学毕业、高中毕业、高等院校毕业生、博士。等级相关性度量可排序的两个变量之间的统计关系;变量的 rank 是它在 Ordered Sequence of Values 中的索引。对于有序变量,为类别分配编号,例如 0 , 1 , 2 , 3 0 , 1 , 2 , 3 0,1,2,30,1,2,3 .请注意分类值之间的等距间距。
Examples of rank correlation coefficients are Spearman’s ρ ρ rho\rho (Spearman, 1904), Kendall’s τ τ tau\tau (Kendall, 1938), GoodmanKrustall’s γ γ gamma\gamma (Goodman and Krustal, 1954, 1959, 1963, 1972), and the polychoric correlation (Drasgow, 2006). The definition of Spearman’s ρ ρ rho\rho is simply Eq. (1), using the ranks of x i x i x_(i)x_{i} and y i y i y_(i)y_{i} as inputs, essentially treating the ranks as interval variables. This makes Spearman’s ρ ρ rho\rho very robust against outliers. Note that Goodman-Krustall’s γ γ gamma\gamma is dependent on the order of the two input variables, resulting in an asymmetric correlation matrix. See Ref. Yoo et al. (2020) for the theoretical properties of a generalization of Kendall’s τ τ tau\tau that deals with sparse data, which is also one of the characteristics of this work.
秩相关系数的例子是 Spearman ρ ρ rho\rho 的 (Spearman, 1904)、Kendall τ τ tau\tau 的 (Kendall, 1938)、GoodmanKrustall γ γ gamma\gamma 的 (Goodman 和 Krustal, 1954, 1959, 1963, 1972) 和多弦相关 (Drasgow, 2006)。Spearman 的 ρ ρ rho\rho 定义就是方程 (1),使用 x i x i x_(i)x_{i} y i y i y_(i)y_{i} 的秩作为输入,基本上将秩视为区间变量。这使得 Spearman 对异常值 ρ ρ rho\rho 非常稳健。请注意,Goodman-Krustall 的 γ γ gamma\gamma s 取决于两个输入变量的顺序,从而产生不对称的相关矩阵。参见 Yoo et al. (2020) 关于处理稀疏数据的 Kendall τ τ tau\tau 泛化的理论特性,这也是这项工作的特点之一。
Although ranking is regular practice, the assumption of equidistant intervals - often made implicitly - can sometimes be hard to justify. For example, adding the category of “MBA” to the above example increases the distance between “PhD” and “no education”, where one could argue that this distance should be independent of the number of educational categories.
尽管排名是常规做法,但等距间隔的假设 - 通常是隐含的 - 有时很难证明。例如,在上面的例子中添加 “MBA” 的类别会增加 “PhD” 和 “no education” 之间的距离,在这种情况下,人们可以争辩说这个距离应该独立于教育类别的数量。
A categorical variable, sometimes called a nominal or class variable, has two or more categories which have no intrinsic ordering. An example is the variable gender, with two categories: male and female. Multiple measures of association exist that quantify the mutual dependence between two (or more) categorical variables, including Pearson’s χ 2 χ 2 chi^(2)\chi^{2} contingency test (Barnard, 1992), the G-test statistic (Sokal and Rohlf, 2012), mutual information (Cover and Thomas, 2006), and for 2 × 2 × 2xx2 \times 2 contingency tables Fisher’s exact test (Fisher, 1922, 1970) and Barnard’s test (Barnard, 1945, 1947). For an overview see Ref. Agresti (1992). These measures determine how similar the joint distribution p ( x , y ) p ( x , y ) p(x,y)p(x, y) is to the product of the factorized marginal distributions p ( x ) p ( y ) p ( x ) p ( y ) p(x)p(y)p(x) p(y). Each measure of association consists of a sum of contributions, one from each cell of the contingency table, and therefore does not depend on the intrinsic ordering of the cells.
分类变量(有时称为名义型变量或类变量)具有两个或多个没有内在排序的类别。例如,变量 gender,它有两个类别:male 和 female。存在多种关联度量来量化两个(或多个)分类变量之间的相互依赖性,包括 Pearson χ 2 χ 2 chi^(2)\chi^{2} 列联检验 (Barnard, 1992)、G 检验统计量 (Sokal 和 Rohlf, 2012)、互信息 (Cover 和 Thomas, 2006),以及 2 × 2 × 2xx2 \times 2 个列联表的 Fisher 精确检验 (Fisher, 1922, 1970) 和 Barnard 检验 (Barnard, 1945, 1947)。有关概述,请参见 Ref. Agresti (1992)。这些度量确定联合分布 p ( x , y ) p ( x , y ) p(x,y)p(x, y) 与因式边际分布的乘积的相似程度 p ( x ) p ( y ) p ( x ) p ( y ) p(x)p(y)p(x) p(y) 。每个关联度量都由贡献总和组成,每个贡献来自列联表的每个单元格,因此不依赖于单元格的内在顺序。
Though typically limited to categorical variables, these test statistics can also be applied to interval and ordinal type variables. However, their values are not bound in the range [ 0,1 ], and can become quite large. Moreover, their
尽管这些检验统计量通常仅限于类别变量,但也可以应用于区间和序数类型变量。但是,它们的值未限制在 [ 0,1 ] 范围内,并且可能会变得非常大。此外,他们的

    • Corresponding author.  通讯作者。
    E-mail addresses: maxbaak@gmail.com (M. Baak), rfkoopman@gmail.com (R. Koopman).
    电子邮件地址:maxbaak@gmail.com (M. Baak)、rfkoopman@gmail.com (R. Koopman)。

    1 https://github.com/KaveIO/PhiK.
    1 https://github.com/KaveIO/PhiK