RMIB: Representation Matching Information Bottleneck for Matching Text Representations RMIB:匹配文本表示的表示匹配信息瓶颈
Haihui Pan Zhifang Liao Wenrui Xie Kun Han 海汇潘志芳 廖文瑞 谢坤汉 (Haihui Pan Zhifang Liao Wenrui Xie Kun Han )
Abstract 抽象
Recent studies have shown that the domain matching of text representations will help improve the generalization ability of asymmetrical domains text matching tasks. This requires that the distribution of text representations should be as similar as possible, similar to matching with heterogeneous data domains, in order to make the data after feature extraction indistinguishable. However, how to match the distribution of text representations remains an open question, and the role of text representations distribution match is still unclear In this work, we explicitly narrow the distribution of text representations by matching them with the same prior distribution. We theoretically prove that narrowing the distribution of text representations in asymmetrical domains text matching is equivalent to optimizing the information bottleneck (IB). Since the interaction between text representations plays an important role in asymmetrical domains text matching, IB does not restrict the interaction between text representations. Therefore, we propose the adequacy of interaction and the incompleteness of a single text representation on the basis of IB and obtain the representation matching information bottleneck (RMIB). We theoretically prove that the constraints on text representations in RMIB is equivalent to maximizing the mutual information between text representations on the premise that the task information is given. On four text matching models and five text matching datasets, we verify that RMIB can improve the performance of asymmetrical domains text matching. Our experimental code is available at https://github.com/chenxingphh/rmib. 最近的研究表明,文本表示的领域匹配将有助于提高非对称领域文本匹配任务的泛化能力。这就要求文本表示的分布应尽可能相似,类似于与异构数据域的匹配,以使特征提取后的数据无法区分。然而,如何匹配文本表示的分布仍然是一个悬而未决的问题,文本表示分布匹配的作用仍然不清楚 在这项工作中,我们通过将文本表示与相同的先验分布进行匹配来明确缩小文本表示的分布范围。从理论上证明,缩小非对称域文本匹配中文本表示的分布等同于优化信息瓶颈(IB)。由于文本表示之间的交互在非对称域文本匹配中起着重要作用,因此IB不限制文本表示之间的交互。因此,我们提出了基于IB的交互充分性和单个文本表示的不完备性,并得到了表示匹配信息瓶颈(RMIB)。从理论上证明,RMIB中对文本表示的约束等价于在给定任务信息的前提下,最大化文本表示之间的互信息。在4个文本匹配模型和5个文本匹配数据集上,验证了RMIB可以提高非对称域文本匹配的性能。我们的实验代码可在 https://github.com/chenxingphh/rmib 上获得。
1. Introduction 1. 引言
Text matching is a basic task in natural language processing (NLP). The purpose of this task is to predict the semantic relationship between two input texts. Text matching can be applied to many natural language processing scenarios. For example, in question-and-answer matching, it is necessary to judge whether the given candidate answers match the given question (Tan et al., 2016); In natural language reasoning, for a given hypothesis and premise, it is necessary to judge whether the hypothesis and premise are entailment, contradiction or neutral (Bowman et al., 2015); In information retrieval, it is necessary to determine the correlation between the searched information and the returned candidate (Reimers & Gurevych, 2019). 文本匹配是自然语言处理 (NLP) 中的一项基本任务。此任务的目的是预测两个输入文本之间的语义关系。文本匹配可以应用于许多自然语言处理方案。例如,在问答匹配中,有必要判断给定的候选答案是否与给定的问题匹配(Tan et al., 2016);在自然语言推理中,对于给定的假设和前提,有必要判断假设和前提是蕴涵、矛盾还是中立(Bowman et al., 2015);在信息检索中,有必要确定搜索信息与返回的候选者之间的相关性(Reimers&Gurevych,2019)。
Asymmetrical text matching refers to two input texts from different domains. For example, the question and the candidate answers can be viewed as being sampled from two different distributions in question-and-answer matching. One approach of asymmetrical domains text matching is to process text from different domains into text representation through deep learning models. In the common semantic space, text representations from different domains have the same dimensions, so it is easy to define semantic matching functions. The current mainstream approaches mainly focus on how to design more effective models to increase the interaction between input texts from different domains to obtain better text representations (Chen et al., 2017; Wang et al., 2017; Gong et al.; Kim et al., 2019; Yang et al., 2019). In the actual text matching, (Yu et al., 2022) finds an interesting phenomenon that the text representations from different domains are usually separated in the early stage of training. However, with the deepening of training, the text representations from different domains gradually begin to mix with each other. The explanation for this phenomenon is that there is a lexical gap between texts from different domains (Tay et al., 2018), so it is necessary to bridge this gap in the process of learning, which is similar to cross-modal matching. Existing researches (Li et al., 2020; Hazarika et al., 2020) shows that the matching model should generate domain invariant features, that is, the global distribution of text representations from different domains in the common semantic space should be as similar as possible, so as to 非对称文本匹配是指来自不同域的两个输入文本。例如,在问答匹配中,问题和候选答案可以被视为从两个不同的分布中抽样。非对称域文本匹配的一种方法是通过深度学习模型将来自不同域的文本处理为文本表示。在公共语义空间中,来自不同域的文本表示具有相同的维度,因此很容易定义语义匹配函数。目前的主流方法主要集中在如何设计更有效的模型,以增加来自不同领域的输入文本之间的交互,以获得更好的文本表示(Chen et al., 2017;Wang等人,2017;龚等人;Kim 等人,2019 年;Yang 等人,2019 年)。在实际的文本匹配中,(Yu et al., 2022)发现了一个有趣的现象,即来自不同领域的文本表示通常在训练的早期阶段是分开的。然而,随着训练的深入,来自不同领域的文本表示逐渐开始相互混合。对这种现象的解释是,来自不同领域的文本之间存在词汇鸿沟(Tay et al., 2018),因此有必要在学习过程中弥合这一鸿沟,这类似于跨模态匹配。现有研究(Li et al., 2020;Hazarika et al., 2020)表明,匹配模型应生成域不变特征,即公共语义空间中不同域的文本表示的全局分布应尽可能相似,以便
capture the commonality between potential text representations. However, existing text matching models lack explicit constraints to make text representations domain invariant. It is still an open question to make it clear how to align the text representation distribution of different domains in the common semantic space. Meanwhile, the role of aligning the distribution of text representation from different domains remains unclear. (Yu et al., 2022) designs a regularizer based on distributed distance to explicitly promote domain invariant features. Unlike (Yu et al., 2022), which directly optimizes the distribution of text representations from different domains, we reduce the distance between the distribution of text representations from different domains by explicitly aligning the distribution of text representations from different domains with the same prior distribution. We theoretically prove that narrowing the distribution of text representations in asymmetrical domains text matching is equivalent to optimizing the information bottleneck (IB). For text matching, the interaction between texts has an important impact on the matching effect (Chen et al., 2017; Wang et al., 2017; Gong et al.; Kim et al., 2019; Yang et al., 2019). However, IB does not restrict the interaction between text representations. Therefore, we propose the adequacy of interaction and the incompleteness of a single text representation on the basis of IB and obtain the representation matching information bottleneck (RMIB). We theoretically prove that the constraints on text representations in RMIB is equivalent to maximizing the mutual information between text representations on the premise that the task information is given. The contributions of this work can be summarized as follows: 捕获潜在文本表示之间的共性。但是,现有的文本匹配模型缺乏显式约束,无法使文本表示成为域不变。要弄清楚如何在公共语义空间中对齐不同域的文本表示分布,仍然是一个悬而未决的问题。同时,协调来自不同领域的文本表示分布的作用仍不清楚。(Yu et al., 2022)设计了一个基于分布距离的正则化器,以显式促进域不变特征。与直接优化来自不同域的文本表示分布的(Yu et al., 2022)不同,我们通过将来自不同域的文本表示的分布与相同的先验分布显式对齐来缩短来自不同域的文本表示之间的距离。从理论上证明,缩小非对称域文本匹配中文本表示的分布等同于优化信息瓶颈(IB)。对于文本匹配,文本之间的交互对匹配效果有重要影响(Chen et al., 2017;Wang等人,2017;龚等人;Kim 等人,2019 年;Yang 等人,2019 年)。但是,IB不限制文本表示之间的交互。因此,我们提出了基于IB的交互充分性和单个文本表示的不完备性,并得到了表示匹配信息瓶颈(RMIB)。从理论上证明,RMIB中对文本表示的约束等价于在给定任务信息的前提下,最大化文本表示之间的互信息。这项工作的贡献可以总结如下:
We improve domain invariant features by explicitly aligning the text representations distribution from different domains with the same prior distribution. We theoretically prove that narrowing the text representations distribution is equivalent to optimizing the information bottleneck (IB) in asymmetrical domains text matching. 我们通过将来自不同域的文本表示分布与相同的先验分布显式对齐来改进域不变特征。从理论上证明,缩小文本表征分布等于优化非对称域文本匹配中的信息瓶颈(IB)。
We propose the adequacy of interaction and the incompleteness of a single text representation for text representations on the basis of IB and obtain the representation matching information bottleneck (RMIB). 本文基于IB提出了文本表征的交互充分性和单文本表示的不完备性,并得到了表征匹配信息瓶颈(RMIB)。
We theoretically prove that the constraints on text representations in RMIB is equivalent to maximizing the mutual information between text representations on the premise that the task information is given. 从理论上证明,RMIB中对文本表示的约束等价于在给定任务信息的前提下,最大化文本表示之间的互信息。
We give the concrete implementation method of optimizing RMIB. On four text matching models and five text matching datasets, we verify that RMIB can improve the performance of asymmetrical domains text matching. 我们给出了优化RMIB的具体实现方法。在4个文本匹配模型和5个文本匹配数据集上,验证了RMIB可以提高非对称域文本匹配的性能。
2. Methodology 2. 方法论
2.1. Notation and Definitions 2.1. 符号和定义
We use capital letters for random variables and lowercase letters for realizations. Let and be random variables, then the information entropy of random variable is defined as . The definition of conditional entropy is . The mutual information between and is defined as . The Kullback-Leibler divergence between two distributions is defined as . 我们使用大写字母表示随机变量,使用小写字母表示实现。设 和 为随机变量,则随机变量 的信息熵定义为 。条件熵的定义是 。和 之间的 互信息定义为 。两个分布之间的 Kullback-Leibler 散度定义为 。
2.2. Information Bottleneck 2.2. 信息瓶颈
Information bottleneck (IB) (Tishby et al., 2000), also known as representation information bottleneck (RIB), regards the feedforward calculation process of deep neural network as a Markov chain. Let be the model input, then is the task label, is the prediction result of the model and is the output of the middle layer, that is, the representation of the model input. If the above variables are regarded as random variables, the following Markov chain is formed as: 信息瓶颈(IB)(Tishby et al., 2000),也称为表示信息瓶颈(RIB),将深度神经网络的前馈计算过程视为马尔可夫链。设 为模型输入,则 为任务标签, 为模型的预测结果, 为中间层的输出,即模型输入的表示。如果将上述变量视为随机变量,则形成以下马尔可夫链:
This means that or is satisfied, that is, under the premise that is known, the existence of will not affect , indicating that and are independent of each other. IB put forward the following two requirements for the representation Z learned by the deep learning model. 这意味着 或 被满足,即在已知的前提 下,存在 不会影响 ,表示 和 是相互独立的。IB对深度学习模型学习到的表示Z提出了以下两个要求。
Sufficient: The representation should contain as much information related to the target label as possible. 充分:表示 形式应包含与目标标签相关的尽可能多的信息。
Minimal: The representation should contain as little information as possible about input . 最小:表示 应尽可能少地包含有关输入 的信息。
On the premise of meeting the sufficient, IB requires that the representation should contain as little information as possible in input , which means that it contains as little redundant information as possible. Combining sufficient and minimal, we can get the optimization objective of IB as follows: 在满足充分条件的前提下,IB要求表示 在输入 中应包含尽可能少的信息,这意味着它包含尽可能少的冗余信息。结合充分和最小,我们可以得到IB的优化目标如下:
By using the Lagrange multiplier method, we can remove the constraint and equivalently express the above formula 通过使用拉格朗日乘数方法,我们可以消除约束并等效地表达上述公式
as: 如:
where is a positive constant. The optimization objective of Equation (5) is also called IB Lagrangian. 其中 为正常数。方程(5)的优化目标也称为IB拉格朗日量。
2.3. Information Bottleneck in Text Matching 2.3. 文本匹配中的信息瓶颈
In the asymmetrical domains text matching task, the input of the model contains two texts, which are represented by two random variables, and . Usually and are sampled from different domains. Then the asymmetrical domains text matching task can be expressed as: 在非对称域文本匹配任务中,模型的输入包含两个文本,分别由两个随机变量表示, 并且 .通常 从 不同的域中采样。那么非对称域文本匹配任务可以表示为:
where represents the parameters of the constructed text matching model, and . The samples are assumed to come from an unknown distribution . The current text matching mainly includes two types of matching models: text encoding model and text interaction model. The text encoding model will encode the input texts and respectively, and there will be no interaction between and during the encoding process. A typical model of text encoding model is siamese neural network (Koch et al., 2015). The text interaction model will interact with the text representation in the feedforward calculation, and usually get better results. A typical text interaction model is ESIM (Chen et al., 2017). However, no matter which text matching model is used, two input text representations will be obtained in the feedforward calculation process. Therefore, the Markov chain in the process of text matching can be expressed as: 其中 表示构造的文本匹配模型的参数, 并表示 。假设样本来自未知分布 。目前的文本匹配主要包括两种类型的匹配模型:文本编码模型和文本交互模型。文本编码模型将分别对输入文本 和 进行编码,并且在编码过程中不会 有交互。文本编码模型的典型模型是暹罗神经网络(Koch et al., 2015)。文本交互模型将在前馈计算中与文本表示进行交互,并且通常会获得更好的结果。典型的文本交互模型是ESIM(Chen et al., 2017)。但是,无论使用哪种文本匹配模型,在前馈计算过程中都会获得两个输入文本表示。因此,文本匹配过程中的马尔可夫链可以表示为:
where and are the representations of and respectively and represents the final representation after the fusion of two text representations, usually the output of the representation fusion layer. Accordingly, the optimization objective of the information bottleneck in the text matching can be expressed as: 其中 和 分别是 和 的表示,分别 表示两个文本表示融合后的最终表示,通常是表示融合层的输出。因此,文本匹配中信息瓶颈的优化目标可以表示为:
2.4. Text Domain Alignment Based on Prior Distribution 2.4. 基于先验分布的文本域对齐
Recent studies have shown that domain matching of text representation will help improve the generalization ability of text matching (Yu et al., 2022). To effectively align the distribution of text representation, (Yu et al., 2022) designs a regularizer based on distributed distance to narrow the text representations: 最近的研究表明,文本表示的领域匹配将有助于提高文本匹配的泛化能力(Yu et al., 2022)。为了有效地对齐文本表示的分布,(Yu et al., 2022)设计了一个基于分布距离的正则化器来缩小文本表示:
where is a metric function of distribution distance. Unlike (Yu et al., 2022), we narrow the distribution between text representations by explicitly aligning text representations with a prior distribution in text matching. Specifically, let the prior distribution be and two input text representations be and . To make the distribution between and as close as possible, we expect the distribution of and to be as close as on the premise that the model can correctly predict the target label. Therefore, the optimization objective of text matching based on text representations alignment can be expressed as: 其中 是分布距离的度量函数。与(Yu et al., 2022)不同,我们通过将文本表示与文本匹配中的先验分布显式对齐来缩小文本表示之间的分布。具体来说,设先验分布为 , 两个输入文本表示形式为 和 。为了使 和 之间的 尽可能接近,我们期望 和 的分布尽可能 接近,前提是模型可以正确预测目标 标签。因此,基于文本表示对齐的文本匹配的优化目标可以表示为:
We can use our method to explain DDR-Match (Yu et al., 2022). Let be the distribution after DDR-Match alignment, then in our method it can be expressed as: 我们可以用我们的方法来解释 DDR-Match(Yu 等人,2022 年)。假设 是 DDR-Match 对齐后的分布,那么在我们的方法中,它可以表示为:
By using the Lagrange multiplier method, we can remove the constraint and equivalently express the above formula as: 通过使用拉格朗日乘数方法,我们可以消除约束,并将上述公式等效地表示为:
where and are positive constants. Although some studies have shown that domain alignment of text representations will help improve the performance of text matching, the role of text representations distribution alignment is still unclear. We analyze this phenomenon and get Theorem 2.1. The proof of Theorem 2.1 can be found in appendix. 其中 和 是正常数。尽管一些研究表明,文本表示的领域对齐将有助于提高文本匹配的性能,但文本表示分布对齐的作用仍不清楚。我们分析了这种现象,得到了定理2.1。定理 2.1 的证明可以在附录中找到。
Theorem 2.1 shows that domain matching in text matching is equivalent to optimizing the information bottleneck in text matching, which indicates that domain alignment of input texts can make the learned text representation forget the input redundant information as much as possible. 定理2.1表明,文本匹配中的域匹配等同于优化文本匹配中的信息瓶颈,这表明输入文本的域对齐可以使学习的文本表示尽可能忘记输入的冗余信息。
Theorem 2.1. Given matching texts , text representations and task label , then it satisfies in asymmet- 定理 2.1.给定匹配的 文本、文本表示 和任务标签 ,那么它满足不对称
rical domains text matching: Rical 域文本匹配:
where and are positive constants. 其中 和 是正常数。
2.5. Representation Matching Information Bottleneck 2.5. 表示匹配信息瓶颈
Information bottleneck (IB) only requires that the learned representations retain as much information as possible about the task label and as little information as possible about the input and does not impose any explicit constraints on the interaction between text representations. In text matching, the performance of text interaction model is usually better than that of text encoding model, which indicates that interaction on text representations has an important impact on the result of text matching. Therefore, we extend IB and obtain representation matching information bottleneck (RMIB) in text matching. Specifically, RMIB's constraints on text representations and input are consistent with IB, but RMIB puts forward the following constraints on the interaction between text representations and the relationship between text representations and task label. 信息瓶颈 (IB) 仅要求学习的表示保留尽可能多的有关任务标签的信息,并尽可能少地保留有关输入的信息,并且不对文本表示之间的交互施加任何显式约束。在文本匹配中,文本交互模型的性能通常优于文本编码模型,这表明文本表示的交互对文本匹配结果有重要影响。因此,我们扩展了IB并得到了文本匹配中的表示匹配信息瓶颈(RMIB)。具体而言,RMIB对文本表示和输入的约束与IB一致,但RMIB对文本表示之间的交互以及文本表示与任务标签之间的关系提出了以下约束。
Sufficient: The representations and should contain as much information related to the target label as possible. This constraint is consistent with the Sufficient of IB. 充分:表示 形式应 包含与目标标签相关的尽可能多的信息。此约束与IB的充分性一致。
Interaction: The interaction between text representations should be sufficient, which means there should be enough mutual information between the two text representations. 交互:文本表示之间的交互应该是充分的,这意味着两个文本表示之间应该有足够的互信息。
Inadequacy: The final correct result cannot be obtained only by using a single text representation in text matching. In text matching, the model needs to make right decisions based on text representation of two input texts, which means that only using a single text representation can not get the right result. For example, in the text encoding model, due to the lack of interaction between the two input texts during the feedforward calculation, it is extremely absurd to use only a single text representation to accurately predict the target. 不足:在文本匹配中仅使用单个文本表示无法获得最终的正确结果。在文本匹配中,模型需要根据两个输入文本的文本表示做出正确的决策,这意味着仅使用单个文本表示无法获得正确的结果。例如,在文本编码模型中,由于前馈计算过程中两个输入文本之间缺乏交互,因此仅使用单个文本表示来准确预测目标,是极其荒谬的。
Based on the constraints proposed by RMIB for text representations, we can obtain the optimization objective of 基于RMIB提出的文本表示约束条件,可以得到
RMIB: RMIB:
A more intuitive RMIB optimization objective is shown in Figure 1. Although RMIB requires that the interaction between text representations should be sufficient, it is difficult to know the impact of the interaction of text representations on task label if we only consider the interaction of text representations without associating with task label. Further, we prove that the constraints of RMIB on text representation and task is equivalent to maximizing the mutual information between text representations on the premise that the task information is given and get the Theorem 2.2. The proof of Theorem 2.2 can be found in appendix. Theorem 2.2 shows that the optimization objective of RMIB can be expressed as: 更直观的 RMIB 优化目标如图 1 所示。虽然RMIB要求文本表示之间的交互应该是充分的,但如果我们只考虑文本表示的交互而不与任务标签关联,就很难知道文本表示的交互对任务标签的影响。进一步,我们证明了RMIB对文本表示和任务的约束等价于在给定任务信息的前提下最大化文本表示之间的互信息,并得到定理2.2。定理 2.2 的证明可以在附录中找到。定理2.2表明,RMIB的优化目标可以表示为:
Although we can see the differences between RMIB and IB from the sufficiency of text representation interaction and the inadequacy of a single text representation, we can see the differences between RMIB and IB more intuitively from Theorem 2.2. IB only requires text representation to contain as much task information as possible, but it lacks constraints on the interaction between text representations. While RMIB requires that the interaction between text representations should be as much as possible on the premise of giving task label. This shows that RMIB is more suitable for text matching than IB. Another function of Theorem 2.2 is to provide a transformation method for optimizing conditional mutual information. 虽然我们可以从文本表示交互的充分性和单个文本表示的不足性中看到 RMIB 和 IB 之间的差异,但我们可以从定理 2.2 中更直观地看到 RMIB 和 IB 之间的差异。IB只要求文本表示包含尽可能多的任务信息,但它缺乏对文本表示之间交互的约束。而 RMIB 要求文本表示之间的交互应尽可能在给出任务标签的前提下进行。这表明RMIB比IB更适合文本匹配。定理 2.2 的另一个功能是提供一种优化条件互信息的变换方法。
Theorem 2.2. Given text representations and task label , then it satisfies in asymmetrical domain text matching: 定理 2.2.给定文本 表示和任务标签 ,则它满足非对称域文本匹配:
2.6. Equivalent Optimization Objective of Representation Matching Information Bottleneck 2.6. 表示匹配信息瓶颈的等效优化目标
Although RMIB gives optimization objective in text matching based on mutual information, it is difficult to directly optimize mutual information items during training process. Fortunately, as mutual information has received more attention in deep learning recently, some methods for approximate estimation of mutual information have also been 虽然RMIB给出了基于互信息的文本匹配的优化目标,但在训练过程中很难直接优化互信息项。幸运的是,随着互信息在深度学习中受到越来越多的关注,一些互信息的近似估计方法也得到了
Figure 1: The optimization objective of RMIB. and indicate that text representations and should contain as little input information as possible. and indicate that a single text representation cannot complete correct text matching. explicitly requires as much interaction as possible between and . 图 1:RMIB 的优化目标。 并指出 文本表示 形式, 并应包含尽可能少的输入信息。 并 指示单个文本表示无法完成正确的文本匹配。 显式要求 和 之间 尽可能多地交互。
proposed (Alemi et al., 2016; Belghazi et al., 2018; Hjelm et al.), which makes the optimization objective of RMIB tractable. For the optimization objective of RMIB, we can further equivalently express it as: 提议(Alemi 等人,2016 年;Belghazi 等人,2018 年;Hjelm et al.),这使得 RMIB 的优化目标变得容易处理。对于 RMIB 的优化目标,我们可以进一步等效地表示为:
where the constant C is the tightest upper bound of the theory of , and it can be easily proved that . The proof of Theorem 2.3 can be found in appendix. 其中常数 C 是 的最 紧上界,可以很容易地证明 。定理 2.3 的证明可以在附录中找到。
Theorem 2.3. Given text representations and task label , let , then it satisfies in asymmetrical domains text matching: 定理 2.3.给定文本 表示和任务标签 ,let ,则它满足非对称域中的文本匹配:
Based on the Lagrange multiplier method and Theorem 2.2, we can further equivalently express Equation (17) as: 基于拉格朗日乘数法和定理 2.2,我们可以进一步等效地将方程 (17) 表示为:
where and are positive constants. For , we can also replace it with where is the fusion of and . Theorem 2.4 shows that if the costs of optimizing and to the same value are of no different, then directly optimizing will give the optimization target of RMIB a smaller value, that is, a more optimized and . The proof of Theorem 2.4 can be found in appendix. Based on Equation (18), Theorem 2.5 gives the objective function of the RMIB that is ultimately used for optimization in the training process. The proof of Theorem 2.5 can be found in appendix. It is worth noting that the application of RMIB is not limited to text encoding model or text interaction model, because text representations and can be generated in either model. 其中 和 是正常数。对于 ,我们也可以用 和 的融合 来 代替它 。定理 2.4 表明,如果优化 的成本和 相同的值没有区别,那么直接优化 将使 RMIB 的优化目标值更小,即更优化的 和 。定理 2.4 的证明可以在附录中找到。基于方程 (18),定理 2.5 给出了 RMIB 的目标函数,该函数最终用于训练过程中的优化。定理 2.5 的证明可以在附录中找到。值得注意的是,RMIB的应用并不局限于文本编码模型或文本交互模型,因为文本表示 和可以在任一模型中生成。
Theorem 2.4. Given text representations , fusion text representation and task label , if the cost of optimizing and to the same value is the identical, then optimizing will obtain better and in RMIB. 定理 2.4.给定文本表示 、融合文本表示 和任务标签 ,如果优化 的成本和 相同的值相同,那么优化 将获得更好的 和 在RMIB中。
Theorem 2.5. The optimization objective of RMIB can be equivalently expressed as: 定理 2.5.RMIB的优化目标可以等效地表示为:
where and are positive constants and is a multilayer perceptron whose output layer activation function is sigmoid. 其中 和 是正常数, 是一个多层感知器,其输出层激活函数为 sigmoid。
3. Experiment 3. 实验
3.1. Dateset and Metric 3.1. 日期集和指标
SICK (Sentences Involving Compositional Knowledge) (Marelli et al., 2014) is a dataset for compositional distributional semantics, which consists of 9.8 k pairs of sentences. Each sentence pair is labelled as either contradiction, neutral or entailment. Accuracy is used as an evaluation metric for this dataset. SICK(涉及组合知识的句子)(Marelli et al., 2014)是一个组合分布语义数据集,由 9.8 k 对句子组成。每个句子对都被标记为矛盾、中性或蕴涵。准确性用作此数据集的评估指标。
SciTail (Science Entailment) (Khot et al., 2018) is an entailment dataset created from multiple-choice science exams and web sentences. The label could be entailment or neutral. The dataset contains a total of 27 k samples, of which 10 k samples have entailment labels, and the remaining 17 k are marked neutral. Accuracy is used as an evaluation metric for this dataset. SciTail (Science Entailment) (Khot et al., 2018) 是一个由多项选择科学考试和网络句子创建的蕴涵数据集。标签可以是蕴涵的,也可以是中性的。数据集共包含 27 k 个样本,其中 10 k 个样本具有蕴涵标签,其余 17 k 个样本标记为中性。准确性用作此数据集的评估指标。
WikiQA (Yang et al., 2015) is a retrieval-based question answering dataset based on Wikipedia. The binary label indicates whether the candidate sentence matches the question. The dataset contains 20.4 k training pairs, 2.7 k development pairs, and 6.2 k test pairs. Mean average precision (MAP) and mean reciprocal rank (MRR) are used as the evaluation metrics for this dataset. WikiQA (Yang et al., 2015) 是一个基于维基百科的基于检索的问答数据集。二进制标签指示候选句子是否与问题匹配。数据集包含 20.4 k 训练对、2.7 k 开发对和 6.2 k 测试对。平均精度 (MAP) 和平均倒数秩 (MRR) 用作该数据集的评估指标。
SNLI (Stanford Natural Language Inference) (Bowman et al., 2015) is a benchmark dataset for natural language inference which contains 570 k human annotated sentence pairs. One of the two sentences is used as "premise" and the other is used as "hypothesis". The label is one of "entailment", "neutral", "contradiction" and "-". We follow the practice in (Bowman et al., 2015) and the data labeled "-" is deleted. Accuracy is used as an evaluation metric for this dataset. SNLI(斯坦福自然语言推理)(Bowman et al., 2015)是自然语言推理的基准数据集,其中包含 570 k 个人类注释的句子对。两句话中的一句用作“前提”,另一句用作“假设”。标签是“蕴涵”、“中性”、“矛盾”和“-”之一。我们遵循(Bowman et al., 2015)中的做法,并删除标有“-”的数据。准确性用作此数据集的评估指标。
Quora Question Pair is a dataset for paraphrase identification. Quora Question Pair contains 400k question pairs with binary label collected from the Quora website. This task is to determine whether two sentences are the interpretation of each other. Accuracy is used as an evaluation metric for this dataset. Quora Question Pair 是一个用于释义识别的数据集。Quora 问题对包含从 Quora 网站收集的带有二进制标签的 400k 问题对。这个任务是确定两个句子是否是彼此的解释。准确性用作此数据集的评估指标。
3.2. Implementation Details 3.2. 实现细节
The data preprocessing of SNLI, Quora Question Pair, SciTail and WikiQA is consistent with RE2 (Yang et al., 2019). We use 840B-300d GloVe (Pennington et al., 2014) to initialize the embedded layer of RE2 (Yang et al., 2019) and ESIM (Chen et al., 2017). We choose the standard Gaussian distribution as , because on the premise that the mean and variance are known, the entropy of the Gaussian distribution is the maximum, that is, the minimum prior knowledge is introduced. For the KL term in RMIB, we use the reparameter trick (Kingma & Welling, 2013) to optimize it. The hyper-parameters of ESIM and RE2 is consistent with (Yang et al., 2019). For BERT (Kenton & Toutanova, 2019) and SBERT (Reimers & Gurevych, 2019), we use the max pooling to obtain the text representations, the learning rate is 2e-5 and the epoch is 6 . We use BERT-base to initialize BERT and SBERT in our experiments. For SBERT, we minimize instead of and because and don't interact before the fusion layer. All experiments use Adam (Kingma & Ba, 2014) as the optimizer and the hyper-parameters search range for and in RMIB is . To ensure the reproducibility of the results, all experiments use the same seed. SNLI、Quora Question Pair、SciTail 和 WikiQA 的数据预处理与 RE2 一致(Yang et al., 2019)。我们使用 840B-300d GloVe (Pennington et al., 2014) 来初始化 RE2 (Yang et al., 2019) 和 ESIM (Chen et al., 2017) 的嵌入层。我们选择标准高斯分布作为 ,因为在均值和方差已知的前提下,高斯分布的熵是最大值,即引入最小先验知识。对于RMIB中的KL项,我们使用reparameter技巧(Kingma&Welling,2013)来优化它。ESIM和RE2的超参数与(Yang等人,2019)一致。对于BERT(Kenton&Toutanova,2019)和SBERT(Reimers&Gurevych,2019),我们使用最大池化来获得文本表示,学习率为2e-5,纪元为6。我们在实验中使用 BERT 基来初始化 BERT 和 SBERT。对于 SBERT,我们最小化 而不是 和 因为 并且 在融合层之前不相互作用。所有实验都使用Adam(Kingma&Ba,2014)作为优化器,RMIB 中的 超参数搜索范围为 。为了确保结果的可重复性,所有实验都使用相同的种子。
3.3. Performance of RMIB 3.3. RMIB的性能
To verify the validity of RMIB, we conduct the experiment using four common text matching models on five public datasets. BERT and SBERT are pre-training models while RE2 and ESIM are non-pre-training models. Meanwhile, BERT, RE2 and ESIM are text interaction models while SBERT is text encoding model. The experimental results are shown in Table 1. From Table 1, it can be seen that compared to the vanilla model, almost all models show significant improvements on the five datasets after using RMIB. We can find that compared with BERT, RE2 and ESIM, RMIB improves SBERT significantly in the five public datasets. For example, SBERT with RMIB can achieve gain on accuracy on SICK. Although the effect of SBERT is not as superior as that of BERT, the difference between the two decreases sharply after the addition of RMIB. For example, the F1 difference between the original SBERT and BERT on SICK is , but after the addition of RMIB, the F1 value difference is reduced to . Since the optimization goal of RMIB is the premise of known task information, the interaction between text representations should be as much as possible, while SBERT itself does not interact with text representations before the fusion layer. Therefore, RMIB can improve SBERT's model performance by explicitly increasing the interaction between text representations. This also indicates that for text matching tasks, the interaction between text representations is crucial to the final result. 为了验证RMIB的有效性,我们在五个公共数据集上使用四种常见的文本匹配模型进行了实验。BERT 和 SBERT 是预训练模型,而 RE2 和 ESIM 是非预训练模型。同时,BERT、RE2和ESIM是文本交互模型,而SBERT是文本编码模型。实验结果如表1所示。从表1可以看出,与原版模型相比,几乎所有模型在使用RMIB后在5个数据集上都表现出显著的改进。我们发现,与BERT、RE2和ESIM相比,RMIB在5个公共数据集中显著提高了SBERT。例如,配备 RMIB 的 SBERT 可以在 SICK 上实现 精度增益。虽然SBERT的效果不如BERT优越,但加入RMIB后两者之间的差异急剧减小。例如,SICK 上原始 SBERT 和 BERT 之间的 F1 差值为 ,但在添加 RMIB 后,F1 值差减小到 。由于 RMIB 的优化目标是已知任务信息的前提,因此文本表示之间的交互应尽可能多,而 SBERT 本身在融合层之前不与文本表示交互。因此,RMIB 可以通过显式增加文本表示之间的交互来提高 SBERT 的模型性能。这也表明,对于文本匹配任务,文本表示之间的交互对最终结果至关重要。
We find that RMIB can bring greater improvement in the case of data hunger. The improvement of the four text matching models on SICK, SciTail and WikiQA is significantly greater than that on SNLI and Quora. We think this is because SNLI and Quora have significantly more data than SICK, SciTail, WikiQA. For example, SNLI has about 57 times the data than SICK. (Koltchinskii, 2001) indicates that when the number of training data sets increases, the generalization performance of the model will also increase. When there are enough training sets, the model itself has good generalization performance, so RMIB can bring less improvement. We also find that BERT's accuracy on SNLI 我们发现,RMIB可以在数据饥渴的情况下带来更大的改善。SICK、SciTail 和 WikiQA 上 4 种文本匹配模型的改进明显大于 SNLI 和 Quora。我们认为这是因为 SNLI 和 Quora 的数据明显多于 SICK、SciTail 和 WikiQA。例如,SNLI 的数据量是 SICK 的 57 倍。(Koltchinskii,2001)指出,当训练数据集的数量增加时,模型的泛化性能也会增加。当训练集足够多时,模型本身具有良好的泛化性能,因此RMIB可以带来的改进较少。我们还发现 BERT 在 SNLI 上的准确性
Table 1: Performance of four text match model with RMIB on five public dataset. Right superscript * represents the results reproduced in our experimental environment. The other data in the table is obtained from (Yu et al., 2022). Due to the lack of results reported on Quora in (Yu et al., 2022), we use - for representation. 表 1:4 个文本匹配模型与 RMIB 在 5 个公共数据集上的性能。右上标 * 表示在我们的实验环境中重现的结果。表中的其他数据来自(Yu et al., 2022)。由于 (Yu et al., 2022) 中缺乏关于 Quora 的报告结果,我们使用 - 表示。
Models 模型
SICK 认证
SICK
Acc.
SciTail 科学尾巴
WikiQA 维基QA
SNLI认证
SNLI
Acc.
Quora账户
Quora
Acc.
Acc. 根据。
F1
MAP
MRR
RE2
84.20
86.61
88.73
74.96
76.58
89.00
-
RE2(DDR-Match,JS) RE2(DDR-匹配,JS)
84.30
86.74
88.85
73.93
76.04
88.75
-
RE2(DDR-Match,MMD) RE2(DDR-匹配,MMD)
84.35
87.39
89.58
74.84
76.24
89.02
-
RE2(DDR-Match,WD) RE2(DDR-匹配,WD)
85.39
87.04
89.12
75.31
76.89
89.09
-
84.96
86.27
88.54
74.28
76.05
88.94
89.52
RE2(RMIB) RE2(RMIB)
BERT
86.65
89.56
91.52
77.52
78.95
87.91
-
BERT(DDR-Match,JS) BERT(DDR-匹配,JS)
85.04
90.40
92.13
79.03
80.87
87.90
-
BERT(DDR-Match,MMD) BERT(DDR-匹配,MMD)
86.68
90.22
91.96
79.54
81,21
88.02
-
BERT(DDR-Match,WD) BERT(DDR-匹配,WD)
86.72
90.53
92.29
79.58
81.23
88.23
-
Bert* 伯特*
86.69
93.79
94.88
83.40
85.01
90.46
90.41
BERT(RMIB) 伯特(RMIB)
SBERT
63.31
79.23
83.74
68.38
69.74
83.76
SBERT(DDR-Match,JS) SBERT(DDR-匹配,JS)
63.35
80.75
83.68
68.89
70.27
84.01
-
SBERT(DDR-Match,MMD) SBERT(DDR-匹配,MMD)
65.29
81.67
84.49
68.85
70.16
83.91
-
SBERT(DDR-Match,WD) SBERT(DDR-匹配,WD)
64.39
81.93
84.77
69.04
70.43
84.17
-
SBERT
72.05
83.44
86.27
71.22
72.67
80.04
86.46
SBERT(RMIB) SBERT(RMIB)
87.34(+0.88)
ESIM
82.59
85.04
87.24
72.38
74.10
87.51
88.09
ESIM(RMIB) ESIM(RMIB)
decreased by after adding RMIB. We believe that this may be because of the less significant heterogeneity of the text to be matched in SNLI. We also evaluate the performance of RMIB on three additional datasets, the results of which are shown in the appendix. 添加 RMIB 后减少 。我们认为,这可能是因为在SNLI中要匹配的文本的异质性不太显著。我们还评估了RMIB在另外三个数据集上的性能,其结果显示在附录中。
3.4. Ablation Study 3.4. 消融研究
As shown in Theorem 2.5, the optimization objective of RMIB consists of four parts. In addition to cross entropy loss, we use ablation experiments on the other three items to explore the role of each item. Since RMIB is obtained by adding Interaction and Inadequacy to IB, we also perform ablation on these two items to compare the differences between RMIB and IB. By setting and in the RMIB optimization objective to 0 , IB can be obtained. The results of the ablation experiment are shown in Table 2. From Table 2, it can be seen that in most cases, the results of ablation of the components of RMIB have decreased compared to the model with RMIB. This is consistent with Theorem 2.2, because the information theory interpretation of the three ablation items of RMIB is to maximize the mutual information between text representations on the premise of knowing the task information. When we ablate one of the items, the information theory interpretation is no longer valid. For RMIB and IB, we find that RMIB consistently outperforms IB on all datasets and models. This also indicates that RMIB is more suitable for text matching tasks. We find that the ablation term in RMIB plays different roles on different datasets. For example, when we set in RMIB to 0 , com- pared to the original model, BERT's performance on SICK decreased by , but it improved by on Quora. We also find that for text encoding models like SBERT, no matter how RMIB is ablated, it can always bring improvements on SICK, SciTail, and WikiQA. We believe that this is because of the small number of the three datasets and the lack of explicit interaction in SBERT itself, so increasing even some of the items in RMIB can bring gains. 如定理2.5所示,RMIB的优化目标由四部分组成。除了交叉熵损失之外,我们还对其他三个项目进行了消融实验,以探索每个项目的作用。由于 RMIB 是通过在 IB 中添加交互作用和不足来获得的,因此我们还对这两个项目进行了消融,以比较 RMIB 和 IB 之间的差异。通过将 RMIB 优化目标设置为 0 ,可以获得 IB。消融实验的结果如表2所示。从表2中可以看出,在大多数情况下,与RMIB模型相比,RMIB组分的消融结果有所下降。这与定理2.2是一致的,因为RMIB的三个消融项的信息论解释是在知道任务信息的前提下,最大化文本表示之间的互信息。当我们消融其中一个项目时,信息论的解释就不再有效了。对于 RMIB 和 IB,我们发现 RMIB 在所有数据集和模型上的表现始终优于 IB。这也说明RMIB更适合文本匹配任务。我们发现RMIB中的消融项在不同的数据集上起着不同的作用。例如,当我们将 RMIB 设置为 0 时,与原始模型相比,BERT 在 SICK 上的性能下降了 ,但在 Quora 上有所改善 。我们还发现,对于像 SBERT 这样的文本编码模型,无论 RMIB 如何消融,它总能带来 SICK、SciTail 和 WikiQA 的改进。我们认为,这是因为三个数据集的数量很少,而且 SBERT 本身缺乏明确的交互,因此即使增加 RMIB 中的某些项目也可以带来收益。
4. Related Work 四、相关工作
Text matching can be mainly divided into two methods: text encoding model and text interaction model. Early work explored encoding each text individually as a vector and then building a neural network classifier on both vectors. In this paradigm, several different models are used to encode the input text individually. Recurrent neural networks and convolutional neural networks are used as text encoder (Bowman et al., 2015; Tai et al., 2015; Yu et al., 2014; Tan et al., 2016). SBERT (Reimers & Gurevych, 2019) uses the pre-trained model BERT (Kenton & Toutanova, 2019) as the text encoder. The text encoding model does not explicitly consider interactions between texts, while text interaction models consider adding interactions between input texts to improve performance. ESIM (Chen et al., 2017) uses attention mechanisms to interact between input texts. BiMPM (Wang et al., 2017) utilizes a variety of attention to increase the interaction between input texts. DIIN (Gong et al.) uses DenseNet (Huang et al., 2017) as a deep convolutional feature extractor to extract information from the alignment 文本匹配主要可以分为两种方法:文本编码模型和文本交互模型。早期的工作探索了将每个文本单独编码为向量,然后在两个向量上构建神经网络分类器。在此范例中,使用几种不同的模型对输入文本进行单独编码。递归神经网络和卷积神经网络被用作文本编码器(Bowman等人,2015;Tai 等人,2015 年;Yu 等人,2014 年;Tan 等人,2016 年)。SBERT (Reimers & Gurevych, 2019) 使用预训练模型 BERT (Kenton & Toutanova, 2019) 作为文本编码器。文本编码模型不显式考虑文本之间的交互,而文本交互模型考虑在输入文本之间添加交互以提高性能。ESIM (Chen et al., 2017) 使用注意力机制在输入文本之间进行交互。BiMPM(Wang et al., 2017)利用各种注意力来增加输入文本之间的交互。DIIN(Gong et al.)使用DenseNet(Huang et al., 2017)作为深度卷积特征提取器,从对齐中提取信息
Table 2: Ablation of RMIB. The superscript - indicates that we set the corresponding coefficient to 0 . For example, (RMIB, ) indicates that we set and to 0 , in which case RMIB will degenerate to IB. 表2:RMIB的消融。上标 - 表示我们将相应的系数设置为 0 。例如,(RMIB, ) 表示我们将 and 设置为 0,在这种情况下,RMIB 将退化为 IB。
Models 模型
SICK 认证
SICK
Acc.
SciTail 科学尾巴
WikiQA 维基QA
SNLI认证
SNLI
Acc.
Quora账户
Quora
Acc.
Acc. 根据。
F1
MAP
MRR
RE2
84.96
6.27
88.54
74.28
76.05
88.94
89.52
RE2(RMIB) RE2(RMIB)
89.19 (+1
RE2(F RE2(F
,
89.16)
73.29
88.61(-1
89.32(
72.79
74.42(
BERT
86.69
370
94.88
83.40
85.01
90.46
90.41
BE
87.40 (t 87.40 (吨
4.12(-
95.16
84.29(
86.02(-
90.43(
90.72(-
BE
86.63(-1
1)
94.32(
85.02(1
90.27
90.57
BE
93.3
94.60
84.56
85.78
90.35
90.38
BE
87.28(+
9)
94.81
12)
85.18
90.03
72.05
83.44
86.27
72.67
80.04
86.46
84.59(-
8)
90.52
74
76.28
81.91
87.34
82.55(+
86.92
89.97
72.79
74.71(
87.56(+
SBER]
SBERT
ESIM
82.59
85.04
87.24
72.38
74.10
87.51
88.09
83.29
86.03
88.29
72.99(
74.55
87.53
88.26
ESIM(F ESIM(F
88.02(-0.07)
results. DRCN (Kim et al., 2019) stacks encoding layers and attention layers and then connects all previously aligned results. RE2 (Yang et al., 2019) introduces an architecture based on convolutional layer and attention layer reinforcement residual connection. DDR-Match (Yu et al., 2022) is a regularizer based on distribution distance to directly narrow the encoding distribution of input texts. 结果。DRCN(Kim等人,2019)堆叠编码层和注意力层,然后连接所有先前对齐的结果。RE2(Yang et al., 2019)引入了一种基于卷积层和注意力层强化残差连接的架构。DDR-Match (Yu et al., 2022) 是一种基于分布距离的正则化器,可直接缩小输入文本的编码分布。
The information bottleneck (IB) is proposed by (Tishby et al., 2000) as a generalization of the minimum adequacy statistic, allowing trade-offs between the adequacy and complexity of the representation. Later (Tishby & Zaslavsky, 2015) and (Shwartz-Ziv & Tishby, 2017) advocate using IB between test data and deep neural network activation to study the adequacy and minimality of the representation. Since it is difficult to directly calculate mutual information, VIB (Alemi et al., 2016) gives the lower bound of the optimization goal for the information bottleneck based on the variational method, so that the optimization goal of the information bottleneck can be combined into the training process. DisenIB (Pan et al., 2021) introduces disentangled information bottleneck from the perspective of supervised detangling. PIB (Wang et al., 2021) establishes an information bottleneck between the accuracy of neural networks and the complexity of information based on information stored in weights. 信息瓶颈 (IB) 由 (Tishby et al., 2000) 提出,作为最小充分性统计的推广,允许在表示的充分性和复杂性之间进行权衡。后来(Tishby&Zaslavsky,2015)和(Shwartz-Ziv&Tishby,2017)主张在测试数据和深度神经网络激活之间使用IB来研究表示的充分性和最小性。由于很难直接计算互信息,VIB(Alemi等人,2016)基于变分方法给出了信息瓶颈优化目标的下限,从而将信息瓶颈的优化目标组合到训练过程中。DisenIB(Pan et al., 2021)从监督解缠的角度引入了解纠缠的信息瓶颈。PIB(Wang et al., 2021)在神经网络的准确性和基于权重中存储的信息的信息复杂性之间建立了一个信息瓶颈。
5. Relationship between RMIB and Contrastive Learning 5. RMIB与对比学习的关系
Contrastive loss (Chen et al., 2020) aims at pulling in samples with similar semantics in the semantic vector space and pushing out samples with significant semantic differences. Let be a sample, then the core of contrastive learning is how to efficiently construct positive and negative sample pairs. SimCSE(Gao et al., 2021) proposes two methods for constructing for NLP scenarios. For supervised versions, SimCSE will treat the dataset labeled as "entailment" as and the dataset labeled as "contradict" as . Contrastive learning emphasizes the distance between different semantic samples, while RMIB emphasizes that the distribution of texts in different input domains should be as similar as possible. The output of RMIB is of discrete type, while contrastive learning is of continuous type. 对比损失(Chen et al., 2020)旨在在语义向量空间中拉入具有相似语义的样本,并推出具有显着语义差异的样本。假设 是一个样本,那么对比学习的核心是如何有效地构建正样本和负样本对。SimCSE(Gao et al., 2021)提出了两种 构建NLP场景的方法。对于受监督的版本,SimCSE 会将标记为“蕴涵”的数据集 视为 ,将标记为“矛盾”的数据集视为 。对比学习强调不同语义样本之间的距离,而RMIB强调不同输入域中文本的分布应尽可能相似。RMIB的输出是离散型的,而对比学习是连续型的。
Since the aim of contrastive learning is different from text matching, we find it impossible to extend RMIB directly to SimCSE. The Inadequacy constraint on representations in RMIB cannot be applied to contrastive learning. However, drawing on the ideas of contrastive learning, we may be able to extend RMIB to unsupervised learning. We can treat as one class and as another class to form binary dataset. Further, we use RIBM to learn this task. The difference here with SimCSE is that our aim is to classify rather than learn the similarity of representational 由于对比学习的目的与文本匹配不同,我们发现不可能将 RMIB 直接扩展到 SimCSE。RMIB 中表征的不足约束不能应用于对比学习。然而,借鉴对比学习的思想,我们也许能够将RMIB扩展到无监督学习。我们可以将其视为 一个类和 另一个类来形成二元数据集。此外,我们使用 RIBM 来学习这项任务。这里与 SimCSE 的不同之处在于,我们的目标是对表征的相似性进行分类,而不是学习
semantics. 语义学。
6. Conclusion 6. 结论
In this paper, we theoretically prove that narrowing text representations distribution in asymmetrical domains text matching is equivalent to optimizing the information bottleneck. Since the information bottleneck does not constrain the interaction of text representations, we extend the information bottleneck in asymmetrical domain text matching and obtain the representation matching information bottleneck (RMIB). We also theoretically prove that the optimization objective of RMIB is equivalent to maximizing mutual information between text representations given task information. The experimental results show the effectiveness of RMIB. The main limitation of RMIB is that there are some hyper-parameters in the optimization goal of RMIB that need to be set manually. We will also explore how to set the hyper-parameters adaptively in future work. 本文从理论上证明了在非对称域文本匹配中缩小文本表示分布等同于优化信息瓶颈。由于信息瓶颈不约束文本表征的交互,因此在非对称域文本匹配中扩展了信息瓶颈,得到了表征匹配信息瓶颈(RMIB)。我们还从理论上证明了RMIB的优化目标等价于在给定任务信息的情况下最大化文本表示之间的互信息。实验结果验证了RMIB的有效性。RMIB 的主要局限性在于 RMIB 的优化目标中有一些超参数需要手动设置。我们还将探讨如何在未来的工作中自适应地设置超参数。
Impact Statement 影响声明
For addressing the problem of handling domain variance in text matching tasks, we introduce the Representation Matching Information Bottleneck (RMIB) framework. We provide both theoretical foundations and empirical evidence to verify the effectiveness of RMIB. 为了解决文本匹配任务中处理域方差的问题,我们引入了表示匹配信息瓶颈(RMIB)框架。我们提供了理论基础和经验证据来验证RMIB的有效性。
References 引用
Alemi, A. A., Fischer, I., Dillon, J. V., and Murphy, K. Deep variational information bottleneck. arXiv preprint arXiv:1612.00410, 2016. Alemi, A. A., Fischer, I., Dillon, J. V., and Murphy, K. 深度变分信息瓶颈。arXiv 预印本 arXiv:1612.00410, 2016.
Belghazi, M. I., Baratin, A., Rajeshwar, S., Ozair, S., Bengio, Y., Courville, A., and Hjelm, D. Mutual information neural estimation. In International conference on machine learning, pp. 531-540. PMLR, 2018. Belghazi, MI, Baratin, A., Rajeshwar, S., Ozair, S., Bengio, Y., Courville, A., and Hjelm, D. 互信息神经估计。在机器学习国际会议上,第 531-540 页。PMLR,2018 年。
Bowman, S., Angeli, G., Potts, C., and Manning, C. D. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 632-642, 2015. 鲍曼,S.,安吉利,G.,波茨,C.和曼宁,CD用于学习自然语言推理的大型注释语料库。2015 年自然语言处理实证方法会议论文集,第 632-642 页,2015 年。
Chen, Q., Zhu, X., Ling, Z.-H., Wei, S., Jiang, H., and Inkpen, D. Enhanced Istm for natural language inference. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1657-1668, 2017. Chen,Q.,Zhu,X.,Ling,Z.-H.,Wei,S.,江,H.和Inkpen,D.用于自然语言推理的增强Istm。计算语言学协会第 55 届年会论文集(第 1 卷:长论文),第 1657-1668 页,2017 年。
Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597-1607. PMLR, 2020. Chen, T., Kornblith, S., Norouzi, M., 和 Hinton, G.用于视觉表示对比学习的简单框架。在机器学习国际会议上,第 1597-1607 页。PMLR,2020 年。
Dolan, B. and Brockett, C. Automatically constructing a corpus of sentential paraphrases. In Third international workshop on paraphrasing (IWP2005), 2005. Dolan, B. 和 Brockett, C. 自动构建一个句子释义语料库。2005年第三届释义国际研讨会(IWP2005年)。
Gao, T., Yao, X., and Chen, D. Simcse: Simple contrastive learning of sentence embeddings. arXiv preprint . Gao, T., Yao, X., and Chen, D. Simcse:句子嵌入的简单对比学习。arXiv 预印本 。
Gong, Y., Luo, H., and Zhang, J. Natural language inference over interaction space. In International Conference on Learning Representations. Gong, Y., Luo, H., and Zhang, J. 交互空间上的自然语言推理.在学习表征国际会议上。
Hazarika, D., Zimmermann, R., and Poria, S. Misa: Modality-invariant and-specific representations for multimodal sentiment analysis. In Proceedings of the 28th ACM International Conference on Multimedia, pp. 11221131,2020 . Hazarika, D.、Zimmermann, R. 和 Poria, S. Misa:用于多模态情感分析的模态不变和特异性表示。在第 28 届 ACM 国际多媒体会议论文集中,第 11221131,2020 页。
Hjelm, R. D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Bachman, P., Trischler, A., and Bengio, Y. Learning deep representations by mutual information estimation and maximization. In International Conference on Learning Representations. Hjelm, R. D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Bachman, P., Trischler, A., and Bengio, Y. 通过相互信息估计和最大化学习深度表示。在学习表征国际会议上。
Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700-4708, 2017. Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. 密集连接的卷积网络.在IEEE计算机视觉和模式识别会议论文集,第4700-4708页,2017年。
Kenton, J. D. M.-W. C. and Toutanova, L. K. Bert: Pretraining of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT, volume 1, pp. 2, 2019. 肯顿,JDM-W.C. 和 Toutanova, L. K. Bert:用于语言理解的深度双向转换器的预训练。在 naacL-HLT 会议记录中,第 1 卷,第 2 页,2019 年。
Khot, T., Sabharwal, A., and Clark, P. Scitail: A textual entailment dataset from science question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018. Khot, T.、Sabharwal, A. 和 Clark, P. Scitail:来自科学问答的文本蕴涵数据集。在AAAI人工智能会议论文集,第32卷,2018年。
Kim, S., Kang, I., and Kwak, N. Semantic sentence matching with densely-connected recurrent and co-attentive information. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pp. 6586-6593, 2019. Kim, S., Kang, I., and Kwak, N. 语义句子与密集连接的循环和共同注意信息匹配。在 AAAI 人工智能会议论文集,第 33 卷,第 6586-6593 页,2019 年。
Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. Kingma, D. P. 和 Ba, J. Adam:一种随机优化方法。arXiv 预印本 arXiv:1412.6980, 2014.
Kingma, D. P. and Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. Kingma, D. P. 和 Welling, M. 自动编码变分贝叶斯。arXiv 预印本 arXiv:1312.6114, 2013.
Koch, G., Zemel, R., Salakhutdinov, R., et al. Siamese neural networks for one-shot image recognition. In ICML deep learning workshop, volume 2. Lille, 2015. Koch, G., Zemel, R., Salakhutdinov, R., et al.用于单次图像识别的暹罗神经网络。在 ICML 深度学习研讨会中,第 2 卷。里尔,2015 年。
Koltchinskii, V. Rademacher penalties and structural risk minimization. IEEE Transactions on Information Theory, 47(5):1902-1914, 2001 Koltchinskii, V. Rademacher的处罚和结构性风险最小化。IEEE信息理论学报, 47(5):1902-1914, 2001
Li, J., Jing, M., Zhu, L., Ding, Z., Lu, K., and Yang, Y. Learning modality-invariant latent representations for generalized zero-shot learning. In Proceedings of the 28th ACM International Conference on multimedia, pp. . Li, J., Jing, M., Zhu, L., Ding, Z., Lu, K., and Yang, Y. 广义零样本学习的模态不变潜在表示.在第 28 届 ACM 多媒体国际会议论文集中,第 .
Marelli, M., Menini, S., Baroni, M., Bentivogli, L., Bernardi, R., Zamparelli, R., et al. A sick cure for the evaluation of compositional distributional semantic models. In Lrec, pp. 216-223. Reykjavik, 2014. Marelli, M., Menini, S., Baroni, M., Bentivogli, L., Bernardi, R., Zamparelli, R., et al.评估组合分布语义模型的病态治疗方法。在Lrec中,第216-223页。雷克雅未克,2014 年。
Pan, Z., Niu, L., Zhang, J., and Zhang, L. Disentangled information bottleneck. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 9285-9293, 2021. Pan, Z., Niu, L., Zhang, J., and Zhang, L. 解开信息瓶颈。在 AAAI 人工智能会议论文集,第 35 卷,第 9285-9293 页,2021 年。
Pennington, J., Socher, R., and Manning, C. D. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532-1543, 2014. Pennington, J.、Socher, R. 和 Manning, C. D. Glove:用于单词表示的全局向量。2014 年自然语言处理经验方法会议论文集 (EMNLP),第 1532-1543 页,2014 年。
Reimers, N. and Gurevych, I. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pp. 3982-3992, 2019. Reimers, N. 和 Gurevych, I. Sentence-bert:使用暹罗 bert-networks 的句子嵌入。2019 年自然语言处理实证方法会议和第 9 届国际自然语言处理联合会议 (EMNLPIJCNLP) 论文集,第 3982-3992 页,2019 年。
Shwartz-Ziv, R. and Tishby, N. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810, 2017. Shwartz-Ziv, R. 和 Tishby, N. 通过信息打开深度神经网络的黑匣子。arXiv 预印本 arXiv:1703.00810, 2017.
Tai, K. S., Socher, R., and Manning, C. D. Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075, 2015. Tai, K. S., Socher, R., and Manning, C. D. 改进了树状结构长短期记忆网络的语义表示。arXiv 预印本 arXiv:1503.00075, 2015.
Tan, M., Dos Santos, C., Xiang, B., and Zhou, B. Improved representation learning for question answer matching. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 464-473, 2016. Tan,M.,Dos Santos,C.,Xiang,B.和周,B.改进了问答匹配的表示学习。计算语言学协会第 54 届年会论文集(第 1 卷:长论文),第 464-473 页,2016 年。
Tay, Y., Luu, A. T., and Hui, S. C. Hermitian co-attention networks for text matching in asymmetrical domains. In IJCAI, volume 18, pp. 4425-31, 2018. Tay, Y., Luu, A. T., and Hui, S. C. 用于非对称域文本匹配的 Hermitian 共注意力网络。IJCAI,第 18 卷,第 4425-31 页,2018 年。
Tishby, N. and Zaslavsky, N. Deep learning and the information bottleneck principle. In 2015 ieee information theory workshop (itw), pp. 1-5. IEEE, 2015. Tishby, N. 和 Zaslavsky, N. 深度学习和信息瓶颈原则。2015 年 ieee 信息理论研讨会 (itw),第 1-5 页。IEEE,2015 年。
Tishby, N., Pereira, F. C., and Bialek, W. The information bottleneck method. arXiv preprint physics/0004057, 2000 . Tishby, N.、Pereira, FC 和 Bialek, W.信息瓶颈方法。arXiv 预印本物理学/0004057, 2000 .
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. Glue:用于自然语言理解的多任务基准和分析平台。arXiv 预印本 arXiv:1804.07461, 2018.
Wang, Z., Hamza, W., and Florian, R. Bilateral multiperspective matching for natural language sentences. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp. 4144-4150, 2017. Wang, Z., Hamza, W., and Florian, R. 自然语言句子的双边多视角匹配.第 26 届国际人工智能联合会议论文集,第 4144-4150 页,2017 年。
Wang, Z., Huang, S.-L., Kuruoglu, E. E., Sun, J., Chen, X., and Zheng, Y. Pac-bayes information bottleneck. In International Conference on Learning Representations, 2021. Wang, Z., Huang, S.-L., Kuruoglu, E. E., Sun, J., Chen, X., and Zheng, Y. Pac-bayes 信息瓶颈.在 2021 年学习表征国际会议上。
Yang, R., Zhang, J., Gao, X., Ji, F., and Chen, H. Simple and effective text matching with richer alignment features. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4699-4709, 2019 . Yang, R., Zhang, J., Gao, X., Ji, F., and Chen, H. 简单有效的文本匹配,具有更丰富的对齐功能。在计算语言学协会第 57 届年会论文集中,第 4699-4709 页,2019 年。
Yang, Y., Yih, W.-t., and Meek, C. Wikiqa: A challenge dataset for open-domain question answering. In Proceedings of the 2015 conference on empirical methods in natural language processing, pp. 2013-2018, 2015. Yang, Y., Yih, W.-t., and Meek, C. Wikiqa:用于开放领域问答的挑战数据集。在 2015 年自然语言处理经验方法会议论文集,第 2013-2018 页,2015 年。
Yu, L., Hermann, K. M., Blunsom, P., and Pulman, S. Deep learning for answer sentence selection. arXiv preprint arXiv:1412.1632, 2014. Yu, L., Hermann, K. M., Blunsom, P., and Pulman, S. 用于答案句子选择的深度学习。arXiv 预印本 arXiv:1412.1632, 2014.
Yu, W., Xu, C., Xu, J., Pang, L., and Wen, J.-R. Distribution distance regularized sequence representation for text matching in asymmetrical domains. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30: . Yu, W., Xu, C., Xu, J., Pang, L., and 温, J.-R.用于非对称域中文本匹配的分布距离正则化序列表示。IEEE/ACM 音频、语音和语言处理汇刊, 30: .
Appendix 附录
The Performance of RMIB on Additional Datasets RMIB 在其他数据集上的性能
We also evaluate the performance of RMIB on three additional data sets. MRPC(Microsoft Research Paraphrase Corpus) (Dolan & Brockett, 2005) is a corpus consisting of 5,801 sentence pairs collected from newswire articles. Each pair is labeled if it is a paraphrase or not annotated by human annotators. The RTE(Recognizing Textual Entailment) (Wang et al., 2018) datasets come from a series of textual entailment challenges. Examples are constructed based on news and Wikipedia text. The WNLI (Wang et al., 2018) contains pairs of sentences, and the task is to determine whether the second sentence is an entailment of the first one or not. The dataset is used to train and evaluate models on their ability to understand these relationships between sentences. The experimental results are as follows and the experimental results show that RMIB can improve the performance of the model significantly. 我们还评估了RMIB在另外三个数据集上的性能。MRPC(Microsoft Research Paraphrase Corpus)(Dolan & Brockett,2005)是一个语料库,由从新闻专线文章中收集的5,801个句子对组成。如果每对是释义或未由人工注释者注释,则都会对其进行标记。RTE(Recognition Textual Entailment) (Wang et al., 2018) 数据集来自一系列文本蕴涵挑战。示例是根据新闻和维基百科文本构建的。WNLI (Wang et al., 2018) 包含成对的句子,任务是确定第二个句子是否是第一个句子的尾声。该数据集用于训练和评估模型理解句子之间这些关系的能力。实验结果如下,实验结果表明,RMIB可以显著提高模型的性能。
Table 3: Performance of BERT and SBERT with RMIB on three additional datasets. 表 3:BERT 和 SBERT 与 RMIB 在另外三个数据集上的性能。
MRPC(Acc.) MRPC(Acc.)
WNLI(Acc.) WNLI(Acc.)
RTE(Acc.) RTE(Acc.)
BERT
85.54
53.52
62.82
RMIB(BERT) RMIB(伯特)
87.01
56.34
69.31
SBERT
71.57
50.70
54.51
RMIB(SBERT) RMIB(SBERT)
77.21
60.56
62.82
Proof of Theorem 2.1 定理证明 2.1
Theorem 2.1. Given matching texts , text representations and task label , then it satisfies in asymmetrical domains text matching: 定理 2.1.给定匹配的 文本、文本表示 和任务标签 ,那么它满足非对称域中的文本匹配:
where and are positive constants. 其中 和 是正常数。
Proof. The optimization objective of the representation information bottleneck in the text matching can be expressed as: 证明。文本匹配中表示信息瓶颈的优化目标可以表示为:
For , we can get: 对于 ,我们可以得到:
We can replace by the prior distribution and according to Gibbs inequality, we can get: 我们可以用先验分布 代入 ,根据吉布斯不等式,我们可以得到:
Further, we can get: 此外,我们可以得到:
We can see that when is minimized, will also be minimized. 我们可以看到,当最小化时 , 也会最小化。
Based on the same proof method, we can also get: 基于相同的证明方法,我们还可以得到:
For , we can get: 对于 ,我们可以得到:
Since and is the representation of and , we can get: 由于 和 是 和 的 表示,我们可以得到:
Further, we can get: 此外,我们可以得到:
We can find that when maximizing will also be maximized. 我们可以发现,当最大化 时也会最大化。
Then we can get: 然后我们可以得到:
Therefore, the Theorem 2.1 can be proved. 因此,可以证明定理 2.1。
Proof of Theorem 2.2 定理证明 2.2
Theorem 2.2. Given text representations and task label , then it satisfies in asymmetrical domain text matching: 定理 2.2.给定文本 表示和任务标签 ,则它满足非对称域文本匹配:
Proof. According to the definition of conditional mutual information, we can get: 证明。根据条件互信息的定义,我们可以得到:
We can decompose and into: 我们可以分解 为 :
So we can get: 因此,我们可以得到:
We decompose and into: 我们分解 为 :
Further, we can get: 此外,我们可以得到:
We note the following relationship between and : 我们注意到 和 之间的 以下关系:
can be expressed as: 可以表示为:
We further decompose and into: 我们进一步分解 为 :
Finally, we can get: 最后,我们可以得到:
Therefore, the Theorem 2.2 can be proved. 因此,可以证明定理 2.2。
Proof of Theorem 2.3 定理证明 2.3
Theorem 2.3. Given text representations and task label , let , then it satisfies in asymmetrical domains text matching: 定理 2.3.给定文本 表示和任务标签 ,let ,则它满足非对称域中的文本匹配:
Proof. According to the definition of conditional mutual information, we can get: 证明。根据条件互信息的定义,我们可以得到:
Since , we can get: 由于 ,我们可以得到:
Let distribution , then: 让分布 ,那么:
where is the number of elements of random variable . Therefore, we can get: 其中 是随机变量的元素数 。因此,我们可以得到:
Since mutual information meets the exchangeability, it can be obtained: 由于互信息满足可交换性,因此可以得到:
Finally, we can get: 最后,我们可以得到:
Therefore, the Theorem 2.3 can be proved. 因此,可以证明定理 2.3。
Proof of Theorem 2.4 定理证明 2.4
Theorem 2.4. Given text representations , fusion text representation and task label , if the cost of optimizing and to the same value is the identical, then optimizing will obtain better and in RMIB. 定理 2.4.给定文本表示 、融合文本表示 和任务标签 ,如果优化 的成本和 相同的值相同,那么优化 将获得更好的 和 在RMIB中。
Proof. According to the definition of mutual information: 证明。根据相互信息的定义:
Since forms a Markov chain, and are independent of each other given , that is, , then: 由于 形成马尔可夫链, 并且 彼此独立,给定 ,即 ,则:
Based on the non-negativity of mutual information, then: 基于相互信息的非负性,那么:
We can see that is the upper bound of . Therefore, if the cost of optimizing and to the same value is the same, this means that optimizing will result in a higher . Since one of the optimization goals of RMIB is to maximize , the higher the value of , the better learned. 我们可以看到这是 的 上限。因此,如果优化的成本 和 相同的值相同,这意味着优化 将导致更高的 .由于RMIB的优化目标之一是最大 化,因此取 值越高,学习效果越好 。
Proof of Theorem 2.5 定理证明 2.5
Theorem 2.5. The optimization objective of RMIB can be equivalently expressed as. 定理 2.5.RMIB 的优化目标可以等效地表示为。
where and are positive constants and is a multi-layer perceptron whose output layer activation function is sigmoid. 其中 和 是正常数, 是一个多层感知器,其输出层激活函数为 sigmoid。
Proof. The optimization objective of RMIB is: 证明。RMIB的优化目标是:
is a constant when given a dataset. From Theorem 2.1 we can see: 在给定数据集时是一个常量。从定理 2.1 中我们可以看到:
According to the definition of mutual information and KL divergence, for there is: 根据互信息和KL背离的定义,因为 有:
Let be defined as: 让我们 定义为:
where and is a given function. By construction: 其中 和 是给定函数。按结构:
We can note that: 我们可以注意到:
Due to the non-negative property of KL divergence, it can be obtained that: 由于KL散度的非负性质,可以得到:
Therefore, we can obtain: 因此,我们可以得到:
Based on the above derivation, the Theorem 2.5 can be proved. 基于上述推导,可以证明定理2.5。
"Equal contribution Central South University, Changsha, China Cheetah Mobile Inc., Beijing, China Datawhale Org., Hangzhou, China. Correspondence to: Zhifang Liao zfliao@csu.edu.cn “ 中南大学,长沙,中国 猎豹移动公司,北京,中国 数据鲸鱼组织,杭州,中国。通信地址:辽志芳 zfliao@csu.edu.cn
Proceedings of the International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). 机器学习 国际会议论文集,维也纳,奥地利。PMLR 235,2024 年。版权所有 2024 作者。