MultiplexNet: Towards Fully Satisfied Logical Constraints in Neural Networks
多层神经网络:朝着在神经网络中完全满足逻辑约束的目标前进

Nicholas Hoernle,¹ Rafael Michael Karampatsis, ¹ Vaishak Belle ¹, Kobi Gal ^{1, 2}
尼古拉斯·霍恩勒, ¹ 拉斐尔·迈克尔·卡拉帕蒂斯, ¹ 韦萨克·贝尔 ¹ , 科比·加尔 ^{1, 2}

Abstract 摘要

We propose a novel way to incorporate expert knowledge into the training of deep neural networks. Many approaches encode domain constraints directly into the network architecture, requiring non-trivial or domain-specific engineering. In contrast, our approach, called MultiplexNet, represents domain knowledge as a logical formula in disjunctive normal form (DNF) which is easy to encode and to elicit from human experts. It introduces a Categorical latent variable that learns to choose which constraint term optimizes the error function of the network and it compiles the constraints directly into the output of existing learning algorithms. We demonstrate the efficacy of this approach empirically on several classical deep learning tasks, such as density estimation and classification in both supervised and unsupervised settings where prior knowledge about the domains was expressed as logical constraints. Our results show that the MultiplexNet approach learned to approximate unknown distributions well, often requiring fewer data samples than the alternative approaches. In some cases, MultiplexNet finds better solutions than the baselines; or solutions that could not be achieved with the alternative approaches. Our contribution is in encoding domain knowledge in a way that facilitates inference that is shown to be both efficient and general; and critically, our approach guarantees 100% constraint satisfaction in a network’s output.
我们提出了一种将专家知识融入深度神经网络训练的新方法。许多方法是将领域约束直接编码到网络结构中,这需要进行非平凡或特定领域的工程。相比之下,我们的方法称为 MultiplexNet,将领域知识表示为易于编码和从人类专家那里获取的析取范式(DNF)逻辑公式。它引入了一个分类潜在变量,学习选择哪个约束项可以优化网络的误差函数,并将约束直接编译到现有学习算法的输出中。我们在几个经典的深度学习任务上,如监督和非监督环境下的密度估计和分类,经验地证明了这种方法的有效性,其中先前对域的知识被表达为逻辑约束。我们的结果表明,MultiplexNet 方法学会很好地近似未知分布,通常需要的数据样本比替代方法更少。在某些情况下,MultiplexNet 找到了优于基准的解决方案,或者是无法通过替代方法实现的解决方案。我们的贡献在于以一种有助于高效和通用推理的方式编码域知识;而且关键的是,我们的方法保证网络输出 100%符合约束。

Introduction 简介

An emerging theme in the development of deep learning is to provide expressive tools that allow domain experts to encode their prior knowledge into the training of neural networks. For example, in a manufacturing setting, we may wish to encode that an actuator for a robotic arm does not exceed some threshold (e.g., causing the arm to move at a hazardous speed). Another example is a self-driving car, where a controller should be known to operate within a predefined set of constraints (e.g., the car should always stop completely at a stop street). In such safety critical domains, machine learning solutions must guarantee to operate within distinct boundaries that are specified by experts (Amodei et al. 2016).
在深度学习发展中的一个新兴主题是为领域专家提供表达性工具,以便将他们的先前知识编码到神经网络的训练中。例如,在制造环境中,我们可能希望编码一个机械臂执行器不超过某个阈值(例如,防止机械臂以危险速度移动)。另一个例子是自动驾驶汽车,其控制器应在预定义的一组约束条件内运行(例如,汽车应始终完全停在停车路口)。在这些安全关键的领域,机器学习解决方案必须确保在专家指定的明确边界内运行(Amodei 等人,2016)。

One possible solution is to encode the relevant domain knowledge directly into a network’s architecture which may require non-trivial and/or domain-specific engineering (Goodfellow et al. 2016). An alternative approach is to express domain knowledge as logical constraints which can then be used to train neural networks (Xu et al. 2018; Fischer et al. 2019; Allen, Balažević, and Hospedales 2020). These approaches compile the constraints into the loss function of the training algorithm, by quantifying the extent to which the output of the network violates the constraints. This is appealing as logical constraints are easy to elicit from people. However, the solution outputted by the network is designed to minimize the loss function — which combines both data and constraints — rather than to guarantee the satisfaction of the domain constraints. Thus, representing constraints in the loss function is not suitable for safety critical domains where 100% constraint satisfaction is desirable.
一种可能的解决方案是将相关的领域知识直接编码到网络架构中,这可能需要非平凡的和/或特定领域的工程(Goodfellow 等人,2016)。另一种方法是将领域知识表示为逻辑约束,然后用于训练神经网络(Xu 等人,2018;Fischer 等人,2019;Allen, Balažević和 Hospedales,2020)。这些方法将约束编译到训练算法的损失函数中,量化网络输出违反约束的程度。这很有吸引力,因为逻辑约束容易从人那里获取。然而,网络输出的解决方案旨在最小化损失函数——该函数结合了数据和约束——而不是保证满足领域约束。因此,在安全关键领域,如果需要 100%满足约束,则在损失函数中表示约束是不合适的。

Safety critical settings are not the only application for domain constraints. Another common problem in the training of large networks is that of data inefficiency. Deep models have shown unprecedented performance on a wide variety of tasks but these come at the cost of large data requirements.¹¹1For instance OpenAI’s GPT-3 (Brown et al. 2020) was trained on about $500$ billion tokens and ImageNet-21k, used to train the ViT network (Dosovitskiy et al. 2020), consists of 14 million images. For tasks where domain knowledge exists, learning algorithms should also use this knowledge to structure a network’s training to reduce the data burden that is placed on the learning process (Fischer et al. 2019).
安全关键设置并非域约束的唯一应用。大型网络训练中另一个常见问题是数据效率低下。深度模型在各种任务上展现了前所未有的性能,但这往往需要大量的数据。对于存在领域知识的任务,学习算法也应利用这些知识来构建网络训练,从而减轻学习过程的数据负担(Fischer 等人,2019 年)。

This paper directly addresses these challenges by providing a new way of representing domain constraints directly in the output layer of a network that guarantees constraint satisfaction. The proposed approach represents domain knowledge as a logical formula in disjunctive normal form (DNF). It augments the output layer of an existing neural network to include a separate transformation for each term in the DNF formula. We introduce a latent Categorical variable that selects the best transformation that optimizes the loss function of the data. In this way, we are able to represent arbitrarily complex domain constraints in an automated manner, and we are also able to guarantee that the output of the network satisfies the specified constraints.
这篇论文通过在网络输出层直接表示领域约束, 提供了一种新的方法来保证约束的满足。该方法将领域知识表示为析取范式(DNF)的逻辑公式。它在现有神经网络的输出层中加入一个单独的变换, 对应于 DNF 公式中的每个项。我们引入一个潜在的分类变量, 选择最佳变换以优化数据的损失函数。这样, 我们能够以自动化的方式表示任意复杂的领域约束, 并且能够确保网络输出满足指定的约束。

We show the efficacy of this MultiplexNet approach in three distinct experiments. First, we present a density estimation task on synthetic data. It is a common goal in machine learning to draw samples from a target distribution, and deep generative models have shown to be flexible and powerful tools for solving this problem. We show that by including domain knowledge, a model can learn to approximate an unknown distribution on fewer samples, and the model will (by construction) only produce samples that satisfy the domain constraints. This experiment speaks to both the data efficiency and the guaranteed constraint satisfaction desiderata. Second, we present an experiment on the popular MNIST data set (LeCun, Cortes, and Burges 2010) which combines structured data with domain knowledge. We structure the digits in a similar manner to the MNIST experiment from Manhaeve et al. (2018); however, we train the network in an entirely label-free manner (Stewart and Ermon 2017). In our third experiment, we apply our approach to the well known image classification task on the CIFAR100 data set (Krizhevsky, Hinton et al. 2009). Images are clustered according to “super classes” (e.g., both maple tree and oak tree fall under the super class tree). We follow the example of Fischer et al. (2019) and show that by including the knowledge that images within a super class are related, we can increase the classification accuracy at the super class level.
我们展示了这种 MultiplexNet 方法在三个不同实验中的有效性。首先,我们在合成数据上进行了密度估计任务。在机器学习中,从目标分布中抽取样本是一个常见的目标,深度生成模型已经被证明是解决这个问题的灵活和强大的工具。我们展示了通过包含领域知识,模型可以用较少的样本学习近似未知分布,而且模型(通过构建)只会产生满足领域约束的样本。这个实验说明了数据效率和保证约束满足这两个需求。其次,我们在流行的 MNIST 数据集(LeCun, Cortes 和 Burges 2010)上进行了一个实验,结合了结构化数据和领域知识。我们以类似于 Manhaeve 等人(2018)MNIST 实验的方式构造数字,但我们以完全无标签的方式训练网络(Stewart 和 Ermon 2017)。在我们的第三个实验中,我们将我们的方法应用于 CIFAR100 数据集(Krizhevsky, Hinton 等人 2009)上著名的图像分类任务。图像根据"超类"(例如,枫树和橡树都属于树木超类)进行聚类。我们遵循 Fischer 等人(2019)的示例,并展示了通过包含超类内图像相关的知识,我们可以提高在超类层面上的分类准确率。

The paper contributes a novel and general way to integrate domain knowledge in the form of a logical specification into the training of neural networks. We show that domain knowledge may be used to restrict the network’s operating domain such that any output is guaranteed to satisfy the constraints; and in certain cases, the domain knowledge can help to train the networks on fewer samples of data.
这种论文提供了一种新颖且通用的方法,将域知识以逻辑规范的形式整合到神经网络的训练过程中。我们证明,域知识可用于限制网络的操作域,从而保证任何输出都满足约束条件;在某些情况下,域知识可帮助网络用更少的数据样本进行训练。

Problem Specification 问题说明

We consider a data set of $N$ i.i.d. samples from a hybrid (some mixture of discrete and/or continuous variables) probability density (Belle, Passerini, and Van den Broeck 2015). Moreover, we assume that: (1) the data set was generated by some random process $p^{*}(x)$ ; and (2) there exists domain or expert knowledge, in the form of a logical formula $\Phi$ , about the random process $p^{*}(x)$ that can express the domain where $p^{*}(x)$ is feasible (non-zero). Both of these assumptions are summarised in Eq. 1. In Eq. 1, the notation $x\models\Phi$ , denotes that the sample $x$ satisfies the formula $\Phi$ (Barrett et al. 2009). For example, if $\Phi:=x>3.5\land y>0$ , and given some sample $(x,y)=(5,2)$ , we denote: $(x,y)\models\Phi$ .
我们考虑一个数据集,包含来自混合概率密度(离散和/或连续变量的某种混合)的 $N$ 个独立同分布的样本(Belle、Passerini 和 Van den Broeck 2015 年)。此外,我们假设:(1)数据集是由某种随机过程 $p^{*}(x)$ 生成的;(2)存在以逻辑公式 $\Phi$ 形式表达的领域或专家知识,描述了随机过程 $p^{*}(x)$ 中 $p^{*}(x)$ 可行(非零)的域。这两个假设总结在等式 1 中。在等式 1 中,符号 $x\models\Phi$ 表示样本 $x$ 满足公式 $\Phi$ (Barrett 等人,2009 年)。例如,如果 $\Phi:=x>3.5\land y>0$ ,且给定某个样本 $(x,y)=(5,2)$ ,我们表示: $(x,y)\models\Phi$ 。

x\sim p^{*}(x)\implies x\models\Phi

(1)

Our aim is to approximate $p^{*}(x)$ with some parametric model $p_{\theta}(x)$ and to incorporate the domain knowledge $\Phi$ into the maximum likelihood estimation of $\theta$ , on the available data set.
我们的目标是使用一些参数模型 $p^{*}(x)$ 来近似 $p_{\theta}(x)$ ,并将领域知识 $\Phi$ 纳入 $\theta$ 最大似然估计中,基于可用的数据集。

Given knowledge of the constraints $\Phi$ , we are interested in ways of integrating these constraints into the training of a network that approximates $p^{*}(x)$ . We desire an algorithm that does not require novel engineering to solve a reparameterisation of the network and moreover, especially salient for safety-critical domains, any sample $x$ from the model, $x\sim p_{\theta}(x)$ , should imply that the constraints are satisfied. This is an especially important aspect to consider when comparing this method to alternative approaches, namely Fischer et al. (2019) and Xu et al. (2018), that do not give this same guarantee.
基于对约束条件 $\Phi$ 的了解,我们对将这些约束条件整合到近似 $p^{*}(x)$ 的网络训练过程中感兴趣。我们希望找到一种算法,不需要对网络进行重新参数化就可以解决这个问题,而且,对于安全至关重要的领域,从模型中生成的任何样本 $x$ , $x\sim p_{\theta}(x)$ ,都应该满足这些约束条件。这一点在与其他方法(如 Fischer et al. (2019) 和 Xu et al. (2018) 提出的方法)进行比较时尤为重要,因为这些方法并不能提供同样的保证。

Related Work 相关工作

The integration of domain knowledge into the training of neural networks is an emerging area of focus. Many previous studies attempt to translate logical constraints into a numerical loss. The two most relevant works in this line are the DL2 framework by Fischer et al. (2019) and the Semantic Loss approach by Xu et al. (2018). DL2 uses a loss term that trades off data with the domain knowledge. It defines a non-negative loss by interpreting the logical constraints using fuzzy logic and defining a measure that quantifies how far a network’s output is from the nearest satisfying solution. Semantic Loss also defines a term that is added to the standard network loss. Their loss function uses weighted model counting (Chavira and Darwiche 2008) to evaluate the probability that a sample from a network’s output satisfies some Boolean constraint formulation. We differ from both of these approaches in that we do not add a loss term to the network’s loss function, rather we compile the constraints directly into its output. Furthermore, in contrast to the works above, any network output from MultiplexNet will satisfy the domain constraints, which is crucial in safety critical domains.
将神经网络训练中融入领域知识是一个正在崭露头角的重点研究领域。之前许多研究尝试将逻辑约束转化为数值损失函数。在这条路径上最相关的两个工作是 Fischer 等人(2019)提出的 DL2 框架和徐等人(2018)提出的语义损失方法。DL2 采用一个损失项,在数据和领域知识之间权衡。它通过使用模糊逻辑解释逻辑约束,定义一个度量来量化网络输出与最近满足解的偏差。语义损失同样定义了一个附加到标准网络损失的项。他们的损失函数使用加权模型计数(Chavira 和 Darwiche,2008)来评估网络输出样本满足某些布尔约束公式的概率。我们的方法不同于这两种方法,我们没有在网络损失函数中添加损失项,而是将约束直接编译到网络输出中。此外,与上述工作不同,MultiplexNet 的任何网络输出都将满足领域约束,这在安全关键领域至关重要。

It is also important to compare the expressiveness of the MultiplexNet constraints to those permitted by Fischer et al. (2019) and Xu et al. (2018). In MultiplexNet, the constraints can consist of any quantifier-free linear arithmetic formula over the rationals. Thus, variables can be combined over $+$ and $\geq$ , and formulae over $\neg$ , $\lor$ and $\land$ . For example, $(x+y\geq 5)\land\neg(z\geq 5)$ but also $(x+y\geq z)\land(z>5\lor z<3)$ are well defined formulae and therefore well defined constraints in our framework. The expressiveness is significant — for example, Xu et al. (2018) only allow for Boolean variables over $\{\neg,\land,\lor\}$ . While Fischer et al. (2019) allow non-Boolean variables to be combined over $\{\geq,\leq\}$ and formulae to be used over $\{\neg,\lor,\land\}$ , it is not a probabilistic framework, but one that is based on fuzzy logic. Thus, our work is probabilistic like the Semantic Loss (Xu et al. 2018), but it is more expressive in that it also allows real-valued variables over summations too.
多路网络(MultiplexNet)约束的表达力与 Fischer 等人(2019 年)和徐等人(2018 年)所允许的约束进行比较也很重要。在多路网络(MultiplexNet)中,约束可由任意无量词线性算术公式组成。因此,变量可在 $+$ 和 $\geq$ 上组合,并在 $\neg$ 、 $\lor$ 和 $\land$ 上形成公式。例如, $(x+y\geq 5)\land\neg(z\geq 5)$ 但也包括 $(x+y\geq z)\land(z>5\lor z<3)$ 都是良定义的公式,因此在我们的框架中都是良定义的约束。表达力是显著的 - 例如,徐等人(2018 年)只允许在 $\{\neg,\land,\lor\}$ 上使用布尔变量。而 Fischer 等人(2019 年)允许在 $\{\geq,\leq\}$ 上组合非布尔变量,并在 $\{\neg,\lor,\land\}$ 上使用公式,但这不是概率框架,而是基于模糊逻辑的框架。因此,我们的工作像语义损失(Xu 等,2018 年)一样是概率的,但它更富表达力,因为它还允许在求和中使用实值变量。

Hu et al. (2016) introduce “iterative rule knowledge distillation” which uses a student and teacher framework to balance constraint satisfaction on first order logic formulae with predictive accuracy on a classification task. During training, the student is used to form a constrained teacher by projecting its weights onto a subspace that ensures satisfaction of the logic. The student is then trained to imitate the teacher’s restricted predictions. Hu et al. (2016) use soft logic (Bach et al. 2017) to encode the logic, thereby allowing gradient estimation; however, the approach is unable to express rules that constrain real-valued outputs. Xsat (Fu and Su 2016) focuses on the Satisfiability Modulo Theory (SMT) problem, which is concerned with deciding whether a (usually a quantifier-free form) formula in first-order logic is satisfied against a background arithmetic theory; similar to what we consider. They present a means for solving SMT formulae but this is not differentiable. Manhaeve et al. (2018) present a compelling method for integrating logical constraints, in the form of a ProbLog program, into the training of a network. However, the networks are embedded into the logic (represented by a Sentential Decision Diagram (Darwiche 2011)), as “neural predicates” and thus it is not clear how to handle the real-valued arithmetic constraints that we represent in MultiplexNet.
胡等人(2016 年)提出了"迭代规则知识蒸馏"方法,该方法使用学生-教师框架平衡满足一阶逻辑公式的约束条件与分类任务的预测准确性。在训练过程中,学生被用来形成一个受约束的教师,将其权重投影到一个确保满足逻辑的子空间中。然后训练学生去模仿教师的受限预测。胡等人(2016 年)使用软逻辑(Bach 等人,2017 年)来编码逻辑,从而允许梯度估计;然而,这种方法无法表达约束实值输出的规则。Xsat(Fu 和 Su,2016 年)专注于满意度模数理论(SMT)问题,该问题涉及确定一阶逻辑中(通常为无量词形式)的公式是否满足背景算术理论;这与我们考虑的类似。他们提出了解决 SMT 公式的方法,但这不是可微分的。Manhaeve 等人(2018 年)提出了一种有吸引力的方法,将逻辑约束(以 ProbLog 程序的形式)整合到网络的训练中。然而,网络被嵌入到逻辑中(由 Sentential Decision Diagram(Darwiche,2011 年)表示),作为"神经谓词",因此不清楚如何处理我们在 MultiplexNet 中表示的实值算术约束。

We also relate to work on program synthesis (Solar-Lezama 2009; Jha et al. 2010; Feng et al. 2017; Osera 2019) where the goal is to produce a valid program for a given set of constraints. Here, the output of a program is designed to meet a given specification. These works differ from this paper as they don’t focus on the core problem of aiding training with the constraints and ensuring that the constraints are fully satisfied.
我们也关注程序合成方面的工作(Solar-Lezama 2009; Jha et al. 2010; Feng et al. 2017; Osera 2019),其目标是为给定的约束集生成一个有效的程序。这里,程序的输出被设计为满足给定的规范。这些工作与本文有所不同,因为它们没有关注如何利用约束来辅助训练,以及如何确保约束得到完全满足这一核心问题。

Other recent works have also explored how human expert knowledge can be used to guide a network’s training. Ross and Doshi-Velez (2018); Ross, Hughes, and Doshi-Velez (2017) explore how the robustness of an image classifier can be improved by regularizing input gradients towards regions of the image that contain information (as identified by a human expert). They highlight the difficulty in eliciting expert knowledge from people but their technique is similar to the other works presented here in that the knowledge loss is still represented as an additive term to the standard network loss. Takeishi and Kawahara (2020) present an example of how the knowledge of relations of objects can be used to regularise a generative model. Again, the solution involves appending terms to the loss function, but they demonstrate that relational information can aid a learning algorithm. Alternative works have also explored means for constraining the latent variables in a latent variable model (Ganchev et al. 2010; Graça, Ganchev, and Taskar 2007). In contrast to this, we focus on constraining the output space of a generative model, rather than the latent space.
其他最近的工作也探讨了如何利用人类专家知识来指导网络的训练。罗斯和多希-维勒兹（2018 年）;罗斯、休斯和多希-维勒兹（2017 年）探讨了如何通过对图像中包含信息的区域（由人类专家确定）的输入梯度进行正则化来提高图像分类器的稳健性。他们强调从人们那里获取专家知识的难度,但他们的技术与这里介绍的其他工作类似,因为知识损失仍然被表示为标准网络损失的附加项。竹之内和川原（2020 年）提供了一个关于如何利用物体关系知识来正则化生成模型的示例。同样,解决方案涉及向损失函数添加术语,但他们证明关系信息可以帮助学习算法。其他工作还探索了约束潜在变量模型中潜在变量的方法（Ganchev 等人，2010 年;Graça、Ganchev 和 Taskar，2007 年）。与此相反,我们专注于约束生成模型的输出空间,而不是潜在空间。

Finally, we mention work on the post-hoc verification of networks. Examples include the works of Katz et al. (2017) and Bunel et al. (2018) who present methods for validating whether a network will operate within the bounds of pre-defined restrictions. Our own work focuses on how to guarantee that the networks operate within given constraints, rather than how to validate their output.
最后,我们提到了网络事后验证的工作。示例包括 Katz 等人(2017 年)和 Bunel 等人(2018 年)的工作,他们提出了验证网络是否在预定义限制范围内运行的方法。我们自己的工作重点是如何保证网络在给定约束内运行,而不是如何验证其输出。

Incorporating Domain Constraints into Model Design
将域约束纳入模型设计

We begin by describing how a satisfiability problem can be hard coded into the output of a network. We then present how any specification of knowledge can be compiled into a form that permits this encoding. An overview of the proposed architecture with a general algorithm that details how to incorporate domain constraints into training a network can be found in Appendix B in the supplementary material.
我们首先描述如何将满意度问题硬编码到网络的输出中。然后我们介绍如何将任何知识规范编译成允许这种编码的形式。可以在附录 B 的补充材料中找到对所提出的体系结构的概述,以及一种可以将领域约束纳入训练网络的一般算法。

Satisfiability as Reparameterisation
满足性作为重新参数化

Let $\tilde{x}$ denote the unconstrained output of a network. Let $g$ be a network activation that is element-wise non-negative (for example an exponential function, or a ReLU (Nair and Hinton 2010) or Softplus (Dugas et al. 2001) layer). If the property to be encoded is a simple inequality $\Phi:\forall x\;cx\geq b$ , it is sufficient to constrain $\tilde{x}$ to be non-negative by applying $g$ and thereafter applying a linear transformation $f$ such that: $\forall\tilde{x}:cf(g(\tilde{x}))\geq b$ . In this case, ${f}$ can implement the transformation $f(z)=sgn(c)z+\frac{b}{c}$ where $sgn$ is the operator that returns the sign of $c$ . By construction we have:
Let $\tilde{x}$ denote the unconstrained output of a network. Let $g$ be a network activation that is element-wise non-negative (for example an exponential function, or a ReLU (Nair and Hinton 2010) or Softplus (Dugas et al. 2001) layer). If the property to be encoded is a simple inequality $\Phi:\forall x\;cx\geq b$ , it is sufficient to constrain $\tilde{x}$ to be non-negative by applying $g$ and thereafter applying a linear transformation $f$ such that: $\forall\tilde{x}:cf(g(\tilde{x}))\geq b$ . In this case, ${f}$ can implement the transformation $f(z)=sgn(c)z+\frac{b}{c}$ where $sgn$ is the operator that returns the sign of $c$ . By construction we have: 人工神经网络的输出可以表示为 $\tilde{x}$ 。网络激活函数 $g$ 可以是非负的元素级函数,如指数函数、ReLU (Nair 和 Hinton 2010)或 Softplus (Dugas 等人 2001)层。如果要编码的属性是简单不等式 $\Phi:\forall x\;cx\geq b$ ,只需将 $\tilde{x}$ 约束为非负,即应用 $g$ ,然后应用线性变换 $f$ ,使得 $\forall\tilde{x}:cf(g(\tilde{x}))\geq b$ 。在这种情况下, ${f}$ 可以实现变换 $f(z)=sgn(c)z+\frac{b}{c}$ ,其中 $sgn$ 是返回 $c$ 符号的运算符。通过构建,我们有:

f(g(\tilde{x}))\models\Phi

(2)

It follows that more complex conjunctions of constraints can be encoded by composing transformations of the form presented in Eq. 2. We present below a few examples to demonstrate how this can be achieved for a number of common constraints (where $\tilde{x}$ always refers to the unconstrained output of the network):
因此,更复杂的约束组合可以通过组合 Eq. 2 中给出的形式的变换来编码。我们在下面给出几个例子,演示如何实现一些常见约束(其中 $\tilde{x}$ 始终指网络的无约束输出):

$\displaystyle a<x<b\;$	$\displaystyle\rightarrow\;x=-g(-g(\tilde{x})+k(a,b))+b$	(3)
$\displaystyle x=c\;$	$\displaystyle\rightarrow\;x=c$	(4)
$\displaystyle x_{2}>h(x_{1})\;$	$\displaystyle\rightarrow\;x_{1}=\tilde{x_{1}}\;;\;x_{2}=h(x_{1})+g(\tilde{x_{2}})$	(5)

In Eq 3, we introduce the function $k(a,b)$ . This is merely a function to compute the correct offset for a given activation $g$ . In the case of the Softplus function, which is the function used in all of our experiments, $k(a,b)=log(exp(b-a)-1)$ .
在等式 3 中,我们引入了函数 $k(a,b)$ 。这只是一个计算给定激活 $g$ 的正确偏移量的函数。对于我们所有实验中使用的 Softplus 函数而言, $k(a,b)=log(exp(b-a)-1)$ 。

In Section: Experiments, we implement three varied experiments that demonstrate how complex constraints can be constructed from this basic primitive in Eq. 2. Conceptually, appending additional conjunctions to $\Phi$ serves to restrict the space that the output can represent. However, in many situations domain knowledge will consist of complicated formulae that exist well beyond mere conjunctions of inequalities.
我们在实验一节中实施了三个不同的实验,以演示如何从等式 2 中的基本原语构建复杂的约束。从概念上讲,附加到 $\Phi$ 的其他连词有助于限制输出可以表示的空间。然而,在许多情况下,领域知识将由超出简单不等式的复杂公式组成。

While conjunctions serve to restrict the space permitted by the network’s output, disjunctions serve to increase the permissible space. For two terms $\phi_{1}$ and $\phi_{2}$ in $\phi_{1}\lor\phi_{2}$ there exist three possibilities: namely, that $x\models\phi_{1}$ or $x\models\phi_{2}$ or $(x\models\phi_{1})\land(x\models\phi_{2})$ . Given the fact that any unconstrained network output can be transformed to satisfy some term $\phi_{k}$ , we propose to introduce multiple transformations of a network’s unconstrained output, each to model the different terms $\phi_{k}$ . In this sense, the network’s output layer can be viewed as a multiplexor in a logical circuit that permits for a branching of logic. If $h_{1}(\tilde{x})$ represents the transformation of $\tilde{x}$ that satisfies $\phi_{1}$ and $h_{2}(\tilde{x})\models\phi_{2}$ then we know the output must also satisfy $\phi_{1}\lor\phi_{2}$ by choosing either $h_{1}$ or $h_{2}$ . It is this branching technique for dealing with disjunctions that gives rise to the name of the approach: MultiplexNet.
虽然连词有助于限制网络输出的空间，析取词则有助于增加可接受的空间。对于 $\phi_{1}\lor\phi_{2}$

We finally turn to the desideratum of allowing any Boolean formula over linear inequalities as the input for the domain constraints. The suggested approach can represent conjunctions of constraints and disjunctions between these conjunctive terms, which is exactly a DNF representation. Thus, the approach can be used with any transformed version of $\Phi$ that is in DNF (Darwiche and Marquis 2002). We propose to use an off-the-shelf solver, e.g., Z3 (De Moura and Bjørner 2008), to provide the logical input to the algorithm that is in DNF. We thus assume the domain knowledge $\Phi$ is expressed as:
我们终于转向允许任何布尔公式作为域约束输入的需求。建议的方法可以表示约束的并和约束项的析取,这正是一种析取范式(DNF)表示。因此,该方法可以用于 $\Phi$ 的任何转换版本,只要它处于 DNF 形式(Darwiche 和 Marquis 2002)。我们建议使用现成的求解器(如 Z3)(De Moura 和 Bjørner 2008)来为算法提供 DNF 形式的逻辑输入。因此,我们假定域知识 $\Phi$ 表示为:

\Phi=\phi_{1}\lor\phi_{2}\lor\ldots\lor\phi_{k}

(6)

If $h_{k}$ is the branch of MultiplexNet that ensures the output of the network $x\models\phi_{k}$ then it follows by construction that $h_{k}(\tilde{x})\models\Phi$ for all $k\in[1,\dots,K]$ . For example, consider a network with a single real-valued output $\tilde{x}\in\mathbb{R}$ . If the knowledge $\Phi:=(x\geq 2)\lor(x\leq-2)$ , we would then have the two terms $h_{1}(\tilde{x})=g(\tilde{x})+2$ and $h_{2}(\tilde{x})=-g(-\tilde{x})-2$ . Here, $g$ is the network activation that is element-wise non-negative that was referred to in Section: Satisfiability as Reparameterisation. It is clear that both $x_{1}=h_{1}(\tilde{x})$ and $x_{2}=h_{2}(\tilde{x})$ satisfy the formula $\Phi$ .
如果 $h_{k}$ 是确保网络 $x\models\phi_{k}$ 输出的 MultiplexNet 分支,那么根据构造,对于所有 $k\in[1,\dots,K]$ , $h_{k}(\tilde{x})\models\Phi$ 成立。例如,考虑一个只有单个实值输出 $\tilde{x}\in\mathbb{R}$ 的网络。如果知识 $\Phi:=(x\geq 2)\lor(x\leq-2)$ ,我们就会有两个术语 $h_{1}(\tilde{x})=g(\tilde{x})+2$ 和 $h_{2}(\tilde{x})=-g(-\tilde{x})-2$ 。这里, $g$ 是在"可满足性作为重新参数化"一节中提到的元素级非负网络激活。很明显, $x_{1}=h_{1}(\tilde{x})$ 和 $x_{2}=h_{2}(\tilde{x})$ 都满足公式 $\Phi$ 。

Lemma 0.1

Suppose $\Phi$ is a quantifier free first-order formula in DNF over $\{x_{1},\dots,x_{J}\}$ consisting of terms $\phi_{1}~{}\lor~{}\dots~{}\lor~{}\phi_{K}$ . Since each branch of MultiplexNet ( $h_{k}$ ) is constructed to satisfy a specific term ( $\phi_{k}$ ), by construction, the output of MultiplexNet will satisfy $\Phi$ : $\{\hat{x}_{1},\dots,\hat{x}_{J}\}\models\Phi$ .

引理 0.1 假设

\Phi

是一个无量词的一阶公式,采用析取范式形式,在

\{x_{1},\dots,x_{J}\}

上由

\phi_{1}~{}\lor~{}\dots~{}\lor~{}\phi_{K}

组成。由于 MultiplexNet(

h_{k}

)的每个分支都是为了满足特定的项(

\phi_{k}

)而构建的,因此按照构建方式,MultiplexNet 的输出将满足

\Phi

\{\hat{x}_{1},\dots,\hat{x}_{J}\}\models\Phi

。

MultiplexNet as a Latent Variable Problem
多路复用网络作为隐变量问题

MultiplexNet introduces a latent Categorical variable $k$ that selects among the different terms $\phi_{k},k\in[1,\dots,K]$ . The model then incorporates a constraint transformation term $h_{k}$ conditional on the value of the Categorical variable.
多维 Network 介绍了一个潜在的分类变量 $k$ ，用于选择不同的术语 $\phi_{k},k\in[1,\dots,K]$ 。该模型随后根据分类变量的值纳入了一个约束转换项 $h_{k}$ 。

p_{\theta}(x)=p_{\theta}(h_{k}(x)|k)p(k)

(7)

A lower bound on the likelihood of the data can be obtained by introducing a variational approximation to the latent Categorical variable $k$ . This standard form of the variational lower bound (ELBO) is presented in Eq. 8.
通过引入对潜在分类变量 $k$ 的变分近似,可以获得数据可能性的下限。ELBO(变分下界)的标准形式如等式 8 所示。

\begin{split}\log p_{\theta}(x)&\geq\mathbb{E}_{q(k)}[\log p_{\theta}(h_{k}(x)|k)+\log p(k)-\log q(k)]\\ &:=ELBO(x)\end{split}

(8)

Gradient based methods require calculating the derivative of Eq. 8. However, as $q(k)$ is a Categorical distribution, the standard reparameterisation trick cannot be applied (Kingma and Welling 2014). One possibility for dealing with this expectation is to use the score function estimator, as in REINFORCE (Williams 1992); however, while the resulting estimator is unbiased, it has a high variance (Mnih and Gregor 2014). It is also possible to replace the Categorical variable with a continuous approximation as is done by Maddison, Mnih, and Teh (2017) and Jang, Gu, and Poole (2016); or, if the dimensionality of the Categorical variable is small, it can be marginalised out as in (Kingma et al. 2014). In the experiments in Section: Experiments, we follow Kingma et al. (2014) and marginalise this variable,²²2Although we note that the alternatives should also be explored. leading to the following learning objective:
基于梯度的方法需要计算公式 8 的导数。然而,由于 $q(k)$ 是一个分类分布,标准重参数化技巧无法应用(Kingma 和 Welling,2014)。处理这个期望的一种可能性是使用得分函数估计量,如 REINFORCE(Williams,1992);然而,尽管所得到的估计量是无偏的,但它具有较高的方差(Mnih 和 Gregor,2014)。也可以将分类变量替换为连续逼近,如 Maddison、Mnih 和 Teh(2017)和 Jang、Gu 和 Poole(2016)所做的;或者,如果分类变量的维度很小,可以将其边缘化,如(Kingma 等人,2014)所做的。在第:实验章节的实验中,我们遵循 Kingma 等人(2014)的做法,将此变量边缘化, ² 导致如下学习目标:

\mathcal{L}(\theta;x)=-\sum\limits_{k=1}^{K}q(k)\big{[}\log p_{\theta}(h_{k}(x)|k)\\ +\log p(k)-\log q(k)\big{]}

(9)

We show in Section: Experiments that this approach can be applied equally successfully for a generative modeling task (where the goal is density estimation) as for a discriminative task (where the goal is structured classification). This helps to demonstrate the universal applicability of incorporating domain knowledge into the training of networks.
我们在实验部分演示,这种方法同样适用于生成性建模任务(目标是密度估计)和判别性任务(目标是结构化分类)。这有助于证明将领域知识纳入网络训练的通用适用性。

Experiments 实验

We apply MultiplexNet to three separate experimental domains. The first domain demonstrates a density estimation task on synthetic data when the number of available data samples are limited. We show how the value of the domain constraints improves the training when the number of data samples decreases; this demonstrates the power of adding domain knowledge into the training pipeline. The second domain applies MultiplexNet to labeling MNIST images in an unsupervised manner by exploiting a structured problem and data set. We use a similar experimental setup to the MNIST experiment from DeepProbLog (Manhaeve et al. 2018); however, we present a natural integration with a generative model that is not possible with DeepProbLog. The third experiment uses hierarchical domain knowledge to facilitate an image classification task taken from Fischer et al. (2019) and Xu et al. (2018). We show how the use of this knowledge can help to improve classification accuracy at the super class level.
我们将 MultiplexNet 应用于三个独立的实验领域。第一个领域演示了当可用数据样本有限时,合成数据上的密度估计任务。我们展示了当数据样本数量减少时,增加域约束的价值如何改善训练效果,这展示了将域知识引入训练管道的力量。第二个领域将 MultiplexNet 应用于以无监督方式对 MNIST 图像进行标注,利用了结构化问题和数据集。我们使用了与 DeepProbLog(Manhaeve et al. 2018)MNIST 实验类似的实验设置,但提出了与生成模型的自然集成,这在 DeepProbLog 中是不可能的。第三个实验使用分层域知识来促进取自 Fischer et al. (2019)和 Xu et al. (2018)的图像分类任务。我们展示了如何利用这种知识来提高超类级别的分类准确性。

Synthetic Data 合成数据

Refer to caption — Figure 1: Simulated data from an unknown density. We assume that we know some constraints about the domain; these are represented by the red boxes. We aim to represent the unknown density, subject to the knowledge that the constraints must be satisfied.
图 1：未知密度的模拟数据。我们假设我们知道有关域的一些约束;这些由红色框表示。我们的目标是表示未知密度,前提是必须满足约束条件。

In this illustrative experiment, we consider a target data set that consists of the six rectangular modes that are shown in Figure 1. The samples from the true target density are shown, along with $8$ rectangular boxes in red. The rectangular boxes represent the domain constraints for this experiment. Here, we show that an expert might know where the data can exist but that the domain knowledge does not capture all of the details of the target density. Thus, the network is still tasked with learning the intricacies of the data that the domain constraints fail to address (e.g., not all of the area within the constraints contains data). However, we desire that the knowledge leads the network towards a better solution, and also to achieve this on fewer data samples from the true distribution.
在本说明性实验中,我们考虑一个由六个矩形模式组成的目标数据集,如图 1 所示。显示了真实目标密度的样本,以及红色的 $8$ 矩形框。矩形框代表了本实验的域约束。这里,我们展示了专家可能知道数据可能存在的位置,但域知识并未捕捉到目标密度的所有细节。因此,网络仍需要学习域约束无法解决的数据细节(例如,约束区域内并非全部包含数据)。然而,我们希望知识能够引导网络走向更好的解决方案,并且能够在从真实分布中采样较少的数据中实现这一目标。

This experiment represents a density estimation task and thus we use a likelihood-based generative model to represent the unknown target density, using both data samples and domain knowledge. We use a variational autoencoder (VAE) which optimizes a lower bound to the marginal log-likelihood of the data. However, a different generative model, for example a normalizing flow (Papamakarios et al. 2019) or a GAN (Goodfellow et al. 2014), could as easily be used in this framework. We optimize Eq. 9 where, for this experiment, the likelihood term $\log p_{\theta}(\cdot\mid k)$ is replaced by the standard VAE loss. Additional experimental details, as well as the full loss function, can be found in Appendix A.
这个实验表示一个密度估计任务,因此我们使用基于似然的生成模型来表示未知的目标密度,同时使用数据样本和领域知识。我们使用变分自编码器(VAE)来优化数据边际对数似然的下界。然而,在这个框架中也可以轻松使用其他生成模型,例如标准化流(Papamakarios et al. 2019)或 GAN(Goodfellow et al. 2014)。我们优化了公式 9,对于这个实验,似然项 $\log p_{\theta}(\cdot\mid k)$ 被标准 VAE 损失所取代。附录 A 中提供了额外的实验细节以及完整的损失函数。

We vary the size of the training data set with $N\in\{100,250,500,1000\}$ as the four experimental conditions. We compare the lower bound to the marginal log-likelihood under three conditions: the MultiplexNet approach, as well as two baselines. The first baseline (Unaware VAE) is a vanilla VAE that is unaware of the domain constraints. This scenario represents the standard setting where domain knowledge is simply ignored in the training of a deep generative network. The second baseline (DL2-VAE) represents a method that appends a loss term to the standard VAE loss. It is important to note that this approach, from DL2 (Fischer et al. 2019), does not guarantee that the constraints are satisfied (clearly seen in Figure 2(b)).
我们通过训练数据集大小的变化来设置 $N\in\{100,250,500,1000\}$ 作为四种实验条件。我们将边际对数似然的下界与三种情况进行比较:MultiplexNet 方法以及两种基线。第一个基线(Unaware VAE)是一个未意识到域约束的标准 VAE。这种情况代表了忽略域知识在深度生成网络训练中的标准设置。第二个基线(DL2-VAE)表示一种在标准 VAE 损失中附加一个损失项的方法。需要注意的是,这种方法(来自 DL2(Fischer 等人,2019))无法保证满足约束条件(在图 2(b)中可以清楚地看到)。

Figure 2 presents the results where we run the experiment on the specified range of training data set sizes. The top plot shows the variational loss as a function of the number of epochs. For all sizes of training data, the MultiplexNet loss on a test set can be seen to outperform the baselines. By including domain knowledge, we can reach a better result, and on fewer samples of data, than by not including the constraints. More important than the likelihood on held-out data is that the samples from the models’ posterior should conform with the constraints. Figure 2(b) shows that the baselines struggle to learn the structure of the constraints. While the MultiplexNet solution is unsurprising, the constraints are followed by construction, the comparison to the baselines is stark. We also present samples from both the prior and the posterior for all of these models in Appendix A. In all of these, MultiplexNet learns to approximate the unknown density within the predefined boundaries of the provided constraints.
图 2 展示了在指定范围的训练数据集大小上运行实验的结果。顶部图显示了变分损失随纪元数的函数。对于所有训练数据大小,MultiplexNet 在测试集上的损失都优于基线。通过包含领域知识,我们可以在使用较少的数据样本的情况下取得更好的结果,而不是不包含约束条件。比保留数据的可能性更重要的是,模型后验的样本应该符合约束条件。图 2(b)显示基线难以学习约束结构。虽然 MultiplexNet 解决方案并不出人意料,但通过构建,约束条件得到遵循,与基线相比是很明显的差异。我们还在附录 A 中展示了这些模型先验和后验的样本。在所有这些中,MultiplexNet 都学会在提供的约束边界内逼近未知密度。

MNIST - Label-free Structured Learning
手写数字识别 - 无标签结构化学习

We demonstrate how a structured data set, in combination with the relevant domain knowledge, can be used to make novel inferences in that domain. Here, we use a similar experiment to that from Kingma et al. (2014) where we model the MNIST digit data set in an unsupervised manner. Moreover, we take inspiration from Manhaeve et al. (2018) for constructing a structured data set were the images represent the terms in a summation (e.g., $image(2)+image(3)=5$ ). However, we add to the complexity of the task by (1) using no labels for any of the images;³³3In the MNIST experiment from Manhaeve et al. (2018), the authors use the result of the summation as labels for the algorithm. We have no such analogy in this experiment and thus cannot use their DeepProbLog implementation as a baseline. and, (2) considering a generative task.
我们演示如何利用结构化数据集和相关领域知识进行新的推论。在这里,我们使用类似于 Kingma 等人(2014)的实验,以无监督的方式对 MNIST 数字数据集进行建模。此外,我们从 Manhaeve 等人(2018)中获得灵感,构建一个结构化数据集,其中图像表示求和中的项(例如, $image(2)+image(3)=5$ )。但是,我们增加了任务的复杂性:(1)不使用任何图像的标签; ³ 和(2)考虑生成任务。

Kingma et al. (2014) propose a generative model that reasons about the cluster assignment of a data point (a single image). In particular, in their popular “Model 2,” they describe a generative model for an image $x$ such that the probability of the image pixel values are conditioned on a latent variable ( $z$ ) and a class label ( $y$ ): $p_{\theta}(x\mid z,y)p(z\mid y)p(y)$ . We can interpret this model using the MultiplexNet framework where the cluster assignment label $y=k$ implies that the image $x$ was generated from cluster $k$ . Given a reconstruction loss for image $x$ , conditioned on class label $y$ ( $\mathcal{L}(x,y)$ ), the domain knowledge in this setting is: $\Phi:=\bigvee_{k=1}^{10}\mathcal{L}(x,y)\land(y=k)$ . We can successfully model the clustering of the data using this setup but there is no means for determining which label corresponds to which cluster assignment.
金玛等人(2014 年)提出了一种生成性模型,该模型推理数据点(单个图像)的簇分配。具体而言,在他们广受欢迎的"模型 2"中,他们描述了一个图像的生成模型 $x$ ,其中图像像素值的概率以潜在变量( $z$ )和类标签( $y$ )为条件: $p_{\theta}(x\mid z,y)p(z\mid y)p(y)$ 。我们可以使用 MultiplexNet 框架解释这个模型,其中簇分配标签 $y=k$ 意味着图像 $x$ 是从簇 $k$ 生成的。给定图像 $x$ 的重建损失,以类标签 $y$ ( $\mathcal{L}(x,y)$ )为条件,该设置中的领域知识为: $\Phi:=\bigvee_{k=1}^{10}\mathcal{L}(x,y)\land(y=k)$ 。我们可以成功地使用这种设置对数据进行聚类建模,但没有任何办法确定哪个标签对应于哪个簇分配。

We therefore propose to augment the data set such that each input is a quintuple of four images $(x_{1},x_{2},x_{3},x_{4})$ in the form $label(x_{1})+label(x_{2})=(label(x_{3}),label(x_{4}))$ . Here, the inputs $label(x_{1})$ and $label(x_{2})$ can be any integer from $0$ to $9$ and the result $(label(x_{3}),label(x_{4}))$ is a two digit number from $(00)$ to $(18)$ . While we do not know explicitly any of the cluster labels, we do know that the data conform to this standard. Thus for all $i,j,k$ where $k=i+j$ , the domain knowledge is of the form:
因此,我们建议扩充数据集,使每个输入都是四个图像的五元组 $(x_{1},x_{2},x_{3},x_{4})$ 的形式 $label(x_{1})+label(x_{2})=(label(x_{3}),label(x_{4}))$ 。其中,输入 $label(x_{1})$ 和 $label(x_{2})$ 可以是从 $0$ 到 $9$ 的任意整数,结果 $(label(x_{3}),label(x_{4}))$ 是从 $(00)$ 到 $(18)$ 的两位数。虽然我们不知道任何聚类标签,但我们知道数据符合这一标准。因此,对于所有 $i,j,k$ 其中 $k=i+j$ ,域知识的形式是:

\begin{split}\Phi&:=\bigvee_{i,j,k}\Big{[}(y_{1}=i)\land(y_{2}=j)\land(y_{3}=\mathds{1}_{k>9})\\ &\land(y_{4}=k\bmod 10)\bigwedge\limits_{n=1}^{4}\mathcal{L}(x_{n},y_{n})\Big{]}\end{split}

(10)

In this setting, the categorical variable in the MultiplexNet chooses among the $100$ combinations that satisfy $label(x_{1})+label(x_{2})=(label(x_{3}),label(x_{4}))$ . This experiment has similarities to DeepProbLog (Manhaeve et al. 2018) as the primitive $\mathcal{L}(x,y)$ is repeated for each digit. In this sense, it is similar to the “neural predicate” used by Manhaeve et al. (2018), and the MultiplexNet output layer implements what would be the logical program from DeepProbLog. However, it is not clear how to implement this label-free, generative task within the DeepProbLog framework.
在这种设置下,MultiplexNet 中的分类变量选择满足 $label(x_{1})+label(x_{2})=(label(x_{3}),label(x_{4}))$ 的 $100$ 组合。这个实验与 DeepProbLog(Manhaeve et al. 2018)有相似之处,因为原语 $\mathcal{L}(x,y)$ 对每个数字都会重复。从这个意义上说,它类似于 Manhaeve et al.(2018)使用的"神经谓词",而 MultiplexNet 输出层实现了 DeepProbLog 中的逻辑程序。但是,在 DeepProbLog 框架内实现这种无标签的生成性任务并不明确。

In Figure 3, we present samples from the prior, conditioned on the different class labels. The model is able to learn a class-conditional representation for the data, given no labels for the images. This is in contrast to a vanilla model (from Kingma et al. (2014)) which does not use the structure of the data set to make inferences about the class labels. We present these baseline samples as well as the experimental details and additional notes in Appendix A. Empirically, the results from this experiment were sensitive to the network’s initialisation and thus we report the accuracy of the top $5$ runs. We selected the runs based on the loss (the ELBO) on a validation set (i.e., the labels were still not used in selecting the run). The accuracy of the inferred labels on held out data is $97.5\pm 0.3$ .
在图 3 中，我们呈现了基于不同类别标签的先验样本。该模型能够学习数据的类条件表示，而无需使用图像的标签。这与 Kingma et al. (2014)提出的普通模型形成对比,该模型未利用数据集的结构来推断类别标签。我们同时呈现了基准样本以及实验细节和附加说明,详见附录 A。经验表明,该实验的结果对网络的初始化敏感,因此我们报告了前 $5$ 个运行的准确性。我们基于验证集的损失(ELBO)选择了这些运行(即标签未用于选择运行)。在保留数据上推断标签的准确性为 $97.5\pm 0.3$ 。

Hierarchical Domain Knowledge on CIFAR100
分层域知识 CIFAR100

Table 1: Accuracy on class label prediction and super-class label prediction, and constraint satisfaction on CIFAR100 data set
表 1：在 CIFAR100 数据集上的类别标签预测、超类标签预测的准确性,以及约束满足

Model	Class Accuracy 类别精度	Super-class Accuracy 超类精度	Constraint Satisfaction 约束满足
Vanilla ResNet 简单残差网络	$75.0\pm(0.1)$	$84.0\pm(0.2)$	$83.8\pm(0.1)$
Vanilla ResNet (SC only) 香草 ResNet	NA	$83.2\pm(0.2)$	NA
Hierarchical Model 层次模型	$71.2\pm(0.2)$	$84.7\pm(0.1)$	$\mathbf{100.0\pm(0.0)}$
DL2	$\mathbf{75.3\pm(0.1)}$	$84.3\pm(0.1)$	$85.8\pm(0.2)$
MultiplexNet 多路网络	$74.4\pm(0.2)$	$\mathbf{85.4\pm(0.3)}$	$\mathbf{100.0\pm(0.0)}$

The final experiment demonstrates how to encode hierarchical domain knowledge into the output layer of a network. The CIFAR100 (Krizhevsky, Hinton et al. 2009) data set consists of $100$ classes of images where the $100$ classes are in turn broken into $20$ super-classes (SC). We wish to encode the belief that images from the same SC are semantically related. Following the encoding in Fischer et al. (2019), we consider constraints which specify that groups of classes should together be very likely or very unlikely. For example, suppose that the SC label is trees and the class label is maple. Our domain knowledge should state that the trees group must be very likely even if there is uncertainty in the specific label maple. Intuitively, it is egregious to misclassify this example as a tractor but it would be acceptable to make the mistake of oak. This can be implemented by training a network to predict first the SC for an unknown image and thereafter the class label, conditioned on the value for the SC.
最终实验演示如何将分层域知识编码到网络的输出层中。CIFAR100（Krizhevsky，Hinton 等 2009 年）数据集由 $100$ 个图像类组成,其中 $100$ 个类进一步划分为 $20$ 个超类(SC)。我们希望编码这样的信念:来自同一 SC 的图像在语义上是相关的。遵循 Fischer et al.（2019 年）的编码方式,我们考虑指定类群应该非常可能或非常不可能的约束。例如,假设 SC 标签是树木,类标签是枫树。我们的域知识应该声明,即使对具体标签枫树存在不确定性,树木组也应该很可能。直观地说,将此示例错误分类为拖拉机是严重的,但将其错误分类为橡树是可以接受的。可以通过训练网络首先预测未知图像的 SC,然后根据 SC 的值预测类标签来实现这一点。

We chose rather to implement this same knowledge using the MultiplexNet framework. Let $x_{k}\in SC_{i}$ denote the output of a network that predicts the $k^{th}$ class label within the $i^{th}$ SC. Let $\alpha\in[0,1]$ denote the minimum requirement for a SC prediction (e.g., if $\alpha=0.95$ , we require that a SC be predicted with probability $0.95$ or more). The domain knowledge is:
我们选择使用 MultiplexNet 框架来实现这个知识。令 $x_{k}\in SC_{i}$ 表示预测 $k^{th}$ 类别标签的网络的输出,其中 $i^{th}$ 表示 SC。令 $\alpha\in[0,1]$ 表示 SC 预测的最小要求(例如,如果 $\alpha=0.95$ ,我们要求 SC 的预测概率为 $0.95$ 或更高)。

\bigvee_{i=1}^{20}\left[\bigwedge_{k\in SC_{i}}\Big{(}x_{k}>\log(\frac{\alpha}{1-\alpha})+\log\sum\limits_{j\notin SC_{i}}\exp\{x_{j}\}\Big{)}\right]

(11)

Eq. 11 states that for all labels within a SC group, the unnormalised logits of the network should be greater than the normalised sum of the other labels belonging to the other SCs with a margin of $\log(\frac{\alpha}{1-\alpha})$ . We explain Eq. 11 further and present other experimental details in Appendix A. This constraint places a semantic grouping on the data as the network is forced into a low entropy prediction at the super class level.
等式 11 表明,对于一个 SC 组内的所有标签,网络的未归一化逻辑输出应当大于其他 SC 的其他标签归一化和加一个 $\log(\frac{\alpha}{1-\alpha})$

We compare the performance of MultiplexNet to three baselines and report the prediction accuracy on the fine class label as well that on the super class label. We use a Wide ResNet 28-10 (Zagoruyko and Komodakis 2016) model in all of the experimental conditions. The first two baselines (Vanilla) only use the Wide ResNet model and are trained to predict the fine class and the super class labels respectively. The second baseline (Hierarchical) is trained to predict the super class label and thereafter the fine class label, conditioned on the value for the super class label. This represents the bespoke engineering solution to this hierarchical problem. The final baseline (DL2) implements the same logical specification that is used for MultiplexNet but uses the DL2 framework to append to the standard cross-entropy loss function.
我们比较 MultiplexNet 与三种基线模型的性能,并报告它们在细粒度类标签和超类标签上的预测准确率。我们在所有实验条件下使用 Wide ResNet 28-10(Zagoruyko 和 Komodakis 2016)模型。前两个基线模型(Vanilla)仅使用 Wide ResNet 模型,分别被训练用于预测细粒度类标签和超类标签。第二个基线模型(Hierarchical)被训练用于首先预测超类标签,然后基于超类标签的值来预测细粒度类标签,这代表了针对这个层次性问题的定制工程解决方案。最后一个基线模型(DL2)实现了与 MultiplexNet 相同的逻辑规范,但使用 DL2 框架来附加到标准的交叉熵损失函数上。

Table 1 presents the results for this experiment. Firstly, it is important to note the difficulty of this task. The Vanilla ResNet that predicts only the super-class labels for the images under performs the baseline that is tasked with predicting the true class label. Moreover, while the hierarchical baseline does outperform the vanilla models on the task of super-class prediction, this comes at a cost to the true class accuracy. As the hierarchical baseline represents the bespoke engineering solution to the problem, it also achieves $100\%$ constraint satisfaction, but this comes at the cost of domain specific and custom implementation. The MultiplexNet approach provides a slight improvement at the SC classification accuracy and importantly, the domain constraints are always met. As the domain knowledge prioritizes the accuracy at the SC level, we note that the MultiplexNet approach does not outperform the Vanilla ResNet at the class accuracy. Surprisingly, the DL2 baseline improves upon the class accuracy but it has a limited impact on the super class accuracy and on the constraint satisfaction.
表 1 展示了这个试验的结果。首先,需要注意到这个任务的难度。只预测图像超类标签的 Vanilla ResNet 的表现不如预测真实类别标签的基线。此外,虽然分层基线在超类预测任务上优于普通模型,但这是以真实类别准确性为代价的。作为针对该问题的定制工程解决方案,分层基线也满足了 $100\%$ 约束条件,但需要特定领域和定制实现。MultiplexNet 方法在超类分类准确性上略有提高,并且始终满足领域约束。由于领域知识重点关注超类级别的准确性,我们注意到 MultiplexNet 方法在类别准确性上并未优于 Vanilla ResNet。令人惊讶的是,DL2 基线在类别准确性上有所改善,但对超类准确性和约束满足度的影响有限。

Limitations and Discussion
限制和讨论

The limitations of the suggested approach relate to the technical specification of the domain knowledge and to the practical implementation of this knowledge. We discuss first these two aspects and then we discuss a potential negative societal impact.
建议方法的局限性与领域知识的技术规范以及该知识的实际实施相关。我们首先讨论这两个方面,然后讨论一种可能的负面社会影响。

First, we require that experts be able to express precisely, in the form of a logical formula, the constraints that are valid for their domain. This may not always be possible. For example, given an image classification task, we may wish to describe our knowledge about the content of the images. Consider an example where images contain pictures of dogs and fish and that we wish to express the knowledge that dogs must have four legs and fish must be in water. It is not clear how these conceptual constraints would then be mapped to a pixel level for actual specification. Moreover, it is entirely plausible to have images of dogs that do not include their legs, or images of fish where the fish is out of the water. The logical statement itself is brittle in these instances and would serve to hinder the training, rather than to help it. This example serves to present the inherent difficulty that is present when actually expressing robust domain knowledge in the form of logical formulae.
首先,我们要求专家能够以逻辑公式的形式精确地表达其领域的有效约束条件。但这并不总是可能的。例如,在图像分类任务中,我们可能希望描述图像内容的知识。考虑一个包含狗狗和鱼的图像的例子,我们希望表达狗狗必须有四条腿,鱼必须在水中的知识。但很难明确如何将这些概念性约束映射到像素级别的实际规范。此外,也完全可能出现不包含腿部的狗狗图像,或鱼儿不在水中的图像。在这些情况下,逻辑陈述会变得脆弱,反而会阻碍训练,而非帮助训练。这个例子说明了在以逻辑公式形式表达强大的领域知识时所面临的固有困难。

The second major limitation of this approach deals with the DNF requirement on the input formula. We require that knowledge be expressed in this form such that the “or” condition is controlled by the latent Categorical variable of MultiplexNet. It is well known that certain formulae have worst case representations in DNF that are exponential in the number of variables. This is clearly undesirable in that the network would have to learn to choose among the exponentially many terms.
这种方法的第二个主要局限性涉及到输入公式必须为析取范式(DNF)的要求。我们要求知识以这种形式表达,使得"或"条件被多路复用网络(MultiplexNet)的隐变量分类变量控制。众所周知,某些公式在 DNF 中的最坏情况表示将随变量数呈指数增长。这显然是不可取的,因为网络需要学会在众多指数级项之间进行选择。

One of the overarching motivations for this work is to constrain networks for safety critical domains. While constrained operation might be desired on many accounts, there may exist edge cases where an autonomously acting agent should act in an undesirable manner to avoid an even more undesirable outcome (a thought experiment of this spirit is the well known Trolley Problem (Hammond and Belle 2021)). By guaranteeing that the operating conditions of a system be restricted to some range, our approach does encounter vulnerability with respect to edge, and unforeseen, cases. However, to counter this point, we argue it is still necessary for experts to define the boundaries over the operation domain of a system in order to explicitly test and design for known worst case scenario settings.
这项工作的一个主要动机是限制关键安全领域的网络。虽然受限运行在许多情况下都是期望的,但也可能存在自主行为者在某些边缘情况下应该采取不太理想的方式来避免更不理想的结果(众所周知的电车难题(Hammond and Belle 2021)就是这种思维实验)。通过确保系统的运行条件被限制在某个范围内,我们的方法确实会遇到对边缘和意外情况的脆弱性。但为了应对这一点,我们认为专家仍然有必要定义系统运行域的边界,以便明确测试和设计已知的最坏情况。

Conclusions and Future Work
结论与未来工作

This work studied how logical knowledge in an expressive language could be used to constrain the output of a network. It provides a new and general way to encode domain knowledge as logical constraints directly in the output layer of a network. Compared to alternative approaches, we go beyond propositional logic by allowing for arithmetic operators in our constraints. We are able to guarantee that the network output is 100% compliant with the domain constraints, which the alternative approaches, which append a “constraint loss,” are unable to match. Thus our approach is especially relevant for safety critical settings in which the network must guarantee to operate within predefined constraints. In a series of experiments we demonstrated that our approach leads to better results in terms of data efficiency (the amount of training data that is required for good performance), reducing the data burden that is placed on the training process. In the future, we are excited about exploring the prospects for using this framework on downstream tasks, such as robustness to adversarial attacks.
这项工作研究了如何使用富表达能力的语言中的逻辑知识来约束网络的输出。它提供了一种新的通用方法,可将领域知识直接编码为网络输出层的逻辑约束。与其他方法相比,我们超越了命题逻辑,允许在约束中使用算术运算符。我们能够保证网络输出 100%符合领域约束,这是附加"约束损失"的其他方法无法达到的。因此,我们的方法对于必须在预定义约束范围内运行的安全关键设置特别相关。在一系列实验中,我们证明了我们的方法在数据效率(需要的训练数据量)方面获得更好的结果,从而减轻了训练过程的数据负担。未来,我们期待探索在下游任务(如对抗性攻击的鲁棒性)中使用这一框架的前景。

References

Allen, Balažević, and Hospedales (2020) Allen, C.; Balažević, I.; and Hospedales, T. 2020. A Probabilistic Framework for Discriminative and Neuro-Symbolic Semi-Supervised Learning. arXiv preprint arXiv:2006.05896.
Amodei et al. (2016) Amodei, D.; Olah, C.; Steinhardt, J.; Christiano, P.; Schulman, J.; and Mané, D. 2016. Concrete problems in AI safety. arXiv preprint arXiv:1606.06565.
Bach et al. (2017) Bach, S. H.; Broecheler, M.; Huang, B.; and Getoor, L. 2017. Hinge-Loss Markov Random Fields and Probabilistic Soft Logic. J. Mach. Learn. Res., 18(1): 3846–3912.
Barrett et al. (2009) Barrett, C.; Sebastiani, R.; Seshia, S. A.; Tinelli, C.; Biere, A.; Heule, M.; van Maaren, H.; and Walsh, T. 2009. Handbook of satisfiability. Satisfiability modulo theories, 185: 825–885.
Belle, Passerini, and Van den Broeck (2015) Belle, V.; Passerini, A.; and Van den Broeck, G. 2015. Probabilistic inference in hybrid domains by weighted model integration. In Proceedings of 24th International Joint Conference on Artificial Intelligence (IJCAI), volume 2015, 2770–2776.
Brown et al. (2020) Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165.
Bunel et al. (2018) Bunel, R.; Turkaslan, I.; Torr, P. H. S.; Kohli, P.; and Mudigonda, P. K. 2018. A Unified View of Piecewise Linear Neural Network Verification. Advances in Neural Information Processing Systems, 4795–4804.
Chavira and Darwiche (2008) Chavira, M.; and Darwiche, A. 2008. On probabilistic inference by weighted model counting. Artificial Intelligence, 172(6-7): 772–799.
Darwiche (2011) Darwiche, A. 2011. SDD: A new canonical representation of propositional knowledge bases. In Twenty-Second International Joint Conference on Artificial Intelligence.
Darwiche and Marquis (2002) Darwiche, A.; and Marquis, P. 2002. A knowledge compilation map. Journal of Artificial Intelligence Research, 17: 229–264.
De Moura and Bjørner (2008) De Moura, L.; and Bjørner, N. 2008. Z3: An efficient SMT solver. In International conference on Tools and Algorithms for the Construction and Analysis of Systems, 337–340. Springer.
Dosovitskiy et al. (2020) Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
Dugas et al. (2001) Dugas, C.; Bengio, Y.; Bélisle, F.; Nadeau, C.; and Garcia, R. 2001. Incorporating second-order functional knowledge for better option pricing. Advances in neural information processing systems, 472–478.
Feng et al. (2017) Feng, Y.; Martins, R.; Wang, Y.; Dillig, I.; and Reps, T. W. 2017. Component-Based Synthesis for Complex APIs. In Proceedings of the 44th ACM SIGPLAN Symposium on Principles of Programming Languages, 599–612.
Fischer et al. (2019) Fischer, M.; Balunovic, M.; Drachsler-Cohen, D.; Gehr, T.; Zhang, C.; and Vechev, M. 2019. DL2: Training and Querying Neural Networks with Logic. In Proceedings of the 36th International Conference on Machine Learning, 1931–1941.
Fu and Su (2016) Fu, Z.; and Su, Z. 2016. XSat: A Fast Floating-Point Satisfiability Solver. In Proceedings of the 28th International Conference on Computer Aided Verification, Part II, 187–209. Springer. ISBN 978-3-319-41539-0.
Ganchev et al. (2010) Ganchev, K.; Graça, J.; Gillenwater, J.; and Taskar, B. 2010. Posterior regularization for structured latent variable models. The Journal of Machine Learning Research, 11: 2001–2049.
Goodfellow et al. (2016) Goodfellow, I.; Bengio, Y.; Courville, A.; and Bengio, Y. 2016. Deep learning, volume 1. MIT press Cambridge.
Goodfellow et al. (2014) Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In Advances in neural information processing systems, 2672–2680.
Graça, Ganchev, and Taskar (2007) Graça, J. V.; Ganchev, K.; and Taskar, B. 2007. Expectation maximization and posterior constraints.
Hammond and Belle (2021) Hammond, L.; and Belle, V. 2021. Learning tractable probabilistic models for moral responsibility and blame. Data Mining and Knowledge Discovery, 35(2): 621–659.
Hu et al. (2016) Hu, Z.; Ma, X.; Liu, Z.; Hovy, E.; and Xing, E. 2016. Harnessing Deep Neural Networks with Logic Rules. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2410–2420.
Jang, Gu, and Poole (2016) Jang, E.; Gu, S.; and Poole, B. 2016. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144.
Jha et al. (2010) Jha, S.; Gulwani, S.; Seshia, S. A.; and Tiwari, A. 2010. Oracle-Guided Component-Based Program Synthesis. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1, 215–224.
Katz et al. (2017) Katz, G.; Barrett, C.; Dill, D. L.; Julian, K.; and Kochenderfer, M. J. 2017. Reluplex: An efficient SMT solver for verifying deep neural networks. In International Conference on Computer Aided Verification, 97–117. Springer.
Kingma et al. (2014) Kingma, D. P.; Mohamed, S.; Rezende, D. J.; and Welling, M. 2014. Semi-supervised learning with deep generative models. In Advances in neural information processing systems, 3581–3589.
Kingma and Welling (2014) Kingma, D. P.; and Welling, M. 2014. Auto-encoding variational bayes. In 2nd International Conference on Learning Representations (ICLR).
Krizhevsky, Hinton et al. (2009) Krizhevsky, A.; Hinton, G.; et al. 2009. Learning multiple layers of features from tiny images.
LeCun, Cortes, and Burges (2010) LeCun, Y.; Cortes, C.; and Burges, C. 2010. MNIST handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2.
Maddison, Mnih, and Teh (2017) Maddison, C. J.; Mnih, A.; and Teh, Y. W. 2017. The concrete distribution: A continuous relaxation of discrete random variables. In 5th International Conference on Learning Representations (ICLR).
Manhaeve et al. (2018) Manhaeve, R.; Dumancic, S.; Kimmig, A.; Demeester, T.; and De Raedt, L. 2018. Deepproblog: Neural probabilistic logic programming. Advances in Neural Information Processing Systems, 31: 3749–3759.
Mnih and Gregor (2014) Mnih, A.; and Gregor, K. 2014. Neural variational inference and learning in belief networks. In International Conference on Machine Learning, 1791–1799. PMLR.
Nair and Hinton (2010) Nair, V.; and Hinton, G. E. 2010. Rectified Linear Units Improve Restricted Boltzmann Machines. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, 807–814.
Osera (2019) Osera, P.-M. 2019. Constraint-Based Type-Directed Program Synthesis. In Proceedings of the 4th ACM SIGPLAN International Workshop on Type-Driven Development, 64–76.
Papamakarios et al. (2019) Papamakarios, G.; Nalisnick, E.; Rezende, D. J.; Mohamed, S.; and Lakshminarayanan, B. 2019. Normalizing Flows for Probabilistic Modeling and Inference. arXiv preprint arXiv:1912.02762.
Ross and Doshi-Velez (2018) Ross, A.; and Doshi-Velez, F. 2018. Improving the adversarial robustness and interpretability of deep neural networks by regularizing their input gradients. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32.
Ross, Hughes, and Doshi-Velez (2017) Ross, A. S.; Hughes, M. C.; and Doshi-Velez, F. 2017. Right for the Right Reasons: Training Differentiable Models by Constraining their Explanations. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI, 2662–2670.
Solar-Lezama (2009) Solar-Lezama, A. 2009. The Sketching Approach to Program Synthesis. In Proceedings of the 7th Asian Symposium on Programming Languages and Systems, 4–13.
Stewart and Ermon (2017) Stewart, R.; and Ermon, S. 2017. Label-free supervision of neural networks with physics and domain knowledge. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31.
Takeishi and Kawahara (2020) Takeishi, N.; and Kawahara, Y. 2020. Knowledge-Based Regularization in Generative Modeling. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI, 2390–2396.
Williams (1992) Williams, R. J. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4): 229–256.
Xu et al. (2018) Xu, J.; Zhang, Z.; Friedman, T.; Liang, Y.; and Van den Broeck, G. 2018. A Semantic Loss Function for Deep Learning with Symbolic Knowledge. In Proceedings of the 35th International Conference on Machine Learning, 5502–5511.
Zagoruyko and Komodakis (2016) Zagoruyko, S.; and Komodakis, N. 2016. Wide Residual Networks. In British Machine Vision Conference.

Appendix A Appendix A: Additional Experimental Details
附录 A 附录 A：附加实验细节

All code and data for repeating the experiments can be found at the repository at: https://github.com/NickHoernle/semantic_loss. The image experiments (on MNIST and CIFAR100) were run on Nvidia GeForce RTX 2080 Ti devices and the synthetic data experiment was run on a 2015 MacBook Pro with processor specs: 2.2 GHz Quad-Core Intel Core i7. The MNIST data set is available under the Creative Commons Attribution-Share Alike 3.0 license; and CIFAR100 is available under the Creative Commons Attribution-Share Alike 4.0 license. The DL2 framework (available under an MIT License), used in the baselines, is available from https://github.com/eth-sri/dl2.
所有重复实验的代码和数据均可在此存储库中找到: https://github.com/NickHoernle/semantic_loss。图像实验(在 MNIST 和 CIFAR100 上)是在 Nvidia GeForce RTX 2080 Ti 设备上运行的,合成数据实验是在 2015 款 MacBook Pro 上运行的,配置为 2.2 GHz 四核 Intel Core i7 处理器。MNIST 数据集可在知识共享署名-相同方式共享 3.0 许可下获得;CIFAR100 可在知识共享署名-相同方式共享 4.0 许可下获得。用于基线的 DL2 框架(可在 MIT 许可下获得)可从 https://github.com/eth-sri/dl2 获得。

In all experiments, the data were split into a train, validation and test set where the test set was held constant across the experimental conditions (e.g., in the CIFAR100 experiment, the same test set was used to compare MultiplexNet vs the vanilla models vs the DL2 model). In cases where model selection was performed (early stopping on CIFAR100 and selection of the best runs from the MNIST experiment, we used the validation set to choose the best runs and/or models). In these cases the validation set was extracted from the training data set prior to training (with $10\%$ of the data used for validation). The standard test sets, given by MNIST and CIFAR100 were used for those experiments and an additional test set was generated for the synthetic experiment.
在所有实验中,数据被分为训练集、验证集和测试集,其中测试集保持不变,以在实验条件间进行比较(例如在 CIFAR100 实验中,使用相同的测试集来比较 MultiplexNet 与原始模型和 DL2 模型)。在进行模型选择的情况下(CIFAR100 的提前终止训练和 MNIST 实验的最佳结果选择),使用验证集来选择最佳模型和结果。在这些情况下,验证集是从训练数据集中提取的,占 $10\%$ 数据。MNIST 和 CIFAR100 的标准测试集用于这些实验,合成实验产生了额外的测试集。

Synthetic Data 合成数据

Deriving the Loss 损失的计算

We first present the full derivation of the loss function that was used for this experiment. We used a variational autoencoder (VAE) with a standard isotropic Gaussian prior. The standard VAE loss is presented in Eq. 12. In the below formulation, $x_{i}$ is a datapoint, $L$ is a minibatch size, and $z_{i}$ is a sample from the approximate posterior $q$ .
我们首先呈现了本实验中使用的损失函数的完整推导过程。我们使用了一个标准各向同性高斯先验的变分自编码器(VAE)。标准 VAE 损失函数在等式 12 中给出。在下面的公式中， $x_{i}$ 是一个数据点， $L$ 是一个小批量大小， $z_{i}$ 是从近似后验 $q$ 中抽取的一个样本。

\mathcal{L}(\theta)=-\sum\limits_{i=1}^{L}\log p(x_{i}\mid z_{i})+\log p(z_{i})-\log q(z_{i}\mid x_{i})

(12)

We use an isotropic Gaussian likelihood for $\log p(x_{i}\mid z_{i})$ and an isotropic Gaussian for the posterior. Standard derivations (see Kingma and Welling (2014) for more details) allow the loss to be expressed as in Eq.13. In this equation, $f_{\theta}$ is the decoder model and it predicts the mean of the likelihood term. A tunable parameter $\sigma$ controls the precision of the reconstructions – this parameter was held constant for all experimental conditions. The posterior distribution is a Gaussian with parameters $\sigma_{q}^{2}$ and $\mu_{q}$ that are output from the encoding network $q_{\theta}(x_{i})$ .
我们对 $\log p(x_{i}\mid z_{i})$ 使用各向同性高斯似然,对后验使用各向同性高斯。标准推导(详见 Kingma 和 Welling(2014))允许将损失表达为等式 13。在该等式中, $f_{\theta}$ 是解码器模型,它预测似然项的均值。可调参数 $\sigma$ 控制重构的精度,该参数在所有实验条件下保持恒定。后验分布是一个高斯分布,其参数 $\sigma_{q}^{2}$ 和 $\mu_{q}$ 由编码网络 $q_{\theta}(x_{i})$ 输出。

\mathcal{L}(\theta)=-\sum\limits_{i=1}^{L}\mathcal{N}(x_{i};f_{\theta}(z_{i}),\sigma^{2})+0.5*(1+\log(\sigma_{q}^{2})\\ -\mu_{q}^{2}-\sigma_{q}^{2})

(13)

Finally, we present how the MultiplexNet loss uses $\mathcal{L}(\theta)$ in the transformation of the output layer of the network. MultiplexNet takes as input the unconstrained network output $f_{\theta}(z_{i})$ and it outputs the transformed (constrained) terms $h_{k}$ (for $K$ terms in the DNF constraint formulation) and the probability of each term $\pi_{k}$ . Let $\mathcal{L}_{k}(\theta)$ denote the same loss $\mathcal{L}(\theta)$ from Eq. 13 but with the raw output of the network $f_{\theta}(z_{i})$ constrained by the constraint transformation $h_{k}$ (i.e., the likelihood term becomes: $\mathcal{N}(x_{i};h_{k}(f_{\theta}(z_{i})),\sigma^{2})$ ) The final loss then is presented in Eq. 14.
最后,我们介绍 MultiplexNet 损失如何在网络输出层的变换中使用 $\mathcal{L}(\theta)$ 。MultiplexNet 以未受约束的网络输出 $f_{\theta}(z_{i})$ 为输入,输出经过变换(受约束)的项 $h_{k}$ (对应 DNF 约束公式中的 $K$ 项)及每个项 $\pi_{k}$ 的概率。令 $\mathcal{L}_{k}(\theta)$ 表示与 Eq. 13 相同的损失 $\mathcal{L}(\theta)$ ,但网络的原始输出 $f_{\theta}(z_{i})$ 受约束变换 $h_{k}$ 限制(即似然项变为: $\mathcal{N}(x_{i};h_{k}(f_{\theta}(z_{i})),\sigma^{2})$ )。最终损失如 Eq. 14 所示。

MPlexNet(\theta)=\sum_{k=1}^{K}\pi_{k}\left(\mathcal{L}_{h_{k}}(\theta)+\log\pi_{h_{k}}\right)

(14)

Samples from the Posterior
样本来自后续

Fig. 4 shows samples from the posterior of the two models – these are the attempts of the VAE to reconstruct the input data. It can clearly be seen that while MultiplexNet strictly adheres to the constraints, the baseline VAE approach fails to capture the constraint boundaries.
图 4 显示了两个模型后验分布的样本 - 这是 VAE 试图重建输入数据的尝试。可以明显看出,虽然 MultiplexNet 严格遵守约束条件,但基线 VAE 方法无法捕捉约束边界。

Samples from the Prior 样本来自于过去

Samples from the prior (Fig. 5) show how well a generative model has learnt the data manifold that it attempts to represent. We show these to demonstrate that in this case, the vanilla VAE fails to capture many of the complexities in the data distribution. To sample from the prior for MultiplexNet, we randomly sample from the latent Categorical variable from MultiplexNet. Hence, the two vertical modes (that contain no data in reality) have samples here. This can easily be solved by introducing a trainable prior parameter over the Categorical variable as well – an easy extension that we do not implement in this work.
从先验分布(图 5)中抽取的样本显示了生成模型学习数据流形的程度。我们展示这些来说明在这种情况下,普通 VAE 无法捕捉数据分布中的许多复杂性。要从 MultiplexNet 的先验分布中采样,我们需要从 MultiplexNet 的潜在分类变量随机采样。因此,这里有两个垂直模式的样本(实际上没有数据)。这个问题可以通过在分类变量上引入可训练的先验参数来轻易解决 - 这是一个我们在这项工作中没有实现的简单扩展。

Network Architecture 网络架构

The default network used in these experiments was a feed-forward network with a single hidden layer for both the decoder and the encoder models. The dimensionality of the latent random variable was 15 and the hidden layer contained $50$ units. ReLU activations were used unless otherwise stated.
这些实验中使用的默认网络是一个具有单个隐藏层的前馈网络,用于编码器和解码器模型。潜在随机变量的维度为 15,隐藏层包含 $50$ 个单元。除非另有说明,否则使用 ReLU 激活。

MNIST - Label-free Structured Learning
手写数字识别 - 无标签结构化学习

Deriving the Loss 损失的计算

We follow the specification from Kingma et al. (2014) where the likelihood of a single image, $x$ , conditioned on a cluster assignment label $y$ , is shown in Eq. 15. Again, $z$ is a latent parameter, again assumed to follow an isotropic Gaussian distribution.
我们遵循金马等人(2014 年)的规范,其中单个图像的似然函数 $x$ 以簇分配标签 $y$ 为条件,如等式 15 所示。同样, $z$ 是一个潜在参数,假定它遵循各向同性的高斯分布。

\log p(x,y)\geq\mathbb{E}_{q(z\mid x)}\big{[}\log p_{\theta}(x\mid y,z)\\ +\log p(z)-\log q(z\mid x)\big{]}

(15)

We refer to the right hand side of Eq. 15 as $-V(x,y)$ . Eq. 15 assumes knowledge of the label $y$ , but this is unknown for our domain. However, we can implement the knowledge from Eq. 10 (in the main text) that specifies all $100$ possibilities for the image inference task. Below we assume the data is of the form $image_{i}+image_{j}=(image_{k_{1}},image_{k_{2}})$ where $label_{i}\in[0,\dots,9]$ , $label_{j}\in[0,\dots,9]$ , $label_{k_{1}}\in[0,1]$ and $label_{k_{2}}\in[0,\dots,9]$ . Finally, we let $\pi_{h}$ refer to the MultiplexNet Categorical selection variable that chooses which of the $100$ possible terms for $(i,j,k_{1},k2)$ are present. Following the MultiplexNet framework, the loss is then presented in Eq 16:
我们将 Eq. 15 的右侧称为 $-V(x,y)$ 。Eq. 15 假定了标签 $y$ 的知识,但这对我们的领域来说是未知的。然而,我们可以实现来自主文本中 Eq. 10 的知识,该知识指定了图像推断任务的所有 $100$ 可能性。下面,我们假设数据具有形式 $image_{i}+image_{j}=(image_{k_{1}},image_{k_{2}})$ ,其中包含 $label_{i}\in[0,\dots,9]$ 、 $label_{j}\in[0,\dots,9]$ 、 $label_{k_{1}}\in[0,1]$ 和 $label_{k_{2}}\in[0,\dots,9]$ 。最后,我们让 $\pi_{h}$ 表示 MultiplexNet Categorical 选择变量,该变量选择 $(i,j,k_{1},k2)$ 的 $100$ 个可能项中哪些是存在的。按照 MultiplexNet 框架,损失函数在 Eq 16 中给出:

\mathcal{L}(\theta)=\sum\limits_{i,j,k}\pi_{h}\big{[}V(x_{1},y_{1}=i)+V(x_{2},y_{2}=j)+\\ V(x_{3},y_{3}=k_{1})+V(x_{4},y_{4}=k_{2})+\log\pi_{h}\big{]}

(16)

Samples from the Vanilla VAE prior from Kingma et al. (2014)
金马等人(2014)的 Vanilla VAE 先验分布的样本

We repeat the mnist experiment with a vanilla VAE model (“Model 2” from Kingma et al. (2014)). Here we simply show that the model can capture the label clustering of the data but that it cannot, unsurprisingly, infer the class labels correctly from the data without considering the fact that the data set has been structured:
我们用一个基本的 VAE 模型（"模型 2"来自 Kingma 等人(2014)）重复 mnist 实验。这里我们只是简单地表明该模型能捕捉数据的标签聚类,但令人不惊讶的是,它无法从数据中准确推断类别标签,因为数据集已经被结构化了。

Network Architecture 网络架构

The default network used in these experiments was a feed-forward network with two hidden layers for both the decoder and the encoder models. The first hidden layer contained $250$ units and the second $100$ units. The dimensionality of the latent random variable was $50$ . ReLU activations were used unless otherwise stated.
在这些实验中使用的默认网络是一个具有两个隐藏层的前馈网络,用于解码器和编码器模型。第一个隐藏层包含 $250$ 个单元,第二个 $100$ 个单元。潜在随机变量的维度为 $50$ 。除非另有说明,否则使用 ReLU 激活。

Hierarchical Domain Knowledge on CIFAR100
分层域知识 CIFAR100

Deriving the Loss 损失的计算

Following the encoding in (Fischer et al. 2019), we consider constraints which specify that groups of classes should together be very likely or very unlikely. For example, suppose that the SC label is trees and the class label is maple. Our domain knowledge should state that the trees group must be very likely even if there is uncertainty in the specific label maple. Fischer et al. (2019) use the following logic to encode this belief (for the $20$ super classes that are present in CIFAR100):
根据(Fischer et al. 2019)中的编码,我们考虑指定类群应该是非常可能或非常不可能的约束。例如,假设 SC 标签是树,类标签是枫树。我们的领域知识应该表明,即使对具体标签枫树存在不确定性,树组也必须是非常可能的。 Fischer et al.(2019)使用以下逻辑来编码这一信念(对于 CIFAR100 中存在的 $20$ 超类):

(p_{people}<\epsilon\lor p_{people}>1-\epsilon)\land\dots\land\\ (p_{trees}<\epsilon\lor p_{trees}>1-\epsilon)

(17)

This encoding is exactly the same as that presented in Eq 18. However, we re-write this encoding in DNF such that it is compatible with MultiplexNet.
这种编码与 Eq 18 中提出的编码完全相同。但是，我们将此编码重写为 DNF 格式,使其与 MultiplexNet 兼容。

(p_{people}>1-\epsilon\land p_{trees}<\epsilon\land\dots)\lor\\ (p_{people}<\epsilon\land p_{trees}>1-\epsilon\land\dots)\lor...

(18)

A simplification on the above, as the classes and thus the super classes lie on a simplex, is the specification that $(p_{people}>1-\epsilon)$ necessarily implies that $(p_{trees}<\epsilon\land\dots)$ holds too. Thus the logic can again be simplified to:
在上述事项的简化,因为类及其父类位于单纯形上,规定 $(p_{people}>1-\epsilon)$ 必定意味着 $(p_{trees}<\epsilon\land\dots)$ 也成立。因此,逻辑可再次简化为：

(p_{people}>1-\epsilon)\lor(p_{trees}>1-\epsilon\\ \lor(p_{fish}>1-\epsilon)\lor...

(19)

Again, as the probability values here lie on a simplex, we can represent a single constraint as in Eq 20. Here, $Z$ is the normalizing constant that ensures the final output values are a valid probability distribution over the class labels (computed with a softmax layer in practice). $Z=e^{baby}+e^{boy}+\dots+e^{cattle}+e^{tractor}$ for all $100$ class labels in CIFAR100.
同样地,由于概率值位于单纯形上,我们可以如等式 20 所示来表示单个约束条件。这里, $Z$ 是归一化常数,确保最终输出值是一个有效的概率分布,对应于类别标签(在实践中使用 Softmax 层计算)。 $Z=e^{baby}+e^{boy}+\dots+e^{cattle}+e^{tractor}$ 对于 CIFAR100 中的所有 $100$ 类标签。

(p_{people}>1-\epsilon)\implies\frac{e^{baby}}{Z}+\frac{e^{boy}}{Z}+\frac{e^{girl}}{Z}\\ +\frac{e^{man}}{Z}+\frac{e^{woman}}{Z}>1-\epsilon

(20)

Finally, we can simplify the right hand side of Eq 20 to obtain the following specification (for the $people$ super class, but the other SCs all follow via symmetry):
最后,我们可以简化 Eq 20 右侧以获得以下规范(适用于 $people$ 超类,但通过对称性,其他 SCs 也适用)

e^{baby}+e^{boy}+e^{girl}+e^{man}+e^{woman}>\\ \frac{1-\epsilon}{\epsilon}\left[e^{beaver}+e^{couch}+\dots+e^{streetcar}\right]

(21)

Note that the right hand side of Eq 21 contains the classes for all the other super classes but not including $people$ . It thus contains $95$ labels in this example.
注意，方程式 21 的右手边包含了所有其他超类的类别,但不包括 $people$ 。在这个示例中,它包含了 $95$ 标签。

Studying each of the terms in $j\in[baby,boy,girl,man,woman]$ separately, and noting that $e^{y}$ is strictly positive, we obtain Eq 22. We use $y_{j}$ to denote a class label in the target super class (in this case $people$ ) and $y_{i}$ to refer to all other labels in all other super classes (SC).
分别研究 $j\in[baby,boy,girl,man,woman]$ 中的各个术语,并注意到 $e^{y}$ 严格为正,我们得到了方程 22。我们使用 $y_{j}$ 来表示目标超类(在本例中为 $people$ )中的类标签,使用 $y_{i}$ 来指代所有其他超类(SC)中的所有其他标签。

e^{y_{j}}>\frac{1-\epsilon}{\epsilon}\left[\sum\limits_{i\notin SC_{people}}e^{y_{i}}\right]

(22)

As we are interested in constraining the unnormalized output of the network ( $y_{j}$ ) in MultiplexNet, it is clear that we can take the logarithm of both sides of Eq 22 to obtain the final objective for one class label. Together with Eq 19, we then obtain the final logical constraint in the main text in Eq 11.
我们对限制网络的未规范化输出感兴趣( $y_{j}$ )在 MultiplexNet 中,很明显,我们可以对等式 22 的两边取对数以获得单个类标签的最终目标。结合等式 19,我们然后在正文中的等式 11 中获得最终的逻辑约束。

y_{j}>\log(\frac{1-\epsilon}{\epsilon})+\log\sum\limits_{i\notin SC_{people}}e^{y_{i}}

(23)

This implementation can then be directly encoded into the MultiplexNet loss as usual (where $y_{k}$ refer to the constrained output of the network for each of the $20$ super classes, $\pi_{k}$ is the MultiplexNet probability for selecting logic term $k$ , and $CE$ is the standard cross entropy loss that is used in image classification.
这种实现可以直接编码到 MultiplexNet 损失中,就像往常一样(其中 $y_{k}$ 是网络对每个 $20$ 超类的受限输出, $\pi_{k}$ 是选择逻辑项 $k$ 的 MultiplexNet 概率, $CE$ 是用于图像分类的标准交叉熵损失)。

\mathcal{L}(\theta)=\sum\limits_{i=1}^{20}\pi_{k}\left[CE(y_{k})+\log\pi_{k}\right]

(24)

Network Architecture 网络架构

We use a Wide ResNet 28-10 (Zagoruyko and Komodakis 2016) in all of the experimental conditions for this CIFAR100 experiment. We build on top of the pytorch implementation of the Wide ResNet that is available here: https://github.com/xternalz/WideResNet-pytorch. This implementation is available under an MIT License.
我们在这个 CIFAR100 实验的所有实验条件下使用了一个 Wide ResNet 28-10(Zagoruyko 和 Komodakis 2016)。我们基于这里提供的 Wide ResNet 的 PyTorch 实现: https://github.com/xternalz/WideResNet-pytorch。这个实现可以根据 MIT 许可证使用。

Appendix B Appendix B: MultiplexNet Architecture Overview
附录 B：MultiplexNet 架构概述

MultiplexNet accepts as input a data set consisting of samples from some target distribution, $p^{*}(x)$ , and some constraints, $\Phi$ that are known about the data set. We assume that the constraints are correct, in that Eq. 1 (main text) holds for all $x$ . We aim to model the unknown density, $p^{*}$ , by maximising the likelihood of a parameterised model, $p_{\theta}(x)$ on the given data set. Moreover, our goal is to incorporate the domain constraints, $\Phi$ , into the training of this model.
多路网络接受一组数据集作为输入,这些数据集包括一些目标分布 $p^{*}(x)$ 以及关于这些数据集的一些已知约束条件 $\Phi$ 。我们假设这些约束条件是正确的,即等式 1(正文)对所有 $x$ 成立。我们的目标是通过最大化参数化模型 $p_{\theta}(x)$ 在给定数据集上的似然度来建模未知密度 $p^{*}$ 。此外,我们的目标是将域约束 $\Phi$ 融入到该模型的训练中。

We first assume that the domain constraints are provided in DNF. This is a reasonable assumption as any logical formula can be compiled to DNF, although there might be an exponential number of terms in worst case scenarios (as discussed in Section: Limitations). For each term $\phi_{k}$ in the DNF representation of $\Phi=\phi_{1}\lor\phi_{2}\lor\dots\lor\phi_{K}$ , we then introduce a transformation, $h_{k}$ , that ensures any real-valued input is transformed to satisfy that term. Given a Softplus transformation $g$ , we can suitably restrict the domain of any real-valued variable such that the output satisfies $\phi_{k}$ . For example, consider the constraint, e.g., $\phi_{1}:x>y+2\land x<5$ . The transformation $h_{1}(x^{\prime})=-g(-(g(x^{\prime})+\alpha)+\beta)$ will constrain the real-valued variable $x^{\prime}$ such that $\phi_{1}$ is satisfied. In this example, $y$ does not need to be constrained. Here $\beta=5$ and $\alpha=\log(e^{5-(y+2)}-1)$ . Any combination of inequalities can be suitably restricted in this way. Equality constraints, can be handled by simply setting the output to the value that is specified.
我们首先假设域约束以析取范式(DNF)的形式提供。这是一个合理的假设,因为任何逻辑公式都可以编译成 DNF,尽管在最坏情况下可能有指数级的项数(如第:限制一节所讨论的)。对于 DNF 表示中的每一项 $\phi_{k}$ ，我们引入一个变换 $h_{k}$ ，以确保任何实值输入都被变换成满足该项的形式。给定一个 Softplus 变换 $g$ ，我们可以适当地限制任何实值变量的域,使得输出满足 $\phi_{k}$ 。例如,考虑约束 $\phi_{1}:x>y+2\land x<5$ 。变换 $h_{1}(x^{\prime})=-g(-(g(x^{\prime})+\alpha)+\beta)$ 将限制实值变量 $x^{\prime}$ ,使得 $\phi_{1}$ 得到满足。在此示例中, $y$ 不需要受到限制。这里 $\beta=5$ 和 $\alpha=\log(e^{5-(y+2)}-1)$ 。任何不等式的组合都可以以这种方式得到适当的限制。等式约束可以通过简单地将输出设置为规定的值来处理。

MultiplexNet therefore accepts the unconstrained output of a network, $x^{\prime}\in\mathbb{R}$ , and introduces $K$ constraint terms $h_{k}$ that each guarantee the constrained output $x_{k}=h(x^{\prime})$ will satisfy a term, $\phi_{k}$ , in the DNF representation of the constraints. The output of the network is then $K$ transformed versions of $x^{\prime}$ where each output $x_{k}$ is guaranteed to satisfy $\Phi$ . The Categorical selection variable $k\sim q(k\mid x)$ can be marginalised out leading to the following objective:
多路网络因此接受网络的无约束输出， $x^{\prime}\in\mathbb{R}$ ，并引入 $K$ 约束项 $h_{k}$ ，每个约束项都保证约束输出 $x_{k}=h(x^{\prime})$ 将满足 DNF 表示的一个项 $\phi_{k}$ 。然后将网络的输出 $K$ 转换为 $x^{\prime}$ 的版本，每个输出 $x_{k}$ 都保证满足 $\Phi$ 。分类选择变量 $k\sim q(k\mid x)$ 可以边缘化，从而得到以下目标:

\mathcal{L}(\theta)=\sum\limits_{i=1}^{20}\pi_{k}\left[\mathcal{L^{\prime}}(x_{k})+\log\pi_{k}\right]

(25)

In Eq. 25, $\mathcal{L^{\prime}}$ refers to the observation likelihood that would be used in the absence of any constraint. $x_{k}$ is the $k^{th}$ constrained term of the unconstrained output of the network: $x_{k}=h_{k}(x^{\prime})$ . This architecture is represented pictorially in Fig. 7.
在等式 25 中, $\mathcal{L^{\prime}}$ 指的是在没有任何约束的情况下使用的观察似然。 $x_{k}$ 是无约束网络输出的 $k^{th}$ 约束项: $x_{k}=h_{k}(x^{\prime})$ 。这种架构在图 7 中以图形方式表示。

MultiplexNet: Towards Fully Satisfied Logical Constraints in Neural Networks多层神经网络:朝着在神经网络中完全满足逻辑约束的目标前进

Abstract 摘要

Introduction 简介

Problem Specification 问题说明

Related Work 相关工作

Incorporating Domain Constraints into Model Design将域约束纳入模型设计

Satisfiability as Reparameterisation满足性作为重新参数化

Lemma 0.1

MultiplexNet as a Latent Variable Problem多路复用网络作为隐变量问题

Experiments 实验

Synthetic Data 合成数据

MNIST - Label-free Structured Learning手写数字识别 - 无标签结构化学习

Hierarchical Domain Knowledge on CIFAR100分层域知识 CIFAR100

Limitations and Discussion限制和讨论

Conclusions and Future Work结论与未来工作

References

Appendix A Appendix A: Additional Experimental Details附录 A 附录 A：附加实验细节

Synthetic Data 合成数据

MNIST - Label-free Structured Learning手写数字识别 - 无标签结构化学习

Hierarchical Domain Knowledge on CIFAR100分层域知识 CIFAR100

Appendix B Appendix B: MultiplexNet Architecture Overview附录 B：MultiplexNet 架构概述

MultiplexNet: Towards Fully Satisfied Logical Constraints in Neural Networks
多层神经网络:朝着在神经网络中完全满足逻辑约束的目标前进

Incorporating Domain Constraints into Model Design
将域约束纳入模型设计

Satisfiability as Reparameterisation
满足性作为重新参数化

MultiplexNet as a Latent Variable Problem
多路复用网络作为隐变量问题

MNIST - Label-free Structured Learning
手写数字识别 - 无标签结构化学习

Hierarchical Domain Knowledge on CIFAR100
分层域知识 CIFAR100

Limitations and Discussion
限制和讨论

Conclusions and Future Work
结论与未来工作

Appendix A Appendix A: Additional Experimental Details
附录 A 附录 A：附加实验细节

MNIST - Label-free Structured Learning
手写数字识别 - 无标签结构化学习

Hierarchical Domain Knowledge on CIFAR100
分层域知识 CIFAR100

Appendix B Appendix B: MultiplexNet Architecture Overview
附录 B：MultiplexNet 架构概述