2024_04_13_36bdceaf63ee8ba307c6g

Visually Grounded Commonsense Knowledge Acquisition
視覺導向的常識知識獲取

Yuan Yao , Tianyu , Ao Zhang , Mengdi , Ruobing , Cornelius Weber ,
姚遠，天宇，張鰲，孟笛，若冰，科尼利斯·韋伯，Zhiyuan Liu , Hai-Tao Zheng , Stefan Wermter , Tat-Seng Chua , Maosong Sun
劉知遠，鄭海濤，斯特凡·韋默特，蔡達昇，孫茂松 Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China
中國北京清華大學人工智慧研究所計算機科學與技術系 Shenzhen International Graduate School, Tsinghua University
中國深圳清華大學國際研究生院 Peng Cheng Laboratory
騰訊大廈 School of Computing, National University of Singapore, Singapore
新加坡國立大學計算機學院 Department of Informatics, University of Hamburg, Hamburg, Germany
德國漢堡大學信息學系 WeChat AI, Tencent
騰訊微信人工智能yaoyuanthu@163.com

Abstract 摘要

Large-scale commonsense knowledge bases empower a broad range of AI applications, where the automatic extraction of commonsense knowledge (CKE) is a fundamental and challenging problem. CKE from text is known for suffering from the inherent sparsity and reporting bias of commonsense in text. Visual perception, on the other hand, contains rich commonsense knowledge about real-world entities, e.g., (person, can_hold, bottle), which can serve as promising sources for acquiring grounded commonsense knowledge. In this work, we present CLEVER, which formulates CKE as a distantLy supErVised multi-instancE leaRning problem, where models learn to summarize commonsense relations from a bag of images about an entity pair without any human annotation on image instances. To address the problem, CLEVER leverages vision-language pre-training models for deep understanding of each image in the bag, and selects informative instances from the bag to summarize commonsense entity relations via a novel contrastive attention mechanism. Comprehensive experimental results in held-out and human evaluation show that CLEVER can extract commonsense knowledge in promising quality, outperforming pre-trained language model-based methods by and points. The predicted commonsense scores show strong correlation with human judgment with a 0.78 Spearman coefficient. Moreover, the extracted commonsense can also be grounded into images with reasonable interpretability. The data and codes can be obtained at https://github.com/thunlp/ CLEVER
大規模的常識知識庫為各種人工智慧應用提供支持，其中自動提取常識知識（CKE）是一個基本且具有挑戰性的問題。從文本中提取CKE以知名於受限於文本中常識的固有稀疏性和報導偏見。另一方面，視覺感知包含關於現實世界實體的豐富常識知識，例如（人，可以拿，瓶子），這可以作為獲取基於實體的常識知識的有前途的來源。在這項工作中，我們提出了CLEVER，將CKE定義為一個遠程監督的多實例學習問題，模型學習如何從關於實體對的一袋圖像中總結常識關係，而不需要對圖像實例進行任何人工標註。為了解決這個問題，CLEVER利用視覺語言預訓練模型深入理解袋中每個圖像，並通過一種新穎的對比注意機制從袋中選擇信息實例來總結常識實體關係。在保留和人類評估中的全面實驗結果顯示，CLEVER 可以提取有前途的共識知識，優於基於預訓練語言模型的方法和點。預測的共識分數與人類判斷呈現出 0.78 的 Spearman 相關係數。此外，提取的共識知識還可以合理地與圖像相關聯。數據和代碼可在 https://github 獲得。com/thunlp/ CLEVER

Introduction 簡介

Providing machines with commonsense knowledge is a longstanding goal of artificial intelligence (Davis, Shrobe, and Szolovits 1993). Tremendous efforts have been devoted to building commonsense knowledge bases (KBs) (Liu and Singh 2004: Speer, Chin, and Havasi 2017; Sap et al. 2019), which have facilitated various important applications in both
為機器提供常識知識一直是人工智慧的長期目標（Davis、Shrobe 和 Szolovits 1993 年）。人們已經付出了巨大的努力來建立常識知識庫（KBs）（Liu 和 Singh 2004 年；Speer、Chin 和 Havasi 2017 年；Sap 等人 2019 年），這些 KB 已經促進了在計算機視覺（Wu 等人 2017 年；Narasimhan、Lazebnik 和 Schwing 2018 年；Gu 等人 2019 年；Gardères 等人 2020 年）和自然語言處理（Zhou 等人 2018 年；Wu 等人 2020 年；Lv 等人 2020 年）等領域的各種重要應用。

Figure 1: Visually grounded commonsense knowledge acquisition as a distantly supervised multi-instance learning problem. Given an entity pair and associated images, our model first understands entity interactions in each image, and then selects informative ones (solid line) to summarize the commonsense relations.
圖 1：視覺基礎的常識知識獲取，作為一個遠距監督的多實例學習問題。給定一個實體對和相關的圖像，我們的模型首先理解每個圖像中的實體交互作用，然後選擇信息豐富（實線）來總結常識關係。

computer vision (Wu et al. 2017; Narasimhan, Lazebnik, and Schwing 2018; Gu et al. 2019; Gardères et al. 2020) and natural language processing (Zhou et al. 2018), Wu et al.|2020; Lv et al. 2020). However, most commonsense KBs are manually curated, which greatly limits their coverage and scale.
然而，大多數常識 KB 都是手工策劃的，這極大地限制了它們的覆蓋範圍和規模。

This paper studies the fundamental and challenging problem of commonsense knowledge extraction (CKE), which aims to extract plausible commonsense interactions between entities, e.g., (person, can_hold, bottle). Previous works have attempted to extract commonsense knowledge from plain text (Li et al. 2016) or pre-trained language models (PLMs) (Petroni et al.|2019, Bosselut et al.|2019). However, there is a growing consensus that obvious commonsense is rarely reported in text (Gordon and Van Durme 2013, Paik et al. 2021), and commonsense in PLMs suffers from low consistency and significant reporting bias (Shwartz and Choi 2020; Zhou et al. 2020; Elazar et al. 2021). There is also widespread doubt whether learning purely from the surface text forms can lead to real understanding of common-
本文研究了常識知識提取（CKE）的基本且具挑戰性的問題，旨在提取實體之間的可信的常識交互作用，例如（人，可以拿著，瓶子）。先前的研究已嘗試從純文字（Li et al. 2016）或預訓練語言模型（PLMs）（Petroni et al.|2019, Bosselut et al.|2019）中提取常識知識。然而，普遍認為，在文本中很少報導明顯的常識（Gordon and Van Durme 2013, Paik et al. 2021），並且 PLMs 中的常識存在著低一致性和重大的報導偏見（Shwartz and Choi 2020; Zhou et al. 2020; Elazar et al. 2021）。人們也普遍懷疑僅從表面文本形式學習是否能真正理解共同的意義（Bender and Koller 2020）。
sense meanings (Bender and Koller 2020).
沒有正在報導的内容。

Visual perceptions (e.g., images), on the other hand, contain rich commonsense knowledge about real-world entities that can be consistently grounded. According to our statistics,

of the triplets in visual relation learning datasets cannot be found in ConceptNet

indicating a promising direction for CKE from image data. However, most existing image-based CKE methods are either confined to restricted interaction types (e.g., spatial or partonomy relations) (Chen, Shrivastava, and Gupta 2013, Collell, Van Gool, and Moens 2018; Xu, Lin, and Zhu 2018) or require extensive human annotation (Vedantam et al. 2015).
視覺知覺（例如圖像）則包含了有關現實世界實體的豐富常識知識，而這些知識可以被一貫地建立起來。根據我們的統計，視覺關係學習數據集中

的三元組無法在 ConceptNet

中找到，這表明從圖像數據中進行 CKE 具有潛在的前景。然而，大部分現有的基於圖像的 CKE 方法要麼受限於受限的交互類型（例如空間或部分關係）（Chen, Shrivastava, and Gupta 2013, Collell, Van Gool, and Moens 2018; Xu, Lin, and Zhu 2018），要麼需要大量的人工標註（Vedantam et al. 2015）。

In this work, we present CLEVER, which formulates CKE as a distantLy supErVised multi-instancE leaRning problem (Dietterich, Lathrop, and Lozano-Pérez 1997), where models learn to summarize general commonsense relations of an entity pair from a bag of images, as shown in Figure 1 . The commonsense relation labels are automatically created by aligning relational facts in existing KBs to image bags to provide distantly supervised learning signals. In this way, commonsense learning can easily scale up in general domain without costly manual image annotation.
在這項工作中，我們提出了 CLEVER，它將 CKE 形式化為一個遠距監督的多示例學習問題（Dietterich, Lathrop, and Lozano-Pérez 1997），模型從一個圖像包中學習總結實體對的通用常識關係，如圖 1 所示。而這些常識關係標籤是通過將現有知識庫中的關係事實與圖像包進行對齊，以提供遠距監督的學習信號而自動創建的。通過這種方式，常識學習在一般領域可以輕易擴展，而無需昂貴的手動圖像標註。

To extract commonsense facts about a pair of query entities, models need to first understand their semantic interactions in each image of the bag, and then select informative ones (i.e., images that express interactions of interest between query entities) to synthesize the commonsense relations. However, our pilot experiments show that existing multi-instance learning methods cannot serve the task well, due to the complexity of real-world commonsense relations. Therefore, we propose a dedicated framework that models image-level entity interactions via vision-language pre-training (VLP) models, and selects meaningful images to summarize bag-level commonsense relations via a novel contrastive attention mechanism.
為了提取有關一對查詢實體的常識事實，模型需要首先了解每個袋子中的圖像中它們的語義交互作用，然後選擇信息豐富的圖像（即表達查詢實體之間感興趣交互作用的圖像）來綜合常識關係。然而，我們的初步實驗表明，由於現實世界常識關係的複雜性，現有的多實例學習方法無法很好地完成任務。因此，我們提出了一個專用框架，通過視覺語言預訓練（VLP）模型來建模圖像級實體交互作用，並通過一種新穎的對比注意機制選擇有意義的圖像來總結袋子級的常識關係。

Comprehensive experimental results in held-out and human evaluation show that CLEVER can extract commonsense knowledge in promising quality, outperforming PLMbased approaches by 3.9 AUC and 6.4 mAUC points. The predicted commonsense scores show strong correlation with human judgment, achieving 0.78 Spearman's rank correlation coefficient. Moreover, the extracted commonsense can also be grounded into images with reasonable interpretability. Compared with PLM-based methods that produce commonsense purely based on text surface forms in a black-box fashion, the interpretability of CLEVER can be leveraged to provide supporting evidence for commonsense knowledge in KBs, which can be useful for downstream applications.
在保留和人類評估中的全面實驗結果顯示，CLEVER 能夠以優異的質量提取常識知識，優於基於 PLM 的方法 3.9 個 AUC 和 6.4 個 mAUC 點。預測的常識得分與人類判斷呈現強烈相關性，達到 0.78 的 Spearman 等級相關係數。此外，提取的常識還可以合理地與圖像相關聯。與純粹以黑盒方式基於文本表面形式生成常識的基於 PLM 的方法相比，CLEVER 的可解釋性可以用來為知識庫中的常識知識提供支持證據，這對下游應用很有用。

Our contributions are summarized as fourfold: (1) We propose to formulate CKE as a distantly supervised multiinstance learning problem, which can easily scale up for commonsense relations in a general domain without manual image annotation. (2) We conduct extensive experiments on existing and adapted CKE methods from different data sources, showing their effectiveness and limitations. (3) We
我們的貢獻概括為四點：(1)我們提出將 CKE 定義為一個遠程監督的多實例學習問題，可以輕鬆擴展到一般領域的常識關係，而無需手動圖像標註。(2)我們對現有的和從不同數據源調整的 CKE 方法進行了廣泛實驗，展示了它們的有效性和局限性。

present a dedicated CKE framework that integrates VLP models with a novel contrastive attention mechanism to deal with complex commonsense relation learning. (4) We conduct comprehensive experiments which demonstrate the effectiveness of the proposed framework.
提出了一個專門的 CKE 框架，該框架將 VLP 模型與一個新穎的對比注意力機制相結合，以應對複雜的常識關係學習。(4) 我們進行了全面的實驗，證明了所提出的框架的有效性。

Knowledge Bases. Large-scale knowledge bases (KBs) that store abundant structured human knowledge facilitate various AI applications. Many efforts have been devoted to building KBs of different knowledge types, including linguistic knowledge (Miller 1994), world knowledge (Bollacker et al. 2008) and commonsense knowledge (Liu and Singh 2004, Speer, Chin, and Havasi 2017, Sap et al. 2019). However, existing KBs are mainly constructed with human annotation, which greatly limits their coverage and scale.
知識庫。儲存豐富的結構化人類知識的大規模知識庫（KBs）促進了各種人工智能應用。許多努力已經致力於構建不同知識類型的 KBs，其中包括語言知識（Miller 1994 年）、世界知識（Bollacker 等人 2008 年）和常識知識（Liu 和 Singh 2004 年，Speer，Chin 和 Havasi 2017 年，Sap 等人 2019 年）。然而，現有的 KB 主要是人類註釋構建的，這在很大程度上限制了它們的涵蓋範圍和規模。

Commonsense Knowledge Acquisition. To acquire commonsense knowledge, some works attempt to learn from internal structures of existing triplets (Speer, Havasi, and Lieberman 2008, Malaviya et al. 2020). However, these models usually suffer from the data sparsity of existing KBs. A more promising direction is to extract the commonsense contained in external data, i.e., commonsense knowledge extraction (CKE). Previous efforts in CKE can be divided into three categories according to the knowledge sources, including text-based, PLM-based and image-based models.
常識知識獲取。為了獲取常識知識，一些研究嘗試從現有三元組的內部結構中學習（Speer、Havasi和Lieberman 2008年，Malaviya等人2020年）。然而，這些模型通常受現有知識庫數據稀疏性的困擾。一個更有前途的方向是從外部數據中提取包含在其中的常識，即常識知識提取（CKE）。先前在CKE方面的努力可以根據知識來源分為三類，包括基於文本、基於PLM和基於圖像的模型。

(1) Text-based methods. Early works attempt to extract commonsense from text Angeli and Manning 2013, Li et al. 2016). However, CKE from text endures inherent reporting bias (Gordon and Van Durme 2013), i.e., people rarely state the obvious commonsense facts in text, making text not an ideal commonsense knowledge source. (2) PLM-based methods. Since PLMs learn certain commonsense knowledge during pre-training, they can be probed or fine-tuned to generate commonsense knowledge (Petroni et al. 2019, Davison, Feldman, and Rush 2019; Bosselut et al. 2019). However, it has been found that the commonsense in PLMs suffers from both low consistency, where small changes in the query templates can lead to substantially different predictions (Zhou et al. 2020; Elazar et al. 2021), and significant bias where the commonsense predictions can greatly differ from human judgments (Shwartz and Choi 2020, Paik et al. 2021). (3) Image-based methods. Some works have explored CKE from images that contain rich grounded commonsense knowledge. Chen, Shrivastava, and Gupta (2013) learn partonomy (i.e, part_of) and taxonomy (i.e., is_a) commonsense from images. Yatskar, Ordonez, and Farhadi (2016); Xu, Lin, and Zhu (2018) extract spatial commonsense (e.g., located_near). Chao et al. (2015) learn unary affordance commonsense about entities. Vedantam et al. (2015); Chen et al. (2022) extract more general commonsense interactions based on human annotation. Sadeghi, Kumar Divvala, and Farhadi (2015) mine commonsense based on spatial consistency of entities. Different from previous works, we extract general type commonsense interactions between entities without human annotation or restricted assumptions about commonsense knowledge.
(1) 基於文本的方法。早期的研究試圖從文本中提取常識（Angeli和Manning 2013年，Li等2016年）。然而，從文本中提取常識存在固有的報告偏見（Gordon和Van Durme 2013年），即人們很少在文本中陳述明顯的常識事實，使文本不是一個理想的常識知識來源。 (2) 基於PLM的方法。由於PLM在預訓練期間學習了某些常識知識，因此可以對其進行探測或微調以生成常識知識（Petroni等2019年，Davison，Feldman和Rush 2019年；Bosselut等2019年）。然而，已發現PLM中的常識存在低一致性問題，即查詢模板的微小變化可能導致顯著不同的預測（Zhou等2020年；Elazar等2021年），以及顯著的偏見問題，即常識預測可能與人類判斷大相徑庭（Shwartz和Choi 2020年，Paik等2021年）。 (3) 基於圖像的方法。一些研究探索了從包含豐富基於圖像的常識知識的圖像中提取常識（Chen，Shrivastava和Gupta 2013年），學習了部分（即part_of）和分類（即is_a）的常識。 Yatskar, Ordonez, and Farhadi (2016); Xu, Lin, and Zhu (2018) 提取了空間常識（例如，位於附近）。Chao 等人（2015）學習了關於實體的單元性能常識。Vedantam 等人（2015）; Chen 等人（2022）基於人類標註提取了更一般的常識互動。Sadeghi，Kumar Divvala 和 Farhadi（2015）基於實體的空間一致性挖掘了常識。與先前的作品不同，我們提取了實體之間的一般類型常識互動，而無需人類標註或對常識知識做出限制性假設。

Scene Graph Generation. Understanding visual interac-
场景图生成。理解物体之间的视觉互动也是场景图生成的兴趣所在（Krishna 等人 2017 年; Lu 等人 2016 年; Xu 等人 2017 年，Tang 等人 2020 年; Yao 等人 2021b c| Zhang 等人 2022 年）。与 CKE 不同，CKE 旨在总结从一组图像中提取的实体之间的全局常识关系，而场景图生成的目标是识别特定图像中的局部关系。此外，场景图模型通常需要大量的图像注释，而提出的远程监督 CKE 框架不需要注释图像。
tions between objects also lies in the interest of scene graph generation (Krishna et al. 2017; Lu et al. 2016; Xu et al. 2017, Tang et al. 2020; Yao et al. 2021b c| Zhang et al. 2022). Different from CKE which aims to summarize global commonsense relations between entities from a bag of images, the goal of scene graph generation is to identify the local relation in a specific image. Moreover, scene graph models usually require large amounts of image annotations, whereas the proposed distantly supervised CKE framework does not need annotated images.

World Knowledge Acquisition. The extraction of factual world knowledge, e.g., (Bob Dylan, composer, Blowin' in the Wind), is an important tool to supplement world knowledge bases. Most works in world knowledge acquisition focus on text as the knowledge source (Nguyen and Grishman 2015; Soares et al. 2019; Wu et al. 2019; Dong et al. 2020; Chen et al.|2021; Yao et al.|2019.|2021a:Zhang et al.|2021a), with some attempts in multimodal world knowledge acquisition (Wen et al.2021). To alleviate human annotation, Mintz et al. (2009) propose distant supervision that aligns KBs to text to create noisy relation labels. Following works focus on dealing with the noise in distant supervision under the multiinstance learning formulation (Riedel, Yao, and McCallum 2010, Zeng et al. 2015, Liu et al. 2018). The most widely adopted method is the selective attention model (Lin et al. 2016) which selects high-quality instances in the bag based on the attention mechanism. In comparison, we aim to extract commonsense knowledge from bag of images. We find in our experiment that existing multi-instance learning models cannot serve the complex commonsense learning well, and therefore we propose a dedicated approach for the task.
世界知識獲取。抽取事實性世界知識，例如（鮑勃·迪倫，作曲家，風中的吹過）是補充世界知識庫的重要工具。大多數世界知識獲取的工作都集中在以文本作為知識源（Nguyen和Grishman 2015; Soares等人2019; Wu等人 2019; Dong等人2020; Chen等人 2021; Yao等人 2019、2021a: Zhang等人2021a），也有一些企圖在多模態世界知識獲取中進行嘗試（Wen等人2021）。為了減輕人類注釋的負擔，Mintz等人（2009）提出了遠端監督的方法，將知識庫對齊到文本，來創建有噪聲的關係標籤。隨後的工作專注於在多實例學習框架下處理遠端監督中的噪聲（Riedel，Yao和McCallum 2010，Zeng等人2015，Liu等人2018）。最廣泛採用的方法是選擇性注意力模型（Lin等人2016），它基於注意力機制在包中選擇高質量的實例。相比之下，我們的目標是從圖像包中提取常識知識。在我們的實驗中發現，現有的多實例學習模型無法很好地應用於複雜的常識學習，因此我們提出了一種針對這個任務的專用方法。

Pilot Experiment and Analysis
預備實驗和分析

To investigate the effectiveness and limitation of existing CKE methods, we first perform an empirical study of representative methods from different information sources, including text-based, PLM-based and image-based models.
為了研究現有常識嵌入方法的效能和局限性，我們首先對來自不同信息來源的代表性方法進行了實證研究，包括基於文本、基於 PLM 和基於圖像的模型。

Problem Definition. CKE aims to extract commonsense relational triplet

, which depicts plausible interactions

between entities

. For example, (person, can_hold, bottle) reflects the commonsense knowledge that a person can hold a bottle. A special NA relation is also included, indicating no relation between the entity pair.
問題定義。CKE 旨在提取常識關係三元組，描述實體之間的合理互動。例如，(人, 可以持有, 瓶子)反映了一個人可以拿著瓶子的常識知識。還包括一個特殊的 NA 關係，表示實體對之間沒有關係。

Benchmark Construction. We construct the CKE benchmark based on Visual Genome (Krishna et al.|2017), which contains relational triplets about entities from real-world image data. Specifically, we select distinct triplets with the top 100 entity types and relation types. For automatic heldout evaluation (Mintz et al. 2009), we split the triplets into disjoint training, validation and test sets. Each entity pair is associated with Visual Genome images that contain the entities. The training/validation/test data contains 13,780/1,166/3,496 commonsense facts, 6,443/678/1,964 entity pairs, and 55,911/5,224/13,722 images respectively.
基準構建。我們基於 Visual Genome（Krishna 等人|2017）構建了 CKE 基準，其中包含有關真實世界圖像數據中實體的關係三元組。具體來說，我們選擇了前 100 個實體類型和關係類型的不同三元組。對於自動留出評估（Mintz 等人，2009），我們將三元組分為不相交的訓練、驗證和測試集。每個實體對應於包含這些實體的 Visual Genome 圖像。訓練/驗證/測試數據分別包含 13,780/1,166/3,496 條常識事實，6,443/678/1,964 個實體對和 55,911/5,224/13,722 張圖像。

Existing CKE Models. We select representative CKE models for empirical study. (1) Text-based models. We adopt RTP (Schuster et al. 2015), a widely used triplet parser, which extracts commonsense triplets from captions based on dependency trees. We extract triplets from Conceptual Caption (Sharma et al. 2018) containing 3M captions, and obtain the confidence of the global triplets according to their frequency in the caption data. (2) PLM-based models. We adopt LAMA (Petroni et al. 2019) that probes knowledge in BERT by filling the prompting template containing the query entity pair and the masked relation (e.g., "person [MASK] bottle")

Following Lin et al. (2020), we further fine-tune the model based on the same prompts using the triplets in the training set to better learn the commonsense knowledge. Following Peng et al. (2020), we also adopt a vanilla finetuned BERT model which predicts relations based on the entity names using [CLS] token.
現有的CKE模型。我們選擇代表性的CKE模型進行實證研究。（1）基於文本的模型。我們採用RTP（Schuster等人，2015年），這是一個廣泛使用的三元組解析器，它根據依存樹從標題中提取常識三元組。我們從包含300萬個標題的Conceptual Caption（Sharma等人，2018年）中提取三元組，並根據它們在標題數據中的頻率獲得全局三元組的置信度。（2）基於PLM的模型。我們採用LAMA（Petroni等人，2019年），通過填充包含查詢實體對和遮罩關係的提示模板（例如，“人[MASK]瓶子”）來探測BERT中的知識。根據Lin等人（2020年）的方法，我們進一步根據訓練集中的三元組對模型進行微調，以更好地學習常識知識。根據Peng等人（2020年）的方法，我們還採用了一個基於實體名稱使用[CLS]標記來預測關係的普通微調BERT模型。

Multi-instance Learning for Image-based CKE. Intuitively, images are raw visual perceptions of rich real-world entity interactions, which can serve as a scalable and promising information source for CKE. However, most existing image-based CKE methods are either restricted in relation types, or require manual image annotation.
基於圖像的CKE的多實例學習。直觀地說，圖像是豐富的現實世界實體互動的原始視覺感知，可以作為CKE的可擴展和有前途的信息來源。然而，大多數現有的基於圖像的CKE方法要麼受限於關係類型，要麼需要手動圖像標註。

For general and scalable commonsense KB construction, it is desirable to extract general type commonsense knowledge from large-scale images without human annotation. To this end, we propose to formulate CKE as a multi-instance learning problem (Dietterich, Lathrop, and Lozano-Pérez 1997), where commonsense relation

between entities

is summarized from a bag of images

containing the entity pair. Inspired by Mintz et al. (2009), we align existing commonsense KBs to image bags to provide distantly supervised learning signals. Specifically, the image bag

is labeled with the relation

between

in the

, assuming that at least a subset of images in the bag expresses the triplet

, and there might be some images in the bag that do not express the triplet. To extract the commonsense triplet, models need to first understand the entity interactions in each image of the bag, and then select the meaningful ones to synthesize the commonsense relations.
對於常識知識庫的構建，我們希望能夠從大型圖像中提取一般類型的常識知識，而不需要人類注釋。為此，我們提出將CKE作為多示例學習問題（Dietterich, Lathrop, and Lozano-Pérez 1997）來建模，其中對於實體對擁有的常識關係

從包含該實體對的一系列圖像

中總結。受Mintz等人（2009）的啟發，我們將現有的常識知識庫與圖像組進行對齊，提供遠距監督學習信號。具體而言，圖像組

以在

中

之間的關係

標記，假設該圖像組中至少有一部分圖像表達了三元組

，而該組中可能存在一些不表達此三元組的圖像。為了提取常識三元組，模型首先需要理解圖像組中每個圖像中的實體交互作用，然後選擇有意義的部分來合成常識關係。

We note some works exploring problems in similar formulation in world knowledge extraction from text. To investigate the effectiveness of existing multi-instance learning methods for image-based CKE, we adapt representative approaches that select and summarize bag of instances using average pooling (Lin et al. 2016), at-least-one strategy (Zeng et al. 2015), or attention mechanism (Lin et al. 2016).
在世界知識提取的文本中，我們注意到一些探索類似問題的作品。為了研究現有的基於圖像的 CKE 的多實例學習方法的有效性，我們適應了選擇和總結實例包的代表性方法，使用平均池化（Lin 等人 2016 年）、至少一個策略（Zeng 等人 2015 年）或注意機制（Lin 等人 2016 年）。

Specifically, given a triplet

, we first select a bag of images containing the query entity pair. In practice, the number of candidate images can be large (e.g.,

), while only a small portion reflects entity interactions. To compose a proper size of image bags, inspired by Zellers et al. (2018), we select images with top spatial overlaps (i.e., intersection over union in pixels) of query entities, which are more likely to exhibit interactions. The query entity pair in each image of the bag is encoded into feature representations

using an adapted Neural Motif (Zellers et al.(2018) model, a widely used CNN-based entity pair encoder.
具體來說，給定三元組

To obtain the bag representation

, (1) average pool-
為了獲得包表示

Figure 2: Results of CKE models from different information sources.

: text-based,

PLM-based, *: image-based.
圖 2：不同資訊來源的 CKE 模型的結果。

：基於文本，

：基於 PLM，*：基於圖像。

ing (AVG) (Lin et al. 2016) computes the mean of instance representations:

; (2) at-leastone strategy (ONE) (Zeng et al. 2015) selects the most likely instance:

, where

achieves the highest score on the golden relation

of the given training triplet; (3) attention mechanism (ATT) (Lin et al. 2016) computes the weighted sum of instance representations:

, where the attention weight is computed based on the golden relation query:

. The bag representation

is optimized towards the golden label

via a softmax classifier. During inference, since the relation label is unknown, ONE and ATT enumerate relation queries for the corresponding relation prediction score.
ing (AVG) (Lin et al. 2016) 計算實例表示的平均值：

；(2)至少一個策略 (ONE) (Zeng et al. 2015) 選擇最有可能的實例：

，其中

在給定訓練三元組的黃金關係

上獲得最高分；(3) 注意機制 (ATT) (Lin et al. 2016) 計算實例表示的加權和：

，其中注意權重是基於黃金關係查詢

計算的。袋表示

通過 softmax 分類器針對黃金標籤

進行優化。在推斷期間，由於關係標籤未知，ONE 和 ATT 為相應關係預測分數列舉關係查詢。

In addition to multi-instance learning based approaches, we also adapt visual relation detection models for imagebased CKE. To simulate a scalable scenario, we randomly select a moderate number (i.e., 100) of image-level annotations for each relation from Visual Genome, and train a Neural Motif (Zellers et al. 2018) model to predict the relation between an entity pair in specific images. During inference, the relation score of a bag is obtained by max pooling over relation scores of all images in the bag.
除了基於多實例學習的方法，我們還為基於圖像的 CKE 適應了視覺關係檢測模型。為了模擬一個可擴展的情境，我們從 Visual Genome 中隨機選擇了一個適中數量（即 100 個）的圖像級標註，並訓練了一個神經主題（Zellers 等人，2018 年）模型來預測特定圖像中實體對之間的關係。在推斷期間，一個包的關係分數是通過對包中所有圖像的關係分數進行最大池化獲得的。

Results. Following previous works in knowledge acquisition (Zeng et al. 2015), Lin et al. 2016), to provide a rigorous evaluation, we draw the precision-recall curve of held-out triplet predictions, and report the area under curve (AUC). Besides the traditional micro result, we also report mAUC, the area under the macro curve (i.e., the average curve of different relations) to evaluate the performance on long-tail relations. From Figure 2 we have the following observations:
結果。在知識獲取的先前工作（曾等人，2015 年，林等人，2016 年）的基礎上，為了提供嚴格的評估，我們繪製了保留三元組預測的精確度-召回率曲線，並報告了曲線下面積（AUC）。除了傳統的微觀結果，我們還報告了 mAUC，即宏觀曲線下面積（即不同關係的平均曲線）來評估長尾關係的性能。從圖 2 中我們得出以下觀察：

(1) Text-based method (RTP) and knowledge probing from PLMs (LAMA) struggle on CKE. The reason is the inherent lack of commonsense knowledge in text, and the models are not fine-tuned for the task. Further fine-tuning PLMs (Prompt-FT and Vanilla-FT) on the task can boost the performance to achieve a strong result.
(1) 基於文本的方法（RTP）和從 PLM（LAMA）進行的知識探測在 CKE 上遇到困難。原因是文本中固有的常識知識不足，並且這些模型並沒有針對該任務進行微調。進一步對 PLMs（Prompt-FT 和 Vanilla-FT）進行任務微調可以提高性能以達到較強的結果。

(2) Visual perceptions from images can provide rich information for commonsense knowledge acquisition. Based on a relatively proper summarization approach (AVG), multiinstance learning-based models on images achieve the best results over all existing CKE models.
(2) 圖像中的視覺感知可以為獲取常識知識提供豐富的信息。基於相對適當的總結方法（AVG），基於圖像的多實例學習模型在所有現有的 CKE 模型中取得了最佳結果。
(3) Multi-instance learning formulation is necessary for scalable image-based CKE in open-domain. Adapted imagelevel visual relation detection models (VRD) do not perform well on CKE, despite more image-level relation annotations used (e.g., 100 image-level annotations per relation).
(3) 多實例學習制定對於開放領域的基於圖像的 CKE 來說是必要的。適應圖像級別視覺關係檢測模型（VRD）在 CKE 上表現不佳，儘管使用了更多的圖像級別關係標註（例如，每個關係 100 個圖像級別標註）。

(4) Simple adaptation of existing multi-instance learning approaches cannot serve CKE well. The overall performance is still not satisfactory for all models. Notably, despite their competitive performance in world knowledge acquisition from text, ONE and ATT perform poorly on CKE. The reason is that compared with the relation schemes of world knowledge, commonsense relations exhibit higher complexity, where fine-grained relations with overlapping semantics (e.g., stand_on and walk_on), and hyponym-hypernym conflicts (e.g., stand_on and on) frequently occur. Compared with AVG, the golden-query-only problem of ONE and ATT hinders them from distinguishing complex commonsense relations. We refer readers to the methodology section for a more detailed discussion on the problem.
(4) 現有的多實例學習方法對 CKE 的效果並不理想。對於所有模型來說，整體表現仍然不滿意。值得注意的是，儘管 ONE 和 ATT 在從文本中獲取世界知識方面表現競爭力強，但在 CKE 上表現不佳。原因在於，與世界知識的關係方案相比，常識關係表現出更高的複雜性，其中細粒度的關係具有重疊的語義（例如，stand_on 和 walk_on），以及下位詞-上位詞的衝突（例如，stand_on 和 on）經常發生。與 AVG 相比，ONE 和 ATT 的黃金查詢問題阻礙了它們區分複雜的常識關係。我們建議讀者參考方法論部分，以獲得更詳細的討論。

Methodology 方法論

The pilot experiment results show that dedicated approaches need to be developed to address the unique challenges of commonsense knowledge acquisition. Essentially, due to the complexity of commonsense relations, multi-instance learning based CKE presents challenges on two levels: (1) on the image level, models need to first understand complex entity interactions in each image, (2) on the bag level, models are required to select informative instances to summarize the fine-grained commonsense relations between the entities. We present a dedicated model for CKE from images, as shown in Figure 3, which (1) achieves deep understanding of the image-level interactions between entities through powerful vision-language pre-training (VLP) models, and (2) selects meaningful images to summarize bag-level commonsense relations via a contrastive attention mechanism.
飛行員實驗結果顯示，需要開發專門的方法來應對常識知識獲取的獨特挑戰。基本上，由於常識關係的複雜性，基於多實例學習的CKE在兩個層面上提出挑戰：（1）在圖像層面上，模型需要首先理解每個圖像中的複雜實體交互作用，（2）在包層面上，模型需要選擇信息豐富的實例來總結實體之間的細粒度常識關係。我們提出了一個專門從圖像中獲取CKE的模型，如圖3所示，通過強大的視覺語言預訓練（VLP）模型實現對實體之間的圖像層面交互作用的深入理解，並通過對比注意機制選擇有意義的圖像來總結包層面的常識關係。

Vision-language Pre-training Models for Image-level Entity Interaction Understanding. Recently VLP models have pushed forward the state-of-art of many multimodal tasks in a foundation role (Bommasani et al. 2021), such as visual question answering and visual grounding. However, few works have explored leveraging VLP methods to model complex visual relations for entity pairs. We show that pretrained Transformers can serve as powerful foundation models to resolve complex image-level entity interactions.
視覺語言預訓練模型用於圖像級別實體交互理解。最近，VLP模型已經在許多多模態任務的基礎角色上推動了最新技術（Bommasani等，2021年），例如視覺問答和視覺定位。然而，很少有研究探討利用VLP方法來模擬實體對的複雜視覺關係。我們展示了預訓練的Transformer可以作為強大的基礎模型，以解決複雜的圖像級別實體交互。

Given a query entity pair

and the associated image bag

, each query entity pair instance in the bag is encoded into deep representations

via detectorbased VLP models. In this work, we adopt VinVL (Zhang et al. 2021b), a state-of-the-art VLP model as the encoder. Specifically, the query and context entities in each image are first encoded by object detectors to obtain a series of visual features

. The visual features and token embeddings of entity tags

are then fed into pre-trained Transformers to obtain deep multimodal hidden representations

. The image-level entity pair representation is obtained by the concatenation of visual and text hidden representations:

給定查詢實體對

和相關的圖像包

，袋中的每個查詢實體對實例都通過基於檢測器的 VLP 模型編碼為深度表示

。在這項工作中，我們採用 VinVL（Zhang 等人 2021b），一種最先進的 VLP 模型作為編碼器。具體而言，每個圖像中的查詢和上下文實體首先通過物體檢測器進行編碼，以獲得一系列視覺特徵

。然後，將視覺特徵和實體標籤的令牌嵌入

餵入預訓練的 Transformer 中，以獲得深度多模態隱藏表示

。通過將視覺和文本隱藏表示串聯起來，獲得圖像級實體對表示：

Figure 3: The CLEVER framework for visually grounded commonsense knowledge acquisition. Given a bag of images about an entity pair, our model leverages VLP models for image-level entity interaction understanding, and selects informative images to summarize bag-level commonsense relations via a contrastive attention mechanism.
圖 3：用於視覺上扎根的常識知識獲取的 CLEVER 框架。給定關於實體對的一組圖像，我們的模型利用 VLP 模型進行圖像級實體交互理解，並通過對比注意機制選擇信息豐富的圖像，以總結袋級常識關係。

Despite the simplicity, the approach exhibits three important advantages in image-level entity interaction modeling: (1) The messages of entities (including query and context entities) are fused through multiple self-attention layers in Transformers to help model complex entity interactions. (2) Visual and textual information of entities are fused into deep multimodal representations. (3) Pre-trained deep vision-language representations are utilized to facilitate commonsense understanding.
儘管方法簡單，但在圖像級實體交互建模中展現了三個重要優勢：（1）實體的訊息（包括查詢和上下文實體）通過Transformer中的多個自我注意力層融合，以幫助建模複雜的實體交互作用。（2）實體的視覺和文本信息被融合成深度多模態表示。（3）利用預訓練的深度視覺語言表示來促進常識理解。

Contrastive Attention Mechanism for Bag-level Commonsense Summarization. From the pilot experimental results, we observe that the complexity of commonsense relations (e.g., overlapping semantics and hyponym-hypernym conflicts) makes the relation boundaries hard to distinguish by existing multi-instance learning methods. In particular, despite its success in world knowledge acquisition from text, attention mechanism (ATT) (Lin et al. 2016) performs poorly on CKE. Here we identify that golden-query-only is the key limitation of ATT in CKE, and show that by making the attention mechanism contrastive over golden relation and other negative relations, the boundaries of complex commonsense relations can be effectively distinguished to achieve significantly better CKE performance.
對比式關注機制用於袋級通識摘要。從初步實驗結果中，我們觀察到通識關係的複雜性（例如，重疊語義和下位詞-上位詞衝突）使得現有的多實例學習方法難以區分關係邊界。特別是，儘管在從文本中獲取世界知識方面取得成功，但注意力機制（ATT）（Lin等，2016年）在CKE上表現不佳。在這裡，我們確定ATT在CKE中的關鍵限制是僅限於黃金查詢，並展示通過使注意力機制對比黃金關係和其他負面關係，可以有效區分複雜通識關係的邊界，以實現顯著更好的CKE性能。

We begin by discussing the golden-query-only problem in ATT. During ATT training, the bag representation

is static for the prediction of different relations, and computed only based on the golden relation query. However, during inference, since the golden relation is unknown, all possible relations need to be enumerated to query the bag to predict the corresponding relation score. The golden-queryonly problem leads to a lack of effective supervision for the bag representations (and relation scores) of other negative relations, resulting in indistinguishable negative bag representations from the golden ones.
我們首先討論 ATT 中的 golden-query-only 問題。在 ATT 訓練期間，袋子表示

對於不同關係的預測是靜態的，僅基於黃金關係查詢進行計算。然而，在推斷期間，由於黃金關係是未知的，需要列舉所有可能的關係來查詢袋子，以預測相應的關係分數。黃金查詢問題導致對其他負面關係的袋子表示（和關係分數）缺乏有效監督，導致無法區分負面袋子表示和黃金袋子表示。

To address the problem, we present a novel contrastive attention mechanism that imposes contrastive supervision for golden and negative bag representations and relation scores.
為了解決這個問題，我們提出了一種新穎的對比注意機制，對黃金和負面袋子表示和關係分數進行對比監督。
Specifically, for the prediction of each relation

, a relation-aware bag representation

is obtained by a weighted sum of instance representations, where the attention weights are computed using the corresponding relation query

as follows:
具體來說，對於每個關係

的預測，通過加權求和實例表示來獲得關係感知袋子表示

，其中注意權重是使用相應的關係查詢

計算如下：

The bag representations are optimized via a contrastive InfoNCE loss (Oord, Li, and Vinyals 2018) as follows:
袋子的表示是通過對比的 InfoNCE 損失（Oord，Li 和 Vinyals 2018）進行優化，具體如下：

where

is the classifier embedding of

. In this way, the contrastive attention imposes clear boundaries between the bag representations of golden and negative relations to deal with the summarization of complex commonsense relations. The contrastive attention can also be viewed as a kind of cross-attention (Vaswani et al. 2017) between relation queries and image instances, which can potentially benefit from multi-layer stacking. We leave it for future work.
其中

是

的分類器嵌入。通過這種方式，對比關注強制在黃金和負面關係的袋子表示之間設定清晰的邊界，以應對複雜常識關係的總結。對比關注也可以被視為一種關係查詢和圖像實例之間的交叉關注（Vaswani 等人 2017），可以潛在地從多層堆疊中受益。我們將其留給未來的工作。

Integrating Multi-source Information for CKE. Intuitively, multiple heterogeneous data sources can provide complementary information for commonsense learning. We show that this complementarity can be leveraged by a simple ensemble of models from each information source, where the aggregated triplet score is a weighted sum of the prediction score from each source.
將多源信息整合到 CKE 中。直觀地，多種異構數據源可以為常識學習提供互補信息。我們展示了這種互補性可以通過每個信息源的模型簡單集成來利用，其中聚合的三元組分數是來自每個源的預測分數的加權和。

Experiments 實驗

In this section, we empirically assess the effectiveness of the proposed model. We refer readers to the appendix for implementation details.
在本節中，我們以實證方式評估所提出模型的有效性。我們建議讀者參考附錄以獲取實施細節。

Source 來源	Method 方法	AUC	F1	P@ 2%	mAUC	mF1	mP@ 2% 來源
-	Random 隨機	1.76	3.51	1.71	2.04	5.13	1.94
Text 文本	RTP Schuster et al. 2015) RTP 舒斯特等人 2015 年	12.30	23.67	16.65	4.10	8.62	7.34
Text 文本	LAMA (Petroni et al. 2019) LAMA (彼得羅尼等人 2019 年)	5.97	14.11	12.80	3.84	3.59	5.59
	Vanilla-FT (Peng et al.2020) Vanilla-FT (彭等人 2020 年)	37.28	47.06	44.21	17.75	30.98	17.34
	Prompt-FT (Lin et al.2020) Prompt-FT（林等人，2020 年）	37.99	44.43	41.69	20.15	35.37	19.81
Image 影像	AVG (Lin et al. 2016) 平均值（林等人，2016 年）	39.04	47.49	44.34	24.73	41.07	20.83
	ONE (Zeng et al.2015) ONE (曾等人，2015 年)	19.69	31.10	25.20	15.70	30.40	12.82
	ATT (Lin et al. 2016) ATT (林等人，2016 年)	17.13	28.37	25.07	2.91	6.09	2.20
	CLEVER (Ours) CLEVER (本研究)
All	Ensemble (Ours) 集成（我們的）

Table 1: Experimental results of CKE methods from different information sources. The best results are highlighted in bold, and best single model results are underlined.
表 1：來自不同信息來源的 CKE 方法的實驗結果。最佳結果以粗體顯示，最佳單模型結果以底線表示。

Figure 4: Experimental results of our model with different bag sizes. We report AUC, mAUC and their average.
圖 4：我們模型在不同袋大小下的實驗結果。我們報告 AUC、mAUC 及其平均值。

Experimental Settings. (1) Benchmark and baselines. We perform experiments on the CKE benchmark constructed from Visual Genome as described in the pilot experiment section, and compare to strong baselines from different information sources. We also include a random baseline that randomly predicts relations for entity pairs. For multisource information integration, we ensemble CLEVER, RTP and Vanilla-FT. (2) Evaluation metrics. To provide multidimensional evaluation, we also report the maximum

on curve, and precision@K% (

triplet prediction.
實驗設置。（1）基準和基準線。我們在從Visual Genome構建的CKE基準上進行實驗，並與來自不同信息來源的強基準進行比較。我們還包括一個隨機基準線，該基準線隨機預測實體對的關係。對於多源信息集成，我們集成了CLEVER、RTP和Vanilla-FT。（2）評估指標。為了提供多維度評估，我們還報告了曲線上的最大

Main Results. From the experimental results in Table 1 . we have the following observations: (1) CLEVER consistently achieves the best results among all baseline models in both micro and macro metrics. Specifically, CLEVER improves the performance of image-based models, and significantly outperforms previous best PLM-based results by 3.9 AUC and 6.4 mAUC points. The results show that CLEVER can extract commonsense knowledge from visual perceptions with promising quality. (2) Ensemble multi-source information further improves the performance over singlesource models. This indicates that CKE can benefit from exploiting complementary information in different sources.
主要結果。從表 1 的實驗結果中，我們得出以下觀察結果：(1) CLEVER 在微觀和宏觀指標中始終取得最佳結果。具體來說，CLEVER 提高了基於圖像的模型的性能，並且在 AUC 和 mAUC 分數上明顯優於先前最佳的 PLM-based 結果 3.9 和 6.4 個點。結果表明，CLEVER 能夠從視覺感知中提取常識知識，具有良好的質量。(2) 集成多源信息進一步提高了性能，超越了單一來源模型。這表明 CKE 可以從不同來源的互補信息中受益。

Human Evaluation. In addition to the held-out evaluation, we also perform a human evaluation on top predictions.
人類評估。除了留出評估之外，我們還對頂級預測進行人類評估。

Method 方法	AUC	F1	mAUC 平均區域下曲線	mF1
CLEVER
VLP CNN 可變長度預測卷積神經網路	39.86	48.48	24.99	41.51
CST-ATT AVG CST-ATT 平均	39.95	47.73	25.56	41.51
CST-ATT ONE CST-ATT 一個	16.16	26.47	5.17	13.00
CST-ATT ATT CST-ATT 附件	16.07	25.59	2.14	4.87

Table 2: Ablation study on the instance encoder and the commonsense summarization method.
表 2：對實例編碼器和常識摘要方法進行消融研究。

We select models that achieve the best micro performance on each source, including RTP, Vanilla-FT and CLEVER. Specifically, for each model, we sample from the top

triplet predictions in a 1:50 ratio, resulting in 1,200 triplets for human evaluation. Each triplet is labeled by three independent annotators to decide the commonsense score: implausible (0), plausible but rare (1), common (2). We report the locally averaged triplet commonsense score given by human annotators in Figure 6 We can observe that triplets extracted by CLEVER are assigned with significantly higher commonsense scores in most cases. In addition, the commonsense scores of CLEVER achieve a strong 0.78 Spearman's rank correlation coefficient with human score, which shows that commonsense scores from our model can be well aligned to human judgments. The reason is that the contrastive attention mechanism can implicitly leverage the redundancy of instances to reflect the commonsense degree, where multiple informative instances in a bag can contribute to higher commonsense scores.
我們選擇在每個來源上實現最佳微觀性能的模型，包括 RTP、Vanilla-FT 和 CLEVER。具體而言，對於每個模型，我們從前

個三元組預測中按 1:50 的比例進行抽樣，從而為人類評估提供 1,200 個三元組。每個三元組由三個獨立的標註者標記，以決定常識分數：不合理（0）、合理但罕見（1）、常見（2）。我們在圖 6 中報告了人類標註者給出的局部平均三元組常識分數。我們可以觀察到，CLEVER 提取的三元組在大多數情況下被賦予更高的常識分數。此外，CLEVER 的常識分數與人類分數有著強烈的 0.78 Spearman 等級相關係數，這表明我們模型的常識分數可以很好地與人類判斷保持一致。原因在於對比關注機制可以隱式地利用實例的冗餘性來反映常識程度，一個袋子中的多個信息豐富的實例可以促使更高的常識分數。

Interpretability. In addition to the competitive performance, a crucial advantage of CLEVER is that the extracted commonsense knowledge can be grounded into visual perceptions through contrastive attention scores over image instances. As shown in Figure 5, informative images are assigned with larger attention scores for commonsense learning. Compared with PLM-based approaches that produce commonsense knowledge purely based on correlations be-
可解釋性。除了競爭性表現外，CLEVER 的一個關鍵優勢是，提取的常識知識可以通過對圖像實例的對比注意力分數來植入視覺感知中。如圖 5 所示，信息豐富的圖像被賦予了更大的注意力分數，用於常識學習。與基於 PLM 的方法相比，這些方法純粹基於相關性生成常識知識。

Attention score: 3.55 注意力分數：3.55

Attention score: 2.98 注意力分數：2.98

(a) Informative Instances
(a) 資訊實例

Attention score: 2.80 注意力分數: 2.80

Attention score: 2.57 注意力分數: 2.57

(b) Uninformative instances
(b) 無信息的實例

Figure 5: Unnormalized attention scores of the extracted commonsense triplet (banana, in, bowl) over several images in a bag.
圖 5：在一個袋子中，提取的常識三元組（香蕉，在，碗）的未標準化注意力分數在多張圖片上的表現。

Figure 6: Human evaluation results on top extracted triplets.
圖 6：頂部提取的三元組的人工評估結果。

tween text tokens in a black-box fashion, CLEVER enables trustworthy commonsense knowledge acquisition with better interpretability in the extraction process. From an application perspective, the selected informative images can also serve as supporting evidence for the extracted triplets in KBs for better knowledge utilization in downstream applications.
以黑盒方式在文本標記之間進行智慧式知識的收集，CLEVER 在萃取過程中能夠提供更佳的可解釋性，實現可信的常識知識獲取。從應用角度來看，所選定的資訊圖像也可以作為知識庫中萃取的三元組的支援證據，以提升下游應用中的知識利用。

Ablation Study. We perform an ablation study by replacing the VLP encoder with the CNN-based encoder, and replacing the contrastive attention mechanism with existing multi-instance learning methods respectively. From the results in Table 2, we can see that both components contribute to the final results. The results show that image-level entity interaction understanding and bag-level summarization are both important for good CKE performance.
剔除研究。我們進行了剔除研究，將 VLP 編碼器替換為基於 CNN 的編碼器，並分別將對比關注機制替換為現有的多實例學習方法。從表 2 中的結果可以看出，這兩個組成部分對最終結果都有所貢獻。結果顯示，圖像層級實體交互理解和包裹層級總結對良好的 CKE 性能都至關重要。

Effect of Bag Size. Intuitively, multiple images in a bag can provide diverse and complementary information about an entity pair for robust commonsense learning. To investigate the effect of bag size, we perform experiments on CLEVER with different bag sizes. From the results in Figure 4, we observe that: (1) A certain number of images is necessary to learn the commonsense interactions. The performance drops significantly when very small bag sizes are used. (2) The performance improvement is not significant when the bag size grows larger than 20 . We hypothesize the reason is that although a larger bag provides richer commonsense information, it also challenges the model with more noisy instances. Therefore, more advanced methods need to be developed to better exploit the rich information in larger
袋大小的影響。直觀地，袋中的多個圖像可以為強大的常識學習提供多樣且互補的信息。為了研究袋大小的影響，我們在 CLEVER 上進行了不同袋大小的實驗。從圖 4 的結果中，我們觀察到：(1) 一定數量的圖像對於學習常識互動是必要的。當使用非常小的袋大小時，性能會顯著下降。(2) 當袋大小大於 20 時，性能改善並不顯著。我們假設原因是，儘管較大的袋提供了更豐富的常識信息，但也會帶來更多噪聲實例，因此需要開發更先進的方法來更好地利用較大的豐富信息。

Method 方法	AUC	F1	P@2%	mAUC 多元面積下曲線
Random 隨機	41.3	45.7	43.0	23.7	38.0	21.7
CLIP 夾取與學習特徵		47.3	44.5	24.0	38.5
Overlap 重疊	41.9					22.0

Table 3: Image sampling strategies for bag construction.
表 3：圖像採樣策略，用於構建袋。

image bags, which we leave for future work.
圖像袋，我們將其留給未來的工作。

Effect of Instance Sampling Strategy for Bag Construction. Given the typically large number of open images containing an entity pair, it is desirable to select instances that are likely to express commonsense interactions at low costs to construct the bag. Besides the spatial overlap strategy, we experiment with another two sampling strategies: (1) Random sampling. Random candidate images are selected to compose the bag. (2) CLIP-based sampling. A text query is constructed for the entity pair as: "

has some relation with

". Then we encode the text query and image candidates using CLIP (Radford et al. 2021), and select the images with top similarity scores. We can see from Table 3 that: (1) Entity interaction priors from CLIP and spatial overlap help select informative images for bag construction. (2) CLIP does not show significant advantage over spatial overlap. The reason is that spatial overlap incorporates more inductive bias for entity pair interactions, while CLIP is optimized to handle general sentences. Therefore, we choose spatial overlap for bag construction due to its simplicity and efficiency.
對於袋子構建的實例抽樣策略的影響。鑒於通常包含一對實體的開放圖像數量龐大，選擇可能以較低成本表達常識互動的實例來構建袋子是可取的。除了空間重疊策略外，我們還嘗試了另外兩種抽樣策略：（1）隨機抽樣。隨機選擇候選圖像來組成袋子。（2）基於CLIP的抽樣。為實體對構建文本查詢如下：“

與

有某種關係”。然後我們使用CLIP（Radford等，2021）對文本查詢和圖像候選進行編碼，並選擇具有最高相似度分數的圖像。從表3中我們可以看到：（1）來自CLIP和空間重疊的實體互動先驗有助於選擇用於袋子構建的信息豐富的圖像。（2）CLIP並未顯示出明顯優勢，原因是空間重疊對實體對互動包含更多歸納偏差，而CLIP則優化用於處理一般句子。因此，我們選擇空間重疊來構建袋子，因其簡單和高效。

Case Study. We provide examples of the extracted triplets from CLEVER in Table 4 We can see that our model can extract reasonable commonsense knowledge unseen during training, and most importantly, novel facts to supplement commonsense KBs. We note that our model can sometimes produce uncommonly observed facts from accidental scene images. We refer readers to the appendix for the supporting images selected by our model for examples in type III.
案例研究。我們在表 4 中提供了從 CLEVER 中提取的三元組的示例。我們可以看到，我們的模型可以提取在訓練期間未見過的合理常識知識，更重要的是，可以提供新穎的事實來補充常識知識庫。我們注意到，我們的模型有時可以從意外場景圖像中產生不尋常的觀察事實。我們建議讀者參考附錄，查看我們的模型選擇的支持圖像，以了解第 III 類型的示例。

Conclusion 結論

In this work, we propose a novel formulation for commonsense knowledge acquisition as an image-based distantly supervised multi-instance learning problem. We present a dedicated framework that achieves deep image-level understanding via vision-language pre-training models, and bag-
在這項工作中，我們提出了一種新穎的常識知識獲取公式，作為基於圖像的遠程監督多實例學習問題。我們提出了一個專用框架，通過視覺語言預訓練模型實現深度圖像級理解，以及袋-

Type 類型

Examples 例子

(woman, hold, umbrella

horse, pull, person

,
(女人, 拿著, 雨傘

馬, 拉, 人

skateboard, under, man

flower, near, fence

源頭, 在, 男人

花, 靠近, 圍欄

girl, wear, glove

truck, has, handle

女孩, 穿, 手套

卡車, 有, 把手

snow, cover, tire

cow, with, nose

雪，覆蓋，輪胎

牛，帶著，鼻子

，

flower, in, mountain

wire, in, building

花，於，山上

線，於，建築物裡

，

logo, printed_on, train

boy, hold, pillow

商標，印在，火車上

男孩，抱著，枕頭

III

三

clock, has, flower

boat, behind, car

,
時鐘, 花朵, 船, 後面, 車

sheep, behind, bench

, on, book

羊, 後面, 長椅, 上面, 書

Table 4: Extracted commonsense triplet examples in different types. I: Reasonable triplets unseen during training, II: novel facts for both Visual Genome and ConceptNet (i.e., newly discovered), III: Uncommonly observed facts.
不同類型中提取的常識三元組示例表 4。I：訓練期間未見過的合理三元組，II：對於 Visual Genome 和 ConceptNet 都是新奇事實（即新發現），III：不常見的觀察事實。

level summarization via a contrastive attention mechanism. Comprehensive experiments show the effectiveness of our framework. In the future, we will explore more advanced multi-instance learning approach, and acquire visual commonsense knowledge in more complex forms and types.
通過對比注意機制進行層級總結。全面的實驗表明了我們框架的有效性。未來，我們將探索更先進的多實例學習方法，並以更複雜的形式和類型獲取視覺常識知識。

Acknowledgements 致謝

This work is funded by the Natural Science Foundation of China (NSFC 62061136001), the German Research Foundation (DFG TRR-169) in Project Crossmodal Learning, National Natural Science Foundation of China (Grant No.62276154), AMiner.Shenzhen SciBrain Fund, Shenzhen Science and Technology Innovation Commission (Research Center for Computer Network (Shenzhen) Ministry of Education), Beijing Academy of Artificial Intelligence (BAAI), the Natural Science Foundation of Guangdong Province (Grant No. 2021A1515012640), Basic Research Fund of Shenzhen City (Grant No. JCYJ20210324120012033 and JSGG20210802154402007), and Overseas Cooperation Research Fund of Tsinghua Shenzhen International Graduate School (Grant No. HW2021008).
本工作由中國自然科學基金委員會（NSFC 62061136001）、德國研究基金會（DFG TRR-169）在跨模態學習項目中的資助，中國國家自然科學基金委員會（項目編號 62276154）、AMiner.深圳智腦基金、深圳市科技創新委員會（計算機網絡研究中心（深圳）教育部重點實驗室）、北京人工智能學院（BAAI）、廣東省自然科學基金（項目編號 2021A1515012640）、深圳市基礎研究基金（項目編號 JCYJ20210324120012033 和 JSGG20210802154402007）、以及清華大學深圳國際研究生院海外合作研究基金（項目編號 HW2021008）資助。

For author contributions, Yuan Yao designed the framework and experiments, and wrote the paper. Tianyu Yu conducted the experiments. Ao Zhang, Mengdi Li, Ruobing Xie, Cornelius Weber, Zhiyuan Liu, Hai-Tao Zheng, Stefan Wermter, Tat-Seng Chua and Maosong Sun provided valuable suggestions.
對於作者貢獻，姚遠設計了框架和實驗，並撰寫了論文。于天宇進行了實驗。張傲、李夢迪、謝若冰、康奈利烏斯·韋伯、劉志遠、鄭海濤、斯特凡·韋姆特、蔡達昇和孫茂松提供了寶貴的建議。

References 參考文獻

Angeli, G.; and Manning, C. D. 2013. Philosophers are mortal: Inferring the truth of unseen facts. In Proceedings of CoNLL, 133-142.
Angeli, G.; and Manning, C. D. 2013. 哲學家是有限的：推斷看不見事實的真相。在 CoNLL 會議論文集中，133-142。

Bender, E. M.; and Koller, A. 2020. Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data. In Proceedings of ACL, 5185-5198.
Bender, E. M.; 和 Koller, A. 2020. 攀登至 NLU: 在數據時代的意義、形式和理解。在 ACL 會議論文集中，5185-5198。

Bollacker, K.; Evans, C.; Paritosh, P.; Sturge, T.; and Taylor, J. 2008. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the ACM SIGMOD, 1247-1250.
Bollacker, K.; Evans, C.; Paritosh, P.; Sturge, T.; 和 Taylor, J. 2008. Freebase: 一個協作創建的圖形數據庫，用於結構化人類知識。在 ACM SIGMOD 會議論文集中，1247-1250。

Bommasani, R.; Hudson, D. A.; Adeli, E.; Altman, R.; Arora, S.; von Arx, S.; Bernstein, M. S.; Bohg, J.; Bosselut,
A.; Brunskill, E.; et al. 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
來源

Bosselut, A.; Rashkin, H.; Sap, M.; Malaviya, C.; Celikyilmaz, A.; and Choi, Y. 2019. COMET: Commonsense Transformers for Automatic Knowledge Graph Construction. In Proceedings of ACL, 4762-4779.
Bosselut, A.; Rashkin, H.; Sap, M.; Malaviya, C.; Celikyilmaz, A.; 和 Choi, Y. 2019. COMET: 共識轉換器用於自動知識圖構建。在 ACL 會議論文集中，4762-4779 頁。

Chao, Y.-W.; Wang, Z.; Mihalcea, R.; and Deng, J. 2015. Mining semantic affordances of visual object categories. In Proceedings of ICCV, 4259-4267.
Chao, Y.-W.; Wang, Z.; Mihalcea, R.; 和 Deng, J. 2015. 挖掘視覺物體類別的語義可負擔性。在 ICCV 會議論文集中，4259-4267 頁。

Chen, T.; Shi, H.; Tang, S.; Chen, Z.; Wu, F.; and Zhuang, Y. 2021. CIL: Contrastive Instance Learning Framework for Distantly Supervised Relation Extraction. In Proceedings of ACL, 6191-6200.
Chen, T.; Shi, H.; Tang, S.; Chen, Z.; Wu, F.; and Zhuang, Y. 2021. CIL: 對比實例學習框架用於遠程監督關係提取。在 ACL 會議論文集中，6191-6200。

Chen, X.; Shrivastava, A.; and Gupta, A. 2013. NEIL: Extracting visual knowledge from web data. In Proceedings of ICCV, 1409-1416.
Chen, X.; Shrivastava, A.; and Gupta, A. 2013. NEIL: 從網絡數據中提取視覺知識。在 ICCV 會議論文集中，1409-1416。

Chen, X.; Zhang, N.; Li, L.; Deng, S.; Tan, C.; Xu, C.; Huang, F.; Si, L.; and Chen, H. 2022. Hybrid Transformer with Multi-Level Fusion for Multimodal Knowledge Graph Completion. In Proceedings of ACM SIGIR, 904-915.
Chen, X.; Zhang, N.; Li, L.; Deng, S.; Tan, C.; Xu, C.; Huang, F.; Si, L.; and Chen, H. 2022. 多模式知識圖完成的混合 Transformer 與多級融合。在 ACM SIGIR 會議論文集中，904-915。

Collell, G.; Van Gool, L.; and Moens, M.-F. 2018. Acquiring common sense spatial knowledge through implicit spatial templates. In Proceedings of AAAI, volume 32.
Collell, G.; Van Gool, L.; and Moens, M.-F. 2018. 通過隱式空間模板獲取常識空間知識。在 AAAI 會議論文集中，第 32 卷。

Davis, R.; Shrobe, H.; and Szolovits, P. 1993. What is a knowledge representation? AI magazine, 14(1): 17-17.
Davis, R.; Shrobe, H.; and Szolovits, P. 1993. 知識表示是什麼？人工智能雜誌，14(1)：17-17。

Davison, J.; Feldman, J.; and Rush, A. 2019. Commonsense Knowledge Mining from Pretrained Models. In Proceedings of EMNLP-IJCNLP, 1173-1178.
Davison, J.; Feldman, J.; and Rush, A. 2019. 從預訓練模型中挖掘常識知識。在 EMNLP-IJCNLP 會議論文集中，1173-1178。

Dietterich, T. G.; Lathrop, R. H.; and Lozano-Pérez, T. 1997. Solving the multiple instance problem with axis-parallel rectangles. Artificial intelligence, 89(1-2): 31-71.
Dietterich, T. G.; Lathrop, R. H.; 和 Lozano-Pérez, T. 1997. 用軸平行矩形解決多實例問題。人工智慧，89(1-2)：31-71。

Dong, B.; Yao, Y.; Xie, R.; Gao, T.; Han, X.; Liu, Z.; Lin, F.; Lin, L.; and Sun, M. 2020. Meta-information guided metalearning for few-shot relation classification. In Proceedings of the 28th International Conference on Computational Linguistics, 1594-1605.
Dong, B.; Yao, Y.; Xie, R.; Gao, T.; Han, X.; Liu, Z.; Lin, F.; Lin, L.; 和 Sun, M. 2020. 元信息引導的元學習，用於少數樣本關係分類。在第 28 屆國際計算語言學會議論文集中，1594-1605。

Elazar, Y.; Kassner, N.; Ravfogel, S.; Ravichander, A.; Hovy, E.; Schütze, H.; and Goldberg, Y. 2021. Measuring and improving consistency in pretrained language models. TACL, 9: 1012-1031.
Elazar, Y.; Kassner, N.; Ravfogel, S.; Ravichander, A.; Hovy, E.; Schütze, H.; 和 Goldberg, Y. 2021. 衡量和提升預訓練語言模型中的一致性。TACL，9：1012-1031。

Gardères, F.; Ziaeefard, M.; Abeloos, B.; and Lecue, F. 2020. ConceptBert: Concept-Aware Representation for Visual Question Answering. In Findings of the Association for Computational Linguistics: EMNLP 2020, 489-498.
Gardères, F.; Ziaeefard, M.; Abeloos, B.; 和 Lecue, F. 2020. ConceptBert: 概念感知表示法用於視覺問答。在計算語言學協會發現: EMNLP 2020, 489-498。

Gordon, J.; and Van Durme, B. 2013. Reporting bias and knowledge acquisition. In Proceedings of the 2013 workshop on Automated knowledge base construction, 25-30.
Gordon, J.; 和 Van Durme, B. 2013. 報導偏見和知識獲取。在 2013 年自動知識庫構建研討會論文集, 25-30。

Gu, J.; Zhao, H.; Lin, Z.; Li, S.; Cai, J.; and Ling, M. 2019. Scene graph generation with external knowledge and image reconstruction. In Proceedings of CVPR, 1969-1978.
Gu, J.; Zhao, H.; Lin, Z.; Li, S.; Cai, J.; 和 Ling, M. 2019. 帶有外部知識和圖像重建的場景圖生成。在 CVPR 論文集, 1969-1978。

Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Kingma, D. P.; and Ba, J. 2014. Adam: 一種用於隨機優化的方法. arXiv 預印本 arXiv:1412.6980.

Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.-J.; Shamma, D. A.; et al. 2017. Visual Genome: Connecting language and
Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.-J.; Shamma, D. A.; 等. 2017. 視覺基因組: 通過眾包的密集圖像標註連接語言和
vision using crowdsourced dense image annotations. IJCV, 123(1): 32-73.
視覺. IJCV, 123(1): 32-73.

Li, X.; Taheri, A.; Tu, L.; and Gimpel, K. 2016. Commonsense knowledge base completion. In Proceedings of ACL, 1445-1455.
荔, X.; Taheri, A.; Tu, L.; 和 Gimpel, K. 2016. 常識知識庫完成。在 ACL 會議論文集中，1445-1455。

Lin, B. Y.; Lee, S.; Khanna, R.; and Ren, X. 2020. Birds have four legs?! NumerSense: Probing Numerical Commonsense Knowledge of Pre-Trained Language Models. In Proceedings of EMNLP, 6862-6868.
林, B. Y.; 李, S.; 卡納, R.; 和任, X. 2020. 鳥有四條腿？！ NumerSense：探究預訓練語言模型的數字常識知識。在 EMNLP 會議論文集中，6862-6868。

Lin, Y.; Shen, S.; Liu, Z.; Luan, H.; and Sun, M. 2016. Neural relation extraction with selective attention over instances. In Proceedings of ACL, 2124-2133.
林, Y.; 沈, S.; 劉, Z.; 鸞, H.; 和孫, M. 2016. 具有對實例的選擇性注意力的神經關係提取。在 ACL 會議論文集中，2124-2133。

Liu, H.; and Singh, P. 2004. ConceptNet-a practical commonsense reasoning tool-kit. BT technology journal, 22(4): 211-226.
Liu, H.;和 Singh, P. 2004. ConceptNet-一個實用的常識推理工具包。 BT 技術期刊, 22(4): 211-226。

Liu, T.; Zhang, X.; Zhou, W.; and Jia, W. 2018. Neural Relation Extraction via Inner-Sentence Noise Reduction and Transfer Learning. In Proceedings of EMNLP, 2195-2204.
Liu, T.; Zhang, X.; Zhou, W.;和 Jia, W. 2018. 透過句內噪音降低和轉移學習的神經關係提取。在 EMNLP 大會論文集中, 2195-2204。

Lu, C.; Krishna, R.; Bernstein, M.; and Fei-Fei, L. 2016. Visual relationship detection with language priors. In Proceedings of ECCV, 852-869. Springer.
Lu, C.; Krishna, R.; Bernstein, M.;和 Fei-Fei, L. 2016. 利用語言先驗進行視覺關係檢測。在 ECCV 大會論文集中, 852-869。 Springer。

Lv, S.; Guo, D.; Xu, J.; Tang, D.; Duan, N.; Gong, M.; Shou, L.; Jiang, D.; Cao, G.; and Hu, S. 2020. Graph-based reasoning over heterogeneous external knowledge for commonsense question answering. In Proceedings of AAAI, volume 34, 8449-8456.
Lv, S.; Guo, D.; Xu, J.; Tang, D.; Duan, N.; Gong, M.; Shou, L.; Jiang, D.; Cao, G.; and Hu, S. 2020. 基於圖形推理的異構外部知識在常識問答中的應用. In AAAI Proceedings, 第 34 卷, 8449-8456 頁.

Malaviya, C.; Bhagavatula, C.; Bosselut, A.; and Choi, Y. 2020. Commonsense knowledge base completion with structural and semantic context. In Proceedings of AAAI, volume 34, 2925-2933.
Malaviya, C.; Bhagavatula, C.; Bosselut, A.; and Choi, Y. 2020. 用結構和語義上下文完成常識知識庫. In AAAI Proceedings, 第 34 卷, 2925-2933 頁.

Miller, G. A. 1994. WordNet: A Lexical Database for English. In Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994.
Miller, G. A. 1994. WordNet: 英文詞彙數據庫. In 人類語言技術: 1994 年 3 月 8-11 日於新澤西州普萊恩斯博羅舉行的研討會論文集.

Mintz, M.; Bills, S.; Snow, R.; and Jurafsky, D. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of ALC-IJCNLP, 1003-1011.
Mintz, M.; Bills, S.; Snow, R.; 和 Jurafsky, D. 2009. 沒有標記數據的關係抽取的遠程監督。在 ALC-IJCNLP 會議論文集中，1003-1011。

Moore, C. 2013. The development of commonsense psychology.
Moore, C. 2013. 常識心理學的發展。

Narasimhan, M.; Lazebnik, S.; and Schwing, A. 2018. Out of the box: Reasoning with graph convolution nets for factual visual question answering. NeurIPS, 31 .
Narasimhan, M.; Lazebnik, S.; 和 Schwing, A. 2018. 出乎意料：使用圖卷積網絡進行事實性視覺問答。NeurIPS，31。

Nguyen, T. H.; and Grishman, R. 2015. Relation extraction: Perspective from convolutional neural networks. In Proceedings of the 1st workshop on vector space modeling for natural language processing, 39-48.
Nguyen, T. H.; 和 Grishman, R. 2015. 關係抽取: 從卷積神經網絡的角度看。在自然語言處理向量空間建模第 1 屆研討會論文集中，39-48。

Oord, A. v. d.; Li, Y.; and Vinyals, O. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
Oord, A. v. d.; Li, Y.; 和 Vinyals, O. 2018. 對比預測編碼的表示學習。arXiv 預印本 arXiv:1807.03748。

Paik, C.; Aroca-Ouellette, S.; Roncone, A.; and Kann, K. 2021. The World of an Octopus: How Reporting Bias Influences a Language Model's Perception of Color. In Proceedings of EMNLP, 823-835.
Paik, C.; Aroca-Ouellette, S.; Roncone, A.; 和 Kann, K. 2021. 章魚的世界: 報導偏見如何影響語言模型對顏色的感知。在 EMNLP 論文集中，823-835。

Peng, H.; Gao, T.; Han, X.; Lin, Y.; Li, P.; Liu, Z.; Sun, M.; and Zhou, J. 2020. Learning from Context or Names? An
Peng, H.; Gao, T.; Han, X.; Lin, Y.; Li, P.; Liu, Z.; Sun, M.; and Zhou, J. 2020. 從上下文還是名稱學習？神經關係抽取的實證研究。在 EMNLP 會議論文集中，3661-3672 頁。
Empirical Study on Neural Relation Extraction. In Proceedings of EMNLP, 3661-3672.
Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove: 全球詞向量表示。在 EMNLP 會議論文集中，1532-1543 頁。

Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove: Global vectors for word representation. In Proceedings of the EMNLP, 1532-1543.

Petroni, F.; Rocktäschel, T.; Riedel, S.; Lewis, P.; Bakhtin, A.; Wu, Y.; and Miller, A. 2019. Language Models as Knowledge Bases? In Proceedings of EMNLP-IJCNLP, 2463-2473.
Petroni, F.; Rocktäschel, T.; Riedel, S.; Lewis, P.; Bakhtin, A.; Wu, Y.; and Miller, A. 2019. 語言模型作為知識庫？在 EMNLP-IJCNLP 會議論文集中的議程 2463-2473。

Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021. Learning transferable visual models from natural language supervision. In ICML, 8748-8763. PMLR.
Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.;等人。2021。從自然語言監督中學習可轉移的視覺模型。在 ICML，8748-8763。PMLR。

Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster rcnn: Towards real-time object detection with region proposal networks. NeurIPS, 28.
Ren, S.; He, K.; Girshick, R.; 和 Sun, J. 2015。Faster rcnn: 朝向具有區域建議網絡的實時目標檢測。在 NeurIPS，28。

Riedel, S.; Yao, L.; and McCallum, A. 2010. Modeling relations and their mentions without labeled text. In Proceedings of ECML-PKDD, 148-163. Springer.
Riedel, S.; Yao, L.; and McCallum, A. 2010. 建模關係及其提及而無標記文本。在 ECML-PKDD 會議論文集中，148-163 頁。Springer。

Sadeghi, F.; Kumar Divvala, S. K.; and Farhadi, A. 2015. Viske: Visual knowledge extraction and question answering by visual verification of relation phrases. In Proceedings of ICCV, 1456-1464.
Sadeghi, F.; Kumar Divvala, S. K.; and Farhadi, A. 2015. Viske：通過視覺驗證關係短語的視覺知識提取和問答。在 ICCV 會議論文集中，1456-1464 頁。

Sap, M.; Le Bras, R.; Allaway, E.; Bhagavatula, C.; Lourie, N.; Rashkin, H.; Roof, B.; Smith, N. A.; and Choi, Y. 2019. Atomic: An atlas of machine commonsense for if-then reasoning. In Proceedings of AAAI, volume 33, 3027-3035.
Sap, M.; Le Bras, R.; Allaway, E.; Bhagavatula, C.; Lourie, N.; Rashkin, H.; Roof, B.; Smith, N. A.; and Choi, Y. 2019. Atomic：一個用於 if-then 推理的機器常識地圖。在 AAAI 會議論文集中，第 33 卷，3027-3035 頁。

Schuster, S.; Krishna, R.; Chang, A.; Fei-Fei, L.; and Manning, C. D. 2015. Generating Semantically Precise Scene Graphs from Textual Descriptions for Improved Image Retrieval. In Proceedings of the Fourth Workshop on Vision and Language, 70-80.
來源

Sharma, P.; Ding, N.; Goodman, S.; and Soricut, R. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of

.
Schuster, S.; Krishna, R.; Chang, A.; Fei-Fei, L.; and Manning, C. D. 2015. 從文字描述生成語義精確的場景圖，以改進圖像檢索。在第四屆視覺與語言研討會論文集中，70-80 頁。

Shwartz, V.; and Choi, Y. 2020. Do neural language models overcome reporting bias? In Proceedings of COLING, 6863-6870.
Sharma, P.; Ding, N.; Goodman, S.; and Soricut, R. 2018. 概念性標題：用於自動圖像標題生成的清理、上位詞化的圖像替代文本數據集。在<b0></b0>論文集中。

Soares, L. B.; Fitzgerald, N.; Ling, J.; and Kwiatkowski, T. 2019. Matching the Blanks: Distributional Similarity for Relation Learning. In Proceedings of ACL, 2895-2905.
Soares, L. B.; Fitzgerald, N.; Ling, J.; and Kwiatkowski, T. 2019. 配對空白：分佈相似性用於關係學習。在 ACL 會議論文集中，2895-2905 頁。

Speer, R.; Chin, J.; and Havasi, C. 2017. Conceptnet 5.5: An open multilingual graph of general knowledge. In Proceedings of AAAI.
Speer, R.; Chin, J.; and Havasi, C. 2017. Conceptnet 5.5：一個開放的多語言通識知識圖。在 AAAI 會議論文集中。

Speer, R.; Havasi, C.; and Lieberman, H. 2008. AnalogySpace: Reducing the Dimensionality of Common Sense Knowledge. In Proceedings of AAAI, volume 8, 548-553.
Speer, R.; Havasi, C.; and Lieberman, H. 2008. AnalogySpace：降低常識知識的維度。在 AAAI 會議論文集中，第 8 卷，548-553 頁。

Tang, K.; Niu, Y.; Huang, J.; Shi, J.; and Zhang, H. 2020. Unbiased scene graph generation from biased training. In Proceedings of CVPR, 3716-3725.
Tang, K.; Niu, Y.; Huang, J.; Shi, J.; and Zhang, H. 2020. 從偏見訓練中無偏見地生成場景圖。在 CVPR 的會議論文中，3716-3725。

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. NeurIPS, 30.
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. 注意力就是一切所需。在 NeurIPS 的論文中，30。

Vedantam, R.; Lin, X.; Batra, T.; Zitnick, C. L.; and Parikh, D. 2015. Learning common sense through visual abstraction. In Proceedings of ICCV, 2542-2550.
Vedantam, R.; Lin, X.; Batra, T.; Zitnick, C. L.; and Parikh, D. 2015. 通過視覺抽象學習常識。在 ICCV 的會議論文中，2542-2550。

Wen, H.; Lin, Y.; Lai, T.; Pan, X.; Li, S.; Lin, X.; Zhou, B.; Li, M.; Wang, H.; Zhang, H.; et al. 2021. Resin: A dockerized schema-guided cross-document cross-lingual crossmedia information extraction and event tracking system. In Proceedings of NAACL, 133-143.
文, 洪; 林, 亦; 賴, 婷; 潘, 霞; 李, 舒; 林, 曦; 周, 斌; 李, 明; 王, 華; 張, 紅; 等. 2021. Resin: 一個基於模式引導的跨文件、跨語言、跨媒體信息提取和事件追踪系統的 Docker 化實現. 於 NAACL 會議論文集, 133-143.

Wu, Q.; Shen, C.; Wang, P.; Dick, A.; and Van Den Hengel, A. 2017. Image captioning and visual question answering based on attributes and external knowledge. TPAMI, 40(6): 1367-1381.
吳, 琪; 沈, 超; 王, 鵬; 迪克, A.; 及 Van Den Hengel, A. 2017. 基於屬性和外部知識的圖像標題生成和視覺問答. TPAMI, 40(6): 1367-1381.

Wu, R.; Yao, Y.; Han, X.; Xie, R.; Liu, Z.; Lin, F.; Lin, L.; and Sun, M. 2019. Open Relation Extraction: Relational Knowledge Transfer from Supervised Data to Unsupervised Data. In Proceedings of EMNLP, 219-228.
吳, 瑞; 姚, 義; 韓, 雄; 謝, 睿; 劉, 志; 林, 鳳; 林, 琳; 及孫, 明. 2019. 開放關係提取: 從監督數據到非監督數據的關係知識轉移. 於 EMNLP 會議論文集, 219-228.

Wu, S.; Li, Y.; Zhang, D.; Zhou, Y.; and Wu, Z. 2020. Diverse and informative dialogue generation with contextspecific commonsense knowledge awareness. In Proceedings of ACL, 5811-5820.
Wu, S.; Li, Y.; Zhang, D.; Zhou, Y.; and Wu, Z. 2020. 具有上下文特定常識知識意識的多樣且資訊豐富的對話生成。在 ACL 會議論文集中，5811-5820 頁。

Xu, D.; Zhu, Y.; Choy, C. B.; and Fei-Fei, L. 2017. Scene graph generation by iterative message passing. In Proceedings of ICCV, 5410-5419.
Xu, D.; Zhu, Y.; Choy, C. B.; and Fei-Fei, L. 2017. 通過迭代消息傳遞生成場景圖。在 ICCV 會議論文集中，5410-5419 頁。

Xu, F. F.; Lin, B. Y.; and Zhu, K. 2018. Automatic Extraction of Commonsense LocatedNear Knowledge. In Proceedings of

.
Xu, F. F.; Lin, B. Y.; and Zhu, K. 2018. 自動提取常識 LocatedNear 知識。在

會議論文集中。

Yao, Y.; Du, J.; Lin, Y.; Li, P.; Liu, Z.; Zhou, J.; and Sun, M. 2021a. CodRED: A Cross-Document Relation Extraction Dataset for Acquiring Knowledge in the Wild. In Proceedings of EMNLP, 4452-4472.
要, Y.; 杜, J.; 林, Y.; 李, P.; 劉, Z.; 周, J.; 和孫, M. 2021a. CodRED: 一個用於在野外獲取知識的跨文件關係提取數據集. 在 EMNLP 會議論文集中, 4452-4472.

Yao, Y.; Ye, D.; Li, P.; Han, X.; Lin, Y.; Liu, Z.; Liu, Z.; Huang, L.; Zhou, J.; and Sun, M. 2019. DocRED: A LargeScale Document-Level Relation Extraction Dataset. In Proceedings of the ACL, 764-777.
要, Y.; 葉, D.; 李, P.; 韓, X.; 林, Y.; 劉, Z.; 劉, Z.; 黃, L.; 周, J.; 和孫, M. 2019. DocRED: 一個大規模的文件級關係提取數據集. 在 ACL 會議論文集中, 764-777.

Yao, Y.; Zhang, A.; Han, X.; Li, M.; Weber, C.; Liu, Z.; Wermter, S.; and Sun, M. 2021b. Visual distant supervision for scene graph generation. In Proceedings of the ICCV, 15816-15826.
要, Y.; 張, A.; 韓, X.; 李, M.; 韋伯, C.; 劉, Z.; Wermter, S.; 和孫, M. 2021b. 用於場景圖生成的視覺遠程監督. 在 ICCV 會議論文集中, 15816-15826.

Yao, Y.; Zhang, A.; Zhang, Z.; Liu, Z.; Chua, T.-S.; and Sun, M. 2021c. CPT: Colorful prompt tuning for pre-trained vision-language models. arXiv preprint arXiv:2109.11797.
Yao, Y.; Zhang, A.; Zhang, Z.; Liu, Z.; Chua, T.-S.; and Sun, M. 2021c. CPT: 色彩提示調整用於預訓練的視覺語言模型。arXiv 預印本 arXiv:2109.11797。

Yatskar, M.; Ordonez, V.; and Farhadi, A. 2016. Stating the Obvious: Extracting Visual Common Sense Knowledge. In Proceedings of NAACL, 193-198.
Yatskar, M.; Ordonez, V.; and Farhadi, A. 2016. 說明明顯之事：提取視覺常識知識。在 NAACL 會議論文集中，193-198。

Zellers, R.; Yatskar, M.; Thomson, S.; and Choi, Y. 2018. Neural motifs: Scene graph parsing with global context. In Proceedings of ICCV, 5831-5840.
Zellers, R.; Yatskar, M.; Thomson, S.; and Choi, Y. 2018. 神經主題：具有全局上下文的場景圖解析。在 ICCV 會議論文集中，5831-5840。

Zeng, D.; Liu, K.; Chen, Y.; and Zhao, J. 2015. Distant supervision for relation extraction via piecewise convolutional neural networks. In Proceedings of EMNLP, 1753-1762.
Zeng, D.; Liu, K.; Chen, Y.; and Zhao, J. 2015. 透過分段卷積神經網絡的遠距監督關係抽取。在 EMNLP 會議論文集中，1753-1762 頁。

Zhang, A.; Yao, Y.; Chen, Q.; Ji, W.; Liu, Z.; Sun, M.; and Chua, T.-S. 2022. Fine-Grained Scene Graph Generation with Data Transfer. In Proceedings of ECCV.
Zhang, A.; Yao, Y.; Chen, Q.; Ji, W.; Liu, Z.; Sun, M.; and Chua, T.-S. 2022. 具有數據轉移的細粒度場景圖生成。在 ECCV 會議論文集中。

Zhang, K.; Yao, Y.; Xie, R.; Han, X.; Liu, Z.; Lin, F.; Lin, L.; and Sun, M. 2021a. Open Hierarchical Relation Extraction. In Proceedings of NAACL, 5682-5693.
Zhang, K.; Yao, Y.; Xie, R.; Han, X.; Liu, Z.; Lin, F.; Lin, L.; and Sun, M. 2021a. 開放式階層關係抽取。在 NAACL 會議論文集中，5682-5693 頁。
Zhang, P.; Li, X.; Hu, X.; Yang, J.; Zhang, L.; Wang, L.; Choi, Y.; and Gao, J. 2021b. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of CVPR, 5579-5588.
張, P.; 李, X.; 胡, X.; 楊, J.; 張, L.; 王, L.; 崔, Y.; 和高, J. 2021b. Vinvl: 重新審視視覺語言模型中的視覺表示。在 CVPR 會議論文集中，5579-5588 頁。

Zhou, H.; Young, T.; Huang, M.; Zhao, H.; Xu, J.; and Zhu, X. 2018. Commonsense knowledge aware conversation generation with graph attention. In IJCAI, 4623-4629.
周, H.; 楊, T.; 黃, M.; 趙, H.; 徐, J.; 和朱, X. 2018. 具備常識知識的圖注意力對話生成。在 IJCAI 會議論文集中，4623-4629 頁。

Zhou, X.; Zhang, Y.; Cui, L.; and Huang, D. 2020. Evaluating commonsense in pre-trained language models. In Proceedings of AAAI, volume 34, 9733-9740.
周, X.; 張, Y.; 崔, L.; 和黃, D. 2020. 評估預訓練語言模型中的常識。在 AAAI 會議論文集中，第 34 卷，9733-9740 頁。

Additional Experiments 額外實驗

In this section, we provide additional experimental results, including additional examples for the interpretability of commonsense triplet extractions, and supporting evidence selected by our model for uncommon facts.
在本部分中，我們提供額外的實驗結果，包括對常識三元組可解釋性的額外示例，以及我們的模型為不常見事實選擇的支持證據。

Additional Examples for Interpretability. We provide more qualitative results for the interpretability of CLEVER commonsense extractions in Figure 7 . We can see that our model can extract reasonable commonsense knowledge involving diverse relations, including spatial relations, partonomy relations and actional relations. Moreover, the contrastive attention scores are discriminative over informative and uninformative images in the bag. This provides interpretability for the extraction process of CLEVER, making the resultant commonsense knowledge more trustworthy. Moreover, the selected informative images can serve as supporting evidence for the extracted commonsense knowledge in KBs, which can be useful for downstream applications.
可解釋性的額外示例。我們在圖 7 中提供了更多有關 CLEVER 常識提取的質性結果。我們可以看到，我們的模型可以提取涉及多樣關係的合理常識知識，包括空間關係、部分關係和行動關係。此外，對比關注分數在袋中的信息圖像中是有區分性的，這提供了對 CLEVER 提取過程的可解釋性，使得結果的常識知識更加可信。此外，所選的信息圖像可以作為知識庫中提取的常識知識的支持證據，這對下游應用可能很有用。

Supporting Evidence for Uncommon Facts. From the case study in the paper (type III in Table 4), we observe that our model can sometimes produce uncommonly observed facts from accidental scene images. We show the supporting images selected by our model with high attention scores in Figure 8. Although the facts are not common, their plausibility can be easily verified once the supporting images are provided. This demonstrates the advantage of extracting commonsense knowledge from grounded information sources, and the effectiveness of the proposed model in image-level entity interaction understanding and bag-level selection.
支持罕見事實的證據。從論文中的案例研究（表 4 中的第 III 類）中，我們觀察到我們的模型有時可以從意外的場景圖像中產生罕見觀察到的事實。我們在圖 8 中展示了我們的模型選擇的支持圖像，這些圖像具有較高的注意力分數。儘管這些事實不常見，但一旦提供了支持圖像，它們的合理性就可以輕易驗證。這展示了從基於信息的來源中提取常識知識的優勢，以及所提出的模型在圖像級實體交互理解和袋級選擇方面的有效性。

Discussion and Outlook 討論與展望

Despite the promising results of the proposed model, we note that there is still ample room for improvement. In this section, we discuss the limitation of this work and promising directions for future research.
儘管所提出的模型取得了令人期待的結果，我們注意到仍有很大的改進空間。在本節中，我們討論了這項工作的局限性以及未來研究的有前途的方向。

Commonsense in More Complex Forms. In this work, we formulate commonsense as a triplet of binary commonsense relations between a pair of entities. Although binary relations constitute a fundamental part of realworld commonsense, humans can summarize commonsense from images in more complex forms: (1)

-ary commonsense interactions among multiple entities. For example, the commonsense "person can write letters with a pen" can be summarized as a ternary relation among three entities write(person, letter, pen). (2) Commonsense correla-
在更複雜的形式中的常識。在這項工作中，我們將常識定義為一對實體之間的二元常識關係三元組。雖然二元關係構成現實世界常識的基本部分，但人類可以從圖像中以更複雜的形式總結常識：(1) 多元常識互動涉及多個實體。例如，常識“人可以用筆寫信”可以總結為三個實體之間的三元關係 write(person, letter, pen)。 (2) 常識相關性-

Informative images 資訊圖像

Attention score: 4.98, Rank: 1/50
注意力分數：4.98，排名：1/50

Attention score: 2.85, Rank: 2/50
注意力分數：2.85，排名：2/50

Attention score: 2.97, Rank: 2/39
注意力分數：2.97，排名：2/39

Attention score: 4.60, Rank: 8/50
注意力分數：4.60，排名：8/50

Attention score: 2.56, Rank: 7/50
注意力分數：2.56，排名：7/50

Attention score: 2.72, Rank: 6/39
注意力分數：2.72，排名：6/39

Attention score: 0.78 , Rank:

注意力分數：0.78，排名：

Attention score: 0.72 , Rank:

注意力分數：0.72，排名：

Attention score: 1.44 , Rank:

注意力分數：1.44，排名：

Attention score: -0.36, Rank: 50/50
注意力分數：-0.36，排名：50/50

Attention score: 0.63, Rank: 49/50
注意力分數：0.63，排名：49/50

Attention score: 0.51 , Rank:

注意力分數：0.51，排名：

Figure 7: Unnormalized attention scores of the extracted commonsense triplets over images and their ranks in a bag.
圖 7：從圖像中提取的常識三元組的未歸一化注意力分數及其在一個包中的排名

(tail, on, book) (尾巴, 在, 書上)

(sheep, behind, bench) (羊, 在, 長凳後面)

(clock, has, flower) (時鐘, 有, 花)

(boat, behind, car) (船，後面，車)

Figure 8: Supporting images selected by our model for uncommonly observed facts.
圖 8：我們模型選擇的支援圖像，用於不常見的觀察事實。

tions between structured facts. For example, (rainwater, on, road) is highly correlated with (person, hold, umbrella).
結構事實之間的關係。例如，(雨水，在，路上) 與 (人，拿著，雨傘) 之間高度相關。

Commonsense in More Complex Types. Although images contain rich commonsense knowledge about the visual worlds, we note that there are still important commonsense types out of reach of images: (1) Temporal commonsense. For example, a person needs to open the door of a refrigerator before getting the milk in it, which can hopefully be acquired from videos. (2) Invisible commonsense. For example, (love, can cause, happiness) may arguably only be extracted from text. Although PLMs can deal with flexible forms and types of commonsense, there is a general belief that learning purely from the correlation of surface text forms without grounding to real-world perceptions cannot lead to real understanding of commonsense meanings (Bender and Koller 2020). Moreover, a developmental learning procedure from concrete grounded commonsense (from visual perceptions) to abstract commonsense (from language) is also more bio-plausible and supported by human cognition (Moore 2013).
更複雜類型中的常識。儘管圖像包含豐富的視覺世界常識知識，但我們注意到仍有重要的常識類型超出圖像的範圍：(1) 時間常識。例如，一個人需要在拿牛奶之前打開冰箱的門，這個知識可以從視頻中獲得。 (2) 無形常識。例如，(愛，可以導致，幸福) 可能只能從文本中提取。儘管 PLMs 可以處理靈活的形式和類型的常識，但普遍認為僅從表面文本形式的相關性學習而沒有基於真實世界知覺的基礎，無法真正理解常識含義 (Bender and Koller 2020)。此外，從具體的基於感知的常識（從視覺感知）到抽象的常識（從語言）的發展性學習過程也更具生物合理性，並得到人類認知的支持（Moore 2013)。

man	person 人	woman 女人	tree 樹	building 建築物	table 桌子	sign 標誌	boy	window 窗戶	fence 圍欄	pole 柱子	girl 女孩
dog	snow 雪	car	bench 替換	street 街道	train 火車	bird 鳥	light 光	head 頭	chair 椅子	hand 手	sidewalk 人行道
door 門	bike 自行車	elephant 大象	rock 岩石	horse 馬	bus	glass 玻璃	truck 卡車	bag	box	boat 船	beach 海灘
plate 盤子	clock 鐘	leaf 葉子	plant 植物	board 板	umbrella 雨傘	giraffe 長頸鹿	leg	flower 花朵	motorcycle 摩托車	track 追踪	cow
post 發布	hill 山坡	zebra 斑馬	surfboard 衝浪板	banana 香蕉	shirt 襯衫	shelf 架子	house 房子	face 臉	food 食物	wire 電線	arm
hair 頭髮	skateboard 溜冰板	paper 紙	branch 分支	bottle 瓶子	handle 把手	sheep 羊	roof 屋頂	bowl 碗	wheel 輪子	book 書籍	logo 標誌
trunk 樹幹	cup	mountain 山	lamp 燈籠	seat 座位	shoe 鞋子	wave 浪潮	pillow 枕頭	jacket 夾克	cabinet 櫥櫃	hat	tail 尾巴
letter 信件	tire 輪胎	ear	nose 鼻子	helmet 安全帽	eye	cap	coat 外套	mouth 嘴巴	glove 手套	pant 褲子	tile 地磚
neck 頸部	jean 牛仔褲	short 短	wing 翼

Table 5: List of entities of the benchmark.
表 5：基準的實體列表。

on	near 附近	in	has	behind 在後面	with 與	above 上面	of
under 下面	in front of 在前面	holding 持有	sitting on 坐在	over 在上方	attached to 附著在	wearing 穿著	standing on 站在
for	between 之間	looking at 注視著	at	hanging from 懸掛在	standing next to 站在旁邊	belonging to 屬於	covering 覆蓋
touching 觸摸	underneath 在下面	carrying 攜帶	next 下一個	laying on 躺在	outside 外面	wears 穿著	part of 的一部分
on back of 在背部	beneath 在下方	leaning on 倚靠在	along 沿著	riding 騎行	standing by 站在旁邊	watching 觀看	standing behind 站在後面
standing near 站在附近	from 來自	against 對抗	to	standing 站立	resting on 靠在	riding on 騎在上面	across 橫跨
sitting on top of 坐在頂部	mounted on 安裝在	walking on 行走在	lying on 躺在	worn by 穿著	covered in 覆蓋著	has on 穿著	in middle of 在中間
sitting 坐著	walking 走路	eating 進食	painted on 繪製在	connected to 連接到	holding up 舉起	using 使用	covered with 覆蓋著
surrounding 周圍	growing on 生長在	held by 被...困住	crossing 穿越	walking in 步行	made of 由...製成	supporting 支持	full of 充滿
pulling 拉攏	lining 襯裡	playing with 玩耍	printed on 印刷在	filled with 充滿著	walking down 走在	parked on 停在	on bottom of 在底部
laying in 躺在	cutting 切割	lying on top of 躺在上面	contains 包含	sitting at 坐在	hitting 擊中	built into 內建於	shows 顯示
parked in 停泊在	written on 寫在	playing 玩耍	says 說	driving on 駕駛在	adorning 裝飾	growing in 生長中	hanging in 懸掛中
swinging 擺動	flying 飛行	throwing 投擲	floating in 漂浮在

Table 6: List of relations of the benchmark.
基準的關係清單表 6。

Implementation Details 實作細節

In this section, we provide the implementation details of the experiments, including benchmark statistics, model training, CNN encoder and evaluation metrics.
在本節中，我們提供實驗的實作細節，包括基準統計資料、模型訓練、CNN 編碼器和評估指標。

Benchmark Statistics. To construct the benchmark, we select distinct triplets with the top 100 entity categories and relation categories. Here we provide the category list of entities and relations of the benchmark in Table 5 and Table 6 .
基準統計。為了構建基準，我們選擇了具有前 100 個實體類別和關係類別的不同三元組。這裡我們在表 5 和表 6 中提供了基準的實體和關係類別列表。

Bag Construction. For bag construction, the entity pairs in images are ranked according to their intersection over union in images. For entity pairs that do not have overlapping areas in an image, they are ranked according to the distance between the central points of the entities.
袋構建。對於袋構建，圖像中的實體對根據它們在圖像中的交集區域進行排名。對於在圖像中沒有重疊區域的實體對，它們根據實體的中心點之間的距離進行排名。

Model Training. The hyperprameters in the experiments are selected by grid search according to the average of AUC and mACU scores on the validation set. We use base-size pre-trained models in all our experiments. Our best model is trained using AdamW (Kingma and Ba 2014) optimizer on 10 NVIDIA GeForce RTX 2080 Ti for 18 epochs, with bag size 50, learning rate 7 e-5, batch size 60 and weight decay 0.01 . We first warm up the training by linearly increasing the learning rate from 7e-6 to 7e-5 in 1000 steps. The learning rate decreases by 10 times after the performance plateau on the validation set, and the training terminates after three performance plateaus.
模型訓練。實驗中的超參數是通過網格搜索根據驗證集上 AUC 和 mACU 分數的平均值來選擇的。我們在所有實驗中使用基於基本尺寸的預訓練模型。我們的最佳模型是在 10 個 NVIDIA GeForce RTX 2080 Ti 上使用 AdamW（Kingma 和 Ba 2014）優化器進行 18 個時期的訓練，袋大小為 50，學習率為 7e-5，批量大小為 60，權重衰減為 0.01。我們首先通過在 1000 個步驟中將學習率從 7e-6 線性增加到 7e-5 來熱身訓練。在驗證集上性能達到平穩後，學習率減少 10 倍，並在三個性能平臺後終止訓練。

CNN Encoder. We adapt a Neural Motif (Zellers et al. 2018) model to encode image-level entity pair features for AVG, ONE and ATT. Specifically, we adopt a Faster RCNN (Ren et al. 2015) model pre-trained on Visual Genome as the visual encoder to extract raw features for objects. For each query entity pair, we concatenate two raw feature vectors and feed the result into an MLP with LayerNorm and ReLU activation to generate the visual representation. Following the implementation of Zellers et al. (2018), we also utilize pre-trained GloVe vectors (Pennington, Socher, and Manning 2014) of entity categories as extra information from text-domain. The Bi-LSTM context encoder is removed since the limited parallelization capability prevents the model from converging at an acceptable time cost. We also conduct few experiments to study the effectiveness of spatial features represented by bounding box positions. Models trained with the usage of spatial features achieve merely marginal performance gain with three times more memory occupation. Therefore, we do not include the spatial feature to the input of our CNN-based model.
CNN編碼器。我們適應了神經主題（Zellers等人，2018年）模型，用於對AVG、ONE和ATT的圖像級實體對特徵進行編碼。具體來說，我們採用了一個在Visual Genome上預先訓練的Faster RCNN（Ren等人，2015年）模型作為視覺編碼器，以提取對象的原始特徵。對於每個查詢實體對，我們串聯兩個原始特徵向量，並將結果餵入具有LayerNorm和ReLU激活的MLP以生成視覺表示。根據Zellers等人（2018年）的實施，我們還利用了預先訓練的GloVe向量（Pennington，Socher和Manning，2014年）作為文本域的額外信息。由於有限的並行化能力阻礙了模型在可接受的時間成本內收斂，因此Bi-LSTM上下文編碼器被移除。我們還進行了一些實驗，以研究由邊界框位置表示的空間特徵的有效性。使用空間特徵訓練的模型僅實現了輕微的性能增益，但佔用了三倍的內存。因此，我們不將空間特徵包含在基於CNN的模型的輸入中。

Evaluation Metrics. Following previous works in knowledge acquisition from text (Zeng et al. 2015; [in et al. 2016), to provide a rigorous evaluation, we report results based on the precision-recall curve of held-out triplet predictions. Specifically, given a query entity pair

, our model predicts a commonsense score for each relation

(excluding NA), which indicates the plausibility of a potentially useful triplet

. We rank all candidate triplets according to their commonsense score, and calculate the precision and recall curve by comparing the top predictions with held-out triplets. The AUC is computed by the area under curve. For each point on the curve, we compute the F1 score by the harmonic mean of precision and recall, and report the maximum F1 score on the curve. Precision @K% is the precision
評估指標。繼前人在從文字中獲取知識的工作（Zeng et al. 2015; [in et al. 2016）後，為了提供嚴格的評估，我們報告了基於保留的三元組預測的精確度-召回曲線的結果。具體而言，對於一個查詢實體對

，我們的模型預測每種關係

（排除 NA），這些關係指示了潛在有用三元組

的合理性。我們根據它們的合理性分數對所有候選三元組進行排序，並通過比較頂部預測與保留的三元組來計算精確度和召回率曲線。AUC 是由曲線下面積計算得出的。對於曲線上的每一點，我們通過精確度和召回率的調和平均值計算 F1 分數，並報告曲線上的最大 F1 分數。@K％的精確度是前 K％候選項的精確度
of the top K% extractions of the candidates. For macro evaluation, we first calculate the precision-recall curve of each relation, and then obtain the macro curve by the average of different relation curves.
。對於宏觀評估，我們首先計算每種關係的精確度-召回曲線，然後通過不同關係曲線的平均值獲得宏觀曲線。

*Corresponding authors: Z.Liu (liuzy @tsinghua.edu.cn), H.Zheng (zheng.haitao@sz.tsinghua.edu.cn)
*通訊作者：Z.Liu（liuzy @ tsinghua.edu.cn），H.Zheng（郵寄地址：zheng.haitao@sz.tsinghua.edu.cn）

Copyright (C) 2023, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
版權所有 (C) 2023 年，人工智慧促進協會（www.aaai.org）。保留所有權利。
We randomly sample 200 distinct relational triplets from Visual Genome (Krishna et al. 2017) and manually verify if the triplet or its variations are included in the ConceptNet.
我們從 Visual Genome（Krishna 等人，2017）中隨機抽樣 200 個不同的關係三元組，並手動驗證三元組或其變體是否包含在 ConceptNet 中。
also experimented with masking the entities, and find that masking relations achieves better performance.
也嘗試遮蔽實體，發現遮蔽關係可以獲得更好的性能。

Visually Grounded Commonsense Knowledge Acquisition 視覺導向的常識知識獲取