Zian Jia ^(1){ }^{1}, Yun Xiong ^(1){ }^{1}, Yuhong Nan ^(2**){ }^{2 *} Yao Zhang ^(1){ }^{1}, Jinjing Zhao ^(3){ }^{3}, Mi Wen ^(4){ }^{4} 贾子安 ^(1){ }^{1} , 熊云 ^(1){ }^{1} , 南玉红 ^(2**){ }^{2 *} 张耀 ^(1){ }^{1} , 赵金晶 ^(3){ }^{3} , 文米 ^(4){ }^{4}^(1){ }^{1} Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, China 上海数据科学重点实验室,复旦大学计算机科学学院,中国^(2){ }^{2} School of Software Engineering, Sun Yat-sen University, China 中山大学软件工程学院,中国^(3){ }^{3} National Key Laboratory of Science and Technology on Information System Security, China 中国信息系统安全国家重点实验室^(4){ }^{4} Shanghai University of Electric Power, China ^(4){ }^{4} 上海电力学院,中国
Abstract 摘要
Advance Persistent Threats (APTs), adopted by most delicate attackers, are becoming increasing common and pose great threat to various enterprises and institutions. Data provenance analysis on provenance graphs has emerged as a common approach in APT detection. However, previous works have exhibited several shortcomings: (1) requiring attack-containing data and a priori knowledge of APTs, (2) failing in extracting the rich contextual information buried within provenance graphs and (3) becoming impracticable due to their prohibitive computation overhead and memory consumption. 高级持续威胁(APT)被大多数精细攻击者采用,正变得越来越普遍,并对各种企业和机构构成重大威胁。基于来源图的数据来源分析已成为 APT 检测中的一种常见方法。然而,以前的工作存在几个缺点:(1)需要包含攻击的数据和对 APT 的先验知识,(2)未能提取埋藏在来源图中的丰富上下文信息,以及(3)由于其高昂的计算开销和内存消耗而变得不可行。
In this paper, we introduce MAGIC, a novel and flexible self-supervised APT detection approach capable of performing multi-granularity detection under different level of supervision. MAGIC leverages masked graph representation learning to model benign system entities and behaviors, performing efficient deep feature extraction and structure abstraction on provenance graphs. By ferreting out anomalous system behaviors via outlier detection methods, MAGIC is able to perform both system entity level and batched log level APT detection. MAGIC is specially designed to handle concept drift with a model adaption mechanism and successfully applies to universal conditions and detection scenarios. We evaluate MAGIC on three widely-used datasets, including both real-world and simulated attacks. Evaluation results indicate that MAGIC achieves promising detection results in all scenarios and shows enormous advantage over state-of-the-art APT detection approaches in performance overhead. 在本文中,我们介绍了 MAGIC,这是一种新颖且灵活的自监督 APT 检测方法,能够在不同的监督级别下执行多粒度检测。MAGIC 利用掩蔽图表示学习来建模良性系统实体和行为,在来源图上进行高效的深度特征提取和结构抽象。通过利用异常检测方法识别异常系统行为,MAGIC 能够在系统实体级别和批量日志级别进行 APT 检测。MAGIC 特别设计用于处理概念漂移,具有模型适应机制,并成功应用于通用条件和检测场景。我们在三个广泛使用的数据集上评估 MAGIC,包括真实世界和模拟攻击。评估结果表明,MAGIC 在所有场景中都取得了良好的检测结果,并在性能开销方面显示出相对于最先进的 APT 检测方法的巨大优势。
1 Introduction 1 引言
Advanced Persistent Threats (APTs) are intentional and sophisticated cyber-attacks conducted by skilled attackers and pose great threat to both enterprises and institutions [1]. Most APTs involve zero-day vulnerabilities and are especially difficult to detect due to their stealthy and changeful nature. 高级持续性威胁(APT)是由熟练攻击者进行的有意且复杂的网络攻击,对企业和机构构成了重大威胁[1]。大多数 APT 涉及零日漏洞,由于其隐蔽和多变的特性,特别难以检测。
Recent works [2-18] on APT detection leverage data provenance to perform APT detection. Data provenance transforms audit logs into provenance graphs, which extract the rich contextual information from audit logs and provide a perfect platform for fine-grained causality analysis and APT detection. Early works [2-6] construct rules based on typical or specific APT patterns and match audit logs against those rules to detect potential APTs. Several recent works [7-9] adopt a statistical anomaly detection approach to detect APTs focusing on different provenance graph elements, e.g, system entities, interactions and communities. Most recent works [10-18], however, are deep learning-based approaches. They utilize various deep learning (DL) techniques to model APT patterns or system behaviors and perform APT detection in a classification or anomaly detection style. 最近的研究[2-18]利用数据来源进行 APT 检测。数据来源将审计日志转化为来源图,从中提取丰富的上下文信息,为细粒度因果分析和 APT 检测提供了完美的平台。早期的研究[2-6]基于典型或特定的 APT 模式构建规则,并将审计日志与这些规则进行匹配以检测潜在的 APT。一些最近的研究[7-9]采用统计异常检测方法,专注于不同的来源图元素,例如系统实体、交互和社区。然而,最近的研究[10-18]主要是基于深度学习的方法。它们利用各种深度学习(DL)技术来建模 APT 模式或系统行为,并以分类或异常检测的方式进行 APT 检测。
While these existing approaches have demonstrated their capability to detect APTs with reasonable accuracy, they encounter various combinations of the following challenges: (1) Supervised methods suffer from lack-of-data (LOD) problem as they require a priori knowledge about APTs (i.e. attack patterns or attack-containing logs). In addition, these methods are particularly vulnerable when confronted with new types of APTs they are not trained to deal with. (2) Statistics-based methods only require benign data to function, but they fail to extract the deep semantics and correlation of complex benign activities buried in audit logs, resulting in high false positive rate. (3) DL-based methods, especially sequence-based and graph-based approaches, have achieved promising effectiveness at the cost of heavy computation overhead, rendering them impractical in real-life detection scenarios. 虽然这些现有的方法已经证明了它们能够以合理的准确性检测 APT,但它们面临以下各种挑战的组合:(1) 监督方法由于需要关于 APT 的先验知识(即攻击模式或包含攻击的日志),因此遭遇缺乏数据(LOD)问题。此外,当面对它们未经过训练的新类型 APT 时,这些方法特别脆弱。(2) 基于统计的方法只需要良性数据就能运行,但它们无法提取埋藏在审计日志中的复杂良性活动的深层语义和关联,导致高误报率。(3) 基于深度学习的方法,特别是基于序列和图的 approaches,虽然在有效性上取得了可喜的成果,但代价是计算开销巨大,使它们在现实检测场景中不切实际。
In this paper, we address the above three issues by introducing MAGIC, a novel self-supervised APT detection approach that leverages masked graph representation learning and simple outlier detection methods to identify key attack system entities from massive audit logs. MAGIC first constructs the provenance graph from audit logs in simple yet universal steps. MAGIC then employs a graph representation module that obtains embeddings by incorporating graph features and structural information in a self-supervised way. The model 在本文中,我们通过引入 MAGIC,解决上述三个问题。MAGIC 是一种新颖的自监督 APT 检测方法,利用掩蔽图表示学习和简单的异常检测方法,从海量审计日志中识别关键攻击系统实体。MAGIC 首先通过简单而通用的步骤从审计日志构建源图。然后,MAGIC 采用图表示模块,通过以自监督的方式结合图特征和结构信息来获取嵌入。该模型
is built upon graph masked auto-encoders [19] under the joint supervision of both masked feature reconstruction and sample-based structure reconstruction. An unsupervised outlier detection method is employed to analyze the computed embeddings and attain the final detection result. 基于图形掩码自编码器[19],在掩码特征重建和基于样本的结构重建的联合监督下构建。采用无监督异常检测方法分析计算得到的嵌入,并获得最终检测结果。
MAGIC is designed to be flexible and scalable. Depending on the application background, MAGIC is able to perform multi-granularity detection, i.e., detecting APT existence in batched logs or locating entity-level adversaries. Although MAGIC is designed to perform APT detection without attackcontaining data, it is well-suited for semi-supervised and fullysupervised conditions. Furthermore, MAGIC also contains an optional model adaption mechanism which provides a feedback channel for its users. Such feedback is important for MAGIC to further improve its performance, combat concept drift and reduce false positives. MAGIC 被设计为灵活且可扩展。根据应用背景,MAGIC 能够执行多粒度检测,即在批量日志中检测 APT 的存在或定位实体级对手。尽管 MAGIC 被设计为在没有攻击数据的情况下进行 APT 检测,但它非常适合半监督和完全监督的条件。此外,MAGIC 还包含一个可选的模型适应机制,为用户提供反馈通道。这种反馈对 MAGIC 进一步提高性能、应对概念漂移和减少误报非常重要。
We implement MAGIC and evaluate its performance and overhead on three different APT attack datasets: the DARPA Transparent Computing E3 datasets [20], the StreamSpot dataset [21] and the Unicorn Wget dataset [22]. The DARPA datasets contain real-world attacks while the StreamSpot and Unicorn Wget dataset are fully simulated in controlled environments. Evaluation results show that MAGIC is able to perform entity-level APT detection with 97.26%97.26 \% precision and 99.91%99.91 \% recall as well as minimum overhead, less memory demanding and significantly faster than state-of-the-art approaches (e.g. 51 times faster than ShadeWatcher [18]). 我们实现了 MAGIC,并在三个不同的 APT 攻击数据集上评估其性能和开销:DARPA 透明计算 E3 数据集[20]、StreamSpot 数据集[21]和 Unicorn Wget 数据集[22]。DARPA 数据集包含真实世界的攻击,而 StreamSpot 和 Unicorn Wget 数据集则是在受控环境中完全模拟的。评估结果表明,MAGIC 能够以 97.26%97.26 \% 的精确度和 99.91%99.91 \% 的召回率执行实体级 APT 检测,同时开销最小,内存需求较低,并且比最先进的方法显著更快(例如,比 ShadeWatcher [18]快 51 倍)。
To benefit future research and encourage further improvement on MAGIC, we make our implementation of MAGIC and our pre-processed datasets open to public ^(1){ }^{1}. In summary, this paper makes the following contributions: 为了促进未来的研究并鼓励对 MAGIC 的进一步改进,我们将 MAGIC 的实现和我们预处理的数据集向公众开放 ^(1){ }^{1} 。总之,本文做出了以下贡献:
We propose MAGIC, a universal APT detection approach based on masked graph representation learning and outlier detection methods, capable of performing multi-granularity detection on massive audit logs. 我们提出了 MAGIC,一种基于掩蔽图表示学习和异常检测方法的通用 APT 检测方法,能够对海量审计日志进行多粒度检测。
We ensure MAGIC’s practicability by minimizing its computation overhead with extended graph masked auto-encoders, allowing MAGIC to complete training and detection in acceptable time even under tight conditions. 我们通过扩展图掩码自编码器来最小化 MAGIC 的计算开销,从而确保其可行性,使 MAGIC 即使在紧张条件下也能在可接受的时间内完成训练和检测。
We secure MAGIC’s universality with various efforts. We leverage masked graph representation learning and outlier detection methods, enabling MAGIC to perform precise detection under different supervision levels, in different detection granularity and with audit logs from various sources. 我们通过各种努力确保 MAGIC 的普遍性。我们利用掩蔽图表示学习和异常检测方法,使 MAGIC 能够在不同的监督级别、不同的检测粒度以及来自各种来源的审计日志下进行精确检测。
We evaluate MAGIC on three widely-used datasets, involving both real-world and simulated APT attacks. Evaluation results show that MAGIC detects APTs with promising results and minimum computation overhead. 我们在三个广泛使用的数据集上评估 MAGIC,这些数据集涉及真实世界和模拟的 APT 攻击。评估结果表明,MAGIC 能够以良好的效果检测 APT,并且计算开销最小。
We provide an open source implementation of MAGIC to benefit future research in the community and encourage further improvement on our approach. 我们提供 MAGIC 的开源实现,以惠及社区未来的研究,并鼓励对我们的方法进行进一步改进。
Figure 1: The provenance graph of a real-world APT attack, exploiting the Pine Backdoor vulnerability. All attackirrelevant entities and interactions have been removed from the provenance graph. 图 1:一个真实世界 APT 攻击的来源图,利用了 Pine Backdoor 漏洞。所有与攻击无关的实体和交互已从来源图中删除。
2 Background 2 背景
2.1 Motivating Example 2.1 激励示例
Here we provide a detailed illustration of an APT scenario that we use throughout the paper. Pine backdoor with Drakon Dropper is an APT attack from the DARPA Engagement 3 Trace dataset [20]. During the attack, an attacker constructs a malicious executable (/tmp/tcexec) and sends it to the target host via a phishing e-mail. The user then unconsciously downloads and opens the e-mail. Contained within the e-mail is an executable designed to perform a port-scan for internal reconnaissance and establish a silent connection between the target host and the attacker. Figure 1 displays the provenance graph of our motivation example. Nodes in the graph represent system entities and arrows represent directed interactions between entities. The graph shown is a subgraph abstracted from the complete provenance graph by removing most attack-irrelevant entities and interactions. Different node shape corresponds to different type of entities. Entities covered in stripes are considered malicious ones. 在这里,我们提供了一个详细的 APT 场景说明,贯穿整篇论文。Pine 后门与 Drakon Dropper 是来自 DARPA Engagement 3 Trace 数据集的 APT 攻击。在攻击过程中,攻击者构建了一个恶意可执行文件(/tmp/tcexec),并通过钓鱼电子邮件将其发送到目标主机。用户随后无意识地下载并打开了电子邮件。电子邮件中包含一个可执行文件,旨在进行内部侦察的端口扫描,并在目标主机与攻击者之间建立一个隐秘连接。图 1 展示了我们动机示例的来源图。图中的节点代表系统实体,箭头表示实体之间的定向交互。所示图形是从完整来源图中抽象出的子图,通过去除大多数与攻击无关的实体和交互。不同的节点形状对应不同类型的实体。被条纹覆盖的实体被视为恶意实体。
2.2 Prior Research and their Limitations 2.2 先前研究及其局限性
Supervised Methods. For early works [2-6], special heuristic rules need to be constructed to cover all attack patterns. Many DL-based APT detection methods [11,14-16,18] construct provenance graphs based on both benign and attackcontaining data and detect APTs in a classification style. These supervised methods can achieve almost perfect detection results on learned attack patterns but are especially vulnerable facing concept drift or unseen attack patterns. Moreover, for rule-based methods, the construction and maintenance of heuristic rules can be very expensive and time-consuming. And for DL-based methods, the scarcity of attack-containing data is preventing these supervised methods from being actually deployable. To address the above issue, MAGIC adopts a fully self-supervised anomaly detection style, allowing the absence of attack-containing data while effectively dealing 监督方法。对于早期的工作[2-6],需要构建特殊的启发式规则以覆盖所有攻击模式。许多基于深度学习的 APT 检测方法[11,14-16,18]基于良性和包含攻击的数据构建溯源图,并以分类的方式检测 APT。这些监督方法在学习到的攻击模式上几乎可以实现完美的检测结果,但在面对概念漂移或未见过的攻击模式时尤其脆弱。此外,对于基于规则的方法,构建和维护启发式规则可能非常昂贵且耗时。而对于基于深度学习的方法,包含攻击的数据稀缺使得这些监督方法无法实际部署。为了解决上述问题,MAGIC 采用完全自监督的异常检测风格,允许在缺乏包含攻击的数据的情况下有效处理。
with unseen attack patterns. 具有未见的攻击模式。
Statistics-based Methods. Most recent statistics-based methods [7-9] detect APT signals by identifying system entities, interactions and communities based on their rarity or anomaly scores. However, the rarity of system entities may not necessarily indicate their abnormality and anomaly scores, obtained via causal analysis or label propagation, are shallow feature extraction on provenance graphs. To illustrate, the process tcexec performs multiple portscan operations on different IP addresses in our motivating example (See Figure 1), which may be considered as a normal system behavior. However, taking into consideration that process tcexec, derived from the external network, also reads sensitive system information (uname) and makes connection with public IP addresses (162.66.239.75), we can easily identify tcexec as a malicious entity. Failure to extract deep semantics and correlations between system entities often results in low detection performance and high false positive rate of statistics-based methods. MAGIC, however, employs a graph representation module to perform deep graph feature extraction on provenance graphs, resulting in high-quality embeddings. 基于统计的方法。最近的基于统计的方法[7-9]通过识别系统实体、交互和社区的稀有性或异常分数来检测 APT 信号。然而,系统实体的稀有性并不一定表示它们的异常性,而通过因果分析或标签传播获得的异常分数只是对来源图的浅层特征提取。举例来说,在我们的动机示例中,进程 tcexec 对不同 IP 地址执行多个端口扫描操作(见图 1),这可能被视为正常的系统行为。然而,考虑到进程 tcexec 源自外部网络,还读取敏感的系统信息(uname)并与公共 IP 地址(162.66.239.75)建立连接,我们可以很容易地将 tcexec 识别为恶意实体。未能提取系统实体之间的深层语义和关联通常会导致基于统计的方法检测性能低下和高误报率。然而,MAGIC 采用图表示模块对来源图进行深层图特征提取,从而生成高质量的嵌入。
DL-based Methods. Recently, DL-based APT detection methods, no matter supervised or unsupervised, are producing very promising detection results. However, in reality, hundreds of GB of audit logs are produced every day in a medium-size enterprise [23]. Consequently, DL-based methods, especially sequence-based [11,14,24][11,14,24] and graphbased [10,12,15-18][10,12,15-18] methods, are impracticable due to their computation overhead. For instance, ATLAS [11] takes an average 1 hour to train on 676 MB of audit logs and ShadeWatcher [18] takes 1 day to train on the DARPA E3 Trace dataset with GPU available. Besides, some graph autoencoder [25-27] based methods encounter explosive memory overhead problem when the scale of provenance graphs expands. MAGIC avoids to be computationally demanding by introducing graph masked auto-encoders and completes its training on the DARPA E3 Trace dataset in mere minutes. Detailed evaluation of MAGIC’s performance overhead is presented in Sec. 6.4. 基于深度学习的方法。最近,基于深度学习的 APT 检测方法,无论是监督还是无监督,都产生了非常有前景的检测结果。然而,在现实中,中型企业每天会产生数百 GB 的审计日志[23]。因此,基于深度学习的方法,特别是基于序列 [11,14,24][11,14,24] 和基于图 [10,12,15-18][10,12,15-18] 的方法,由于其计算开销,变得不可行。例如,ATLAS[11]在 676MB 的审计日志上训练平均需要 1 小时,而 ShadeWatcher[18]在可用 GPU 的情况下,在 DARPA E3 Trace 数据集上训练需要 1 天。此外,一些基于图自编码器[25-27]的方法在源图规模扩大时会遇到爆炸性的内存开销问题。MAGIC 通过引入图掩码自编码器避免了计算需求过高的问题,并在 DARPA E3 Trace 数据集上仅用几分钟完成训练。MAGIC 性能开销的详细评估在第 6.4 节中呈现。
End-to-end Approaches. Beyond the three major limitations discussed above, it is also worth to mention that most recent APT detection approaches [11,17,18][11,17,18] are end-to-end detectors and focus on one specific detection task. For instance, ATLAS [11] focused on end-to-end attack reconstruction and Unicorn [10] yields system-level alarms from streaming logs. Instead, MAGIC’s approach is universal and performs multigranularity APT detection under various detection scenarios, which can also be applied to audit logs collected from different sources. 端到端方法。除了上述讨论的三个主要限制之外,值得一提的是,最近的大多数 APT 检测方法 [11,17,18][11,17,18] 都是端到端检测器,并专注于一个特定的检测任务。例如,ATLAS [11] 专注于端到端攻击重建,而 Unicorn [10] 从流日志中生成系统级警报。相反,MAGIC 的方法是通用的,并在各种检测场景下执行多粒度 APT 检测,这也可以应用于从不同来源收集的审计日志。
2.3 Threat Model and Definitions 2.3 威胁模型和定义
We first present the threat model we use throughout the paper and then formally define key concepts that are crucial to 我们首先介绍在整篇论文中使用的威胁模型,然后正式定义对关键概念至关重要的内容
understanding how MAGIC performs APT detection. 理解 MAGIC 如何执行 APT 检测。
Threat Model. We assume that attackers come from outside a system and target valuable information within the system. An attacker may perform sophisticated steps to achieve his goal but leaves trackable evidence in logs. The combination of the system hardware, operating system and system audit softwares is our trusted computing base. Poison attacks and evasion attacks are not considered in our threat model. 威胁模型。我们假设攻击者来自系统外部,目标是系统内的有价值信息。攻击者可能会采取复杂的步骤来实现其目标,但会在日志中留下可追踪的证据。系统硬件、操作系统和系统审计软件的组合是我们可信计算基础。毒药攻击和规避攻击不在我们的威胁模型中考虑。
Provenance Graph. A provenance graph is a directed cyclic graph extracted from raw audit logs. Constructing a provenance graph is common practice in data provenance, as it connects system entities and presents the interaction relationships between them. A provenance graph contains nodes representing different system entities (e.g., processes, files and sockets) and edges representing interactions between system entities (e.g., execute and connect), labeled with their types. For example, /tmp/tcexec is a FileObject system entity and the edge between /tmp/tcexec and tcexec is an execute operation from a FileObject targeting a Process (See Figure 1). 来源图。来源图是从原始审计日志中提取的有向循环图。构建来源图是数据来源中的常见做法,因为它连接系统实体并展示它们之间的交互关系。来源图包含表示不同系统实体的节点(例如,进程、文件和套接字)以及表示系统实体之间交互的边(例如,执行和连接),并标注其类型。例如,/tmp/tcexec 是一个 FileObject 系统实体,而 /tmp/tcexec 和 tcexec 之间的边是一个从 FileObject 目标指向进程的执行操作(见图 1)。
Multi-granularity Detection. MAGIC is capable to perform APT detection at two-granularity: batched log level and system entity level. MAGIC’s multi-granularity detection ability gives rises to a two-stage detection approach: first conduct batched log level detection on streaming batches of logs, and then perform system entity level detection on positive batches to identify detailed detection results. Applying this approach to real-world settings will effectively reduce workload, resource consumption and false positives and, in the meantime, produce detailed outcomes. 多粒度检测。MAGIC 能够在两种粒度下进行 APT 检测:批量日志级别和系统实体级别。MAGIC 的多粒度检测能力产生了两阶段检测方法:首先对流式日志的批量进行批量日志级别检测,然后对正批量进行系统实体级别检测,以识别详细的检测结果。将这种方法应用于现实世界的环境中,将有效减少工作量、资源消耗和误报,同时产生详细的结果。
Batched log level detection. Under this granularity of APT detection, the major task is given batched audit logs from a consistent source, MAGIC alerts if a potential APT is detected in a batch of logs. Similar to Unicorn [10], MAGIC does not accurately locate malicious system entities and interactions under this granularity of detection. 批量日志级别检测。在这种 APT 检测的粒度下,主要任务是从一致的来源获取批量审计日志,如果在一批日志中检测到潜在的 APT,MAGIC 会发出警报。与 Unicorn [10]类似,MAGIC 在这种检测粒度下无法准确定位恶意系统实体和交互。
System entity level detection. The detection task under this granularity of APT detection is given audit logs from a consistent source, MAGIC is able to accurately locate malicious system entities in those audit logs. Identification of key system entities during APTs is vital to subsequent tasks such as attack investigation and attack story recovery as it provides explicable detection results and reduces the need for domain experts as well as redundant manual efforts [11]. 系统实体级检测。在这种 APT 检测粒度下的检测任务中,MAGIC 能够从一致来源的审计日志中准确定位恶意系统实体。在 APT 期间识别关键系统实体对于后续任务(如攻击调查和攻击故事恢复)至关重要,因为它提供了可解释的检测结果,并减少了对领域专家的需求以及冗余的人工努力[11]。
3 MAGIC Overview 3 MAGIC 概述
MAGIC is a novel self-supervised APT detection approach that leverages masked graph representation learning and outlier detection methods and is capable of efficiently performing multi-granularity detection on massive audit logs. MAGIC’s pipeline consists of three main components: (1) provenance graph construction, (2) a graph representation module and (3) a detection module. It also provides an optional (4) MAGIC 是一种新颖的自监督 APT 检测方法,利用了掩蔽图表示学习和异常检测方法,能够高效地对海量审计日志进行多粒度检测。MAGIC 的流程包括三个主要组件:(1)来源图构建,(2)图表示模块和(3)检测模块。它还提供了一个可选的(4)
Figure 2: Overview of MAGIC’s detection pipeline. 图 2:MAGIC 检测流程概述。
model adaption mechanism. During training, MAGIC transforms training data with (1), learns graph embedding by (2) and memorizes benign behaviors in (3). During inference, MAGIC transforms target data with (1), obtains graph embedding with the trained (2) and detects outliers through (3). Figure 2 gives an overview of the MAGIC architecture. 模型适应机制。在训练过程中,MAGIC 通过(1)转换训练数据,通过(2)学习图嵌入,并在(3)中记忆良性行为。在推理过程中,MAGIC 通过(1)转换目标数据,通过训练好的(2)获得图嵌入,并通过(3)检测异常值。图 2 给出了 MAGIC 架构的概述。
Streaming audit logs collected by system auditing softwares are usually stored in batches. During provenance graph construction (1), MAGIC transforms these logs into static provenance graphs. System entities and interactions between them are extracted and converted into nodes and edges respectively. Several complexity reduction techniques are utilized to remove redundant information. 系统审计软件收集的流式审计日志通常以批量方式存储。在来源图构建过程中(1),MAGIC 将这些日志转换为静态来源图。系统实体及其之间的交互被提取并分别转换为节点和边。采用多种复杂性降低技术来去除冗余信息。
The constructed provenance graphs are then fed through the graph representation module (2) to obtain output embeddings (i.e. comprehensive vector representations of objects). Built upon graph masked auto-encoders and integrating samplebased structure reconstruction, the graph representation module embeds, propagates and aggregates node and edge attributes into output embeddings, which contain both node embeddings and the system state embedding. 构建的来源图随后通过图表示模块(2)进行处理,以获得输出嵌入(即对象的综合向量表示)。该图表示模块基于图掩码自编码器,并整合基于样本的结构重建,嵌入、传播和聚合节点和边属性到输出嵌入中,这些嵌入包含节点嵌入和系统状态嵌入。
The graph representation module is trained with only benign audit logs to model benign system behaviors. When performing APT detection on potentially attack-containing audit logs, MAGIC utilizes outlier detection methods based on the output embeddings to detect outliers in system behaviors (3). Depending on the granularity of the task, different embeddings are used to complete APT detection. On batched log level tasks, the system state embeddings, which reflect the general behaviors of the whole system, are the detection targets. An outlier in such embeddings means its corresponding system state is unseen and potentially malicious, which reveals an APT signal in that batch. On system entity level tasks, the detection targets are those node embeddings, which 图形表示模块仅使用良性审计日志进行训练,以建模良性系统行为。在对可能包含攻击的审计日志进行 APT 检测时,MAGIC 利用基于输出嵌入的异常检测方法来检测系统行为中的异常(3)。根据任务的粒度,使用不同的嵌入来完成 APT 检测。在批量日志级别任务中,系统状态嵌入反映了整个系统的一般行为,是检测目标。这种嵌入中的异常意味着其对应的系统状态是未见过的,可能是恶意的,这在该批次中揭示了 APT 信号。在系统实体级别任务中,检测目标是那些节点嵌入,
represent the behaviors of system entities. Outliers in node embeddings indicates suspicious system entities and detects APT threats in finer granularity. 表示系统实体的行为。节点嵌入中的异常值指示可疑的系统实体,并以更细的粒度检测 APT 威胁。
In real-world detection settings, MAGIC has two predesigned applications. For each batch of logs collected by system auditing softwares, one can either directly utilize MAGIC’s entity level detection to accurately identify malicious entities within the batch, or perform a two-stage detection, as stated in Sec. 2.3. In this case, MAGIC first scans a batch and sees if malicious signals exist in the batch (batched log level detection). If it alerts positive, MAGIC then performs entity level detection to identify malicious system entities in finer granularity. Batched log level detection is significantly less computationally demanding than entity level detection. Therefore, such a two-stage routine can help MAGIC’s users to save computational resource and avoid false alarms without affecting MAGIC’s detection fineness. However, if users favor fine-grain detection on all system entities, the former routine is still an accessible option. 在现实世界的检测环境中,MAGIC 有两个预设计的应用程序。对于系统审计软件收集的每一批日志,可以直接利用 MAGIC 的实体级检测准确识别批次中的恶意实体,或者执行如第 2.3 节所述的两阶段检测。在这种情况下,MAGIC 首先扫描一批日志,查看该批次中是否存在恶意信号(批次日志级别检测)。如果检测结果为正,MAGIC 将进行实体级检测,以更细粒度识别恶意系统实体。批次日志级别检测的计算需求显著低于实体级检测。因此,这种两阶段的流程可以帮助 MAGIC 的用户节省计算资源,避免误报,而不影响 MAGIC 的检测精度。然而,如果用户更倾向于对所有系统实体进行细粒度检测,前一种流程仍然是一个可行的选择。
To deal with concept drift and unseen attacks, an optional model adaption mechanism is employed to provide feedback channels for its users (4). Detection results checked and confirmed by security analysts are fed back to MAGIC, helping it to adapt to benign system behavior changes in a semisupervised way. Under such conditions, MAGIC achieves even more promising detection results, which is discussed in Sec. 6.3. Furthermore, MAGIC can be easily applied to real-world online APT detection thanks to it’s ability to adapt itself to concept drift and its minimum computation overhead. 为了应对概念漂移和未见攻击,采用了一种可选的模型适应机制,为用户提供反馈渠道(4)。由安全分析师检查和确认的检测结果反馈给 MAGIC,帮助其以半监督的方式适应良性系统行为的变化。在这种情况下,MAGIC 实现了更有前景的检测结果,具体讨论见第 6.3 节。此外,由于 MAGIC 能够适应概念漂移并且计算开销最小,它可以轻松应用于现实世界的在线 APT 检测。
4 Design Details 4 设计细节
In this section, we explain in detail how MAGIC performs efficient APT detection on massive audit logs. MAGIC con- 在本节中,我们详细解释 MAGIC 如何在海量审计日志上执行高效的 APT 检测。MAGIC con-
tains four major components: a graph construction phase that builds optimised and consistent provenance graphs (Sec. 4.1), a graph representation module that produces output embeddings with maximum efficiency (Sec. 4.2), a detection module that utilizes outlier detection methods to perform APT detection (Sec. 4.3) and a model adaption mechanism to deal with concept drift and other high-quality feedbacks (Sec. 4.4). 包含四个主要组件:一个构建优化和一致的来源图的图构建阶段(第 4.1 节),一个生成最大效率输出嵌入的图表示模块(第 4.2 节),一个利用异常检测方法进行 APT 检测的检测模块(第 4.3 节),以及一个处理概念漂移和其他高质量反馈的模型适应机制(第 4.4 节)。
4.1 Provenance Graph Construction 4.1 来源图构建
MAGIC first constructs a provenance graph out of raw audit logs before performing graph representation and APT detection. We follow three steps to construct a consistent and optimised provenance graph ready for graph representation. MAGIC 首先从原始审计日志中构建一个溯源图,然后进行图表示和 APT 检测。我们遵循三个步骤来构建一个一致且优化的溯源图,以便进行图表示。
Log Parsing. The first step is to simply parse each log entry, extract system entities and system interactions between them. Then, a prototype provenance graph can be constructed with system entities as nodes and interactions as edges. Now we extract categorical information regarding nodes and edges. For simple log format that provides entity and interaction labels, we directly utilize these labels. For some format that provides complicated attributes of those entities and interactions, we apply multi-label hashing (e.g., xxhash [28]) to transform attributes into labels. At this stage, the provenance graph is a directed multi-graph. We designed an example to demonstrate how we deal with the raw provenance graph after log parsing in Figure 3. 日志解析。第一步是简单地解析每个日志条目,提取系统实体及其之间的系统交互。然后,可以构建一个原始来源图,以系统实体作为节点,交互作为边。现在我们提取有关节点和边的分类信息。对于提供实体和交互标签的简单日志格式,我们直接使用这些标签。对于一些提供复杂属性的实体和交互的格式,我们应用多标签哈希(例如,xxhash [28])将属性转换为标签。在这个阶段,来源图是一个有向多重图。我们设计了一个示例,以演示在日志解析后如何处理原始来源图,如图 3 所示。
Initial Embedding. In this stage, we transform node and edge labels into fixed-size feature vector (i.e., initial embedding) of dimension dd, where dd is the hidden dimension of our graph representation module. We apply a lookup embedding, which establish an one-to-one mapping between node/edge labels to dd-dimension feature vectors. As demonstrated in Figure 3 (I and II), process aa and bb share the same label, so they are mapped to the same feature vector, while aa and cc are embedded into different feature vectors as they have different labels. We note that the possible number of unique node/edge labels is determined by the data source (i.e., auditing log format). Therefore, the lookup embedding works under a transductive setting and do not need to learn embeddings for unseen labels. 初始嵌入。在这个阶段,我们将节点和边的标签转换为固定大小的特征向量(即初始嵌入),其维度为 dd ,其中 dd 是我们图表示模块的隐藏维度。我们应用查找嵌入,建立节点/边标签与 dd 维特征向量之间的一对一映射。如图 3(I 和 II)所示,过程 aa 和 bb 共享相同的标签,因此它们被映射到相同的特征向量,而 aa 和 cc 则嵌入到不同的特征向量中,因为它们具有不同的标签。我们注意到,唯一节点/边标签的可能数量由数据源(即审计日志格式)决定。因此,查找嵌入在传导设置下工作,不需要为未见标签学习嵌入。
Noise Reduction. The expected input provenance graph of our graph representation module would be simple-graphs. Thus, we need to combine multiple edges between node pairs. If multiple edges of the same label (also sharing the same initial embedding) exist between a pair of nodes, we remove redundant edges so that only one of them remains. Then we combine the remaining edges into one final edge. We note that between a pair of nodes, edges of several different labels may remain. After the combination, the initial embedding of the resulting unique edge is obtained by averaging the initial embeddings of the remaining edges. To illustrate, we show how our noise reduction combines multi-edges and how it affects the edge initial embeddings in Figure 3 (II an III). First, three read and two write interactions between aa and cc 噪声减少。我们图表示模块的预期输入来源图将是简单图。因此,我们需要在节点对之间结合多个边。如果在一对节点之间存在多个相同标签的边(也共享相同的初始嵌入),我们将删除冗余边,以便只保留其中一条。然后,我们将剩余的边合并为一条最终边。我们注意到,在一对节点之间,可能会保留几种不同标签的边。合并后,结果唯一边的初始嵌入是通过对剩余边的初始嵌入进行平均得到的。为了说明,我们展示了我们的噪声减少如何结合多条边以及它如何影响边的初始嵌入,如图 3(II 和 III)所示。首先,在 aa 和 cc 之间有三次读取和两次写入交互。
Figure 3: Example of MAGIC’s graph construction steps. 图 3:MAGIC 图构建步骤的示例。
are merged into one for each label. Then we combine them together, forming one edge e_(ac)e_{a c} with initial embedding equal to the average initial embedding of the remaining edges ( e_(2)^(')e_{2}^{\prime} and {:e_(5)^('))\left.e_{5}^{\prime}\right). We provide a comparison between our noise reduction steps and previous works in Appendix E. 每个标签合并为一个。然后我们将它们组合在一起,形成一条边 e_(ac)e_{a c} ,其初始嵌入等于剩余边的平均初始嵌入( e_(2)^(')e_{2}^{\prime} 和 {:e_(5)^('))\left.e_{5}^{\prime}\right) )。我们在附录 E 中提供了我们的噪声减少步骤与之前工作的比较。
After conducting the above three steps, MAGIC has finished constructing a consistent and information-preserving provenance graph ready for subsequent tasks. During provenance graph construction, little information is lost as MAGIC only damages the original semantics by generalizing detailed descriptions of system entities and interactions into labels. However, an average 79.60%79.60 \% of all edges are reduced on the DARPA E3 Trace dataset, saving MAGIC’s training time and memory consumption. 在完成上述三个步骤后,MAGIC 已经构建了一个一致且保留信息的溯源图,准备进行后续任务。在溯源图构建过程中,几乎没有信息丢失,因为 MAGIC 仅通过将系统实体和交互的详细描述概括为标签来损害原始语义。然而,在 DARPA E3 Trace 数据集上,所有边的平均 79.60%79.60 \% 被减少,从而节省了 MAGIC 的训练时间和内存消耗。
4.2 Graph Representation Module 4.2 图形表示模块
MAGIC employs a graph representation module to obtain high-quality embeddings from featured provenance graphs. As illustrated in Figure 4, the graph representation module consists of three phases: a masking procedure to partially hide node features (i.e. initial embeddings) for reconstruction purpose (Sec. 4.2.1), a graph encoder that produces node and system state output embeddings by propagating and aggregating graph features (Sec. 4.2.2), a graph decoder that provides supervision signals for the training of the graph representation module via masked feature reconstruction and sample-based structure reconstruction (Sec. 4.2.3). The encoder and decoder form a graph masked auto-encoder, which excels in MAGIC 采用图表示模块从特征来源图中获取高质量的嵌入。如图 4 所示,图表示模块由三个阶段组成:一个掩蔽过程,用于部分隐藏节点特征(即初始嵌入)以进行重建(第 4.2.1 节),一个图编码器,通过传播和聚合图特征生成节点和系统状态输出嵌入(第 4.2.2 节),一个图解码器,通过掩蔽特征重建和基于样本的结构重建为图表示模块的训练提供监督信号(第 4.2.3 节)。编码器和解码器形成一个图掩蔽自编码器,表现出色。
Figure 4: Graph representation module of MAGIC. 图 4:MAGIC 的图形表示模块。
producing fast and resource-saving embeddings. 生成快速且节省资源的嵌入。
4.2.1 Feature Masking 4.2.1 特征屏蔽
Before training our graph representation module, we perform masking on nodes, so that the graph masked auto-encoder can be trained upon reconstruction of these nodes. Masked nodes are randomly chosen, covering a certain proportion of all nodes. The initial embeddings of such masked nodes are replaced with a special mask token x_("mask ")x_{\text {mask }} to cover any original information about these nodes. Edges, however, are not masked because these edges provide precious information about relationships between system entities. In summary, given node initial embeddings x_(n)x_{n}, we mask nodes as follows: 在训练我们的图表示模块之前,我们对节点进行掩码处理,以便图掩码自编码器可以在重建这些节点时进行训练。掩码节点是随机选择的,覆盖所有节点的某个比例。这些掩码节点的初始嵌入被替换为一个特殊的掩码标记 x_("mask ")x_{\text {mask }} ,以遮盖关于这些节点的任何原始信息。然而,边缘并没有被掩码,因为这些边缘提供了关于系统实体之间关系的重要信息。总之,给定节点的初始嵌入 x_(n)x_{n} ,我们对节点进行如下掩码处理:
emb_(n)={:[x_(n)",",n!in widetilde(N)],[x_("mask ")",",n in widetilde(N)]:}e m b_{n}=\begin{array}{ll}
x_{n}, & n \notin \widetilde{N} \\
x_{\text {mask }}, & n \in \widetilde{N}
\end{array}
where widetilde(N)\widetilde{N} are randomly-chosen masked nodes, emb_(n)e m b_{n} is the embedding of node nn ready for training the graph representation module. This masking process only happens during training. During detection, we do not mask any nodes. 其中 widetilde(N)\widetilde{N} 是随机选择的被遮蔽节点, emb_(n)e m b_{n} 是节点 nn 的嵌入,准备用于训练图表示模块。这个遮蔽过程仅在训练期间发生。在检测期间,我们不遮蔽任何节点。
4.2.2 Graph Encoder 4.2.2 图编码器
Initial embeddings obtained from the graph construction steps take only raw features into consideration. However, raw features are far from enough to model detailed behaviors of system entities. Contextual information of an entity, such as its neighborhood, its multi-hop relationships and its interaction patterns with other system entities plays an important role to obtain high-quality entity embeddings [29]. Here we employ and extend graph masked auto-encoders [19] to generate output embeddings in a self-supervised way. The graph masked auto-encoder consists of an encoder and a decoder. The encoder produces output embeddings by propagating and aggregating graph features and the decoder reconstructs graph features to provide supervision signals for training. Such encoder-decoder architecture maintains the contextual and semantic information within the generated embeddings, while its computation overhead is significantly reduced via masked learning. 初始嵌入是通过图构建步骤获得的,仅考虑原始特征。然而,原始特征远不足以建模系统实体的详细行为。实体的上下文信息,例如其邻域、多跳关系以及与其他系统实体的交互模式,在获得高质量实体嵌入方面起着重要作用[29]。在这里,我们采用并扩展图掩码自编码器[19]以自监督的方式生成输出嵌入。图掩码自编码器由编码器和解码器组成。编码器通过传播和聚合图特征生成输出嵌入,而解码器重建图特征以提供训练的监督信号。这种编码器-解码器架构在生成的嵌入中保持上下文和语义信息,同时通过掩码学习显著减少了计算开销。
The encoder of our graph representation module contains multiple stacked layers of graph attention networks (GAT) [30]. The function of a GAT layer is to generate output node embeddings according to both the features (initial embeddings) of the node itself and its neighbors. Differing from ordinary GNNs, GAT introduces an attention mechanism to measure the importance of those neighbors. 我们的图表示模块的编码器包含多个堆叠的图注意力网络(GAT)层[30]。GAT 层的功能是根据节点自身的特征(初始嵌入)和其邻居生成输出节点嵌入。与普通的图神经网络(GNN)不同,GAT 引入了一种注意力机制来衡量这些邻居的重要性。
To explain in detail, one layer of GAT takes node embeddings generated by previous layers as input and propagates embeddings from source nodes to destination nodes into messages along the interactions. The message contains information about the source node and the interaction between source and destination: 为了详细解释,GAT 的一层将前一层生成的节点嵌入作为输入,并沿着交互将源节点的嵌入传播到目标节点,形成消息。该消息包含有关源节点的信息以及源节点与目标节点之间的交互。
MSG(src,dst)=W_(msg)^(T)(h_(src)||emb_(e))M S G(s r c, d s t)=W_{m s g}^{T}\left(h_{s r c} \| e m b_{e}\right)
And the attention mechanism is employed to calculate the attention coefficients between the message source and its destination: 注意机制用于计算消息源与其目的地之间的注意系数:
{:[alpha(src","dst)=LeakyReLU(W_(as)^(T)h_(src)+W_(am)MSG(src,dst))","],[a(src","dst)=Softmax(alpha(src","dst)).]:}\begin{aligned}
\alpha(s r c, d s t) & =\operatorname{LeakyReLU}\left(W_{a s}^{T} h_{s r c}+W_{a m} M S G(s r c, d s t)\right), \\
a(s r c, d s t) & =\operatorname{Softmax}(\alpha(s r c, d s t)) .
\end{aligned}
Then for the destination node, the GAT aggregates messages from incoming edges to update its node embedding by computing a weighted sum of all incoming messages. The weights are exactly the attention coefficients: 然后,对于目标节点,GAT 从输入边聚合消息,通过计算所有输入消息的加权和来更新其节点嵌入。权重正是注意力系数:
{:[AGG(h_(dst),h_(X))=W_("self ")h_(dst)+sum_(i inN)a(i","dst)MSG(i","dst)","],[h_(n)^(l)=AGG^(l)(h_(n)^(l-1),h_(N)^(l-1))]:}\begin{aligned}
A G G\left(h_{d s t}, h_{\mathcal{X}}\right) & =W_{\text {self }} h_{d s t}+\sum_{i \in \mathcal{N}} a(i, d s t) M S G(i, d s t), \\
h_{n}^{l} & =A G G^{l}\left(h_{n}^{l-1}, h_{\mathscr{N}}^{l-1}\right)
\end{aligned}
where h_(n)^(l)h_{n}^{l} is the hidden embedding of node nn at ll-th layer of GAT, h_(n)^(l-1)h_{n}^{l-1} is that of layer l-1l-1 and N_(n)\mathcal{N}_{n} is the one-hop neighborhood of nn. The input of the first GAT layer are the initial node embeddings. e_(e)e_{e} is the initial edge embedding and remains constant throughout the graph representation module. W_(as),W_(am),W_("self "),W_(msg)W_{a s}, W_{a m}, W_{\text {self }}, W_{m s g} are trainable parameters. The updated node embedding forms a general abstraction of the node’s one-hop interaction behavior. 其中 h_(n)^(l)h_{n}^{l} 是 GAT 第 ll 层中节点 nn 的隐藏嵌入, h_(n)^(l-1)h_{n}^{l-1} 是第 l-1l-1 层的嵌入, N_(n)\mathcal{N}_{n} 是 nn 的一跳邻域。第一个 GAT 层的输入是初始节点嵌入。 e_(e)e_{e} 是初始边嵌入,并在整个图表示模块中保持不变。 W_(as),W_(am),W_("self "),W_(msg)W_{a s}, W_{a m}, W_{\text {self }}, W_{m s g} 是可训练参数。更新后的节点嵌入形成节点一跳交互行为的一般抽象。
Multiple layers of such GATs are stacked to obtain the final node embedding hh, which is concatenated by the original node embedding and outputs of all GAT layers: 多个这样的 GAT 层被堆叠以获得最终的节点嵌入 hh ,该嵌入由原始节点嵌入和所有 GAT 层的输出连接而成:
h_(n)=emb_(n)||h_(n)^(1)||cdots||h_(n)^(l)h_{n}=e m b_{n}\left\|h_{n}^{1}\right\| \cdots \| h_{n}^{l}
where *||*\cdot \| \cdot denotes the concatenate operation. The more layers of GAT stacked, the wider the neighboring range is and the farther a node’s multi-hop interaction pattern its embedding is able to represent. Consequently, the graph encoder effectively incorporates node initial features and multi-hop interaction behaviors to abstract system entity behaviors into node embeddings. The graph encoder also applies an average pooling to all node embeddings to generate a comprehensive embedding of the graph itself [31], which recapitulates the overall 其中 *||*\cdot \| \cdot 表示连接操作。堆叠的 GAT 层数越多,邻域范围越广,节点的多跳交互模式能够表示的嵌入也越远。因此,图编码器有效地将节点初始特征和多跳交互行为结合起来,将系统实体行为抽象为节点嵌入。图编码器还对所有节点嵌入应用平均池化,以生成图本身的综合嵌入 [31],这概括了整体。
state of the system: 系统状态:
The node embeddings and system state embeddings generated by the graph encoder are considered the output of the graph representation module, which are used in subsequent tasks in different scenarios. 图编码器生成的节点嵌入和系统状态嵌入被视为图表示模块的输出,这些输出在不同场景的后续任务中使用。
4.2.3 Graph Decoder 4.2.3 图解码器
The graph encoder does not provide supervision signals that support model training. In typical graph auto-encoders [25, 27], a graph decoder is employed to decode node embeddings and supervise model training via feature reconstruction and structure reconstruction. Graph masked auto-encoders, however, abandon structure reconstruction to reduce computation overhead. Our graph decoder is a mixture of both, which integrates masked feature reconstruction and sample-based structure reconstruction to construct an objective function that optimizes the graph representation module. 图编码器不提供支持模型训练的监督信号。在典型的图自编码器中[25, 27],使用图解码器来解码节点嵌入,并通过特征重构和结构重构来监督模型训练。然而,图掩码自编码器放弃了结构重构,以减少计算开销。我们的图解码器是两者的混合,集成了掩码特征重构和基于样本的结构重构,以构建一个优化图表示模块的目标函数。
Given node embeddings h_(n)h_{n} obtained from the graph encoder, the decoder first re-masks those masked nodes and transforms them into the input of masked feature reconstruction: 给定从图编码器获得的节点嵌入 h_(n)h_{n} ,解码器首先重新掩蔽那些被掩蔽的节点,并将它们转化为掩蔽特征重建的输入:
h_(n)^(**)={[W^(**)h_(n)",",n!in widetilde(N)],[W^(**)v_("remask ")",",n in widetilde(N)]:}h_{n}^{*}= \begin{cases}W^{*} h_{n}, & n \notin \widetilde{N} \\ W^{*} v_{\text {remask }}, & n \in \widetilde{N}\end{cases}
Subsequently, the decoder uses a similar GAT layer described above to reconstruct the initial embeddings of the masked nodes, allowing the calculation of a feature reconstruction loss: 随后,解码器使用上述类似的 GAT 层来重建被遮蔽节点的初始嵌入,从而允许计算特征重建损失:
where L_(fr)L_{f r} is the masked feature reconstruction loss obtained by calculating a scaled cosine loss between initial and reconstructed embeddings of the masked nodes. This loss [19] scales dramatically between easy and difficult samples which effectively speeds up learning. The degree of such scaling is controlled by a hyper-parameter gamma\gamma. 其中 L_(fr)L_{f r} 是通过计算被遮蔽节点的初始嵌入和重建嵌入之间的缩放余弦损失获得的遮蔽特征重建损失。该损失 [19] 在简单样本和困难样本之间的缩放幅度非常大,从而有效加速学习。这种缩放的程度由超参数 gamma\gamma 控制。
Meanwhile, sample-based structure reconstruction aims to reconstruct graph structure (i.e. predict edges between nodes). Instead of reconstructing the whole adjacency matrix, which has O(N^(2))O\left(N^{2}\right) complexity, sample-based structure reconstruction applies contrastive sampling on node pairs and predicts edge probabilities between such pairs. Only non-masked nodes are involved in structure reconstruction. Positive samples are constructed with all existing edges between non-masked nodes and negative samples are sampled among node pairs with no existing edges between them. 同时,基于样本的结构重建旨在重建图结构(即预测节点之间的边)。与重建整个邻接矩阵(其复杂度为 O(N^(2))O\left(N^{2}\right) )不同,基于样本的结构重建对节点对进行对比采样,并预测这些节点对之间的边概率。只有未被屏蔽的节点参与结构重建。正样本由未被屏蔽的节点之间的所有现有边构成,负样本则在没有现有边的节点对中进行采样。
A simple two-layer MLP is used to reconstruct edges between node pairs samples, generating one probability for each 一个简单的两层多层感知器用于重建节点对样本之间的边缘,为每个样本生成一个概率
sample. The reconstruction loss takes the form of a simple binary cross-entropy loss on those samples: 样本。重建损失采用对这些样本的简单二元交叉熵损失形式: