MAGIC: Detecting Advanced Persistent Threats via Masked Graph Representation Learning
魔法：通过掩蔽图表示学习检测高级持续威胁

Zian Jia $^{1}$ , Yun Xiong $^{1}$ , Yuhong Nan $^{2 *}$ Yao Zhang $^{1}$ , Jinjing Zhao $^{3}$ , Mi Wen $^{4}$
贾子安 $^{1}$ , 熊云 $^{1}$ , 南玉红 $^{2 *}$ 张耀 $^{1}$ , 赵金晶 $^{3}$ , 文米 $^{4}$ $^{1}$ Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, China
上海数据科学重点实验室，复旦大学计算机科学学院，中国 $^{2}$ School of Software Engineering, Sun Yat-sen University, China
中山大学软件工程学院，中国 $^{3}$ National Key Laboratory of Science and Technology on Information System Security, China
中国信息系统安全国家重点实验室 $^{4}$ Shanghai University of Electric Power, China
$^{4}$ 上海电力学院，中国

Abstract 摘要

Advance Persistent Threats (APTs), adopted by most delicate attackers, are becoming increasing common and pose great threat to various enterprises and institutions. Data provenance analysis on provenance graphs has emerged as a common approach in APT detection. However, previous works have exhibited several shortcomings: (1) requiring attack-containing data and a priori knowledge of APTs, (2) failing in extracting the rich contextual information buried within provenance graphs and (3) becoming impracticable due to their prohibitive computation overhead and memory consumption.
高级持续威胁（APT）被大多数精细攻击者采用，正变得越来越普遍，并对各种企业和机构构成重大威胁。基于来源图的数据来源分析已成为 APT 检测中的一种常见方法。然而，以前的工作存在几个缺点：（1）需要包含攻击的数据和对 APT 的先验知识，（2）未能提取埋藏在来源图中的丰富上下文信息，以及（3）由于其高昂的计算开销和内存消耗而变得不可行。

In this paper, we introduce MAGIC, a novel and flexible self-supervised APT detection approach capable of performing multi-granularity detection under different level of supervision. MAGIC leverages masked graph representation learning to model benign system entities and behaviors, performing efficient deep feature extraction and structure abstraction on provenance graphs. By ferreting out anomalous system behaviors via outlier detection methods, MAGIC is able to perform both system entity level and batched log level APT detection. MAGIC is specially designed to handle concept drift with a model adaption mechanism and successfully applies to universal conditions and detection scenarios. We evaluate MAGIC on three widely-used datasets, including both real-world and simulated attacks. Evaluation results indicate that MAGIC achieves promising detection results in all scenarios and shows enormous advantage over state-of-the-art APT detection approaches in performance overhead.
在本文中，我们介绍了 MAGIC，这是一种新颖且灵活的自监督 APT 检测方法，能够在不同的监督级别下执行多粒度检测。MAGIC 利用掩蔽图表示学习来建模良性系统实体和行为，在来源图上进行高效的深度特征提取和结构抽象。通过利用异常检测方法识别异常系统行为，MAGIC 能够在系统实体级别和批量日志级别进行 APT 检测。MAGIC 特别设计用于处理概念漂移，具有模型适应机制，并成功应用于通用条件和检测场景。我们在三个广泛使用的数据集上评估 MAGIC，包括真实世界和模拟攻击。评估结果表明，MAGIC 在所有场景中都取得了良好的检测结果，并在性能开销方面显示出相对于最先进的 APT 检测方法的巨大优势。

1 Introduction 1 引言

Advanced Persistent Threats (APTs) are intentional and sophisticated cyber-attacks conducted by skilled attackers and pose great threat to both enterprises and institutions [1]. Most APTs involve zero-day vulnerabilities and are especially difficult to detect due to their stealthy and changeful nature.
高级持续性威胁（APT）是由熟练攻击者进行的有意且复杂的网络攻击，对企业和机构构成了重大威胁[1]。大多数 APT 涉及零日漏洞，由于其隐蔽和多变的特性，特别难以检测。

Recent works [2-18] on APT detection leverage data provenance to perform APT detection. Data provenance transforms audit logs into provenance graphs, which extract the rich contextual information from audit logs and provide a perfect platform for fine-grained causality analysis and APT detection. Early works [2-6] construct rules based on typical or specific APT patterns and match audit logs against those rules to detect potential APTs. Several recent works [7-9] adopt a statistical anomaly detection approach to detect APTs focusing on different provenance graph elements, e.g, system entities, interactions and communities. Most recent works [10-18], however, are deep learning-based approaches. They utilize various deep learning (DL) techniques to model APT patterns or system behaviors and perform APT detection in a classification or anomaly detection style.
最近的研究[2-18]利用数据来源进行 APT 检测。数据来源将审计日志转化为来源图，从中提取丰富的上下文信息，为细粒度因果分析和 APT 检测提供了完美的平台。早期的研究[2-6]基于典型或特定的 APT 模式构建规则，并将审计日志与这些规则进行匹配以检测潜在的 APT。一些最近的研究[7-9]采用统计异常检测方法，专注于不同的来源图元素，例如系统实体、交互和社区。然而，最近的研究[10-18]主要是基于深度学习的方法。它们利用各种深度学习（DL）技术来建模 APT 模式或系统行为，并以分类或异常检测的方式进行 APT 检测。

While these existing approaches have demonstrated their capability to detect APTs with reasonable accuracy, they encounter various combinations of the following challenges: (1) Supervised methods suffer from lack-of-data (LOD) problem as they require a priori knowledge about APTs (i.e. attack patterns or attack-containing logs). In addition, these methods are particularly vulnerable when confronted with new types of APTs they are not trained to deal with. (2) Statistics-based methods only require benign data to function, but they fail to extract the deep semantics and correlation of complex benign activities buried in audit logs, resulting in high false positive rate. (3) DL-based methods, especially sequence-based and graph-based approaches, have achieved promising effectiveness at the cost of heavy computation overhead, rendering them impractical in real-life detection scenarios.
虽然这些现有的方法已经证明了它们能够以合理的准确性检测 APT，但它们面临以下各种挑战的组合：(1) 监督方法由于需要关于 APT 的先验知识（即攻击模式或包含攻击的日志），因此遭遇缺乏数据（LOD）问题。此外，当面对它们未经过训练的新类型 APT 时，这些方法特别脆弱。(2) 基于统计的方法只需要良性数据就能运行，但它们无法提取埋藏在审计日志中的复杂良性活动的深层语义和关联，导致高误报率。(3) 基于深度学习的方法，特别是基于序列和图的 approaches，虽然在有效性上取得了可喜的成果，但代价是计算开销巨大，使它们在现实检测场景中不切实际。

In this paper, we address the above three issues by introducing MAGIC, a novel self-supervised APT detection approach that leverages masked graph representation learning and simple outlier detection methods to identify key attack system entities from massive audit logs. MAGIC first constructs the provenance graph from audit logs in simple yet universal steps. MAGIC then employs a graph representation module that obtains embeddings by incorporating graph features and structural information in a self-supervised way. The model
在本文中，我们通过引入 MAGIC，解决上述三个问题。MAGIC 是一种新颖的自监督 APT 检测方法，利用掩蔽图表示学习和简单的异常检测方法，从海量审计日志中识别关键攻击系统实体。MAGIC 首先通过简单而通用的步骤从审计日志构建源图。然后，MAGIC 采用图表示模块，通过以自监督的方式结合图特征和结构信息来获取嵌入。该模型
is built upon graph masked auto-encoders [19] under the joint supervision of both masked feature reconstruction and sample-based structure reconstruction. An unsupervised outlier detection method is employed to analyze the computed embeddings and attain the final detection result.
基于图形掩码自编码器[19]，在掩码特征重建和基于样本的结构重建的联合监督下构建。采用无监督异常检测方法分析计算得到的嵌入，并获得最终检测结果。

MAGIC is designed to be flexible and scalable. Depending on the application background, MAGIC is able to perform multi-granularity detection, i.e., detecting APT existence in batched logs or locating entity-level adversaries. Although MAGIC is designed to perform APT detection without attackcontaining data, it is well-suited for semi-supervised and fullysupervised conditions. Furthermore, MAGIC also contains an optional model adaption mechanism which provides a feedback channel for its users. Such feedback is important for MAGIC to further improve its performance, combat concept drift and reduce false positives.
MAGIC 被设计为灵活且可扩展。根据应用背景，MAGIC 能够执行多粒度检测，即在批量日志中检测 APT 的存在或定位实体级对手。尽管 MAGIC 被设计为在没有攻击数据的情况下进行 APT 检测，但它非常适合半监督和完全监督的条件。此外，MAGIC 还包含一个可选的模型适应机制，为用户提供反馈通道。这种反馈对 MAGIC 进一步提高性能、应对概念漂移和减少误报非常重要。

We implement MAGIC and evaluate its performance and overhead on three different APT attack datasets: the DARPA Transparent Computing E3 datasets [20], the StreamSpot dataset [21] and the Unicorn Wget dataset [22]. The DARPA datasets contain real-world attacks while the StreamSpot and Unicorn Wget dataset are fully simulated in controlled environments. Evaluation results show that MAGIC is able to perform entity-level APT detection with

97.26 %

precision and

99.91 %

recall as well as minimum overhead, less memory demanding and significantly faster than state-of-the-art approaches (e.g. 51 times faster than ShadeWatcher [18]).
我们实现了 MAGIC，并在三个不同的 APT 攻击数据集上评估其性能和开销：DARPA 透明计算 E3 数据集[20]、StreamSpot 数据集[21]和 Unicorn Wget 数据集[22]。DARPA 数据集包含真实世界的攻击，而 StreamSpot 和 Unicorn Wget 数据集则是在受控环境中完全模拟的。评估结果表明，MAGIC 能够以

97.26 %

的精确度和

99.91 %

的召回率执行实体级 APT 检测，同时开销最小，内存需求较低，并且比最先进的方法显著更快（例如，比 ShadeWatcher [18]快 51 倍）。

To benefit future research and encourage further improvement on MAGIC, we make our implementation of MAGIC and our pre-processed datasets open to public

^{1}

. In summary, this paper makes the following contributions:
为了促进未来的研究并鼓励对 MAGIC 的进一步改进，我们将 MAGIC 的实现和我们预处理的数据集向公众开放

^{1}

。总之，本文做出了以下贡献：

We propose MAGIC, a universal APT detection approach based on masked graph representation learning and outlier detection methods, capable of performing multi-granularity detection on massive audit logs.
我们提出了 MAGIC，一种基于掩蔽图表示学习和异常检测方法的通用 APT 检测方法，能够对海量审计日志进行多粒度检测。
We ensure MAGIC’s practicability by minimizing its computation overhead with extended graph masked auto-encoders, allowing MAGIC to complete training and detection in acceptable time even under tight conditions.
我们通过扩展图掩码自编码器来最小化 MAGIC 的计算开销，从而确保其可行性，使 MAGIC 即使在紧张条件下也能在可接受的时间内完成训练和检测。
We secure MAGIC’s universality with various efforts. We leverage masked graph representation learning and outlier detection methods, enabling MAGIC to perform precise detection under different supervision levels, in different detection granularity and with audit logs from various sources.
我们通过各种努力确保 MAGIC 的普遍性。我们利用掩蔽图表示学习和异常检测方法，使 MAGIC 能够在不同的监督级别、不同的检测粒度以及来自各种来源的审计日志下进行精确检测。
We evaluate MAGIC on three widely-used datasets, involving both real-world and simulated APT attacks. Evaluation results show that MAGIC detects APTs with promising results and minimum computation overhead.
我们在三个广泛使用的数据集上评估 MAGIC，这些数据集涉及真实世界和模拟的 APT 攻击。评估结果表明，MAGIC 能够以良好的效果检测 APT，并且计算开销最小。
We provide an open source implementation of MAGIC to benefit future research in the community and encourage further improvement on our approach.
我们提供 MAGIC 的开源实现，以惠及社区未来的研究，并鼓励对我们的方法进行进一步改进。

Figure 1: The provenance graph of a real-world APT attack, exploiting the Pine Backdoor vulnerability. All attackirrelevant entities and interactions have been removed from the provenance graph.
图 1：一个真实世界 APT 攻击的来源图，利用了 Pine Backdoor 漏洞。所有与攻击无关的实体和交互已从来源图中删除。

2 Background 2 背景

2.1 Motivating Example 2.1 激励示例

Here we provide a detailed illustration of an APT scenario that we use throughout the paper. Pine backdoor with Drakon Dropper is an APT attack from the DARPA Engagement 3 Trace dataset [20]. During the attack, an attacker constructs a malicious executable (/tmp/tcexec) and sends it to the target host via a phishing e-mail. The user then unconsciously downloads and opens the e-mail. Contained within the e-mail is an executable designed to perform a port-scan for internal reconnaissance and establish a silent connection between the target host and the attacker. Figure 1 displays the provenance graph of our motivation example. Nodes in the graph represent system entities and arrows represent directed interactions between entities. The graph shown is a subgraph abstracted from the complete provenance graph by removing most attack-irrelevant entities and interactions. Different node shape corresponds to different type of entities. Entities covered in stripes are considered malicious ones.
在这里，我们提供了一个详细的 APT 场景说明，贯穿整篇论文。Pine 后门与 Drakon Dropper 是来自 DARPA Engagement 3 Trace 数据集的 APT 攻击。在攻击过程中，攻击者构建了一个恶意可执行文件（/tmp/tcexec），并通过钓鱼电子邮件将其发送到目标主机。用户随后无意识地下载并打开了电子邮件。电子邮件中包含一个可执行文件，旨在进行内部侦察的端口扫描，并在目标主机与攻击者之间建立一个隐秘连接。图 1 展示了我们动机示例的来源图。图中的节点代表系统实体，箭头表示实体之间的定向交互。所示图形是从完整来源图中抽象出的子图，通过去除大多数与攻击无关的实体和交互。不同的节点形状对应不同类型的实体。被条纹覆盖的实体被视为恶意实体。

2.2 Prior Research and their Limitations
2.2 先前研究及其局限性

Supervised Methods. For early works [2-6], special heuristic rules need to be constructed to cover all attack patterns. Many DL-based APT detection methods [11,14-16,18] construct provenance graphs based on both benign and attackcontaining data and detect APTs in a classification style. These supervised methods can achieve almost perfect detection results on learned attack patterns but are especially vulnerable facing concept drift or unseen attack patterns. Moreover, for rule-based methods, the construction and maintenance of heuristic rules can be very expensive and time-consuming. And for DL-based methods, the scarcity of attack-containing data is preventing these supervised methods from being actually deployable. To address the above issue, MAGIC adopts a fully self-supervised anomaly detection style, allowing the absence of attack-containing data while effectively dealing
监督方法。对于早期的工作[2-6]，需要构建特殊的启发式规则以覆盖所有攻击模式。许多基于深度学习的 APT 检测方法[11,14-16,18]基于良性和包含攻击的数据构建溯源图，并以分类的方式检测 APT。这些监督方法在学习到的攻击模式上几乎可以实现完美的检测结果，但在面对概念漂移或未见过的攻击模式时尤其脆弱。此外，对于基于规则的方法，构建和维护启发式规则可能非常昂贵且耗时。而对于基于深度学习的方法，包含攻击的数据稀缺使得这些监督方法无法实际部署。为了解决上述问题，MAGIC 采用完全自监督的异常检测风格，允许在缺乏包含攻击的数据的情况下有效处理。
with unseen attack patterns.
具有未见的攻击模式。
Statistics-based Methods. Most recent statistics-based methods [7-9] detect APT signals by identifying system entities, interactions and communities based on their rarity or anomaly scores. However, the rarity of system entities may not necessarily indicate their abnormality and anomaly scores, obtained via causal analysis or label propagation, are shallow feature extraction on provenance graphs. To illustrate, the process tcexec performs multiple portscan operations on different IP addresses in our motivating example (See Figure 1), which may be considered as a normal system behavior. However, taking into consideration that process tcexec, derived from the external network, also reads sensitive system information (uname) and makes connection with public IP addresses (162.66.239.75), we can easily identify tcexec as a malicious entity. Failure to extract deep semantics and correlations between system entities often results in low detection performance and high false positive rate of statistics-based methods. MAGIC, however, employs a graph representation module to perform deep graph feature extraction on provenance graphs, resulting in high-quality embeddings.
基于统计的方法。最近的基于统计的方法[7-9]通过识别系统实体、交互和社区的稀有性或异常分数来检测 APT 信号。然而，系统实体的稀有性并不一定表示它们的异常性，而通过因果分析或标签传播获得的异常分数只是对来源图的浅层特征提取。举例来说，在我们的动机示例中，进程 tcexec 对不同 IP 地址执行多个端口扫描操作（见图 1），这可能被视为正常的系统行为。然而，考虑到进程 tcexec 源自外部网络，还读取敏感的系统信息（uname）并与公共 IP 地址（162.66.239.75）建立连接，我们可以很容易地将 tcexec 识别为恶意实体。未能提取系统实体之间的深层语义和关联通常会导致基于统计的方法检测性能低下和高误报率。然而，MAGIC 采用图表示模块对来源图进行深层图特征提取，从而生成高质量的嵌入。
DL-based Methods. Recently, DL-based APT detection methods, no matter supervised or unsupervised, are producing very promising detection results. However, in reality, hundreds of GB of audit logs are produced every day in a medium-size enterprise [23]. Consequently, DL-based methods, especially sequence-based

[11, 14, 24]

and graphbased

[10, 12, 15 - 18]

methods, are impracticable due to their computation overhead. For instance, ATLAS [11] takes an average 1 hour to train on 676 MB of audit logs and ShadeWatcher [18] takes 1 day to train on the DARPA E3 Trace dataset with GPU available. Besides, some graph autoencoder [25-27] based methods encounter explosive memory overhead problem when the scale of provenance graphs expands. MAGIC avoids to be computationally demanding by introducing graph masked auto-encoders and completes its training on the DARPA E3 Trace dataset in mere minutes. Detailed evaluation of MAGIC’s performance overhead is presented in Sec. 6.4.
基于深度学习的方法。最近，基于深度学习的 APT 检测方法，无论是监督还是无监督，都产生了非常有前景的检测结果。然而，在现实中，中型企业每天会产生数百 GB 的审计日志[23]。因此，基于深度学习的方法，特别是基于序列

[11, 14, 24]

和基于图

[10, 12, 15 - 18]

的方法，由于其计算开销，变得不可行。例如，ATLAS[11]在 676MB 的审计日志上训练平均需要 1 小时，而 ShadeWatcher[18]在可用 GPU 的情况下，在 DARPA E3 Trace 数据集上训练需要 1 天。此外，一些基于图自编码器[25-27]的方法在源图规模扩大时会遇到爆炸性的内存开销问题。MAGIC 通过引入图掩码自编码器避免了计算需求过高的问题，并在 DARPA E3 Trace 数据集上仅用几分钟完成训练。MAGIC 性能开销的详细评估在第 6.4 节中呈现。
End-to-end Approaches. Beyond the three major limitations discussed above, it is also worth to mention that most recent APT detection approaches

[11, 17, 18]

are end-to-end detectors and focus on one specific detection task. For instance, ATLAS [11] focused on end-to-end attack reconstruction and Unicorn [10] yields system-level alarms from streaming logs. Instead, MAGIC’s approach is universal and performs multigranularity APT detection under various detection scenarios, which can also be applied to audit logs collected from different sources.
端到端方法。除了上述讨论的三个主要限制之外，值得一提的是，最近的大多数 APT 检测方法

[11, 17, 18]

都是端到端检测器，并专注于一个特定的检测任务。例如，ATLAS [11] 专注于端到端攻击重建，而 Unicorn [10] 从流日志中生成系统级警报。相反，MAGIC 的方法是通用的，并在各种检测场景下执行多粒度 APT 检测，这也可以应用于从不同来源收集的审计日志。

2.3 Threat Model and Definitions
2.3 威胁模型和定义

We first present the threat model we use throughout the paper and then formally define key concepts that are crucial to
我们首先介绍在整篇论文中使用的威胁模型，然后正式定义对关键概念至关重要的内容
understanding how MAGIC performs APT detection.
理解 MAGIC 如何执行 APT 检测。
Threat Model. We assume that attackers come from outside a system and target valuable information within the system. An attacker may perform sophisticated steps to achieve his goal but leaves trackable evidence in logs. The combination of the system hardware, operating system and system audit softwares is our trusted computing base. Poison attacks and evasion attacks are not considered in our threat model.
威胁模型。我们假设攻击者来自系统外部，目标是系统内的有价值信息。攻击者可能会采取复杂的步骤来实现其目标，但会在日志中留下可追踪的证据。系统硬件、操作系统和系统审计软件的组合是我们可信计算基础。毒药攻击和规避攻击不在我们的威胁模型中考虑。
Provenance Graph. A provenance graph is a directed cyclic graph extracted from raw audit logs. Constructing a provenance graph is common practice in data provenance, as it connects system entities and presents the interaction relationships between them. A provenance graph contains nodes representing different system entities (e.g., processes, files and sockets) and edges representing interactions between system entities (e.g., execute and connect), labeled with their types. For example, /tmp/tcexec is a FileObject system entity and the edge between /tmp/tcexec and tcexec is an execute operation from a FileObject targeting a Process (See Figure 1).
来源图。来源图是从原始审计日志中提取的有向循环图。构建来源图是数据来源中的常见做法，因为它连接系统实体并展示它们之间的交互关系。来源图包含表示不同系统实体的节点（例如，进程、文件和套接字）以及表示系统实体之间交互的边（例如，执行和连接），并标注其类型。例如，/tmp/tcexec 是一个 FileObject 系统实体，而 /tmp/tcexec 和 tcexec 之间的边是一个从 FileObject 目标指向进程的执行操作（见图 1）。
Multi-granularity Detection. MAGIC is capable to perform APT detection at two-granularity: batched log level and system entity level. MAGIC’s multi-granularity detection ability gives rises to a two-stage detection approach: first conduct batched log level detection on streaming batches of logs, and then perform system entity level detection on positive batches to identify detailed detection results. Applying this approach to real-world settings will effectively reduce workload, resource consumption and false positives and, in the meantime, produce detailed outcomes.
多粒度检测。MAGIC 能够在两种粒度下进行 APT 检测：批量日志级别和系统实体级别。MAGIC 的多粒度检测能力产生了两阶段检测方法：首先对流式日志的批量进行批量日志级别检测，然后对正批量进行系统实体级别检测，以识别详细的检测结果。将这种方法应用于现实世界的环境中，将有效减少工作量、资源消耗和误报，同时产生详细的结果。

Batched log level detection. Under this granularity of APT detection, the major task is given batched audit logs from a consistent source, MAGIC alerts if a potential APT is detected in a batch of logs. Similar to Unicorn [10], MAGIC does not accurately locate malicious system entities and interactions under this granularity of detection.
批量日志级别检测。在这种 APT 检测的粒度下，主要任务是从一致的来源获取批量审计日志，如果在一批日志中检测到潜在的 APT，MAGIC 会发出警报。与 Unicorn [10]类似，MAGIC 在这种检测粒度下无法准确定位恶意系统实体和交互。
System entity level detection. The detection task under this granularity of APT detection is given audit logs from a consistent source, MAGIC is able to accurately locate malicious system entities in those audit logs. Identification of key system entities during APTs is vital to subsequent tasks such as attack investigation and attack story recovery as it provides explicable detection results and reduces the need for domain experts as well as redundant manual efforts [11].
系统实体级检测。在这种 APT 检测粒度下的检测任务中，MAGIC 能够从一致来源的审计日志中准确定位恶意系统实体。在 APT 期间识别关键系统实体对于后续任务（如攻击调查和攻击故事恢复）至关重要，因为它提供了可解释的检测结果，并减少了对领域专家的需求以及冗余的人工努力[11]。

3 MAGIC Overview 3 MAGIC 概述

MAGIC is a novel self-supervised APT detection approach that leverages masked graph representation learning and outlier detection methods and is capable of efficiently performing multi-granularity detection on massive audit logs. MAGIC’s pipeline consists of three main components: (1) provenance graph construction, (2) a graph representation module and (3) a detection module. It also provides an optional (4)
MAGIC 是一种新颖的自监督 APT 检测方法，利用了掩蔽图表示学习和异常检测方法，能够高效地对海量审计日志进行多粒度检测。MAGIC 的流程包括三个主要组件：（1）来源图构建，（2）图表示模块和（3）检测模块。它还提供了一个可选的（4）

Figure 2: Overview of MAGIC’s detection pipeline.
图 2：MAGIC 检测流程概述。
model adaption mechanism. During training, MAGIC transforms training data with (1), learns graph embedding by (2) and memorizes benign behaviors in (3). During inference, MAGIC transforms target data with (1), obtains graph embedding with the trained (2) and detects outliers through (3). Figure 2 gives an overview of the MAGIC architecture.
模型适应机制。在训练过程中，MAGIC 通过(1)转换训练数据，通过(2)学习图嵌入，并在(3)中记忆良性行为。在推理过程中，MAGIC 通过(1)转换目标数据，通过训练好的(2)获得图嵌入，并通过(3)检测异常值。图 2 给出了 MAGIC 架构的概述。

Streaming audit logs collected by system auditing softwares are usually stored in batches. During provenance graph construction (1), MAGIC transforms these logs into static provenance graphs. System entities and interactions between them are extracted and converted into nodes and edges respectively. Several complexity reduction techniques are utilized to remove redundant information.
系统审计软件收集的流式审计日志通常以批量方式存储。在来源图构建过程中（1），MAGIC 将这些日志转换为静态来源图。系统实体及其之间的交互被提取并分别转换为节点和边。采用多种复杂性降低技术来去除冗余信息。

The constructed provenance graphs are then fed through the graph representation module (2) to obtain output embeddings (i.e. comprehensive vector representations of objects). Built upon graph masked auto-encoders and integrating samplebased structure reconstruction, the graph representation module embeds, propagates and aggregates node and edge attributes into output embeddings, which contain both node embeddings and the system state embedding.
构建的来源图随后通过图表示模块（2）进行处理，以获得输出嵌入（即对象的综合向量表示）。该图表示模块基于图掩码自编码器，并整合基于样本的结构重建，嵌入、传播和聚合节点和边属性到输出嵌入中，这些嵌入包含节点嵌入和系统状态嵌入。

The graph representation module is trained with only benign audit logs to model benign system behaviors. When performing APT detection on potentially attack-containing audit logs, MAGIC utilizes outlier detection methods based on the output embeddings to detect outliers in system behaviors (3). Depending on the granularity of the task, different embeddings are used to complete APT detection. On batched log level tasks, the system state embeddings, which reflect the general behaviors of the whole system, are the detection targets. An outlier in such embeddings means its corresponding system state is unseen and potentially malicious, which reveals an APT signal in that batch. On system entity level tasks, the detection targets are those node embeddings, which
图形表示模块仅使用良性审计日志进行训练，以建模良性系统行为。在对可能包含攻击的审计日志进行 APT 检测时，MAGIC 利用基于输出嵌入的异常检测方法来检测系统行为中的异常（3）。根据任务的粒度，使用不同的嵌入来完成 APT 检测。在批量日志级别任务中，系统状态嵌入反映了整个系统的一般行为，是检测目标。这种嵌入中的异常意味着其对应的系统状态是未见过的，可能是恶意的，这在该批次中揭示了 APT 信号。在系统实体级别任务中，检测目标是那些节点嵌入，
represent the behaviors of system entities. Outliers in node embeddings indicates suspicious system entities and detects APT threats in finer granularity.
表示系统实体的行为。节点嵌入中的异常值指示可疑的系统实体，并以更细的粒度检测 APT 威胁。

In real-world detection settings, MAGIC has two predesigned applications. For each batch of logs collected by system auditing softwares, one can either directly utilize MAGIC’s entity level detection to accurately identify malicious entities within the batch, or perform a two-stage detection, as stated in Sec. 2.3. In this case, MAGIC first scans a batch and sees if malicious signals exist in the batch (batched log level detection). If it alerts positive, MAGIC then performs entity level detection to identify malicious system entities in finer granularity. Batched log level detection is significantly less computationally demanding than entity level detection. Therefore, such a two-stage routine can help MAGIC’s users to save computational resource and avoid false alarms without affecting MAGIC’s detection fineness. However, if users favor fine-grain detection on all system entities, the former routine is still an accessible option.
在现实世界的检测环境中，MAGIC 有两个预设计的应用程序。对于系统审计软件收集的每一批日志，可以直接利用 MAGIC 的实体级检测准确识别批次中的恶意实体，或者执行如第 2.3 节所述的两阶段检测。在这种情况下，MAGIC 首先扫描一批日志，查看该批次中是否存在恶意信号（批次日志级别检测）。如果检测结果为正，MAGIC 将进行实体级检测，以更细粒度识别恶意系统实体。批次日志级别检测的计算需求显著低于实体级检测。因此，这种两阶段的流程可以帮助 MAGIC 的用户节省计算资源，避免误报，而不影响 MAGIC 的检测精度。然而，如果用户更倾向于对所有系统实体进行细粒度检测，前一种流程仍然是一个可行的选择。

To deal with concept drift and unseen attacks, an optional model adaption mechanism is employed to provide feedback channels for its users (4). Detection results checked and confirmed by security analysts are fed back to MAGIC, helping it to adapt to benign system behavior changes in a semisupervised way. Under such conditions, MAGIC achieves even more promising detection results, which is discussed in Sec. 6.3. Furthermore, MAGIC can be easily applied to real-world online APT detection thanks to it’s ability to adapt itself to concept drift and its minimum computation overhead.
为了应对概念漂移和未见攻击，采用了一种可选的模型适应机制，为用户提供反馈渠道（4）。由安全分析师检查和确认的检测结果反馈给 MAGIC，帮助其以半监督的方式适应良性系统行为的变化。在这种情况下，MAGIC 实现了更有前景的检测结果，具体讨论见第 6.3 节。此外，由于 MAGIC 能够适应概念漂移并且计算开销最小，它可以轻松应用于现实世界的在线 APT 检测。

4 Design Details 4 设计细节

In this section, we explain in detail how MAGIC performs efficient APT detection on massive audit logs. MAGIC con-
在本节中，我们详细解释 MAGIC 如何在海量审计日志上执行高效的 APT 检测。MAGIC con-
tains four major components: a graph construction phase that builds optimised and consistent provenance graphs (Sec. 4.1), a graph representation module that produces output embeddings with maximum efficiency (Sec. 4.2), a detection module that utilizes outlier detection methods to perform APT detection (Sec. 4.3) and a model adaption mechanism to deal with concept drift and other high-quality feedbacks (Sec. 4.4).
包含四个主要组件：一个构建优化和一致的来源图的图构建阶段（第 4.1 节），一个生成最大效率输出嵌入的图表示模块（第 4.2 节），一个利用异常检测方法进行 APT 检测的检测模块（第 4.3 节），以及一个处理概念漂移和其他高质量反馈的模型适应机制（第 4.4 节）。

4.1 Provenance Graph Construction
4.1 来源图构建

MAGIC first constructs a provenance graph out of raw audit logs before performing graph representation and APT detection. We follow three steps to construct a consistent and optimised provenance graph ready for graph representation.
MAGIC 首先从原始审计日志中构建一个溯源图，然后进行图表示和 APT 检测。我们遵循三个步骤来构建一个一致且优化的溯源图，以便进行图表示。
Log Parsing. The first step is to simply parse each log entry, extract system entities and system interactions between them. Then, a prototype provenance graph can be constructed with system entities as nodes and interactions as edges. Now we extract categorical information regarding nodes and edges. For simple log format that provides entity and interaction labels, we directly utilize these labels. For some format that provides complicated attributes of those entities and interactions, we apply multi-label hashing (e.g., xxhash [28]) to transform attributes into labels. At this stage, the provenance graph is a directed multi-graph. We designed an example to demonstrate how we deal with the raw provenance graph after log parsing in Figure 3.
日志解析。第一步是简单地解析每个日志条目，提取系统实体及其之间的系统交互。然后，可以构建一个原始来源图，以系统实体作为节点，交互作为边。现在我们提取有关节点和边的分类信息。对于提供实体和交互标签的简单日志格式，我们直接使用这些标签。对于一些提供复杂属性的实体和交互的格式，我们应用多标签哈希（例如，xxhash [28]）将属性转换为标签。在这个阶段，来源图是一个有向多重图。我们设计了一个示例，以演示在日志解析后如何处理原始来源图，如图 3 所示。
Initial Embedding. In this stage, we transform node and edge labels into fixed-size feature vector (i.e., initial embedding) of dimension

d

, where

d

is the hidden dimension of our graph representation module. We apply a lookup embedding, which establish an one-to-one mapping between node/edge labels to

d

-dimension feature vectors. As demonstrated in Figure 3 (I and II), process

a

and

b

share the same label, so they are mapped to the same feature vector, while

a

and

c

are embedded into different feature vectors as they have different labels. We note that the possible number of unique node/edge labels is determined by the data source (i.e., auditing log format). Therefore, the lookup embedding works under a transductive setting and do not need to learn embeddings for unseen labels.
初始嵌入。在这个阶段，我们将节点和边的标签转换为固定大小的特征向量（即初始嵌入），其维度为

d

，其中

d

是我们图表示模块的隐藏维度。我们应用查找嵌入，建立节点/边标签与

d

维特征向量之间的一对一映射。如图 3（I 和 II）所示，过程

a

和

b

共享相同的标签，因此它们被映射到相同的特征向量，而

a

和

c

则嵌入到不同的特征向量中，因为它们具有不同的标签。我们注意到，唯一节点/边标签的可能数量由数据源（即审计日志格式）决定。因此，查找嵌入在传导设置下工作，不需要为未见标签学习嵌入。
Noise Reduction. The expected input provenance graph of our graph representation module would be simple-graphs. Thus, we need to combine multiple edges between node pairs. If multiple edges of the same label (also sharing the same initial embedding) exist between a pair of nodes, we remove redundant edges so that only one of them remains. Then we combine the remaining edges into one final edge. We note that between a pair of nodes, edges of several different labels may remain. After the combination, the initial embedding of the resulting unique edge is obtained by averaging the initial embeddings of the remaining edges. To illustrate, we show how our noise reduction combines multi-edges and how it affects the edge initial embeddings in Figure 3 (II an III). First, three read and two write interactions between

a

and

c

噪声减少。我们图表示模块的预期输入来源图将是简单图。因此，我们需要在节点对之间结合多个边。如果在一对节点之间存在多个相同标签的边（也共享相同的初始嵌入），我们将删除冗余边，以便只保留其中一条。然后，我们将剩余的边合并为一条最终边。我们注意到，在一对节点之间，可能会保留几种不同标签的边。合并后，结果唯一边的初始嵌入是通过对剩余边的初始嵌入进行平均得到的。为了说明，我们展示了我们的噪声减少如何结合多条边以及它如何影响边的初始嵌入，如图 3（II 和 III）所示。首先，在

a

和

c

之间有三次读取和两次写入交互。

Figure 3: Example of MAGIC’s graph construction steps.
图 3：MAGIC 图构建步骤的示例。
are merged into one for each label. Then we combine them together, forming one edge

e_{a c}

with initial embedding equal to the average initial embedding of the remaining edges (

e_{2}^{'}

and

e_{5}^{'})

. We provide a comparison between our noise reduction steps and previous works in Appendix E.
每个标签合并为一个。然后我们将它们组合在一起，形成一条边

e_{a c}

，其初始嵌入等于剩余边的平均初始嵌入（

e_{2}^{'}

和

e_{5}^{'})

）。我们在附录 E 中提供了我们的噪声减少步骤与之前工作的比较。

After conducting the above three steps, MAGIC has finished constructing a consistent and information-preserving provenance graph ready for subsequent tasks. During provenance graph construction, little information is lost as MAGIC only damages the original semantics by generalizing detailed descriptions of system entities and interactions into labels. However, an average

79.60 %

of all edges are reduced on the DARPA E3 Trace dataset, saving MAGIC’s training time and memory consumption.
在完成上述三个步骤后，MAGIC 已经构建了一个一致且保留信息的溯源图，准备进行后续任务。在溯源图构建过程中，几乎没有信息丢失，因为 MAGIC 仅通过将系统实体和交互的详细描述概括为标签来损害原始语义。然而，在 DARPA E3 Trace 数据集上，所有边的平均

79.60 %

被减少，从而节省了 MAGIC 的训练时间和内存消耗。

4.2 Graph Representation Module
4.2 图形表示模块

MAGIC employs a graph representation module to obtain high-quality embeddings from featured provenance graphs. As illustrated in Figure 4, the graph representation module consists of three phases: a masking procedure to partially hide node features (i.e. initial embeddings) for reconstruction purpose (Sec. 4.2.1), a graph encoder that produces node and system state output embeddings by propagating and aggregating graph features (Sec. 4.2.2), a graph decoder that provides supervision signals for the training of the graph representation module via masked feature reconstruction and sample-based structure reconstruction (Sec. 4.2.3). The encoder and decoder form a graph masked auto-encoder, which excels in
MAGIC 采用图表示模块从特征来源图中获取高质量的嵌入。如图 4 所示，图表示模块由三个阶段组成：一个掩蔽过程，用于部分隐藏节点特征（即初始嵌入）以进行重建（第 4.2.1 节），一个图编码器，通过传播和聚合图特征生成节点和系统状态输出嵌入（第 4.2.2 节），一个图解码器，通过掩蔽特征重建和基于样本的结构重建为图表示模块的训练提供监督信号（第 4.2.3 节）。编码器和解码器形成一个图掩蔽自编码器，表现出色。

Figure 4: Graph representation module of MAGIC.
图 4：MAGIC 的图形表示模块。
producing fast and resource-saving embeddings.
生成快速且节省资源的嵌入。

4.2.1 Feature Masking 4.2.1 特征屏蔽

Before training our graph representation module, we perform masking on nodes, so that the graph masked auto-encoder can be trained upon reconstruction of these nodes. Masked nodes are randomly chosen, covering a certain proportion of all nodes. The initial embeddings of such masked nodes are replaced with a special mask token

x_{mask}

to cover any original information about these nodes. Edges, however, are not masked because these edges provide precious information about relationships between system entities. In summary, given node initial embeddings

x_{n}

, we mask nodes as follows:
在训练我们的图表示模块之前，我们对节点进行掩码处理，以便图掩码自编码器可以在重建这些节点时进行训练。掩码节点是随机选择的，覆盖所有节点的某个比例。这些掩码节点的初始嵌入被替换为一个特殊的掩码标记

x_{mask}

，以遮盖关于这些节点的任何原始信息。然而，边缘并没有被掩码，因为这些边缘提供了关于系统实体之间关系的重要信息。总之，给定节点的初始嵌入

x_{n}

，我们对节点进行如下掩码处理：

e m b_{n} = \begin{array}{ll} x_{n}, & n \notin \tilde{N} \\ x_{mask}, & n \in \tilde{N} \end{array}

where

\tilde{N}

are randomly-chosen masked nodes,

e m b_{n}

is the embedding of node

n

ready for training the graph representation module. This masking process only happens during training. During detection, we do not mask any nodes.
其中

\tilde{N}

是随机选择的被遮蔽节点，

e m b_{n}

是节点

n

的嵌入，准备用于训练图表示模块。这个遮蔽过程仅在训练期间发生。在检测期间，我们不遮蔽任何节点。

4.2.2 Graph Encoder 4.2.2 图编码器

Initial embeddings obtained from the graph construction steps take only raw features into consideration. However, raw features are far from enough to model detailed behaviors of system entities. Contextual information of an entity, such as its neighborhood, its multi-hop relationships and its interaction patterns with other system entities plays an important role to obtain high-quality entity embeddings [29]. Here we employ and extend graph masked auto-encoders [19] to generate output embeddings in a self-supervised way. The graph masked auto-encoder consists of an encoder and a decoder. The encoder produces output embeddings by propagating and aggregating graph features and the decoder reconstructs graph features to provide supervision signals for training. Such encoder-decoder architecture maintains the contextual and semantic information within the generated embeddings, while its computation overhead is significantly reduced via masked learning.
初始嵌入是通过图构建步骤获得的，仅考虑原始特征。然而，原始特征远不足以建模系统实体的详细行为。实体的上下文信息，例如其邻域、多跳关系以及与其他系统实体的交互模式，在获得高质量实体嵌入方面起着重要作用[29]。在这里，我们采用并扩展图掩码自编码器[19]以自监督的方式生成输出嵌入。图掩码自编码器由编码器和解码器组成。编码器通过传播和聚合图特征生成输出嵌入，而解码器重建图特征以提供训练的监督信号。这种编码器-解码器架构在生成的嵌入中保持上下文和语义信息，同时通过掩码学习显著减少了计算开销。

The encoder of our graph representation module contains multiple stacked layers of graph attention networks (GAT) [30]. The function of a GAT layer is to generate output node embeddings according to both the features (initial embeddings) of the node itself and its neighbors. Differing from ordinary GNNs, GAT introduces an attention mechanism to measure the importance of those neighbors.
我们的图表示模块的编码器包含多个堆叠的图注意力网络（GAT）层[30]。GAT 层的功能是根据节点自身的特征（初始嵌入）和其邻居生成输出节点嵌入。与普通的图神经网络（GNN）不同，GAT 引入了一种注意力机制来衡量这些邻居的重要性。

To explain in detail, one layer of GAT takes node embeddings generated by previous layers as input and propagates embeddings from source nodes to destination nodes into messages along the interactions. The message contains information about the source node and the interaction between source and destination:
为了详细解释，GAT 的一层将前一层生成的节点嵌入作为输入，并沿着交互将源节点的嵌入传播到目标节点，形成消息。该消息包含有关源节点的信息以及源节点与目标节点之间的交互。

M S G (s r c, d s t) = W_{m s g}^{T} (h_{s r c} ‖ e m b_{e})

And the attention mechanism is employed to calculate the attention coefficients between the message source and its destination:
注意机制用于计算消息源与其目的地之间的注意系数：

\begin{aligned} α (s r c, d s t) & = LeakyReLU (W_{a s}^{T} h_{s r c} + W_{a m} M S G (s r c, d s t)), \\ a (s r c, d s t) & = Softmax (α (s r c, d s t)) . \end{aligned}

Then for the destination node, the GAT aggregates messages from incoming edges to update its node embedding by computing a weighted sum of all incoming messages. The weights are exactly the attention coefficients:
然后，对于目标节点，GAT 从输入边聚合消息，通过计算所有输入消息的加权和来更新其节点嵌入。权重正是注意力系数：

\begin{aligned} A G G (h_{d s t}, h_{X}) & = W_{self} h_{d s t} + \sum_{i \in N} a (i, d s t) M S G (i, d s t), \\ h_{n}^{l} & = A G G^{l} (h_{n}^{l - 1}, h_{N}^{l - 1}) \end{aligned}

where

h_{n}^{l}

is the hidden embedding of node

n

l

-th layer of GAT,

h_{n}^{l - 1}

is that of layer

l - 1

and

N_{n}

is the one-hop neighborhood of

n

. The input of the first GAT layer are the initial node embeddings.

e_{e}

is the initial edge embedding and remains constant throughout the graph representation module.

W_{a s}, W_{a m}, W_{self}, W_{m s g}

are trainable parameters. The updated node embedding forms a general abstraction of the node’s one-hop interaction behavior.
其中

h_{n}^{l}

是 GAT 第

l

层中节点

n

的隐藏嵌入，

h_{n}^{l - 1}

是第

l - 1

层的嵌入，

N_{n}

是

n

的一跳邻域。第一个 GAT 层的输入是初始节点嵌入。

e_{e}

是初始边嵌入，并在整个图表示模块中保持不变。

W_{a s}, W_{a m}, W_{self}, W_{m s g}

是可训练参数。更新后的节点嵌入形成节点一跳交互行为的一般抽象。

Multiple layers of such GATs are stacked to obtain the final node embedding

h

, which is concatenated by the original node embedding and outputs of all GAT layers:
多个这样的 GAT 层被堆叠以获得最终的节点嵌入

h

，该嵌入由原始节点嵌入和所有 GAT 层的输出连接而成：

h_{n} = e m b_{n} ‖ h_{n}^{1} ‖ \dots ‖ h_{n}^{l}

where

\cdot ‖ \cdot

denotes the concatenate operation. The more layers of GAT stacked, the wider the neighboring range is and the farther a node’s multi-hop interaction pattern its embedding is able to represent. Consequently, the graph encoder effectively incorporates node initial features and multi-hop interaction behaviors to abstract system entity behaviors into node embeddings. The graph encoder also applies an average pooling to all node embeddings to generate a comprehensive embedding of the graph itself [31], which recapitulates the overall
其中

\cdot ‖ \cdot

表示连接操作。堆叠的 GAT 层数越多，邻域范围越广，节点的多跳交互模式能够表示的嵌入也越远。因此，图编码器有效地将节点初始特征和多跳交互行为结合起来，将系统实体行为抽象为节点嵌入。图编码器还对所有节点嵌入应用平均池化，以生成图本身的综合嵌入 [31]，这概括了整体。
state of the system: 系统状态：

h_{G} = \frac{1}{| N |} \sum_{n_{i} \in N} h_{n_{i}}

The node embeddings and system state embeddings generated by the graph encoder are considered the output of the graph representation module, which are used in subsequent tasks in different scenarios.
图编码器生成的节点嵌入和系统状态嵌入被视为图表示模块的输出，这些输出在不同场景的后续任务中使用。

4.2.3 Graph Decoder 4.2.3 图解码器

The graph encoder does not provide supervision signals that support model training. In typical graph auto-encoders [25, 27], a graph decoder is employed to decode node embeddings and supervise model training via feature reconstruction and structure reconstruction. Graph masked auto-encoders, however, abandon structure reconstruction to reduce computation overhead. Our graph decoder is a mixture of both, which integrates masked feature reconstruction and sample-based structure reconstruction to construct an objective function that optimizes the graph representation module.
图编码器不提供支持模型训练的监督信号。在典型的图自编码器中[25, 27]，使用图解码器来解码节点嵌入，并通过特征重构和结构重构来监督模型训练。然而，图掩码自编码器放弃了结构重构，以减少计算开销。我们的图解码器是两者的混合，集成了掩码特征重构和基于样本的结构重构，以构建一个优化图表示模块的目标函数。

Given node embeddings

h_{n}

obtained from the graph encoder, the decoder first re-masks those masked nodes and transforms them into the input of masked feature reconstruction:
给定从图编码器获得的节点嵌入

h_{n}

，解码器首先重新掩蔽那些被掩蔽的节点，并将它们转化为掩蔽特征重建的输入：

h_{n}^{*} = {\begin{cases} W^{*} h_{n}, & n \notin \tilde{N} \\ W^{*} v_{remask}, & n \in \tilde{N} \end{cases}

Subsequently, the decoder uses a similar GAT layer described above to reconstruct the initial embeddings of the masked nodes, allowing the calculation of a feature reconstruction loss:
随后，解码器使用上述类似的 GAT 层来重建被遮蔽节点的初始嵌入，从而允许计算特征重建损失：

\begin{aligned} x_{n}^{*} & = A G G^{*} (h_{n}^{*}, h_{N_{n}}^{*}), \\ L_{f r} & = \frac{1}{| \tilde{N} |} \sum_{n_{i} \in \tilde{N}} {(1 - \frac{x_{n_{i}}^{T} x_{n_{i}}^{*}}{‖ x_{n_{i}} ‖ \cdot ‖ x_{n_{i}}^{*} ∣})}^{γ} \end{aligned}

where

L_{f r}

is the masked feature reconstruction loss obtained by calculating a scaled cosine loss between initial and reconstructed embeddings of the masked nodes. This loss [19] scales dramatically between easy and difficult samples which effectively speeds up learning. The degree of such scaling is controlled by a hyper-parameter

γ

.
其中

L_{f r}

是通过计算被遮蔽节点的初始嵌入和重建嵌入之间的缩放余弦损失获得的遮蔽特征重建损失。该损失 [19] 在简单样本和困难样本之间的缩放幅度非常大，从而有效加速学习。这种缩放的程度由超参数

γ

控制。

Meanwhile, sample-based structure reconstruction aims to reconstruct graph structure (i.e. predict edges between nodes). Instead of reconstructing the whole adjacency matrix, which has

O (N^{2})

complexity, sample-based structure reconstruction applies contrastive sampling on node pairs and predicts edge probabilities between such pairs. Only non-masked nodes are involved in structure reconstruction. Positive samples are constructed with all existing edges between non-masked nodes and negative samples are sampled among node pairs with no existing edges between them.
同时，基于样本的结构重建旨在重建图结构（即预测节点之间的边）。与重建整个邻接矩阵（其复杂度为

O (N^{2})

）不同，基于样本的结构重建对节点对进行对比采样，并预测这些节点对之间的边概率。只有未被屏蔽的节点参与结构重建。正样本由未被屏蔽的节点之间的所有现有边构成，负样本则在没有现有边的节点对中进行采样。

A simple two-layer MLP is used to reconstruct edges between node pairs samples, generating one probability for each
一个简单的两层多层感知器用于重建节点对样本之间的边缘，为每个样本生成一个概率
sample. The reconstruction loss takes the form of a simple binary cross-entropy loss on those samples:
样本。重建损失采用对这些样本的简单二元交叉熵损失形式：

\begin{aligned} prob (n, n^{'}) = σ (MLP (h_{n} | | h_{n^{'}})) \\ L_{s r} = - \frac{1}{| \hat{N} |} \sum_{n \in \hat{N}} (\log (1 - prob (n, n^{-})) + \log (prob (n, n^{+}))) \end{aligned}

where

(n, n^{-})

and

(n, n^{+})

are negative and positive samples respectively and

\hat{N} = N - \tilde{N}

is the set of non-masked nodes. Sample-based structure reconstruction only provides supervision to the output embeddings. Instead of using dot products, we employ a MLP to calculate edge probabilities as interacting entities are not necessarily similar in behaviors. Also, we are not forcing the model to learn to predict edge probabilities. The function of such structure reconstruction is to maximize behavioral information contained in the abstracted node embeddings so that a simple MLP is sufficient to incorporate and interpret such information into edge probabilities.
其中

(n, n^{-})

和

(n, n^{+})

分别是负样本和正样本，

\hat{N} = N - \tilde{N}

是未被屏蔽的节点集合。基于样本的结构重建仅为输出嵌入提供监督。我们采用多层感知机（MLP）来计算边缘概率，而不是使用点积，因为交互实体的行为不一定相似。此外，我们并没有强迫模型学习预测边缘概率。这种结构重建的功能是最大化抽象节点嵌入中包含的行为信息，因此简单的 MLP 足以将这些信息纳入并解释为边缘概率。

The final objective function

L = L_{f r} + L_{s r}

combines

L_{f r}

and

L_{s r}

and provides supervision signals to the graph representation module, enabling it to learn parameters in a selfsupervised way.
最终目标函数

L = L_{f r} + L_{s r}

结合了

L_{f r}

和

L_{s r}

，并为图表示模块提供监督信号，使其能够以自监督的方式学习参数。

4.3 Detection Module 4.3 检测模块

Based on the output embeddings generated by the graph representation module, we utilize outlier detection methods to perform APT detection in an unsupervised way. As detailedly explained in previous sections, such embeddings summarize system behaviors in different granularity. The goal of our detection model is to identify malicious system entities or states given only a priori knowledge of benign system behaviors. Embeddings generated via graph representation learning tend to form clusters if their corresponding entities share similar interaction behaviors in the graph [19,25-27,32]. Thus, outliers in system state embeddings indicate uncommon and suspicious system behaviors. Based on such insight, we develop a special outlier detection method to perform APT detection.
基于图表示模块生成的输出嵌入，我们利用异常检测方法以无监督的方式进行 APT 检测。如前面章节详细解释的那样，这些嵌入以不同的粒度总结了系统行为。我们检测模型的目标是仅根据良性系统行为的先验知识识别恶意系统实体或状态。通过图表示学习生成的嵌入如果其对应的实体在图中共享相似的交互行为，往往会形成聚类。因此，系统状态嵌入中的异常值表示不常见和可疑的系统行为。基于这样的洞察，我们开发了一种特殊的异常检测方法来进行 APT 检测。

During training, benign output embeddings are first abstracted from the training provenance graphs. What the detection module does at this stage is simply memorizing those embeddings and organize them in a

K

-D Tree [33]. After training, the detection module reveals outliers in three steps: k -nearest neighbor searching, similarity calculation and filtering. Given a target embedding, the detection module first obtains its k-nearest neighbors via K-D Tree searching. Such searching process only takes

\log (N)

time, where

N

is the total number of memorized training embeddings. Then, a similarity criterion is applied to evaluate the target embedding’s closeness to its neighbors and compute an anomaly score. If its anomaly score yields higher than a hyper-parameter

θ

, the target embedding is considered an outlier and its corresponding system entity or system state is malicious. An example workflow of the detection module is formalized as follows,
在训练过程中，良性输出嵌入首先从训练来源图中抽象出来。此阶段检测模块的工作仅仅是记住这些嵌入并将它们组织成一个

K

-维树[33]。训练完成后，检测模块通过三个步骤揭示异常值：k-最近邻搜索、相似性计算和过滤。给定一个目标嵌入，检测模块首先通过 K-D 树搜索获得其 k 个最近邻。这一搜索过程仅需

\log (N)

时间，其中

N

是记住的训练嵌入的总数。然后，应用相似性标准来评估目标嵌入与其邻居的接近程度并计算异常分数。如果其异常分数高于超参数

θ

，则目标嵌入被视为异常值，其对应的系统实体或系统状态被认为是恶意的。检测模块的一个示例工作流程形式化如下，
using euclidean distance as the similarity criterion:
使用欧几里得距离作为相似性标准：

\begin{aligned} N_{x} = K N N (x) \\ {dist}_{x} = \frac{1}{| N_{x} |} \sum_{x_{i} \in N_{x}} ‖ x - x_{i} | | \\ {score}_{x} = \frac{{dist}_{x}}{\overset{―}{\overset{―}{d i s t}}} \\ {result}_{x} = {\begin{cases} 1, & {score}_{x} \geq θ \\ 0, & {score}_{x} < θ \end{cases} \end{aligned}

where

\overset{―}{d i s t}

is the average distance between training embeddings and their k-nearest neighbors. When performing batched log level detection, the detection module memorizes benign system state embeddings that reflects system states and detects if the system state embedding of a newly-arrived provenance graph is an outlier. When performing system entity level detection, the detection module instead memorizes benign node embeddings that indicates system entity behaviors and given a newly-arrived provenance graph, it detects outliers within the embeddings of all system entities.
其中

\overset{―}{d i s t}

是训练嵌入与其 k 最近邻之间的平均距离。在执行批量日志级别检测时，检测模块记忆反映系统状态的良性系统状态嵌入，并检测新到达的来源图的系统状态嵌入是否为异常值。在执行系统实体级别检测时，检测模块则记忆指示系统实体行为的良性节点嵌入，并在给定新到达的来源图时，检测所有系统实体的嵌入中的异常值。

4.4 Model Adaption 4.4 模型适应

For an APT detector to effectively function in real-world detection scenarios, concept drift must be taken into consideration. When facing benign yet previously unseen system behaviors, MAGIC produces false positive detection results, which may mislead subsequent applications (e.g. attack investigation and story recovery). Recent works address this issue by forgetting outdated data [10] or fitting their model to benign system changes via a model adaption mechanism [18]. MAGIC also integrates a model adaption mechanism to combat concept drift and learn from false positives identified by security analysts. Slightly different from other works that use only false positives to retrain the model, MAGIC can be retrained with all feedbacks. As discussed in previous sections, the graph representation module in MAGIC encodes system entities into embeddings in a self-supervised way, without knowing its label. Any unseen data, including those true negatives, are valuable training data for the graph representation module to enhance its representation ability on unseen system behaviors.
为了使 APT 检测器在现实世界的检测场景中有效运作，必须考虑概念漂移。当面对良性但之前未见的系统行为时，MAGIC 会产生误报检测结果，这可能会误导后续应用（例如攻击调查和故事恢复）。最近的研究通过遗忘过时数据[10]或通过模型适应机制将模型调整为良性系统变化来解决这个问题[18]。MAGIC 还集成了一种模型适应机制，以应对概念漂移并从安全分析师识别的误报中学习。与仅使用误报重新训练模型的其他工作略有不同，MAGIC 可以使用所有反馈进行重新训练。如前面部分所讨论，MAGIC 中的图表示模块以自监督的方式将系统实体编码为嵌入，而无需知道其标签。任何未见数据，包括那些真实的负例，都是图表示模块的宝贵训练数据，以增强其对未见系统行为的表示能力。

The detection module can only be retrained with benign feedbacks to keep up to system behavior changes. And as it memorizes more and more benign feedbacks, its detection efficiency is lowered. To address this issue, we also implement a discounting mechanism on the detection module. When the volume of memorized embeddings exceeds a certain amount, earliest embeddings are simply removed as newly-arrived embeddings are learned. We provide the model adaption mechanism as an optional solution to concept drift and unseen system behaviors. It is recommended to adapt MAGIC to system changes by feeding confirmed false positive samples to MAGIC’s model adaption mechanism.
检测模块只能通过良性反馈进行再训练，以跟上系统行为的变化。随着它记忆越来越多的良性反馈，其检测效率降低。为了解决这个问题，我们还在检测模块上实施了折扣机制。当记忆的嵌入量超过一定数量时，最早的嵌入会被简单地移除，以便学习新到达的嵌入。我们提供模型适应机制作为应对概念漂移和未见系统行为的可选解决方案。建议通过向 MAGIC 的模型适应机制提供确认的假阳性样本来使 MAGIC 适应系统变化。

5 Implementation 5 实施

We implement MAGIC with about 3,500 lines of code in Python 3.8. We develop several log parsers to cope with different format of audit logs, including StreamSpot [34], Camflow [35] and CDM [36]. Provenance graphs are constructed using the graph processing library Networkx [37] and stored in JSON format. The graph representation module is implemented via PyTorch [38] and DGL [39]. The detection module is developed with Scikit-learn [40]. For hyper-parameters of MAGIC, the scaling factor

γ

in the feature reconstruction loss is set to 3 , the number of neighbors k is set to 10 , the learning rate as 0.001 and the weight decay factor equals

5 \times 10^{- 4}

. We use a 3 layer graph encoder and a mask rate of 0.5 in our experiments. The output embedding dimension

d

is different on two detection scenarios, batched log level detection and entity level detection. We use

d

equals 256 in batched log level detection and of and an we set

d

equals 64 in entity level detection to reduce resource consumption. The detection threshold

θ

is chosen by a simple linear search separately conducted on each dataset. The hyper-parameters may have other choices. We demonstrate the impact of these hyper-parameters on MAGIC later in the evaluation section. In our hyper-parameter analysis,

d

is chosen from

{16, 32

64, 128, 256}, l

from

{1, 2, 3, 4}

and

r

from

{0.3, 0.5, 0.7}

. For the threshold

θ

, it is chosen between 1 and 10 in batched log level detection. For entity level detection, please refer to Appendix D.
我们在 Python 3.8 中实现了约 3,500 行代码的 MAGIC。我们开发了几个日志解析器，以应对不同格式的审计日志，包括 StreamSpot [34]、Camflow [35] 和 CDM [36]。使用图处理库 Networkx [37] 构建了来源图，并以 JSON 格式存储。图表示模块通过 PyTorch [38] 和 DGL [39] 实现。检测模块使用 Scikit-learn [40] 开发。对于 MAGIC 的超参数，特征重构损失中的缩放因子

γ

设置为 3，邻居数量 k 设置为 10，学习率设置为 0.001，权重衰减因子等于

5 \times 10^{- 4}

。在我们的实验中，我们使用了 3 层图编码器和 0.5 的掩码率。输出嵌入维度

d

在两种检测场景中不同，批量日志级别检测和实体级别检测。我们在批量日志级别检测中使用

d

等于 256，在实体级别检测中设置

d

等于 64，以减少资源消耗。检测阈值

θ

通过在每个数据集上单独进行的简单线性搜索选择。超参数可能还有其他选择。我们将在评估部分展示这些超参数对 MAGIC 的影响。在我们的超参数分析中，

d

从

{16, 32

中选择，

64, 128, 256}, l

从

{1, 2, 3, 4}

中选择，

r

从

{0.3, 0.5, 0.7}

中选择。对于阈值

θ

，它在批量日志级别检测中选择在 1 到 10 之间。有关实体级别检测，请参阅附录 D。

6 Evaluation 6 评估

We use 131 GB of audit logs derived from various system auditing softwares to evaluate the effectiveness and efficiency of MAGIC. We first describe our experimental settings (Sec. 6.1), then elaborate the effectiveness of MAGIC in different scenarios (Sec. 6.2), conduct a false positive analysis and assess the usefulness of the model adaption mechanism (Sec. 6.3) and analyze the run-time performance overhead of MAGIC (Sec. 6.4). The impact of different components and hyperparameters of MAGIC is analyzed in Sec. 6.5. In addition, a detailed case study on our motivation example is conducted in Appendix C to illustrate how MAGIC’s pipeline work for APT detection. These experiments are conducted under the same device setting.
我们使用来自各种系统审计软件的 131 GB 审计日志来评估 MAGIC 的有效性和效率。我们首先描述实验设置（第 6.1 节），然后详细阐述 MAGIC 在不同场景下的有效性（第 6.2 节），进行假阳性分析并评估模型适应机制的有用性（第 6.3 节），分析 MAGIC 的运行时性能开销（第 6.4 节）。第 6.5 节分析 MAGIC 的不同组件和超参数的影响。此外，在附录 C 中对我们的动机示例进行了详细案例研究，以说明 MAGIC 的管道如何用于 APT 检测。这些实验在相同的设备设置下进行。

6.1 Experimental Settings
6.1 实验设置

We evaluate the effectiveness of MAGIC on three public datasets: the StreamSpot dataset [21], the Unicorn Wget dataset [22] and the DARPA Engagement 3 datasets [20]. These datasets vary in volume, origin and granularity. We believe by testing MAGIC’s performance on these datasets, we are able to compare MAGIC with as many state-of-the-art
我们在三个公共数据集上评估 MAGIC 的有效性：StreamSpot 数据集[21]、Unicorn Wget 数据集[22]和 DARPA Engagement 3 数据集[20]。这些数据集在规模、来源和粒度上各不相同。我们相信，通过在这些数据集上测试 MAGIC 的性能，我们能够将 MAGIC 与尽可能多的最先进技术进行比较。

Table 1: Datasets for batched log level detection.
表 1：批量日志级别检测的数据集。

Dataset 数据集	Scenario 场景	Malicious 恶意	#Log pieces #日志片段	Avg. #Entity 平均实体数量	Avg. #Interaction 平均互动次数	Size(GB) 大小(GB)
StreamSpot 流点
	CNN		100	8,989	294,903	0.9
	Download 下载		100	8,830	310,814	1.0
	Gmail		100	6,826	37,382	0.1
	VGame		100	8,636	112,958	0.4
	YouTube		100	8,292	113,229	0.3
	Attack 攻击	$✓$	100	8,890	28,423	0.1
Unicorn Wget 独角兽 Wget	Benign 良性		125	265,424	975,226	64.0
Unicorn Wget 独角兽 Wget	Attack 攻击	$✓$	25	257,156	949,887	12.6

APT detection approaches as possible and explore the universality and applicability of MAGIC. we provide detailed descriptions of the three datasets as follows.
APT 检测方法尽可能多，并探索 MAGIC 的普遍性和适用性。我们提供以下三个数据集的详细描述。
StreamSpot Dataset. The StreamSpot dataset (See Table 1) is a simulated dataset collected and made public by StreamSpot [34] using auditing system SystemTap [41]. The StreamSpot dataset contains 600 batches of audit logs monitoring system calls under 6 unique scenarios. Five of those scenarios are simulated benign user behaviors while the attack scenario simulates a drive-by-download attack. The dataset is considered a relatively small dataset and since no labels of log entries and system entities are provided, we perform batched log level detection on the StreamSpot dataset similar to previous works [10, 15, 17].
StreamSpot 数据集。StreamSpot 数据集（见表 1）是由 StreamSpot [34] 使用审计系统 SystemTap [41] 收集并公开的模拟数据集。StreamSpot 数据集包含 600 批审计日志，监控 6 种独特场景下的系统调用。其中五种场景模拟了良性用户行为，而攻击场景则模拟了驱动下载攻击。该数据集被认为是相对较小的数据集，由于没有提供日志条目和系统实体的标签，我们对 StreamSpot 数据集进行批量日志级别检测，类似于之前的工作 [10, 15, 17]。
Unicorn Wget Dataset. The Unicorn Wget dataset (See Table 1) contains simulated attacks designed by Unicorn [10]. Specifically, it contains 150 batches of logs collected with Camflow [35], where 125 of them are benign and 25 of them contain supply-chain attacks. Those attacks, categorized as stealth attacks, are elaborately designed to behave similar to benign system workflows and are expected to be difficult to identify. This dataset is considered the hardest among our experimental datasets for its huge volume, complicated log format and the stealthy nature of these attacks. The same as state-of-the-art approaches, we perform batched log level detection on this dataset.
独角兽 Wget 数据集。独角兽 Wget 数据集（见表 1）包含由独角兽设计的模拟攻击 [10]。具体而言，它包含 150 批使用 Camflow [35] 收集的日志，其中 125 批是良性的，25 批包含供应链攻击。这些攻击被归类为隐蔽攻击，经过精心设计，行为类似于良性系统工作流，预计难以识别。由于其庞大的数据量、复杂的日志格式以及这些攻击的隐蔽性，该数据集被认为是我们实验数据集中最困难的。与最先进的方法相同，我们在该数据集上进行批量日志级别检测。
DARPA E3 Datasets. The DARPA Engagement 3 datasets (See Table 2), as a part of the DARPA Transparent Computing program, are collected among an enterprise network during an adversarial engagement. APT attacks exploiting different vulnerabilities [20] are conducted by the red team to exfiltrate sensitive information. Blue teams try to identify those attacks by auditing the network hosts and performing causality analysis on them. Trace, THEIA and CADETS sub-datasets are included in our evaluation. These three sub-datasets consist of a total 51.69 GB of audit records, containing as many as

6, 539, 677

system entities and 68,127,444 interactions. Thus, we evaluate MAGIC’s system entity level detection ability and address the overhead issue on these datasets.
DARPA E3 数据集。DARPA Engagement 3 数据集（见表 2）作为 DARPA 透明计算计划的一部分，是在对抗性交互中从企业网络中收集的。红队利用不同的漏洞进行 APT 攻击 [20]，以窃取敏感信息。蓝队通过审计网络主机并对其进行因果分析来识别这些攻击。我们的评估中包括 Trace、THEIA 和 CADETS 子数据集。这三个子数据集总共包含 51.69 GB 的审计记录，涉及多达

6, 539, 677

个系统实体和 68,127,444 次交互。因此，我们评估 MAGIC 的系统实体级检测能力，并解决这些数据集上的开销问题。

For different dataset, we employ different dataset splits to evaluate the model and we use only benign samples for training. For the StreamSpot dataset, we randomly choose 400 batches out of 500 benign logs for training and the rest for testing, resulting in an balanced test set. For the Unicorn Wget dataset, 100 batches of benign logs are selected for training
对于不同的数据集，我们采用不同的数据集划分来评估模型，并且仅使用良性样本进行训练。对于 StreamSpot 数据集，我们随机选择 500 个良性日志中的 400 个批次进行训练，其余用于测试，从而形成一个平衡的测试集。对于 Unicorn Wget 数据集，选择 100 个良性日志批次进行训练。

Table 2: Datasets for system entity level detection.
表 2：系统实体级别检测的数据集。

Dataset 数据集

Scenario 场景

Malicious 恶意

#Node

#Edge

Size (GB) 大小 (GB)

DARPA E3 Trace DARPA E3 跟踪

良性扩展后门

Benign

Extension Backdoor

✓

3, 220, 594

732

松树后门钓鱼可执行文件

Pine Backdoor

Phishing Executable

✓

67,345

4, 080, 457

15.40

良性攻击

Benign

Attack

✓

1, 598, 647

25,319

2, 874, 821

17.91

DARPA E3 CADETS DARPA E3 学员

良性攻击

Benign

Attack

✓

1, 614, 189

12,846

3, 303, 264

18.38

Figure 5: ROC curves on all datasets.
图 5：所有数据集上的 ROC 曲线。
while the rest are for testing. For the DARPA E3 datasets, we use the same ground-truth labels as ThreaTrace [17] and split

\log

entries according to their order of occurrence. The earliest

80 % \log

entries are for training while the rest are preserved for testing. During evaluation, the average performance of MAGIC under 100 global random seeds is reported as the final result, so the experimental results may contain fractions of system entities/batches of logs.
其余部分用于测试。对于 DARPA E3 数据集，我们使用与 ThreaTrace [17]相同的真实标签，并根据出现顺序拆分

\log

条目。最早的

80 % \log

条目用于训练，其余部分保留用于测试。在评估过程中，MAGIC 在 100 个全局随机种子下的平均性能被报告为最终结果，因此实验结果可能包含系统实体/日志批次的分数。

6.2 Effectiveness 6.2 效果性

MAGIC’s effectiveness of multi-granularity APT detection is evaluated on three datasets. Here we present the detection results of MAGIC on each dataset, then compare it with state-of-the-art APT detection approaches on those datasets.
MAGIC 在三个数据集上评估了多粒度 APT 检测的有效性。这里我们展示 MAGIC 在每个数据集上的检测结果，然后将其与这些数据集上的最先进 APT 检测方法进行比较。
Detection Result. Results show that MAGIC successfully detects APTs with high accuracy in different scenarios. We present the detection results of MAGIC on each dataset in Table 3 and their corresponding ROC curves in Figure 5.
检测结果。结果表明，MAGIC 在不同场景中成功地以高准确率检测 APT。我们在表 3 中展示了 MAGIC 在每个数据集上的检测结果，以及在图 5 中展示的相应 ROC 曲线。

On easy datasets such as the StreamSpot dataset, MAGIC achieves almost perfect detection results. This is because the StreamSpot dataset collects only single user activity per log batch, resulting in system behaviors that can be easily separated from each other. We further present this effect by visualizing the distribution of system state embeddings abstracted from those log batches in Figure 6. The system state embeddings are separated into 6 categories, matching the 6 scenarios involved in the dataset. Also, this indicates that MAGIC’s graph representation module excels at abstracting system behaviors into such embeddings.
在像 StreamSpot 数据集这样的简单数据集上，MAGIC 几乎达到了完美的检测结果。这是因为 StreamSpot 数据集每个日志批次仅收集单个用户活动，导致系统行为可以很容易地相互区分。我们进一步通过在图 6 中可视化从这些日志批次中抽象出的系统状态嵌入的分布来展示这一效果。系统状态嵌入被分为 6 类，匹配数据集中涉及的 6 种场景。此外，这表明 MAGIC 的图表示模块在将系统行为抽象为这样的嵌入方面表现出色。

When dealing with the Unicorn Wget dataset, MAGIC
在处理独角兽 Wget 数据集时，MAGIC

Table 3: MAGIC’s detection results on different datasets. For batched log level detection, the detection targets are log pieces. And for system entity level detection, system entities are the targets.
表 3：MAGIC 在不同数据集上的检测结果。对于批量日志级别检测，检测目标是日志片段。对于系统实体级别检测，系统实体是目标。

Granularity 粒度	Dataset 数据集		Train Ratio 列车比例	Ground Truth 真实情况		#TP	#FP	#TN	#FN	Precision 精确度	Recall 召回	FPR	F1-Score F1 分数	AUC
Granularity 粒度	Dataset 数据集		Train Ratio 列车比例	#Benign	#Malicious	#TP	#FP	#TN	#FN	Precision 精确度	Recall 召回	FPR	F1-Score F1 分数	AUC
Batched log level 批处理日志级别	StreamSpot 流点		80%	100	100	100.0	0.59	99.41	0.0	99.41%	100.00%	0.59%	99.71%	99.95%
Batched log level 批处理日志级别	Unicorn Wget 独角兽 Wget		80%	25	25	24.0	0.5	24.5	1.0	98.02%	96.00%	2.00%	96.98%	96.32%
System entity level 系统实体级别	DARPA E3 Trace DARPA E3 跟踪	All 所有	80%	616,025	68,082	68,072	569	615,456	10	99.17%	99.98%	0.09%	$99.57 %$	99.99%
		Extension Backdoor 扩展后门			732	727			5
		Pine Backdoor 松树后门			67,345	67,342			3
		Phishing Executable 网络钓鱼可执行文件			5	3			2
	DARPA E3 THEIA	All 所有	80%	319,448	25,319	25,318	456	318,992	1	98.23%	99.99%	0.14%	99.11%	99.87%
	DARPA E3 CADETS DARPA E3 学员	All 所有	80%	344,327	12,846	12,816	759	343,568	30	94.40%	99.77%	0.22%	97.01%	99.77%

Figure 6: Latent space of system state embeddings in the StreamSpot dataset. Each point represents a log piece in the dataset, which belongs to one of six scenarios: watching YouTube, checking G-mail, playing vgame, undergoing a drive-by-download attack, downloading ordinary files and watching CNN.
图 6：StreamSpot 数据集中系统状态嵌入的潜在空间。每个点代表数据集中的一个日志片段，属于六种场景之一：观看 YouTube、查看 G-mail、玩电子游戏、遭受驱动下载攻击、下载普通文件和观看 CNN。
yields an average

98.01 %

precision and

96.00 %

recall, significantly lower compared to that of the StreamSpot dataset. MAGIC’s self-supervised style makes it difficult to distinguish between stealth attacks and benign system behaviors. However, MAGIC still successfully recovers an average 24 out of

25 \log

batches with only 0.5 false positives generated, better than any of the state-of-the-art detectors [10, 15, 17].
产生平均

98.01 %

的精确度和

96.00 %

的召回率，明显低于 StreamSpot 数据集的水平。MAGIC 的自监督风格使得区分隐蔽攻击和良性系统行为变得困难。然而，MAGIC 仍然成功恢复了平均

25 \log

批次中的 24 批，仅生成 0.5 个假阳性，优于任何最先进的检测器[10, 15, 17]。

On the DARPA datasets, MAGIC achieves an average

99.91 %

recall and

0.15 %

false positive rate with only benign log entries for training. This indicates MAGIC quickly learns to model system behaviors. The test set in this scenario is unbalanced, which means the total number of ground-truth benign entities far exceeds that of malicious entities. Among 1,386,046 test entities, only 106,246 are labeled malicious. However, few false positives are generated, as MAGIC identifies malicious entities with outlier detection and anomalous entities are naturally detected as abnormalities.
在 DARPA 数据集上，MAGIC 仅使用良性日志条目进行训练，达到了平均

99.91 %

的召回率和

0.15 %

的假阳性率。这表明 MAGIC 能够快速学习建模系统行为。在这种情况下，测试集是不平衡的，这意味着真实良性实体的总数远远超过恶意实体的数量。在 1,386,046 个测试实体中，只有 106,246 个被标记为恶意。然而，生成的假阳性很少，因为 MAGIC 通过异常检测识别恶意实体，而异常实体自然被检测为异常。

Among the false negative results, we notice that most of them are malicious files and libraries involved in the attacks. This indicates that MAGIC excels at detecting malicious processes and network connections that behave differently from benign system entities. However, MAGIC finds it hard to locate passive entities such as malicious files and libraries, whose behaviors tend to be similar to benign ones. Fortunately,
在假阴性结果中，我们注意到大多数都是涉及攻击的恶意文件和库。这表明 MAGIC 在检测与良性系统实体行为不同的恶意进程和网络连接方面表现出色。然而，MAGIC 很难定位被动实体，如恶意文件和库，因为它们的行为往往与良性实体相似。幸运的是，

Table 4: Comparison between MAGIC and state-of-the-art APT detection methods on different datasets. Within column supervision, B indicates benign data, A refers to attack data and SA for streaming attack data.
表 4：MAGIC 与最先进的 APT 检测方法在不同数据集上的比较。在监督列中，B 表示良性数据，A 表示攻击数据，SA 表示流式攻击数据。

Dataset 数据集	Approach 方法	Train Ratio 列车比例	Supervision 监督	Precision 精确度	F1-Score F1 分数	Recall 召回	FPR
	StreamSpot 流点	$80 %$	B	$73 %$	$81 %$	$91 %$	$6.6 %$
	Unicorn (baseline) 独角兽（基线）	$75 %$	B	$95 %$	$96 %$	$93 %$	$1.6 %$
StreamSpot 流点	Prov-Gem 普罗维德-宝石	$80 %$	B,A	$100 %$	$97 %$	$94 %$	$0 %$
	ThreaTrace	$75 %$	B	$98 %$	$99 %$	$99 %$	$0.4 %$
	MAGIC (Ours) 魔法（我们的）	$80 %$	B	$99 %$	$99 %$	$100 %$	$0.6 %$
	Unicorn (baseline) 独角兽（基线）	$80 %$	B	$86 %$	$90 %$	$95 %$	$15.5 %$
Unicorn 独角兽	Prov-Gem 普罗维德-宝石	$80 %$	B,A	$100 %$	$89 %$	$80 %$	$0 %$
Wget	ThreaTrace	$80 %$	B	$93 %$	$95 %$	$98 %$	$7.4 %$
	MAGIC (Ours) 魔法（我们的）	$80 %$	B	$98 %$	$97 %$	$96 %$	$2.0 %$
	DeepLog 深日志	N/A 不适用	B,A	$41 %$	$51 %$	$68 %$	$2.7 %$
DARPA	Log2vec (baseline) Log2vec（基线）	N/A 不适用	B,A	$54 %$	$64 %$	$78 %$	$1.8 %$
E3 Trace E3 跟踪	ThreaTrace	N/A 不适用	B	$72 %$	$83 %$	$99 %$	$1.1 %$
	ShadeWatcher 阴影观察者	$80 %$	B,SA	$97 %$	$99 %$	$99 %$	$0.3 %$
	MAGIC (Ours) 魔法（我们的）	$80 %$	B	$99 %$	$99 %$	$99 %$	$0.1 %$
	DeepLog 深日志	N/A 不适用	B,A	$16 %$	$15 %$	$14 %$	$0.5 %$
DARPA	Log2vec (baseline) Log2vec（基线）	N/A 不适用	B,A	$62 %$	$64 %$	$66 %$	$0.3 %$
E3 THEIA	ThreaTrace	N/A 不适用	B	$87 %$	$93 %$	$99 %$	$0.1 %$
	MAGIC (Ours) 魔法（我们的）	B0%	B	$98 %$	$99 %$	$99 %$	$0.1 %$
	DeepLog 深日志	N/A 不适用	B,A	$23 %$	$35 %$	$74 %$	$4.4 %$
DARPA	Log2vec (baseline) Log2vec（基线）	N/A 不适用	B,A	$49 %$	$62 %$	$85 %$	$1.6 %$
E3 CADETS E3 学员	ThreaTrace	N/A 不适用	B	$90 %$	$95 %$	$99 %$	$0.2 %$
	MAGIC (Ours) 魔法（我们的）	$80 %$	B	$94 %$	$97 %$	$99 %$	$0.2 %$

| Dataset | Approach | Train Ratio | Supervision | Precision | F1-Score | Recall | FPR | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | | StreamSpot | $80 \%$ | B | $73 \%$ | $81 \%$ | $91 \%$ | $6.6 \%$ | | | Unicorn (baseline) | $75 \%$ | B | $95 \%$ | $96 \%$ | $93 \%$ | $1.6 \%$ | | StreamSpot | Prov-Gem | $80 \%$ | B,A | $100 \%$ | $97 \%$ | $94 \%$ | $\mathbf{0 \%}$ | | | ThreaTrace | $75 \%$ | B | $98 \%$ | $99 \%$ | $99 \%$ | $0.4 \%$ | | | MAGIC (Ours) | $80 \%$ | B | $99 \%$ | $\mathbf{9 9 \%}$ | $\mathbf{1 0 0 \%}$ | $0.6 \%$ | | | Unicorn (baseline) | $80 \%$ | B | $86 \%$ | $90 \%$ | $95 \%$ | $15.5 \%$ | | Unicorn | Prov-Gem | $80 \%$ | B,A | $100 \%$ | $89 \%$ | $80 \%$ | $\mathbf{0 \%}$ | | Wget | ThreaTrace | $80 \%$ | B | $93 \%$ | $95 \%$ | $\mathbf{9 8 \%}$ | $7.4 \%$ | | | MAGIC (Ours) | $80 \%$ | B | $98 \%$ | $\mathbf{9 7 \%}$ | $96 \%$ | $2.0 \%$ | | | DeepLog | N/A | B,A | $41 \%$ | $51 \%$ | $68 \%$ | $2.7 \%$ | | DARPA | Log2vec (baseline) | N/A | B,A | $54 \%$ | $64 \%$ | $78 \%$ | $1.8 \%$ | | E3 Trace | ThreaTrace | N/A | B | $72 \%$ | $83 \%$ | $99 \%$ | $1.1 \%$ | | | ShadeWatcher | $80 \%$ | B,SA | $97 \%$ | $99 \%$ | $\mathbf{9 9 \%}$ | $0.3 \%$ | | | MAGIC (Ours) | $80 \%$ | B | $\mathbf{9 9 \%}$ | $\mathbf{9 9 \%}$ | $99 \%$ | $\mathbf{0 . 1 \%}$ | | | DeepLog | N/A | B,A | $16 \%$ | $15 \%$ | $14 \%$ | $0.5 \%$ | | DARPA | Log2vec (baseline) | N/A | B,A | $62 \%$ | $64 \%$ | $66 \%$ | $0.3 \%$ | | E3 THEIA | ThreaTrace | N/A | B | $87 \%$ | $93 \%$ | $99 \%$ | $\mathbf{0 . 1 \%}$ | | | MAGIC (Ours) | B0% | B | $\mathbf{9 8 \%}$ | $\mathbf{9 9 \%}$ | $\mathbf{9 9 \%}$ | $0.1 \%$ | | | DeepLog | N/A | B,A | $23 \%$ | $35 \%$ | $74 \%$ | $4.4 \%$ | | DARPA | Log2vec (baseline) | N/A | B,A | $49 \%$ | $62 \%$ | $85 \%$ | $1.6 \%$ | | E3 CADETS | ThreaTrace | N/A | B | $90 \%$ | $95 \%$ | $\mathbf{9 9 \%}$ | $0.2 \%$ | | | MAGIC (Ours) | $80 \%$ | B | $\mathbf{9 4 \%}$ | $\mathbf{9 7 \%}$ | $99 \%$ | $\mathbf{0 . 2 \%}$ |

these intermediate files and libraries can be easily identified during attack story recovery, given the malicious processes and connections successfully detected.
这些中间文件和库在攻击故事恢复过程中可以很容易地被识别，因为恶意进程和连接已成功检测到。
MAGIC vs. State-of-the-art. The three datasets used to evaluate MAGIC are also used by several state-of-the-art learningbased APT detection approaches. For instance, Unicorn [10], Prov-Gem [15] and ThreaTrace [17] for Unicorn Wget and StreamSpot dataset, ThreaTrace and ShadeWatcher [18] for sub-dataset E3-Trace. Methods that require a priori information about APTs, such as Holmes [3], Poirot [5] and Morse [6], are not taken into consideration as MAGIC cannot be compared to them in the same detection scenario.
MAGIC 与最先进技术。用于评估 MAGIC 的三个数据集也被多个最先进的基于学习的 APT 检测方法使用。例如，Unicorn [10]、Prov-Gem [15]和 ThreaTrace [17]用于 Unicorn Wget 和 StreamSpot 数据集，ThreaTrace 和 ShadeWatcher [18]用于子数据集 E3-Trace。需要关于 APT 的先验信息的方法，如 Holmes [3]、Poirot [5]和 Morse [6]，不被考虑，因为 MAGIC 无法在相同的检测场景中与它们进行比较。

Comparison results between MAGIC and other state-of-the-art approaches on each dataset are presented in Table 4. Comparison between MAGIC and other unsupervised approaches (i.e. Unicorn and ThreaTrace) yields a total victory of our approach, revealing MAGIC’s effectiveness in modeling and detecting outliers of benign system behaviors with no supervision from attack-containing logs.
MAGIC 与其他最先进方法在每个数据集上的比较结果见表 4。MAGIC 与其他无监督方法（即 Unicorn 和 ThreaTrace）的比较显示了我们方法的全面胜利，揭示了 MAGIC 在建模和检测良性系统行为的异常方面的有效性，而无需依赖包含攻击的日志进行监督。

Beyond unsupervised methods, Prov-Gem is a supervised APT detector based on GATs. However, it fails to achieve a better detection result on even the easiest StreamSpot dataset. This is mainly because simple GAT layers supervised on classification tasks are not as expressive as graph
超越无监督方法，Prov-Gem 是一个基于 GAT 的监督 APT 检测器。然而，它在最简单的 StreamSpot 数据集上也未能取得更好的检测结果。这主要是因为在分类任务上监督的简单 GAT 层并不像图那样具有表现力。
masked auto-encoders in producing high-quality embeddings. Another APT detector mentioned, ShadeWatcher, adopts a semi-supervised detection approach. ShadeWatcher leverages TransR [42] and graph neural networks (GNNs) to detect APTs based on recommendation and is able to achieve the best recall rate on the E3-Trace sub-dataset. TransR, a self-supervised graph representation method on Knowledge Graphs, contributes most to the detection accuracy of ShadeWatcher. Unfortunately, TransR is extremely expensive in computation overhead. For example, ShadeWatcher spends as much as 12 hours training the TransR module on the E3-Trace sub-dataset. On the contrary, evaluation on the computation overhead of MAGIC(Sec. 6.4) shows that MAGIC is able to complete training on the same amount of training data 51 times faster than ShadeWatcher [18].
在生成高质量嵌入方面，掩码自编码器表现出色。另一个提到的 APT 检测器 ShadeWatcher 采用半监督检测方法。ShadeWatcher 利用 TransR 和图神经网络（GNN）基于推荐检测 APT，并能够在 E3-Trace 子数据集上实现最佳召回率。TransR 是一种在知识图谱上自监督的图表示方法，对 ShadeWatcher 的检测准确性贡献最大。不幸的是，TransR 在计算开销上非常昂贵。例如，ShadeWatcher 在 E3-Trace 子数据集上训练 TransR 模块花费了多达 12 小时。相反，对 MAGIC（第 6.4 节）计算开销的评估显示，MAGIC 能够以比 ShadeWatcher 快 51 倍的速度完成相同训练数据的训练。

6.3 False Positive Analysis
6.3 假阳性分析

For real-time applications, APT detectors must prevent false alarms at best effort, as those false alarms often tire and confuse security analysts. We evaluate MAGIC’s false positive rate (FPR) on benign audit logs and investigate how our model adaption mechanism reduces those false alarms. Table 3 shows MAGIC’s false positive rate on each dataset. Within each dataset, only benign logs are used for training and testing. MAGIC yields low FPR (average

0.15 %

) with large training data. This is because MAGIC models benign system behaviors with self-supervised embeddings, allowing it to effectively handle unseen system entities. Such a low FPR enables MAGIC’s application under real-world settings. When conducting fine-grain entity-level detection only, MAGIC only yields 569 false alarms on the Trace dataset, with an average only 40 false alarms every day. Security analysts can easily handle this number of alarms and do security investigations on them. If the two-stage detection described in Sec. 3 is applied, the average number of false alarms every day can be further lowered to 24.
对于实时应用，APT 检测器必须尽力防止误报，因为这些误报常常使安全分析师感到疲惫和困惑。我们评估了 MAGIC 在良性审计日志上的误报率 (FPR)，并研究了我们的模型适应机制如何减少这些误报。表 3 显示了 MAGIC 在每个数据集上的误报率。在每个数据集中，仅使用良性日志进行训练和测试。MAGIC 在大规模训练数据下产生了低 FPR（平均

0.15 %

）。这是因为 MAGIC 通过自监督嵌入建模良性系统行为，使其能够有效处理未见过的系统实体。如此低的 FPR 使 MAGIC 能够在现实世界环境中应用。当仅进行细粒度实体级检测时，MAGIC 在 Trace 数据集上仅产生 569 个误报，平均每天仅 40 个误报。安全分析师可以轻松处理这个数量的警报并对其进行安全调查。如果应用第 3 节中描述的两阶段检测，平均每天的误报数量可以进一步降低到 24。

Our model adaption mechanism is designed to help MAGIC to learn from newly-observed unseen behaviors. We evaluate how such mechanism reduces false positives on benign audit logs in Table 5. Specifically, we test MAGIC on the Trace dataset under five different settings:
我们的模型适应机制旨在帮助 MAGIC 从新观察到的未见行为中学习。我们在表 5 中评估该机制如何减少良性审计日志中的误报。具体而言，我们在五种不同设置下对 Trace 数据集上的 MAGIC 进行测试：

Training on the first $80 %$ log entries and testing on the rest $20 %$ with no adaption, identical to our original setting.
在前 $80 %$ 个日志条目上进行训练，并在其余 $20 %$ 个条目上进行测试，未进行适应，与我们的原始设置相同。
Training on the first $20 %$ and testing on the last $20 %$ with no adaption, for comparison purpose.
在前 $20 %$ 进行训练，在最后 $20 %$ 进行测试，未进行适应，供比较之用。
Training on the first $20 %$ , adapting on false positives generated from the following $20 %$ , and testing on the last $20 %$ .
在前面的 $20 %$ 上进行训练，在后面的 $20 %$ 中对生成的假阳性进行调整，并在最后的 $20 %$ 上进行测试。
Training on the first $20 %$ , adapting on both false positives and true negatives generated from the following $20 % \log$ entries, and testing on the last $20 %$ .
在前 $20 %$ 上进行训练，针对接下来的 $20 % \log$ 条生成的假阳性和真阴性进行调整，并在最后 $20 %$ 上进行测试。
Training on the first $20 %$ , adapting on both false positives and true negatives generated from the following $40 % \log$
在第一次 $20 %$ 上进行训练，针对以下 $40 % \log$ 生成的假阳性和真阴性进行调整

Table 5: MAGIC’s false positive rates on different datasets. The effect of model adaption mechanism is tested under different settings.
表 5：MAGIC 在不同数据集上的假阳性率。模型适应机制的效果在不同设置下进行了测试。

Dataset 数据集	Train Ratio 列车比例	Adaption 适应	Test Ratio 测试比率	FPR
StreamSpot 流点	$80 %$	N/A 不适用	$20 %$	$0.59 %$
Unicorn Wget 独角兽 Wget	$80 %$	N/A 不适用	$20 %$	$2.00 %$
	$80 %$	N/A 不适用	$20 %$	$0.089 %$
	$20 %$	N/A 不适用	$20 %$	$0.426 %$
DARPA	$20 %$	$20 %$ FP $20 %$ FP	$20 %$	$0.272 %$
E3 Trace E3 跟踪	$20 %$	$20 %$ FP & TN $20 %$ FP 和 TN	$20 %$	$0.220 %$
	$20 %$	$40 %$ FP & TN $40 %$ FP 和 TN	$20 %$	$0.173 %$

Table 6: Performance overhead of MAGIC on the E3-Trace sub-dataset.
表 6：MAGIC 在 E3-Trace 子数据集上的性能开销。

Phase 阶段

Component 组件

Time consumption (s) 时间消耗（秒）

Peak Memory consumption (MB)
峰值内存消耗 (MB)

CPU only 仅限 CPU

图形构建

Graph

Construction

N/A 不适用

642

2,610

图形表示

Graph

Representation

151

685

1,564

Detection 检测

1,320

Inference 推理

图形表示

Graph

Representation

2,108

Detection 检测

825

1,667

entries, and testing on the last

20 %

.
条目，并在最后进行

20 %

的测试。
Experimental results indicate that adapting the model to benign feedbacks consistently reduces false positives. A further reduction can be achieved by feeding both false positives and true negatives to the model. This is because the graph representation module can be retrained with any data to enhance its representation ability, as described in Sec 4.4.
实验结果表明，将模型适应于良性反馈可以持续减少误报。通过将误报和真阴性同时输入模型，可以进一步减少误报。这是因为图表示模块可以使用任何数据进行再训练，以增强其表示能力，如第 4.4 节所述。

6.4 Performance Overhead 6.4 性能开销

MAGIC is designed to perform APT detection with minimum overhead, granting it applicability under various conditions. MAGIC completes its training and inference in logarithmic time and takes up linear space. We provide a detailed analysis of its time and space complexity in Appendix B. We further test MAGIC’s run-time performance on sub-dataset E3-Trace and present its time and memory consumption in Table 6.
MAGIC 旨在以最小的开销执行 APT 检测，使其在各种条件下都具有适用性。MAGIC 的训练和推理在对数时间内完成，并占用线性空间。我们在附录 B 中提供了其时间和空间复杂度的详细分析。我们进一步在子数据集 E3-Trace 上测试 MAGIC 的运行时性能，并在表 6 中展示其时间和内存消耗。

In real-world settings, GPUs may not be available at all. We also test MAGIC’s efficiency without the GPU. With only CPUs available, the training phase becomes apparently slower. The efficiency of graph construction and outlier detection phase is not affected as they are implemented to perform on CPUs. We also measure the max memory consumption during training and inference. MAGIC’s low memory consumption not only prevents OOM problems on huge datasets, but also makes MAGIC approachable under tight conditions.
在现实世界中，GPU 可能根本不可用。我们还测试了在没有 GPU 的情况下 MAGIC 的效率。仅使用 CPU 时，训练阶段显然变得更慢。图构建和异常检测阶段的效率不受影响，因为它们被实现为在 CPU 上执行。我们还测量了训练和推理期间的最大内存消耗。MAGIC 的低内存消耗不仅防止了在大型数据集上出现 OOM 问题，还使得在紧张条件下使用 MAGIC 成为可能。

Evaluation results on performance overhead manifest the claim that MAGIC is advantageous over other state-of-the-art APT detectors in efficiency. For instance, ATLAS [11] takes about an hour to train its model on 676MB of audit logs and ShadeWatcher [18] takes 1 day to train on the E3-Trace subdataset. Compared with ShadeWatcher, MAGIC is 51 times faster in training under the same train ratio setting (i.e.

80 %

评估结果显示，MAGIC 在效率上优于其他最先进的 APT 检测器。例如，ATLAS [11]在 676MB 的审计日志上训练模型大约需要一个小时，而 ShadeWatcher [18]在 E3-Trace 子数据集上训练则需要 1 天。与 ShadeWatcher 相比，MAGIC 在相同训练比例设置下（即

80 %

）的训练速度快了 51 倍。

Figure 7: Effect of different reconstruction components on MAGIC’s performance and efficiency.
图 7：不同重建组件对 MAGIC 性能和效率的影响。
log entries for training on E3-Trace sub-dataset).
E3-Trace 子数据集的训练日志条目。
These evaluation results also illustrate that MAGIC is fully practicable in different conditions. Considering the fact that sub-dataset E3-Trace is collected in 2 weeks, 1.37 GB of audit logs is produced per day. This means under CPU-only conditions, MAGIC takes only 2 minutes to detect APTs from those logs and complete model adaption every day. Such promising efficiency makes MAGIC an available choice for individuals and small-size enterprises. For larger enterprises and institutions, they produces audit logs in hundreds of GBs every day [23]. In this case, efficiency of MAGIC can be ensured by training and adapting itself with GPUs and parallelizing the detection module with distributed CPU cores.
这些评估结果还表明，MAGIC 在不同条件下都是完全可行的。考虑到子数据集 E3-Trace 是在两周内收集的，每天产生 1.37 GB 的审计日志。这意味着在仅使用 CPU 的情况下，MAGIC 每天只需 2 分钟就能从这些日志中检测 APT 并完成模型适应。这种令人鼓舞的效率使 MAGIC 成为个人和小型企业的可用选择。对于大型企业和机构，他们每天产生数百 GB 的审计日志。在这种情况下，可以通过使用 GPU 进行训练和自我适应，并通过分布式 CPU 核心并行化检测模块来确保 MAGIC 的效率。

6.5 Ablation Study 6.5 消融研究

In this section, we first address the effectiveness of important individual components in MAGIC’s graph representation module, then carry out a hyper-parameter analysis to evaluate the sensitivity of MAGIC. Analysis on individual components and most hyper-parameters are conducted on the most difficult Wget dataset and the hyper-parameter analysis on the detection threshold

θ

is conducted on all datasets.
在本节中，我们首先讨论 MAGIC 图表示模块中重要单个组件的有效性，然后进行超参数分析以评估 MAGIC 的敏感性。对单个组件和大多数超参数的分析是在最困难的 Wget 数据集上进行的，而对检测阈值

θ

的超参数分析是在所有数据集上进行的。
Individual Component Analysis. We study how feature reconstruction (FR) as well as structure reconstruction (SR) affect MAGIC’s performance and Figure 7 presents the impact of these component to both detection result and performance overhead. Both FR and SR provide supervision for MAGIC’s graph representation module. Compared with ordinary FR, Masked FR slightly boosts performance and significantly reduces training time. Sampled-based SR, however, is an effective complexity reduction component which accelerates training without losing performance, compared with full SR. Hyper-parameter Analysis. The function of MAGIC is controlled by several hyper-parameters, including an embedding dimension

d

, number of GAT layers

l

, node mask rate

r

and the outlier detection threshold

θ

. Figure 8 illustrates how these hyper-parameters affect model performance in different situations. Hyper-parameters have little impact in most cases.
个体组件分析。我们研究特征重建（FR）和结构重建（SR）如何影响 MAGIC 的性能，图 7 展示了这些组件对检测结果和性能开销的影响。FR 和 SR 都为 MAGIC 的图表示模块提供了监督。与普通 FR 相比，掩码 FR 略微提升了性能，并显著减少了训练时间。然而，基于采样的 SR 是一种有效的复杂性降低组件，它在不损失性能的情况下加速训练，相比于完整的 SR。超参数分析。MAGIC 的功能由几个超参数控制，包括嵌入维度

d

、GAT 层数

l

、节点掩码率

r

和异常检测阈值

θ

。图 8 说明了这些超参数在不同情况下如何影响模型性能。在大多数情况下，超参数的影响很小。

Generally speaking, relatively higher model performance
一般来说，相对较高的模型性能
is achieved with a larger embedding dimension and more GAT layers by collecting more information from more distant neighborhoods, as shown in Sub-figure 8(a) and 8(b). However, increasing

d

l

introduces heavier computation which leads to longer training and inference time.
通过使用更大的嵌入维度和更多的 GAT 层，从更远的邻域收集更多信息来实现，如子图 8(a)和 8(b)所示。然而，增加

d

或

l

会引入更重的计算，从而导致更长的训练和推理时间。

The default mask rate of 0.5 yields best results. This is because MAGIC is unable to get sufficient training under a low mask rate. And under a high mask rate, node features are severely damaged which prevents MAGIC to learn node embeddings via feature reconstruction. Increasing mask rate slightly introduces more computation burden.
默认的掩码率为 0.5 能产生最佳结果。这是因为 MAGIC 在低掩码率下无法获得足够的训练。而在高掩码率下，节点特征受到严重损害，导致 MAGIC 无法通过特征重构学习节点嵌入。稍微增加掩码率会引入更多的计算负担。

We further examine the anomaly scores of entities to assess the sensitive of

θ

. A lower

θ

naturally leads to a higher recall performance at the cost of more false positives and vice versa. As demonstrated in Figure 9, most malicious entities are given high anomaly scores compared with benign ones and are wellseparated from them with little overlapping. The considerable spaces between benign and malicious anomaly scores support the claim that MAGIC does not depend on a precise threshold

θ

to perform accurate detection in practical situations. We quantify such spaces in Appendix D.
我们进一步检查实体的异常分数，以评估

θ

的敏感性。较低的

θ

自然会导致更高的召回性能，但代价是更多的假阳性，反之亦然。如图 9 所示，大多数恶意实体的异常分数与良性实体相比较高，并且与它们之间有很好的分离，重叠很少。良性和恶意异常分数之间的显著间隔支持了 MAGIC 在实际情况下进行准确检测不依赖于精确阈值

θ

的说法。我们在附录 D 中量化了这些间隔。

For an unsupervised detector as MAGIC, hyper-parameters are usually difficult to select. However, that is not the case for MAGIC. We provide a general guideline for hyper-parameter selection also in Appendix D.
对于像 MAGIC 这样的无监督检测器，超参数通常很难选择。然而，MAGIC 并非如此。我们在附录 D 中提供了超参数选择的一般指导。

7 Discussion and Limitations
7 讨论与局限性

Quality of Training Data. MAGIC models benign system behaviors and detects APTs with outlier detection. Similar to other anomaly-based APT detection approaches [7, 8, 10], we assume that all up-to-date benign system behaviors are observed during training log collection. However, if MAGIC is trained with low quality data that insufficiently covers system behaviors, many false positives can be generated. .
训练数据的质量。MAGIC 模型模拟良性系统行为并通过异常检测来检测 APT。与其他基于异常的 APT 检测方法类似，我们假设在训练日志收集期间观察到所有最新的良性系统行为。然而，如果 MAGIC 使用低质量数据进行训练，且数据不足以覆盖系统行为，则可能会产生许多误报。
Outlier Detection. MAGIC implements a KNN-based outlier detection module for APT detection. While it completes training and inference in logarithmic time, its efficiency on large datasets is still unsatisfactory. To illustrate, our detection module takes 13.8 minutes to check 684,111 targets, making up

99 %

of total inference time. Other outlier detection methods, such as One-class SVM [43] and Isolation Forest [44], do not fit in our detection setting and cannot adapt to concept drift. Cluster-based methods and approximate KNN search may be more suitable to huge datasets. Meanwhile, KNN-based methods may be extended to GPU. We leave such improvements on our detection module to future research.
异常检测。MAGIC 实现了一个基于 KNN 的异常检测模块用于 APT 检测。尽管它在对数时间内完成训练和推理，但在大数据集上的效率仍然不令人满意。举例来说，我们的检测模块检查 684,111 个目标需要 13.8 分钟，占总推理时间的

99 %

。其他异常检测方法，如一类 SVM [43] 和孤立森林 [44]，不适合我们的检测设置，无法适应概念漂移。基于聚类的方法和近似 KNN 搜索可能更适合巨大的数据集。同时，基于 KNN 的方法可以扩展到 GPU。我们将对检测模块的此类改进留待未来研究。
Adversarial Attacks. In Sec. 6.2, we show that MAGIC handles stealth attacks well, which avoid detection by behaving similar to the benign system. However, if attackers get to know how MAGIC works in details, they might conduct elaborately designed attacks to infiltrate our detector. We make a simple analysis on adversarial attacks and demonstrate MAGIC’s robustness against them in Appendix A. As Graph-based
对抗攻击。在第 6.2 节中，我们展示了 MAGIC 能够很好地处理隐蔽攻击，这些攻击通过表现得类似于良性系统来避免检测。然而，如果攻击者详细了解 MAGIC 的工作原理，他们可能会进行精心设计的攻击以渗透我们的检测器。我们对对抗攻击进行了简单分析，并在附录 A 中展示了 MAGIC 对它们的鲁棒性。作为基于图的

Figure 8: Effect of different hyper-parameters on MAGIC’s performance and efficiency.
图 8：不同超参数对 MAGIC 性能和效率的影响。

Figure 9: Anomaly scores of system entities. We exclude the highest and lowest 5% scores on each dataset.
图 9：系统实体的异常分数。我们排除了每个数据集中最高和最低的 5%分数。
approaches become increasingly popular and powerful in different detection applications, designing and avoiding these adversarial attacks on both the input graphs and the GNNs stands for an interesting research topic.
随着不同检测应用中方法变得越来越流行和强大，设计和避免对输入图和图神经网络的对抗攻击成为一个有趣的研究课题。

MAGIC is mainly related to three fields of research, including the detection of APTs, graph representation learning and outlier detection methods.
MAGIC 主要与三个研究领域相关，包括 APT 的检测、图表示学习和异常检测方法。
APT Detection. The goal of APT detection is to detect APT signals, malicious entities and invalid interactions from audit logs. Recent works are mostly based on data provenance. As suggested by [18], provenance-based detectors can be categorized into rule-based, statistics-based and learning-based approaches. Rule-based approaches [2-6] utilize a priori knowledge about previous seen attacks and construct unique heuristic rules to detect them. Statistics-based approaches [7-9] construct statistics to measure the abnormality of provenance graph elements and perform anomaly detection on them. Learning-based approaches

[10 - 18, 45]

leverage deep learning to model either benign system behaviors

[10, 12, 17]

or attack patterns [11, 14-16,18] and perform APT detection as classification

[11, 15, 16]

or anomaly detection

[18, 45]

. Among them, sequence-based methods [11,14] detect APTs
APT 检测。APT 检测的目标是从审计日志中检测 APT 信号、恶意实体和无效交互。最近的研究大多基于数据来源。如[18]所建议，基于来源的检测器可以分为基于规则、基于统计和基于学习的方法。基于规则的方法[2-6]利用先前已知攻击的先验知识，构建独特的启发式规则来检测它们。基于统计的方法[7-9]构建统计数据以测量来源图元素的异常性，并对其进行异常检测。基于学习的方法

[10 - 18, 45]

利用深度学习来建模良性系统行为

[10, 12, 17]

或攻击模式[11, 14-16,18]，并将 APT 检测作为分类

[11, 15, 16]

或异常检测

[18, 45]

。其中，基于序列的方法[11,14]检测 APT。
based on system execution/workflow patterns and graphbased methods [10, 12, 15-18,45] model entities and interactions via GNNs and detect abnormal behaviors as APTs.
基于系统执行/工作流程模式和基于图的方法[10, 12, 15-18, 45]，通过图神经网络（GNNs）建模实体和交互，并检测异常行为作为高级持续威胁（APTs）。
Graph Representation Learning. Embedding techniques on graphs start from the graph convolutional network (GCN) [46] and are further enhanced by the graph attention network (GAT) [30] and GraphSAGE [47]. Graph auto-encoders [19, 25-27] bring unsupervised solutions for graph representation learning. GAE, VGAE [25] and GATE [27] utilize feature reconstruction and structure reconstruction to learn output embeddings in a self-supervised way. However, they focus on link prediction and graph clustering tasks, irrelevant to our application. Recently, graph masked auto-encoders [19] leverage masked feature reconstruction and have achieved state-of-the-art performance on various applications.
图表示学习。图上的嵌入技术始于图卷积网络（GCN）[46]，并通过图注意力网络（GAT）[30]和 GraphSAGE [47]进一步增强。图自编码器[19, 25-27]为图表示学习提供了无监督的解决方案。GAE、VGAE [25]和 GATE [27]利用特征重构和结构重构以自监督的方式学习输出嵌入。然而，它们专注于链接预测和图聚类任务，与我们的应用无关。最近，图掩码自编码器[19]利用掩码特征重构，并在各种应用中取得了最先进的性能。
Outlier Detection. Outlier detection methods view outliers (i.e. objects that may not belong to the ordinary distribution) as anomalies and aim to identify them. Traditional outlier detection methods include One-class SVMs [43], Isolation Forest [44] and Local Outlier Factor [48]. These traditional approaches are widely used in various detection scenarios, including credit card fraud detection [49] and malicious transaction detection [50]. This proves that outlier detection methods work well in anomaly detection. Meanwhile, auto-encoders themselves are effective tools for outlier detection. We explain why we do not use graph auto-encoders as anomaly detectors in Appendix F.
异常检测。异常检测方法将异常值（即可能不属于普通分布的对象）视为异常，并旨在识别它们。传统的异常检测方法包括一类支持向量机 [43]、孤立森林 [44] 和局部异常因子 [48]。这些传统方法在各种检测场景中广泛使用，包括信用卡欺诈检测 [49] 和恶意交易检测 [50]。这证明了异常检测方法在异常检测中的有效性。同时，自编码器本身也是有效的异常检测工具。我们在附录 F 中解释了为什么不使用图自编码器作为异常检测器。

9 Conclusion 9 结论

We have introduced MAGIC, an universally applicable APT detection approach that operates in utmost efficiency with little overhead. MAGIC leverages masked graph representation learning to model benign system behaviors from raw audit logs and performs multi-granularity APT detection via outlier detection methods. Evaluations on three widelyused datasets under various detection scenarios indicate that MAGIC achieves promising detection results with low false positive rate and minimum computation overhead.
我们引入了 MAGIC，这是一种普遍适用的 APT 检测方法，具有极高的效率和较小的开销。MAGIC 利用掩码图表示学习从原始审计日志中建模良性系统行为，并通过异常检测方法进行多粒度 APT 检测。在各种检测场景下对三个广泛使用的数据集的评估表明，MAGIC 在低误报率和最小计算开销的情况下取得了令人满意的检测结果。

Acknowledgments 致谢

We would like to thank the anonymous reviewers and our shepherd for their detailed and valuable comments. This work was supported by the National Natural Science Foundation of China (No. U1936213 and No. 62032025), CNKLSTISS, the Fundamental Research Funds for the Central Universities, Sun Yat-sen University (No. 22lgqb26) and Program of Shanghai Academic Research Leader (No. 21XD1421500).
我们要感谢匿名评审和我们的指导者提供的详细和宝贵的意见。本研究得到了中国国家自然科学基金（编号：U1936213 和 62032025）、CNKLSTISS、中南大学基础研究基金（编号：22lgqb26）以及上海学术研究领军人才计划（编号：21XD1421500）的支持。

References 参考文献

[1] Adel Alshamrani, Sowmya Myneni, Ankur Chowdhary, et al. A survey on advanced persistent threats: Techniques, solutions, challenges, and research opportunities. IEEE Communications Surveys & Tutorials, 2019.
阿德尔·阿尔沙姆拉尼，索姆亚·迈尼尼，安库尔·乔达里等。关于高级持续威胁的调查：技术、解决方案、挑战和研究机会。IEEE 通信调查与教程，2019 年。
[2] Wajih Ul Hassan, Adam Bates, and Daniel Marino. Tactical provenance analysis for endpoint detection and response systems. In 2020 IEEE Symposium on Security and Privacy (SP), 2020.
[2] Wajih Ul Hassan, Adam Bates 和 Daniel Marino. 针对终端检测和响应系统的战术来源分析. 载于 2020 年 IEEE 安全与隐私研讨会(SP), 2020.
[3] Sadegh M Milajerdi, Rigel Gjomemo, Birhanu Eshete, et al. Holmes: real-time apt detection through correlation of suspicious information flows. In 2019 IEEE Symposium on Security and Privacy (SP), 2019.
[3] Sadegh M Milajerdi, Rigel Gjomemo, Birhanu Eshete 等. Holmes：通过可疑信息流的关联进行实时 APT 检测. 2019 年 IEEE 安全与隐私研讨会（SP），2019。
[4] Md Nahid Hossain, Sadegh M Milajerdi, Junao Wang, et al. Sleuth: Real-time attack scenario reconstruction from cots audit data. In USENIX Security Symposium, 2017.
[4] Md Nahid Hossain, Sadegh M Milajerdi, Junao Wang 等. Sleuth: 基于商业现成审计数据的实时攻击场景重建. 在 USENIX 安全研讨会, 2017.
[5] Sadegh M Milajerdi, Birhanu Eshete, Rigel Gjomemo, et al. Poirot: Aligning attack behavior with kernel audit records for cyber threat hunting. In Proceedings of the 2019 ACM SIGSAC conference on computer and communications security, 2019.
[5] Sadegh M Milajerdi, Birhanu Eshete, Rigel Gjomemo 等. Poirot：将攻击行为与内核审计记录对齐以进行网络威胁狩猎. 载于 2019 年 ACM SIGSAC 计算机与通信安全会议论文集, 2019.
[6] Md Nahid Hossain, Sanaz Sheikhi, and R Sekar. Combating dependence explosion in forensic analysis using alternative tag propagation semantics. In 2020 IEEE Symposium on Security and Privacy (SP), 2020.
[6] Md Nahid Hossain, Sanaz Sheikhi 和 R Sekar. 使用替代标签传播语义应对法医分析中的依赖爆炸. 载于 2020 年 IEEE 安全与隐私研讨会（SP），2020。
[7] Wajih Ul Hassan, Shengjian Guo, Ding Li, et al. Nodoze: Combatting threat alert fatigue with automated provenance triage. In network and distributed systems security symposium, 2019.
[7] Wajih Ul Hassan, Shengjian Guo, Ding Li 等. Nodoze：通过自动化来源分类来应对威胁警报疲劳. 在网络与分布式系统安全研讨会, 2019.
[8] Qi Wang, Wajih Ul Hassan, Ding Li, et al. You are what you do: Hunting stealthy malware via data provenance analysis. In NDSS, 2020.
[8] 齐王，瓦吉赫·乌尔·哈桑，丁立等。你就是你所做的：通过数据来源分析猎捕隐秘恶意软件。在 NDSS，2020。
[9] Yushan Liu, Mu Zhang, Ding Li, et al. Towards a timely causality analysis for enterprise security. In NDSS, 2018.
[9] 刘玉山, 张穆, 李丁, 等. 面向企业安全的及时因果分析. 发表在 NDSS, 2018.
[10] Xueyuan Han, Thomas Pasquier, Adam Bates, et al. Unicorn: Runtime provenance-based detector for advanced
[10] Xueyuan Han, Thomas Pasquier, Adam Bates 等. Unicorn: 基于运行时来源的高级检测器
persistent threats. arXiv preprint arXiv:2001.01525, 2020.
持续威胁。arXiv 预印本 arXiv:2001.01525，2020 年。
[11] Abdulellah Alsaheel, Yuhong Nan, Shiqing Ma, et al. Atlas: A sequence-based learning approach for attack investigation. In USENIX Security Symposium, 2021.
[11] Abdulellah Alsaheel, Yuhong Nan, Shiqing Ma 等. Atlas: 一种基于序列的攻击调查学习方法. 在 USENIX 安全研讨会, 2021.
[12] Kexin Pei, Zhongshu Gu, Brendan Saltaformaggio, et al. Hercule: Attack story reconstruction via community discovery on correlated log graph. In Proceedings of the 32Nd Annual Conference on Computer Security Applications, 2016.
[12] 裴克欣, 顾中书, 布伦丹·萨尔塔福马吉奥, 等. Hercule: 通过相关日志图上的社区发现进行攻击故事重构. 载于 2016 年计算机安全应用年会论文集。
[13] Yun Shen, Enrico Mariconti, Pierre Antoine Vervier, et al. Tiresias: Predicting security events through deep learning. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, 2018.
[13] 云申，恩里科·马里孔蒂，皮埃尔·安托万·维尔维耶等。Tiresias：通过深度学习预测安全事件。发表于 2018 年 ACM SIGSAC 计算机与通信安全会议论文集，2018 年。
[14] Fucheng Liu, Yu Wen, Dongxue Zhang, et al. Log2vec: A heterogeneous graph embedding based approach for detecting cyber threats within enterprise. In Proceedings of the 2019 ACM SIGSAC conference on computer and communications security, 2019.
[14] 刘福成, 文宇, 张冬雪, 等. Log2vec：一种基于异构图嵌入的方法，用于检测企业中的网络威胁. 载于 2019 年 ACM SIGSAC 计算机与通信安全会议论文集, 2019.
[15] Maya Kapoor, Joshua Melton, Michael Ridenhour, et al. Prov-gem: Automated provenance analysis framework using graph embeddings. In 2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA), 2021.
[15] 玛雅·卡普尔，约书亚·梅尔顿，迈克尔·赖登豪尔等。Prov-gem：使用图嵌入的自动化来源分析框架。在 2021 年第 20 届 IEEE 国际机器学习与应用会议（ICMLA），2021。
[16] Zitong Li, Xiang Cheng, Lixiao Sun, et al. A hierarchical approach for advanced persistent threat detection with attention-based graph neural networks. Security and Communication Networks, 2021.
[16] 李子彤, 程翔, 孙立晓, 等. 基于注意力的图神经网络的高级持续威胁检测的分层方法. 安全与通信网络, 2021.
[17] Su Wang, Zhiliang Wang, Tao Zhou, et al. Threatrace: Detecting and tracing host-based threats in node level through provenance graph learning. IEEE Transactions on Information Forensics and Security, 2022.
[17] 苏旺，王志良，周涛等。Threatrace：通过来源图学习检测和追踪节点级的主机威胁。IEEE 信息取证与安全学报，2022 年。
[18] Jun Zengy, Xiang Wang, Jiahao Liu, et al. Shadewatcher: Recommendation-guided cyber threat analysis using system audit records. In 2022 IEEE Symposium on Security and Privacy

(S P), 2022

.
[18] 曾俊, 王翔, 刘嘉豪, 等. Shadewatcher: 基于推荐的网络威胁分析使用系统审计记录. 载于 2022 年 IEEE 安全与隐私研讨会

(S P), 2022

.
[19] Zhenyu Hou, Xiao Liu, Yuxiao Dong, et al. Graphmae: Self-supervised masked graph autoencoders. arXiv preprint arXiv:2205.10803, 2022.
[19] 侯振宇, 刘晓, 董宇霄, 等. Graphmae: 自监督掩蔽图自编码器. arXiv 预印本 arXiv:2205.10803, 2022.
[20] Darpa transparent computing program engagement 3 data release. https://github.com/darpa-i20/ Transparent-Computing. Accessed: 2022-09-20.
[20] 达尔帕透明计算项目参与 3 数据发布。https://github.com/darpa-i20/ 透明计算。访问时间：2022-09-20。
[21] The streamspot dataset. https://github.com/ sbustreamspot/sbustreamspot-data. Accessed: 2022-09-17.
[21] streamspot 数据集。https://github.com/ sbustreamspot/sbustreamspot-data。访问日期：2022-09-17。
[22] Wget dataset. https://dataverse.harvard.edu/ dataverse/unicorn-wget. Accessed: 2022-09-17.
[22] Wget 数据集。https://dataverse.harvard.edu/ dataverse/unicorn-wget。访问日期：2022-09-17。
[23] Zhang Xu, Zhenyu Wu, Zhichun Li, et al. High fidelity data reduction for big data security dependency analyses. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, 2016.
[23] 张旭，吴振宇，李志春等。大数据安全依赖分析的高保真数据减少。载于 2016 年 ACM SIGSAC 计算机与通信安全会议论文集，2016。
[24] Hailun Ding, Juan Zhai, Yuhong Nan, and Shiqing Ma. Airtag: Towards automated attack investigation by unsupervised learning with log texts. In 32nd USENIX Security Symposium (USENIX Security 23), pages 373390, 2023.
[24] 丁海伦, 翟娟, 南玉红, 和马世青. Airtag: 通过无监督学习与日志文本实现自动化攻击调查. 在第 32 届 USENIX 安全研讨会 (USENIX Security 23), 第 373-390 页, 2023.
[25] Thomas N Kipf and Max Welling. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308, 2016.
[25] Thomas N Kipf 和 Max Welling. 变分图自编码器. arXiv 预印本 arXiv:1611.07308, 2016.
[26] Jiwoong Park, Minsik Lee, Hyung Jin Chang, et al. Symmetric graph convolutional autoencoder for unsupervised graph representation learning. In Proceedings of the IEEE/CVF international conference on computer vision, 2019.
[26] Jiwoong Park, Minsik Lee, Hyung Jin Chang 等. 对称图卷积自编码器用于无监督图表示学习. 载于 2019 年 IEEE/CVF 国际计算机视觉会议论文集.
[27] Amin Salehi and Hasan Davulcu. Graph attention autoencoders. arXiv preprint arXiv:1905.10715, 2019.
[27] Amin Salehi 和 Hasan Davulcu. 图注意力自编码器. arXiv 预印本 arXiv:1905.10715, 2019.
[28] Yann Collet. Xxhash: Extremely fast non-cryptographic hash algorithm. https://github.com/Cyan4973/ xxHash.
[28] Yann Collet. Xxhash：极快的非加密哈希算法。https://github.com/Cyan4973/ xxHash。
[29] Keyulu Xu, Weihua Hu, Jure Leskovec, et al. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826, 2018.
[29] 许凯璐, 胡伟华, 朱雷·莱斯科维奇等. 图神经网络有多强大？arXiv 预印本 arXiv:1810.00826, 2018.
[30] Petar Veličković, Guillem Cucurull, Arantxa Casanova, et al. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
[30] Petar Veličković, Guillem Cucurull, Arantxa Casanova 等. 图注意力网络. arXiv 预印本 arXiv:1710.10903, 2017.
[31] Chuang Liu, Yibing Zhan, Chang Li, et al. Graph pooling for graph neural networks: Progress, challenges, and opportunities. arXiv preprint arXiv:2204.07321, 2022.
[31] 刘创, 战怡兵, 李畅, 等. 图神经网络的图池化：进展、挑战与机遇. arXiv 预印本 arXiv:2204.07321, 2022.
[32] Xiao Wang, Nian Liu, Hui Han, et al. Self-supervised heterogeneous graph neural network with co-contrastive learning. In Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining, 2021.
[32] 小王，年刘，辉韩等。具有共同对比学习的自监督异构图神经网络。发表于第 27 届 ACM SIGKDD 知识发现与数据挖掘会议论文集，2021 年。
[33] Jon Louis Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 1975.
[33] 乔恩·路易斯·本特利。用于关联搜索的多维二叉搜索树。《计算机协会通讯》，1975 年。
[34] Emaad Manzoor, Sadegh M Milajerdi, and Leman Akoglu. Fast memory-efficient anomaly detection in streaming heterogeneous graphs. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016.
[34] Emaad Manzoor, Sadegh M Milajerdi, 和 Leman Akoglu. 在流式异构图中快速内存高效的异常检测. 载于第 22 届 ACM SIGKDD 国际知识发现与数据挖掘会议论文集, 2016.
[35] Thomas Pasquier, Xueyuan Han, Mark Goldstein, et al. Practical whole-system provenance capture. In Proceedings of the 2017 Symposium on Cloud Computing, 2017.
[35] Thomas Pasquier, Xueyuan Han, Mark Goldstein 等. 实用的全系统来源捕获. 载于 2017 年云计算研讨会论文集, 2017.
[36] Joud Khoury, Timothy Upthegrove, Armando Caro, et al. An event-based data model for granular information flow tracking. In Proceedings of the 12th USENIX Conference on Theory and Practice of Provenance, 2020.
[36] Joud Khoury, Timothy Upthegrove, Armando Caro 等. 一种基于事件的数据模型用于细粒度信息流跟踪. 载于第十二届 USENIX 证明理论与实践会议论文集, 2020.
[37] Aric A. Hagberg, Daniel A. Schult, and Pieter J. Swart. Exploring network structure, dynamics, and function using networkx. In Proceedings of the 7th Python in Science Conference, 2008.
[37] Aric A. Hagberg, Daniel A. Schult, 和 Pieter J. Swart. 使用 networkx 探索网络结构、动态和功能. 载于第七届科学中的 Python 会议论文集, 2008.
[38] Adam Paszke, Sam Gross, Francisco Massa, et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32. 2019.
[38] Adam Paszke, Sam Gross, Francisco Massa 等. Pytorch：一种命令式风格的高性能深度学习库. 载于《神经信息处理系统进展》第 32 卷. 2019.
[39] Minjie Wang, Da Zheng, Zihao Ye, et al. Deep graph library: A graph-centric, highly-performant package for graph neural networks. arXiv preprint arXiv:1909.01315, 2019.
[39] 王敏杰, 郑达, 叶子豪, 等. 深度图书馆：一个以图为中心的高性能图神经网络包. arXiv 预印本 arXiv:1909.01315, 2019.
[40] F. Pedregosa, G. Varoquaux, A. Gramfort, et al. Scikitlearn: Machine learning in Python. Journal of Machine Learning Research, 2011.
[40] F. Pedregosa, G. Varoquaux, A. Gramfort 等. Scikit-learn: Python 中的机器学习. 机器学习研究杂志, 2011.
[41] Vara Prasad, William Cohen, FC Eigler, et al. Locating system problems using dynamic instrumentation. In 2005 Ottawa Linux Symposium, 2005.
[41] Vara Prasad, William Cohen, FC Eigler 等. 使用动态插桩定位系统问题. 在 2005 年渥太华 Linux 研讨会, 2005.
[42] Yankai Lin, Zhiyuan Liu, Maosong Sun, et al. Learning entity and relation embeddings for knowledge graph completion. In Proceedings of the AAAI conference on artificial intelligence, 2015.
[42] 林燕凯, 刘志远, 孙茂松, 等. 学习实体和关系嵌入以完成知识图谱. 载于 2015 年人工智能 AAAI 会议论文集.
[43] Larry M Manevitz and Malik Yousef. One-class svms for document classification. Journal of machine Learning research, 2001.
[43] 拉里·M·马内维茨和马利克·尤塞夫。用于文档分类的一类支持向量机。机器学习研究杂志，2001 年。
[44] Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation forest. In 2008 eighth ieee international conference on data mining, 2008.
[44] Fei Tony Liu, Kai Ming Ting, 和 Zhi-Hua Zhou. 隔离森林. 在 2008 年第八届 IEEE 国际数据挖掘会议上, 2008.
[45] Xueyuan Han, Xiao Yu, Thomas Pasquier, et al. Sigl: Securing software installations through deep graph learning. In 30th USENIX Security Symposium (USENIX Security 21), 2021.
[45] 韩雪媛, 余晓, 托马斯·帕斯基耶等. Sigl: 通过深度图学习保护软件安装. 在第 30 届 USENIX 安全研讨会（USENIX Security 21），2021 年.
[46] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
[46] Thomas N Kipf 和 Max Welling. 使用图卷积网络的半监督分类. arXiv 预印本 arXiv:1609.02907, 2016.
[47] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. Advances in neural information processing systems, 2017.
[47] Will Hamilton, Ying Zhitao 和 Jure Leskovec. 大图上的归纳表示学习. 神经信息处理系统进展, 2017.
[48] Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, et al. Lof: identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data, 2000.
[48] Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng 等. Lof: 识别基于密度的局部离群点. 载于 2000 年 ACM SIGMOD 国际数据管理会议论文集, 2000.
[49] Panpan Zheng, Shuhan Yuan, Xintao Wu, et al. Oneclass adversarial nets for fraud detection. In Proceedings of the AAAI Conference on Artificial Intelligence, 2019.
[49] 郑盼盼, 袁书涵, 吴新涛, 等. 一类对抗网络用于欺诈检测. 载于 2019 年人工智能协会会议论文集.
[50] Zakia Ferdousi and Akira Maeda. Unsupervised outlier detection in time series data. In 22nd International Conference on Data Engineering Workshops (ICDEW’06), 2006.
[50] Zakia Ferdousi 和前田晃。时间序列数据中的无监督异常检测。在第 22 届国际数据工程研讨会（ICDEW’06），2006 年。
[51] Jun Zeng, Zheng Leong Chua, Yinfang Chen, et al. Watson: Abstracting behaviors from audit logs via aggregation of contextual semantics. In NDSS, 2021.
[51] 曾俊, 蔡正良, 陈银芳, 等. Watson: 通过上下文语义聚合从审计日志中抽象行为. 在 NDSS, 2021.

Appendix 附录

A Analysis on Adversarial Attack
对抗攻击分析

Adversarial attacks against MAGIC, such as evasion attacks and poison attacks, are tricky to implement but still possible to carry out. Two types of adversarial attacks are potentially practical against MAGIC, manipulating input audit logs or exploiting model architecture. The later approach is not an option for MAGIC’s attackers, as they have no access to MAGIC’s inner parameters. Similar to SIGL [45], we conduct a simple experiment concerning MAGIC’s robustness against graph manipulations, which shows these attacks do not affect MAGIC’s detection effectiveness.
针对 MAGIC 的对抗性攻击，如规避攻击和毒化攻击，实施起来很棘手，但仍然可以进行。两种类型的对抗性攻击可能对 MAGIC 有效，分别是操纵输入审计日志或利用模型架构。后者对 MAGIC 的攻击者来说不是一个选项，因为他们无法访问 MAGIC 的内部参数。与 SIGL [45]类似，我们进行了一项简单实验，研究 MAGIC 对图形操作的鲁棒性，结果表明这些攻击不会影响 MAGIC 的检测有效性。

In this experiment, we expect attackers have no knowledge of MAGIC’s inner parameters and cannot get any feedback from MAGIC. However, attackers can freely manipulate the malicious entities within MAGIC’s input audit logs, as well as a small proportion of benign entities. Consequently, we consider four types of attacks:
在本实验中，我们预计攻击者对 MAGIC 的内部参数没有任何了解，也无法从 MAGIC 获得任何反馈。然而，攻击者可以自由操控 MAGIC 输入审计日志中的恶意实体，以及少量的良性实体。因此，我们考虑四种类型的攻击：

Malicious Feature Evasion (MFE). Attackers have altered the features of all malicious entities in the raw audit logs trying to evade detection. This affects the node initial embeddings of the input provenance graph and forces malicious entities to mimic benign ones in node features.
恶意特征规避（MFE）。攻击者已更改原始审计日志中所有恶意实体的特征，以试图逃避检测。这影响了输入来源图的节点初始嵌入，并迫使恶意实体在节点特征上模仿良性实体。

Malicious Structure Evasion (MSE). Attackers have adding new edges between malicious entities and benign ones, so that malicious entities behave more normally and tend to have local structures more similar to benign entities. This affects the graph representation module and pulls the embeddings of malicious nodes towards benign ones, making them more difficult to identify.
恶意结构规避（MSE）。攻击者在恶意实体和良性实体之间添加新的边缘，使得恶意实体的行为更为正常，并且倾向于具有与良性实体更相似的局部结构。这影响了图表示模块，并将恶意节点的嵌入拉向良性节点，使其更难以识别。

Combined Evasion (MCE). This type of attack is a combination of the MFE and MSE and causes malicious entities to approximate benign ones in both features and structures.
组合规避（MCE）。这种攻击类型是 MFE 和 MSE 的结合，导致恶意实体在特征和结构上接近良性实体。

Table 7: Impact of different adversarial attack strategies on MAGIC’s detection effectiveness.
表 7：不同对抗攻击策略对 MAGIC 检测有效性的影响。

Attack Type 攻击类型	None 翻译文本：无	MFE	MSE	MCE	BFP
AUC	0.9999	0.9999	0.9999	0.9994	0.9942

Benign Feature Poison (BFP). Attackers have manipulated the features of benign entities to poison MAGIC. Attackers have injected some benign entities with similar initial features as malicious entities, trying to convince MAGIC that the malicious entities behave normally. This shifts benign entities to mimic malicious ones in node features and gradually poisons MAGIC’s detection module.
良性特征毒素（BFP）。攻击者操纵了良性实体的特征，以毒害 MAGIC。攻击者向一些良性实体注入了与恶意实体相似的初始特征，试图说服 MAGIC 认为恶意实体的行为是正常的。这使得良性实体在节点特征上模仿恶意实体，并逐渐毒害 MAGIC 的检测模块。

We evaluate MAGIC’s robustness against the above four attack strategies on the E3-Trace sub-dataset and we present the results in Table 7. Experimental results confirm that MAGIC is robust enough against adversarial attacks. MFE has almost no impact on MAGIC’s effectiveness. MAGIC is slightly more vulnerable facing the structure evasion attacks, because the graph structure is critical to unsupervised graph representation learning. However, the structure attacks are still weak against MAGIC’s detection effectiveness. We believe that the feature and structure reconstruction involved in our graph representation module contributes to this robustness, as it learns a model to reconstruct both node features and its neighboring structure with the information from its neighbors. Consequently, adversarial attacks involving feature and structure altering of malicious entities will fail, and with MAGIC’s model itself well-protected, attackers have to either extend their effort on searching and manipulating benign entities, or find other attack strategies against MAGIC. Meanwhile, BFP poses a relatively greater threat to MAGIC’s performance. However, under BFP, only 0.057 AUC reduction is achieved by manipulating the input feature of an enormous 161,029 benign entities. This level of intervention in the benign system is beyond the capability of ordinary attackers and will definitely leave observable trace. Therefore, an effective BFP attack is also very difficult to carry out.
我们评估了 MAGIC 对上述四种攻击策略在 E3-Trace 子数据集上的鲁棒性，并在表 7 中展示了结果。实验结果确认 MAGIC 对对抗攻击具有足够的鲁棒性。MFE 几乎对 MAGIC 的有效性没有影响。MAGIC 在面对结构规避攻击时稍显脆弱，因为图结构对无监督图表示学习至关重要。然而，结构攻击对 MAGIC 的检测有效性仍然较弱。我们相信，图表示模块中涉及的特征和结构重建有助于这种鲁棒性，因为它学习一个模型来重建节点特征及其邻近结构，利用来自邻居的信息。因此，涉及恶意实体特征和结构改变的对抗攻击将失败，并且由于 MAGIC 的模型本身得到了良好的保护，攻击者必须要么加大努力寻找和操纵良性实体，要么寻找其他针对 MAGIC 的攻击策略。同时，BFP 对 MAGIC 的性能构成了相对更大的威胁。然而，在 BFP 下，仅有 0。057 AUC 减少是通过操控 161,029 个良性实体的输入特征实现的。这种对良性系统的干预程度超出了普通攻击者的能力，并且肯定会留下可观察的痕迹。因此，有效的 BFP 攻击也非常难以实施。

B Time and Space Complexity of MAGIC
魔法的时间和空间复杂度

Given number of system entities

N

, number of system interactions

E

, number of possible node/edge labels

t

, graph representation dimension

d

, number of GAT layers

l

and mask rate

r

the graph construction steps builds a featured provenance graph in

O ((N + E) * t)

time, masked feature reconstruction is completed in

O ((N + E) * d^{2} * l * r)

time and sample-based structure reconstruction takes only

O (N * d)

time. Training of the detection module takes

O (N * \log N * d)

time to build a K-D Tree and memorize benign embeddings. Detection result of a single target is obtained in

O (\log N * d * k)

time. Thus, the overall time complexity of MAGIC during training and inference is

O (N * \log N * d * k + E * d^{2} * l * r + (N + E) * t)

.
给定系统实体数量

N

、系统交互数量

E

、可能的节点/边标签数量

t

、图表示维度

d

、GAT 层数

l

和掩码率

r

，图构建步骤在

O ((N + E) * t)

时间内构建了一个特征溯源图，掩码特征重建在

O ((N + E) * d^{2} * l * r)

时间内完成，基于样本的结构重建仅需

O (N * d)

时间。检测模块的训练需要

O (N * \log N * d)

时间来构建 K-D 树并记忆良性嵌入。单个目标的检测结果在

O (\log N * d * k)

时间内获得。因此，MAGIC 在训练和推理过程中的整体时间复杂度为

O (N * \log N * d * k + E * d^{2} * l * r + (N + E) * t)

。

Figure 10: Another provenance graph of our motivating example (i.e. Pine Backdoor). Numbers assigned to nodes are the anomaly scores assessed by MAGIC’s detection module.
图 10：我们激励示例（即 Pine Backdoor）的另一个来源图。分配给节点的数字是 MAGIC 检测模块评估的异常分数。

MAGIC’s memory consumption largely depends on

d

and

t

. The graph representation module takes up

O ((N + E) * t)

space to store a provenance graph and

O ((N + E) * (t + d))

space to generate output embeddings. The detection module takes

O (N * d)

space to memorize benign embeddings. The overall space complexity of MAGIC is

O ((N + E) * (t + d))

.
MAGIC 的内存消耗在很大程度上取决于

d

和

t

。图表示模块占用

O ((N + E) * t)

空间来存储来源图，

O ((N + E) * (t + d))

空间来生成输出嵌入。检测模块占用

O (N * d)

空间来记忆良性嵌入。MAGIC 的整体空间复杂度为

O ((N + E) * (t + d))

。

C Case Study C 案例研究

We use our motivating example described in Sec. 2.1 again to illustrate how MAGIC detects APTs from audit logs. Our motivating example involves an APT attack: Pine Backdoor, which implants an malicious executable to a host via phishing e-mail, aiming to perform internal reconnaissance and build a silent connection between the host and the attacker. We perform system entity level detection on it and obtain real detection results from MAGIC. First, MAGIC constructs a provenance graph from raw audit logs. We provide the example provenance graph in Figure 10 to illustrate. Among them, tcexec, Connection-162.66.239.75 and the portscan NetFlowObjects which tcexec connects to are malicious entities while the others are benign ones. The graph representation module then obtains their embeddings via the graph masked auto-encoder. The embedding of tcexec is calculated by propagating and aggregating information from its multihop neighborhood to model its interaction behavior with other system entities. For instance, its 2-hop neighborhood is namely Connection-162.66.239.75, tcexec’s sub-process, uname, ld-linux-x86-64.so. 2 and other portscan NetFlowObjects. The detection module subsequently calculates distances between those embeddings and their k-nearest benign neighbors in the latent space and assigns anomaly scores to them. The malicious entities are given very high anomaly scores (3598.58 6802.93) that far exceed our detection threshold (3000) while others present low abnormality (

0.01 \sim 1636.34

). Thus, MAGIC successfully detects malicious system entities with no false alarm generated.
我们再次使用第 2.1 节中描述的激励示例来说明 MAGIC 如何从审计日志中检测 APT。我们的激励示例涉及一个 APT 攻击：Pine Backdoor，它通过网络钓鱼电子邮件将恶意可执行文件植入主机，旨在进行内部侦察并在主机与攻击者之间建立隐秘连接。我们对其进行系统实体级检测，并从 MAGIC 获得真实的检测结果。首先，MAGIC 从原始审计日志构建一个溯源图。我们在图 10 中提供了示例溯源图以作说明。其中，tcexec、Connection-162.66.239.75 和 tcexec 连接的 portscan NetFlowObjects 是恶意实体，而其他则是良性实体。图表示模块随后通过图掩码自编码器获取它们的嵌入。tcexec 的嵌入是通过传播和聚合来自其多跳邻域的信息来计算的，以建模其与其他系统实体的交互行为。例如，它的 2 跳邻域即为 Connection-162.66.239.75、tcexec 的子进程 uname、ld-linux-x86-64.so 2 和其他 portscan NetFlowObjects。检测模块随后计算这些嵌入与潜在空间中其 k 个最近良性邻居之间的距离，并为它们分配异常分数。恶意实体的异常分数非常高（3598.58 6802.93），远远超过我们的检测阈值（3000），而其他实体则表现出低异常（

0.01 \sim 1636.34

）。因此，MAGIC 成功检测到恶意系统实体，没有产生误报。

D Hyper-parameter Choice Guideline
D 超参数选择指南

The choice of detection threshold

θ

depends on the corresponding dataset MAGIC is operating on. However, finding a precise

θ

is not necessary on each individual dataset, according to Sec. 6.5. Even if there is only benign data available, users may adjust and choose the optimal

θ

according to the resulting false positive rate (i.e. use the first

θ

when false positive rate is below the desired value). Here we provide a simple analysis on how wide the threshold selection range is and what consequence selecting different

θ

may lead to.
检测阈值

θ

的选择取决于 MAGIC 正在操作的相应数据集。然而，根据第 6.5 节，在每个单独的数据集上找到精确的

θ

并不是必要的。即使只有良性数据可用，用户也可以根据结果的假阳性率调整并选择最佳的

θ

（即在假阳性率低于期望值时使用第一个

θ

）。在这里，我们提供一个简单的分析，阈值选择范围有多宽，以及选择不同的

θ

可能导致的后果。
Impact of Different

θ

Choices. Altering

θ

does not lead to drastic changes in detection results, especially for #FN. For example, on the E3-Trace sub-dataset, changing

θ

from 1000 to 5000 results in 2 more FNs and an

1 %

decrease in FPR. On E3-THEIA, altering

θ

from 100 to 500 provides

0.5 %

FPR decrease and does not change #FN. On E3-Cadets, setting

θ

as 100 instead of 10 leads to a

10 %

decreased FPR with only 30 new FNs generated.
不同

θ

选择的影响。改变

θ

不会导致检测结果的剧烈变化，特别是对于#FN。例如，在 E3-Trace 子数据集中，将

θ

从 1000 更改为 5000 会导致多出 2 个 FN，并且 FPR 下降了

1 %

。在 E3-THEIA 上，将

θ

从 100 更改为 500 提供了

0.5 %

的 FPR 下降，并且没有改变#FN。在 E3-Cadets 上，将

θ

设置为 100 而不是 10 导致 FPR 下降了

10 %

，仅生成了 30 个新的 FN。
Quantifing the Threshold Selection Range. By limiting FPR below

1 %

on entity level tasks, we get a minimum

θ

equals 1980 on E3-Trace, 180 on E3-THEIA dataset and 95 on E3-CADETS dataset. The resulting recalls are respectively

0.99985, 0.99996

and 0.99782 . Meanwhile, the maximum

θ

s that ensure recall

> 99 %

are actually 6600, 1020 and 120 on the three different sub-datasets. The space between the minimum and maximum

θ

is big enough and the curves are flat. Consequently, MAGIC does not need a precise

θ

to obtain the desirable result and selecting threshold

θ

based on false positive rate is practical.
量化阈值选择范围。通过将实体级任务中的假阳性率限制在

1 %

以下，我们在 E3-Trace 上得到最小值

θ

为 1980，在 E3-THEIA 数据集上为 180，在 E3-CADETS 数据集上为 95。相应的召回率分别为

0.99985, 0.99996

和 0.99782。同时，确保召回率

> 99 %

的最大

θ

实际上分别为 6600、1020 和 120，适用于这三个不同的子数据集。最小值和最大值

θ

之间的间隔足够大，曲线也很平坦。因此，MAGIC 不需要精确的

θ

来获得理想的结果，基于假阳性率选择阈值

θ

是可行的。

E Noise Reduction E 噪声减少

Noise reduction is widely adopted by various recent works [11,

15, 17, 18

] to reduce the complexity of provenance graphs and remove redundant and useless information. MAGIC applies a mild noise reduction approach as MAGIC is less sensitive to the scale of the provenance graph and more information is preserved in this way. Compared with noise reduction done in recent works

[11, 18]

, we neither delete irrelevant nodes nor merge nodes with the same interaction behavior. This is because (1) attack-irrelevant nodes provide information for benign system behaviors and (2) multiple nodes with the same interaction behavior duplicate the information propagated to near-by nodes and impact on their embeddings.
噪声减少被许多近期的研究广泛采用[11,

15, 17, 18

]，以降低来源图的复杂性并去除冗余和无用的信息。MAGIC 采用了一种温和的噪声减少方法，因为 MAGIC 对来源图的规模不太敏感，因此以这种方式保留了更多信息。与近期研究中进行的噪声减少

[11, 18]

相比，我们既不删除无关节点，也不合并具有相同交互行为的节点。这是因为（1）与攻击无关的节点为良性系统行为提供信息，且（2）具有相同交互行为的多个节点重复了传播到附近节点的信息，并影响它们的嵌入。

F Auto-encoder Based Anomaly Detection
基于自编码器的异常检测

Among traditional applications of machine learning, anomaly detection via auto-encoders is common practice. Typically, auto-encoders are trained to reconstructing a target and minimize its reconstruction loss. Thus, the reconstruction loss of a newly-arrived sample indicates how similar it behaves
在机器学习的传统应用中，通过自编码器进行异常检测是常见的做法。通常，自编码器被训练用于重建目标并最小化其重建损失。因此，新到样本的重建损失表明其行为的相似程度。

Figure 11: Attack graph of sub-dataset E3-Trace.
图 11：子数据集 E3-Trace 的攻击图。
to training samples and samples with high reconstruction errors are detected as outliers. However, we do not apply auto-encoder-based outlier detection because of two reasons: (1) our sample-based structure reconstruction produces highvariance reconstruction loss on single sample, which prevents stable threshold-based outlier detection and (2) MAGIC performs batched log level detection by detecting outliers in system state embeddings, which do not have a reconstruction target and cannot be compared in reconstruction error.
对于训练样本，重构误差较高的样本被检测为异常值。然而，我们不采用基于自编码器的异常检测，原因有两个：（1）我们的基于样本的结构重构在单个样本上产生高方差的重构损失，这阻碍了基于阈值的稳定异常检测；（2）MAGIC 通过检测系统状态嵌入中的异常值进行批量日志级别检测，这些嵌入没有重构目标，无法在重构误差中进行比较。

G Entity-level Data Labeling on DARPA TC datasets
G 实体级数据标注在 DARPA TC 数据集上

A Feasible Labeling Methodology. Mining and labeling attack-relevant entities in DARPA TC datasets can be extremely effort-consuming, given the fact that the ground truth document they provide is practically unreadable. Recently, Watson [51] have carrited out a successful attempt to label the E3-Trace sub-dataset. We are able to repeat this labeling methodology on sub-datasets E3-Trace, E3-THEIA and E3-CADETS. The following is a detailed description on this labeling process:
可行的标记方法。挖掘和标记 DARPA TC 数据集中与攻击相关的实体可能非常耗费精力，因为它们提供的真实文档几乎无法阅读。最近，Watson [51] 成功地对 E3-Trace 子数据集进行了标记。我们能够在 E3-Trace、E3-THEIA 和 E3-CADETS 子数据集上重复这种标记方法。以下是该标记过程的详细描述：

Traverse all log entries in the dataset. Among those records, we extract process, file and netflow entities.
遍历数据集中的所有日志条目。在这些记录中，我们提取进程、文件和网络流实体。
Extract entity names. Entities’ semantic names are stored in different fields. For instance, Subject.properties.map.name stores process names in sub-dataset E3-Trace.
提取实体名称。实体的语义名称存储在不同的字段中。例如，Subject.properties.map.name 在子数据集 E3-Trace 中存储过程名称。
Mine key attack-relevant entities from the ground truth document. Some attacks were not well-recorded but at least one of the attacks using the same strategy is well-recorded.
从真实文档中提取与攻击相关的关键实体。一些攻击记录不够完整，但至少有一种使用相同策略的攻击记录得很好。
Match the names of key attack entities with all extracted entities. The matching entities are labeled as positive. Explore the neighborhood of these entities and search for other entities that are involved in the attack. Newly-identified ones are also treated as positives.
将关键攻击实体的名称与所有提取的实体进行匹配。匹配的实体标记为正面。探索这些实体的邻域，并搜索参与攻击的其他实体。新识别的实体也被视为正面。
The Resulting Ground Truth. We perform such labeling steps on sub-datasets E3-Trace, E3-THEIA and E3-CADETS and obtain the following ground truth, in the form of descriptive text and attack graphs.
结果的真实情况。我们在子数据集 E3-Trace、E3-THEIA 和 E3-CADETS 上执行这些标记步骤，并获得以下真实情况，以描述性文本和攻击图的形式呈现。
E3-Trace (Figure 11). Two successful attack attempts worked on Trace: Browser Extension and Pine Backdoor. During the Browser Extension attack, the user downloaded and executed gtcache via browser extension pass_mgr. gtcache
E3-Trace（图 11）。两个成功的攻击尝试针对 Trace：浏览器扩展和 Pine 后门。在浏览器扩展攻击中，用户通过浏览器扩展 pass_mgr 下载并执行了 gtcache。

Figure 12: Attack graph of sub-dataset E3-THEIA.
图 12：子数据集 E3-THEIA 的攻击图。

Figure 13: Attack graph of sub-dataset E3-CADETS.
图 13：子数据集 E3-CADETS 的攻击图。
communicated with the attacker, scaned sensitive information and created ztmp to portscan host 128.55.12.73. In the Pine Backdoor attack, the user unfortunately launched the phishing executable tcexec. tcexec connected back to the attacker and performed a wide postscan on the local network.
与攻击者进行了沟通，扫描了敏感信息并创建了 ztmp 以对主机 128.55.12.73 进行端口扫描。在 Pine 后门攻击中，用户不幸地启动了钓鱼可执行文件 tcexec。tcexec 连接回攻击者，并在本地网络上进行了广泛的后扫描。

E3-THEIA (Figure 12). Three successful attack attempts worked on THEIA: Firefox Backdoor, Browser Extension and Pine Backdoor. The user was compromised by a malicious payload clean while browsing via firefox. clean acquired root privileges, connected back to the attacker and executed another payload profile. The Browser Extension attack aimed to resume the first attack by re-establishing the connection, grabbing root privileges and portscanning the local network via gtcache and mail. The Pine Backdoor attack is very similar to the one conducted on Trace but unexpectedly stopped due to missing library error.
E3-THEIA（图 12）。有三个成功的攻击尝试针对 THEIA：Firefox 后门、浏览器扩展和 Pine 后门。用户在使用 Firefox 浏览时被恶意有效载荷感染。该有效载荷获得了根权限，连接回攻击者并执行了另一个有效载荷配置。浏览器扩展攻击旨在通过重新建立连接、获取根权限并通过 gtcache 和邮件对本地网络进行端口扫描来恢复第一次攻击。Pine 后门攻击与对 Trace 进行的攻击非常相似，但由于缺少库错误意外停止。
E3-CADETS (Figure 13). The attacker exploited the Nginx Backdoor and tried two attacks on CADETS. During the first attack, the attacker connected to a vulnerable Nginx server running on CADETS and injected process vUgefal with root privileges. It read sensitive information and tried to further infect the sshd process with malicious implants before the host crashed. The attacker then tried another attack on CADETS, resulting in a malicious process XIM. The attacker also created another process test to establish a long-lasting connection and portscan the local network.
E3-CADETS（图 13）。攻击者利用了 Nginx 后门，并对 CADETS 进行了两次攻击。在第一次攻击中，攻击者连接到运行在 CADETS 上的易受攻击的 Nginx 服务器，并以 root 权限注入了进程 vUgefal。它读取了敏感信息，并试图在主机崩溃之前进一步感染 sshd 进程。随后，攻击者对 CADETS 进行了另一轮攻击，导致了恶意进程 XIM 的产生。攻击者还创建了另一个进程 test，以建立持久连接并对本地网络进行端口扫描。

$^{*}$ Corresponding author: Yuhong Nan
$^{*}$ 通讯作者：南玉红
$^{1}$ MAGIC is available at https://github.com/FDUDSDE/MAGIC
$^{1}$ MAGIC 可在 https://github.com/FDUDSDE/MAGIC 获取

MAGIC: Detecting Advanced Persistent Threats via Masked Graph Representation Learning 魔法：通过掩蔽图表示学习检测高级持续威胁

Abstract 摘要

1 Introduction 1 引言

2 Background 2 背景

2.1 Motivating Example 2.1 激励示例

2.2 Prior Research and their Limitations2.2 先前研究及其局限性

2.3 Threat Model and Definitions2.3 威胁模型和定义

3 MAGIC Overview 3 MAGIC 概述

4 Design Details 4 设计细节

4.1 Provenance Graph Construction4.1 来源图构建

4.2 Graph Representation Module4.2 图形表示模块

4.2.1 Feature Masking 4.2.1 特征屏蔽

4.2.2 Graph Encoder 4.2.2 图编码器

4.2.3 Graph Decoder 4.2.3 图解码器

4.3 Detection Module 4.3 检测模块

4.4 Model Adaption 4.4 模型适应

5 Implementation 5 实施

6 Evaluation 6 评估

6.1 Experimental Settings6.1 实验设置

6.2 Effectiveness 6.2 效果性

6.3 False Positive Analysis6.3 假阳性分析

6.4 Performance Overhead 6.4 性能开销

6.5 Ablation Study 6.5 消融研究

7 Discussion and Limitations7 讨论与局限性

8 Related Work 8 相关工作

9 Conclusion 9 结论

Acknowledgments 致谢

References 参考文献

Appendix 附录

A Analysis on Adversarial Attack对抗攻击分析

B Time and Space Complexity of MAGIC魔法的时间和空间复杂度

C Case Study C 案例研究

D Hyper-parameter Choice GuidelineD 超参数选择指南

E Noise Reduction E 噪声减少

F Auto-encoder Based Anomaly Detection基于自编码器的异常检测

G Entity-level Data Labeling on DARPA TC datasetsG 实体级数据标注在 DARPA TC 数据集上

MAGIC: Detecting Advanced Persistent Threats via Masked Graph Representation Learning
魔法：通过掩蔽图表示学习检测高级持续威胁

2.2 Prior Research and their Limitations
2.2 先前研究及其局限性

2.3 Threat Model and Definitions
2.3 威胁模型和定义

4.1 Provenance Graph Construction
4.1 来源图构建

4.2 Graph Representation Module
4.2 图形表示模块

6.1 Experimental Settings
6.1 实验设置

6.3 False Positive Analysis
6.3 假阳性分析

7 Discussion and Limitations
7 讨论与局限性

A Analysis on Adversarial Attack
对抗攻击分析

B Time and Space Complexity of MAGIC
魔法的时间和空间复杂度

D Hyper-parameter Choice Guideline
D 超参数选择指南

F Auto-encoder Based Anomaly Detection
基于自编码器的异常检测

G Entity-level Data Labeling on DARPA TC datasets
G 实体级数据标注在 DARPA TC 数据集上