VulChecker: Graph-based Vulnerability Localization in Source Code VulChecker:基于图的源代码漏洞定位
Yisroel Mirsky, Ben-Gurion University of the Negev; George Macon, Georgia Tech Research Institute; Michael Brown, Georgia Institute of Technology; Carter Yagemann, Ohio State University; Matthew Pruett, Evan Downing, 以色列·米尔斯基,内盖夫本-古里昂大学;乔治·梅肯,乔治亚理工研究所;迈克尔·布朗,乔治亚理工学院;卡特·亚格曼,俄亥俄州立大学;马修·普鲁特,埃文·唐宁,
This paper is included in the Proceedings of the 32nd USENIX Security Symposium. 本文收录于第 32 届 USENIX 安全研讨会论文集。
August 9-11, 2023 *\cdot Anaheim, CA, USA 2023 年 8 月 9 日至 11 日,美国加利福尼亚州阿纳海姆 978-1-939133-37-3978-1-939133-37-3
VulChecker: Graph-based Vulnerability Localization in Source Code VulChecker:基于图的源代码漏洞定位
Yisroel Mirsky ^(1){ }^{1}, George Macon ^(2){ }^{2}, Michael Brown ^(3){ }^{3}, Carter Yagemann ^(4){ }^{4} 以色列·米尔斯基 ^(1){ }^{1} , 乔治·梅肯 ^(2){ }^{2} , 迈克尔·布朗 ^(3){ }^{3} , 卡特·雅格曼 ^(4){ }^{4}Matthew Pruett ^(3){ }^{3}, Evan Downing ^(3){ }^{3}, Sukarno Mertoguno ^(3){ }^{3} and Wenke Lee ^(3){ }^{3} 马修·普鲁特 ^(3){ }^{3} , 埃文·唐宁 ^(3){ }^{3} , 苏卡诺·梅尔托古诺 ^(3){ }^{3} 和 温克·李 ^(3){ }^{3}^(1){ }^{1} Ben-Gurion University of the Negev 内盖夫本-古里昂大学^(2){ }^{2} Georgia Tech Research Institute 乔治亚理工学院研究所^(3){ }^{3} Georgia Institute of Technology 乔治亚理工学院^(4){ }^{4} Ohio State University 俄亥俄州立大学
Abstract 摘要
In software development, it is critical to detect vulnerabilities in a project as early as possible. Although, deep learning has shown promise in this task, current state-of-the-art methods cannot classify and identify the line on which the vulnerability occurs. Instead, the developer is tasked with searching for an arbitrary bug in an entire function or even larger region of code. In this paper, we propose VulChecker: a tool that can precisely locate vulnerabilities in source code (down to the exact instruction) as well as classify their type (CWE). To accomplish this, we propose a new program representation, program slicing strategy, and the use of a message-passing graph neural network to utilize all of code’s semantics and improve the reach between a vulnerability’s root cause and manifestation points. We also propose a novel data augmentation strategy for cheaply creating strong datasets for vulnerability detection in the wild, using free synthetic samples available online. With this training strategy, VulChecker was able to identify 24 CVEs ( 10 from 2019&20202019 \& 2020 ) in 19 projects taken from the wild, with nearly zero false positives compared to a commercial tool that could only detect 4 . VulChecker also discovered an exploitable zero-day vulnerability, which has been reported to developers for responsible disclosure. 在软件开发中,尽早检测项目中的漏洞至关重要。尽管深度学习在这项任务中显示出潜力,但当前最先进的方法无法分类和识别漏洞发生的具体行数。相反,开发者需要在整个函数或更大范围的代码中搜索任意错误。本文提出了 VulChecker:一个能够精确定位源代码中的漏洞(直到具体指令)并分类其类型(CWE)的工具。为此,我们提出了一种新的程序表示、程序切片策略,以及使用消息传递图神经网络来利用代码的所有语义,并改善漏洞根本原因与表现点之间的联系。我们还提出了一种新颖的数据增强策略,以便廉价地创建强大的数据集,用于在实际环境中检测漏洞,利用在线可用的免费合成样本。 通过这种训练策略,VulChecker 能够在 19 个来自实际环境的项目中识别出 24 个 CVE(其中 10 个来自 2019&20202019 \& 2020 ),与只能检测到 4 个的商业工具相比,几乎没有误报。VulChecker 还发现了一个可利用的零日漏洞,已向开发者报告以进行负责任的披露。
1 Introduction 1 引言
Software vulnerabilities are typically introduced into systems as a consequence of flawed security control designs or developer error during the implementation of the software specification. Such flaws and errors are unavoidable during the design and implementation phases of the software life cycle, particularly for large, complex, and interconnected software systems, such as distributed systems. In 2020, there were 18,352 vulnerabilities publicly reported in the Common Vulnerabilities and Exposures (CVE) database [25], and the total number of new vulnerabilities reported has been increasing every year. In reality, many more vulnerabilities exist and are being silently patched (e.g., [29]), exploited by malicious actors (i.e., zero-days), or simply have yet to be discovered. 软件漏洞通常是由于安全控制设计缺陷或开发人员在实施软件规范时的错误而引入到系统中的。这些缺陷和错误在软件生命周期的设计和实施阶段是不可避免的,特别是在大型、复杂和互联的软件系统中,例如分布式系统。2020 年,在公共漏洞和暴露(CVE)数据库中报告了 18,352 个漏洞[25],而新报告的漏洞总数每年都在增加。实际上,存在更多的漏洞,这些漏洞正在被默默修补(例如,[29]),被恶意行为者利用(即零日漏洞),或者尚未被发现。
Detecting software vulnerabilities in the earliest stages of the software life cycle is critically important, primarily because vulnerabilities in deployed software can be exploited by an attacker to achieve a variety of malicious ends (e.g., data theft, ransom, etc.). Additionally, the cost to patch vulnerabilities at various points in the software life cycle increases dramatically; the cost to patch a vulnerability discovered in deployed software is exponentially higher than the cost to patch the vulnerability during implementation. As a result, detecting vulnerabilities in software source code has been a highly active area of both academic and commercial research and development for decades. 在软件生命周期的最早阶段检测软件漏洞至关重要,主要是因为已部署软件中的漏洞可能被攻击者利用以实现各种恶意目的(例如,数据盗窃、勒索等)。此外,在软件生命周期的不同阶段修补漏洞的成本急剧增加;在已部署软件中发现的漏洞的修补成本比在实施阶段修补漏洞的成本高出几个数量级。因此,检测软件源代码中的漏洞几十年来一直是学术界和商业界研究与开发的一个活跃领域。
As such, static application security testing (SAST) tools have become pervasive in the software development process to help developers identify bugs and vulnerabilities in their source code. SAST tools typically employ static software analysis techniques (e.g., taint analysis, data-flow analysis) and/or libraries of vulnerability patterns, matching rules, etc., to scan source code, identify potential bugs or vulnerabilities, and report them for manual inspection. SAST reports are highly precise, yielding alerts that identify specific lines of source code as vulnerable to specific types of vulnerabilities, typically categorized as common weakness enumerations (CWEs). Many SAST tools are available to developers, including both open source (e.g., Flawfinder [5], Infer [7], Clang [3]) and commercial offerings (e.g., Fortify [2], Perforce QAC [6], and Coverity [4]). SAST tools are typically employed during code reviews as part of continuous integration and development processes, or launched manually by developers. 静态应用安全测试(SAST)工具在软件开发过程中变得普遍,帮助开发人员识别源代码中的错误和漏洞。SAST 工具通常采用静态软件分析技术(例如,污点分析、数据流分析)和/或漏洞模式库、匹配规则等,扫描源代码,识别潜在的错误或漏洞,并报告以供人工检查。SAST 报告非常精确,发出的警报能够识别特定行的源代码作为特定类型漏洞的易受攻击点,通常被归类为常见弱点枚举(CWE)。许多 SAST 工具可供开发人员使用,包括开源工具(例如,Flawfinder [5]、Infer [7]、Clang [3])和商业产品(例如,Fortify [2]、Perforce QAC [6] 和 Coverity [4])。SAST 工具通常在代码审查期间作为持续集成和开发过程的一部分使用,或由开发人员手动启动。
Although SAST tools are powerful, they have several disadvantages. First, they often report false positives (i.e., misidentify code as vulnerable). In most cases this is a result of (1) limitations or incomplete rule/pattern matching algorithms, and (2) vulnerabilities that exist in unreachable code. Second, SAST tools struggle to detect complex and contextual vulnerabilities, resulting in high false negative rates (i.e., failing to identify vulnerable code). This is also a limitation of rule/pattern matching algorithms. For example, some program behaviors, such as integer overflows and 尽管静态应用安全测试(SAST)工具功能强大,但它们也存在一些缺点。首先,它们经常报告误报(即错误地将代码识别为脆弱)。在大多数情况下,这源于(1)规则/模式匹配算法的局限性或不完整性,以及(2)存在于不可达代码中的脆弱性。其次,SAST 工具在检测复杂和上下文相关的脆弱性方面表现不佳,导致高误报率(即未能识别脆弱代码)。这也是规则/模式匹配算法的一个局限性。例如,一些程序行为,如整数溢出和
underflows, are legitimate in some contexts (e.g., cryptography) and vulnerable to exploitation in others (e.g., memory allocation). Similarly, use-after-free vulnerabilities are highly context sensitive, sometimes requiring long sequences of (de)allocations over the program’s lifetime to manifest. 在某些情况下(例如,加密),下溢是合法的,而在其他情况下(例如,内存分配)则容易受到利用。同样,使用后释放的漏洞高度依赖于上下文,有时需要在程序的生命周期内进行长序列的(重新)分配才能显现。
1.1 Recent Efforts 1.1 最近的努力
To overcome these limitations, researchers have recently turned to deep learning. Specifically, prior work has explored applications of convolutional (CNN) [19, 28, 38], recurrent (RNN) [20-23, 40], and graph neural networks (GNN) [9,11,13,14,30,32,35,39][9,11,13,14,30,32,35,39] to detect vulnerabilities in source code. This is accomplished by representing the software as an annotated graph (G)(G) to encode its data-flow (DFG), control-flow (CFG), and/or abstract syntax (AST). 为了克服这些限制,研究人员最近转向了深度学习。具体而言,先前的工作探讨了卷积神经网络(CNN)[19, 28, 38]、递归神经网络(RNN)[20-23, 40]和图神经网络(GNN) [9,11,13,14,30,32,35,39][9,11,13,14,30,32,35,39] 在源代码中检测漏洞的应用。这是通过将软件表示为带注释的图 (G)(G) 来实现的,以编码其数据流(DFG)、控制流(CFG)和/或抽象语法(AST)。
While prior approaches have demonstrated good performance, software developers are unlikely to adopt or use these models in the same manner they currently use SAST tools. This is because (1) these methods make predictions over entire functions or code regions and/or (2) cannot classify the type of vulnerability found. Using such a tool would require developers to search through hundreds of lines of code for a vulnerability with no further information. This is even more problematic with tools that do not classify the vulnerability [9,11,14,20-22[9,11,14,20-22, 30,35,38,3930,35,38,39 ] since the developer has no indication of what they are looking for or what remediation approach is required. This issue is amplified in the presence of false positives, which are common occurrence in existing machine learning techniques. 尽管先前的方法表现良好,但软件开发人员不太可能以当前使用 SAST 工具的方式采用或使用这些模型。这是因为(1)这些方法对整个函数或代码区域进行预测和/或(2)无法分类发现的漏洞类型。使用这样的工具将要求开发人员在数百行代码中搜索漏洞,而没有进一步的信息。这在不对漏洞进行分类的工具中尤为棘手,因为开发人员不知道他们在寻找什么或需要什么样的修复方法。在存在假阳性的情况下,这一问题更加严重,而假阳性在现有的机器学习技术中是常见的现象。
1.2 Insights 1.2 见解
To develop a solution that can be readily used by developers in a comparable manner as existing SAST tools, we first identify the issues with current approaches and their causes: 为了开发一种可以被开发者以与现有静态应用安全测试(SAST)工具相似的方式轻松使用的解决方案,我们首先识别当前方法的问题及其原因:
Broad Program Slicing. To localize a vulnerability, the current approach is to take a subgraph G_(i)G_{i} from GG as an observation for making a prediction. G_(i)G_{i} is either an entire function [9,11,28,30,32,35,39][9,11,28,30,32,35,39] or a contiguous region extracted from GG with a point of interest (PoI) at the center (a.k.a., a program slice) [13,14,19,20,22,23,38,40][13,14,19,20,22,23,38,40]. As a result, a prediction made on G_(i)G_{i} can relate to hundreds of lines of code. It also makes it harder for the optimizer since G_(i)G_{i} may contain a number of potential root cause and manifestation points. 广泛程序切片。为了定位漏洞,目前的方法是从 GG 中提取一个子图 G_(i)G_{i} 作为观察以进行预测。 G_(i)G_{i} 可以是整个函数 [9,11,28,30,32,35,39][9,11,28,30,32,35,39] ,也可以是从 GG 中提取的一个连续区域,中心有一个关注点(PoI)(即程序切片) [13,14,19,20,22,23,38,40][13,14,19,20,22,23,38,40] 。因此,对 G_(i)G_{i} 的预测可能与数百行代码相关。这也使得优化器的工作变得更加困难,因为 G_(i)G_{i} 可能包含多个潜在的根本原因和表现点。
Insight 1: To increase the precision of a prediction, the observation must relate directly to the location of the vulnerability. 洞察 1:为了提高预测的准确性,观察必须与脆弱性的位置直接相关。
Incomplete Code Representations. To convert G_(i)G_{i} into a format that machine learning can understand, symbols derived from the AST and/or functions are either compressed [2023,28,4023,28,40 ] or embedded using methods like Word 2 Vec [9, 11,19,30,32,35,39]11,19,30,32,35,39] as a preprocessing step. This method of representing software is compatible with machine learning techniques, but is suboptimal because it prevents the learner from directly reasoning about instruction and dependency semantics. Moreover, information from the AST is either (1) omitted, (2) included in manner that inhibits graph-based learning [9,11,30][9,11,30], or (3) is included incorrectly in a manner that harms code semantics and model efficiency [32,39][32,39]. 不完整的代码表示。为了将 G_(i)G_{i} 转换为机器学习可以理解的格式,从 AST 和/或函数派生的符号要么被压缩[20 23,28,4023,28,40 ],要么使用像 Word 2 Vec [9, 11,19,30,32,35,39]11,19,30,32,35,39] ]的方法嵌入,作为预处理步骤。这种表示软件的方法与机器学习技术兼容,但并不理想,因为它阻止学习者直接推理指令和依赖语义。此外,来自 AST 的信息要么被(1)省略,要么以抑制基于图的学习的方式包含 [9,11,30][9,11,30] ,要么以错误的方式包含,从而损害代码语义和模型效率 [32,39][32,39] 。
Insight 2: To enable effective graph-based learning in this task, a new code representation that properly incorporates the flow of AST information is required. Furthermore, instruction level semantics should be explicitly provided in GG to enable efficient and effective learning. 洞察 2:为了在此任务中实现有效的基于图的学习,需要一种新的代码表示,能够正确地融入 AST 信息的流动。此外,应在 GG 中明确提供指令级语义,以实现高效和有效的学习。
Manifestation Distance. The manifestation of a vulnerability can occur hundreds of lines after its root cause. Current GNN approaches use models that can only propagate a node’s information one or two steps away. As a result, these GNN models cannot infer whether a potential root cause directly influences a potential manifestation point. Instead, these models are limited to identifying local patterns around either point (e.g., a missing guard branch prior to a memcpy is enough to raise an alarm). 表现距离。漏洞的表现可能在其根本原因之后的数百行发生。目前的图神经网络(GNN)方法使用的模型只能传播节点信息一到两步。因此,这些 GNN 模型无法推断潜在根本原因是否直接影响潜在表现点。相反,这些模型仅限于识别任一节点周围的局部模式(例如,在 memcpy 之前缺少保护分支就足以引发警报)。
Insight 3: To increase the model’s ability to identify vulnerabilities, the model must be able to identify causality across the entirety of G_(i)G_{i}. 洞察 3:为了增强模型识别漏洞的能力,模型必须能够识别 G_(i)G_{i} 整体的因果关系。
The Lack of Labeled Data. Deep learning requires thousands of samples for training. However, since it is hard and expensive to label vulnerabilities in real code at the instruction or even line level, large fine-grained datasets do not exist. Therefore, current work must either: (1) use synthetic datasets (e.g., NIST SARD [26]), which do not generalize to large projects or real world code, or (2) use datasets that only label code regions (program slices) or entire functions. Insight 4: To enable cost effective high precision detection in source code, there is a need to develop data augmentation techniques to efficiently expand existing datasets. 缺乏标记数据。深度学习需要成千上万的样本进行训练。然而,由于在指令甚至行级别标记真实代码中的漏洞既困难又昂贵,因此不存在大型细粒度数据集。因此,目前的工作必须要么:(1)使用合成数据集(例如,NIST SARD [26]),这些数据集无法推广到大型项目或真实世界代码,或者(2)使用仅标记代码区域(程序切片)或整个函数的数据集。洞察 4:为了实现源代码中成本效益高的高精度检测,需要开发数据增强技术,以有效扩展现有数据集。
Level of Program Representation. Source-level programming languages are designed for programmer ease and flexibility, and as such can capture several inter-related and high level operations in a single line of source code. Such rich instruction semantics are in stark contrast to the characteristics of software vulnerabilities, which are typically associated with atomic and machine-level instruction semantics (e.g., out-of-bounds buffer accesses, integer overflows). Most prior work extracts GG from the source code without lowering or compiling it. As a result, these models are challenged with identifying low-level vulnerabilities from abstract program representations. 程序表示级别。源级编程语言旨在为程序员提供便利和灵活性,因此可以在一行源代码中捕捉多个相互关联的高级操作。这种丰富的指令语义与软件漏洞的特征形成鲜明对比,后者通常与原子和机器级指令语义相关(例如,越界缓冲区访问、整数溢出)。大多数先前的工作从源代码中提取 GG ,而不进行降级或编译。因此,这些模型在从抽象程序表示中识别低级漏洞时面临挑战。
Insight 5: To improve the model’s ability to reason about vulnerabilities that have atomic machine-level characteristics, GG should capture instruction semantics at lower levels rather than at source code. 洞察 5:为了提高模型对具有原子机器级特征的漏洞进行推理的能力, GG 应该在更低的层次捕捉指令语义,而不是在源代码层面。
1.3 The Proposed Solution 1.3 提出的解决方案
In this paper we propose VulChecker: a deep learning framework for detecting the precise manifestation point 在本文中,我们提出了 VulChecker:一个用于检测精确表现点的深度学习框架
of software vulnerabilities in source code. To accomplish this, VulChecker first passes the program’s source code through a custom LLVM compiler toolchain to generate a GNN-optimized program representation, which we refer to as an enriched program dependency graph (ePDG). 软件源代码中的漏洞。为此,VulChecker 首先通过自定义的 LLVM 编译器工具链处理程序的源代码,以生成一个 GNN 优化的程序表示,我们称之为增强程序依赖图(ePDG)。
ePDGs are graph structures in which nodes represent atomic machine-level instructions and edges represent control- and data-flow dependencies between instructions. ePDGs are well-conditioned for graph-based machine learning models since they more accurately capture the flow of the software. ePDG 是图结构,其中节点表示原子机器级指令,边表示指令之间的控制和数据流依赖关系。ePDG 对于基于图的机器学习模型具有良好的条件,因为它们更准确地捕捉了软件的流动。
To localize a given vulnerability, VulChecker searches the ePDG for potential vulnerability manifestation points, such as calls to a free () function for double free vulnerabilities (i.e., CWE-415). VulChecker then cuts a subgraph G_(i)G_{i} backwards from that point. By having every subgraph terminate at a potential manifestation point, the model obtains a stable and consistent view. This enables the model to flow information about any potential root causes in G_(i)G_{i} down to the singular manifestation point in question. During deployment, when a subgraph is predicted to be vulnerable with respect to a specific CWE, VulChecker alerts the developer to the exact location (line and instruction) of the manifestation point in source code and indicates the CWE type. 为了定位给定的漏洞,VulChecker 在 ePDG 中搜索潜在的漏洞表现点,例如对 free() 函数的调用以查找双重释放漏洞(即 CWE-415)。然后,VulChecker 从该点向后切割一个子图 G_(i)G_{i} 。通过使每个子图在潜在表现点终止,该模型获得了一个稳定且一致的视图。这使得模型能够将关于 G_(i)G_{i} 中任何潜在根本原因的信息传递到相关的单一表现点。在部署过程中,当预测某个子图在特定 CWE 下存在漏洞时,VulChecker 会提醒开发者源代码中表现点的确切位置(行和指令),并指明 CWE 类型。
In order to give the model the ability to reason about causality, we chose a different GNN than previous approaches. In particular, we develop a model based on a message passing GNN called Structure2Vec (S2V) [15]. With this model, we are able to efficiently pass information over hundreds of hops from one side of G_(i)G_{i} to the other. Furthermore, by using S2V we are able to supply edge features (e.g., data type for data-flow dependencies) as well as consider a node’s features at every propagation iteration. 为了赋予模型推理因果关系的能力,我们选择了一种不同于以往方法的图神经网络(GNN)。具体而言,我们开发了一种基于消息传递的 GNN 模型,称为 Structure2Vec(S2V)[15]。通过该模型,我们能够高效地在 G_(i)G_{i} 的一侧与另一侧之间传递数百跳的信息。此外,通过使用 S2V,我们能够提供边特征(例如,数据流依赖的数据类型),并在每次传播迭代中考虑节点的特征。
Finally, to mitigate the issue of obtaining fine-grained labeled datasets, we propose a data augmentation technique. The approach is to generate positive samples by injecting synthetic traces into projects taken from the wild. The approach is cheap because the synthetic traces are taken from open sourced labeled datasets like the Juliet C/C++ Test Suite [26] and GG is manipulated directly to avoid compilation issues. The approach is effective because we inject root causes at random distance away from the manifestation point. This improves generalization by forcing the model to search deeper into G_(i)G_{i} over real code. 最后,为了缓解获取细粒度标记数据集的问题,我们提出了一种数据增强技术。该方法是通过将合成痕迹注入来自实际项目的样本中来生成正样本。该方法成本低,因为合成痕迹来自开源标记数据集,如 Juliet C/C++测试套件[26],并且 GG 被直接处理以避免编译问题。该方法有效,因为我们在距离表现点随机注入根本原因。这通过迫使模型在真实代码中更深入地搜索 G_(i)G_{i} 来提高泛化能力。
We evaluated VulChecker on five distinct CWEs: integer overflow (CWE-190), stack overflow (CWE-121), heap overflow (CWE-122), double free (CWE-415), and use-after-free (CWE-416). We chose these CWEs because they are some of the most prevalent and exploitable types of memory corruption plaguing development in popular languages like C//C++\mathrm{C} / \mathrm{C}++. These CWEs cover both spatial and temporal memory safety issues, testing the flexibility of our approach. They are also featured in MITRE’s 2021 list of the 25 most dangerous CWEs. ^(1){ }^{1} When trained on only augmented data, VulChecker was able to detect 我们对 VulChecker 进行了评估,涵盖了五种不同的 CWE:整数溢出(CWE-190)、栈溢出(CWE-121)、堆溢出(CWE-122)、双重释放(CWE-415)和使用后释放(CWE-416)。我们选择这些 CWE 是因为它们是困扰流行语言开发的一些最常见和可利用的内存损坏类型。这些 CWE 涵盖了空间和时间内存安全问题,测试了我们方法的灵活性。它们还出现在 MITRE 2021 年发布的 25 个最危险 CWE 名单中。当仅在增强数据上进行训练时,VulChecker 能够检测到。
24 vulnerabilities (reported CVEs) in the 19C++19 \mathrm{C}++ projects we evaluated. This was significantly more than the baseline SAST tools, where even the commercial tool only detected 4 of the CVEs. We also had a vulnerability analyst review the top 100 results from a model trained on both augmented and CVE data. The analyst found that VulChecker has hit rate (precision) of 5080%80 \% on the top 50 results. VulChecker was also able to detect a previously unknown vulnerability (zero-day), which has been reported to the developers for responsible disclosure. Overall, we found VulChecker to be a practical tool for developers because: (1) it is able to operate on large projects (such as libgit2 with over 300 files, 110k lines of code, and 18 million instructions), and (2) it has a comparatively low false positive rate. 在我们评估的 19C++19 \mathrm{C}++ 个项目中发现了 24 个漏洞(报告的 CVE)。这显著高于基线 SAST 工具,甚至商业工具仅检测到 4 个 CVE。我们还让一位漏洞分析师审查了基于增强数据和 CVE 数据训练的模型的前 100 个结果。分析师发现,VulChecker 在前 50 个结果中的命中率(精确度)为 50 80%80 \% 。VulChecker 还能够检测到一个之前未知的漏洞(零日漏洞),该漏洞已报告给开发者以进行负责任的披露。总体而言,我们发现 VulChecker 是开发者的一个实用工具,因为:(1)它能够在大型项目上运行(例如 libgit2,拥有超过 300 个文件、11 万行代码和 1800 万条指令),以及(2)它的假阳性率相对较低。
1.4 Contributions 1.4 贡献
Overall the main contributions of this work are as follows: 总体而言,本工作的主要贡献如下:
To the best of our knowledge, VulChecker is the first deep learning framework that can both detect vulnerabilities in source code with instruction and line-level precision and classify the type of CWE. This makes VulChecker the first practical deep learning-based SAST tool for developers, enabling them to quickly identify and remediate vulnerabilities in their code in the same manner as traditional SAST tools. Improving upon prior techniques, developers using VulChecker do not need to search hundreds of lines to find vulnerable code and/or determine what type of vulnerability is present. 据我们所知,VulChecker 是第一个能够以指令和行级精度检测源代码中的漏洞并分类 CWE 类型的深度学习框架。这使得 VulChecker 成为第一个实用的基于深度学习的 SAST 工具,帮助开发者快速识别和修复代码中的漏洞,方式与传统 SAST 工具相同。与之前的技术相比,使用 VulChecker 的开发者无需搜索数百行代码来查找漏洞代码和/或确定存在何种类型的漏洞。
We introduce the use of message passing neural networks for the task of vulnerability detection to learn and identify the relationship (causality) between a vulnerability’s root cause and manifestation point in a PDG. Using this model, we are also able to explicitly assign features to edges in our ePDG, such as data type in the DFG component. 我们引入了消息传递神经网络用于漏洞检测任务,以学习和识别漏洞根本原因与表现点在程序依赖图(PDG)中的关系(因果关系)。使用该模型,我们还能够明确地将特征分配给我们的扩展程序依赖图(ePDG)中的边,例如数据流图(DFG)组件中的数据类型。
We propose the ePDG, a novel graph-based machinelearning optimized program representation that avoids the inefficiencies and drawbacks of others in prior work. 我们提出了 ePDG,一种新颖的基于图的机器学习优化程序表示,避免了先前工作中其他方法的低效和缺陷。
We present a novel data augmentation approach that helps mitigate the lack of fine-grained datasets for GNN-based source code vulnerability detection, while still generalizing to real world software projects taken from the wild. The approach is simple and easy to implement, making it a practical way to boost the performance of future models in this research domain. 我们提出了一种新颖的数据增强方法,旨在缓解基于图神经网络的源代码漏洞检测中细粒度数据集的缺乏,同时仍能推广到来自实际软件项目的真实场景。该方法简单易行,是提升该研究领域未来模型性能的实用方式。
The source code, models, and datasets presented in this paper are available online. ^(2){ }^{2} A demo of VulChecker working as a plugin for Visual Studio Code is also available online. ^(3){ }^{3} 本文中呈现的源代码、模型和数据集均可在线获取。 ^(2){ }^{2} VulChecker 作为 Visual Studio Code 插件的演示也可在线获取。 ^(3){ }^{3}
2 Related Works 2 相关工作
In this section, we review the last three years of related work on deep learning techniques for detecting vulnerabilities in 在本节中,我们回顾了过去三年关于深度学习技术检测漏洞的相关工作
Table 1: Summary of related works in comparison with VulChecker 表 1:与 VulChecker 相关工作的总结比较
source code and contrast them to VulChecker’s model, program representation, and application (summarized in Table 1). 源代码,并将其与 VulChecker 的模型、程序表示和应用进行对比(总结见表 1)。
2.1 Model 2.1 模型
Early work in this area focused on CNNs and RNNs on linear (sequential) representations of the software’s data-flow graphs [19,23,28,38,40][19,23,28,38,40]. However, a linear representation omits software structure, which prevents the model from learning and utilizing various contexts and semantics in its pattern recognition. These linear reductions also make it hard for the model to perform well on code from the wild where a vulnerability can manifest after hundreds of lines of noisy and complex code [11]. 早期的研究集中在卷积神经网络(CNN)和递归神经网络(RNN)对软件数据流图的线性(顺序)表示上 [19,23,28,38,40][19,23,28,38,40] 。然而,线性表示忽略了软件结构,这使得模型无法学习和利用其模式识别中的各种上下文和语义。这些线性简化也使得模型在处理来自实际环境的代码时表现不佳,因为漏洞可能在数百行嘈杂和复杂的代码之后显现[11]。
To overcome these issues, later work [14,39][9,11,13,30,32[14,39][9,11,13,30,32, 35] used graph-based neural networks to consider the code’s structural semantics. They utilize a graph convolutional neural network (GCN) that propagates information to neighboring nodes to learn embeddings for each node, which are then averaged or summed prior to use in classification. However, GCNs do not recall (i.e., propagate) a node’s original feature vector at each iteration and struggle to learn long-distance relationships across the input structure. To remedy this, subsequent work [9,30,32][9,30,32] used gated graph recurrent neural networks (GRNN) that leveraged a recurrent layer to recall information passed to neighboring nodes at previous iterations. However, in these networks the number of layers dictated the number of propagation iterations, which was only one or two [13]. VulChecker’s model is based on Structure2Vec (S2V) [15], a message-passing neural network that can perform hundreds of iterations without needing additional layers. Consequently, VulChecker is better suited for vulnerability detection because it can identify connections between distant root cause and manifestation points. 为了克服这些问题,后来的工作 [14,39][9,11,13,30,32[14,39][9,11,13,30,32 ,35]使用基于图的神经网络来考虑代码的结构语义。他们利用图卷积神经网络(GCN),该网络将信息传播到邻近节点,以学习每个节点的嵌入,然后在分类之前进行平均或求和。然而,GCN 在每次迭代中并不回忆(即传播)节点的原始特征向量,并且在学习输入结构中的长距离关系时存在困难。为了解决这个问题,后续的工作 [9,30,32][9,30,32] 使用了门控图递归神经网络(GRNN),该网络利用递归层回忆在先前迭代中传递给邻近节点的信息。然而,在这些网络中,层数决定了传播迭代的次数,通常只有一到两次[13]。VulChecker 的模型基于 Structure2Vec(S2V)[15],这是一种消息传递神经网络,可以在不需要额外层的情况下执行数百次迭代。因此,VulChecker 更适合用于漏洞检测,因为它能够识别远程根本原因和表现点之间的连接。
2.2 Program Representation 2.2 程序表示
Prior work used a variety of graph-based program representations (depicted in Figure 1), the simplest of which are control-flow graphs (CFG), data-flow graphs (DFG), and combined CFG/DFGs called program dependency graphs (PDG). These representations can be readily generated from source code (e.g., with tools like Joern [37]), but do not capture instruction semantics explicitly. 先前的工作使用了多种基于图的程序表示(如图 1 所示),其中最简单的是控制流图(CFG)、数据流图(DFG)以及称为程序依赖图(PDG)的组合 CFG/DFG。这些表示可以很容易地从源代码生成(例如,使用像 Joern [37]这样的工具),但并未明确捕捉指令语义。
More advanced representations include code property graphs (CPG) [37] and natural code sequence CPGs (ncsCPG) such as in [39]. CPGs merge the source code’s PDG and AST by attaching each source line’s AST subtree under the corresponding node in the PDG. This indirectly captures instruction semantics along with control- and data-flow dependencies in a single structure. However, such representations are not well structured for GNN learning models since information from the AST cannot flow across the graph. Finally, an ncsCPG is a CPG in which the leaf nodes in each AST subtree are superficially linked to leaf nodes in the preceding or successive trees, as visualized with orange edges in Figure 1 [32, 39]. This enables information to flow from the AST during GNN learning, however it is suboptimal because semantically unrelated source code lines become linked, even if they share no data or control dependencies. 更高级的表示包括代码属性图(CPG)[37]和自然代码序列 CPG(ncsCPG),如[39]所示。CPG 通过将每个源代码行的 AST 子树附加到 PDG 中相应节点下,合并了源代码的 PDG 和 AST。这间接捕捉了指令语义以及控制和数据流依赖关系于一个单一结构中。然而,这种表示对于 GNN 学习模型并不结构良好,因为 AST 中的信息无法在图中流动。最后,ncsCPG 是一个 CPG,其中每个 AST 子树的叶节点在表面上与前一个或后一个树的叶节点相连,如图 1 中的橙色边所示[32, 39]。这使得信息能够在 GNN 学习过程中从 AST 流动,然而这并不是最优的,因为语义上无关的源代码行被连接在一起,即使它们没有共享数据或控制依赖关系。
Finally, systems that use sequential models (i.e., CNNs and RNNs) represent code as a sequence extracted or flattened from one of the graphical forms above [19-23, 28, 38, 40]. While appropriate for the model type being used, these representations fail to leverage the inherently graph-like structure and operation of software. 最后,使用序列模型(即 CNN 和 RNN)的系统将代码表示为从上述图形形式之一提取或展平的序列[19-23, 28, 38, 40]。虽然适合所使用的模型类型,但这些表示未能利用软件固有的图形结构和操作。
Regardless of the program representation used, prior work relies on feature extraction to compress and express the contents of each token or node in GG. Examples include one hot encodings and pre-processed embeddings (e.g., Word2Vec [24]) to capture the meaning of different symbols and calls (e.g., int, ==, for, free ( ), etc.) In some cases entire portions of code are summarized using Doc2Vec [18]. The issue with these representations are that (1) nodes in G_(i)G_{i} would likely capture multiple 无论使用何种程序表示,先前的工作依赖于特征提取来压缩和表达每个标记或节点在 GG 中的内容。示例包括独热编码和预处理嵌入(例如,Word2Vec [24]),以捕捉不同符号和调用(例如,int, == ,for,free(),等)的含义。在某些情况下,整个代码部分使用 Doc2Vec [18]进行总结。这些表示的问题在于(1) G_(i)G_{i} 中的节点可能会捕捉到多个。
Figure 1: A comparison of different code representations. Most prior systems rely on CFG (green) and DFG (purple) edges alone, while more recent work includes ASTs (black) to capture the code better. However these trees are not conducive to graph-based learning due to their terminal leaf nodes. This is mitigated by connecting leaves with natural code sequence (NCS) edges (orange) [32,39], but this results in arbitrary paths w.r.t. the data and control dependency edges. 图 1:不同代码表示的比较。大多数先前的系统仅依赖于 CFG(绿色)和 DFG(紫色)边,而最近的工作则包括 AST(黑色)以更好地捕捉代码。然而,由于其终端叶节点,这些树不利于基于图的学习。通过用自然代码序列(NCS)边(橙色)连接叶节点可以缓解这一问题[32,39],但这导致了与数据和控制依赖边相关的任意路径。
operations in a single line of source code resulting in a loss in semantic precision and (2) the use of pre-processed embeddings prevents the model from learning the best representation to optimize the learning objective (i.e., classifying vulnerabilities). In contrast, our proposed ePDG explicitly defines node and edge features that encode relevant information (e.g., operations) while avoiding superficial features (e.g., variable names). 在单行源代码中进行的操作导致语义精度的损失,以及(2)使用预处理的嵌入使模型无法学习最佳表示以优化学习目标(即,分类漏洞)。相比之下,我们提出的 ePDG 明确定义了编码相关信息(例如,操作)的节点和边特征,同时避免了表面特征(例如,变量名称)。
2.3 Application 2.3 应用
Some recent works have proposed detecting vulnerabilities at the binary level. For example, in [36] the authors propose Gemini, a deep learning model which uses Structure2Vec and a Siamese network to measure the similarity of a given function’s CFG to that of a vulnerable one at the binary level. Other approaches such as DeepBinDiff [17], ASM2Vec [16], and Bin2Vec [8] all follow similar approaches for vulnerability detection where a repository of compiled vulnerable code snippets are checked against the binary in question. However, these approaches are limited since (1) the models search for instances of vulnerabilities (e.g., heartbleed) and not the general pattern (CWE), (2) they require binary disassembly, a process that is undecidable in the general case, and as such may miss bugs in code the disassembler fails to recover, and (3) the vulnerabilities identified using these methods indicate code regions and not specific lines. 一些近期的研究提出在二进制级别检测漏洞。例如,在[36]中,作者提出了 Gemini,这是一种深度学习模型,使用 Structure2Vec 和 Siamese 网络来测量给定函数的控制流图(CFG)与二进制级别的漏洞函数的相似性。其他方法如 DeepBinDiff [17]、ASM2Vec [16]和 Bin2Vec [8]也采用类似的方法进行漏洞检测,其中编译的漏洞代码片段库会与相关的二进制文件进行检查。然而,这些方法存在局限性,因为(1)模型搜索的是漏洞的实例(例如,心脏出血漏洞),而不是一般模式(CWE),(2)它们需要二进制反汇编,这在一般情况下是不可判定的,因此可能会遗漏反汇编器无法恢复的代码中的错误,以及(3)使用这些方法识别的漏洞指示的是代码区域而不是具体行。
Regarding vulnerability detection in source code, the related works listed in Table 1 cannot directly identify the line of a vulnerability because their representations of G_(i)G_{i} do not anchor a specific code line. Instead, they cut graphs to include an entire function [9,11,28,30-32,35,39][9,11,28,30-32,35,39] or the code surrounding a potential root cause point for any vulnerability [13,14,19,20,22,23,38,40][13,14,19,20,22,23,38,40], which leads to very broad predictions over entire functions or larger code regions. 关于源代码中的漏洞检测,表 1 中列出的相关工作无法直接识别漏洞的具体行,因为它们对 G_(i)G_{i} 的表示并未锚定特定的代码行。相反,它们切割图形以包含整个函数 [9,11,28,30-32,35,39][9,11,28,30-32,35,39] 或潜在根本原因点周围的代码 [13,14,19,20,22,23,38,40][13,14,19,20,22,23,38,40] ,这导致对整个函数或更大代码区域的非常广泛的预测。
One exception is VulDeeLocator [21], a work developed in parallel to VulChecker. In this work the authors first find PoIs in the source code through the program’s AST and mark all library/API function calls, array definitions, pointer definitions and arithmetic expressions. Each marked PoI is then traced down to a lower level code representation (LLVM IR). ^(4){ }^{4} Then, a forward and backward program slice is taken around the PoI and the slice is flattened into a sequence of 900 tokens (e.g., call, void, @, FUN1,’ (’, …). Next, each token in the sequence is embedded into a vector using Word2Vec and the sequence of vectors is passed to a bidirectional recurrent neural network (biRNN) which predicts which line is vulnerable. 一个例外是 VulDeeLocator [21],这是与 VulChecker 并行开发的工作。在这项工作中,作者首先通过程序的 AST 在源代码中找到感兴趣点(PoI),并标记所有库/API 函数调用、数组定义、指针定义和算术表达式。每个标记的 PoI 随后被追踪到更低级的代码表示(LLVM IR)。然后,在 PoI 周围进行前向和后向程序切片,并将切片展平为 900 个标记的序列(例如,call、void、@、FUN1、'('、…)。接下来,序列中的每个标记都使用 Word2Vec 嵌入到一个向量中,并将向量序列传递给一个双向递归神经网络(biRNN),该网络预测哪一行是脆弱的。
However, similar to prior work [9,11, 14, 20, 22, 30, 35,38,39], VulDeeLocator cannot indicate the vulnerability being detected. This leaves the developer to guess the vulnerability out of hundreds of CWEs. Moreover, selecting PoIs from ASTs is not suitable for spatial and temporal memory safety violations. In contrast our ePDG representation based on IR is closer to machine code and resolves ambiguities regarding data types, temporary variables, and storage locations. Furthermore, like other works, VulDeeLocator remove loops and flatten code into a sequence into a finite number of tokens which reducing the code’s semantics and patterns. Finally, many source code based systems like VulDeeLocator tailored to specific languages. However, VulChecker performs analysis at the IR level, making it somewhat language agnostic. Although we only evaluate on C and C++\mathrm{C}++ projects in this paper, VulChecker has the potential to work on other languages such as Objective-C, Fortran, and Rust since LLVM can lower them as well. Further research is need to verify compatibility. 然而,与之前的工作[9,11, 14, 20, 22, 30, 35,38,39]类似,VulDeeLocator 无法指示被检测到的漏洞。这使得开发者不得不在数百个 CWE 中猜测漏洞。此外,从 AST 中选择兴趣点不适合空间和时间内存安全违规。相比之下,我们基于 IR 的 ePDG 表示更接近机器代码,并解决了关于数据类型、临时变量和存储位置的模糊性。此外,像其他工作一样,VulDeeLocator 去除了循环并将代码展平为有限数量的标记序列,从而减少了代码的语义和模式。最后,许多基于源代码的系统,如 VulDeeLocator,都是针对特定语言量身定制的。然而,VulChecker 在 IR 级别进行分析,使其在某种程度上与语言无关。尽管我们在本文中仅对 C 和 C++\mathrm{C}++ 项目进行了评估,但 VulChecker 有潜力在其他语言上工作,如 Objective-C、Fortran 和 Rust,因为 LLVM 也可以将它们降级。需要进一步研究以验证兼容性。
In summary, in contrast to the current state-of-the-art, VulChecker (1) can locate vulnerabilities at both line and instruction level, (2) can classify the type of vulnerability, (3) can better associate root cause and manifestation points by reaching deeper into program slices using a message passing GNN, and (4) may be generalized to a wider array of programming languages. 总之,与当前的最先进技术相比,VulChecker (1) 可以在行级和指令级定位漏洞,(2) 可以对漏洞类型进行分类,(3) 可以通过使用消息传递的图神经网络更好地关联根本原因和表现点,深入程序切片,(4) 可能会推广到更广泛的编程语言。
3 VulChecker
This section presents details on how the VulChecker framework functions. We start with an overview of the pipeline phases and important notation and then elaborate on each step. Overview. For each CWE (i.e., class of vulnerability), VulChecker trains a separate model and uses a different sampling strategy to extract observations (i.e., potential manifestation points). Each of VulChecker’s CWE pipelines follow the same four steps: (1) ePDG generation, (2) sampling, (3) feature extraction, and (4) model training or execution. 本节介绍了 VulChecker 框架的功能细节。我们首先概述管道阶段和重要符号,然后详细说明每个步骤。概述。对于每个 CWE(即漏洞类别),VulChecker 训练一个单独的模型,并使用不同的采样策略来提取观察结果(即潜在表现点)。VulChecker 的每个 CWE 管道遵循相同的四个步骤:(1)ePDG 生成,(2)采样,(3)特征提取,以及(4)模型训练或执行。
Figure 2: A diagram showing the steps of VulChecker’s pipeline for one CWE. Note that the real graphs are significantly larger than what is visualized (e.g., projects like libgit2-v0.26.1 have over 18 million nodes in GG ). Solid edges represent control-flow and dashed edges are data dependencies. 图 2:一个图示,展示了 VulChecker 针对一个 CWE 的流程步骤。请注意,实际图形的规模远大于所展示的(例如,像 libgit2-v0.26.1 这样的项目有超过 1800 万个节点在 GG 中)。实线边表示控制流,虚线边表示数据依赖。
Figure 2 provides a visual summary of these steps. 图 2 提供了这些步骤的视觉总结。
ePDG Generation: Given a target program’s source code (denoted by SS ), we first compile it to LLVM IR (intermediate representation) and apply several optimizations. Next, we use a custom LLVM plugin specific to the CWE type of interest to analyze the IR, tag potential root cause and manifestation points, and produce an ePDG of SS denoted by GG. This process is detailed in Section 3.1. ePDG 生成:给定目标程序的源代码(表示为 SS ),我们首先将其编译为 LLVM IR(中间表示)并应用若干优化。接下来,我们使用一个特定于感兴趣的 CWE 类型的自定义 LLVM 插件来分析 IR,标记潜在的根本原因和表现点,并生成一个表示为 GG 的 SS 的 ePDG。该过程在第 3.1 节中详细说明。
Sampling: Next, we scan GG to locate potential manifestation points for the given CWE (Section 3.2) that were tagged during ePDG generation. A manifestation point is any node mm (instruction) in GG that is known to manifest that CWE (e.g., stack memory writes for stack overflow). For the ii-th manifestation point in GG, we cut a subgraph G_(i)G_{i} backwards from that point up to a given depth. G_(i)G_{i} represents an observation upon which VulChecker predicts whether or not a vulnerability exists. 采样:接下来,我们扫描 GG 以定位在 ePDG 生成过程中标记的给定 CWE(第 3.2 节)的潜在表现点。表现点是指在 GG 中已知会表现该 CWE 的任何节点 mm (指令)(例如,堆栈溢出的堆栈内存写入)。对于 GG 中的第 ii 个表现点,我们从该点向后切割一个子图 G_(i)G_{i} ,直到给定的深度。 G_(i)G_{i} 表示 VulChecker 预测是否存在漏洞的观察结果。
Feature Extraction: Using the same structure of G_(i)G_{i}, we create an annotated graph G_(i)^(')G_{i}^{\prime} where each node contains a feature vector that explicitly captures the respective instruction in G_(i)G_{i}. Similarly, we add explicit features to the edges of G_(i)^(')G_{i}^{\prime}. Consequently, G_(i)^(')G_{i}^{\prime} captures the code leading up to the ii-th manifestation point in a manner that a graph-based machine learning model can understand (Section 3.3). 特征提取:使用 G_(i)G_{i} 的相同结构,我们创建一个注释图 G_(i)^(')G_{i}^{\prime} ,其中每个节点包含一个特征向量,明确捕捉 G_(i)G_{i} 中的相应指令。同样,我们为 G_(i)^(')G_{i}^{\prime} 的边添加了显式特征。因此, G_(i)^(')G_{i}^{\prime} 以图形化机器学习模型可以理解的方式捕捉到达第 ii 个表现点之前的代码(第 3.3 节)。
Model Training/Execution: VulChecker predicts whether the ii-th manifestation point is vulnerable or not by passing G_(i)^(')G_{i}^{\prime} through a S2V model MM (Section 3.4). The S2V model is a binary classifier trained on a specific CWE. To train MM, we use many negative samples of SS, where each sample has been augmented with many positive synthetic examples (Section 4). 模型训练/执行:VulChecker 通过将 G_(i)^(')G_{i}^{\prime} 传递给 S2V 模型 MM (第 3.4 节)来预测第 ii 个表现点是否脆弱。S2V 模型是一个针对特定 CWE 训练的二元分类器。为了训练 MM ,我们使用了许多 SS 的负样本,每个样本都经过许多正合成示例的增强(第 4 节)。
Annotation If MM predicts G_(i)^(')G_{i}^{\prime} as positive, then the debug symbols passed down from SS to the ii-th manifestation point in the ePDG are used to report line number and instruction to the developer. 注释 如果 MM 将 G_(i)^(')G_{i}^{\prime} 预测为正,则从 SS 传递到 ePDG 中第 ii 个表现点的调试符号用于向开发者报告行号和指令。
3.1 ePDG Generation 3.1 ePDG 生成
The creation of an ePDG consists of two steps: (1) lowering the source code SS to LLVM IR and (2) extracting GG based on the structure and flows it contains. 创建电子程序设计图(ePDG)包括两个步骤:(1)将源代码 SS 降级为 LLVM IR,和(2)根据其包含的结构和流程提取 GG 。
3.1.1 Lowering Code to LLVM IR 3.1.1 降低代码到 LLVM IR
The first step in ePDG generation is compiling the source code SS to LLVM IR, which provides a machine-level representation of the target program. This process greatly simplifies the program representation with respect to control-flow (e.g., complicated branching constructs in source code are reduced to conditional jumps that test a single condition), data-flow (e.g., definition-use chains are shorter and less complex as they are based on virtual register values rather than source code variables), and program semantics (IR instructions are atomic and directly translatable to instruction set architecture opcodes). To perform the initial lowering of source code, VulChecker uses LLVM’s C/C++ frontend, Clang (v11.0.0). During lowering, VulChecker instructs Clang to embed debug information in the IR, which enables traceability of IR instructions back to source code instructions. If SS contains labels, then they are propagated to the LLVM IR using LLVM-provided debug information. These labels are later passed down to ePDG nodes at a later stage. 生成 ePDG 的第一步是将源代码 SS 编译为 LLVM IR,这提供了目标程序的机器级表示。这个过程在控制流(例如,源代码中的复杂分支结构被简化为测试单一条件的条件跳转)、数据流(例如,定义-使用链更短且更简单,因为它们基于虚拟寄存器值而不是源代码变量)和程序语义(IR 指令是原子的,且可以直接转换为指令集架构操作码)方面大大简化了程序表示。为了执行源代码的初步降级,VulChecker 使用 LLVM 的 C/C++前端 Clang(v11.0.0)。在降级过程中,VulChecker 指示 Clang 在 IR 中嵌入调试信息,这使得 IR 指令可以追溯到源代码指令。如果 SS 包含标签,则这些标签会通过 LLVM 提供的调试信息传播到 LLVM IR。这些标签随后会在后续阶段传递给 ePDG 节点。
In addition to simplifying the program representation, VulChecker also uses semantic-preserving compiler optimizations provided by LLVM to simplify and better express the code in GG. Specifically, it applies: (1) function inlining to replace function call sites in the IR with a concrete copy of the called function body, (2) indirect branch expansion to eliminate indirect branching constructs, and (3) dead code elimination to reduce the size of the output graph. All together, these optimizations ensure that VulChecker can efficiently analyze our ePDGs interprocedurally in the domain of the original program, as the ePDGs contain fully-precise control- and data-flow information in a minimally sized, connected graph. 除了简化程序表示,VulChecker 还利用 LLVM 提供的语义保留编译器优化来简化并更好地表达 GG 中的代码。具体而言,它应用了:(1) 函数内联,将 IR 中的函数调用位置替换为被调用函数体的具体副本,(2) 间接分支扩展,以消除间接分支结构,以及 (3) 死代码消除,以减少输出图的大小。所有这些优化确保 VulChecker 能够高效地在原始程序的领域中进行跨过程分析,因为 ePDGs 包含在最小大小的连接图中完全精确的控制流和数据流信息。
3.1.2 Generating the ePDG 3.1.2 生成 ePDG
With the simplified version of SS in LLVM IR form, the next objective is to generate an ePDG, GG, that captures the target’s control-flow, data-flow, and instruction semantics with minimal loss. Concretely, GG is a multigraph defined as: 在 LLVM IR 形式的简化版本 SS 之后,下一目标是生成一个 ePDG, GG ,它以最小的损失捕获目标的控制流、数据流和指令语义。具体而言, GG 是一个定义为多重图的图:
G:=(V,E,q,r)G:=(\mathcal{V}, \mathcal{E}, q, r)
where V\mathcal{V} is a set of nodes (instructions), E\mathcal{E} is a set of edges (flows), and both qq and rr are mappings of nodes and edges to classes and attributes. More formally, let qq be a mapping of nodes in V\mathcal{V} to a class of instruction, defined as 其中 V\mathcal{V} 是一组节点(指令), E\mathcal{E} 是一组边(流), qq 和 rr 都是节点和边到类和属性的映射。更正式地,设 qq 是将 V\mathcal{V} 中的节点映射到指令类的映射,定义为
q:Vrarr{{c,a}:c in C,a inA_(c)}q: \mathcal{V} \rightarrow\left\{\{c, a\}: c \in C, a \in A_{c}\right\}
where CC is the set of all types of instructions in the LLVM instruction API (e.g., return, add, allocate, etc.) and A_(c)A_{c} is the set of all possible attributes for instruction v inVv \in \mathcal{V} of type c. Examples include static values in arithmetic operations, function names for call instructions, and root cause and manifestation tags. Let rr be a mapping of edges in E\mathcal{E} to a pair of nodes defined as: 其中 CC 是 LLVM 指令 API 中所有类型指令的集合(例如,返回、加法、分配等),而 A_(c)A_{c} 是指令 v inVv \in \mathcal{V} 类型 c 的所有可能属性的集合。示例包括算术运算中的静态值、调用指令的函数名称,以及根本原因和表现标签。设 rr 为 E\mathcal{E} 中边的映射,定义为一对节点:
r:Erarr{{(x,y),d,b}:x,y inV,d in D,b inA_(d)}r: \mathcal{E} \rightarrow\left\{\{(x, y), d, b\}: x, y \in \mathcal{V}, d \in D, b \in A_{d}\right\}
where DD is the set of edge types (i.e., control-flow or data-flow) and A_(d)A_{d} is the set of flow attributes for a flow type dd (e.g., the data type of the data dependency). 其中 DD 是边类型的集合(即控制流或数据流), A_(d)A_{d} 是流类型 dd 的流属性集合(例如,数据依赖的数据类型)。
To generate GG according to this definition, VulChecker uses a custom plugin for LLVM’s middle-end optimizer, Opt (v11.0.0). This plugin first invokes LLVM’s built-in controland data-flow analyses, and then performs an instruction-byinstruction scan of the target code. For each instruction I_(j)I_{j}, VulChecker creates a corresponding node v_(j)inVv_{j} \in \mathcal{V} and mapping q_(j)in q:v_(j)q_{j} \in q: v_{j}. Next, VulChecker uses LLVM’s API to extract semantic information about I_(j)I_{j} to populate the entry {c_(j),a_(j)}\left\{c_{j}, a_{j}\right\} of q_(j)q_{j} (e.g., operation, if the instruction is a conditional branch, etc.). In addition to semantic information obtained directly from LLVM’s API, these attributes also include debug information (e.g., source file and line number), tags indicating potential root cause and manifestation points, and labels indicating actual root cause and manifestation points for model training. 根据该定义生成 GG ,VulChecker 使用了一个针对 LLVM 中间端优化器 Opt(v11.0.0)的自定义插件。该插件首先调用 LLVM 内置的控制流和数据流分析,然后对目标代码进行逐条指令扫描。对于每条指令 I_(j)I_{j} ,VulChecker 创建一个相应的节点 v_(j)inVv_{j} \in \mathcal{V} 和映射 q_(j)in q:v_(j)q_{j} \in q: v_{j} 。接下来,VulChecker 使用 LLVM 的 API 提取关于 I_(j)I_{j} 的语义信息,以填充 q_(j)q_{j} 的入口 {c_(j),a_(j)}\left\{c_{j}, a_{j}\right\} (例如,操作,如果指令是条件分支等)。除了直接从 LLVM 的 API 获得的语义信息,这些属性还包括调试信息(例如,源文件和行号)、指示潜在根本原因和表现点的标签,以及指示实际根本原因和表现点的标签,用于模型训练。
Next, VulChecker performs a second instruction-byinstruction pass to generate control- and data-flow edges in GG. This is determined using LLVM’s API for identifying an instruction’s predecessors/successors and values defined/used. For each of a given instruction’s predecessors and successors, a corresponding edge e_(j,k)inEe_{j, k} \in \mathcal{E} is generated with the appropriate type and attributes. Control-flow edges are assigned a void data type, and data-flow edges are assigned a data type corresponding to the value definition in the origin node (I_(j))\left(I_{j}\right). After both passes over the IR instructions derived from SS finish, the ePDG generation is complete. VulChecker outputs GG in JSON format for the next step: sampling. 接下来,VulChecker 进行第二次逐条指令的遍历,以生成控制流和数据流边 GG 。这通过使用 LLVM 的 API 来识别指令的前驱/后继以及定义/使用的值来确定。对于给定指令的每个前驱和后继,生成一个相应的边 e_(j,k)inEe_{j, k} \in \mathcal{E} ,并赋予适当的类型和属性。控制流边被分配为 void 数据类型,而数据流边则被分配为与源节点中值定义相对应的数据类型 (I_(j))\left(I_{j}\right) 。在对源自 SS 的 IR 指令进行两次遍历后,ePDG 生成完成。VulChecker 以 JSON 格式输出 GG ,以便进行下一步:采样。
3.2 Sampling 3.2 采样
Now that SS is lowered into GG, VulChecker can extract observations from GG for the machine learning process. First, we identify the points of interest ( PoI ) in GG, which are all potential manifestation points for a given CWE (denoted m_(i)inEm_{i} \in \mathcal{E} ). VulChecker then cuts the subgraph G_(i)G_{i} from GG such that m_(i)m_{i} is anchored in G_(i)G_{i} as the termination node. 现在 SS 被降低到 GG ,VulChecker 可以从 GG 提取观察数据用于机器学习过程。首先,我们在 GG 中识别感兴趣点(PoI),这些都是给定 CWE(表示为 m_(i)inEm_{i} \in \mathcal{E} )的潜在表现点。然后,VulChecker 从 GG 中切割出子图 G_(i)G_{i} ,使得 m_(i)m_{i} 锚定在 G_(i)G_{i} 作为终止节点。
3.2.1 PoI Criteria 3.2.1 PoI 标准
To extract samples from GG, we first identify the nodes in E\mathcal{E} where the given CWE can manifest itself. We identify a PoI, m_(i)m_{i}, using a lookup list of potential manifestation IR instructions. This list is curated based on our domain expertise and related work. Specifically, for integer overflow (CWE-190), a PoI is any call to a function that passes integer arguments. ^(5){ }^{5} While we acknowledge that this is a heuristic, it is well validated in prior work [33] to be an accurate criteria for distinguishing intended and unintended overflows. This is the only heuristic we use in our system. For stack and heap overflow (CWE-121,122), the PoIs are any store instructions to local or dynamically allocated memory, respectively. For use-after-free (CWE-416), PoIs are any memory accesses to dynamically allocated memory and for double free (CWE-415), any calls to the memory manager’s free function. Aside from CWE-190, these criteria are conservative, ensuring that all true positive manifestation points will have a corresponding PoI(m_(i))\operatorname{PoI}\left(m_{i}\right) in GG. 为了从 GG 中提取样本,我们首先识别出 E\mathcal{E} 中可能出现给定 CWE 的节点。我们使用潜在表现 IR 指令的查找列表来识别一个兴趣点(PoI),即 m_(i)m_{i} 。该列表是基于我们的领域专业知识和相关工作的整理。具体来说,对于整数溢出(CWE-190),PoI 是任何传递整数参数的函数调用。 ^(5){ }^{5} 虽然我们承认这是一种启发式方法,但在之前的工作中 [33] 已被验证为区分有意和无意溢出的准确标准。这是我们系统中使用的唯一启发式方法。对于栈和堆溢出(CWE-121,122),PoI 是对本地或动态分配内存的任何存储指令。对于使用后释放(CWE-416),PoI 是对动态分配内存的任何内存访问;对于双重释放(CWE-415),PoI 是对内存管理器的释放函数的任何调用。除了 CWE-190,这些标准是保守的,确保所有真正的阳性表现点在 GG 中都有相应的 PoI(m_(i))\operatorname{PoI}\left(m_{i}\right) 。
For double free, notice how our design assumes that we already know which function in the program performs freeing. In practice, this information can be provided by a developer; our system’s intended end-user. 对于双重释放,请注意我们的设计假设我们已经知道程序中哪个函数执行释放。在实践中,这些信息可以由开发者提供;我们系统的预期最终用户。
3.2.2 Program Slicing 3.2.2 程序切片
For each potential manifestation point m_(i)m_{i} identified in E\mathcal{E}, VulChecker crawls GG backwards from m_(i)m_{i} using breadth first search (BFS), up to a predefined depth n_("depth ")n_{\text {depth }}, where n_("depth ")n_{\text {depth }} is a user defined parameter of VulChecker. 对于在 E\mathcal{E} 中识别的每个潜在表现点 m_(i)m_{i} ,VulChecker 从 m_(i)m_{i} 开始向后爬取 GG ,使用广度优先搜索(BFS),直到预定义的深度 n_("depth ")n_{\text {depth }} ,其中 n_("depth ")n_{\text {depth }} 是 VulChecker 的用户定义参数。
The resulting subgraph G_(i)G_{i} efficiently and effectively captures any guarding branches and root causes that may result in a positive or negative manifestation of m_(i)m_{i}. This is because the BFS is performed over both control-flow and data-flow edges, indiscriminately. Consequently, instructions that may be far apart in terms of control-flow (i.e., intermediate basic blocks) can be closely linked by data-flow, which is useful for bug classes like use-after-free. 生成的子图 G_(i)G_{i} 有效且高效地捕捉到可能导致 m_(i)m_{i} 正面或负面表现的任何保护分支和根本原因。这是因为 BFS 在控制流和数据流边缘上不加区分地执行。因此,在控制流上可能相距较远的指令(即中间基本块)可以通过数据流紧密关联,这对于像使用后释放这样的错误类别非常有用。
Finally, since m_(i)m_{i} is the termination node of G_(i)G_{i}, the manifestation point in question is anchored to a static location, benefiting effective message passing and localization of the prediction-by obtaining the node’s metadata q(m_(i))q\left(m_{i}\right). 最后,由于 m_(i)m_{i} 是 G_(i)G_{i} 的终止节点,所讨论的表现点锚定在一个静态位置,有利于有效的信息传递和预测的定位——通过获取节点的元数据 q(m_(i))q\left(m_{i}\right) 。
3.2.3 Labeling 3.2.3 标签化
For each G_(i)G_{i} extracted from GG, we associate a label y={y=\{ negative, positive }\} based on the ground-truth for the terminating manifestation point m_(i)m_{i} in G_(i)G_{i}. Positive labels (vulnerable) are assigned to lines in the source code (for convenience) and then mapped down to IR using debug symbols. If a line of source code contains more than one potential manifestation point, we apply the label to the last 对于从 GG 中提取的每个 G_(i)G_{i} ,我们根据在 G_(i)G_{i} 中终止表现点 m_(i)m_{i} 的真实情况,将标签 y={y=\{ 负面、正面 }\} 关联起来。正面标签(脆弱)被分配给源代码中的行(为了方便),然后通过调试符号映射到 IR。如果一行源代码包含多个潜在表现点,我们将标签应用于最后一个。
relevant IR instruction in the statement. ^(6){ }^{6} Conversely, any m_(i)m_{i} not labeled as vulnerable receives a negative label in GG. 相关的 IR 指令在声明中。 ^(6){ }^{6} 相反,任何 m_(i)m_{i} 未标记为脆弱的在 GG 中会收到负面标签。
When labeling code projects taken from the wild, it is not possible to have the complete ground truth for all the vulnerabilities (exploitable m_(i)m_{i} ) contained in the code because there might be zero-day bugs. In practice, this means that our training labels may contain some false negatives. However, we observe good performance from our system in practice, which leads us to believe false negatives are rare and of minor consequence. We also note that correctly labeled negative samples will significantly outweigh ^(7){ }^{7} mislabeled zero-day bugs in the training set, and deep neural networks are robust to noisy labels [27]. 在对来自实际环境的代码项目进行标记时,无法获得代码中所有漏洞(可利用的 m_(i)m_{i} )的完整真实情况,因为可能存在零日漏洞。实际上,这意味着我们的训练标签可能包含一些假阴性。然而,我们观察到我们的系统在实践中表现良好,这使我们相信假阴性是罕见且影响较小的。我们还注意到,正确标记的负样本将显著超过训练集中 ^(7){ }^{7} 错误标记的零日漏洞,并且深度神经网络对噪声标签具有鲁棒性 [27]。
Finally, we clarify that VulChecker obtains its positive samples from synthetic datasets. Therefore, labeling of vulnerabilities in projects from the wild is only necessary to evaluate VulChecker (good performance can be obtained from augmented datasets described in section 4). 最后,我们澄清 VulChecker 的正样本来自合成数据集。因此,仅在评估 VulChecker 时需要对真实项目中的漏洞进行标注(从第 4 节中描述的增强数据集中可以获得良好的性能)。
3.3 Feature Extraction 3.3 特征提取
For each G_(i)G_{i} extracted from GG, we make a machine learning observation G_(i)^(')G_{i}^{\prime}, which has the same graph structure as G_(i)G_{i}. The nodes and edges of G_(i)^(')G_{i}^{\prime} are assigned feature vectors that express the respective instructions and flows. Table 2 summarizes the feature vectors, where ‘count’ is the number of features. There are a total of 1352 node features and 8 edge features. A node’s feature vector captures the respective instruction using operational, structural, and semantic features. Edge features capture information regarding connectivity (control and data flows). 对于从 GG 中提取的每个 G_(i)G_{i} ,我们进行一次机器学习观察 G_(i)^(')G_{i}^{\prime} ,其图结构与 G_(i)G_{i} 相同。 G_(i)^(')G_{i}^{\prime} 的节点和边被分配了特征向量,以表达各自的指令和流。表 2 总结了特征向量,其中“count”是特征的数量。总共有 1352 个节点特征和 8 个边特征。节点的特征向量通过操作、结构和语义特征捕捉各自的指令。边特征捕捉有关连接性(控制和数据流)的信息。
Operational Node Features. To capture the operation performed at a node (instruction), we extract features such as the static value, the operation type, the basic function, and whether or not the instruction is part of an if clause. The static values are important for recognizing guard checks implemented by the developer to gracefully prevent our target vulnerability classes. Knowing whether an instruction is part of an if clause is also a helpful context to identify bug-preventing checks or rare cases where a bug could manifest. The operation and function features are one-hot encodings that represent the action of a node (instruction). The categories for these features were collected by scanning several large software projects from the wild (over 40 million nodes). Some of the operations include: add, load, store, allocate, unsigned_divide, logical_shift_right, get_element_pointer, and so on. Some of the functions include: malloc, free, fmemopen, printf, and lseek, among others. We also reserve a category called ‘other’ for operations which did not appear in the code collection, but may appear in training and testing. We chose to include an ‘other’ category so that the network can apply a weight to these occurrences. To encourage the model to learn the patterns and avoid bias (cheating) we set the function feature of m_(i)m_{i} to zero. We note that since VulChecker inlines all user functions during the ePDG lowering process and use an IR 操作节点特征。为了捕捉节点(指令)执行的操作,我们提取了静态值、操作类型、基本功能以及指令是否属于 if 子句等特征。静态值对于识别开发者实施的保护检查以优雅地防止我们的目标漏洞类别非常重要。了解指令是否属于 if 子句也是识别防止错误检查或错误可能出现的罕见情况的有用上下文。操作和功能特征是表示节点(指令)动作的一热编码。这些特征的类别是通过扫描多个大型软件项目(超过 4000 万个节点)收集的。一些操作包括:加法、加载、存储、分配、无符号除法、逻辑右移、获取元素指针等。一些功能包括:malloc、free、fmemopen、printf 和 lseek 等。我们还保留了一个名为“其他”的类别,用于在代码收集中未出现但可能出现在训练和测试中的操作。 我们选择包含一个“其他”类别,以便网络可以对这些情况施加权重。为了鼓励模型学习模式并避免偏见(作弊),我们将 m_(i)m_{i} 的函数特征设置为零。我们注意到,由于 VulChecker 在 ePDG 降级过程中内联所有用户函数并使用中间表示(IR)
representation, there is no need to tokenize function names like in prior work. Therefore, the features of G_(i)^(')G_{i}^{\prime} explicitly represent the information that best benefits the learning process. 在这种表示中,不需要像之前的工作那样对函数名称进行标记。因此, G_(i)^(')G_{i}^{\prime} 的特征明确表示了最有利于学习过程的信息。
Structural Node Features. Since G_(i)^(')G_{i}^{\prime} is a directed graph, we use graph features to help the model understand the influence of each node as it propagates messages though the structure. For example, knowing the distance from the nearest potential root cause point rr to the anchored potential manifestation point m_(i)m_{i} gives the machine learning model implicit information on where relevant nodes are in the graph. We identify a potential root cause point r_(j)r_{j} in G_(i)G_{i} using domain expertise. Namely, integer arithmetic operations are potential root causes for integer overflow, stack and heap writes for stack and heap overflow, respectively, and calls to free memory for use-after-free and double free. We also use a feature called the betweeness centrality measure (BEC) [34]; a score that measures how critical a node is for efficiently drawing paths between nodes in G_(i)^(')G_{i}^{\prime}. This is a common feature used in graph-based machine learning, and graph-based anomaly detection [12]. We use this feature because it provides the S 2 V model with implicit information on a node’s relative location in G_(i)^(')G_{i}^{\prime}. Later in section 3.4 we will explain how S2V uses these features to make decisions on how to pass messages across the for graph classification. 结构节点特征。由于 G_(i)^(')G_{i}^{\prime} 是一个有向图,我们使用图特征来帮助模型理解每个节点在通过结构传播消息时的影响。例如,知道离最近的潜在根因点 rr 到锚定的潜在表现点 m_(i)m_{i} 的距离,可以为机器学习模型提供关于图中相关节点位置的隐含信息。我们在 G_(i)G_{i} 中使用领域专业知识识别潜在根因点 r_(j)r_{j} 。即,整数算术操作是整数溢出的潜在根因,栈和堆写入分别是栈和堆溢出的潜在根因,以及调用释放内存是使用后释放和双重释放的潜在根因。我们还使用一种称为介数中心性度量(BEC)的特征[34];该分数衡量一个节点在 G_(i)^(')G_{i}^{\prime} 中有效绘制节点之间路径的关键程度。这是图基机器学习和图基异常检测[12]中常用的特征。我们使用这个特征是因为它为 S 2 V 模型提供了关于节点在 G_(i)^(')G_{i}^{\prime} 中相对位置的隐含信息。稍后在第 3 节中。我们将解释 S2V 如何利用这些特征来决定如何在图分类中传递消息。
Semantic Node Features. In addition to marking the distance to the nearest root cause point, we also indicate if a node itself is a potential root cause or manifestation point. This helps the model locate potential vulnerability sources to propagate to m_(i)m_{i} and other manifestation points that may be absorbing the signal prior to m_(i)m_{i}. We also note the output type of the node’s operation (e.g., int). 语义节点特征。除了标记到最近根本原因点的距离外,我们还指示一个节点本身是否是潜在的根本原因或表现点。这有助于模型定位潜在的脆弱性源,以传播到 m_(i)m_{i} 和其他可能在 m_(i)m_{i} 之前吸收信号的表现点。我们还注意到节点操作的输出类型(例如,int)。
Edge Features. For edge feature vectors, we indicate the type of edge (control-flow or data-dependency) and capture the data type of the data-dependencies so that the model can capture what kind of data goes where. By knowing the flow of static values, external inputs (from certain functions), and their data types, the model has enough information to foresee (simulate) the impact of data on a program. 边缘特征。对于边缘特征向量,我们指明边缘的类型(控制流或数据依赖),并捕捉数据依赖的数据类型,以便模型能够捕捉数据的流向。通过了解静态值的流动、外部输入(来自某些函数)及其数据类型,模型拥有足够的信息来预见(模拟)数据对程序的影响。
We represent the attributed graph G_(i)^(')G_{i}^{\prime} as the tuple (X_(v),X_(e),A,C)\left(X_{v}, X_{e}, A, C\right) : a matrix of node features, X_(v)X_{v}, a matrix of edge features, X_(e)X_{e}, its adjacency matrix, AA, and its incidence matrix, CC. 我们将带属性的图 G_(i)^(')G_{i}^{\prime} 表示为元组 (X_(v),X_(e),A,C)\left(X_{v}, X_{e}, A, C\right) :节点特征矩阵 X_(v)X_{v} ,边特征矩阵 X_(e)X_{e} ,其邻接矩阵 AA ,以及其关联矩阵 CC 。
3.4 Model Training & Execution 3.4 模型训练与执行
VulChecker’s machine learning model consists of two components that are trained end-to-end: a graph embedding network M_(G)M_{G} and a deep neural network (DNN) classifier M_(C)M_{C}. Our graph embedding network M_(G)M_{G} is a adaptation of the Structure2Vec (S2V) model from [15]. The model uses a neural network to generate node embeddings by passing messages across the graph’s structure. The parameters of M_(G)M_{G} are fitted to M_(C)M_{C} 's classification learning objective such that M_(C)(M_(G)(G_(i)^(')))=yM_{C}\left(M_{G}\left(G_{i}^{\prime}\right)\right)=y where yy is the probability that m_(i)m_{i} is a vulnerability. VulChecker 的机器学习模型由两个端到端训练的组件组成:一个图嵌入网络 M_(G)M_{G} 和一个深度神经网络(DNN)分类器 M_(C)M_{C} 。我们的图嵌入网络 M_(G)M_{G} 是对[15]中 Structure2Vec(S2V)模型的改编。该模型使用神经网络通过在图的结构中传递消息来生成节点嵌入。 M_(G)M_{G} 的参数被拟合到 M_(C)M_{C} 的分类学习目标,以便 M_(C)(M_(G)(G_(i)^(')))=yM_{C}\left(M_{G}\left(G_{i}^{\prime}\right)\right)=y ,其中 yy 是 m_(i)m_{i} 是一个漏洞的概率。
The execution of M_(G)(G_(i)^('))M_{G}\left(G_{i}^{\prime}\right) consists of a number of iterations of message passing from each node v_(i)inVv_{i} \in \mathcal{V} to its neighbors v_(j)in Gamma(v_(i))v_{j} \in \Gamma\left(v_{i}\right), where a message from node v_(i)v_{i} is in the form of the M_(G)(G_(i)^('))M_{G}\left(G_{i}^{\prime}\right) 的执行由多个迭代组成,每个节点 v_(i)inVv_{i} \in \mathcal{V} 向其邻居 v_(j)in Gamma(v_(i))v_{j} \in \Gamma\left(v_{i}\right) 传递消息,其中来自节点 v_(i)v_{i} 的消息形式为
Table 2: Summary of Features used in G_(i)^(')G_{i}^{\prime} 表 2: G_(i)^(')G_{i}^{\prime} 中使用的特征摘要
Edge type {CFG,DFG}\{\mathrm{CFG}, \mathrm{DFG}\} 边缘类型 {CFG,DFG}\{\mathrm{CFG}, \mathrm{DFG}\}
2
Total 总计
8
Name https://cdn.mathpix.com/cropped/2024_09_28_a4f0e02b04f27ea162dag-10.jpg?height=112&width=106&top_left_y=291&top_left_x=698 Count
"新" Has static value? ∙ 1
Static value 1
Operation {+,**,%,dots} - 54
Basic function {malloc, read, ...} ・ 1228
Part of IF clause 1
Number of data dependents ・ 1
Number of control dependents 1
Betweeness centrality measure 1
Distance to m_(i) 1
Distance to nearest r 1
Operation of nearest r ・ 54
Output dtype {int, float, ...} ・ 6
Node tag {r,m, none } 2
Total 1352
https://cdn.mathpix.com/cropped/2024_09_28_a4f0e02b04f27ea162dag-10.jpg?height=64&width=39&top_left_y=863&top_left_x=305 Output dtype {float, pointer ...} 6
Edge type {CFG,DFG} 2
Total 8| | Name |  | Count |
| :---: | :---: | :---: | :---: |
| 新 | Has static value? | $\bullet$ | 1 |
| | Static value | | 1 |
| | Operation $\{+, *, \%, \ldots\}$ | - | 54 |
| | Basic function {malloc, read, ...} | ・ | 1228 |
| | Part of IF clause | | 1 |
| | Number of data dependents | ・ | 1 |
| | Number of control dependents | | 1 |
| | Betweeness centrality measure | | 1 |
| | Distance to $m_{i}$ | | 1 |
| | Distance to nearest $r$ | | 1 |
| | Operation of nearest $r$ | ・ | 54 |
| | Output dtype {int, float, ...} | ・ | 6 |
| | Node tag $\{r, m$, none $\}$ | | 2 |
| | | Total | 1352 |
|  | Output dtype {float, pointer ...} | | 6 |
| | Edge type $\{\mathrm{CFG}, \mathrm{DFG}\}$ | | 2 |
| | | Total | 8 |
vector (embedding) e_(i)e_{i}. At the start of each iteration, a neural network is used to predict what the next broadcast message of the ii-th node should be, based on (1) its neighbors’ last messages and the feature vector stored in that node ( x_(vi)inX_(v)x_{v i} \in X_{v} ) and incident edges (x_(eij)inX_(e))\left(x_{e i j} \in X_{e}\right). This is formulated as 向量(嵌入) e_(i)e_{i} 。在每次迭代开始时,使用神经网络预测第 ii 个节点的下一个广播消息应该是什么,基于(1)其邻居的最后消息和存储在该节点中的特征向量( x_(vi)inX_(v)x_{v i} \in X_{v} )以及入边 (x_(eij)inX_(e))\left(x_{e i j} \in X_{e}\right) 。这被表述为
e_(i)=ReLU(W_(v)x_(vi)+sum_(j in Gamma(i))W_(e)x_(eij)+sigma(sum_(j in Gamma(i))e_(j)))e_{i}=\operatorname{ReLU}\left(W_{v} x_{v i}+\sum_{j \in \Gamma(i)} W_{e} x_{e i j}+\sigma\left(\sum_{j \in \Gamma(i)} e_{j}\right)\right)
where W_(v)W_{v} and W_(e)W_{e} are matrices whose parameters are learned during training, and sigma\sigma is a deep neural network. A single iteration can be computed in matrix form as E_(t)=ReLU(W_(v)X_(v)+CW_(e)X_(e)+sigma(AE_(t-1)))E_{t}=\operatorname{ReLU}\left(W_{v} X_{v}+C W_{e} X_{e}+\sigma\left(A E_{t-1}\right)\right), where EE is a matrix containing the current node embeddings. Following the suggestions of [36], we use the ReLU activations in sigma\sigma to help model complex relationships in the graph. 其中 W_(v)W_{v} 和 W_(e)W_{e} 是在训练过程中学习的参数矩阵, sigma\sigma 是一个深度神经网络。单次迭代可以以矩阵形式计算为 E_(t)=ReLU(W_(v)X_(v)+CW_(e)X_(e)+sigma(AE_(t-1)))E_{t}=\operatorname{ReLU}\left(W_{v} X_{v}+C W_{e} X_{e}+\sigma\left(A E_{t-1}\right)\right) ,其中 EE 是包含当前节点嵌入的矩阵。根据 [36] 的建议,我们在 sigma\sigma 中使用 ReLU 激活函数,以帮助建模图中的复杂关系。
After n_("iter ")n_{\text {iter }} iterations (a user parameter), the node embeddings in EE are averaged together to form a single embedding vector that is passed through a batch normalization layer before being passed to M_(C)M_{C}. 经过 n_("iter ")n_{\text {iter }} 次迭代(用户参数), EE 中的节点嵌入被平均在一起,形成一个单一的嵌入向量,该向量经过批量归一化层后传递给 M_(C)M_{C} 。
To train the model parameters of M_(G)M_{G} and M_(C)M_{C}, we optimize the following learning objective function: 为了训练 M_(G)M_{G} 和 M_(C)M_{C} 的模型参数,我们优化以下学习目标函数:
where L_(CE)L_{C E} is the standard cross-entropy loss function. 其中 L_(CE)L_{C E} 是标准交叉熵损失函数。
Because of the different feature sets, we train separate model for each CWE. We argue that having multiple models (one for each CWE) is reasonable since (1) it only takes 1-2ms to execute a model (2) other SAST tools (like Perforce QAC and Checkmarx) also require the user to select which CWEs to scan for since not all CWEs are important to the developer and each pattern adds complexity to the search, and (3) the models are very small ( 250 KB each) making them very easy to store as well as execute well on a non-GPU system. 由于特征集的不同,我们为每个 CWE 训练了单独的模型。我们认为拥有多个模型(每个 CWE 一个)是合理的,因为(1)执行一个模型只需 1-2 毫秒(2)其他静态应用安全测试工具(如 Perforce QAC 和 Checkmarx)也要求用户选择要扫描的 CWE,因为并非所有 CWE 对开发者都重要,每个模式都会增加搜索的复杂性,以及(3)这些模型非常小(每个 250 KB),使它们在非 GPU 系统上存储和执行都非常方便。
3.5 Hyperparameters 3.5 超参数
Aside from the network parameters (e.g., depth and width of M_(C)M_{C} and M_(G)M_{G} ), VulChecker has two main hyperparameters: 除了网络参数(例如, M_(C)M_{C} 和 M_(G)M_{G} 的深度和宽度),VulChecker 还有两个主要超参数:
Figure 3: An illustration of an ePDG from the wild G^((w))G^{(w)} being augmented with a synthetic vulnerability trace from Juliet G_(i)^((J))G_{i}^{(J)}. 图 3:来自野外的 ePDG G^((w))G^{(w)} 被增强了来自 Juliet G_(i)^((J))G_{i}^{(J)} 的合成漏洞痕迹。 n_("depth "),n_("iter ")n_{\text {depth }}, n_{\text {iter }}. Parameter n_("depth ")n_{\text {depth }} cuts the subgraph backwards from a potential manifestation point (m_(i))\left(m_{i}\right), in hopes to include a potential root cause. Parameter n_("iter ")n_{\text {iter }} controls how far information can be shared across the subgraph, in hopes to correlate the direct influence of potential root causes on m_(i)m_{i}. Regarding time performance, increasing n_("depth ")n_{\text {depth }} has a polynomial time complexity depending on the branch factor of GG, whereas increasing n_("iter ")n_{\text {iter }} has a linear impact. Regarding task performance, setting n_("depth ")n_{\text {depth }} too large harms performance since more irrelevant information is included in G_(i)G_{i}. On control-flow edges, it is possible that the root cause for a positive m_(i)m_{i} will be further than n_("depth ")n_{\text {depth }} away from m_(i)m_{i}. However, we found that the data dependency edges in G_(i)G_{i} can greatly reduce the distance between these points in wild code. We also found that increasing n_("iter ")n_{\text {iter }} improves performance up to n_("depth ")n_{\text {depth }} (the diameter of the network). To find the optimal hyperparemeters for our CVE dataset, we used Bayesian parameter optimization [1]. For each trial, the optimiser trained a new model on a new dataset given the selected n_("depth "),n_("iter ")n_{\text {depth }}, n_{\text {iter }}, and other DNN parameters such as depth and learning rate. In our case, the optimizer found n_("iter ")=n_("depth ")=50n_{\text {iter }}=n_{\text {depth }}=50 to be the optimal setting. In general, the issue of scoping a program slice is an open research problem, and slicing is currently the best way to extract concise samples from GG. n_("depth "),n_("iter ")n_{\text {depth }}, n_{\text {iter }} 。参数 n_("depth ")n_{\text {depth }} 从潜在表现点 (m_(i))\left(m_{i}\right) 向后切割子图,以期包括潜在根本原因。参数 n_("iter ")n_{\text {iter }} 控制信息在子图中共享的距离,以期关联潜在根本原因对 m_(i)m_{i} 的直接影响。关于时间性能,增加 n_("depth ")n_{\text {depth }} 的时间复杂度是多项式的,取决于 GG 的分支因子,而增加 n_("iter ")n_{\text {iter }} 则有线性影响。关于任务性能,设置 n_("depth ")n_{\text {depth }} 过大将损害性能,因为更多无关信息被包含在 G_(i)G_{i} 中。在控制流边上,正 m_(i)m_{i} 的根本原因可能距离 m_(i)m_{i} 更远于 n_("depth ")n_{\text {depth }} 。然而,我们发现 G_(i)G_{i} 中的数据依赖边可以大大缩短这些点在野生代码中的距离。我们还发现,增加 n_("iter ")n_{\text {iter }} 可以将性能提升至 n_("depth ")n_{\text {depth }} (网络的直径)。为了找到我们 CVE 数据集的最佳超参数,我们使用了贝叶斯参数优化[1]。 对于每个试验,优化器在给定所选的 n_("depth "),n_("iter ")n_{\text {depth }}, n_{\text {iter }} 和其他深度神经网络参数(如深度和学习率)的情况下,训练了一个新模型。在我们的案例中,优化器发现 n_("iter ")=n_("depth ")=50n_{\text {iter }}=n_{\text {depth }}=50 是最佳设置。一般来说,程序切片的范围问题仍然是一个开放的研究问题,而切片目前是从 GG 中提取简洁样本的最佳方法。
4 Data Augmentation 4 数据增强
Motivation. To create a model that can operate on real projects, it must be trained on samples that reflect the real world, having large bodies of irrelevant code and benign patterns between root causes and manifestation points. Some line-labeled datasets exist, such as the Juliet C/C++ Test Suite from NIST [26]. However, they consist of short synthetic programs that do not reflect real-world code. In the absence of realistic line-labelled datasets, we propose a data augmentation technique that combines vulnerabilities from synthetic line-labeled datasets with real-world code to generate realistic training samples. 动机。为了创建一个能够在真实项目中运行的模型,必须在反映现实世界的样本上进行训练,这些样本包含大量无关代码和根本原因与表现点之间的良性模式。确实存在一些带行标签的数据集,例如来自 NIST 的 Juliet C/C++测试套件[26]。然而,它们由短的合成程序组成,并不能反映真实世界的代码。在缺乏现实的带行标签数据集的情况下,我们提出了一种数据增强技术,将合成带行标签数据集中的漏洞与真实世界代码相结合,以生成真实的训练样本。
Method. To accomplish this, we augment a ‘clean’ ePDG from 方法。为此,我们增强了一个“干净”的 ePDG。
a real-world project ^(8){ }^{8} from GitHub by splicing into it multiple Juliet ePDG subgraphs containing vulnerabilities. This generates a training sample with many positive manifestation points, each of which are extracted separately during program slicing. 一个来自 GitHub 的真实项目 ^(8){ }^{8} ,通过将多个包含漏洞的 Juliet ePDG 子图拼接到其中。这样生成了一个训练样本,具有许多正向表现点,每个点在程序切片过程中被单独提取。
Formally, given an ePDG of a real-world project G^((w))G^{(w)} and a set of vulnerable subgraphs extracted from the Juliet dataset G_(i)^((J))in JG_{i}^{(J)} \in J, we augment G^((w))G^{(w)} as follows (illustrated in Figure 3): 正式地,给定一个真实世界项目的 ePDG G^((w))G^{(w)} 和从 Juliet 数据集中提取的一组脆弱子图 G_(i)^((J))in JG_{i}^{(J)} \in J ,我们将 G^((w))G^{(w)} 扩展如下(如图 3 所示):
Augmentation Procedure 增强程序
Select a G_(i)^((J))G_{i}^{(J)} at random from JJ 从 JJ 中随机选择一个 G_(i)^((J))G_{i}^{(J)}
Choose a random path of length nn, where 3 <= n <= n_("depth ")3 \leq n \leq n_{\text {depth }} 选择一条长度为 nn 的随机路径,其中 3 <= n <= n_("depth ")3 \leq n \leq n_{\text {depth }}
Select a random path pp in the control-flow of G^((w))G^{(w)} with length nn 在 G^((w))G^{(w)} 的控制流中选择一条长度为 nn 的随机路径 pp
Else; count == count +1////+1 / / There was a collision 否则;计数 == 计数 +1////+1 / / 发生了碰撞
If count >> limit; Then return G^((w))G^{(w)} 如果计数 >> 限制;则返回 G^((w))G^{(w)}
Else; loop to Step 1 否则;循环到第 1 步
When VulChecker trains on the augmented G^((w))G^{(w)}, it will cut a sample G_(i)^((w))G_{i}^{(w)} for every potential manifestation node, including the positive ones injected from Juliet (blue in the figure). However, since we place samples SS steps away from nodes marked in red, G_(i)^((w))G_{i}^{(w)} may sometimes overlap with G_(j)^((w))G_{j}^{(w)} for i!=ji \neq j. The reason for this is to help the model learn to focus on the target terminating manifestation point. 当 VulChecker 在增强的 G^((w))G^{(w)} 上进行训练时,它会为每个潜在的表现节点切割一个样本 G_(i)^((w))G_{i}^{(w)} ,包括从 Juliet 注入的正样本(图中蓝色)。然而,由于我们将样本 SS 放置在标记为红色的节点 G_(i)^((w))G_{i}^{(w)} 的 i!=ji \neq j 步远处, G_(i)^((w))G_{i}^{(w)} 有时可能会与 G_(j)^((w))G_{j}^{(w)} 重叠。这样做的原因是帮助模型学习集中于目标终止表现点。
Validity. Since our augmentation process splices multiple ePDGs, it may produce samples where a vulnerability ePDG subgraph lies on an infeasible path (i.e., it is dynamically unreachable) in the augmented ePDG. As is typical of static analysis tools, VulChecker considers both feasible and infeasible paths in a program. Therefore, it can still learn from such samples provided (1) the augmentation maintains the vulnerabilities’ data-flow and static reachability properties, and (2) splicing ePDGs does not otherwise invalidate inserted vulnerabilities. To ensure (1), our augmentation process preserves the dataflow and static reachability of G^((w))G^{(w)} and G^((j))G^{(j)}. This is illustrated in Figure 3 where (A) the augmented ePDG still contains the same data-flow edges found in G^((j))G^{(j)} (the dashed lines) and (B) all nodes remain statically reachable in the same order. Regarding (2), augmentation cannot invalidate a vulnerability because nodes in G^((j))G^{(j)} and G^((w))G^{(w)} are guaranteed not to interfere with each other. Non-interference among data-flows follows from our derivation of ePDGs from SSA (static single assignment) form LLVM IR. In SSA form, all data values are defined exactly once, meaning no node from G^((w))G^{(w)} can redefine a data value created by a node in G^((j))G^{(j)}. Non-interference among static control- 有效性。由于我们的增强过程拼接了多个 ePDG,它可能会生成样本,其中一个漏洞 ePDG 子图位于增强 ePDG 中的不可行路径上(即,它在动态上不可达)。与静态分析工具的典型情况一样,VulChecker 考虑程序中的可行路径和不可行路径。因此,只要(1)增强保持漏洞的数据流和静态可达性属性,以及(2)拼接 ePDG 不会以其他方式使插入的漏洞失效,它仍然可以从这样的样本中学习。为了确保(1),我们的增强过程保留了 G^((w))G^{(w)} 和 G^((j))G^{(j)} 的数据流和静态可达性。这在图 3 中得到了说明,其中(A)增强的 ePDG 仍然包含在 G^((j))G^{(j)} 中找到的相同数据流边(虚线),并且(B)所有节点在相同顺序中保持静态可达性。关于(2),增强不能使漏洞失效,因为 G^((j))G^{(j)} 和 G^((w))G^{(w)} 中的节点保证不会相互干扰。数据流之间的非干扰性源于我们从 SSA(静态单赋值)形式 LLVM IR 推导 ePDG 的过程。 在 SSA 形式中,所有数据值都被定义一次,意味着 G^((w))G^{(w)} 中的节点不能重新定义由 G^((j))G^{(j)} 中的节点创建的数据值。静态控制中的非干扰
flows arises from our augmentation procedure that ensures that the second half of the vulnerability subgraph remains statically reachable from the first half via at least one program path. ^(9){ }^{9} 流的产生源于我们的增强过程,该过程确保脆弱性子图的后半部分通过至少一条程序路径从前半部分静态可达。 ^(9){ }^{9}
In summary, the augmented samples G_(i)^((w))G_{i}^{(w)} help the model generalize to more realistic scenarios because (1) they contain real-world code and (2) they teach the model to search through greater distances and noise to find the root cause. We note that the same augmentation approach can be extended to other code representations such as CFG, PDG and CPG. 总之,增强样本 G_(i)^((w))G_{i}^{(w)} 帮助模型更好地推广到更现实的场景,因为(1)它们包含真实世界的代码,和(2)它们教会模型在更大的距离和噪声中寻找根本原因。我们注意到,相同的增强方法可以扩展到其他代码表示,如 CFG、PDG 和 CPG。
5 Evaluation 5 评估
In this section, we evaluate VulChecker’s performance as a tool for detecting and localizing vulnerabilities in source code. We accomplish this by evaluating VulChecker’s performance at detecting vulnerabilities in the wild (real CVEs) when only trained on augmented (synthetic) datasets. We also explore the impact of the augmentation as well as the tool’s precision to understand its practicality. 在本节中,我们评估 VulChecker 作为检测和定位源代码中漏洞的工具的性能。我们通过评估 VulChecker 在仅使用增强(合成)数据集训练时检测实际漏洞(真实 CVE)的性能来实现这一目标。我们还探讨了增强的影响以及该工具的准确性,以理解其实用性。
5.1 Experiment Setup 5.1 实验设置
In this section we present our experiment setup. We note that all of our code has been published online, including our datasets and trained models, for use by the software and research communities. 在本节中,我们介绍我们的实验设置。我们注意到我们所有的代码都已在线发布,包括我们的数据集和训练模型,以供软件和研究社区使用。
5.1.1 Experiments 5.1.1 实验
To evaluate VulChecker, we performed 3 experiments: 为了评估 VulChecker,我们进行了 3 个实验:
EXP1: Performance. To measure VulChecker’s general performance, we use the area under the curve (AUC) measure where the values of 1.0 and 0.5 indicate a perfect classifier and random guessing respectively. For application performance (where a detection threshold is set), we measure the false positive rate (FPR), true positive rate (TPR), and accuracy (ACC). 实验 1:性能。为了测量 VulChecker 的整体性能,我们使用曲线下面积(AUC)指标,其中 1.0 和 0.5 的值分别表示完美分类器和随机猜测。对于应用性能(设置检测阈值的情况),我们测量假阳性率(FPR)、真阳性率(TPR)和准确率(ACC)。
As a baseline, we compare VulChecker to five different source code vulnerability detection tools and methods (two which detect vulnerabilities in code regions and three which perform line-level localization): 作为基准,我们将 VulChecker 与五种不同的源代码漏洞检测工具和方法进行比较(其中两种检测代码区域中的漏洞,三种进行行级定位):
Region-level detectors. The first is the approach of [31] where ASTs of functions are converted into sequences of nodes. Unknown code is then assigned a vulnerability score based on its longest common subsequence (LCS) similarity to samples of vulnerable code. We contacted the authors for their source code, but did not hear back from them. Therefore, we implemented their solution to the best of our ability. We normalized the LCS values between 0 and 1 based on the length of the largest sequence being compared. We note that although their work performs function-level localization, it can be used to classify the CWE. Therefore, we use this work as a baseline because it can be compared directly to VulChecker’s detection performance. The second is Gemini, a deep learning-based binary level code similarity tool described in section 2. 区域级检测器。第一个是[31]的方法,其中函数的抽象语法树(AST)被转换为节点序列。未知代码随后根据其与易受攻击代码样本的最长公共子序列(LCS)相似性被分配一个漏洞评分。我们联系了作者以获取他们的源代码,但没有收到回复。因此,我们尽力实现了他们的解决方案。我们根据被比较的最大序列的长度将 LCS 值归一化到 0 和 1 之间。我们注意到,尽管他们的工作执行了函数级定位,但它可以用于分类 CWE。因此,我们将这项工作作为基准,因为它可以直接与 VulChecker 的检测性能进行比较。第二个是 Gemini,一种基于深度学习的二进制级代码相似性工具,如第 2 节所述。
Line-level detectors. The first is Helix QAC, a licensed commercial SAST tool by Perforce. ^(10){ }^{10} QAC is a static code analyzer that automatically scans code for CWE violations (based on C//C++\mathrm{C} / \mathrm{C}++ coding rules). It both localizes at the line-level and classifies vulnerabilities like VulChecker. The second is an open source SAST tool called Cppcheck ^(11){ }^{11} which also performs localization and classification. The third is VulDeeLocator (described in section 2). In the original paper, VulDeeLocator was trained on a mix of CWEs, meaning that users could not tell which CWE was predicted (only that a given line is vulnerable). In our implementation of VulDeeLocator, we mirror VulChecker and train separate model on each CWE to refine the predictions. We used the author’s code ^(12){ }^{12} but found it very limited: (1) it cannot parse lines which will result in large slices and (2) it cannot efficiently parse large projects. Therefore, we modified their code to reduce these limitations to the best of our ability. ^(13){ }^{13} We ensured that all positive cases were included although many of negative cases were omitted. Note, this may decrease VulDeeLocator’s false positive rates in EXP1. 行级检测器。第一个是 Helix QAC,这是 Perforce 的一款授权商业静态应用安全测试(SAST)工具。 ^(10){ }^{10} QAC 是一种静态代码分析器,自动扫描代码以查找 CWE 违规(基于 C//C++\mathrm{C} / \mathrm{C}++ 编码规则)。它在行级别进行定位,并像 VulChecker 一样对漏洞进行分类。第二个是名为 Cppcheck 的开源 SAST 工具 ^(11){ }^{11} ,它也执行定位和分类。第三个是 VulDeeLocator(在第 2 节中描述)。在原始论文中,VulDeeLocator 是在多种 CWE 的混合数据上训练的,这意味着用户无法判断预测的是哪个 CWE(只知道某一行是脆弱的)。在我们对 VulDeeLocator 的实现中,我们模仿了 VulChecker,并对每个 CWE 训练了单独的模型以改进预测。我们使用了作者的代码 ^(12){ }^{12} ,但发现其非常有限:(1)它无法解析行,这将导致大块切片;(2)它无法高效解析大型项目。因此,我们修改了他们的代码,以尽可能减少这些限制。 ^(13){ }^{13} 我们确保所有正例都被包含,尽管许多负例被省略。请注意,这可能会降低 VulDeeLocator 在 EXP1 中的假阳性率。
We selected the Cppcheck and QAC because they are well-known SAST tools and they perform both line-level localization and vulnerability classification like VulChecker. We selected [31] (LCS) because it captures the performance of both AST and pattern matching-based approaches. We selected Gemini because it has some parallels to VulChecker: it uses the Structure2Vec and it detects vulnerabilities on code structure (CFGs). However, it differs from VulChecker in that it only operates on binary, does not use ePDGs, requires a large labeled dataset, and it cannot localize vulnerabilities to the line-level (only function-level). Finally, we use VulDeepLocator since it uses deep learning to perform line level localization like VulChecker. 我们选择了 Cppcheck 和 QAC,因为它们是知名的静态应用安全测试(SAST)工具,并且它们像 VulChecker 一样执行行级定位和漏洞分类。我们选择了[31](LCS),因为它捕捉了基于抽象语法树(AST)和模式匹配方法的性能。我们选择了 Gemini,因为它与 VulChecker 有一些相似之处:它使用 Structure2Vec,并且在代码结构(控制流图,CFGs)上检测漏洞。然而,它与 VulChecker 的不同之处在于它仅在二进制上操作,不使用 ePDGs,需要一个大型标记数据集,并且无法将漏洞定位到行级(仅限于函数级)。最后,我们使用 VulDeepLocator,因为它使用深度学习进行行级定位,类似于 VulChecker。
EXP2: Precision. To support claims that VulChecker is a practical and helpful tool for developers, we employed an experienced vulnerability analyst to review the top 100 results, for each CWE and evaluated system, on a hold out set and measured precision. Examining thousands of cases ^(14){ }^{14} to determine whether they are buggy (and exploitable) is challenging for large real-world projects due to hard problems like determining reachability (i.e., determining whether the conditions to trigger a bug can be satisfied starting from the program’s entry point). Since it is infeasible to confirm exploitability for so many cases, we employed the following methodology to determine the precision of the system: EXP2:精确度。为了支持 VulChecker 是开发者实用且有帮助的工具的说法,我们聘请了一位经验丰富的漏洞分析师来审查每个 CWE 和评估系统的前 100 个结果,在保留集上进行评估并测量精确度。检查成千上万的案例 ^(14){ }^{14} 以确定它们是否存在缺陷(并且可被利用)对于大型现实项目来说是具有挑战性的,因为存在诸如可达性(即,确定从程序入口点开始是否可以满足触发缺陷的条件)等难题。由于确认如此多案例的可利用性是不可行的,我们采用了以下方法来确定系统的精确度:
The analyst first examined the source code for each case and determined whether the expected behavior for the predicted CWE is present and whether the ePDG contains any checks to handle that CWE. For example, a positive use-after-free 分析师首先检查了每个案例的源代码,并确定预测的 CWE 是否存在预期行为,以及 ePDG 是否包含处理该 CWE 的任何检查。例如,正面的使用后释放。
case must contain a free statement, followed by a use of the freed variable name, without any checks in-between for whether the variable has been freed or not. A positive integer overflow case must contain arithmetic statements followed by a use of the calculated result, without checking for integer wraparound. The result of this step is two buckets: cases that cannot contain the predicted CWE and ones that might. 案例必须包含一个自由声明,后面跟着对释放的变量名称的使用,中间不进行任何检查以确定变量是否已被释放。正整数溢出案例必须包含算术语句,后面跟着对计算结果的使用,而不检查整数是否发生了环绕。此步骤的结果是两个桶:不能包含预测的 CWE 的案例和可能包含的案例。
2. Next, the analyst examined every known CVE for the projects used in this experiment and collected their corresponding patches. With this information, the analyst examined each case in the “maybe buggy” bucket and identified the ones that match already known CVEs. The result is a bucket of detected, previously known, vulnerabilities. 接下来,分析师检查了本实验中使用的项目的每个已知 CVE,并收集了相应的补丁。根据这些信息,分析师检查了“可能存在漏洞”类别中的每个案例,并识别出与已知 CVE 匹配的案例。结果是一个检测到的、先前已知的漏洞的集合。
3. Finally, for the cases that might be buggy and could not be matched to a known vulnerability, the analyst took a random sample and attempted to verify their exploitability using industry best-practices like fuzz testing and manual reverse engineering. This process took several weeks. 最后,对于可能存在漏洞且无法与已知漏洞匹配的情况,分析师随机抽取样本,并尝试使用行业最佳实践,如模糊测试和手动逆向工程,来验证其可利用性。这个过程花费了几周时间。
For the purpose of this experiment, we consider cases in the not buggy bucket from Step 1 to be false positives and cases in the latter buckets (maybe buggy, matched to prior CVE, verified novel vulnerability) to be true positives. This decision is based on the fact that a “maybe buggy” case can still be a vulnerability, even if the analyst failed to verify it given time constraints. 为了本实验的目的,我们将第一步中不属于“有缺陷”类别的案例视为假阳性,而将后面的类别(可能有缺陷、与之前的 CVE 匹配、经过验证的新漏洞)视为真阳性。这个决定基于这样一个事实,即“可能有缺陷”的案例仍然可以是一个漏洞,即使分析师由于时间限制未能验证它。
For the purposes of ethical disclosure, we chose to disclose novel bugs that were verified exploitable. ^(15){ }^{15} We also provide a video demo of VulChecker operating as plugin for Visual Studio. ^(16){ }^{16} 出于伦理披露的目的,我们选择披露已验证可利用的新漏洞。 ^(15){ }^{15} 我们还提供了 VulChecker 作为 Visual Studio 插件运行的视频演示。 ^(16){ }^{16}
EXP3: Ablation. In our paper, we claim that training on augmented data is beneficial because it is challenging to obtain real world instruction or line-level labeled datasets. However, it is not clear how much the augmentation process helps the overall performance. To demonstrate the contribution of the augmentation process, we perform an ablation study by contrasting a model trained on augmented data to one trained on synthetic data. EXP3:消融实验。在我们的论文中,我们声称在增强数据上进行训练是有益的,因为获取真实世界的指令或行级标注数据集是具有挑战性的。然而,增强过程对整体性能的帮助程度尚不清楚。为了证明增强过程的贡献,我们通过对比在增强数据上训练的模型与在合成数据上训练的模型,进行了一项消融研究。
Case Studies. By using VulChecker on our datasets, we identified a zero-day vulnerability and incorrect CVE information. Details on these cases can be found in the appendix. 案例研究。通过在我们的数据集上使用 VulChecker,我们识别出了一种零日漏洞和不正确的 CVE 信息。这些案例的详细信息可以在附录中找到。
5.1.2 Datasets 5.1.2 数据集
We collected a wide variety of C//C++\mathrm{C} / \mathrm{C}++ project for training and testing VulChecker and the baselines. A description of these projects, including a mapping between the projects and the experiments, can be found in the appendix in Table 6. 我们收集了多种多样的 C//C++\mathrm{C} / \mathrm{C}++ 项目用于训练和测试 VulChecker 及基线。有关这些项目的描述,包括项目与实验之间的映射,可以在附录的表 6 中找到。
Train set. To train the baselines, we used the Juliet C/C++ dataset denoted D_(jul)^(c)D_{j u l}^{c} for CWE c\operatorname{CWE} c. The dataset has the following number of samples for each CWE(D_(jul)^(190):3960,D_(jul)^(121):4944:}\operatorname{CWE}\left(D_{j u l}^{190}: 3960, D_{j u l}^{121}: 4944\right., {:D_(jul)^(122):5922,D_(jul)^(415):960,D_(jul)^(416):459)\left.D_{j u l}^{122}: 5922, D_{j u l}^{415}: 960, D_{j u l}^{416}: 459\right). 训练集。为了训练基线,我们使用了标记为 D_(jul)^(c)D_{j u l}^{c} 的 Juliet C/C++数据集用于 CWE c\operatorname{CWE} c 。该数据集对于每个 CWE(D_(jul)^(190):3960,D_(jul)^(121):4944:}\operatorname{CWE}\left(D_{j u l}^{190}: 3960, D_{j u l}^{121}: 4944\right. , {:D_(jul)^(122):5922,D_(jul)^(415):960,D_(jul)^(416):459)\left.D_{j u l}^{122}: 5922, D_{j u l}^{415}: 960, D_{j u l}^{416}: 459\right) 具有以下样本数量。
To train VulChecker, we used D_("jul ")D_{\text {jul }} to augment 20 'clean’ projects collected from GitHub: A ‘clean’ project was selected if it (1) did not contain a CVE for the given CWE, (2) had at least 40 k lines of source code, and (3) was a cmake project. ^(17){ }^{17} Then, for each CWE, we augmented these projects using D_(jul)D_{j u l}. We denote this augmented training set as D_("aug ")^(c)D_{\text {aug }}^{c} for CWEc\mathrm{CWE} c. Overall, there were approximately 6k-24k6 \mathrm{k}-24 \mathrm{k} positive manifestation points and 2 million negative points per CWE (with some projects having over 200 million nodes in total). To handle class imbalance, we down sampled the negative points to equal proportions. 为了训练 VulChecker,我们使用 D_("jul ")D_{\text {jul }} 对从 GitHub 收集的 20 个“干净”项目进行了增强:如果一个“干净”项目满足以下条件,则被选中:(1) 对于给定的 CWE 没有包含 CVE,(2) 至少有 40 千行源代码,以及(3) 是一个 cmake 项目。 ^(17){ }^{17} 然后,对于每个 CWE,我们使用 D_(jul)D_{j u l} 增强这些项目。我们将这个增强的训练集称为 D_("aug ")^(c)D_{\text {aug }}^{c} ,用于 CWEc\mathrm{CWE} c 。总体而言,每个 CWE 大约有 6k-24k6 \mathrm{k}-24 \mathrm{k} 个正向表现点和 200 万个负向点(一些项目的总节点数超过 2 亿)。为了处理类别不平衡,我们将负向点下采样至相等比例。
We note that data the augmentation process is part of VulChecker’s algorithm (it cannot be applied to the other baseline models). However, VulChecker and the baselines were all trained on the same positive data since D_(jul)D_{j u l} is captured in D_("aug ")D_{\text {aug }}. 我们注意到数据增强过程是 VulChecker 算法的一部分(它不能应用于其他基线模型)。然而,VulChecker 和基线模型都是在相同的正数据上训练的,因为 D_(jul)D_{j u l} 被捕获在 D_("aug ")D_{\text {aug }} 中。
Test sets. In our evaluations, we used open source projects collected from the Internet. To evaluate the models in EXP1, we collected projects from GitHub which contain CVEs: First, we collected a list of all relevant CVEs using the NVD database and filtered out all CVEs that were not tagged with one of our CWEs (leaving 3,788). For each CWE, we then filtered out all of those without version information and that were confirmed to be closed source (leaving 524). We then manually collected all of the projects, filtered out any project that does not use the cmake compilation system, and verified the CWE label in the code. In the end, we had 19 projects with approximately 35 CVEs. Of these CVEs, 14 were from 2019 and 2020 and the majority had CVSS ranks (severity levels) of medium or high. We denote this dataset as D_(cve)^(c)D_{c v e}^{c} for CWE cc. For each CVE, a vulnerability researcher tagged the positive manifestation point in the source code (all other points are considered negative). In total, we used the following number of projects containing one or more relevant CVEs ( D_("cve ")^(190):9,D_("cve ")^(121):2,D_("cve ")^(122):4,D_("cve ")^(415):2D_{\text {cve }}^{190}: 9, D_{\text {cve }}^{121}: 2, D_{\text {cve }}^{122}: 4, D_{\text {cve }}^{415}: 2, D_(cve)^(416):7D_{c v e}^{416}: 7 ). These projects contained millions of nodes. For more information on these projects, see Table 5 in the appendix. 测试集。在我们的评估中,我们使用了从互联网收集的开源项目。为了评估 EXP1 中的模型,我们从 GitHub 收集了包含 CVE 的项目:首先,我们使用 NVD 数据库收集了所有相关 CVE 的列表,并过滤掉所有未标记为我们 CWEs 之一的 CVE(剩下 3,788 个)。对于每个 CWE,我们过滤掉所有没有版本信息且确认为闭源的 CVE(剩下 524 个)。然后,我们手动收集了所有项目,过滤掉任何不使用 cmake 编译系统的项目,并验证代码中的 CWE 标签。最终,我们得到了 19 个项目,约有 35 个 CVE。在这些 CVE 中,14 个来自 2019 年和 2020 年,大多数的 CVSS 等级(严重性级别)为中等或高。我们将该数据集标记为 D_(cve)^(c)D_{c v e}^{c} ,对应 CWE cc 。对于每个 CVE,一名漏洞研究人员在源代码中标记了正面表现点(所有其他点被视为负面)。总的来说,我们使用了包含一个或多个相关 CVE 的以下数量的项目( D_("cve ")^(190):9,D_("cve ")^(121):2,D_("cve ")^(122):4,D_("cve ")^(415):2D_{\text {cve }}^{190}: 9, D_{\text {cve }}^{121}: 2, D_{\text {cve }}^{122}: 4, D_{\text {cve }}^{415}: 2 , D_(cve)^(416):7D_{c v e}^{416}: 7 )。这些项目包含数百万个节点。 有关这些项目的更多信息,请参见附录中的表 5。
To evaluate the precision of VulChecker in EXP2, used 9 more unlabeled projects collected from GitHub. These projects were selected if they were cmake and had no reported CVE. We denote this dataset as D_("out ")D_{\text {out }}. 为了评估 VulChecker 在 EXP2 中的精确度,使用了从 GitHub 收集的 9 个未标记项目。这些项目的选择标准是它们使用 cmake 并且没有报告的 CVE。我们将这个数据集称为 D_("out ")D_{\text {out }} 。
Finally, in EXP3 (ablation on data augmentation), we used a CWE190 dataset consisting of holdout data from D_(jul),1D_{j u l}, 1 project from D_("cve ")^(190)D_{\text {cve }}^{190} and 5 more ‘clean’ wild projects from GitHub. 最后,在 EXP3(关于数据增强的消融实验)中,我们使用了一个 CWE190 数据集,该数据集由来自 D_("cve ")^(190)D_{\text {cve }}^{190} 的 D_(jul),1D_{j u l}, 1 项目的保留数据和来自 GitHub 的另外 5 个“干净”的野生项目组成。
5.1.3 Model Configuration 5.1.3 模型配置
We configured all of the CWE models with the same hyper-parameters based on results from experimentation and Bayesian optimization [1]. For program slicing, we used a backwards cut depth of n_("depth ")=50n_{\text {depth }}=50. For M_(G)M_{G}, we used an embedding size of 32, 9 layers in the network sigma\sigma, and performed n_("iter ")=n_{\text {iter }}= 50 propagation iterations. For M_(C)M_{C}, we used 7 dense layers of 32 neurons each. The entire model was implemented using PyTorch and trained end-to-end with a learning rate of 0.0001 for 100 epochs on a NVIDIA Titan RTX GPU with 24GB of RAM. 我们根据实验结果和贝叶斯优化的结果,为所有 CWE 模型配置了相同的超参数[1]。对于程序切片,我们使用了深度为 n_("depth ")=50n_{\text {depth }}=50 的反向切割。对于 M_(G)M_{G} ,我们使用了 32 的嵌入大小,网络中有 9 层 sigma\sigma ,并进行了 n_("iter ")=n_{\text {iter }}= 50 次传播迭代。对于 M_(C)M_{C} ,我们使用了 7 个每层 32 个神经元的密集层。整个模型使用 PyTorch 实现,并在 NVIDIA Titan RTX GPU(24GB RAM)上以 0.0001 的学习率进行了 100 个周期的端到端训练。
Figure 4: Performance of the line-based detectors (left) and region-based detectors (right) in detecting CVEs in the wild. 图 4:基于线的检测器(左)和基于区域的检测器(右)在野外检测 CVE 的性能。
The models in EXP1 were setup as follows. For LCS, we followed the approach of the authors [31] and we compared all 24 k functions in Juliet to a random subset of 400 functions from the wild, ^(18){ }^{18} ensuring that all 23 vulnerable functions were included as well. For Gemini and VulDeeLocator, a separate model was trained on each CWE in the Juliet Dataset (i.e., D_(jul)^(c)D_{j u l}^{c} ). For VulChecker, the models were trained on the respective augmented CWE datasets D_("aug ")^(c)D_{\text {aug }}^{c}. Finally, for QAC and Cppcheck, the default configurations were used. EXP1 中的模型设置如下。对于 LCS,我们遵循了作者的做法[31],并将 Juliet 中的所有 24 个 k 函数与来自 wild 的随机子集 400 个函数进行比较, ^(18){ }^{18} 确保所有 23 个易受攻击的函数也被包含在内。对于 Gemini 和 VulDeeLocator,分别在 Juliet 数据集中的每个 CWE 上训练了一个单独的模型(即 D_(jul)^(c)D_{j u l}^{c} )。对于 VulChecker,模型是在各自的增强 CWE 数据集上训练的 D_("aug ")^(c)D_{\text {aug }}^{c} 。最后,对于 QAC 和 Cppcheck,使用了默认配置。
5.2 EXP1: Performance 5.2 实验 1:性能
In Figure 4 we present the results of the baselines which are open-source on D_(cve)D_{c v e}. In the figure, we plot each of the models’ receiver operating characteristic (ROC) curves and provide the models’ AUCs. The ROC plot shows the tradeoff between the FPR and TPR at every threshold. In particular, we are interested in the left side of the plot where the model can be tuned to achieve a low FPR. With a FPR of 0.01 , we found that VulChecker was able to accurately identify 35% of all 63 positive manifestations in CWE-122, 45%45 \% of the 20 manifestations in CWE-190, and all 3 manifestations in CWE-415. When relaxing the threshold to a FPR of 0.1 , VulChecker can identify 95%,64%,46%,100%95 \%, 64 \%, 46 \%, 100 \%, and 83%83 \% of the CVE manifestations for CWEs 190, 121, 122, 415, and 416 , respectively. A list of the 24 CVEs detected at this rate can be found in the Appendix along with the performance at 在图 4 中,我们展示了在 D_(cve)D_{c v e} 上开源的基线结果。在图中,我们绘制了每个模型的接收者操作特征(ROC)曲线,并提供了模型的 AUC 值。ROC 图显示了每个阈值下假阳性率(FPR)和真正率(TPR)之间的权衡。特别地,我们对图的左侧感兴趣,在那里模型可以调整以实现低 FPR。在 FPR 为 0.01 时,我们发现 VulChecker 能够准确识别 CWE-122 中的 63 个正例中的 35%,CWE-190 中的 20 个正例中的 45%45 \% ,以及 CWE-415 中的所有 3 个正例。当将阈值放宽至 FPR 为 0.1 时,VulChecker 能够识别 CWE 190、121、122、415 和 416 的 CVE 正例中的 95%,64%,46%,100%95 \%, 64 \%, 46 \%, 100 \% 和 83%83 \% 。在附录中可以找到以此速率检测到的 24 个 CVE 的列表,以及性能表现。
Table 3: Baseline comparison against a commercial SAST tool in detecting CVEs in the wild. 表 3:与商业静态应用安全测试工具在检测实际中的 CVE 的基线比较。
different set FPR rates (Table 4). Cppcheck is not included in the figure because it did not have any TPs. Instead, Cppcheck produced 22 FPs for CWE-190 and 1 FP for CWE-415. 不同的假阳性率(表 4)。Cppcheck 未包含在图中,因为它没有任何真正的正例。相反,Cppcheck 为 CWE-190 产生了 22 个假阳性,为 CWE-415 产生了 1 个假阳性。
The results in Figure 4 show that VulChecker outperforms the other methods by a large margin. This is because AST (LCS), CFG (Gemini) and linear (VulDeeLocator -see section 2.3) structures cannot capture all code behaviors, such as data dependencies, whereas VulChecker’s ePDGs captures these behaviors explicitly. 图 4 中的结果表明,VulChecker 的表现远超其他方法。这是因为 AST(LCS)、CFG(Gemini)和线性(VulDeeLocator - 见第 2.3 节)结构无法捕捉所有代码行为,例如数据依赖性,而 VulChecker 的 ePDGs 则明确捕捉了这些行为。
In Table 3 we compare the performance of VulChecker to the closed-source solution (Helix QAC) on D_("cve ")D_{\text {cve }} (TP and FP are the true positive and false positive counts respectively). When setting VulChecker’s FPR to be on par with the commercial tool QAC (FPR 0.1), we find that VulChecker detects significantly more CVEs than QAC in D_("cve ")(24vs.4)D_{\text {cve }}(24 \mathrm{vs} .4). Unlike QAC, which cannot adjust its sensitivity, VulChecker can be adjusted. Specifically, we can raise the FPR to 0.2 and detect 31 CVEs (with double the FPs) or lower the FPR to 0.05 and detect 17 CVEs (with a quarter the FPs). We note that although VulChecker and QAC have ∼460\sim 460 FPs at the line-level, there are hundreds of thousands of lines of code in the target projects. Therefore, 460 FPs is reasonable. 在表 3 中,我们将 VulChecker 的性能与闭源解决方案(Helix QAC)进行比较(TP 和 FP 分别是真阳性和假阳性计数)。当将 VulChecker 的假阳性率(FPR)设置为与商业工具 QAC 相当(FPR 0.1)时,我们发现 VulChecker 在 D_("cve ")(24vs.4)D_{\text {cve }}(24 \mathrm{vs} .4) 中检测到的 CVE 数量显著高于 QAC。与无法调整灵敏度的 QAC 不同,VulChecker 是可以调整的。具体而言,我们可以将 FPR 提高到 0.2,检测到 31 个 CVE(假阳性数量翻倍),或者将 FPR 降低到 0.05,检测到 17 个 CVE(假阳性数量为四分之一)。我们注意到,尽管 VulChecker 和 QAC 在行级别上有 ∼460\sim 460 个假阳性,但目标项目中有数十万行代码。因此,460 个假阳性是合理的。
We note that since the software projects in D_(cve)^(c)D_{c v e}^{c} are older versions of the software (Table 5), they contain a number of bugs and vulnerabilities that we have labeled as negative. Therefore, the FPR is actually lower since the top results contain other bugs and vulnerabilities. Through a manual inspection of the top 50 results for each CWE, we found that 3-7 additional instances of each CWE were detected. 我们注意到,由于 D_(cve)^(c)D_{c v e}^{c} 中的软件项目是软件的旧版本(表 5),它们包含了一些我们标记为负面的错误和漏洞。因此,假阳性率实际上更低,因为排名靠前的结果包含其他错误和漏洞。通过对每个 CWE 的前 50 个结果进行手动检查,我们发现检测到每个 CWE 的额外实例为 3-7 个。
In summary, VulChecker demonstrates good performance in detecting the exact lines and instructions of exploitable vulnerabilities (CVEs) in real world projects. Moreover, it can operate with the same FPR as a commercial SAST tool while detecting x6 more CVEs. This is a meaningful result, since the model was trained for ‘free’ on augmented data labeled with only synthetic samples. 总之,VulChecker 在检测真实项目中可利用漏洞(CVE)的确切行和指令方面表现良好。此外,它在检测的 CVE 数量是商业 SAST 工具的 6 倍的同时,保持了相同的误报率。这是一个有意义的结果,因为该模型是在仅使用合成样本标记的增强数据上“免费”训练的。
5.3 EXP2: Model Precision 5.3 EXP2:模型精度
In deployment, a user would train VulChecker on both D_("aug ")^(c)D_{\text {aug }}^{c} and D_(cve)^(c)D_{c v e}^{c} to obtain a more complete and accurate model. To evaluate the precision and practicality of VulChecker in this scenario, we manually inserted different instances of vulnerability CWE-190 into libgd. We found that when a 在部署中,用户会在 D_("aug ")^(c)D_{\text {aug }}^{c} 和 D_(cve)^(c)D_{c v e}^{c} 上训练 VulChecker,以获得更完整和准确的模型。为了评估 VulChecker 在这种情况下的精确性和实用性,我们手动将不同实例的漏洞 CWE-190 插入 libgd。我们发现,当一个
Figure 5: The precision (hit rate) of VulChecker on the holdout set when trained on both augmented and CVE datasets. 图 5:VulChecker 在保留集上的精度(命中率),当其在增强数据集和 CVE 数据集上训练时。
model is trained on both D_("aug ")^(190)D_{\text {aug }}^{190} and D_("cve ")^(190)D_{\text {cve }}^{190} it increases the rank of these manifestation points from the 20th-64th positions to the 1 st position (highest score). Ranking is achieved by sorting results according to model’s softmax confidence scores. 模型在 D_("aug ")^(190)D_{\text {aug }}^{190} 和 D_("cve ")^(190)D_{\text {cve }}^{190} 上进行训练时,将这些表现点的排名从第 20-64 位提升到第 1 位(最高分)。排名是通过根据模型的 softmax 置信分数对结果进行排序来实现的。
With this knowledge, we trained new models on both D_("aug ")^(c)D_{\text {aug }}^{c} and D_(cve)^(c)D_{c v e}^{c} and executed each of them on our holdout set D_("out ")D_{\text {out }}. Since the holdout set has no ground truth, we hired a vulnerability analyst to examine the top 100 results (from 1 million potential manifestation points). Figure 5 plots the precision of VulChecker on the top kk results for each CWE. A rise in precision indicates sequence (cluster) of positive cases found. As expected, the precision drops as kk increases since the model’s confidence lowers over kk. Using the analyst’s findings, we found that 50-80%50-80 \% of the top 50 results, and 45-70%45-70 \% of the top 100 results were true positives, according to our criteria defined in Subsection 5.1. Concretely, the following counts of true positive cases were found in the top 100 results for each CWE (CWE190:68, CWE121:47, CWE122:46, CWE415:71, CWE416:28). 凭借这些知识,我们在 D_("aug ")^(c)D_{\text {aug }}^{c} 和 D_(cve)^(c)D_{c v e}^{c} 上训练了新模型,并在我们的保留集 D_("out ")D_{\text {out }} 上执行了每个模型。由于保留集没有真实值,我们聘请了一位漏洞分析师来检查前 100 个结果(从 100 万个潜在表现点中)。图 5 绘制了 VulChecker 在每个 CWE 的前 kk 个结果上的精确度。精确度的上升表明发现了正例的序列(聚类)。正如预期的那样,随着 kk 的增加,精确度下降,因为模型在 kk 上的信心降低。根据分析师的发现,我们发现前 50 个结果中有 50-80%50-80 \% 个是真阳性,前 100 个结果中有 45-70%45-70 \% 个是真阳性,符合我们在 5.1 小节中定义的标准。具体而言,在每个 CWE 的前 100 个结果中发现了以下真阳性案例的计数(CWE190:68,CWE121:47,CWE122:46,CWE415:71,CWE416:28)。
The analyst then took these true positives and attempted to manually match them with previously known CVEs, using their patches as reference. For CWEs 190, 121, 122, 415 and 416 we found 23.5%,32.6%,89.0%,24.2%23.5 \%, 32.6 \%, 89.0 \%, 24.2 \% and 14.8%14.8 \% of the true positives to be 17 verified CVEs. Note that because a CVE can match multiple related lines in the source code, multiple true positives cases can correspond to the same CVE. The CVEs identified are included in Appendix A. 分析师随后将这些真实正例与先前已知的 CVE 进行手动匹配,以其补丁作为参考。对于 CWE 190、121、122、415 和 416,我们发现 23.5%,32.6%,89.0%,24.2%23.5 \%, 32.6 \%, 89.0 \%, 24.2 \% 和 14.8%14.8 \% 的真实正例中有 17 个经过验证的 CVE。请注意,由于一个 CVE 可以与源代码中的多个相关行匹配,因此多个真实正例可能对应于同一个 CVE。识别出的 CVE 已包含在附录 A 中。
Lastly, the analyst attempted to verify the exploitability of the remaining true positive cases using standard practices like fuzz testing and manual reverse engineering. This yielded 1 verified zero-day vulnerability, which we disclosed to developers for patching. A case study of this vulnerability is included in Appendix B. 最后,分析师尝试使用模糊测试和手动逆向工程等标准实践来验证剩余真实正例的可利用性。这导致确认了一个零日漏洞,我们已将其披露给开发者以进行修补。该漏洞的案例研究包含在附录 B 中。
In summary, VulChecker provides valuable information to developers with a minimal number of false alarms. This means that in deployment, VulChecker can be a practical tool for identifying vulnerabilities in source code. 总之,VulChecker 为开发者提供了有价值的信息,并且误报数量极少。这意味着在部署过程中,VulChecker 可以成为识别源代码中漏洞的实用工具。
5.4 EXP3: Ablation Study on Augmentation 5.4 EXP3:增强的消融研究
In Figure 6, we present the ROC curves and AUC values for a model trained on only synthetic data. The model performs extremely well on other synthetic samples because the samples are short and have virtually no noise. However, the same model fails to generalise to the real world software and vulnerabilities found in D_("cve ")D_{\text {cve }}. This is a troubling insight since many prior proposals in this domain have evaluated their model’s performance on the synthetic datasets like SARD [14, 19, 20, 22, 30, 35, 38], so it is not clear whether these can perform well in practice. We note that the proposed augmentation process can be applied to these works since the algorithm is agnostic to CFG, PDG, and CPG code graphs. However, we leave this comparative study for future work since it is outside the scope of our target application (line/instruction level vulnerability detection). 在图 6 中,我们展示了仅在合成数据上训练的模型的 ROC 曲线和 AUC 值。该模型在其他合成样本上的表现非常出色,因为样本较短且几乎没有噪声。然而,同一模型未能推广到 D_("cve ")D_{\text {cve }} 中发现的真实世界软件和漏洞。这一发现令人担忧,因为该领域许多先前的提案在合成数据集(如 SARD [14, 19, 20, 22, 30, 35, 38])上评估了其模型的性能,因此尚不清楚这些模型在实际中是否能够表现良好。我们注意到,所提出的增强过程可以应用于这些工作,因为该算法对 CFG、PDG 和 CPG 代码图是不可知的。然而,我们将这一比较研究留待未来工作,因为这超出了我们目标应用(行/指令级漏洞检测)的范围。
When contrasting the results in Figure 6 to those of Figure 4, we can clearly see the benefit and importance of the data augmentation. This is also apparent in the knowledge captured by each of the models: In Figure 7, we plot M_(G)M_{G} 's embeddings of G_(i)^(')G_{i}^{\prime} (prior to the DNN classification) when trained on synthetic CWE-121 samples (left) and augmented CWE-121 samples (right). From the plots, we can see that the model trained on synthetic data learns a very simple representation since it is easy for it to separate negative and positive manifestation points. On the other hand, the vulnerabilities in augmented data have much longer distances between the root cause and manifestation points. This forces the model to look deeper into G_(i)^(')G_{i}^{\prime} and to learn how to distinguish benign code from vulnerable code while overlooking irrelevant code. This can be seen in Figure 7 where the model trained on augmented data has difficulty separating the concepts in the ground truth (top right), but succeeds in associating the learnt concepts in samples from the wild (bottom right). In contrast, the model trained on synthetic data easily separates the concepts in training (top left) but fails to associate/identify any of the samples from the wild (bottom left). 当将图 6 中的结果与图 4 进行对比时,我们可以清楚地看到数据增强的好处和重要性。这在每个模型所捕获的知识中也很明显:在图 7 中,我们绘制了 M_(G)M_{G} 的 G_(i)^(')G_{i}^{\prime} 嵌入(在 DNN 分类之前),分别是在合成 CWE-121 样本(左)和增强 CWE-121 样本(右)上训练的。从图中可以看出,基于合成数据训练的模型学习了非常简单的表示,因为它很容易将负面和正面表现点分开。另一方面,增强数据中的漏洞在根本原因和表现点之间的距离要长得多。这迫使模型更深入地研究 G_(i)^(')G_{i}^{\prime} ,并学习如何区分良性代码和易受攻击的代码,同时忽略无关代码。这在图 7 中可以看到,基于增强数据训练的模型在分离真实情况中的概念(右上角)时遇到困难,但在将学习到的概念与野外样本(右下角)关联时却取得了成功。 相比之下,基于合成数据训练的模型能够轻松区分训练中的概念(左上角),但无法关联/识别任何来自真实环境的样本(左下角)。
In summary, the proposed data augmentation strategy is a cheap and effective way to boost a model’s performance on real software projects. Given the shortage of fine-grained labeled datasets from the wild, this augmentation approach enables the research, development, and actual deployment of line-level classifiers. 总之,所提出的数据增强策略是一种廉价且有效的方法,可以提升模型在真实软件项目上的表现。鉴于来自实际环境的细粒度标注数据集的短缺,这种增强方法使得行级分类器的研究、开发和实际部署成为可能。
6 Assumptions & Limitations 6 假设与局限性
In this framework we made several assumptions about the environment and use cases. 在这个框架中,我们对环境和使用案例做出了几个假设。
Responsiveness. In order to lower the code into LLVM IR, the code must be able to compile. This means that new reports to the developer will only become available at certain time-frames and not in real-time while typing. For example, when the developer completes a line of code or entire segment. We believe this is a reasonable delay since the programmer will still be engaged in the code and can quickly resolve the problem before moving on. 响应性。为了将代码降级为 LLVM IR,代码必须能够编译。这意味着对开发者的新报告只能在特定时间段内提供,而不是在输入时实时提供。例如,当开发者完成一行代码或整个代码段时。我们认为这是一个合理的延迟,因为程序员仍然会专注于代码,并且可以在继续之前迅速解决问题。
Figure 6: Performance of VulChecker when trained on synthetic data, then either tested on synthetic (left) or tested on real data (right). 图 6:VulChecker 在合成数据上训练后,分别在合成数据(左)或真实数据(右)上进行测试的性能。
Figure 7: The embeddings of G_(i)^(')G_{i}^{\prime} learnt by S 2 V when trained on synthetic (left) or augmented (right) data. Black (negative) and red (positive/vulnerable) points are samples taken from the model’s training set. Grey points are embeddings of novel samples taken from the wild. For visualization, all plots have been reduced from 32 to 2 dimensions using T-SNE. 图 7:S 2 V 在合成数据(左)或增强数据(右)上训练时学习到的 G_(i)^(')G_{i}^{\prime} 嵌入。黑色(负)和红色(正/脆弱)点是从模型的训练集中提取的样本。灰色点是从实际环境中提取的新样本的嵌入。为了可视化,所有图形均使用 T-SNE 从 32 维降至 2 维。
Security. An advanced attacker may use adversarial machine learning [10] to poison the training set (e.g., include crafted negative samples which will cause the model to miss certain vulnerability patterns in deployment). Although this attack is possible, it is challenging because the attacker must know which GitHub projects will be collected. 安全性。高级攻击者可能会使用对抗性机器学习[10]来毒化训练集(例如,包含经过精心设计的负样本,这将导致模型在部署时错过某些漏洞模式)。尽管这种攻击是可能的,但由于攻击者必须知道将收集哪些 GitHub 项目,因此这具有挑战性。
Regarding attacks on a trained model, it may be possible to generate adversarial perturbations in SS such that G_(i)G_{i} will not be detected by MM as vulnerable. This is less of a threat in private projects with trusted contributors (e.g., a software company). Regardless, there are some significant challenges that the attacker must accomplish to generate perturbations: (1) it is an open research question whether it is possible to generate adversarial perturbations on source code that will compile, and (2) the mapping of SS to GG is not differentiable so the attacker can only operate on G_(i)^(')G_{i}^{\prime}. 关于对训练模型的攻击,可能生成对抗扰动在 SS 中,使得 G_(i)G_{i} 不会被 MM 检测为脆弱。这在有可信贡献者的私有项目中(例如,软件公司)威胁较小。尽管如此,攻击者必须克服一些重大挑战以生成扰动:(1)是否可以生成能够编译的源代码上的对抗扰动仍然是一个开放的研究问题,以及(2) SS 到 GG 的映射不是可微的,因此攻击者只能在 G_(i)^(')G_{i}^{\prime} 上操作。
Integrity. Since we use projects from the wild for our augmentation, it is possible that vulnerabilities in these projects exist and will be considered as negative during training. This may confuse the model and impact the performance. Although we showed that this technique works well em- 完整性。由于我们使用来自实际项目的增补,因此这些项目中可能存在漏洞,这在训练过程中将被视为负面因素。这可能会使模型感到困惑并影响其性能。尽管我们展示了该技术的有效性。
pirically, further research is needed to understand how the quantity and quality of these bugs impact a trained model. 经验上,进一步的研究是必要的,以理解这些错误的数量和质量如何影响训练模型。
Depth. A fundamental limitation of VulChecker is the range of parameter n_("depth ")n_{\text {depth }}. If n_("depth ")n_{\text {depth }} is substantially large then it would become prohibitively expensive to train and execute VulChecker on every potential manifestation point. Moreover, the size of GG can negatively impact the models ability to identify long-distance relationships [41]. Therefore, when a root cause is significantly far from a positive manifestation point, VulChecker will not be able to confidently detect the bug. We note that ePDGs consider data dependency edges which significantly shorten the distance between root causes and manifestation points in G. However, the problem of efficient long-distance causality in static analysis is an open problem, unresolved in existing approaches (see section 2). 深度。VulChecker 的一个基本限制是参数 n_("depth ")n_{\text {depth }} 的范围。如果 n_("depth ")n_{\text {depth }} 非常大,那么在每个潜在的表现点上训练和执行 VulChecker 将变得极其昂贵。此外, GG 的大小可能会对模型识别远距离关系的能力产生负面影响[41]。因此,当根本原因与正表现点的距离显著较远时,VulChecker 将无法自信地检测到该错误。我们注意到,ePDGs 考虑了数据依赖边,这显著缩短了根本原因与 G 中的表现点之间的距离。然而,静态分析中高效的远距离因果关系问题仍然是一个未解决的开放问题(见第 2 节)。
7 Conclusion 7 结论
There is a large gap between the current state-of-the-art and a practical deep learning SAST tool that can be used by developers. In this paper, we proposed VulChecker, the first tool that can both perform line/instruction level vulnerability detection and classify the indicated vulnerability. We have also proposed (1) a new code representation (ePDG) for low level GNN tasks on software and (2) a novel data augmentation strategy to help GNN-based SAST tools work on real world software. 目前的最先进技术与可供开发者使用的实用深度学习静态应用安全测试(SAST)工具之间存在很大差距。本文提出了 VulChecker,这是第一个能够进行行/指令级漏洞检测并对所指示的漏洞进行分类的工具。我们还提出了(1)一种用于软件低级图神经网络(GNN)任务的新代码表示(ePDG),以及(2)一种新颖的数据增强策略,以帮助基于 GNN 的 SAST 工具在现实世界的软件上工作。
Ethical Disclosures 伦理披露
All previously unknown vulnerabilities found by VulChecker in this study have been disclosed to the respective software developers for remediation. 本研究中 VulChecker 发现的所有之前未知的漏洞已向相关软件开发者披露以便进行修复。
Acknowledgment 致谢
This research was supported in part by the Defense Advanced Research Projects Agency (DARPA) under contract HR00112090031. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of DARPA. This material is also based upon work supported by the Zuckerman STEM Leadership Program. 本研究部分得到了国防高级研究计划局(DARPA)合同 HR00112090031 的支持。本文中表达的任何观点、发现、结论或建议均为作者个人意见,并不一定反映 DARPA 的观点。本文材料还基于扎克曼 STEM 领导力项目的支持。
References 参考文献
[1] “scikit-optimize: Sequential model-based optimization in python.” [Online]. Available: https://scikit-optimize.github.io/stable/
[2] “Application security I cyberres,” https://www. microfocus.com/en-us/cyberres/application-security, 2021, (Accessed on 07/23/2021).
[3] “Clang static analyzer,” https://clang-analyzer.llvm.org/, 2021, (Accessed on 07/23/2021).
[4] “coverity - google search,” https://www.google. com/search?q=coverity&oq=coverity&aqs=chrome. .69i57j019.1102j0j4&sourceid=chrome&ie=UTF-8, 2021, (Accessed on 07/23/2021).
[5] “Flawfinder home page,” https://dwheeler.com/ flawfinder/, 2021, (Accessed on 07/23/2021).
[6] “Helix qac for c and c++ I perforce,” https: //www.perforce.com/products/helix-qac, 2021, (Accessed on 07/23/2021).
[7] “Infer static analyzer I infer l infer,” https://fbinfer.com/, 2021, (Accessed on 07/23/2021).
[8] S. Arakelyan, S. Arasteh, C. Hauser, E. Kline, and A. Galstyan, “Bin2vec: learning representations of binary executable programs for security tasks,” Cybersecurity, vol. 4, no. 1, pp. 1-14, 2021.
[9] S. Cao, X. Sun, L. Bo, Y. Wei, and B. Li, “Bgnn4vd: Constructing bidirectional graph neural-network for vulnerability detection,” Information and Software Technology, vol. 136, p. 106576, 2021.
[10] A. Chakraborty, M. Alam, V. Dey, A. Chattopadhyay, and D. Mukhopadhyay, “Adversarial attacks and defences: A survey,” arXiv preprint arXiv:1810.00069, 2018.
[11] S. Chakraborty, R. Krishna, Y. Ding, and B. Ray, “Deep learning based vulnerability detection: Are we there yet,” IEEE Transactions on Software Engineering, 2021.
[12] A. Chaudhary, H. Mittal, and A. Arora, “Anomaly detection using graph neural networks,” in 2019 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COMITCon). IEEE, 2019, pp. 346-350.
[13] X. Cheng, H. Wang, J. Hua, G. Xu, and Y. Sui, “Deepwukong: Statically detecting software vulnerabilities using deep graph neural network,” ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 30, no. 3, pp. 1-33, 2021.
[14] X. Cheng, H. Wang, J. Hua, M. Zhang, G. Xu, L. Yi, and Y. Sui, “Static detection of control-flow-related vulnerabilities using graph embedding,” in 2019 24th International Conference on Engineering of Complex Computer Systems (ICECCS). IEEE, 2019, pp. 41-50.
[15] H. Dai, B. Dai, and L. Song, “Discriminative embeddings of latent variable models for structured data,” in Proceedings of The 33rd International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. F. Balcan and K. Q. Weinberger, Eds., vol. 48. New York, New York, USA: PMLR, 20-22 Jun 2016, pp. 2702-2711. [Online]. Available: http://proceedings.mlr.press/v48/daib16.html
[16] S. H. H. Ding, B. C. M. Fung, and P. Charland, “Asm2vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization,” in 2019 IEEE Symposium on Security and Privacy (SP), 2019, pp. 472-489.
[17] Y. Duan, X. Li, J. Wang, and H. Yin, “Deepbindiff: Learning program-wide code representations for binary diffing,” in Network and Distributed System Security Symposium, 2020.
[18] Q. Le and T. Mikolov, “Distributed representations of sentences and documents,” in International conference on machine learning. PMLR, 2014, pp. 1188-1196.
[19] X. Li, L. Wang, Y. Xin, Y. Yang, and Y. Chen, “Automated vulnerability detection in source code using minimum intermediate representation learning,” Applied Sciences, vol. 10, no. 5, p. 1692, 2020.
[20] X. Li, L. Wang, Y. Xin, Y. Yang, Q. Tang, and Y. Chen, “Automated software vulnerability detection based on hybrid neural network,” Applied Sciences, vol. 11, no. 7, p. 3201, 2021.
[21] Z. Li, D. Zou, S. Xu, Z. Chen, Y. Zhu, and H. Jin, “Vuldeelocator: a deep learning-based fine-grained vulnerability detector,” IEEE Transactions on Dependable and Secure Computing, 2021.
[22] Z. Li, D. Zou, S. Xu, H. Jin, Y. Zhu, and Z. Chen, “Sysevr: A framework for using deep learning to detect software vulnerabilities,” IEEE Transactions on Dependable and Secure Computing, 2021.
[23] Z. Li, D. Zou, S. Xu, X. Ou, H. Jin, S. Wang, Z. Deng, and Y. Zhong, “Vuldeepecker: A deep learning-based system for vulnerability detection,” arXiv preprint arXiv:1801.01681, 2018.
[24] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.
[25] MITRE, “Cve,” https://cve.mitre.org/, 2021, (Accessed on 07/23/2021).
[26] NIST, “Software assurance reference dataset,” https: //samate.nist.gov/SARD/, (Accessed on 07/23/2021).
[27] D. Rolnick, A. Veit, S. Belongie, and N. Shavit, “Deep learning is robust to massive label noise,” arXiv preprint arXiv:1705.10694, 2017.
[28] R. Russell, L. Kim, L. Hamilton, T. Lazovich, J. Harer, O. Ozdemir, P. Ellingwood, and M. McConley, “Automated vulnerability detection in source code using deep representation learning,” in 2018 17th IEEE international conference on machine learning and applications (ICMLA). IEEE, 2018, pp. 757-762.
[29] T. Spring, “Microsoft quietly patches another critical malware protection engine flaw I threatpost,” https://threatpost.com/ microsoft-quietly-patches-another-critical-malware-\ protect \\\backslash normalcr\relaxprotection-engine-flaw/ 125951/, 2021, (Accessed on 07/23/2021).
[30] S. Suneja, Y. Zheng, Y. Zhuang, J. Laredo, and A. Morari, “Learning to map source code to software vulnerability using code-as-a-graph,” arXiv preprint arXiv:2006.08614, 2020.
[31] N. Visalli, L. Deng, A. Al-Suwaida, Z. Brown, M. Joshi, and B. Wei, “Towards automated security vulnerability and software defect localization,” in 2019 IEEE 17th International Conference on Software Engineering Research, Management and Applications (SERA). IEEE, 2019, pp. 90-93.
[32] H. Wang, G. Ye, Z. Tang, S. H. Tan, S. Huang, D. Fang, Y. Feng, L. Bian, and Z. Wang, “Combining graph-based learning with automated data collection for code vulnerability detection,” IEEE Transactions on Information Forensics and Security, vol. 16, pp. 1943-1958, 2020.
[33] T. Wang, C. Song, and W. Lee, “Diagnosis and emergency patch generation for integer overflow exploits,” in International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer, 2014, pp. 255-275.
[34] D. R. White and S. P. Borgatti, “Betweenness centrality measures for directed graphs,” Social networks, vol. 16, no. 4, pp. 335-346, 1994.
[35] Y. Wu, J. Lu, Y. Zhang, and S. Jin, “Vulnerability detection in c/c++ source code with graph representation learning,” in 2021 IEEE 11th Annual Computing and Communication Workshop and Conference (CCWC). IEEE, 2021, pp. 1519-1524.
[36] X. Xu, C. Liu, Q. Feng, H. Yin, L. Song, and D. Song, “Neural network-based graph embedding for cross-platform binary code similarity detection,” in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, ser. CCS '17. New York, NY, USA: Association for Computing Machinery, 2017, p. 363-376. [Online]. Available: https://doi.org/10.1145/3133956.3134018
[37] F. Yamaguchi, N. Golde, D. Arp, and K. Rieck, 'Modeling and discovering vulnerabilities with code property graphs," in 2014 IEEE Symposium on Security and Privacy. IEEE, 2014, pp. 590-604.
[38] M. Zagane, M. K. Abdi, and M. Alenezi, “Deep learning for software vulnerabilities detection using code metrics,” IEEE Access, vol. 8, pp. 74 562-74 570, 2020.
[39] Y. Zhou, S. Liu, J. Siow, X. Du, and Y. Liu, “Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks,” arXiv preprint arXiv:1909.03496, 2019.
[40] D. Zou, S. Wang, S. Xu, Z. Li, and H. Jin, " mu\mu vuldeepecker: A deep learning-based system for multiclass vulnerability detection," IEEE Transactions on Dependable and Secure Computing, 2019.
[41] D. Zou, Z. Hu, Y. Wang, S. Jiang, Y. Sun, and Q. Gu, “Layer-dependent importance sampling for training deep and large graph convolutional networks,” Advances in neural information processing systems, vol. 32, 2019.
Appendix 附录
A Additional Results on CVE Detection 关于 CVE 检测的附加结果
The following breaks down which CVEs are detected by VulChecker and QAC for different FPRs: 以下是 VulChecker 和 QAC 在不同假阳性率(FPR)下检测到的 CVE 的详细信息:
With an FPR set to 0.05, and trained on augmented data only, VulChecker successfully identified the following CVEs in the projects listed in Table 5: CVE-2011-0904, CVE-2016-2799, CVE-2016-5824, CVE-2016-9584, CVE-2016-9591, CVE-2017-12982, CVE-2017-14520, CVE-2017-15565, CVE-2017-15642, CVE-2018-10887, CVE-2019-10020, CVE-2019-10024, CVE-2019-13590, CVE-2019-14288, CVE-2019-17546, CVE-2019-8356, CVE-2020-15305, and CVE-2020-15389. 在将假阳性率设置为 0.05,并仅在增强数据上训练的情况下,VulChecker 成功识别了表 5 中列出的项目中的以下 CVE:CVE-2011-0904、CVE-2016-2799、CVE-2016-5824、CVE-2016-9584、CVE-2016-9591、CVE-2017-12982、CVE-2017-14520、CVE-2017-15565、CVE-2017-15642、CVE-2018-10887、CVE-2019-10020、CVE-2019-10024、CVE-2019-13590、CVE-2019-14288、CVE-2019-17546、CVE-2019-8356、CVE-2020-15305 和 CVE-2020-15389。
With the FPR set to 0.1 we also detect CVE-2014-6053, CVE-2016-2799, CVE-2017-15372, CVE-2017-9775, CVE-2019-3822, and CVE-2019-5435. 将 FPR 设置为 0.1 时,我们还检测到 CVE-2014-6053、CVE-2016-2799、CVE-2017-15372、CVE-2017-9775、CVE-2019-3822 和 CVE-2019-5435。
With the FPR set to 0.2 we also detect CVE-2017-8816, CVE-2018-14618, CVE-2018-20330, CVE-2018-7225, CVE-2019-14289, CVE-2019-5482, and CVE-202027828. 将 FPR 设置为 0.2 时,我们还检测到 CVE-2017-8816、CVE-2018-14618、CVE-2018-20330、CVE-2018-7225、CVE-2019-14289、CVE-2019-5482 和 CVE-2020-27828。
Zero-day detection. During evaluation, VulChecker uncovered a novel and exploitable zero-day vulnerability in one of the analyzed C++\mathrm{C}++ projects, demonstrating its ability to identify new bug patterns. The vulnerable code snippet is shown in Figure 8 of the appendix, with some lines removed for brevity. In this case, the vulnerable function is is a lexical parser designed to process PDF files. Unfortunately, such parsing is difficult to implement correctly, and the function is almost 100 lines long, posing a challenge for both manual and SAST-assisted code review. As it turns out, due to how the developers nested calls to hd_read_byte, with only the outer-most code loop checking the buffer’s bounds, a maliciously crafted PDF file can cause the final default switch case to write outside the buffer, causing a heap overflow. 零日检测。在评估过程中,VulChecker 在分析的 C++\mathrm{C}++ 项目中发现了一种新型且可利用的零日漏洞,展示了其识别新错误模式的能力。易受攻击的代码片段见附录的图 8,部分行已被删除以简化内容。在这种情况下,易受攻击的函数是一个用于处理 PDF 文件的词法解析器。不幸的是,这种解析很难正确实现,该函数几乎有 100 行代码,给手动和 SAST 辅助代码审查带来了挑战。事实证明,由于开发人员嵌套调用 hd_read_byte 的方式,只有最外层的代码循环检查缓冲区的边界,恶意构造的 PDF 文件可以导致最终的默认 switch 案例在缓冲区外写入,从而导致堆溢出。
Fortunately, VulChecker is able to detect the overflowing instruction in a matter of minutes as a manifestation point for CWE-122, which we verified with an expertly crafted proofof-compromise (PoC). We have disclosed this vulnerability to the project’s developers, who at the time of writing have acknowledged the issue and are developing a patch. 幸运的是,VulChecker 能够在几分钟内检测到溢出指令,作为 CWE-122 的表现点,我们通过精心制作的漏洞证明(PoC)进行了验证。我们已将此漏洞披露给项目的开发者,截至撰写本文时,他们已确认该问题并正在开发补丁。
Supply Chain Risk Reduction. One detection made by VulChecker that surprised us occurred in Poppler version 0.10.6. In this case, VulChecker labeled a buggy line that we determined to be CVE-2009-0756, however this conclusion was initially perplexing because according to the official CVE advisory, the bug was only known to exist in versions 0.10 .4 and earlier, not 0.10 .6 . However, once we downloaded the patch published by the developers and compared it to our version of Poppler, we discovered that we indeed had a vulnerable version of the library, despite having downloaded it directly from the vendor’s website. This is significant because developers frequently check the library dependencies of their projects against CVE advisories for potential risks. In this case, without VulChecker’s analysis, such checks would wrongly conclude that there is no need to worry about CVE-2009-0756 since the library’s version falls outside of known vulnerable releases. In reality, the CVE was present in our experiment. 供应链风险降低。VulChecker 发现的一个让我们感到惊讶的漏洞发生在 Poppler 版本 0.10.6。在这种情况下,VulChecker 标记了一个有缺陷的代码行,我们确定其为 CVE-2009-0756,然而这个结论最初让人困惑,因为根据官方的 CVE 通告,该漏洞仅在 0.10.4 及更早版本中被知晓,而不是 0.10.6。然而,一旦我们下载了开发者发布的补丁并与我们版本的 Poppler 进行比较,我们发现我们确实拥有一个易受攻击的库版本,尽管我们是直接从供应商的网站下载的。这一点非常重要,因为开发者通常会根据 CVE 通告检查其项目的库依赖,以识别潜在风险。在这种情况下,如果没有 VulChecker 的分析,这种检查将错误地得出结论,认为不必担心 CVE-2009-0756,因为该库的版本不在已知的易受攻击版本之内。实际上,CVE 在我们的实验中是存在的。
Table 4: Performance at Different FPR Rates 表 4:在不同假阳性率下的表现
char *s = lb ->scratch;
char *e = s + lb->size;
/* truncated for brevity */
while (1) {
if (s == e) {
s += pdf_lexbuf_grow(ctx, lb);
e = lb->scratch + lb->size;
}
c = hd_read_byte(ctx, f);
switch (c) {
case EOF:
goto end;
case ',':
/* truncated */
break
case ',':
/* truncated */
break;
case '\\',
c = hd_read_byte(ctx, f);
switch (c) {/* truncated */ }
break
default:
*s++ = c ; /* BUG: overflow */
break;
}
}
Figure 8: Exploitable zero-day found by VulChecker. A lack of bounds checking while performing lexical parsing for PDF files can result in an overflow write. 图 8:VulChecker 发现的可利用零日漏洞。在对 PDF 文件进行词法解析时缺乏边界检查可能导致溢出写入。
C Details on the Datasets C 数据集的详细信息
Table 5: The CVE Projects used in the Testsets 表 5:测试集中的 CVE 项目
CWE Project CWE 项目
Version 版本
|V||V|
190
curl curl 卷曲 卷曲
7.56.1,7.64.17.56 .1,7.64 .1
573k,584k573 \mathrm{k}, 584 \mathrm{k}
libcurl curl
7.61.0,7.63.07.61 .0,7.63 .0
571 k
libgit2
0.26.1,0.27.20.26 .1,0.27 .2
18.2 mil., 18.7 mil. 1820 万,1870 万。
libjpeg turbo
2.0 .1
274 k
libtiff
4.0 .10
297 k
libvncserver LibVNCServer
0.9 .11
197 k
sound.exchange sox 声音交换 sox
14.4 .2
416 k
121
poppler poppler 流行者 流行者
0.55
2.7 mil. 270 万。
sound.exchange sox 声音交换 sox
14.4 .2
416 k
122
curl curl 卷曲 卷曲
7.61 .1
568 k
graphite2 石墨 2
1.3 .5
228 k
jasper version jasper 版本
2.0 .22
445 k
sound.exchange sox 声音交换 sox
14.4 .2
416 k
415
jasper version jasper 版本
2.0 .11
417 k
openjpeg
2.3 .1
256 k
416
jasper version jasper 版本
2.0 .11
417 k
libical
1.0.0,2.01.0 .0,2.0
1.1 mil., 973 k 1.1 百万,973 千
openexr 开放 EXR
2.5 .1
890 k
openjpeg
2.3 .1
256 k
sound.exchange sox 声音交换 sox
14.4 .2
416 k
CWE Project Version |V|
190 curl curl 7.56.1,7.64.1 573k,584k
libcurl curl 7.61.0,7.63.0 571 k
libgit2 0.26.1,0.27.2 18.2 mil., 18.7 mil.
libjpeg turbo 2.0 .1 274 k
libtiff 4.0 .10 297 k
libvncserver LibVNCServer 0.9 .11 197 k
sound.exchange sox 14.4 .2 416 k
121 poppler poppler 0.55 2.7 mil.
sound.exchange sox 14.4 .2 416 k
122 curl curl 7.61 .1 568 k
graphite2 1.3 .5 228 k
jasper version 2.0 .22 445 k
sound.exchange sox 14.4 .2 416 k
415 jasper version 2.0 .11 417 k
openjpeg 2.3 .1 256 k
416 jasper version 2.0 .11 417 k
libical 1.0.0,2.0 1.1 mil., 973 k
openexr 2.5 .1 890 k
openjpeg 2.3 .1 256 k
sound.exchange sox 14.4 .2 416 k| CWE Project | Version | $\|V\|$ | |
| :--- | :--- | :---: | :---: |
| 190 | curl curl | $7.56 .1,7.64 .1$ | $573 \mathrm{k}, 584 \mathrm{k}$ |
| | libcurl curl | $7.61 .0,7.63 .0$ | 571 k |
| | libgit2 | $0.26 .1,0.27 .2$ | 18.2 mil., 18.7 mil. |
| | libjpeg turbo | 2.0 .1 | 274 k |
| | libtiff | 4.0 .10 | 297 k |
| | libvncserver LibVNCServer | 0.9 .11 | 197 k |
| | sound.exchange sox | 14.4 .2 | 416 k |
| 121 | poppler poppler | 0.55 | 2.7 mil. |
| | sound.exchange sox | 14.4 .2 | 416 k |
| 122 | curl curl | 7.61 .1 | 568 k |
| | graphite2 | 1.3 .5 | 228 k |
| | jasper version | 2.0 .22 | 445 k |
| | sound.exchange sox | 14.4 .2 | 416 k |
| 415 | jasper version | 2.0 .11 | 417 k |
| | openjpeg | 2.3 .1 | 256 k |
| 416 | jasper version | 2.0 .11 | 417 k |
| | libical | $1.0 .0,2.0$ | 1.1 mil., 973 k |
| | openexr | 2.5 .1 | 890 k |
| | openjpeg | 2.3 .1 | 256 k |
| | sound.exchange sox | 14.4 .2 | 416 k |
O Used in EXP1 and EXP2 to make D_("aug ")D_{\text {aug }} for training (has no CVE for the CWE) 在 EXP1 和 EXP2 中用于制作 D_("aug ")D_{\text {aug }} 进行训练(没有 CWE 的 CVE)
Used in EXP1 and EXP2 to make D_("cve ")D_{\text {cve }} for testing (has CVE label(s)) 在 EXP1 和 EXP2 中用于制作 D_("cve ")D_{\text {cve }} 以进行测试(具有 CVE 标签)
Used in EXP2 for testing to make D_("out ")D_{\text {out }} (no labels) 在 EXP2 中用于测试以制作 D_("out ")D_{\text {out }} (无标签) _(pi)^(pi){ }_{\pi}{ }^{\pi} Used in EXP3 for testing and has no reported CVE for CWE190 在 EXP3 中用于测试,并且没有报告 CWE190 的 CVE
Table 6: Details on the source code projects taken from the wild and used in this paper. Left: a mapping of which project was used for each dataset by CWE. Right: Statistics on the projects. 表 6:本文中使用的来自实际环境的源代码项目的详细信息。左侧:按 CWE 映射每个数据集使用的项目。右侧:项目的统计数据。
^(6){ }^{6} Example (integer overflow): int x=(y**5)+z\mathrm{x}=(\mathrm{y} * 5)+\mathrm{z}; 示例(整数溢出):int x=(y**5)+z\mathrm{x}=(\mathrm{y} * 5)+\mathrm{z} ; ^(7){ }^{7} A project from GitHub can have over 100 k negative samples. 一个来自 GitHub 的项目可以拥有超过 10 万个负样本。
^(8){ }^{8} We ensure these projects are free of known CVEs (see section 3.2.3). 我们确保这些项目没有已知的 CVE(见第 3.2.3 节)。
^(9){ }^{9} We manually verified that our augmentation process was sound with respect to these two properties across a random sample of approximately 100 augmented ePDGS. 我们手动验证了我们的增强过程在这两个属性方面是可靠的,随机抽取了大约 100 个增强的 ePDGS 进行检查。
^(15){ }^{15} The authors feel that reporting cases that might not be true bugs, without a reproducible proof-of-compromise (PoC), would be disrespectful of the developer’s time. 作者认为,报告可能不是实际漏洞的案例,而没有可重现的证明(PoC),将对开发者的时间不够尊重。 ^(16){ }^{16} https: / /tinyurl.com/mucpt67x
^(17){ }^{17} Our current implementation of the LLVM ePDG extractor supports cmake, but the pipeline is not fundamentally restricted to a particular build system. 我们当前的 LLVM ePDG 提取器实现支持 cmake,但该管道并不从根本上限制于特定的构建系统。
^(18)400{ }^{18} 400 random negative samples are taken due to the slow speed of LCS. 由于 LCS 的速度较慢,随机抽取了 ^(18)400{ }^{18} 400 个负样本。