VulChecker: Graph-based Vulnerability Localization in Source Code VulChecker:基于图的源代码漏洞定位
Yisroel Mirsky, Ben-Gurion University of the Negev; George Macon, Georgia Tech Research Institute; Michael Brown, Georgia Institute of Technology; Carter Yagemann, Ohio State University; Matthew Pruett, Evan Downing, 以色列·米尔斯基,内盖夫本-古里昂大学;乔治·梅肯,乔治亚理工研究所;迈克尔·布朗,乔治亚理工学院;卡特·亚格曼,俄亥俄州立大学;马修·普鲁特,埃文·唐宁,
This paper is included in the Proceedings of the 32nd USENIX Security Symposium. 本文收录于第 32 届 USENIX 安全研讨会论文集。
August 9-11, 2023 *\cdot Anaheim, CA, USA 2023 年 8 月 9 日至 11 日,美国加利福尼亚州阿纳海姆 978-1-939133-37-3978-1-939133-37-3
VulChecker: Graph-based Vulnerability Localization in Source Code VulChecker:基于图的源代码漏洞定位
Yisroel Mirsky ^(1){ }^{1}, George Macon ^(2){ }^{2}, Michael Brown ^(3){ }^{3}, Carter Yagemann ^(4){ }^{4} 以色列·米尔斯基 ^(1){ }^{1} , 乔治·梅肯 ^(2){ }^{2} , 迈克尔·布朗 ^(3){ }^{3} , 卡特·雅格曼 ^(4){ }^{4}Matthew Pruett ^(3){ }^{3}, Evan Downing ^(3){ }^{3}, Sukarno Mertoguno ^(3){ }^{3} and Wenke Lee ^(3){ }^{3} 马修·普鲁特 ^(3){ }^{3} , 埃文·唐宁 ^(3){ }^{3} , 苏卡诺·梅尔托古诺 ^(3){ }^{3} 和 温克·李 ^(3){ }^{3}^(1){ }^{1} Ben-Gurion University of the Negev 内盖夫本-古里昂大学^(2){ }^{2} Georgia Tech Research Institute 乔治亚理工学院研究所^(3){ }^{3} Georgia Institute of Technology 乔治亚理工学院^(4){ }^{4} Ohio State University 俄亥俄州立大学
Abstract 摘要
In software development, it is critical to detect vulnerabilities in a project as early as possible. Although, deep learning has shown promise in this task, current state-of-the-art methods cannot classify and identify the line on which the vulnerability occurs. Instead, the developer is tasked with searching for an arbitrary bug in an entire function or even larger region of code. In this paper, we propose VulChecker: a tool that can precisely locate vulnerabilities in source code (down to the exact instruction) as well as classify their type (CWE). To accomplish this, we propose a new program representation, program slicing strategy, and the use of a message-passing graph neural network to utilize all of code’s semantics and improve the reach between a vulnerability’s root cause and manifestation points. We also propose a novel data augmentation strategy for cheaply creating strong datasets for vulnerability detection in the wild, using free synthetic samples available online. With this training strategy, VulChecker was able to identify 24 CVEs ( 10 from 2019&20202019 \& 2020 ) in 19 projects taken from the wild, with nearly zero false positives compared to a commercial tool that could only detect 4 . VulChecker also discovered an exploitable zero-day vulnerability, which has been reported to developers for responsible disclosure. 在软件开发中,尽早检测项目中的漏洞至关重要。尽管深度学习在这项任务中显示出潜力,但当前最先进的方法无法分类和识别漏洞发生的具体行数。相反,开发者需要在整个函数或更大范围的代码中搜索任意错误。本文提出了 VulChecker:一个能够精确定位源代码中的漏洞(直到具体指令)并分类其类型(CWE)的工具。为此,我们提出了一种新的程序表示、程序切片策略,以及使用消息传递图神经网络来利用代码的所有语义,并改善漏洞根本原因与表现点之间的联系。我们还提出了一种新颖的数据增强策略,以便廉价地创建强大的数据集,用于在实际环境中检测漏洞,利用在线可用的免费合成样本。 通过这种训练策略,VulChecker 能够在 19 个来自实际环境的项目中识别出 24 个 CVE(其中 10 个来自 2019&20202019 \& 2020 ),与只能检测到 4 个的商业工具相比,几乎没有误报。VulChecker 还发现了一个可利用的零日漏洞,已向开发者报告以进行负责任的披露。
1 Introduction 1 引言
Software vulnerabilities are typically introduced into systems as a consequence of flawed security control designs or developer error during the implementation of the software specification. Such flaws and errors are unavoidable during the design and implementation phases of the software life cycle, particularly for large, complex, and interconnected software systems, such as distributed systems. In 2020, there were 18,352 vulnerabilities publicly reported in the Common Vulnerabilities and Exposures (CVE) database [25], and the total number of new vulnerabilities reported has been increasing every year. In reality, many more vulnerabilities exist and are being silently patched (e.g., [29]), exploited by malicious actors (i.e., zero-days), or simply have yet to be discovered. 软件漏洞通常是由于安全控制设计缺陷或开发人员在实施软件规范时的错误而引入到系统中的。这些缺陷和错误在软件生命周期的设计和实施阶段是不可避免的,特别是在大型、复杂和互联的软件系统中,例如分布式系统。2020 年,在公共漏洞和暴露(CVE)数据库中报告了 18,352 个漏洞[25],而新报告的漏洞总数每年都在增加。实际上,存在更多的漏洞,这些漏洞正在被默默修补(例如,[29]),被恶意行为者利用(即零日漏洞),或者尚未被发现。
Detecting software vulnerabilities in the earliest stages of the software life cycle is critically important, primarily because vulnerabilities in deployed software can be exploited by an attacker to achieve a variety of malicious ends (e.g., data theft, ransom, etc.). Additionally, the cost to patch vulnerabilities at various points in the software life cycle increases dramatically; the cost to patch a vulnerability discovered in deployed software is exponentially higher than the cost to patch the vulnerability during implementation. As a result, detecting vulnerabilities in software source code has been a highly active area of both academic and commercial research and development for decades. 在软件生命周期的最早阶段检测软件漏洞至关重要,主要是因为已部署软件中的漏洞可能被攻击者利用以实现各种恶意目的(例如,数据盗窃、勒索等)。此外,在软件生命周期的不同阶段修补漏洞的成本急剧增加;在已部署软件中发现的漏洞的修补成本比在实施阶段修补漏洞的成本高出几个数量级。因此,检测软件源代码中的漏洞几十年来一直是学术界和商业界研究与开发的一个活跃领域。
As such, static application security testing (SAST) tools have become pervasive in the software development process to help developers identify bugs and vulnerabilities in their source code. SAST tools typically employ static software analysis techniques (e.g., taint analysis, data-flow analysis) and/or libraries of vulnerability patterns, matching rules, etc., to scan source code, identify potential bugs or vulnerabilities, and report them for manual inspection. SAST reports are highly precise, yielding alerts that identify specific lines of source code as vulnerable to specific types of vulnerabilities, typically categorized as common weakness enumerations (CWEs). Many SAST tools are available to developers, including both open source (e.g., Flawfinder [5], Infer [7], Clang [3]) and commercial offerings (e.g., Fortify [2], Perforce QAC [6], and Coverity [4]). SAST tools are typically employed during code reviews as part of continuous integration and development processes, or launched manually by developers. 静态应用安全测试(SAST)工具在软件开发过程中变得普遍,帮助开发人员识别源代码中的错误和漏洞。SAST 工具通常采用静态软件分析技术(例如,污点分析、数据流分析)和/或漏洞模式库、匹配规则等,扫描源代码,识别潜在的错误或漏洞,并报告以供人工检查。SAST 报告非常精确,发出的警报能够识别特定行的源代码作为特定类型漏洞的易受攻击点,通常被归类为常见弱点枚举(CWE)。许多 SAST 工具可供开发人员使用,包括开源工具(例如,Flawfinder [5]、Infer [7]、Clang [3])和商业产品(例如,Fortify [2]、Perforce QAC [6] 和 Coverity [4])。SAST 工具通常在代码审查期间作为持续集成和开发过程的一部分使用,或由开发人员手动启动。
Although SAST tools are powerful, they have several disadvantages. First, they often report false positives (i.e., misidentify code as vulnerable). In most cases this is a result of (1) limitations or incomplete rule/pattern matching algorithms, and (2) vulnerabilities that exist in unreachable code. Second, SAST tools struggle to detect complex and contextual vulnerabilities, resulting in high false negative rates (i.e., failing to identify vulnerable code). This is also a limitation of rule/pattern matching algorithms. For example, some program behaviors, such as integer overflows and 尽管静态应用安全测试(SAST)工具功能强大,但它们也存在一些缺点。首先,它们经常报告误报(即错误地将代码识别为脆弱)。在大多数情况下,这源于(1)规则/模式匹配算法的局限性或不完整性,以及(2)存在于不可达代码中的脆弱性。其次,SAST 工具在检测复杂和上下文相关的脆弱性方面表现不佳,导致高误报率(即未能识别脆弱代码)。这也是规则/模式匹配算法的一个局限性。例如,一些程序行为,如整数溢出和
underflows, are legitimate in some contexts (e.g., cryptography) and vulnerable to exploitation in others (e.g., memory allocation). Similarly, use-after-free vulnerabilities are highly context sensitive, sometimes requiring long sequences of (de)allocations over the program’s lifetime to manifest. 在某些情况下(例如,加密),下溢是合法的,而在其他情况下(例如,内存分配)则容易受到利用。同样,使用后释放的漏洞高度依赖于上下文,有时需要在程序的生命周期内进行长序列的(重新)分配才能显现。
1.1 Recent Efforts 1.1 最近的努力
To overcome these limitations, researchers have recently turned to deep learning. Specifically, prior work has explored applications of convolutional (CNN) [19, 28, 38], recurrent (RNN) [20-23, 40], and graph neural networks (GNN) [9,11,13,14,30,32,35,39][9,11,13,14,30,32,35,39] to detect vulnerabilities in source code. This is accomplished by representing the software as an annotated graph (G)(G) to encode its data-flow (DFG), control-flow (CFG), and/or abstract syntax (AST). 为了克服这些限制,研究人员最近转向了深度学习。具体而言,先前的工作探讨了卷积神经网络(CNN)[19, 28, 38]、递归神经网络(RNN)[20-23, 40]和图神经网络(GNN) [9,11,13,14,30,32,35,39][9,11,13,14,30,32,35,39] 在源代码中检测漏洞的应用。这是通过将软件表示为带注释的图 (G)(G) 来实现的,以编码其数据流(DFG)、控制流(CFG)和/或抽象语法(AST)。
While prior approaches have demonstrated good performance, software developers are unlikely to adopt or use these models in the same manner they currently use SAST tools. This is because (1) these methods make predictions over entire functions or code regions and/or (2) cannot classify the type of vulnerability found. Using such a tool would require developers to search through hundreds of lines of code for a vulnerability with no further information. This is even more problematic with tools that do not classify the vulnerability [9,11,14,20-22[9,11,14,20-22, 30,35,38,3930,35,38,39 ] since the developer has no indication of what they are looking for or what remediation approach is required. This issue is amplified in the presence of false positives, which are common occurrence in existing machine learning techniques. 尽管先前的方法表现良好,但软件开发人员不太可能以当前使用 SAST 工具的方式采用或使用这些模型。这是因为(1)这些方法对整个函数或代码区域进行预测和/或(2)无法分类发现的漏洞类型。使用这样的工具将要求开发人员在数百行代码中搜索漏洞,而没有进一步的信息。这在不对漏洞进行分类的工具中尤为棘手,因为开发人员不知道他们在寻找什么或需要什么样的修复方法。在存在假阳性的情况下,这一问题更加严重,而假阳性在现有的机器学习技术中是常见的现象。
1.2 Insights 1.2 见解
To develop a solution that can be readily used by developers in a comparable manner as existing SAST tools, we first identify the issues with current approaches and their causes: 为了开发一种可以被开发者以与现有静态应用安全测试(SAST)工具相似的方式轻松使用的解决方案,我们首先识别当前方法的问题及其原因:
Broad Program Slicing. To localize a vulnerability, the current approach is to take a subgraph G_(i)G_{i} from GG as an observation for making a prediction. G_(i)G_{i} is either an entire function [9,11,28,30,32,35,39][9,11,28,30,32,35,39] or a contiguous region extracted from GG with a point of interest (PoI) at the center (a.k.a., a program slice) [13,14,19,20,22,23,38,40][13,14,19,20,22,23,38,40]. As a result, a prediction made on G_(i)G_{i} can relate to hundreds of lines of code. It also makes it harder for the optimizer since G_(i)G_{i} may contain a number of potential root cause and manifestation points. 广泛程序切片。为了定位漏洞,目前的方法是从 GG 中提取一个子图 G_(i)G_{i} 作为观察以进行预测。 G_(i)G_{i} 可以是整个函数 [9,11,28,30,32,35,39][9,11,28,30,32,35,39] ,也可以是从 GG 中提取的一个连续区域,中心有一个关注点(PoI)(即程序切片) [13,14,19,20,22,23,38,40][13,14,19,20,22,23,38,40] 。因此,对 G_(i)G_{i} 的预测可能与数百行代码相关。这也使得优化器的工作变得更加困难,因为 G_(i)G_{i} 可能包含多个潜在的根本原因和表现点。
Insight 1: To increase the precision of a prediction, the observation must relate directly to the location of the vulnerability. 洞察 1:为了提高预测的准确性,观察必须与脆弱性的位置直接相关。
Incomplete Code Representations. To convert G_(i)G_{i} into a format that machine learning can understand, symbols derived from the AST and/or functions are either compressed [2023,28,4023,28,40 ] or embedded using methods like Word 2 Vec [9, 11,19,30,32,35,39]11,19,30,32,35,39] as a preprocessing step. This method of representing software is compatible with machine learning techniques, but is suboptimal because it prevents the learner from directly reasoning about instruction and dependency semantics. Moreover, information from the AST is either (1) omitted, (2) included in manner that inhibits graph-based learning [9,11,30][9,11,30], or (3) is included incorrectly in a manner that harms code semantics and model efficiency [32,39][32,39]. 不完整的代码表示。为了将 G_(i)G_{i} 转换为机器学习可以理解的格式,从 AST 和/或函数派生的符号要么被压缩[20 23,28,4023,28,40 ],要么使用像 Word 2 Vec [9, 11,19,30,32,35,39]11,19,30,32,35,39] ]的方法嵌入,作为预处理步骤。这种表示软件的方法与机器学习技术兼容,但并不理想,因为它阻止学习者直接推理指令和依赖语义。此外,来自 AST 的信息要么被(1)省略,要么以抑制基于图的学习的方式包含 [9,11,30][9,11,30] ,要么以错误的方式包含,从而损害代码语义和模型效率 [32,39][32,39] 。
Insight 2: To enable effective graph-based learning in this task, a new code representation that properly incorporates the flow of AST information is required. Furthermore, instruction level semantics should be explicitly provided in GG to enable efficient and effective learning. 洞察 2:为了在此任务中实现有效的基于图的学习,需要一种新的代码表示,能够正确地融入 AST 信息的流动。此外,应在 GG 中明确提供指令级语义,以实现高效和有效的学习。
Manifestation Distance. The manifestation of a vulnerability can occur hundreds of lines after its root cause. Current GNN approaches use models that can only propagate a node’s information one or two steps away. As a result, these GNN models cannot infer whether a potential root cause directly influences a potential manifestation point. Instead, these models are limited to identifying local patterns around either point (e.g., a missing guard branch prior to a memcpy is enough to raise an alarm). 表现距离。漏洞的表现可能在其根本原因之后的数百行发生。目前的图神经网络(GNN)方法使用的模型只能传播节点信息一到两步。因此,这些 GNN 模型无法推断潜在根本原因是否直接影响潜在表现点。相反,这些模型仅限于识别任一节点周围的局部模式(例如,在 memcpy 之前缺少保护分支就足以引发警报)。
Insight 3: To increase the model’s ability to identify vulnerabilities, the model must be able to identify causality across the entirety of G_(i)G_{i}. 洞察 3:为了增强模型识别漏洞的能力,模型必须能够识别 G_(i)G_{i} 整体的因果关系。
The Lack of Labeled Data. Deep learning requires thousands of samples for training. However, since it is hard and expensive to label vulnerabilities in real code at the instruction or even line level, large fine-grained datasets do not exist. Therefore, current work must either: (1) use synthetic datasets (e.g., NIST SARD [26]), which do not generalize to large projects or real world code, or (2) use datasets that only label code regions (program slices) or entire functions. Insight 4: To enable cost effective high precision detection in source code, there is a need to develop data augmentation techniques to efficiently expand existing datasets. 缺乏标记数据。深度学习需要成千上万的样本进行训练。然而,由于在指令甚至行级别标记真实代码中的漏洞既困难又昂贵,因此不存在大型细粒度数据集。因此,目前的工作必须要么:(1)使用合成数据集(例如,NIST SARD [26]),这些数据集无法推广到大型项目或真实世界代码,或者(2)使用仅标记代码区域(程序切片)或整个函数的数据集。洞察 4:为了实现源代码中成本效益高的高精度检测,需要开发数据增强技术,以有效扩展现有数据集。
Level of Program Representation. Source-level programming languages are designed for programmer ease and flexibility, and as such can capture several inter-related and high level operations in a single line of source code. Such rich instruction semantics are in stark contrast to the characteristics of software vulnerabilities, which are typically associated with atomic and machine-level instruction semantics (e.g., out-of-bounds buffer accesses, integer overflows). Most prior work extracts GG from the source code without lowering or compiling it. As a result, these models are challenged with identifying low-level vulnerabilities from abstract program representations. 程序表示级别。源级编程语言旨在为程序员提供便利和灵活性,因此可以在一行源代码中捕捉多个相互关联的高级操作。这种丰富的指令语义与软件漏洞的特征形成鲜明对比,后者通常与原子和机器级指令语义相关(例如,越界缓冲区访问、整数溢出)。大多数先前的工作从源代码中提取 GG ,而不进行降级或编译。因此,这些模型在从抽象程序表示中识别低级漏洞时面临挑战。
Insight 5: To improve the model’s ability to reason about vulnerabilities that have atomic machine-level characteristics, GG should capture instruction semantics at lower levels rather than at source code. 洞察 5:为了提高模型对具有原子机器级特征的漏洞进行推理的能力, GG 应该在更低的层次捕捉指令语义,而不是在源代码层面。
1.3 The Proposed Solution 1.3 提出的解决方案
In this paper we propose VulChecker: a deep learning framework for detecting the precise manifestation point 在本文中,我们提出了 VulChecker:一个用于检测精确表现点的深度学习框架
of software vulnerabilities in source code. To accomplish this, VulChecker first passes the program’s source code through a custom LLVM compiler toolchain to generate a GNN-optimized program representation, which we refer to as an enriched program dependency graph (ePDG). 软件源代码中的漏洞。为此,VulChecker 首先通过自定义的 LLVM 编译器工具链处理程序的源代码,以生成一个 GNN 优化的程序表示,我们称之为增强程序依赖图(ePDG)。
ePDGs are graph structures in which nodes represent atomic machine-level instructions and edges represent control- and data-flow dependencies between instructions. ePDGs are well-conditioned for graph-based machine learning models since they more accurately capture the flow of the software. ePDG 是图结构,其中节点表示原子机器级指令,边表示指令之间的控制和数据流依赖关系。ePDG 对于基于图的机器学习模型具有良好的条件,因为它们更准确地捕捉了软件的流动。
To localize a given vulnerability, VulChecker searches the ePDG for potential vulnerability manifestation points, such as calls to a free () function for double free vulnerabilities (i.e., CWE-415). VulChecker then cuts a subgraph G_(i)G_{i} backwards from that point. By having every subgraph terminate at a potential manifestation point, the model obtains a stable and consistent view. This enables the model to flow information about any potential root causes in G_(i)G_{i} down to the singular manifestation point in question. During deployment, when a subgraph is predicted to be vulnerable with respect to a specific CWE, VulChecker alerts the developer to the exact location (line and instruction) of the manifestation point in source code and indicates the CWE type. 为了定位给定的漏洞,VulChecker 在 ePDG 中搜索潜在的漏洞表现点,例如对 free() 函数的调用以查找双重释放漏洞(即 CWE-415)。然后,VulChecker 从该点向后切割一个子图 G_(i)G_{i} 。通过使每个子图在潜在表现点终止,该模型获得了一个稳定且一致的视图。这使得模型能够将关于 G_(i)G_{i} 中任何潜在根本原因的信息传递到相关的单一表现点。在部署过程中,当预测某个子图在特定 CWE 下存在漏洞时,VulChecker 会提醒开发者源代码中表现点的确切位置(行和指令),并指明 CWE 类型。
In order to give the model the ability to reason about causality, we chose a different GNN than previous approaches. In particular, we develop a model based on a message passing GNN called Structure2Vec (S2V) [15]. With this model, we are able to efficiently pass information over hundreds of hops from one side of G_(i)G_{i} to the other. Furthermore, by using S2V we are able to supply edge features (e.g., data type for data-flow dependencies) as well as consider a node’s features at every propagation iteration. 为了赋予模型推理因果关系的能力,我们选择了一种不同于以往方法的图神经网络(GNN)。具体而言,我们开发了一种基于消息传递的 GNN 模型,称为 Structure2Vec(S2V)[15]。通过该模型,我们能够高效地在 G_(i)G_{i} 的一侧与另一侧之间传递数百跳的信息。此外,通过使用 S2V,我们能够提供边特征(例如,数据流依赖的数据类型),并在每次传播迭代中考虑节点的特征。
Finally, to mitigate the issue of obtaining fine-grained labeled datasets, we propose a data augmentation technique. The approach is to generate positive samples by injecting synthetic traces into projects taken from the wild. The approach is cheap because the synthetic traces are taken from open sourced labeled datasets like the Juliet C/C++ Test Suite [26] and GG is manipulated directly to avoid compilation issues. The approach is effective because we inject root causes at random distance away from the manifestation point. This improves generalization by forcing the model to search deeper into G_(i)G_{i} over real code. 最后,为了缓解获取细粒度标记数据集的问题,我们提出了一种数据增强技术。该方法是通过将合成痕迹注入来自实际项目的样本中来生成正样本。该方法成本低,因为合成痕迹来自开源标记数据集,如 Juliet C/C++测试套件[26],并且 GG 被直接处理以避免编译问题。该方法有效,因为我们在距离表现点随机注入根本原因。这通过迫使模型在真实代码中更深入地搜索 G_(i)G_{i} 来提高泛化能力。
We evaluated VulChecker on five distinct CWEs: integer overflow (CWE-190), stack overflow (CWE-121), heap overflow (CWE-122), double free (CWE-415), and use-after-free (CWE-416). We chose these CWEs because they are some of the most prevalent and exploitable types of memory corruption plaguing development in popular languages like C//C++\mathrm{C} / \mathrm{C}++. These CWEs cover both spatial and temporal memory safety issues, testing the flexibility of our approach. They are also featured in MITRE’s 2021 list of the 25 most dangerous CWEs. ^(1){ }^{1} When trained on only augmented data, VulChecker was able to detect 我们对 VulChecker 进行了评估,涵盖了五种不同的 CWE:整数溢出(CWE-190)、栈溢出(CWE-121)、堆溢出(CWE-122)、双重释放(CWE-415)和使用后释放(CWE-416)。我们选择这些 CWE 是因为它们是困扰流行语言开发的一些最常见和可利用的内存损坏类型。这些 CWE 涵盖了空间和时间内存安全问题,测试了我们方法的灵活性。它们还出现在 MITRE 2021 年发布的 25 个最危险 CWE 名单中。当仅在增强数据上进行训练时,VulChecker 能够检测到。
24 vulnerabilities (reported CVEs) in the 19C++19 \mathrm{C}++ projects we evaluated. This was significantly more than the baseline SAST tools, where even the commercial tool only detected 4 of the CVEs. We also had a vulnerability analyst review the top 100 results from a model trained on both augmented and CVE data. The analyst found that VulChecker has hit rate (precision) of 5080%80 \% on the top 50 results. VulChecker was also able to detect a previously unknown vulnerability (zero-day), which has been reported to the developers for responsible disclosure. Overall, we found VulChecker to be a practical tool for developers because: (1) it is able to operate on large projects (such as libgit2 with over 300 files, 110k lines of code, and 18 million instructions), and (2) it has a comparatively low false positive rate. 在我们评估的 19C++19 \mathrm{C}++ 个项目中发现了 24 个漏洞(报告的 CVE)。这显著高于基线 SAST 工具,甚至商业工具仅检测到 4 个 CVE。我们还让一位漏洞分析师审查了基于增强数据和 CVE 数据训练的模型的前 100 个结果。分析师发现,VulChecker 在前 50 个结果中的命中率(精确度)为 50 80%80 \% 。VulChecker 还能够检测到一个之前未知的漏洞(零日漏洞),该漏洞已报告给开发者以进行负责任的披露。总体而言,我们发现 VulChecker 是开发者的一个实用工具,因为:(1)它能够在大型项目上运行(例如 libgit2,拥有超过 300 个文件、11 万行代码和 1800 万条指令),以及(2)它的假阳性率相对较低。
1.4 Contributions 1.4 贡献
Overall the main contributions of this work are as follows: 总体而言,本工作的主要贡献如下:
To the best of our knowledge, VulChecker is the first deep learning framework that can both detect vulnerabilities in source code with instruction and line-level precision and classify the type of CWE. This makes VulChecker the first practical deep learning-based SAST tool for developers, enabling them to quickly identify and remediate vulnerabilities in their code in the same manner as traditional SAST tools. Improving upon prior techniques, developers using VulChecker do not need to search hundreds of lines to find vulnerable code and/or determine what type of vulnerability is present. 据我们所知,VulChecker 是第一个能够以指令和行级精度检测源代码中的漏洞并分类 CWE 类型的深度学习框架。这使得 VulChecker 成为第一个实用的基于深度学习的 SAST 工具,帮助开发者快速识别和修复代码中的漏洞,方式与传统 SAST 工具相同。与之前的技术相比,使用 VulChecker 的开发者无需搜索数百行代码来查找漏洞代码和/或确定存在何种类型的漏洞。
We introduce the use of message passing neural networks for the task of vulnerability detection to learn and identify the relationship (causality) between a vulnerability’s root cause and manifestation point in a PDG. Using this model, we are also able to explicitly assign features to edges in our ePDG, such as data type in the DFG component. 我们引入了消息传递神经网络用于漏洞检测任务,以学习和识别漏洞根本原因与表现点在程序依赖图(PDG)中的关系(因果关系)。使用该模型,我们还能够明确地将特征分配给我们的扩展程序依赖图(ePDG)中的边,例如数据流图(DFG)组件中的数据类型。
We propose the ePDG, a novel graph-based machinelearning optimized program representation that avoids the inefficiencies and drawbacks of others in prior work. 我们提出了 ePDG,一种新颖的基于图的机器学习优化程序表示,避免了先前工作中其他方法的低效和缺陷。
We present a novel data augmentation approach that helps mitigate the lack of fine-grained datasets for GNN-based source code vulnerability detection, while still generalizing to real world software projects taken from the wild. The approach is simple and easy to implement, making it a practical way to boost the performance of future models in this research domain. 我们提出了一种新颖的数据增强方法,旨在缓解基于图神经网络的源代码漏洞检测中细粒度数据集的缺乏,同时仍能推广到来自实际软件项目的真实场景。该方法简单易行,是提升该研究领域未来模型性能的实用方式。
The source code, models, and datasets presented in this paper are available online. ^(2){ }^{2} A demo of VulChecker working as a plugin for Visual Studio Code is also available online. ^(3){ }^{3} 本文中呈现的源代码、模型和数据集均可在线获取。 ^(2){ }^{2} VulChecker 作为 Visual Studio Code 插件的演示也可在线获取。 ^(3){ }^{3}
2 Related Works 2 相关工作
In this section, we review the last three years of related work on deep learning techniques for detecting vulnerabilities in 在本节中,我们回顾了过去三年关于深度学习技术检测漏洞的相关工作
Table 1: Summary of related works in comparison with VulChecker 表 1:与 VulChecker 相关工作的总结比较
source code and contrast them to VulChecker’s model, program representation, and application (summarized in Table 1). 源代码,并将其与 VulChecker 的模型、程序表示和应用进行对比(总结见表 1)。
2.1 Model 2.1 模型
Early work in this area focused on CNNs and RNNs on linear (sequential) representations of the software’s data-flow graphs [19,23,28,38,40][19,23,28,38,40]. However, a linear representation omits software structure, which prevents the model from learning and utilizing various contexts and semantics in its pattern recognition. These linear reductions also make it hard for the model to perform well on code from the wild where a vulnerability can manifest after hundreds of lines of noisy and complex code [11]. 早期的研究集中在卷积神经网络(CNN)和递归神经网络(RNN)对软件数据流图的线性(顺序)表示上 [19,23,28,38,40][19,23,28,38,40] 。然而,线性表示忽略了软件结构,这使得模型无法学习和利用其模式识别中的各种上下文和语义。这些线性简化也使得模型在处理来自实际环境的代码时表现不佳,因为漏洞可能在数百行嘈杂和复杂的代码之后显现[11]。
To overcome these issues, later work [14,39][9,11,13,30,32[14,39][9,11,13,30,32, 35] used graph-based neural networks to consider the code’s structural semantics. They utilize a graph convolutional neural network (GCN) that propagates information to neighboring nodes to learn embeddings for each node, which are then averaged or summed prior to use in classification. However, GCNs do not recall (i.e., propagate) a node’s original feature vector at each iteration and struggle to learn long-distance relationships across the input structure. To remedy this, subsequent work [9,30,32][9,30,32] used gated graph recurrent neural networks (GRNN) that leveraged a recurrent layer to recall information passed to neighboring nodes at previous iterations. However, in these networks the number of layers dictated the number of propagation iterations, which was only one or two [13]. VulChecker’s model is based on Structure2Vec (S2V) [15], a message-passing neural network that can perform hundreds of iterations without needing additional layers. Consequently, VulChecker is better suited for vulnerability detection because it can identify connections between distant root cause and manifestation points. 为了克服这些问题,后来的工作 [14,39][9,11,13,30,32[14,39][9,11,13,30,32 ,35]使用基于图的神经网络来考虑代码的结构语义。他们利用图卷积神经网络(GCN),该网络将信息传播到邻近节点,以学习每个节点的嵌入,然后在分类之前进行平均或求和。然而,GCN 在每次迭代中并不回忆(即传播)节点的原始特征向量,并且在学习输入结构中的长距离关系时存在困难。为了解决这个问题,后续的工作 [9,30,32][9,30,32] 使用了门控图递归神经网络(GRNN),该网络利用递归层回忆在先前迭代中传递给邻近节点的信息。然而,在这些网络中,层数决定了传播迭代的次数,通常只有一到两次[13]。VulChecker 的模型基于 Structure2Vec(S2V)[15],这是一种消息传递神经网络,可以在不需要额外层的情况下执行数百次迭代。因此,VulChecker 更适合用于漏洞检测,因为它能够识别远程根本原因和表现点之间的连接。
2.2 Program Representation 2.2 程序表示
Prior work used a variety of graph-based program representations (depicted in Figure 1), the simplest of which are control-flow graphs (CFG), data-flow graphs (DFG), and combined CFG/DFGs called program dependency graphs (PDG). These representations can be readily generated from source code (e.g., with tools like Joern [37]), but do not capture instruction semantics explicitly. 先前的工作使用了多种基于图的程序表示(如图 1 所示),其中最简单的是控制流图(CFG)、数据流图(DFG)以及称为程序依赖图(PDG)的组合 CFG/DFG。这些表示可以很容易地从源代码生成(例如,使用像 Joern [37]这样的工具),但并未明确捕捉指令语义。
More advanced representations include code property graphs (CPG) [37] and natural code sequence CPGs (ncsCPG) such as in [39]. CPGs merge the source code’s PDG and AST by attaching each source line’s AST subtree under the corresponding node in the PDG. This indirectly captures instruction semantics along with control- and data-flow dependencies in a single structure. However, such representations are not well structured for GNN learning models since information from the AST cannot flow across the graph. Finally, an ncsCPG is a CPG in which the leaf nodes in each AST subtree are superficially linked to leaf nodes in the preceding or successive trees, as visualized with orange edges in Figure 1 [32, 39]. This enables information to flow from the AST during GNN learning, however it is suboptimal because semantically unrelated source code lines become linked, even if they share no data or control dependencies. 更高级的表示包括代码属性图(CPG)[37]和自然代码序列 CPG(ncsCPG),如[39]所示。CPG 通过将每个源代码行的 AST 子树附加到 PDG 中相应节点下,合并了源代码的 PDG 和 AST。这间接捕捉了指令语义以及控制和数据流依赖关系于一个单一结构中。然而,这种表示对于 GNN 学习模型并不结构良好,因为 AST 中的信息无法在图中流动。最后,ncsCPG 是一个 CPG,其中每个 AST 子树的叶节点在表面上与前一个或后一个树的叶节点相连,如图 1 中的橙色边所示[32, 39]。这使得信息能够在 GNN 学习过程中从 AST 流动,然而这并不是最优的,因为语义上无关的源代码行被连接在一起,即使它们没有共享数据或控制依赖关系。
Finally, systems that use sequential models (i.e., CNNs and RNNs) represent code as a sequence extracted or flattened from one of the graphical forms above [19-23, 28, 38, 40]. While appropriate for the model type being used, these representations fail to leverage the inherently graph-like structure and operation of software. 最后,使用序列模型(即 CNN 和 RNN)的系统将代码表示为从上述图形形式之一提取或展平的序列[19-23, 28, 38, 40]。虽然适合所使用的模型类型,但这些表示未能利用软件固有的图形结构和操作。
Regardless of the program representation used, prior work relies on feature extraction to compress and express the contents of each token or node in GG. Examples include one hot encodings and pre-processed embeddings (e.g., Word2Vec [24]) to capture the meaning of different symbols and calls (e.g., int, ==, for, free ( ), etc.) In some cases entire portions of code are summarized using Doc2Vec [18]. The issue with these representations are that (1) nodes in G_(i)G_{i} would likely capture multiple 无论使用何种程序表示,先前的工作依赖于特征提取来压缩和表达每个标记或节点在 GG 中的内容。示例包括独热编码和预处理嵌入(例如,Word2Vec [24]),以捕捉不同符号和调用(例如,int, == ,for,free(),等)的含义。在某些情况下,整个代码部分使用 Doc2Vec [18]进行总结。这些表示的问题在于(1) G_(i)G_{i} 中的节点可能会捕捉到多个。
Figure 1: A comparison of different code representations. Most prior systems rely on CFG (green) and DFG (purple) edges alone, while more recent work includes ASTs (black) to capture the code better. However these trees are not conducive to graph-based learning due to their terminal leaf nodes. This is mitigated by connecting leaves with natural code sequence (NCS) edges (orange) [32,39], but this results in arbitrary paths w.r.t. the data and control dependency edges. 图 1:不同代码表示的比较。大多数先前的系统仅依赖于 CFG(绿色)和 DFG(紫色)边,而最近的工作则包括 AST(黑色)以更好地捕捉代码。然而,由于其终端叶节点,这些树不利于基于图的学习。通过用自然代码序列(NCS)边(橙色)连接叶节点可以缓解这一问题[32,39],但这导致了与数据和控制依赖边相关的任意路径。
operations in a single line of source code resulting in a loss in semantic precision and (2) the use of pre-processed embeddings prevents the model from learning the best representation to optimize the learning objective (i.e., classifying vulnerabilities). In contrast, our proposed ePDG explicitly defines node and edge features that encode relevant information (e.g., operations) while avoiding superficial features (e.g., variable names). 在单行源代码中进行的操作导致语义精度的损失,以及(2)使用预处理的嵌入使模型无法学习最佳表示以优化学习目标(即,分类漏洞)。相比之下,我们提出的 ePDG 明确定义了编码相关信息(例如,操作)的节点和边特征,同时避免了表面特征(例如,变量名称)。
2.3 Application 2.3 应用
Some recent works have proposed detecting vulnerabilities at the binary level. For example, in [36] the authors propose Gemini, a deep learning model which uses Structure2Vec and a Siamese network to measure the similarity of a given function’s CFG to that of a vulnerable one at the binary level. Other approaches such as DeepBinDiff [17], ASM2Vec [16], and Bin2Vec [8] all follow similar approaches for vulnerability detection where a repository of compiled vulnerable code snippets are checked against the binary in question. However, these approaches are limited since (1) the models search for instances of vulnerabilities (e.g., heartbleed) and not the general pattern (CWE), (2) they require binary disassembly, a process that is undecidable in the general case, and as such may miss bugs in code the disassembler fails to recover, and (3) the vulnerabilities identified using these methods indicate code regions and not specific lines. 一些近期的研究提出在二进制级别检测漏洞。例如,在[36]中,作者提出了 Gemini,这是一种深度学习模型,使用 Structure2Vec 和 Siamese 网络来测量给定函数的控制流图(CFG)与二进制级别的漏洞函数的相似性。其他方法如 DeepBinDiff [17]、ASM2Vec [16]和 Bin2Vec [8]也采用类似的方法进行漏洞检测,其中编译的漏洞代码片段库会与相关的二进制文件进行检查。然而,这些方法存在局限性,因为(1)模型搜索的是漏洞的实例(例如,心脏出血漏洞),而不是一般模式(CWE),(2)它们需要二进制反汇编,这在一般情况下是不可判定的,因此可能会遗漏反汇编器无法恢复的代码中的错误,以及(3)使用这些方法识别的漏洞指示的是代码区域而不是具体行。
Regarding vulnerability detection in source code, the related works listed in Table 1 cannot directly identify the line of a vulnerability because their representations of G_(i)G_{i} do not anchor a specific code line. Instead, they cut graphs to include an entire function [9,11,28,30-32,35,39][9,11,28,30-32,35,39] or the code surrounding a potential root cause point for any vulnerability [13,14,19,20,22,23,38,40][13,14,19,20,22,23,38,40], which leads to very broad predictions over entire functions or larger code regions. 关于源代码中的漏洞检测,表 1 中列出的相关工作无法直接识别漏洞的具体行,因为它们对 G_(i)G_{i} 的表示并未锚定特定的代码行。相反,它们切割图形以包含整个函数 [9,11,28,30-32,35,39][9,11,28,30-32,35,39] 或潜在根本原因点周围的代码 [13,14,19,20,22,23,38,40][13,14,19,20,22,23,38,40] ,这导致对整个函数或更大代码区域的非常广泛的预测。
One exception is VulDeeLocator [21], a work developed in parallel to VulChecker. In this work the authors first find PoIs in the source code through the program’s AST and mark all library/API function calls, array definitions, pointer definitions and arithmetic expressions. Each marked PoI is then traced down to a lower level code representation (LLVM IR). ^(4){ }^{4} Then, a forward and backward program slice is taken around the PoI and the slice is flattened into a sequence of 900 tokens (e.g., call, void, @, FUN1,’ (’, …). Next, each token in the sequence is embedded into a vector using Word2Vec and the sequence of vectors is passed to a bidirectional recurrent neural network (biRNN) which predicts which line is vulnerable. 一个例外是 VulDeeLocator [21],这是与 VulChecker 并行开发的工作。在这项工作中,作者首先通过程序的 AST 在源代码中找到感兴趣点(PoI),并标记所有库/API 函数调用、数组定义、指针定义和算术表达式。每个标记的 PoI 随后被追踪到更低级的代码表示(LLVM IR)。然后,在 PoI 周围进行前向和后向程序切片,并将切片展平为 900 个标记的序列(例如,call、void、@、FUN1、'('、…)。接下来,序列中的每个标记都使用 Word2Vec 嵌入到一个向量中,并将向量序列传递给一个双向递归神经网络(biRNN),该网络预测哪一行是脆弱的。
However, similar to prior work [9,11, 14, 20, 22, 30, 35,38,39], VulDeeLocator cannot indicate the vulnerability being detected. This leaves the developer to guess the vulnerability out of hundreds of CWEs. Moreover, selecting PoIs from ASTs is not suitable for spatial and temporal memory safety violations. In contrast our ePDG representation based on IR is closer to machine code and resolves ambiguities regarding data types, temporary variables, and storage locations. Furthermore, like other works, VulDeeLocator remove loops and flatten code into a sequence into a finite number of tokens which reducing the code’s semantics and patterns. Finally, many source code based systems like VulDeeLocator tailored to specific languages. However, VulChecker performs analysis at the IR level, making it somewhat language agnostic. Although we only evaluate on C and C++\mathrm{C}++ projects in this paper, VulChecker has the potential to work on other languages such as Objective-C, Fortran, and Rust since LLVM can lower them as well. Further research is need to verify compatibility. 然而,与之前的工作[9,11, 14, 20, 22, 30, 35,38,39]类似,VulDeeLocator 无法指示被检测到的漏洞。这使得开发者不得不在数百个 CWE 中猜测漏洞。此外,从 AST 中选择兴趣点不适合空间和时间内存安全违规。相比之下,我们基于 IR 的 ePDG 表示更接近机器代码,并解决了关于数据类型、临时变量和存储位置的模糊性。此外,像其他工作一样,VulDeeLocator 去除了循环并将代码展平为有限数量的标记序列,从而减少了代码的语义和模式。最后,许多基于源代码的系统,如 VulDeeLocator,都是针对特定语言量身定制的。然而,VulChecker 在 IR 级别进行分析,使其在某种程度上与语言无关。尽管我们在本文中仅对 C 和 C++\mathrm{C}++ 项目进行了评估,但 VulChecker 有潜力在其他语言上工作,如 Objective-C、Fortran 和 Rust,因为 LLVM 也可以将它们降级。需要进一步研究以验证兼容性。
In summary, in contrast to the current state-of-the-art, VulChecker (1) can locate vulnerabilities at both line and instruction level, (2) can classify the type of vulnerability, (3) can better associate root cause and manifestation points by reaching deeper into program slices using a message passing GNN, and (4) may be generalized to a wider array of programming languages. 总之,与当前的最先进技术相比,VulChecker (1) 可以在行级和指令级定位漏洞,(2) 可以对漏洞类型进行分类,(3) 可以通过使用消息传递的图神经网络更好地关联根本原因和表现点,深入程序切片,(4) 可能会推广到更广泛的编程语言。
3 VulChecker
This section presents details on how the VulChecker framework functions. We start with an overview of the pipeline phases and important notation and then elaborate on each step. Overview. For each CWE (i.e., class of vulnerability), VulChecker trains a separate model and uses a different sampling strategy to extract observations (i.e., potential manifestation points). Each of VulChecker’s CWE pipelines follow the same four steps: (1) ePDG generation, (2) sampling, (3) feature extraction, and (4) model training or execution. 本节介绍了 VulChecker 框架的功能细节。我们首先概述管道阶段和重要符号,然后详细说明每个步骤。概述。对于每个 CWE(即漏洞类别),VulChecker 训练一个单独的模型,并使用不同的采样策略来提取观察结果(即潜在表现点)。VulChecker 的每个 CWE 管道遵循相同的四个步骤:(1)ePDG 生成,(2)采样,(3)特征提取,以及(4)模型训练或执行。
Figure 2: A diagram showing the steps of VulChecker’s pipeline for one CWE. Note that the real graphs are significantly larger than what is visualized (e.g., projects like libgit2-v0.26.1 have over 18 million nodes in GG ). Solid edges represent control-flow and dashed edges are data dependencies. 图 2:一个图示,展示了 VulChecker 针对一个 CWE 的流程步骤。请注意,实际图形的规模远大于所展示的(例如,像 libgit2-v0.26.1 这样的项目有超过 1800 万个节点在 GG 中)。实线边表示控制流,虚线边表示数据依赖。
Figure 2 provides a visual summary of these steps. 图 2 提供了这些步骤的视觉总结。
ePDG Generation: Given a target program’s source code (denoted by SS ), we first compile it to LLVM IR (intermediate representation) and apply several optimizations. Next, we use a custom LLVM plugin specific to the CWE type of interest to analyze the IR, tag potential root cause and manifestation points, and produce an ePDG of SS denoted by GG. This process is detailed in Section 3.1. ePDG 生成:给定目标程序的源代码(表示为 SS ),我们首先将其编译为 LLVM IR(中间表示)并应用若干优化。接下来,我们使用一个特定于感兴趣的 CWE 类型的自定义 LLVM 插件来分析 IR,标记潜在的根本原因和表现点,并生成一个表示为 GG 的 SS 的 ePDG。该过程在第 3.1 节中详细说明。
Sampling: Next, we scan GG to locate potential manifestation points for the given CWE (Section 3.2) that were tagged during ePDG generation. A manifestation point is any node mm (instruction) in GG that is known to manifest that CWE (e.g., stack memory writes for stack overflow). For the ii-th manifestation point in GG, we cut a subgraph G_(i)G_{i} backwards from that point up to a given depth. G_(i)G_{i} represents an observation upon which VulChecker predicts whether or not a vulnerability exists. 采样:接下来,我们扫描 GG 以定位在 ePDG 生成过程中标记的给定 CWE(第 3.2 节)的潜在表现点。表现点是指在 GG 中已知会表现该 CWE 的任何节点 mm (指令)(例如,堆栈溢出的堆栈内存写入)。对于 GG 中的第 ii 个表现点,我们从该点向后切割一个子图 G_(i)G_{i} ,直到给定的深度。 G_(i)G_{i} 表示 VulChecker 预测是否存在漏洞的观察结果。
Feature Extraction: Using the same structure of G_(i)G_{i}, we create an annotated graph G_(i)^(')G_{i}^{\prime} where each node contains a feature vector that explicitly captures the respective instruction in G_(i)G_{i}. Similarly, we add explicit features to the edges of G_(i)^(')G_{i}^{\prime}. Consequently, G_(i)^(')G_{i}^{\prime} captures the code leading up to the ii-th manifestation point in a manner that a graph-based machine learning model can understand (Section 3.3). 特征提取:使用 G_(i)G_{i} 的相同结构,我们创建一个注释图 G_(i)^(')G_{i}^{\prime} ,其中每个节点包含一个特征向量,明确捕捉 G_(i)G_{i} 中的相应指令。同样,我们为 G_(i)^(')G_{i}^{\prime} 的边添加了显式特征。因此, G_(i)^(')G_{i}^{\prime} 以图形化机器学习模型可以理解的方式捕捉到达第 ii 个表现点之前的代码(第 3.3 节)。
Model Training/Execution: VulChecker predicts whether the ii-th manifestation point is vulnerable or not by passing G_(i)^(')G_{i}^{\prime} through a S2V model MM (Section 3.4). The S2V model is a binary classifier trained on a specific CWE. To train MM, we use many negative samples of SS, where each sample has been augmented with many positive synthetic examples (Section 4). 模型训练/执行:VulChecker 通过将 G_(i)^(')G_{i}^{\prime} 传递给 S2V 模型 MM (第 3.4 节)来预测第 ii 个表现点是否脆弱。S2V 模型是一个针对特定 CWE 训练的二元分类器。为了训练 MM ,我们使用了许多 SS 的负样本,每个样本都经过许多正合成示例的增强(第 4 节)。
Annotation If MM predicts G_(i)^(')G_{i}^{\prime} as positive, then the debug symbols passed down from SS to the ii-th manifestation point in the ePDG are used to report line number and instruction to the developer. 注释 如果 MM 将 G_(i)^(')G_{i}^{\prime} 预测为正,则从 SS 传递到 ePDG 中第 ii 个表现点的调试符号用于向开发者报告行号和指令。
3.1 ePDG Generation 3.1 ePDG 生成
The creation of an ePDG consists of two steps: (1) lowering the source code SS to LLVM IR and (2) extracting GG based on the structure and flows it contains. 创建电子程序设计图(ePDG)包括两个步骤:(1)将源代码 SS 降级为 LLVM IR,和(2)根据其包含的结构和流程提取 GG 。
3.1.1 Lowering Code to LLVM IR 3.1.1 降低代码到 LLVM IR
The first step in ePDG generation is compiling the source code SS