Elsevier

Automation in Construction
建筑自动化

Volume 142, October 2022, 104524
第142卷,2022年10月,104524
Automation in Construction

Knowledge-informed semantic alignment and rule interpretation for automated compliance checking
基于知识的语义对齐和规则解释,用于自动化合规性检查

工程技术TOPEI检索SCI升级版 工程技术1区SCI基础版 工程技术2区IF 9.6
https://doi.org/10.1016/j.autcon.2022.104524 Get rights and content  获取权限和内容

Highlights  亮点

  • Proposes a novel knowledge-informed framework for automated rule checking (ARC).
    提出了一种新的知识型自动规则检查框架。
  • Domain knowledge including concepts, relationships, constraints, etc. for ARC are considered.
    领域知识,包括概念,关系,约束条件等ARC被认为是。
  • Proposed method outperforms the rule-based methods with 90.1% accuracy for semantic alignment.
    所提出的方法优于基于规则的方法,语义对齐的准确率为90.1%。
  • Complex rules with implicit information and high-order constraints could be interpreted.
    可以解释具有隐式信息和高阶约束的复杂规则。
  • Automated rule interpretation can be 5 times faster than manual interpretation by domain experts.
    自动规则解释的速度比领域专家手动解释的速度快5倍。

Abstract  摘要

As an essential prodecure to improve design quality in the construction industry, automated rule checking (ARC) requires intelligent rule interpretation from regulatory texts and precise alignment of concepts from different sources. However, there still exists semantic gaps between design models and regulatory texts, hindering the exploitation of ARC. Thus, a knowledge-informed framework for improved ARC is proposed based on natural language processing. Within the framework, an ontology is first established to represent domain knowledge, including concepts, synonyms, relationships, constraints, etc. Then, semantic alignment and conflict resolution are introduced to enhance the rule interpretation process based on predefined domain knowledge and unsupervised learning techniques. Finally, an algorithm is developed to identify the proper SPARQL function for each rule, and then to generate SPARQL-based queries for model checking purposes, thereby making it possible to interpret complex rules where extra implicit data needs to be inferred. Experiments show that the proposed framework and methods successfully filled the semantic gaps between design models and regulatory texts with domain knowledge, which achieves a 90.1% accuracy and substantially outperforms the commonly used keyword matching method. In addition, the proposed rule interpretation method proves to be 5 times faster than the manual interpretation by domain experts. This research contributes to the body of knowledge of a novel framework and the corresponding methods to enhance automated rule checking with domain knowledge.
作为提高建筑行业设计质量的重要手段,自动规则检查(ARC)需要对法规文本进行智能规则解释,并对不同来源的概念进行精确对齐。然而,设计模型和规范文本之间仍然存在语义鸿沟,阻碍了ARC的开发。因此,提出了一个基于自然语言处理的知识知情的改进ARC框架。在该框架内,首先建立一个本体来表示领域知识,包括概念,同义词,关系,约束等,然后,语义对齐和冲突解决引入增强规则解释过程的基础上预定义的领域知识和无监督学习技术。 最后,算法的开发,以确定适当的SPARQL函数为每个规则,然后生成基于SPARQL的查询模型检查的目的,从而使它能够解释复杂的规则,额外的隐式数据需要推断。实验结果表明,该框架和方法成功地填补了设计模型和法规文本之间的语义空白,准确率达到90.1%,显著优于常用的关键词匹配方法。此外,所提出的规则解释方法被证明是5倍的速度比手工解释领域专家。本研究有助于一个新的框架和相应的方法,以提高自动规则检查领域知识的知识体。

Keywords  关键词

Automated rule checking (ARC)
Automated compliance checking (ACC)
Rule interpretation
Natural language processing (NLP)
Semantic alignment
Knowledge modeling
Building information modeling (BIM)

自动规则检查(ARC)
自动合规性检查(ACC)
规则解释
自然语言处理(NLP)
语义对齐
知识建模
建筑信息建模(BIM)

1. Introduction  1.介绍

Building regulatory documents such as design guidelines, codes, standards, and manuals of practice stipulate the safety, sustainability, and comfort of the entire lifecycle of a built environment [1]. Given that extensive human knowledge is needed in the regulation compliance checking process, the building design was checked manually by domain experts for a long time. Additionally, subjective judgment is always inevitable in this process, making it impossible to handle massive and heterogeneous building information and regulations effectively and sustainably [2]. Therefore, an automated compliance checking (ACC) method is critical for a more efficient and consistent checking procedure.
建筑监管文件,如设计指南,规范,标准和实践手册,规定了建筑环境整个生命周期的安全性,可持续性和舒适性[1]。鉴于在法规符合性检查过程中需要广泛的人类知识,建筑设计长时间由领域专家手动检查。此外,在这个过程中,主观判断总是不可避免的,这使得无法有效和可持续地处理大量和异构的建筑信息和法规[2]。因此,一个自动化的合规性检查(ACC)的方法是一个更有效和一致的检查程序的关键。
To promote the compliance checking process in the architecture, engineering, and construction (AEC) industry, automated rule checking (ARC, also known as ACC) has been extensively studied by many researchers, and several ARC systems have been established [[3], [4], [5], [6]]. The rule checking process can be broadly structured into four stages: 1) rule interpretation, which translates rules represented in natural language into a computer-processable format, 2) building model preparation, which prepares the required information for the checking process, 3) rule execution, which checks the prepared model using computer-processable rules, and 4) reporting checking results [3,5]. Among the four stages, rule interpretation is the most vital and complex stage that needs extensive investigation [5].
为了促进建筑、工程和施工(AEC)行业的合规性检查过程,许多研究人员广泛研究了自动规则检查(ARC,也称为ACC),并建立了几个ARC系统[[3][4][5][6]]。规则检查过程可以大致分为四个阶段:1)规则解释,将自然语言表示的规则转换为计算机可处理的格式,2)构建模型准备,为检查过程准备所需的信息,3)规则执行,使用计算机可处理的规则检查准备好的模型,以及4)报告检查结果[35]。 在这四个阶段中,规则解释是最重要和最复杂的阶段,需要广泛的研究[5]。
Most of the existing ARC systems, including CORENET emerging from the first large effort of ARC in Singapore and the widely used Solibri Model Checker (SMC), are usually based on hard-coded or manual rule interpretation methods [3]. Sometimes, they are referred to as ‘Black Box’ approaches because it is difficult to maintain and modify hard-coded rules efficiently [7,8]. Therefore, semiautomated and automated methods for interpreting the regulation text into a computer-processable format have been proposed to achieve a flexible, transparent, and portable ARC process. In the beginning, semiautomated rule interpretation methods, such as the RASE methodology [9] were introduced to help AEC experts analyze the semantic structure of regulatory requirements, and document markup techniques were used to annotate different components of a regulatory requirement. However, these methods just process text at a coarse granularity level and require considerable human effort to mark regulation documents and create queries or pseudocodes from them. For this reason, natural language processing (NLP), a widely used method for processing and understanding text-based human language [6], has been introduced to automate rule interpretation from regulatory documents. Generally, the NLP-based method for automated rule interpretation consists of two tasks: 1) information extraction, which extracts semantic information from textual building regulatory documents automatically [10,11], and 2) information transformation, which transforms the extracted information into logic clauses to support reasoning in compliance checking [2,12]. Chaining the information extraction and information transformation tasks together has enormously facilitated the automated rule interpretation process. To enhance NLP-based methods for different tasks in rule interpretation, techniques, such as ontology, machine learning, and deep learning are introduced to achieve a more comprehensive understanding of regulatory texts [6].
大多数现有的ARC系统,包括从新加坡ARC的第一次大规模努力中出现的CORENET和广泛使用的Solibri模型转换器(SMC),通常基于硬编码或手动规则解释方法[3]。有时,它们被称为“黑盒”方法,因为很难有效地维护和修改硬编码规则[78]。因此,半自动化和自动化的方法解释成计算机可处理的格式的法规文本已被提出,以实现灵活,透明,便携式ARC过程。 一开始,半自动规则解释方法,如RASE方法[9]被引入以帮助AEC专家分析监管要求的语义结构,文档标记技术被用于注释监管要求的不同组件。然而,这些方法只是在粗粒度级别上处理文本,并且需要相当大的人力来标记法规文档并从它们创建查询或伪代码。出于这个原因,自然语言处理(NLP),一种广泛使用的处理和理解基于文本的人类语言的方法[6],已经被引入来自动化法规文件的规则解释。 通常,用于自动规则解释的基于NLP的方法包括两个任务:1)信息提取,其自动从文本建筑法规文档中提取语义信息[10,11],以及2)信息转换,其将提取的信息转换为逻辑子句以支持合规性检查中的推理[2,12]。将信息提取和信息转换任务链接在一起极大地促进了自动化规则解释过程。 为了增强规则解释中不同任务的基于NLP的方法,引入了本体、机器学习深度学习等技术,以实现对法规文本的更全面理解[6]。
Although many efforts have been made for automated rule interpretation, there are still some research gaps. First, as pointed out by some researchers [7,13], to achieve fully automated rule checking, both the regulations and the BIM models should use the same concepts and relations to define and describe the same objects, i.e., the two representations should speak “the same language”. However, different terms and modeling paradigms may be adopted when dealing with regulatory documents and BIM models. That is, there exists a semantic gap between design models and regulatory documents. Therefore, an extra step called semantic alignment is needed to determine correspondences between different concepts and relations in regulations and BIM models. In previous studies, the methods for aligning the concepts and relations in computer-interpretable regulations with those in ontologies or BIMs were mainly based on keyword matching [10,11] or handcrafted mapping tables [2]. Researchers may make a huge effort to construct mapping tables containing all aliases, abbreviations, alternate spellings, etc. For instance, designers may use the alias “fire partition wall” to represent the concept “firewall”. If the alias of a concept is not contained in the mapping table, then the semantic alignment process will fail. Therefore, a more robust and flexible method for automated semantic alignment is required. Second, when mapping the extracted information into plain description logic clauses, such as the Horn clause [12], the B-Prolog representation [10,11], or the deontic logic (DL) clauses [2], hidden domain knowledge is required to infer implicit relationships, data types, and properties. Due to lack of domain knowledge, many queries in the AEC domain are hampered by implicit required relationships and properties that are difficult to retrieve [14]. It is important to incorporate domain knowledge into the rule interpretation process. In addition, the widely used B-Prolog representation and DL clauses have difficulty dealing with implicitly defined quantity information. For example, information such as "room A" containing "safe exit B" and "safe exit C" is explicitly stored in a BIM model, while the number of safe exits contained in "room A" is implicit information that requires additional calculations. Therefore, the logical representation of complex rules or high-order information is needed.
尽管人们在自动规则解释方面做了很多努力,但仍然存在一些研究空白。首先,正如一些研究人员[713]所指出的,为了实现完全自动化的规则检查,法规和BIM模型都应该使用相同的概念和关系来定义和描述相同的对象,即,两个代表应该“口径一致”。但是,在处理监管文件和BIM模型时,可能会采用不同的术语和建模范例。也就是说,在设计模型和规范文档之间存在语义鸿沟。因此,需要一个称为语义对齐的额外步骤来确定法规和BIM模型中不同概念和关系之间的对应关系。 在以前的研究中,将计算机可解释法规中的概念和关系与本体或BIM中的概念和关系对齐的方法主要基于关键字匹配[1011]或手工制作的映射表[2]。研究人员可能会花费巨大的努力来构建包含所有别名、缩写、替代拼写等的映射表。例如,设计人员可能会使用别名“防火隔墙“来表示概念“防火墙“。如果映射表中不包含概念的别名,则语义对齐过程将失败。因此,需要一种更鲁棒和灵活的自动语义对齐方法。 其次,当将提取的信息映射到普通描述逻辑子句(例如Horn子句[12]、B-Prolog表示[1011]或道义逻辑(DL)子句[2])时,需要隐藏域知识来推断隐含关系、数据类型和属性。由于缺乏领域知识,AEC领域中的许多查询都受到难以检索的隐式必需关系和属性的阻碍[14]。将领域知识纳入规则解释过程是很重要的。此外,广泛使用的B-Prolog表示和DL子句难以处理隐式定义的数量信息。 例如,BIM模型中明确存储了“A房间“包含“安全出口B“和“安全出口C“等信息,而“A房间“包含的安全出口数量是隐含信息,需要额外计算。因此,需要对复杂规则或高阶信息进行逻辑表示。
To address the abovementioned problems, this research proposes a novel knowledge-informed semantic alignment and rule interpretation framework for ARC. Domain knowledge, such as equivalent concepts, constraints, and relationships are modeled based on ontology, which are further utilized in three folds to enhance the ARC processes: 1) automated alignment of regulations concepts with those in ontology based on semantic similarity, 2) improving the accuracy of rule interpretation by detecting and resolving potential conflicts of rules, and 3) generating complex rules with implicit information based on SPARQL automatically.
针对上述问题,本研究提出了一种新的基于知识的ARC语义对齐和规则解释框架。基于本体对诸如等价概念、约束和关系之类的领域知识进行建模,这些领域知识被进一步用于增强ARC过程的三个方面:1)基于语义相似性将规则概念与本体中的规则概念自动对齐,2)通过检测和解决规则的潜在冲突来提高规则解释的准确性,3)基于SPARQL自动生成隐含信息的复杂规则。
The remainder of this paper is organized as follows: Section 2 reviews the related studies and highlights the potential research gaps. Section 3 describes the knowledge-informed framework for ARC proposed in this research. The proposed algorithms and methods are also explained in this section. Section 4 illustrates and analyzes the results of the experiments. Section 5 demonstrates the application of this framework. Section 6 discusses the advantages and contributions of this research, while also summarizes potential limitations. Finally, Section 7 concludes this research.
本文的其余部分组织如下:第2节回顾了相关研究并强调了潜在的研究差距。第三节描述了本研究提出的ARC知识知情框架。本节还解释了所提出的算法和方法。第四部分对实验结果进行了说明和分析。第5节演示了该框架的应用。第6节讨论了本研究的优势和贡献,同时也总结了潜在的局限性。最后,第七部分对本研究进行了总结。

2. Overview of related studies
2.相关研究概述

2.1. NLP-based automated regulatory document interpretation
2.1.基于NLP的自动化规范性文件解释

ARC methods and systems have been extensively studied in recent decades with the increasing maturity and adoption of BIM [4,15] since decision tables were first utilized to check structural design against building codes [16]. However, for the existing ARC systems, intensive manual efforts are still needed in the rule interpretation stages [3]. Therefore, to make the rule interpretation stage more efficient and transparent, numerous automated rule interpreting methods have been proposed.
随着BIM的日益成熟和采用,ARC方法和系统在近几十年来得到了广泛的研究[4,15],因为决策表首次用于检查建筑规范的结构设计[16]。然而,对于现有的ARC系统,在规则解释阶段仍然需要密集的人工努力[3]。因此,为了使规则解释阶段更有效和透明,已经提出了许多自动化的规则解释方法。
In recent years, NLP algorithms that can achieve a comprehensive understanding of a text [6] have been utilized to develop automated rule interpreting methods. NLP algorithms can be mainly divided into two approaches: the handwritten rules approach and the statistical approach [17]. The former depends on symbolic human-defined pattern-matching rules, including Backus-Naur Form (BNF) notation, regular expression syntax, and Prolog language. The latter automatically builds probabilistic "rules" from training datasets (e.g., large annotated bodies of text) similar to machine-learning algorithms, including support vector machines (SVMs), hidden Markov models (HMMs), conditional random fields (CRFs), and naïve Bayes [17]. In the AEC industry, Zhang & El-Gohary pointed out that domain-specific regulatory text is more suitable for automated NLP than general nontechnical text (e.g., news articles, general websites) because the regulatory text has fewer homonym conflicts and fewer coreference resolution problems. Additionally, the ontology for a specific domain is easier to develop as opposed to those that capture general knowledge of multiple domains [12]. Domain ontology is able to provide a shared vocabulary among the parties by defining abstract concepts and relationships, including the taxonomy of concepts, equivalent and disjoint concepts, and enumerations of terminologies. As such, raw data is provided with explicit meaning, making it easier for machines to automatically process and integrate data through ontology [18]. A domain ontology may enhance the automated interpretability and understandability of domain-specific text [12]. Therefore, in the AEC industry, many studies in terms of NLP-based automated rule interpreting methods have adopted rule-based and ontology-based approaches [6]. For example, Zhang & El-Gohary [10,12] proposed an automated rule interpretation method mainly consisting of three processing phases: 1) text classification, which recognizes relevant sentences in a regulatory text corpus; 2) information extraction, which extracts the words and phases in the relevant sentences based on pattern-matching-based information extraction rules and ontology; 3) information transformation, which transforms the extracted information into the Horn clauses or the B-Prolog representation using regular expression-based mapping rules. Zhou & El-Gohary [11] proposed a rule-based ontology-enhanced information extraction method for extracting building energy requirements from energy conservation codes, and then formatted the extracted information into a B-Prolog representation. Zhou et al. [20] proposed a novel fuly automated rule extraction method based on a deep learning model and a set of context-free grammars (CFGs), which can interpret regulatory rules into pseudocode formats automatically with high generality, accuracy, and interpretability. Xu and Cai [2] proposed an ontology and rule-based NLP framework to automate the interpretation of utility regulations into deontic logic (DL) clauses, where pattern-matching rules are used for information extraction; pre-learned model and domain-specific hand-crafted mapping methods were also adopted for semantic alignment between rules and ontology.
近年来,可以实现对文本的全面理解的NLP算法[6]已被用于开发自动规则解释方法。NLP算法主要可以分为两种方法:手写规则方法和统计方法[17]。前者依赖于人类定义的符号模式匹配规则,包括巴科斯范式(BNF)表示法、正则表达式语法和Prolog语言。后者自动从训练数据集构建概率“规则”(例如,大型注释文本),类似于机器学习算法,包括支持向量机(SVM),隐马尔可夫模型(HMM),条件随机场(CRF)和朴素贝叶斯[17]。 在AEC行业,Zhang和El-Gohary指出,特定领域的监管文本比一般的非技术文本更适合自动化NLP(例如,新闻文章、一般网站),因为规范文本具有较少的同音异义词冲突和较少的共指消解问题。此外,与捕获多个领域的一般知识的本体相比,特定领域的本体更容易开发[12]。领域本体能够通过定义抽象概念和关系,包括概念的分类、等价和不相交的概念以及术语的枚举,在各方之间提供共享的词汇表。因此,原始数据具有明确的含义,使机器更容易通过本体自动处理和集成数据[18]。 领域本体可以增强特定领域文本的自动可解释性可理解性[12]。因此,在AEC行业中,许多基于NLP的自动规则解释方法的研究都采用了基于规则和基于本体的方法[6]。 例如,Zhang & El-Gohary [1012]提出了一种自动规则解释方法,主要包括三个处理阶段:1)文本分类,在法规文本语料库中识别相关句子; 2)信息提取,基于基于模式匹配的信息提取规则和本体提取相关句子中的词和短语; 3)信息转换,使用基于正则表达式的映射规则将提取的信息转换为Horn子句或B-Prolog表示。Zhou & El-Gohary [11]提出了一种基于规则的本体增强型信息提取方法,用于从节能规范中提取建筑能源需求,然后将提取的信息格式化为B-Prolog表示。Zhou等人 [20]提出了一种基于深度学习模型和一组上下文无关语法(CFG)的新型全自动规则提取方法,该方法可以自动将监管规则解释为具有高度通用性,准确性和可解释性的伪代码格式。Xu和Cai [2]提出了一种基于本体和规则的NLP框架,用于将效用规则自动解释为道义逻辑(DL)子句,其中模式匹配规则用于信息提取;还采用了预先学习的模型和特定于领域的手工映射方法,用于规则和本体之间的语义对齐。

2.2. Semantic alignment methods for ARC
2.2. ARC的语义对齐方法

Most of the studies focusing on information extraction assumed that the concepts and relations in regulatory documents were consistent with those in design models. However, this was not the case in most of the scenarios. To achieve full automated rule checking, both the regulations and the design models (including BIMs, and their equivalent Ontologies) should use the same concepts and relations to define and describe the same objects, i.e., the two representations should speak “the same language” [7,13].
大多数研究集中在信息提取假设的概念和关系,在规范性文件是一致的,在设计模型。然而,在大多数情况下,情况并非如此。为了实现完全自动化的规则检查,规则和设计模型(包括BIM及其等效本体)都应该使用相同的概念和关系来定义和描述相同的对象,即,这两个代表应该说“同一种语言”[713]。
Thus, the semantic alignment in ARC aims to map the extracted named entities in the text to the corresponding concepts in the knowledge base (e.g., ontology, knowledge graph) to facilitate the understanding of the text [21,22]. In the existing studies of ARC, the methods for aligning and mapping the concepts and relations in computer-interpretable regulations to those in ontologies or BIMs are mainly rule-based approaches, such as exhaustive usage of keyword matching [10,11] and handcrafted mapping tables [2].
因此,ARC中的语义对齐旨在将文本中提取的命名实体映射到知识库中的对应概念(例如,本体,知识图),以促进文本的理解[2122]。在现有的ARC研究中,用于将计算机可解释规则中的概念和关系对齐和映射到本体或BIM中的概念和关系的方法主要是基于规则的方法,例如穷举使用关键字匹配[10,11]和手工制作的映射表[2]。
However, semantic alignment methods have been widely researched [21] in other domains and could be divided into three types of approaches: 1) rule-based methods, 2) supervised learning methods, and 3) unsupervised learning methods. Rule-based methods utilize dictionary-based look-up and string-matching algorithms [23] or hand-written rules to measure the morphological similarity between extracted words and ontology concepts [24]. For string-matching algorithms, large efforts are involved in the construction of a large name dictionary containing aliases, abbreviations, alternate spellings, etc. [21]. If an alias of a concept is not contained in the mapping table, then the semantic alignment process will fail. For hand-written rules, human efforts are involved to define text patterns and pattern-matching rules for semantic alignment. The supervised learning methods learn the similarities between extracted words and ontology concepts from labeled training data. In recent years, some open datasets for model training have been developed; therefore, deep learning-based semantic alignment methods have been investigated [25]. While this kind of approaches eliminates human involvement in pattern definition, it still requires considerable manual effort to prepare a training dataset [2] because there are few open datasets that contain domain-specific entities in the construction domain. Unsupervised methods, such as word embedding models, can learn distributed representations of words from large unlabeled corpora and are promising approaches for capturing semantic information [22]. Unsupervised methods require less human effort in the data preparation stage than the former two types of methods.
然而,语义对齐方法在其他领域已经得到了广泛的研究[21],可以分为三种类型的方法:1)基于规则的方法,2)监督学习方法,3)无监督学习方法。基于规则的方法利用基于字典的查找和字符串匹配算法[23]或手写规则来测量提取的单词和本体概念之间的形态相似性[24]。对于字符串匹配算法,大量的工作涉及构建包含别名,缩写,替代拼写等的大型名称字典[21]。如果映射表中不包含概念的别名,则语义对齐过程将失败。 对于手写规则,需要人工定义文本模式和模式匹配规则以进行语义对齐。监督学习方法从标记的训练数据中学习提取的单词和本体概念之间的相似性。近年来,一些用于模型训练的开放数据集已经开发出来;因此,基于深度学习的语义对齐方法已经被研究[25]。虽然这种方法消除了人类对模式定义的参与,但它仍然需要大量的手动工作来准备训练数据集[2],因为很少有开放的数据集包含构建领域中的特定领域实体。 无监督方法,如词嵌入模型,可以从大型未标记语料库中学习词的分布式表示,并且是捕获语义信息的有前途的方法[22]。与前两种方法相比,无监督方法在数据准备阶段需要更少的人力。

2.3. Research gaps  2.3.研究空白

Despite a large number of efforts for automated interpretation of regulatory documents , there are still some limitations in the following two main aspects.
尽管为规范性文件的自动化解释做出了大量努力,但在以下两个主要方面仍存在一些局限性。
First, in the construction domain, the methods for semantic alignment are mainly based on rule-based methods in the existing studies, which are mainly hard-coded rules and hard to generalize. In addition, constructing mapping tables containing all aliases, abbreviations, and alternate spellings requires a large amount of effort.
首先,在建筑领域,现有的语义对齐方法主要是基于规则的方法,这些方法主要是硬编码的规则,难以推广。此外,构造包含所有别名、缩写和替换拼写的映射表需要大量的工作。
Second, the existing studies mainly focus on interpreting the rules expressed by natural language into pseudocode or plain description logic. The pseudocode formats are still difficult to directly reason by computers [2]. The studies utilizing plain description logic mainly focus on the logical representation of some simple rules. Complex rules with implicit relationships and properties are usually ignored. One typical example is “the number of stories in a factory should not exceed 2”. When the number of stories is not explicitly stored in the BIM model, the number of stories should be derived through aggregation functions via counting the floor objects. However, the counting ability of plain description logic language (e.g., the first order logic) is very limited [19]. It is difficult to derive the implicit information required for rule checking.
其次,现有的研究主要集中在将自然语言表达的规则解释为伪码或明文描述逻辑。伪码格式仍然很难由计算机直接推理[2]。利用纯描述逻辑的研究主要集中在一些简单规则的逻辑表示上。具有隐式关系和属性的复杂规则通常被忽略。一个典型的例子是“一个工厂的层数不应超过2层”。当BIM模型中未明确存储层数时,应通过对楼层对象进行计数,通过聚合函数导出层数。然而,普通描述逻辑语言的计数能力(例如,一阶逻辑)非常有限[19]。很难导出规则检查所需的隐式信息。
There are two main obstacles when interpreting complex rules. On one hand, the representability of the widely used plain description logical clauses is limited for representing complex rules with implicit attributes that need additional reasoning. On the other hand, additional efforts are needed to automatically choose the proper information transformation methods for different regulatory texts. For example, regulatory texts that requires counting functions to infer implicit information should be first identified. Thus, text classification is usually used to classify regulatory texts for further processing. Zhou & El-Gohary [10,11] adopted a text classification method to filter out irrelevant text. However, the existing text classification studies mainly focus on the intended scopes and applications at a coarse granularity level (e.g., relevant or irrelevant), so they cannot be used to choose the proper information transformation method to be utilized for each rule.
在解释复杂的规则时有两个主要障碍。一方面,广泛使用的简单描述逻辑子句的可表示性有限,用于表示需要额外推理的具有隐式属性的复杂规则。另一方面,需要作出额外努力,为不同的监管文本自动选择适当的信息转换方法。例如,需要计数功能来推断隐含信息的监管文本应该首先被识别。因此,文本分类通常用于对法规文本进行分类以供进一步处理。Zhou & El-Gohary [1011]采用文本分类方法过滤掉不相关的文本。然而,现有的文本分类研究主要集中在预期的范围和应用在一个粗粒度的水平(例如,相关的或不相关的),因此它们不能用于为每个规则选择要利用的适当的信息变换方法。

3. Methodology and framework
3.方法和框架

To address the abovementioned problems, a novel knowledge-informed framework for ARC based on NLP techniques is proposed, as illustrated in Fig. 1. The proposed framework consists of four parts: ontology-based knowledge modeling, model preparation with semantic enrichement, enhanced rule interpretation, and checking execution. Based on the framework, three new methods, including the unsupervised-based semantic alignment method, the conflict resolution method, and the code generation method that supports the interpretation of a wider range of rules, are proposed. The main contributions of this work are highlighted in Fig. 1. First, we introduce the four parts of the proposed framework.
为了解决上述问题,提出了一种新的基于NLP技术的ARC知识通知框架,如图1所示。该框架由四个部分组成:基于本体的知识建模,语义丰富的模型准备,增强的规则解释和检查执行。基于该框架,提出了三种新的方法,包括无监督的语义对齐方法,冲突解决方法,和代码生成方法,支持更广泛的规则的解释,。这项工作的主要贡献在图1中突出显示。首先,我们介绍了拟议框架的四个部分。
  • 1.
    Ontology-based knowledge modeling. This part is the basis of the proposed framework, and the domain-specific concepts, properties, relations, and descriptions related to regulatory documents are integrated into the ontology. In addition, a set of model semantic enrichment rules in model preparation and conflict resolution rules in rule interpretation can be established based on domain-specific knowledge.
    基于本体的知识建模。这部分是所提出的框架的基础上,特定领域的概念,属性,关系,并与监管文件的描述集成到本体。此外,在模型准备和规则解释的冲突解决规则,可以建立一套模型语义丰富的领域特定的知识为基础。
  • 2.
    Model preparation with semantic enrichment. Since IFC (Industry Foundation Classes) is supported by a wide range of BIM software [26], it is utilized for data exchange and improve the scalability of ARC systems. However, the IFC file cannot be directly parsed by the reasoning engine, thus requiring further model conversion and enrichment. This part aims to automatically convert the BIM model in IFC format into the Turtle (Terse RDF Triple Language) format based on the proposed ontology, the IFC parsing method, and the ontology mapping method. Moreover, a set of semantic model enrichment rules are adopted to infer the implicit information and supply missing information in the Turtle model before the checking process. The model preparation pipeline consists of model conversion and model enrichment, as shown in Fig. 1. The details of the model preparation pipeline are illustrated in Appendix A.
    语义丰富的模型准备。由于IFC(行业基础类)得到广泛的BIM软件的支持[26],因此它被用于数据交换并提高ARC系统的可扩展性。然而,IFC文件不能被推理引擎直接解析,因此需要进一步的模型转换和丰富。该部分的目的是基于所提出的本体、IFC解析方法和本体映射方法,将IFC格式的BIM模型自动转换为Turtle(Terse RDF Triple Language)格式。此外,一组语义模型丰富规则被用来推断隐含的信息,并在检查过程之前,海龟模型中的缺失信息的补充。模型准备管道由模型转换和模型丰富组成,如图1所示。模型准备流程的详细信息见附录A。
  • 3.
    Ehanced rule interpretation. In this part, the rules are automatically interpreted into a computer-executable format based on a pipeline of NLP-based methods, including preprocessing, semantic labeling, parsing, semantic alignment, conflict resolution, and code generation. First, preprocessing aims to preprocess these sentences by sentence splitting to facilitate later work. Semantic labeling aims to label words or phrases in a sentence. Subsequently, the parsing step parses the labeled sentence into a language-independent syntax tree. The semantic alignment automatically aligns the concepts in the syntax tree to those in the ontology based on unsupervised learning techniques, such as the word2vec technique. Then, a set of conflict resolution rules are utilized to revise the alignment result. Finally, the code generation step aims to transform the syntax tree after alignment into computer-executable code.
    增强的规则解释。在这一部分中,规则被自动解释为计算机可执行的格式,基于NLP的方法,包括预处理,语义标记,解析,语义对齐,冲突解决和代码生成的管道。首先,预处理的目的是预处理这些句子的句子分裂,以方便以后的工作。语义标注的目的是标注句子中的单词或短语。随后,解析步骤将标记的句子解析成语言无关的语法树。语义对齐基于无监督学习技术(诸如word2vec技术)自动地将语法树中的概念与本体中的概念对齐。然后,一组冲突解决规则被用来修改对齐结果。最后,代码生成步骤旨在将对齐后的语法树转换为计算机可执行代码。
  • 4.
    Checking execution. Finally, automated compliance checking is performed through a reasoning engine, such as protégé and GraphDB [27].
    检查执行情况。最后,通过推理引擎(如protégé和GraphDB)执行自动合规性检查[27]。
Fig. 1
  1. Download: Download high-res image (253KB)
    下载:下载高分辨率图像(253KB)
  2. Download: Download full-size image
    下载:下载全尺寸图像

Fig. 1. The proposed NLP-based and knowledge-informed automated rule checking framework.
Fig. 1.提出了一个基于NLP和知识的自动规则检查框架。

To better illustrate the proposed knowledge-informed ARC framework, the regulatory texts from Section 3 of the Chinese Code for fire protection design of buildings (GB 50016–2014) [30] are selected as an example. GB 50016–2014 [30] is the Chinese national fire code that restricts the shape, properties, requirements for safe evacuation, and so on of buildings. The regulatory texts of Section 3 in GB 50016–2014 [30] are related to the fire safety of factories. First, a new ontology and a set of rules are developed to integrate domain knowledge (Section 3.1). Then, the automated rule interpretation method is illustrated in Section 3.2. Note that the model preparation part is not the core of this work, so it will be briefly introduced in Appendix A. For the checking execution stage, this work implements an automated rule checking system based on Python and GraphDB (Section 5).
为了更好地说明拟议的知识知情ARC框架,选择中国建筑设计防火规范(GB 50016-2014)[30]第3节的法规文本作为示例。GB 50016-2014 [30]是中国国家消防规范,对建筑物的形状、性能、安全疏散要求等进行了限制。GB 50016-2014 [30]中第3节的法规文本与工厂的消防安全有关。首先,开发一个新的本体和一组规则来集成领域知识(第3.1节)。然后,在3.2节中说明自动规则解释方法。注意,模型准备部分不是这项工作的核心,因此将在附录A中简要介绍。 对于检查执行阶段,本文实现了一个基于Python和GraphDB的自动规则检查系统(第5节)。

3.1. Ontology-based knowledge modeling
3.1.基于本体的知识建模

To integrate domain-specific knowledge and concepts, the fire protection for building ontology (FPBO) was developed. In addition, domain-range constraints and equivalent classes for conflict resolution and rules for semantic enrichment are also established based on the FPBO and domain knowledge.
为了集成领域知识和概念,开发了建筑消防本体(FPBO)。此外,还建立了基于FPBO和领域知识的领域范围约束和等价类的冲突消解和语义扩充规则。

3.1.1. Classes  3.1.1.类

For the building information domain, some concepts and hierarchical relations in ifcOWL [28] and Building Topology Ontology [29] are referred to in the creation of the FPBO. For the fire protection domain, the concepts and relations are extracted and summarized from GB 50016–2014 [30]. A semiautomated ontology construction method proposed by Qiu et al. [31] was adopted. Based on the regulatory corpus, the concepts were first extracted using statistics-based methods (e.g., term frequency inverse document frequency), and the top 500 words with the highest score were extracted. Then, the semantic clustering method is used to merge the domain words. If the similarity (i.e., word2vec-based similarity) of two clusters is higher than a certain value (0.6 in this work), then merge the two clusters. A total of 183 clusters are obtained. Then, the subsumption method is used to build the taxonomy structure in each clustering. The basic idea is that concept c1 is considered more specific than concept c2 if the document set in which c1 occurs is the subset of the document set in which c2 occurs [31]. After the subsumption method, the typical taxonomy structure is that the first level includes “building” and the second level includes “plant”, “warehouse”, “residence”, etc. Finally, the concepts and hierarchy of the ontology were manually adjusted and supplemented. The semiautomated method reduces the manual effort in keywords searching and clustering. Finally, the FPBO was created in protégé [32]. Fig. 2 illustrates part of the structure of the FPBO. The classes are represented by blue boxes in Fig. 2. The classes of BuildingSpatialElement and BuildingComponentElement are the central concepts of the FPBO. BuildingSpatialElement, representing the general space that has a 3D spatial extent related to the building domain, is further categorized into four subclasses: BuildingSite, BuildingRegion, BuildingStorey, and BuildingSpace. BuildingComponentElement represents the essential elements in architecture and structural engineering, such as beams, columns, walls, and doors.
对于建筑信息领域,在FPBO的创建中参考了ifcOWL [28]和建筑拓扑本体[29]中的一些概念和层次关系。对于消防领域,从GB 50016-2014 [ 30 ]中提取和总结了概念和关系。采用了Qiu等人[ 31 ]提出的半自动本体构建方法。基于监管语料库,首先使用基于语法的方法(例如,词频逆文档频率),并且提取具有最高得分的前500个词。然后,使用语义聚类方法对领域词进行合并。如果相似性(即,基于Word 2 Vec的相似性)高于某个值(0.6在这项工作中),然后合并两个集群。共得到183个聚类。然后,在每个聚类中使用包容方法来建立分类结构。其基本思想是,如果c1出现的文档集是c2出现的文档集的子集,则概念c1被认为比概念c2更具体[31]。采用包含方法后,典型的分类结构是第一层包含“建筑物”,第二层包含“厂房”、“仓库”、“住宅”等。最后,人工调整和补充本体的概念和层次结构。半自动化的方法减少了人工的关键字搜索和聚类的努力。最后,FPBO在Protégé创建[32]。图 图2示出了FPBO的部分结构。这些类在图2中用蓝色框表示。类BuildingSpatialElementBuildingCompatientElement是FPBO的中心概念。BuildingSpatialElement表示具有与建筑领域相关的3D空间范围的一般空间,进一步分为四个子类:BuildingSiteBuildingRegionBuildingStoreyBuildingSpaceBuildingConstrucentElement表示建筑和结构工程中的基本元素,例如梁、柱、墙和门。
Fig. 2
  1. Download: Download high-res image (350KB)
    下载:下载高分辨率图像(350KB)
  2. Download: Download full-size image
    下载:下载全尺寸图像

Fig. 2. Structure of the fire protection of building ontology.
图二.建筑消防本体的结构。

3.1.2. Properties and their constraints
3.1.2.属性及其约束

Object properties (e.g., hasBuildingSpatialElement, hasBuildingElement) are also introduced in the FPBO to establish the relationships between BuildingSpatialElement and BuildingComponentElement. The object properties are represented by blue arrows in Fig. 2. Key terms in the fire protection domain are defined as data properties, such as hasFireResistanceLimit_hour representing fire resistance rating. The data properties are represented by orange arrows in Fig. 2. In addition, the domains and ranges of the data properties and object properties have been defined, which could be used for conflict resolution when generating rules from texts. Part of the domains and ranges of the properties are shown in Fig. 2. For example, the domain of the data property ‘isFireWall_Boolean’ is class ‘Wall’, and its range is a bool value. Other situations are not allowed. For another example, the range of the data property ‘hasFireHazardCategory’ only includes the following values: Class A, B, C, D, and E. These values also uniquely correspond to the data property ‘hasFireHazardCategory’. The ranges of the properties are defined using the protégé's function “data range expression”, and the result of the data property ‘hasFireHazardCategory’ is shown in Fig. 2. The range of the data property ‘hasNumberOfFloors’ is also defined using the same method, and the range values are shown in Table 1.
对象属性(例如,hasBuildingSpatialElement、hasBuildingElement)也被引入FPBO中,以建立BuildingSpatialElementBuildingBuildingElement之间的关系。对象属性由图2中的蓝色箭头表示。消防领域中的关键术语定义为数据属性,例如hasFireResistanceLimit_hour表示耐火等级。数据属性在图2中由橙子箭头表示。此外,定义了数据属性和对象属性的域和范围,可用于从文本生成规则时的冲突解决。图2中示出了性质的部分域和范围。 例如,数据属性“isFireWall_Boolean”的域是类“Wall”,其范围是bool值。其他情况不允许。又例如,数据属性“hasFireHazardCategory”的范围仅包括以下值:Class A、B、C、D和E。这些值还唯一地对应于数据属性“hasFireHazardCategory”。属性的范围使用protégé的函数“data range expression”定义,数据属性“hasFireHazardCategory”的结果如图2所示。数据属性'hasNumberOfFloors'的范围也使用相同的方法定义,范围值如表1所示。

Table 1. Domain-range conflict resolution dictionary.
表1.域范围冲突解决字典。

Key (Original concept of the node)
关键字(节点的原始概念)
Range of the concept  概念范围Value ([New concept of the node, new word of the node, concept of its sub-node, word of its sub-node, the value of its sub node])
Value([节点的新概念,节点的新词,其子节点的概念,其子节点的词,其子节点的值])
isFireProtectionSubdivision_Boolean
isFireProtection细分_布尔
True/False  真/假[BuildingSpace, Building Space, isFireProtectionSubdivision_Boolean, is fire compartment, True]
[BuildingSpace,Building Space,isFireProtectionSubdivision_Boolean,is fire compartment,True]
IsSecurityExits_BooleanTrue/False  真/假[Doors, door, IsSecurityExits_Boolean, is Safe exit, True]
[Doors,door,IsSecurityExits_Boolean,is Safe exit,True]
isFireWall_BooleanTrue/False  真/假[Wall, wall, isFireWall_Boolean, is fire wall, True]
[Wall,wall,isFireWall_Boolean,是防火墙,True]
hasFireHazardCategoryClass A, Class B, Class C, Class D, Class E
A类、B类、C类、D类、E类
/
hasNumberOfFloorsSingle-story, multiple-story
单层、多层
/

3.1.3. Descriptions and equivalent classes
3.1.3.描述和等价类

Different vocabularies in natural language may be used by various regulatory documents to describe the same concept. For instance, different terms, such as “fire partition wall” and “firewall” are often used to refer to the same concept ‘firewall’. Therefore, we define three types of descriptions (or metadata in some literatures), including names, aliases, and definitions for each concept via annotations. Each concept has one name, one definition which contains one or two short sentences, and zero to two aliases. A total of 287 descriptions are defined for 138 concepts. Since the aliases of the concepts are difficult to summarize completely, the definitions and semantic similarity matching methods are utilized in concept semantic alignment between rules and the FPBO. A typical example (e.g., the description of the column) of a description is shown in the orange box in Fig. 2. In addition, the equivalent classes are also considered in the FPBO for describing the same concept. The equivalent classes are defined by some logic rules, shown in the gray box in Fig. 2. For example, the class: ‘FireWall’ = class: ‘Wall’ + data property: ‘isFireWall_Boolean’ + value: ‘True’.
自然语言中的不同词汇表可能被各种规范文档用来描述相同的概念。例如,不同的术语,如“防火隔墙”和“防火墙”,经常被用来指同一个概念“防火墙”。因此,我们定义了三种类型的描述(或在某些文献中的元数据),包括名称,别名和通过注释为每个概念定义。每个概念都有一个名称,一个包含一个或两个短句的定义,以及零到两个别名。共为138个概念定义了287个描述。由于概念的别名很难完全概括,因此利用定义和语义相似度匹配方法进行规则和FPBO之间的概念语义对齐。一个典型的例子(例如,列的描述)在图2中的橙子框中示出。 此外,在FPBO中还考虑了描述同一概念的等价类。等价类由一些逻辑规则定义,如图2中的灰色框所示。例如,class:'FireWall' = class:'Wall' + data property:'isFireWall_Boolean' + value:'True'。

3.1.4. Semantic enrichment rules
3.1.4.语义丰富规则

Missing, incomplete, implicit, and incorrect information are major obstacles to automated code compliance checking in the construction industry [33]. Thus, semantic enrichment should be performed to infer the implicit information and supply the missing information before beginning the checking process. In this work, since most of the attributes have been supplemented via the IFC conversion process and the constructed ontology, only a few topological relations between the objects need to be inferred. Therefore, a series of semantic enrichment SWRL rules are defined based on the FPBO. The defined SWRL rules aim to derive implicit spatial relationships, e.g., which building spaces and building components are contained in a building. The detailed SWRL rules are illustrated in Appendix A.
缺失、不完整、隐含和不正确的信息是建筑行业自动化代码合规性检查的主要障碍[33]。因此,语义丰富应执行,以推断隐含的信息,并在开始检查过程之前提供缺失的信息。在这项工作中,由于大多数的属性已经通过IFC转换过程和构建的本体补充,只有少数的对象之间的拓扑关系需要推断。因此,定义了一系列语义丰富的SWRL规则的基础上FPBO。定义的SWRL规则旨在导出隐含的空间关系,例如,所述建筑空间和建筑构件包含在建筑物中。详细的SWRL规则见附录A。

3.2. Rule interpretation  3.2.规则解读

This step aims to automatically interpret the domain regulatory texts expressed in natural language into computer-processable codes based on a pipeline of NLP techniques, including preprocessing (see Section 3.2.1), semantic labeling and parsing (see Section 3.2.2), semantic alignment (see Section 3.2.3), conflict resolution (see Section 3.2.4), and code generation (see Section 3.2.5). Fig. 3 shows an example of the proposed automated rule interpretation method.
该步骤旨在基于NLP技术的管道(包括预处理)将以自然语言表达的领域监管文本自动解释为计算机可处理的代码(参见第3.2.1节)、语义标记和解析(参见第3.2.2节)、语义对齐(参见第3.2.3节)、冲突解决(参见第3.2.4节)和代码生成(参见第3.2.5节)。图3示出了所提出的自动规则解释方法的示例。
Fig. 3
  1. Download: Download high-res image (585KB)
    下载:下载高分辨率图像(585KB)
  2. Download: Download full-size image
    下载:下载全尺寸图像

Fig. 3. Example illustrating the processing of automated rule interpretation.
图三.说明自动规则解释处理的示例。

3.2.1. Preprocessing  3.2.1.预处理

Preprocessing aims to preprocess these sentences by sentence splitting to facilitate later work. Sentence splitting is a rule-based, deterministic consequence of tokenization [34]. A sentence ends when sentence-ending punctuation (.,!, or?) occurs. Thus, a regulatory document can be split into single sentences by identifying these boundary indicators of a sentence.
预处理的目的是通过句子分割对这些句子进行预处理,以方便以后的工作。句子分割是标记化的一个基于规则的确定性结果[34]。一个句子结束时,结束标点符号(.,!,还是?)发生。因此,通过识别句子的这些边界指示符,可以将监管文档拆分为单个句子。

3.2.2. Semantic labeling and parsing
3.2.2.语义标注和句法分析

Sentence labeling and parsing aim to extract and parse the internal structure of a text represented in natural language and transform it into a syntax tree.
句子标注和句法分析的目的是提取和解析以自然语言表示的文本的内部结构,并将其转换为语法树。
Semantic labeling is the process of assigning semantic labels to words or phrases in a sentence. Zhang et al. and Zhou et al. [20,35] pointed out that accurate and complete annotation and information extraction involve a deep level of semantic information analysis that requires a complete understanding of building code requirement sentences. Some requirement sentences are rich and complex in semantics and syntactic, which are challenging for conventional methods (e.g., POS tagging, gazetteer lookup). To better extract the features of the sentences, the authors [20] defined seven semantic labels, including obj, sobj, prop, cmp, Rprop, ARprop, and Robj, based on the hierarchical structure of the objects and attributes in the BIM model. The sobj (short for super object), obj (short for object), and prop (short for property) labels are used to locate and identify elements to be checked, but the differences are their levels. The sobj is the parent node of obj, while the prop is the child node of obj. The cmp (short for compare) means the comparative/existential relation between a prop and a requirement. The Rprop is the restriction requirement applied to a prop that shall be satisfied. The ARprop is the applicability applied to a prop, and the rule checking will be performed if the applicability of an object is satisfied. Robj is the parent or refereed element of a Rprop or ARprop [20].
语义标注是给句子中的词或短语赋予语义标签的过程。Zhang et al.和Zhou et al. [20,35]指出,准确和完整的注释和信息提取涉及深层次的语义信息分析,这需要对构建代码需求语句的完整理解。一些需求语句在语义和句法上是丰富和复杂的,这对于常规方法是具有挑战性的(例如,词性标注、地名索引查找)。 为了更好地提取句子的特征,作者[20]基于BIM模型中对象和属性的层次结构定义了七个语义标签,包括objsobjpropcmpRpropARpropRobjsobj(超级对象的缩写)、obj(对象的缩写)和prop(属性的缩写)标签用于定位和标识要检查的元素,但不同之处在于它们的级别。sobjobj的父节点,而propobj的子节点。 cmp(compare的缩写)表示prop和requirement之间的比较/存在关系。Rprop是应用于应满足的prop的限制要求。ARprop是应用于prop的适用性,如果满足对象的适用性,则将执行规则检查。RobjRpropARprop的父元素或引用元素[20]。
Then, the authors used the state-of-the-art DNN pretrained model BERT [36] for automated semantic annotation to better capture the deep semantics of the sentences. The meanings of each semantic label and training process of the DNN model have been depicted in detail in previous work [20]. In this work, the predefined labels and the well-trained model are still utilized.
然后,作者使用最先进的DNN预训练模型BERT [36]进行自动语义注释,以更好地捕捉句子的深层语义。每个语义标签的含义和DNN模型的训练过程已经在以前的工作中详细描述[20]。在这项工作中,仍然使用预定义的标签和经过良好训练的模型。
Parsing is the process of analyzing the structure of a labeled sentence and parsing it into the syntax tree. The domain-specific regulation text is more regular (e.g., less ambiguity and fewer homonym conflicts) than general nontechnical text; thus, it is more suitable for parsing with semantic labels [12]. There are two commonly used grammars according to the Chomsky hierarchy: CFG (type-2) and regular grammar (type-3) [37]. The authors [20] adopted CFG because CFG has higher expressiveness and can be obtained by simpler regular expressions. The authors used the ANTLR4 package [38] and the Python language to implement the parsing process. The proposed method achieves state-of-the-art high accuracies for parsing simple and complex sentences. In this work, the parsing method is still utilized. Note that the syntax tree is still not computer-executable, further steps, including semantic alignment, conflict resolution, and code generation, are required to achieve the whole interpretation.
解析是分析标记句子的结构并将其解析成语法树的过程。特定领域的法规文本更加规则(例如,更少的歧义和更少的同音异义词冲突);因此,它更适合使用语义标签进行解析[12]。根据乔姆斯基的层次结构,有两种常用的语法:CFG(类型2)和正则语法(类型3)[37]。作者[20]采用了CFG,因为CFG具有更高的表达性,可以通过更简单的正则表达式获得。作者使用ANTLR 4包[38]和Python语言来实现解析过程。 所提出的方法实现了国家的最先进的高精度解析简单和复杂的句子。在这项工作中,仍然使用解析方法。请注意,语法树仍然不是计算机可执行的,需要进一步的步骤,包括语义对齐,冲突解决和代码生成,以实现整个解释。

3.2.3. Semantic alignment
3.2.3.语义对齐

Semantic alignment aims to map the labeled elements in the text to the corresponding concepts in the knowledge base (e.g., ontology, knowledge graph) to facilitate the understanding of the text [21,22]. In this study, semantic alignment maps the extracted semantic elements in each node of the aforementioned syntax tree to the concepts in the FPBO, i.e., classes and data properties defined in Section 3.1. An example of semantic alignment is shown in Fig. 3, where the classes and data properties are represented as red and green circles, respectively. The dashed line means there are conflicts in the alignment results, and the downstream task conflict resolution should be performed on them.
语义对齐旨在将文本中的标记元素映射到知识库中的相应概念(例如,本体,知识图),以促进文本的理解[2122]。在本研究中,语义对齐将上述语法树的每个节点中提取的语义元素映射到FPBO中的概念,即,类和数据属性在3.1节中定义。图3中示出了语义对齐的示例,其中类和数据属性分别表示为红色和绿色圆圈。虚线表示对齐结果中存在冲突,需要对其进行下游任务冲突消解。
The methods of semantic alignment include rule-based methods, supervised learning methods, and unsupervised learning methods. There are few open datasets that contain domain-specific entities in the construction domain. Therefore, large efforts are required to manually annotate a sufficiently large amount of training data for supervised learning methods. Analogous to supervised learning methods, rule-based methods also require large efforts in the construction of a large and comprehensive name dictionary containing aliases, abbreviations, alternate spellings, etc. [21]. Given that huge efforts are needed to build datasets and dictionaries for the construction domain, this study adopts an unsupervised learning-based method, which only needs less effort than the other types of methods.
语义对齐的方法包括基于规则的方法、有监督学习方法和无监督学习方法。在构造域中,很少有包含特定于域的实体的开放数据集。因此,需要大量的努力来手动注释足够大量的训练数据以用于监督学习方法。类似于监督学习方法,基于规则的方法也需要大量的努力来构建一个包含别名,缩写,替代拼写等的大型和全面的名称词典。考虑到构建建筑领域的数据集和字典需要付出巨大的努力,本研究采用了一种基于无监督学习的方法,这种方法只需要比其他类型的方法更少的努力。
The unsupervised learning-based approach adopted in this work assumes that semantically similar words usually have similar vector spaces. The workflow of the semantic similarity measurement between a named entity and a concept is shown in Fig. 4. For each named entity in the syntax tree, the semantic similarity between the named entity and all the ontology concepts is measured, and then the ontology concept with the highest similarity score is assigned as the normalized concept to the named entity. As each concept has several descriptions, the similarity between the named entity and each description of the concept is calculated first, and then the average similarity value of all descriptions is taken as the semantic similarity between the concept and the target named entity. To measure the semantic similarity between one description and the target named entity, the multiword named entity and description are preprocessed, including tokenizing, word splitting, and stopping word removal, before two composing wordlists are generated. Each word in the wordlist is represented as a real-valued vector using a pretrained unsupervised learning model illustrated later. For the multiword named entity and description, the vector representations are computed by weighted average of the vectors of their composing words. Two widely used weighted methods, the average weighted method and the TD-IDF (Term Frequency-Inverse Document Frequency) weighted method [39], are used. Finally, a cosine similarity score is assigned as the semantic similarity between the entity and the description.
在这项工作中采用的基于无监督学习的方法假设语义相似的单词通常具有相似的向量空间。命名实体和概念之间的语义相似性测量的工作流程在图4中示出。对于语法树中的每个命名实体,测量命名实体与所有本体概念之间的语义相似度,然后将具有最高相似度得分的本体概念作为归一化概念分配给命名实体。由于每个概念都有多个描述,首先计算命名实体与概念的每个描述之间的相似度,然后取所有描述的平均相似度值作为概念与目标命名实体之间的语义相似度。 为了度量一个描述与目标命名实体之间的语义相似度,在生成两个组合词表之前,对多词命名实体和描述进行预处理,包括标记化、词分裂和停止词去除。单词列表中的每个单词都使用稍后说明的预训练的无监督学习模型表示为实值向量。对于多词命名实体和描述,通过加权平均其组成词的向量来计算向量表示。使用两种广泛使用的加权方法,平均加权方法和TD-IDF(词频-逆文档频率)加权方法[39]。最后,一个余弦相似度得分被分配作为实体和描述之间的语义相似度。
Fig. 4
  1. Download: Download high-res image (268KB)
    下载:下载高分辨率图像(268 KB)
  2. Download: Download full-size image
    下载:下载全尺寸图像

Fig. 4. Flow chart of semantic similarity measurement between a named entity and a concept.
见图4。命名实体和概念之间语义相似度度量的流程图。

In this work, the adopted unsupervised learning model is the word embedding model [40]. The open-source toolkit gensim [41] is used to train the word embedding model with dimensions of 400 according to previous research [42,43]. The training corpus is composed of the Chinese corpus from Wikimedia [44] and 1560 Chinese code corpora from the website of the Chinese codes [45]. On the one hand, the corpus of the domain-related code corpus is too small to cover some frequently used words. On the other hand, the Chinese corpus from Wikimedia has few domain-related words. The performance of the models trained using two corpora is better than using only one of them [46]. Therefore, the model has been trained using the two corpora above.
在这项工作中,采用的无监督学习模型是单词嵌入模型[40]。开源工具包gensim [41]用于根据先前的研究[42,43]训练维度为400的单词嵌入模型。训练语料库由维基百科的中文语料库[44]和中文代码网站的1560个中文代码语料库[45]组成。一方面,领域相关代码语料库的语料量太小,无法覆盖一些常用词。另一方面,维基百科的中文语料库中,领域相关词很少。 使用两个语料库训练的模型的性能优于仅使用其中一个[46]。因此,该模型已经使用上述两个语料库进行了训练。
In addition, we also choose two dictionary-based keyword matching methods as the baseline, including the keyword matching method (KW) and the weighted keyword matching method (KW-Weighted). For KW, if a named entity completely matches one of the descriptions of an ontology concept, it is assigned as the normalized concept. For KW-Weighted, in the descriptions of an ontology concept, the shorter the length of the description, the more information it contains, and the higher the correlation with the term. Based on the above ideas, Formula (1) can measure the similarity between a named entity and a concept as follows:(1)similarity=1nn=iNsimilarityiloglentermdescriptioniwhere similarityi indicates the similarity between the named entity and the number i description of a concept. If the named entity is the same as the number i description, the value of similarityi equals 1; if the named entity is the substring of the number i description, the value of similarityi equals 0.6, and the rest equals 0. The denominator is the log value of the string length of description i. The similarity between the named entity and the concept is the average of the similarity values of the concept's descriptions.
此外,我们还选择了两种基于词典的关键词匹配方法作为基线,包括关键词匹配方法(KW)和加权关键词匹配方法(KW-Weighted)。对于KW,如果命名实体完全匹配本体概念的描述之一,则将其分配为规范化概念。对于KW-Weighted,在本体概念的描述中,描述的长度越短,它包含的信息越多,与术语的相关性越高。基于上述思想,公式(1)可以衡量命名实体与概念之间的相似度,如下: (1)similarity=1nn=iNsimilarityiloglentermdescriptioni 其中相似度表示命名实体与概念描述中的数字之间的相似度。 如果命名实体与数字i描述相同,则相似度的值等于1;如果命名实体是数字i描述的子串,则相似度的值等于0.6,其余等于0。分母是描述i的字符串长度的对数值。命名实体和概念之间的相似度是概念描述的相似度值的平均值。

3.2.4. Conflict resolution
3.2.4.解决冲突

Conflict resolution refers to modifying the semantic alignment results according to the domain knowledge of civil engineering. It is necessary to ensure that the regulations and the models use the same concepts and relations to define and describe the same objects. As a result, two types of conflict resolution methods, the domain-range conflict resolution and the equivalent class conflict resolution, are proposed based on domain knowledge.
冲突消解是指根据土木工程领域知识对语义对齐结果进行修改。必须确保规章和模型使用相同的概念和关系来界定和描述相同的对象。在此基础上,提出了基于领域知识的领域-范围冲突消解和等价类冲突消解两类冲突消解方法。
For the domain-range conflict resolution method, the domains and ranges of the data properties and object properties are defined in Section 3.1. For example, the range of a concept whose type is a data property cannot be an object. Otherwise, the wrong situation where an attribute value has an object may occur, which violates the definition of the BIM model. Another situation is when the range of some data properties must only be certain values. For example, the range of the data property ‘hasFireHazardCategory’ only includes the following values: Class A, B, C, D, and E. These values also uniquely correspond to the data property ‘hasFireHazardCategory’. This means that we can identify the data property according to its range if the data property is missing in the text. The results of semantic alignment can be identified and checked according to ranges of data properties and object properties, and are further modified via the domain-range conflict resolution dictionary (Table 1). The domain-range conflict resolution dictionary was extracted from FPBO. If the wrong semantic alignment results are identified according to the range of a concept, then the information of correct concepts can be determined by keyword matching with the wrong concept as the key; afterward, the syntax tree structure can be modified. The specific case is shown in Fig. 5(a). The root entity ‘fire compartment’ is aligned to ‘isFireProtectionSubdivision_Boolean’, which is a data property. The range of the data property cannot have an object ‘Column’. Therefore, the original semantic alignment result is wrong. Then, the syntax tree is modified according to Table 1. If the text that relates to the data property is missing in the original text, we can identify it and complement the missing value according to the range of the data property. The specific case is shown in Fig. 5(b). The missing data property ‘hasFireHazardCategory’ is supplemented via the range value ‘Class A'.
对于域-范围冲突解决方法,数据属性和对象属性的域和范围在第3.1节中定义。例如,类型为数据属性的概念的范围不能是对象。否则,可能会出现属性值有对象的错误情况,违反BIM模型的定义。另一种情况是某些数据属性的范围必须是某些值。例如,数据属性“hasFireHazardCategory”的范围仅包括以下值:Class A、B、C、D和E。这些值还唯一地对应于数据属性“hasFireHazardCategory”。这意味着,如果文本中缺少数据属性,我们可以根据其范围来识别数据属性。 语义对齐的结果可以根据数据属性和对象属性的范围进行识别和检查,并通过域范围冲突解决字典(表1)进行进一步修改。从FPBO中提取域-范围冲突消解字典。如果根据概念的范围识别出错误的语义对齐结果,则以错误概念为关键词,通过关键词匹配确定正确概念的信息,然后修改语法树结构。具体情况见图5(a)。根实体“fire compartment”与数据属性“isFireProtectionSubdivision_Boolean”对齐。数据属性的范围不能有对象“”。因此,原始语义对齐结果是错误的。 然后,根据表1修改语法树。如果与数据属性相关的文本在原始文本中缺失,我们可以识别它并根据数据属性的范围补充缺失的值。具体案例如图5(B)所示。缺失的数据属性“hasFireHazardCategory”通过范围值“Class A”补充。
Fig. 5
  1. Download: Download high-res image (693KB)
    下载:下载高分辨率图像(693 KB)
  2. Download: Download full-size image
    下载:下载全尺寸图像

Fig. 5. Example depicting the processing of conflict resolution (in the square brackets, the plain words are in black font, the classes in FPBO are in red font, and the data properties in FPBO are in green font). The red dashed circles show the conflicts in the original sentences. The conflict resolution results are bolded.) (For interpretation of the references to colour in this figure legend, the reader is referred to the online version of this article.)
图五.描述冲突解决过程的示例(在方括号中,纯文字为黑色字体,FPBO中的类为红色字体,FPBO中的数据属性为绿色字体)。红色虚线圆圈表示原始句子中的冲突。冲突解决结果以粗体显示。)(For有关本图图例中颜色的解释,请参阅本文的在线版本。

For the equivalent class conflict resolution method, equivalent classes are also defined in FPBO. Since the IFC model contains the most basic concepts, we propose an equivalent concept conflict resolution method based on the decomposition of concepts using equivalent classes. Based on domain knowledge, some complex concepts in the regulations can be defined as the combination of basic concepts. For example: class: ‘FireWall’ = class: ‘Wall’ + data property: ‘isFireWall_Boolean’ + value: ‘True’. To identify this kind of conflict or mismatch, an equivalent concept conflict resolution dictionary is established according to the FPBO equivalent classes, as shown in Table 2. If the complex concepts are identified according to the key in Table 2, then the information of a series of simple concepts can be determined by keyword matching with the key; afterward, the syntax tree structure can also be modified. The specific case is shown in Fig. 5(c). The complex class ‘FireWall’ is decomposed into the simple class ‘Wall’ and the data property ‘isFireWall_Boolean’.
对于等价类冲突解决方法,FPBO中也定义了等价类。由于IFC模型包含了最基本的概念,我们提出了一种基于概念等价类分解的等价概念冲突消解方法。基于领域知识,可以将法规中的一些复杂概念定义为基本概念的组合。例如:class:'FireWall'= class:'Wall'+ data property:'isFireWall_Boolean'+value:'True'。为了识别这种冲突或不匹配,根据FPBO等价类建立等价概念冲突消解字典,如表2所示。 如果根据表2中的关键字来识别复杂概念,则可以通过关键字与关键字的匹配来确定一系列简单概念的信息,之后还可以修改语法树结构。具体情况见图5(c)。复杂类“FireWall”被分解为简单类“Wall”和数据属性“isFireWall_Boolean”

Table 2. Equivalent concept conflict resolution dictionary.
表2.等价概念冲突解决词典。

Key (Original concept of the node)
关键字(节点的原始概念)
Value ([New concept of the node, new word of the node, concept of its sub-node, word of its sub-node, the value of its sub node])
Value([节点的新概念,节点的新词,其子节点的概念,其子节点的词,其子节点的值])
Firewall  防火墙[Wall, wall, isFireWall_Boolean, is fire wall, True]
[Wall,wall,isFireWall_Boolean,是防火墙,True]
Plant  植物[BuildingRegion, Building, hasBuildingType, has the building type of, Plant]
[BuildingRegion,Building,hasBuildingType,建筑类型为,Plant]

3.2.5. Code generation  3.2.5.代码生成

After semantic alignment and conflict resolution, the code generation step is conducted to transform the syntax tree into a computer-processable format. To support a wider application range for checking rules covering implicit information, a new code generation method enhanced by text classification is proposed. Text classification is introduced to identify proper SPARQL functions which could be used to infer implicit information for complex regulatory texts. Considering its ability in reasoning implicit information, the SPARQL [47,48] language is selected.
在语义对齐和冲突解决之后,进行代码生成步骤以将语法树转换成计算机可处理的格式。为了支持更广泛的应用范围,检查规则涵盖隐含信息,提出了一种新的代码生成方法,增强了文本分类。文本分类,以确定适当的SPARQL功能,可以用来推断复杂的监管文本的隐含信息。考虑到其推理隐式信息的能力,选择SPARQL [47,48]语言。
Text classification aims to classify sentences into different categories based on semantics and is an important step for text analysis [49,50]. Salama [50] classifies regulation documents into 14 predefined categories, including security, safety, health, etc., depending on the objective of the document. Solihin et al. [49] classified the regulatory text into four categories according to the complexity and suitability of the tools or techniques required, as shown in Table 3. Following the above two classification works, this study proposes new classification categories that can consider the computability of the text. The categories and their definitions are shown in Table 3. Different categories of rules are subsequently processed using different code generation techniques. For example, the rules of Class 2 may require additional reasoning functions, such as the aggregated algebra provided by SPARQL.
文本分类旨在根据语义将句子分类为不同的类别,是文本分析的重要步骤[49,50]。Salama [50]将监管文件分为14个预定义的类别,包括安全,安全,健康等,这取决于文件的目的。Solihin等人[49]根据所需工具或技术的复杂性和适用性将监管文本分为四类,如表3所示。在上述两个分类工作之后,本研究提出了新的分类类别,可以考虑文本的可计算性。这些类别及其定义见表3。 随后使用不同的代码生成技术处理不同类别的规则。例如,第2类的规则可能需要额外的推理功能,例如SPARQL提供的聚合代数。

Table 3. Rule categories.
表3.规则类别。

[49]
[49]第四十九届
Ours  我们
Class 1: Rules that require a single or small number of explicit data
第1类:需要单个或少量显式数据的规则
Class 1: Direct attribute related requirement
第1类:直接属性相关要求

(Description: Rules that require only explicit data in BIM model)
(说明:BIM模型中只需要显式数据的规则)
Class 2: Rules that require simple derived attribute values
类别2:需要简单派生属性值的规则
Class 2: Indirect attribute related requirement
第2类:间接属性相关要求

(Description: Rules that require simple derived values from explicit data in BIM model)
(描述:要求从BIM模型中的显式数据中简单导出值的规则)
Class 2.1: Quantity attribute related requirement
2.1类:数量属性相关要求
Class 2.2 Other complex indirect attribute related requirement
2.2其他复杂间接属性相关要求
Class 3: Rules that require an extended data structure
类别3:需要扩展数据结构的规则
Class 3: Spatial geometry related requirement
第3类:空间几何相关要求

(Description: Rules that require extended data structure and complex derivation, such as geometric computing)
(描述:需要扩展数据结构和复杂推导的规则,如几何计算)
Class 4: Rules that require a “proof of solution”
第4类:需要“解决方案证明”的规则
Class 4: Others  第四类:其他
(Description: Rules which is hard to be automated interpreted)
(描述:难以自动解释的规则)
To date, several methods have been proposed for automated text classification. For example, Caldas et al. [51] developed an automated text classification system based on traditional machine learning methods, such as the Rocchio algorithm, naïve Bayes, k-nearest neighbors, and SVMs. Salama [50] classified documents based on machine learning methods, such as naïve Bayes, maximum entropy, and SVMs. This study adopts a classification method based on keyword matching and semantic ranking, as text classification techniques are not the focus of this paper and machine learning methods depend on large datasets that require very expensive manual labeling. In this work, we first define a table of keywords for classification. Part of the keywords is shown in Table 4. A regulatory text is classified into a specific category if it contains a keyword of the category. A text may contain more than one keyword, resulting in overlapping classifications. Therefore, a classification priority for semantic ranking is defined as follows: Class 3 > Class 2.2 > Class 2.1 > Class 1 > Class 4. The results of the automated text classification are shown in Section 4.
到目前为止,已经提出了几种方法用于自动文本分类。例如,Caldas等人[51]开发了一个基于传统机器学习方法的自动文本分类系统,如Rocchio算法,朴素贝叶斯,k-最近邻和SVM。Salama [50]基于机器学习方法对文档进行分类,例如朴素贝叶斯,最大熵SVM。由于文本分类技术不是本文的重点,机器学习方法依赖于需要非常昂贵的人工标记的大型数据集,因此本研究采用基于关键字匹配和语义排名的分类方法。在这项工作中,我们首先定义了一个表的关键字分类。 部分关键词如表4所示。如果法规文本包含特定类别的关键词,则将其归类为该类别。一个文本可能包含一个以上的关键字,导致重叠的分类。因此,语义排序的分类优先级定义如下:类3 >类2.2 >类2.1 >类1 >类4。自动文本分类的结果如第4节所示。

Table 4. Keywords for rule categories.
表4.规则类别的关键字。

Class  Keywords (in Chinese)  关键词(中文)
Class 1: Direct attribute related requirement
第1类:直接属性相关要求
Length, width, height, thickness, depth, span, precision…
长度,宽度,高度,厚度,深度,跨度,精度...
Class 2.1: Quantity attribute related requirement
2.1类:数量属性相关要求
Number, times…  数字,时间...
Class 2.2 Other complex indirect attribute related requirement
2.2其他复杂间接属性相关要求
Area, volume…  面积,体积……
Class 3: Spatial geometry related requirement
第3类:空间几何相关要求
Distance…  距离
Class 4: Others  第四类:其他
After the proper SPARQL functions for the rules are identified by the text classification method, the syntax trees are transformed into SPARQL by pattern-matching rules. According to the SPARQL syntax [47,48], During the transformation, the following three aspects should be noted, as shown in Fig. 6.
通过文本分类方法确定规则的SPARQL函数后,通过模式匹配规则将语法树转换为SPARQL。根据SPARQL语法[47,48],在转换过程中,应注意以下三个方面,如图6所示。
  • 1.
    Prefix. The prefix is the SPARQL instruction for a declaration of a namespace prefix which defines the source of the adopted functions. The prefix should be written at the beginning of a SPARQL file. It allows us to write prefixed names in queries instead of having to use full URIs everywhere. Therefore, it is a syntax convenience mechanism for shorter, easier-to-read (and write) queries.
    前缀前缀是用于命名空间前缀声明的SPARQL指令,该命名空间前缀定义了所采用函数的源。前缀应该写在SPARQL文件的开头。它允许我们在查询中写入前缀名称,而不必在任何地方使用完整的URI。因此,它是一种语法方便的机制,用于更短的读(和写)查询。
  • 2.
    Aggregate Algebra. Aggregate algebra is a set of predefined function provided by SPARQL, which supports counting and other aggregate functions. According to the text classification, except for the rules of Class 1 that describe requirements on explicitly specified attributes, the data required for checking other categories of rules cannot be directly obtained from the Turtle file, and additional derivations are needed. In this work, for the rules of Class 2 which require implicitly specified attributes, the aggregate algebra provided by SPARQL can be used to derive some implicit data from explicit data in the BIM model. For example, the explicit data in the BIM model are that fire compartment A has safety exits A, B, and C. The indirect attribute related requirement: “The number of safety exits for each fire compartment is not less than 2” can be checked using the COUNT provided by SPARQL. The code “(COUNT(DISTINCT? safety_exit) AS? Safety_exit_num)” can derive and store the number of safety exits, which is implicit data, in ‘?Safety_exit_num’, to realize automated rule checking.
    聚合代数聚集代数是SPARQL提供的一组预定义函数,支持计数和其他聚集函数。根据文本分类,除了Class 1的规则描述了对明确指定的属性的要求外,检查其他类别的规则所需的数据不能直接从Turtle文件中获取,需要额外的派生。在这项工作中,对于需要隐式指定属性的Class 2规则,可以使用SPARQL提供的聚合代数从BIM模型中的显式数据导出一些隐式数据。例如,BIM模型中的显式数据是防火分区A具有安全出口A、B和C。与间接属性相关的要求:“每个防火分区的安全出口数量不少于2个”可以使用SPARQL提供的COUNT进行检查。代码“(COUNT(DISTINCT?safety_exit)AS? Safety_exit_num)“可以导出并存储安全出口的数量,这是隐式数据,在'?Safety_exit_num ',实现自动规则检查.
  • 3.
    Object property mapping. Object property mapping is to establish the relationship between the two objects queried. For example, space A contains element B. The ‘contains’ is expressed through the object property ‘hasBuildingElement’. If there is no object property, the relationship between space A and element B could not be reasoned, resulting in a failed query. The object property between the two classes can be uniquely determined by a HashMap generated from FPBO, with the name of the domain class and the range class of an object property.
    对象属性映射。对象属性映射是建立查询的两个对象之间的关系。例如,空间A包含元素B。“contains”通过对象属性“hasBuildingElement”表示。如果没有对象属性,则无法推理空间A和元素B之间的关系,从而导致查询失败。两个类之间的对象属性可以由FPBO生成的HashMap唯一确定,该HashMap具有对象属性的域类和范围类的名称。
Fig. 6
  1. Download: Download high-res image (464KB)
    下载:下载高分辨率图像(464KB)
  2. Download: Download full-size image
    下载:下载全尺寸图像

Fig. 6. SPARQL syntax to be noted during the transformation process.
见图6。在转换过程中要注意的SPARQL语法。

4. Experiments and results
4.实验和结果

This section describes the experiment conducted to validate the proposed method. First, several compulsory regulation texts are selected from GB 50016–2014 [30]. The gold standards of concept alignment and text classification are manually developed. Second, various automated semantic alignment methods are compared and evaluated based on the gold standard. Third, the results of automated text classification are compared with the gold standard. Finally, the time consumed by the proposed automated rule interpretation method is compared with the time consumed by the experts' manual rule interpretation method.
本节描述了为验证所提出的方法而进行的实验。首先,从GB 50016-2014[30]中选取了几个强制性法规文本。概念对齐和文本分类的黄金标准是手动开发的。其次,各种自动语义对齐方法进行了比较和评估的基础上的黄金标准。第三,将自动文本分类的结果与金标准进行比较。最后,将所提出的自动规则解释方法所消耗的时间与专家手工规则解释方法所消耗的时间进行比较。

4.1. Dataset development  4.1.数据集开发

The regulatory texts from Section 3 in the Chinese Code for fire protection design of buildings (GB 50016–2014)[30] are selected as the data source. First, the rules represented in figures are transformed into natural language. Then, the regulatory text on building codes is split into single sentences by semicolons, periods, and newlines. Next, these sentences are classified by experts according to the rule categories in Table 3 to form the gold standard for the automated text classification method. Then, the semantic alignments of the selected sentences are manually developed according to the FPBO to form the gold standard for the automated semantic alignment method. Finally, 27 types of semantic alignment tags are established, and the dataset contains 99 sentences and 468 semantic alignment labels. A repository containing the dataset is established on GitHub and can be found at https://github.com/SkydustZ/auto-rule-transform.
选取《建筑设计防火规范》(GB 50016-2014)[30]第3节的规范性文本作为数据源。首先,将图形中表示的规则转换为自然语言。然后,关于建筑规范的法规文本被分隔符、句号和换行符分成单个句子。接下来,专家根据表3中的规则类别对这些句子进行分类,以形成自动文本分类方法的金标准。然后,根据FPBO手动开发所选句子的语义对齐,以形成自动语义对齐方法的金标准。最后,建立了27种语义对齐标签,数据集包含99个句子和468个语义对齐标签。 包含数据集的存储库在GitHub上建立,可以在https://github.com/SkydustZ/auto-rule-transform上找到。

4.2. Semantic alignment result
4.2.语义对齐结果

To evaluate the performance of automated semantic alignment methods, the model predictions are compared with those from the gold standard. Accuracy and running time are employed as evaluation indices, and the results are shown in Table 5. Table 5 shows that the baseline algorithm KW achieves the lowest accuracy of 55.8%, and the KW-Weighted algorithm achieves an accuracy of 65.2%, which is just a little higher than KW. The accuracy of methods based on semantic similarity (e.g., W2V-avg, W-tfidf, and WMD) is higher. The proposed semantic similarity and conflict resolution algorithm achieve the best accuracy of 90.1%. The KW-based algorithms only consider the morphological similarity between the named entities and the FPBO concepts, but they do not consider the semantic information contained in the descriptions or metadata of a concept. In addition, different vocabularies in natural language may be used by various regulatory documents to describe the same concept. For instance, different terms, such as “fire partition wall” and “firewall” are often used to refer to the same concept of a ‘firewall’. Once the synonyms and aliases of a concept are not defined in the ontology, the KW-based methods cannot align the concept well. However, semantic similarity-based methods can measure the semantic similarity of two concepts, even if they differ greatly in morphology. The analysis of the mis-classifications by semantic similarity-based methods suggests two main reasons for this. First, the named entity composed of more than one concept is difficult to distinguish. For example, the named entity ‘fire protection zone’ is mistakenly aligned to the data property ‘isFireProtectionSubdivision_Boolean’ in the FPBO. However, it should be aligned to the class ‘BuildingSpace’ with the data property ‘isFireProtectionSubdivision_Boolean’, and the value of the data property should be ‘True’. Second, it's hard to recognize the abbreviations or implicit concepts in the regulatory text. For example, “two-story house” is the abbreviation of “the number of floors in the house is 2”. The “two” in the phrase “two-story” should be the value of the data property ‘hasNumberOfFloors’. The semantic similarity-based methods cannot align the named entity “two-story” with the data property ‘hasNumberOfFloors’ with a value of 2. Another example is that “Plant in Class A" is the abbreviation of “the plant with the fire hazard category in Class A". “Class A" is the value of the data property ‘hasFireHazardCategory’. The semantic similarity-based methods cannot align “Class A" with the data property ‘hasFireHazardCategory’ with a value of ‘Class A'. Both the first and second types of mis-classifications can be identified and corrected based on the conflict resolution methods proposed in Section 3.2.4. The proposed semantic similarity and conflict resolution method not only considers the semantics of a single word but also considers the information of the whole sentence structure and domain knowledge. Therefore, mis-classifications are found and corrected according to the established conflict resolution dictionary, achieving an accuracy of 90.1%. There are two reasons why the accuracy cannot reach 100%. 1) Syntax tree error. When the sentence is too complicated, the syntax tree will fail to be constructed. Therefore, the sentence structure information cannot be considered. In the validation dataset, 3 complicated sentences fail to be constructed, and the other 96 sentences are correctly constructed. Therefore, 30% of alignment errors are caused by this type. 2) Conflict resolution error. When the semantic alignment results are far from the ground truth, they cannot be corrected by the proposed conflict resolution method because the defined correction dictionary does not consider such situations. For example, when the ‘bearing wall’ is wrongly predicted as ‘RoofSheathing’ in the process of semantic alignment, then the conflict resolution dictionary can not identify the error. 70% of alignment errors are caused by this type.
为了评估自动语义对齐方法的性能,将模型预测与金标准进行比较。采用准确度和运行时间作为评价指标,结果如表5所示。表5显示,基线算法KW实现了55.8%的最低准确度,而KW加权算法实现了65.2%的准确度,仅略高于KW。基于语义相似性的方法的准确性(例如,W2 V-avg、W-tfidf和WMD)更高。所提出的语义相似度和冲突解决算法达到最好的准确率为90.1%。基于KW的算法只考虑命名实体与FPBO概念之间的形态相似性,而没有考虑概念描述或元数据中包含的语义信息。 此外,不同的规范文档可能使用自然语言中的不同词汇来描述相同的概念。例如,不同的术语,如“防火隔墙”和“防火墙”,通常用于指代“防火墙”的相同概念。一旦概念的同义词和别名没有在本体中定义,基于KW的方法就不能很好地对齐概念。然而,基于语义相似性的方法可以测量两个概念的语义相似性,即使它们在形态上有很大差异。基于语义相似性的方法的错误分类的分析提出了两个主要原因。首先,由多个概念组成的命名实体难以区分。例如,命名实体“fire protection zone”错误地与FPBO中的数据属性“isFireProtectionSubdivision_Boolean”对齐。 但是,它应该与具有数据属性“isFireProtectionSubdivision_Boolean”的类“BuildingSpace”对,并且数据属性的值应该为“True”。二是法规文本中的缩略语或隐含概念难以识别。例如,“二层楼”是“房屋层数为2”的缩写。短语“two-story”中的“two”应该是数据属性“hasNumberOfFloors”的值。基于语义相似性的方法无法将命名实体“two-story”与值为2的数据属性“hasNumberOfFloors”对齐。又如,“甲类厂房”是“甲类火灾危险性厂房”的简称。“Class A”是数据属性“hasFireHazardCategory”的值。 基于语义相似性的方法无法将“Class A”与值为“Class A”的数据属性“hasFireHazardCategory”对齐。第一种和第二种错误分类都可以根据第3.2.4节中提出的冲突解决方法来识别和纠正。提出的语义相似度和冲突消解方法不仅考虑了单个词的语义,而且考虑了整个句子结构和领域知识的信息。因此,错误的分类发现和纠正根据建立的冲突解决字典,实现了90.1%的准确率。准确率不能达到100%的原因有两个。1)树错误。当句子过于复杂时,句法树将无法构建。因此,无法考虑句子结构信息。 在验证数据集中,有3个复杂的句子构造失败,其余96个句子构造正确。因此,30%的对准误差是由这种类型引起的。2)冲突解决错误。当语义对齐结果远离地面实况时,它们不能通过所提出的冲突解决方法来校正,因为所定义的校正字典不考虑这种情况。例如,当在语义对齐过程中将“承重墙”错误地预测为“屋顶护套”时,则冲突解决字典不能识别错误。70%的对准误差是由这种类型引起的。

Table 5. Performance of different semantic alignment methods.
表5.不同语义对齐方法的性能。

Method  方法Accuracy  精度Time  时间
KW55.8%6.83 s
KW-Weighted  千瓦加权65.2%5.10 s
W2V-avg77.6%25.8 s
W2V-tfidf77.0%33.9 s
WMD73.0%44.9 s
W2V-avg + conflict resolution
W2 V-avg+冲突解决
90.1%75.0 s
Because the structures of syntax trees are considered in the proposed semantic similarity and conflict resolution method, the time consumed by syntax tree construction is added to the time of the proposed method. Although the computational complexity of the proposed method is higher than that of other algorithms, the running time on the validation datasets including 27 concepts and 99 sentences is still acceptable. The average time consumed per sentence does not exceed 1 s.
由于该方法考虑了语法树的结构,因此将构造语法树的时间加入到了该方法的时间中。尽管该方法的计算复杂度高于其他算法,但在包含27个概念和99个句子的验证数据集上的运行时间仍然可以接受。每句话平均耗时不超过1秒。
Table 6 shows the interpretation performance at the sentence level. Our method achieves a 60.6% accuracy for semantically aligning the elements in whole sentences, which is higher than the results of other methods. Sentence-level accuracy is more suitable to indicate the performance of a semantic alignment method because this stricter criterion also considers the applicability of a method, such as supporting the downstream tasks like code generation and rule execution [20]. In the entire ARC process, each result of rule interpretation is executed for rule checking; therefore, the total correctness is related to the sentence-level accuracy of rule interpretation. The result of high sentence-level aligning accuracy also demonstrates the high interpretation performance of our method.
表6显示了句子层面的口译表现。我们的方法实现了60.6%的准确率语义对齐的元素在整个句子,这是高于其他方法的结果。句子级准确性更适合于指示语义对齐方法的性能,因为这个更严格的标准也考虑了方法的适用性,例如支持代码生成和规则执行等下游任务[20]。在整个ARC过程中,规则解释的每个结果都要执行规则检查,因此,总的正确性与规则解释的执行级准确度有关。高精度水平对齐的结果也证明了我们的方法的高解释性能。

Table 6. Performance of different alignment methods at the sentence level.
表6.不同对齐方法在句子层面的表现。

Method  方法Sentence number  短语编号Correctly interpreted sentences
正确解释的句子
Accuracy  精度
KW-Weighted  千瓦加权991414.1%
W2V-avg992020.2%
W2V-avg + conflict resolution
W2 V-avg+冲突解决
996060.6%

4.3. Text classification result
4.3.文本分类结果

To measure the result, the model predictions are compared with the gold standard, and the precision (P), recall (R), and F1-score (F1) are calculated for each semantic label as follows:(2)P=Ncorrect/Nlabeled(3)R=Ncorrect/Ntrue(4)F1=2PR/P+Rwhere N{correct,labeled,true} denotes the number of {model correctly labeled, model labeled, true} elements for a label. Finally, the weighted (i.e., micro) average F1-score is calculated to represent the overall performance (ni denotes the number of elements of the i-th semantic label) as follows:(5)WeightedF1=iniF1,i/ini
为了衡量结果,将模型预测与金标准进行比较,并为每个语义标签计算精确度(P)、召回率(R)和F1分数(F1),如下所示: (2)P=Ncorrect/Nlabeled (3)R=Ncorrect/Ntrue (4)F1=2PR/P+R 其中N{正确标记真实}表示标签的{模型正确标记,模型标记,真实}元素的数量。最后,加权(即,如下计算平均F1分数以表示总体性能(n表示第i个语义标签的元素的数量):
Table 7 shows that the adopted classification method based on keyword matching and semantic ranking obtains a promising results with 83.4% overall F1-score on the validation dataset. As a proof of concept in an early stage, the adopted sentence classification method shows promising results because it can classify different categories for long and complex sentences with relatively high accuracy, which reduces the human effort and supports the information transformation stage. However, it was found that the F1-score of some categories, such as Class 2.1 and Class 4, is low, which is due to the imperfect classification priority for semantic ranking or the incomplete keywords for rule categories. In the future, we will develop a larger corpus and train deep learning models that can better understand the meaning of the text to improve the effect of automated text classification.
表7显示,所采用的基于关键字匹配和语义排名的分类方法在验证数据集上获得了83.4%的总体F1分数的有希望的结果。作为早期阶段的概念证明,所采用的句子分类方法显示出有希望的结果,因为它可以以相对较高的准确率对长而复杂的句子进行不同的类别分类,从而减少了人工努力并支持信息转换阶段。然而,我们发现,一些类别,如类2.1和类4,F1-分数低,这是由于不完善的分类优先级语义排名或不完整的关键字规则类别。未来,我们将开发更大规模的语料库,并训练能够更好理解文本含义的深度学习模型,以提高自动文本分类的效果。

Table 7. Performance of text classification method.
表7.文本分类方法的性能。

Tag  标签NumberPrecision  精度Recall  召回F1-score  F1得分
Class 1  1类4393.5%100%96.6%
Class 2.1  2.1级212.5%100%22.2%
Class 2.2  2.2级3696.4%75%84.4%
Class 3  3类10100%100%100%
Class 4  4类10100%10.0%18.2%
Total  10194.2%82.2%83.4%

4.4. Performance of automated rule interpretation
4.4.自动规则解释的性能

Table 8 shows the rule categories that can be automatically interpreted by our methods and previous methods. The SPARQL language provides predefined counting functions which is more suitable and concise for rules with implicit attributes that need additional reasoning than the widely used first-order logic description. Therefore, our method can interpret the rules of Classes 2.1 & 2.2, which were difficult in previous studies.
表8显示了我们的方法和以前的方法可以自动解释的规则类别。SPARQL语言提供了预定义的计数函数,与广泛使用的一阶逻辑描述相比,该函数更适合且更简洁地用于需要额外推理的具有隐式属性的规则。因此,我们的方法可以解释类2.1和2.2的规则,这是在以前的研究中很难。

Table 8. Rule categories that can be interpreted.
表8.可以解释的规则类别。

Rule Category  规则类别Previous studies  以前的研究Our methods  我们的方法
Class 1  1类
Class 2.1 & 2.2  2.1和2.2类×
To prove the effectiveness of the proposed method for automated rule interpretation, the time consumed by the proposed method is compared with the time consumed by the experts' manual rule interpretation method. Two experts participated in the experiment are both major in civil engineering and computer science, and are skilled in writing SPARQL. It should be noted that the keyword mapping method or handcrafted mapping tables are used in previous works, resulting in low sentence-level alignment accuracy and difficulty in generating SPARQL codes, as illustrated in Table 6. Therefore, the performance of our method is only compared with the manual results.
为了证明所提出的自动规则解释方法的有效性,将所提出的方法与专家手动规则解释方法所消耗的时间进行了比较。参与实验的两名专家都是土木工程和计算机科学专业的,并且熟练编写SPARQL。需要注意的是,在以前的工作中使用了关键字映射方法或手工制作的映射表,导致了低水平对齐精度和难以生成SPARQL代码,如表6所示。因此,我们的方法的性能仅与手动结果进行比较。
In this experiment, 4 direct attribute-related requirements (Class 1) with explicitly specified data and 8 indirect attribute-related requirements (Classes 2.1 & 2.2) with implicitly specified data are selected. The time consumed to interpret rules from natural language into SPARQL, which can be executed by the reasoning engine correctly, is recorded. It should be noted that the automated interpretation method cannot guarantee that the interpretation result is 100% correct, so manual correction is required. Therefore, the time consumed by the manual correction is also recorded as a part of the automated interpretation time. The results in Table 9 show that the time consumed by automated rule interpretation is 20.7% of that of the manual interpretation for direct attribute-related requirements and 17.2% for indirect attribute-related requirements, even if the manual correction time is considered. This means that the proposed rule interpretation method greatly reduces the time and cost.
在本实验中,选择了4个具有显式指定数据的直接属性相关需求(类别1)和8个具有隐式指定数据的间接属性相关需求(类别2.1和2.2)。记录了将规则从自然语言解释为SPARQL所花费的时间,该时间可以被推理机正确地执行。需要注意的是,自动判读方法不能保证判读结果100%正确,因此需要人工校正。因此,手动校正所消耗的时间也被记录为自动判读时间的一部分。表9中的结果表明,即使考虑手动校正时间,对于直接属性相关的需求,自动规则解释所消耗的时间是手动解释的20.7%,对于间接属性相关的需求,自动规则解释所消耗的时间是手动解释的17.2%。 这意味着所提出的规则解释方法大大减少了时间和成本。

Table 9. Time consumed by different rule interpretation methods.
表9.不同的规则解释方法所消耗的时间。

Method  方法Rule Class  规则类Parsing (s)  解析Semantic labeling & Postprocessing (s)
语义标注和后处理
Manual revised (s)  手册修订Total time (s)  总时间(s)
Expert  专家Class 1  1类991288/1387
Class 2.1 2.2  第2.2类1462344/2490
Algorithm  算法Class 1  1类0.4365.39190.75286.57
Class 2.1 2.2  第2.2类0.56128.64298.69427.89
Note: Manual correction refers to the time required to modify the automatically generated SPARQL code so that it can be executed correctly. Not all generated SPARQL codes need to be corrected manually.
注:手动更正是指修改自动生成的SPARQL代码以便正确执行所需的时间。并非所有生成的SPARQL代码都需要手动更正。

5. Application  5.应用

To demonstrate the real application of our method, this section implements rule checking of an actual plant building. Fig. 2 shows the workflow of our automated rule checking system. After the model is established, according to the model preparation method proposed in Appendix A, the IFC file exported from the Revit BIM model is converted into Turtle format. IFC objects related to fire protection that design compliance checking are converted into 507 instances of the FPBO. Then, semantic enrichment is conducted using the SWRL rules defined in Appendix A and the Pellet reasoner. The original file has 3817 axioms, while the inferred file has 10, 824 axioms, which means that some implicit information is inferred. The logical consistency and data consistency of the axioms are verified by reasons and experts, respectively. For example, the implicit information of the instance ‘IfcBuilding0’ has been reasoned out, as shown in the content with a yellow background in Fig. 7.
为了演示我们的方法的真实的应用,本节实现了一个实际厂房的规则检查。图2显示了我们的自动规则检查系统的工作流程。模型建立后,按照附录A提出的模型准备方法,将从Revit BIM模型导出的IFC文件转换为Turtle格式。与消防相关的IFC对象,设计符合性检查被转换为FPBO的507个实例。然后,使用附录A中定义的SWRL规则和Pellet推理器进行语义丰富。原始文件有3817个公理,而推断文件有10,824个公理,这意味着推断出了一些隐式信息。 公理的逻辑一致性和数据一致性分别通过推理和专家验证。例如,实例'IfcBuilding0'的隐含信息已经推理出来,如图7中黄色背景的内容所示。
Fig. 7
  1. Download: Download high-res image (448KB)
    下载:下载高分辨率图像(448KB)
  2. Download: Download full-size image
    下载:下载全尺寸图像

Fig. 7. Application of the semantic enrichment.
见图7。语义丰富的应用。

After that, three example regulation rules are selected, including one direct attribute-related requirement (i.e., rule 1) and two indirect attribute-related requirements (i.e., rules 2 & 3). Utilizing the automated rule interpretation method in Section 3.2, the above rules are interpreted into SPARQL codes, which can be used for the reasoning engine, as shown in Fig. 8. Finally, these codes are executed to check the model using GraphDB. Fig. 8 also shows a rule checking application in a BIM model using the three example rules. According to rule 1, columns with a fire resistance limit of 2.0h, which is less than the requirement of 3h, were selected. A total of 13 column instances are screened from 302 column instances. The fire protection zone with 2 safety exits that meet the requirement of rule 2 is selected out. Because the plant has 2 fire protection zones, the other one fails to pass the second check. For rule 3, the building is a two-story plant with a fire hazard category of class C and a fire resistance grade of class three. It meets the requirement of the code. The model checker provides the global ID of the selected elements, and the users can locate the elements according to the global ID, as shown in Fig. 8. Thus, this rule checking identifies 13 column instances and 1 fire protection zone instance that failed to meet the requirements. The results are the same as the manual checking results. The automated rule checking successfully detects the issues and can greatly improve the efficiency of the design.
在此之后,选择三个示例调节规则,包括一个直接属性相关的要求(即,规则1)和两个间接的属性相关要求(即,规则2和3)。利用第3.2节中的自动规则解释方法,将上述规则解释为SPARQL代码,该代码可用于推理引擎,如图8所示。最后,利用GraphDB对模型进行验证。图8还示出了使用三个示例规则的BIM模型中的规则检查应用。根据规则1,选择耐火极限为2.0h(小于3h的要求)的柱。从302个色谱柱实例中共筛选出13个色谱柱实例。选择满足规则2要求的有2个安全出口的防火分区。 因厂区有2个防火分区,另一个未通过二次审核。对于规则3,该建筑物为两层厂房,火灾危险类别为丙类,耐火等级为三级。符合规范要求。模型检查器提供所选元素的全局ID,用户可以根据全局ID定位元素,如图8所示。因此,该规则检查识别出13个柱实例和1个防火区实例未能满足要求。其结果与人工检查结果相同。自动规则检查成功地发现了问题,可以大大提高设计的效率。
Fig. 8
  1. Download: Download high-res image (438KB)
    下载:下载高分辨率图像(438 KB)
  2. Download: Download full-size image
    下载:下载全尺寸图像
Fig. 8
  1. Download: Download high-res image (852KB)
    下载:下载高分辨率图像(852KB)
  2. Download: Download full-size image
    下载:下载全尺寸图像

Fig. 8. Application example: use of the proposed ARC system for supporting fire protection design compliance checking.
见图8。应用示例:使用建议的ARC系统支持消防设计合规性检查。

6. Discussion  6.讨论

This work proposes a knowledge-informed automated rule checking framework consisting of ontology-based knowledge modeling, model preparation with semantic enrichment, enhanced rule interpretation, and checking execution. The ontology-based knowledge modeling part integrates the domain-specific concepts, relations, descriptions related to regulatory documents, and domain-specific rules for model enrichment and rule interpretation. The ARC processes are enhanced with the support of domain knowledge. A series of open-source toolkits are utilized to establish the framework. Finally, a demonstration of the rule checking application is conducted in an actual plant building as a proof of concept.
这项工作提出了一个知识知情的自动规则检查框架,包括基于本体的知识建模,语义丰富的模型准备,增强的规则解释,检查执行。基于本体的知识建模部分集成了特定领域的概念、关系、与法规文件相关的描述以及特定领域的规则,用于模型丰富和规则解释。在领域知识的支持下,ARC过程得到了增强。一系列的开源工具包被用来建立框架。最后,一个示范的规则检查应用程序进行了实际的厂房作为概念的证明
Together with the knowledge-informed automated rule checking framework, methods for semantic alignment and rule transformation are also introduced to create an end-to-end pipline that interprets regulation texts into a computer-processable format. Compared to the state-of-the-art methods, the scientific contributions of this study are the following:
连同知识知情的自动化规则检查框架,语义对齐和规则转换的方法也被引入到创建一个端到端的流水线,解释成一个计算机可处理的格式的法规文本。与最先进的方法相比,本研究的科学贡献如下:
  • 1.
    This work proposes an unsupervised-learning-based semantic alignment method to automatically align the concepts and relations between regulations and BIMs. In addition, knowledge-informed conflict resolution rules are utilized to improve the accuracy of rule interpretation by detecting and resolving potential conflicts of rules. Compared with the widely used keyword mapping method or handcrafted mapping tables, the proposed method requires fewer human efforts in constructing ontologies and achieves the highest accuracy in the validation datasets, improving flexibility and generality.
    本文提出了一种基于无监督学习的语义对齐方法来自动对齐规则和BIM之间的概念和关系。此外,知识知情的冲突解决规则被用来提高规则解释的准确性,通过检测和解决潜在的规则冲突。与目前广泛使用的关键词映射方法或手工映射表相比,该方法在构建本体时需要较少的人工努力,在验证数据集上获得了最高的准确率,提高了灵活性和通用性。
  • 2.
    The proposed code generation method supports a wider range of rules, including some complex rules with implicit information to be inferred. The progress is achieved for two reasons. On one hand, text classification considering the computability of the text is integrated into the proposed method. Different categories of rules are subsequently processed using different SPARQL functions, improving the accuracy and the applicability. On the other hand, the proposed code generation method utilizes SPARQL rather than widely used plain description clauses, which is suitable for a wider range of rule interpretation and rule checking.
    所提出的代码生成方法支持更广泛的规则,包括一些复杂的规则与隐式信息进行推断。取得这一进展有两个原因。一方面,考虑到文本的可计算性的文本分类被集成到所提出的方法。不同类别的规则随后使用不同的SPARQL函数进行处理,提高了准确性和适用性。另一方面,所提出的代码生成方法利用SPARQL而不是广泛使用的普通描述子句,这适用于更广泛的规则解释和规则检查。
The proposed method is applicable for creating computable rules from various textual regulatory documents. For example, in proactive design, a designer can use the proposed method to create computable rules from specific or customized textual regulatory documents to check the models [52]. The proposed automated rule interpretation method can cover most cases. But there exist really complicated requirements in the building codes where human efforts cannot be avoided. Therefore, we provide a semiautomatic human-computer interaction function. Users can modify the structure of syntax trees or the results of the semantic alignment to ensure 100% accuracy of the interpretation result. Note that the tools used in the proposed framework are all open source and freely available. The prototype tools of the proposed framework and datasets developed in this work can also be obtained from open source, which will also facilitate the development of the ARC.
所提出的方法是适用于创建可计算的规则,从各种文本的监管文件。例如,在主动设计中,设计人员可以使用所提出的方法从特定或定制的文本监管文档中创建可计算的规则来检查模型[52]。所提出的自动规则解释方法可以覆盖大多数情况。但是建筑规范中确实存在着复杂的要求,人类的努力无法避免。因此,我们提供了一个半自动的人机交互功能。用户可以修改语法树的结构或语义对齐的结果,以确保解释结果的100%准确性。请注意,建议的框架中使用的工具都是开源的,可以免费获得。 在这项工作中开发的拟议框架和数据集的原型工具也可以从开放源代码中获得,这也将促进ARC的开发。
Four main limitations of this research are identified and will be studied in the future by the authors.
本研究的四个主要局限性被确定,并将在未来的作者进行研究。
First, the proposed method is only tested in compulsory regulation texts from GB 50016–2014. The FPBO cannot be directly applied to other fields, such as earthquake resistance, energy, and sustainability. However, the rule sets developed in this study are reusable and expandable. Future research can be done to enrich the ontology and rule sets to accommodate the regulatory requirements of different fields.
首先,所提出的方法仅在GB 50016-2014的强制性法规文本中进行测试。FPBO不能直接应用于其他领域,如抗震,能源和可持续性。然而,在这项研究中开发的规则集是可重用和可扩展的。未来的研究可以做丰富的本体和规则集,以适应不同领域的监管要求。
Second, the performance of the end-to-end automated rule interpretation method needs to be further improved. Future researchers can develop a larger dataset and then replace the adopted methods with state-of-art deep learning models to improve the performance. For example, the adopted dictionary-based text classification method and unsupervised learning-based semantic alignment methods can be replaced by RNN, CNN, or attention-based methods, which can achieve a comprehensive understanding of texts, to improve the performance.
第二,端到端自动规则解释方法的性能需要进一步提高。未来的研究人员可以开发更大的数据集,然后用最先进的深度学习模型取代所采用的方法,以提高性能。例如,所采用的基于词典的文本分类方法和基于无监督学习的语义对齐方法可以被RNNCNN或基于注意力的方法所取代,从而实现对文本的全面理解,以提高性能。
Third, the scope of rule categories covered by automated rule interpretation needs to be further expanded. Although the direct attribute-related requirements and part of the indirect attribute-related requirements can be automatically interpreted by the proposed end-to-end pipeline, the existential requirements and spatial geometry-related requirements are not yet supported. Besides, some complex rules that require additional information such as reference to rules from other sections, figures, and tables are not yet considered. In the future, the following extensions can be pursued to further improve this research. SPARQL functions for querying spatial geometry-related building data in RDF format should be extended. Approaches for transforming spatial geometry-related building data into RDF format should be further researched. The automated rule interpreting method that considers rules from other sections, figures, and tables, should be further investigated.
第三,需要进一步扩大自动规则解释所涵盖的规则类别的范围。虽然直接属性相关的需求和部分间接属性相关的需求可以由所提出的端到端流水线自动解释,但存在性需求和空间几何相关的需求尚未得到支持。此外,还没有考虑一些需要额外信息的复杂规则,例如引用其他章节、图和表中的规则。在未来,可以进行以下扩展,以进一步改善这项研究。应扩展用于查询RDF格式的空间几何相关建筑数据的SPARQL功能。将空间几何相关的建筑数据转换为RDF格式的方法有待进一步研究。考虑其他部分、图和表中规则的自动规则解释方法应进一步研究。
Fourth, in this work, for research purposes, the attributes of the objects in an IFC file are all stored in the proper place rather than in the “name” or the “description” of the objects. While, in the real engineering applications, some of the information that needs to be queried may be stored in other places which may increase the difficulty of parsing IFC files. Future works should pay more attention to accommodating more situations in practice.
第四,在这项工作中,为了研究的目的,IFC文件中的对象的属性都存储在适当的地方,而不是在对象的“名称”或“描述”。而在真实的工程应用中,一些需要查询的信息可能存储在其他地方,这可能会增加解析IFC文件的难度。今后的工作应更加注意适应更多的实际情况。

7. Conclusion  7.结论

In this research, we propose a knowledge-informed ARC framework consisting of four parts: ontology-based knowledge modeling, model preparation with semantic enrichment, enhanced rule interpretation, and checking execution. Based on the framework, an ontology is first established to represent domain knowledge, including concepts, synonyms, relationships, constraints, etc. To eliminate hard-coded patterns and improve the generalization of concept alignment, an unsupervised learning-based semantic alignment method and knowledge-informed conflict resolution are proposed and adopted. To identify the proper SPARQL functions for complex rules, a domain-specific text classification method, which can identify the proper code generation method for texts, is proposed and used. Then, the regulatory texts are automatically interpreted into SPARQL, which is suitable for a wider application range for checking rules where implicit information needs to be inferred. The proposed algorithms are implemented in a prototype system. To demonstrate the system, a plant building is automatically checked using some rules selected from GB 50016–2014. The experimental results indicate a number of contributions. First, the proposed semantic alignment method and conflict resolution method achieve an accuracy of 90.1% and exceed the commonly used dictionary-based matching method by 34.3% accuracy. This means that this method can not only save manual effort in mapping tables or ontology construction but can also improve the accuracy. Second, the proposed rule interpretation method can support a wider range of rules. Third, the efficiency of the proposed rule interpretation method is improved by more than 5 times compared with manual interpretation by experts.
在这项研究中,我们提出了一个知识知情的ARC框架包括四个部分:基于本体的知识建模,模型准备与语义丰富,增强规则解释,检查执行。基于该框架,首先建立一个本体来表示领域知识,包括概念,同义词,关系,约束等,以消除硬编码的模式,提高概念对齐的泛化,提出并采用了一种无监督学习的语义对齐方法和知识知情的冲突解决。为了识别适合复杂规则的SPARQL函数,提出并使用了一种特定于领域的文本分类方法,该方法可以识别文本的适当代码生成方法。然后,将法规文本自动解释为SPARQL,适用于更广泛的应用范围,检查规则,其中隐含的信息需要推断。 所提出的算法在一个原型系统中实现。为了演示该系统,使用从GB 50016-2014中选择的一些规则自动检查厂房。实验结果表明了一些贡献。首先,所提出的语义对齐方法和冲突解决方法达到了90.1%的准确率,超过了常用的基于字典的匹配方法的34.3%的准确率。这意味着,该方法不仅可以节省人工映射表或本体构建的努力,但也可以提高准确性。第二,所提出的规则解释方法可以支持更广泛的规则。第三,所提出的规则解释方法的效率比专家人工解释提高了5倍以上。
In this study, the tools utilized are all available in open source and free to access, and the prototype of the automated rule interpretation and developed datasets are all available in open source. These contributions can considerably promote the research and application of ARC.
在这项研究中,所使用的工具都是开源的,可以免费访问,自动规则解释和开发的数据集的原型都是开源的。这些贡献将极大地促进ARC的研究和应用。

Declaration of Competing Interest
竞争利益声明

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
作者声明,他们没有已知的可能影响本文所报告工作的竞争性经济利益或个人关系。

Acknowledgment  确认

The authors are grateful for the financial support received from the National Natural Science Foundation of China (No. 51908323, No. 72091512), the National Key Research and Development Program (No. 2019YFE0112800), and the Tencent Foundation through the XPLORER PRIZE.
作者感谢国家自然科学基金(编号:51908323,编号:72091512)、国家重点研发计划(编号:2019YFE0112800)和腾讯基金会通过XPLORER PRIZE获得的资助。

Appendix A. Model preparation
附录A.模型制备

IFC is a commonly used collaboration format in BIM-based projects, and many BIM software programs support IFC format. Thus, using IFC files for data exchange can improve the scalability of ARC systems. However, the EXPRESS Schema is adopted in IFC, which cannot be directly parsed by the reasoning engine. To date, several works have been proposed to convert IFC into IFCOWL, but the above methods convert all instances and attributes in IFC to IFCOWL, resulting in information redundancy. In addition, the concepts and relations in the FPBO are different from those in the IFCOWL, so the existing converters cannot be used directly. Therefore, a new RDF converter is customized and developed in this study to convert IFC data into the RDF format. Fig. 1 illustrates the workflow of the model preparation method consisting of two main steps: model conversion and model enrichment.
IFC是BIM项目中常用的协作格式,许多BIM软件程序都支持IFC格式。因此,使用IFC文件进行数据交换可以提高ARC系统的可扩展性。然而IFC采用的是EXPRESS模式,推理机不能直接解析该模式。迄今为止,已经提出了将IFC转换为IFCOWL的若干工作,但是上述方法将IFC中的所有实例和属性转换为IFCOWL,导致信息冗余。此外,FPBO中的概念和关系与IFCOWL中的概念和关系不同,因此不能直接使用现有的转换器。因此,一个新的RDF转换器是定制和开发在这项研究中将IFC数据转换成RDF格式。图1示出了由两个主要步骤组成的模型准备方法的工作流程:模型转换和模型富集。
The model conversion aims to convert the IFC data into RDF format and consists of three sub steps: concept mapping, IFC parsing, and turtle file generation.
模型转换的目的是将IFC数据转换为RDF格式,包括三个子步骤:概念映射、IFC解析和生成turtle文件。
  • 1.
    Concept mapping. The concepts of IFC are aligned with those of FPBO via the mapping table established as shown in Table A.1. The concepts of IFC objects are mapped to the classes in the FPBO; for example, ‘IfcBuildingStory’ corresponds to ‘BuildingStory’. The concepts of IFC attributes are mapped to the data properties in the FPBO; for example, ‘fire_resistance_rating’ corresponds to ‘hasFireResistanceLimits_hour’. The object properties among instances in FPBO are determined by three relationships in the IFC file, including decomposition, contains, and BoundedBy. First, the domain objects and range objects of the above three relationships can be obtained by parsing IFC, and then the FPBO classes of the objects can be determined. The object property between the two objects can be uniquely determined by a HashMap with two keys, where one key is the class of the domain object and the other key is that of the range object.
    概念图。IFC的概念通过表A.1所示的映射表与FPBO的概念保持一致。IFC对象的概念映射到FPBO中的类;例如,“IfcBuildingStory”对应于“BuildingStory”。IFC属性的概念映射到FPBO中的数据属性;例如,“fire_resistance_rating”对应于“hasFireResistanceLimits_hour”。FPBO中实例之间的对象属性由IFC文件中的三种关系确定,包括分解、包含和BoundedBy。首先通过解析IFC得到上述三种关系的域对象和范围对象,然后确定对象的FPBO类。 两个对象之间的对象属性可以由具有两个键的HashMap唯一确定,其中一个键是域对象的类,另一个键是范围对象的类。
  • 2.
    IFC parsing. Different tools are available for developing in IFC format. The IfcOpenShell (Python), IfcPlusPlus (C++), and xBIM toolkits (.NET) are the most famous. IfcOpenShell is based on OpenCascade technology and features other tools, such as blender importers. This paper utilizes IFCOpenShell to parse IFC files and store the necessary objects and attributes in memory. The in-memory objects and attributes are then mapped to the FPBO concepts according to the aforementioned ontology mapping result.
    IFC解析。有不同的工具可用于IFC格式的开发。IfcOpenShell(Python)、IfcPlusPlus(C++)和xBIM工具包(.NET)是最著名的。IfcOpenShell基于OpenCascade技术,并提供其他工具,如blender importers。本文利用IFCOpenShell来解析IFC文件,并在内存中存储必要的对象和属性。然后根据上述本体映射结果将内存中对象和属性映射到FPBO概念。
  • 3.
    Turtle file generation. Terse RDF Triple Language is a common format for ontology data exchange. Then, the mapped data stored in memory are outputted according to the Turtle syntax into a Turtle file.
    海龟文件生成。Terse RDF Triple Language是一种通用的本体数据交换格式。然后,根据Turtle语法将存储在内存中的映射数据输出到Turtle文件中。
In this work, the size of the converted IFCOWL file is nearly 200 times of the .ttl file converted using FPBO and the corresponding converter. The IFCOWL file is too large, so the opening and operations (e.g., ontology alignment and model enrichment) via protégé are too slow, which is not suitable for the research purpose.
在这项工作中,转换的IFCOWL文件的大小是使用FPBO和相应转换器转换的.ttl文件的近200倍。IFCOWL文件太大,因此打开和操作(例如,本体对齐和模型丰富)太慢,不适合研究目的。
Model enrichment aims to infer implicit information and supply missing information in the Turtle model. Model enrichment was performed based on the Pellet inference machine and the SWRL rules, which are defined as follows:
模型丰富的目的是推断隐含的信息,并补充海龟模型中的缺失信息。基于Pellet推理机和SWRL规则执行模型富集,其定义如下:
Category 1: Transitivity of the contain relationship among spatial elements. For example, space A contains space B, and space B contains space C. Therefore, space A contains space C.
第一类:空间元素之间包含关系的传递性。例如,空间A包含空间B,空间B包含空间C。空间A包含空间C。
SWRL1: BuildingRegion(?x) ^ BuildingStorey(?y) ^ BuildingSpace(?z) ^ hasBuildingSpatialElement(?x,?y) ^ hasBuildingSpatialElement(?y,?z) - > hasBuildingSpatialElement(?x,?z).
SWRL1:BuildingRegion(?x)^ BuildingStorey(?y)^ BuildingSpace(?z)^ hasBuildingSpatialElement(?x、?y)^ hasBuildingSpatialElement(?是吗?z)->  hasBuildingSpatialElement(?x、?z)
Category 2: Transitivity of the contain relationship between spatial elements and component elements. For example, space A contains space B, and space B contains element C. Therefore, space A contains element C.
第二类:空间元素和组成元素之间的包含关系的传递性。例如,空间A包含空间B,空间B包含元素C。因此,空间A包含元素C。
SWRL2: BuildingRegion(?x) ^ BuildingSpace(?y) ^ hasBuildingSpatialElement(?x,?y) ^ BuildingComponentElement(?z) ^ hasBuildingElement(?y,?z) - > hasBuildingElement(?x,?z).
SWRL2:BuildingRegion(?x)^ BuildingSpace(?y)^ hasBuildingSpatialElement(?x、?y)^BuildingUniform(?z)^ hasBuildingElement(?是吗?z)->  hasBuildingElement(?x、?z)
The above SWRL can support the derivation of implicit information by a reasoning engine. SWRL1 can derive the implicit information on which building spaces are contained in a building. SWRL2 can derive the implicit information on which building components are contained in a building.

Table A.1. Part of the ontology mapping dictionary.

IFC SchemaFPBO concepts
IfcBuildingStoreyBuildingStorey
IfcBuildingBuildingRegion
IfcSpaceBuildingSpace
IfcColumnColumn
IfcBeamBeam
IfcSlabFloor
IfcWallStandardCaseWall
IfcWallWall
IfcWindowWindow
IfcDoorDoors
IfcStairStairs
GlobalIdhasGlobalId
fire_resistance_rating (user defined attribute)hasFireResistanceLimits_hour

上述SWRL可以支持推理引擎对隐式信息的推导。SWRL1可以导出关于建筑物中包含哪些建筑物空间的隐含信息。SWRL2可以导出关于建筑物中包含哪些建筑构件的隐式信息。

表A.1.本体映射字典的一部分。

IFC架构FPBO概念
Ifc建筑物楼层建筑物楼层
Ifc建筑物建筑物区域
Ifc空间建筑物空间
Ifc柱
Ifc
Ifc板地板
Ifc墙标准案例
Ifc墙Ifc
Ifc
门Ifc楼梯Ifc楼梯
GlobalIdhasGlobalId
fire_resistance_rating(用户定义属性)hasFireResistanceLimits_hour

Data availability  数据可用性

Data will be made available on request.
数据将根据要求提供。

References  引用

Cited by (39)  引用(39)

  • Domain-specific language models pre-trained on construction management systems corpora
    在施工管理系统语料库上预先训练的特定领域语言模型

    2024, Automation in Construction
    2024年,建筑自动化
    Citation Excerpt :  引文摘录:

    For instance, the evaluation of building designs has traditionally been conducted manually by domain experts. This method, while reliant on expert knowledge, inherently introduces subjective biases, thus limiting the capability to efficiently and sustainably manage the vast and diverse array of building information and regulations [2]. Furthermore, the process of analyzing near-miss incidents in safety reports is notably laborious and time-consuming, posing significant challenges in terms of resource allocation and timely response [3].
    例如,建筑设计的评估传统上由领域专家手动进行。这种方法虽然依赖于专家知识,但本质上会引入主观偏见,从而限制了有效和可持续地管理大量不同的建筑信息和法规的能力[2]。此外,在安全报告中分析未遂事故的过程非常费力和耗时,在资源分配和及时响应方面构成了重大挑战[3]。

View all citing articles on Scopus
查看Scopus上的所有引用文章
View Abstract  查看摘要