Big Data - How to Realize the Promise 大数据 - 如何实现承诺
Alison Cave ^(1,**){ }^{1, *}, Nikolai C. Brun ^(2){ }^{2}, Fergus Sweeney ^(1){ }^{1}, Guido Rasi ^(1){ }^{1} and Thomas Senderovitz ^(2){ }^{2} on behalf of the HMA-EMA Joint Big Data Taskforce Alison Cave ^(1,**){ }^{1, *} 、 Nikolai C. Brun ^(2){ }^{2} 、 Fergus Sweeney ^(1){ }^{1} 、 Guido Rasi ^(1){ }^{1} 和 Thomas Senderovitz ^(2){ }^{2} 代表 HMA-EMA 联合大数据工作组
Abstract 抽象
The increasing volume and complexity of data now being captured across multiple settings and devices offers the opportunity to deliver a better characterization of diseases, treatments, and the performance of medicinal products in individual healthcare systems. Such data sources, commonly labeled as big data, are generally large, accumulating rapidly, and incorporate multiple data types and forms. Determining the acceptability of these data to support regulatory decisions demands an understanding of data provenance and quality in addition to confirming the validity of new approaches and methods for processing and analyzing these data. The Heads of Agencies and the European Medicines Agency Joint Big Data Taskforce was established to consider these issues from the regulatory perspective. This review reflects the thinking from its first phase and describes the big data landscape from a regulatory perspective and the challenges to be addressed in order that regulators can know when and how to have confidence in the evidence generated from big datasets. 现在,在多个环境和设备上捕获的数据量和复杂性不断增加,这为在单个医疗保健系统中更好地描述疾病、治疗和医药产品的性能提供了机会。此类数据源通常标记为大数据,通常很大,积累迅速,并包含多种数据类型和形式。确定这些数据是否可接受以支持监管决策,除了确认处理和分析这些数据的新方法和方法的有效性外,还需要了解数据来源和质量。机构负责人和欧洲药品管理局联合大数据工作组的成立是为了从监管角度考虑这些问题。本综述反映了第一阶段的思考,并从监管角度描述了大数据形势和需要解决的挑战,以便监管机构能够知道何时以及如何对大数据数据集生成的证据充满信心。
The role of medicines regulatory agencies is multifactorial and broad; their overarching responsibility is to ensure that all approved medicines, medical devices, and the combination thereof, are both effective and safe but in order to deliver on that responsibility, medicines regulators must work across the entire drug development pathway. For example, there is a need not only for evidence on the intended and unintended effects of medicinal products arising from both the highly controlled environment of a randomized controlled clinical trial and subsequently in clinical practice but also for information on disease, its prevalence, and progression across a population, on current standards of care across our diverse European population and on prevalence of potential adverse effects to contextualize information around medicinal products. Increasingly there is a need for an understanding of the accuracy of diagnostic tests, including imaging tests, which may impact on either the diagnosis of a disease and, hence, the prescribing of a medicinal product or the monitoring of its effectiveness and/or safety. 药品监管机构的作用是多因素和广泛的;他们的首要责任是确保所有批准的药物、医疗器械及其组合既有效又安全,但为了履行这一责任,药品监管机构必须在整个药物开发过程中开展工作。例如,不仅需要关于随机对照临床试验的高度受控环境和随后的临床实践中产生的医药产品的预期和非预期影响的证据,还需要提供有关疾病、其流行率和人群进展、我们不同欧洲人群的当前护理标准以及潜在不良反应的流行率的信息,以了解有关药物的信息产品。越来越需要了解诊断测试的准确性,包括影像学检查,这可能会影响疾病的诊断,从而影响医药产品的处方或对其有效性和/或安全性的监测。
The unparalleled pace of change in the scientific landscape is challenging the current regulatory paradigm and requiring regulatory agencies to look beyond conventional sources of evidence to support decision making across the entire product life cycle. These new sources of evidence, often collectively termed big data, offer opportunities to improve decision making but also bring uncertainties around the quality of the data and the analytic methods used and, hence, the veracity of the evidence ultimately generated. 科学领域前所未有的变化速度正在挑战当前的监管范式,并要求监管机构超越传统证据来源,以支持整个产品生命周期的决策。这些新的证据来源,通常统称为大数据,为改进决策提供了机会,但也为数据质量和所使用的分析方法带来了不确定性,因此也带来了最终产生的证据的真实性的不确定性。
Although the term big data is widely utilized, there is no commonly accepted definition and the concept is quite nebulous with no universally defined thresholds for any of the presumed characteristics. In our view, a definition should encompass not only the concept that big data is diverse, heterogeneous, and large, and incorporates multiple data types but should also refer to the 尽管大数据一词被广泛使用,但没有普遍接受的定义,而且这个概念相当模糊,没有针对任何假定特征的通用定义阈值。在我们看来,定义不仅应该包括大数据是多样化的、异构的和庞大的概念,并包含多种数据类型,还应该指
complexity and challenges of integrating the data to enable a combined analysis. Hence, we define big data as extremely large datasets, which may be complex, multidimensional, (un)structured, and heterogeneous, which are accumulating rapidly and which may be analyzed computationally to reveal patterns, trends, and associations. In general, big datasets require advanced or specialized methods to provide an answer within reliable constraints. Thus, a single dataset may not strictly meet the definition of big data but when pooled with other datasets of a similar type, or linked to other datasets of different types, the datasets become sufficiently large or the difficulties in pooling, linking, and analyzing are sufficiently complex for the data to assume the characteristics of big data. 集成数据以实现组合分析的复杂性和挑战。因此,我们将大数据定义为超大型数据集,这些数据集可能是复杂的、多维的、(非)结构化的和异构的,它们正在迅速积累,并且可以通过计算进行分析以揭示模式、趋势和关联。一般来说,大数据数据集需要高级或专门的方法才能在可靠的约束下提供答案。因此,单个数据集可能并不严格满足大数据的定义,但是当与其他类似类型的数据集汇集或链接到不同类型的其他数据集时,数据集变得足够大,或者汇集、链接和分析的困难足够复杂,以至于数据具有大数据的特征。
Datasets of most immediate utility for regulatory decision making are data derived from previous clinical trials, real-world data, including postmarketing registry data, spontaneous adverse drug reaction (ADR) reports, and genomic data, especially if linked to clinical data. Other types of 'omic data, such as proteomics and metabolomics, represent more heterogeneous data and at the far end of the uncertainty scale, sits individually generated social media data. There is little doubt that the development of future treatments (medicines, in vitro diagnostics, devices, or digital therapeutics) will utilize such data, which may reach regulatory authorities either as supportive data together with more traditionally analyzed structured data, or may underpin the submission as a whole. Thus, it is essential that regulators understand its presence and the robustness by which it was generated in order to make a competent evaluation of the submission as a whole and to continuously monitor the medicine, device, or the performance of the in vitro diagnostics on the market. 对监管决策最直接有用的数据集是来自先前临床试验的数据、真实世界数据,包括上市后注册数据、自发性药物不良反应 (ADR) 报告和基因组数据,尤其是与临床数据相关联的数据。其他类型的组学数据,例如蛋白质组学和代谢组学,代表了更多的异质性数据,并且在不确定性尺度的远端,是单独生成的社交媒体数据。毫无疑问,未来治疗方法(药物、体外诊断、设备或数字疗法)的开发将利用这些数据,这些数据可能会作为支持数据与更传统分析的结构化数据一起到达监管机构,或者可能支持整个提交。因此,监管机构必须了解它的存在及其产生的稳健性,以便对整个提交进行有效的评估,并持续监测市场上的药物、设备或体外诊断的性能。
This challenge is significant: The paradigm for authorization in most stringent regulatory authorities is based on the assessment of well-controlled, randomized, high-quality data of known 这一挑战是巨大的:在最严格的监管机构中,授权范式是基于对已知
provenance. In contrast, big data offers evidence that may be derived from unstructured, heterogeneous, and unvalidated data of potentially unknown provenance, often with unknowns around potential bias and with additional uncertainties of accuracy and precision. Much of this information may be at the individual level. Moreover, not all datasets are the same; there is variable quality and standardization, data are generated under different scenarios and for different purposes, and ownership resides with multiple stakeholders, many of whom have no obligation to engage with regulatory systems. Thus, influencing the data landscape to meet regulatory needs is complex. Furthermore, with the availability of multiple linkable datasets, many associations will be observed, which may or may not be spurious, but which will consume considerable resources to validate and may generate significant concerns. Processes will need to be defined to replicate and test findings and also to determine how to define the threshold of evidence required for regulators to act. 种源。相比之下,大数据提供的证据可能来自可能来源未知的非结构化、异构和未经验证的数据,通常存在潜在偏差的未知因素,并且具有准确性和精密度的额外不确定性。这些信息大部分可能位于个人层面。此外,并非所有数据集都是相同的;质量和标准化参差不齐,数据是在不同的情景和不同的目的下生成的,所有权属于多个利益相关者,其中许多利益相关者没有义务参与监管系统。因此,影响数据环境以满足监管需求是复杂的。此外,随着多个可链接数据集的可用性,将观察到许多关联,这些关联可能是虚假的,也可能不是,但会消耗大量资源来验证,并可能产生重大问题。需要定义流程来复制和测试结果,并确定如何定义监管机构采取行动所需的证据阈值。
Generalizing that evidence threshold is difficult when evidence suitable for regulatory decision making can be a variable concept. As such lower grades of evidence may be acceptable in the case of rare diseases where data are hard to collect compared with more frequent diseases, or when assessing a safety risk where randomized evidence is impossible to obtain. Nonetheless, certain evidentiary standards have evolved as scientific methodology and rigor has advanced from pure empirical thinking relying on serial observations as the basis for causal inferences through Popper’s falsification principles on to present day’s randomized double-blind placebo-controlled clinical trials as the gold standard for evidence. When considering real-world data for use in a regulatory context the point where data becomes evidence is of crucial importance. A large dataset with a large series of observations does not in itself automatically constitute good evidence of causality but when and why evidence reaches the causality threshold needs to be clarified. Regulators must set the standard for good evidence and guide innovators and industry in their efforts to generate evidence suitable for regulatory decision making. 当适合监管决策的证据可能是一个可变的概念时,推广该证据阈值是困难的。因此,在罕见病的情况下,与更常见的疾病相比,数据难以收集,或者在评估无法获得随机证据的安全风险时,较低等级的证据可能是可以接受的。尽管如此,某些证据标准已经发展成为科学方法论,严谨性已经从依靠连续观察作为因果推理基础的纯粹实证思维,通过波普尔的证伪原则发展到今天的随机双盲安慰剂对照临床试验作为证据的黄金标准。在考虑将真实世界数据用于监管环境时,数据成为证据的关键点至关重要。具有大量观察结果的大型数据集本身并不自动构成因果关系的良好证据,但需要澄清证据何时以及为何达到因果关系阈值。监管机构必须为良好证据设定标准,并指导创新者和行业努力生成适合监管决策的证据。
Given this context, the task force of the EU Heads of Agencies and the European Medicines Agency (HMA-EMA Joint Big Data taskforce) was formed to describe the big data landscape from a regulatory perspective in order to inform the EU regulatory network on decisions and planning on the capability and capacity to guide, analyze, and interpret these data. This paper is a distillation of the report from the first phase of the taskforce and sets out the thinking of the Task Force around the steps that need to be taken in the wider data landscape in order that regulators can know when and how to have confidence in the evidence generated from big datasets. A fuller version of this report and the reports of the individual subgroups are available on the EMA and HMA websites. As for many stakeholders, regulatory concerns center not only around the data itself, its quality (including any data transformations), and representativeness, but also on the analytical processes (data manipulation, modeling, and analytics) used to generate the evidence. It is clear that delivery of the Task Force recommendations will require the consolidated action of multiple stakeholders and substantial resources. The Task Force has continued its work and its phase II report, proposing the priorities, timing, and approach 在此背景下,欧盟机构负责人和欧洲药品管理局(HMA-EMA 联合大数据工作组)成立了,从监管角度描述大数据格局,以便为欧盟监管网络提供指导、分析和解释这些数据的能力和能力的决策和规划信息。本文是工作组第一阶段报告的提炼,阐述了工作组围绕在更广泛的数据环境中需要采取的步骤的想法,以便监管机构能够知道何时以及如何对大数据数据集生成的证据充满信心。本报告和各个子组的报告的更完整版本可在 EMA 和 HMA 网站上找到。对于许多利益相关者来说,监管问题不仅集中在数据本身、其质量(包括任何数据转换)和代表性上,还集中在用于生成证据的分析过程(数据操作、建模和分析)上。显然,交付工作组建议需要多个利益相关者的联合行动和大量资源。工作组继续开展工作,提交第二阶段报告,提出优先事项、时间和方法
to implementing recommendations for big data in medicines regulation, is anticipated for publication in 2020. 实施药品大数据法规的建议,预计将于 2020 年发布。
DATA CHARACTERISTICS 数据特征
To a large extent, data quality determines the validity of the evidence that can be reliably derived from a given dataset. Thus, understanding quality is a key need. However, quality is hard to define and it is even harder to prespecify what the required data quality attributes might be over a range of regulatory use cases. Acceptability will always be influenced by the context of use; the opportunities and timeliness of other data capture options, the question being asked, the level of risk associated with each decision, the availability of other treatments, and the unmet medical need (for recent examples see ref. 1). Thus, guidelines are needed to inform on regulatory expectations around data quality across the range of regulatory decisions in the context of the risk associated with each decision, which should, where possible, seek alignment with other regulatory authorities. 在很大程度上,数据质量决定了可以从给定数据集可靠得出的证据的有效性。因此,了解质量是一项关键需求。但是,质量很难定义,而且在一系列监管使用案例中预先指定所需的数据质量属性可能更是困难。可接受性总是会受到使用环境的影响;其他数据采集选项的机会和及时性、提出的问题、与每个决定相关的风险水平、其他治疗方法的可用性以及未满足的医疗需求(有关最近的例子,请参阅参考文献 1)。因此,需要制定指导方针,在与每项决策相关的风险背景下,告知监管部门对数据质量的期望,并应在可能的情况下寻求与其他监管机构的一致性。
To move the dial, we require the capability to characterize data based on a data quality framework, which enables a common understanding of the strengths and limitations of big datasets. Ideally, such a framework should be model and geography agnostic to allow a comparison across data sources of the same type whatever their origin. Moreover, the framework should address more than just the data content, but also the quality control measures in place to describe reproducibility over time, and these findings should be transparently recorded in a sustainable, accessible inventory. Standards should include measurement technologies to allow systematic benchmarking and validation as these analytical techniques evolve with a particular focus on the reproducibility of results. The 2009 report by the Human Proteomics Organization test Sample Working Group illustrates the scale of the challenge: of 27 laboratories examining the same sample that consisted of 20 highly purified proteins, only 7 laboratories reported all 20 proteins correctly. ^(2){ }^{2} Thus, achieving sustainability of such a venture will require collaboration, engagement, and sustained effort over multiple stakeholders to maximize its utility. On a much smaller scale, the current EMA Patient Disease Registry Initiative provides an example of how incorporating the needs of all relevant stakeholders informs the development of minimal quality standards and data elements in order to facilitate downstream data harmonization and maximize the utility of the data. ^(3){ }^{3} Facilitating broad engagement and seeking agreement will support and drive adoption. 为了改变方向,我们需要能够根据数据质量框架来描述数据,从而能够对大数据集中的优势和局限性达成共识。理想情况下,这样的框架应该与模型和地理无关,以允许在相同类型的数据源之间进行比较,无论其来源如何。此外,该框架不仅应涉及数据内容,还应涉及为描述随时间推移的可重复性而采取的质量控制措施,这些发现应透明地记录在可持续、可访问的清单中。标准应包括测量技术,以便随着这些分析技术的发展,特别关注结果的可重复性,以便进行系统的基准测试和验证。人类蛋白质组学组织测试样品工作组 2009 年的报告说明了挑战的规模:在 27 个实验室检查由 20 种高度纯化蛋白质组成的同一样品中,只有 7 个实验室正确报告了所有 20 种蛋白质。 ^(2){ }^{2} 因此,实现此类企业的可持续性需要多个利益相关者的协作、参与和持续努力,以最大限度地发挥其效用。在更小的范围内,当前的 EMA 患者疾病登记计划提供了一个例子,说明如何结合所有相关利益相关者的需求,为最低质量标准和数据元素的制定提供信息,以促进下游数据的协调并最大限度地提高数据的效用。 ^(3){ }^{3} 促进广泛参与并寻求协议将支持和推动采用。
A key enabler for data characterization is standardization in order to drive harmonization across datasets, enhance interoperability, improve data quality, allow comparability, and facilitate data analyses. However, standardizing data is hard; much of the data are unstructured and heterogeneous, and this is especially true of social media data and data from wearables that are anticipated to account for much of the data volume increases in the coming years. 数据表征的一个关键推动因素是标准化,以推动数据集之间的协调、增强互操作性、提高数据质量、允许可比性并促进数据分析。但是,标准化数据很困难;许多数据是非结构化和异构的,社交媒体数据和可穿戴设备数据尤其如此,预计这些数据将占未来几年数据量增长的大部分。
The need for standards is not new. It was recognized many years ago, and when required for regulatory purposes has driven global harmonization, a key component of the mission of the International Council for Harmonisation (ICH) of Technical Requirements for Pharmaceuticals for Human Use. For instance, 对标准的需求并不新鲜。它多年前就得到了认可,当出于监管目的需要时,它推动了全球协调,这是国际协调委员会 (ICH) 人用药品技术要求使命的关键组成部分。例如
the data model and data elements of the Individual Case Study Report, which is used for reporting of ADRs, provides clear specification of reporting requirements for ADRs. ^(4){ }^{4} As a result, platforms, such as EudraVigilance, ^(5){ }^{5} contain extremely well-structured information, although the completeness and accuracy of the information within Individual Case Study Report forms is still dependent on the reporter and, hence, variable. However, many other datasets are not standardized partly because most have evolved over many years, and, hence, the data encompass many technological developments and partly because there is not one single owner. In addition, with the exception of clinical trials data, data were not generated to support regulatory decision making and, hence, the need to comply with strict quality guidelines. Thus, data heterogeneity spans a continuum; consider genomics as an example where nearly 250 million genomes are currently available worldwide, but while relatively well structured, much of the data is siloed by disease, institution, and country, generated with different methodologies, analyzed by nonstandardized software, and often stored in incompatible file formats, and consequently only a small percentage is linked. ^(6){ }^{6} This situation is replicated over multiple datasets in the big data landscape. Consider also that genomics is one of the most well organized fields with considerable harmonization efforts, such as Global Alliance for Genomics and Health Connect, already underway. ^(7){ }^{7} The resources to duplicate such efforts across the full spectrum of datasets will be considerable. 用于 ADR 报告的个人案例研究报告的数据模型和数据元素明确规定了 ADR 的报告要求。 ^(4){ }^{4} 因此,EudraVigilance 等平台 ^(5){ }^{5} 包含结构极其良好的信息,尽管个人案例研究报告表格中信息的完整性和准确性仍然取决于报告者,因此是可变的。然而,许多其他数据集没有标准化,部分原因是大多数数据集已经发展了很多年,因此,数据包含许多技术发展,部分原因是没有一个单一的所有者。此外,除临床试验数据外,未生成数据以支持监管决策,因此需要遵守严格的质量指南。因此,数据异质性跨越一个连续体;以基因组学为例,目前全球有近 2.5 亿个基因组可用,但虽然结构相对良好,但大部分数据按疾病、机构和国家/地区孤立,使用不同的方法生成,由非标准化软件分析,并且通常以不兼容的文件格式存储,因此只有一小部分是链接的。 ^(6){ }^{6} 这种情况在大数据环境中的多个数据集上复制。还要考虑到,基因组学是组织最完善的领域之一,已经有相当大的协调工作,例如全球基因组学联盟和 Health Connect,已经在进行中。 ^(7){ }^{7} 在所有数据集中复制此类工作的资源将是相当大的。
No single data standard will have the depth and breadth to be applicable to all datasets. However, as a community, we should strive as much as possible to minimize the number of standards. To encourage adoption, standards should be transparent and open source to promote widespread uptake, globally applicable, comprehensive, and maintained with an ongoing process for testing and revision that is sustainable. It is, therefore, important to strongly support the use and maintenance of available data standards, and the development of standards where none are available (e.g., novel data sources, such as m-health, and less mature fields, such as epigenetics, to ensure early alignment). ^(8){ }^{8} A recent example of regulatory uptake and support of standards is provided by International Organization for Standardization (ISO) Identification of Medicinal Products, ^(9){ }^{9} which aims to facilitate international identification of medicinal products. 没有一个数据标准具有适用于所有数据集的深度和广度。但是,作为一个社区,我们应该尽可能多地减少标准的数量。为了鼓励采用,标准应该是透明和开源的,以促进广泛采用、全球适用、全面,并通过可持续的持续测试和修订过程进行维护。因此,大力支持使用和维护现有数据标准以及制定不可用的标准非常重要(例如,新的数据源,如 m-health,以及不太成熟的领域,如表观遗传学,以确保早期对齐)。 ^(8){ }^{8} 国际标准化组织 (ISO) 药品标识提供了监管采用和支持标准的最新示例, ^(9){ }^{9} 该组织旨在促进药品的国际标识。
Although the benefits of standardization are clear, implementation of standards is expensive, and the return on investment must be sufficient to outweigh these costs. The cost of implementation of ISO Identification of Medicinal Products alone across 14 European Federation of Pharmaceutical Industries and Associations member companies has been estimated to be at least undefined Hence, there is a need for a common understanding of the overall vision and scope, a clear definition of the ultimate value and a well-formed plan for implementation. Sustainability of standards is challenging but will be enabled by widespread adoption. From a regulatory perspective, a prioritization of efforts will be needed with an early focus on data most likely to impact on decision making in the near term. 尽管标准化的好处是显而易见的,但标准的实施成本很高,而且投资回报必须足以超过这些成本。据估计,仅在 14 个欧洲制药工业联合会和协会成员公司中实施 ISO 医药产品标识的成本就至少 undefined 为 因此,需要对整体愿景和范围达成共识,明确定义最终价值并制定完善的实施计划。标准的可持续性具有挑战性,但广泛采用将实现这一点。从监管的角度来看,需要确定工作的优先次序,尽早关注最有可能影响近期决策的数据。
At a global level, it is important to ensure that extremely expensive and time-consuming standardization and data mapping initiatives ^(11-13){ }^{11-13} do not pull in opposite directions but work together 在全球范围内,重要的是要确保极其昂贵和耗时的标准化和数据映射计划 ^(11-13){ }^{11-13} 不会朝着相反的方向拉扯,而是协同工作
to achieve sustainable and global solutions. From a regulatory perspective, global cooperation is important, as for many rare diseases and cancers or indeed rare ADRs there may only be a handful of cases worldwide, and these data need to be interoperable to derive meaningful insights. We need to be aware also that data mapping is expensive, may create assumptions around equivalence, and there is always a fear of information lost during data transformation, so, therefore, standardization of data at inception should be the goal. Where this is not possible, a clear framework to confirm the validity of mapped data for regulatory decision making needs to be established (e.g., following the implementation of common data models). 实现可持续的全球解决方案。从监管的角度来看,全球合作很重要,因为对于许多罕见疾病和癌症,或者实际上是罕见的 ADR,全球可能只有少数病例,这些数据需要可互操作才能获得有意义的见解。我们还需要意识到,数据映射成本高昂,可能会产生围绕等价性的假设,并且总是担心在数据转换过程中丢失信息,因此,在开始时就实现数据标准化应该是目标。如果无法做到这一点,则需要建立一个明确的框架来确认映射数据的有效性,以便进行监管决策(例如,在实施通用数据模型之后)。
Standardization will not only enable better data characterization but will also facilitate data linkage between related datasets to provide additional insight not possible from single isolated datasets. This is a key requirement as European healthcare data are heterogeneous; differences in healthcare systems, national guidelines, and clinical practice have driven different content, and, hence, the generalizability of a single healthcare system from a single European country cannot be assumed for the whole of Europe. Moreover, data linkage applies not only to databases within a subgroup (e.g., how to integrate different registries or electronic health records but among disparate datasets; e.g., linking clinical data with genomic/ pharmacogenomics data and proteomic data and linking data across care settings; e.g., primary, secondary, and tertiary care). For example, predicting a patient’s response to a therapeutic intervention with a proteomic or genomic biomarker in order to minimize exposure of patients to ineffective or intolerable therapies, can only be achieved if 'omics data is linked to clinical outcomes. Unfortunately, currently clinical outcome data relevant to regulatory decision making (e.g., data on efficacy or safety of treatments is only found sporadically in public databases), thereby limiting their value in a regulatory context. Raising awareness of the need for linkage of treatment and outcome data would be particularly beneficial. 标准化不仅可以实现更好的数据表征,还可以促进相关数据集之间的数据链接,以提供单个孤立数据集无法提供的额外见解。这是一项关键要求,因为欧洲医疗保健数据是异构的;医疗保健系统、国家指南和临床实践的差异导致了不同的内容,因此,不能假设来自单个欧洲国家的单一医疗保健系统的普遍性适用于整个欧洲。此外,数据链接不仅适用于亚组内的数据库(例如,如何整合不同的注册表或电子健康记录,而是在不同数据集之间;例如,将临床数据与基因组学/药物基因组学数据和蛋白质组学数据联系起来,以及将不同护理机构的数据联系起来;例如,初级、二级和三级护理)。例如,只有当组学数据与临床结果相关联时,才能实现使用蛋白质组学或基因组生物标志物预测患者对治疗干预的反应,以尽量减少患者暴露于无效或无法耐受的疗法。不幸的是,目前与监管决策相关的临床结局数据(例如,关于治疗的有效性或安全性的数据仅在公共数据库中零星发现),从而限制了它们在监管环境中的价值。提高对治疗和结局数据之间联系的必要性的认识将特别有益。
Different questions will require linkage of data at different levels and require different data protection solutions. For some regulatory needs, linkage at an individual patient level would ideally be required (e.g., understanding the clinical outcome of an ADR or enabling longitudinal follow-up of a genomic targeted or gene-editing medicine). However, there are many scenarios where linkage of data at a population level would be sufficient (e.g., standard of care at different disease stages across Europe or outcomes from vaccination programs). To enable meaningful data linkage, sharing needs to move beyond simply sharing the raw data, to encompass associated metadata, which describes key characteristics about the data (e.g., sample type, disease stage, treatment, and genomic mutation). ^(14){ }^{14} 不同的问题将需要不同级别的数据链接,并且需要不同的数据保护解决方案。对于某些监管需求,理想情况下需要个体患者水平的联系(例如,了解 ADR 的临床结果或能够对基因组靶向或基因编辑药物进行纵向随访)。然而,在许多情况下,人群水平的数据关联就足够了(例如,欧洲不同疾病阶段的护理标准或疫苗接种计划的结果)。为了实现有意义的数据链接,共享需要超越简单地共享原始数据,而是包含相关的元数据,这些元数据描述了有关数据的关键特征(例如,样本类型、疾病分期、治疗和基因组突变)。 ^(14){ }^{14}
DATA ANALYTICS 数据分析
Big Data analytics is a growing field of data science, which combines methods from various disciplines, including biostatistics, mathematical modeling and simulation, bio-informatics, and computer science, and encompasses data collection, data management, data-integration, data standardization, machine learning (ML), and requires specialized information technology 大数据分析是一个不断发展的数据科学领域,它结合了来自各个学科的方法,包括生物统计学、数学建模和模拟、生物信息学和计算机科学,包括数据收集、数据管理、数据集成、数据标准化、机器学习 (ML),并且需要专门的信息技术
^(1){ }^{1} European Medicines Agency, Amsterdam, Netherlands; ^(2){ }^{2} Danish Medicines Agency, Copenhagen, Denmark. *Correspondence: Alison Cave (alisoncave@hotmail.co.uk) ^(1){ }^{1} 荷兰阿姆斯特丹的欧洲药品管理局; ^(2){ }^{2} 丹麦药品管理局,丹麦哥本哈根。*通信:Alison Cave (alisoncave@hotmail.co.uk)
The views expressed in this article are the personal views of the authors and may not be understood or quoted as being made on behalf of or reflecting the position of the regulatory agencies or organizations with which the authors are employed/affiliated. 本文中表达的观点是作者的个人观点,不得被理解或引用为代表或反映作者受雇/附属的监管机构或组织的立场。
Received October 28, 2019; accepted December 12, 2019. doi:10.1002/cpt. 1736 接收日期:2019 年 10 月 28 日;2019 年 12 月 12 日接受。doi:10.1002/cpt. 1736