这是用户在 2024-5-23 15:01 为 https://ar5iv.labs.arxiv.org/html/2310.18689v1?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Foundational Models in Medical Imaging: A Comprehensive Survey and Future Vision

Bobby Azad   鲍比·阿扎德
Electrical Engineering and Computer Science Department
South Dakota State University
Brookings, USA
&Reza Azad
Faculty of Electrical Engineering and Information Technology
RWTH Aachen University
Aachen, Germany
&Sania Eskandari
Department of Electrical Engineering
University of Kentucky
Lexington, USA
&Afshin Bozorgpour
Faculty of Informatics and Data Science
University of Regensburg
Regensburg, Germany
&Amirhossein Kazerouni
School of Electrical Engineering
Iran University of Science and Technology
Tehran, Iran
&Islem Rekik
BASIRA Lab, Imperial-X and Computing Department
Imperial College London
London, UK
&Dorit Merhof
Faculty of Informatics and Data Science
University of Regensburg
Regensburg, Germany
电气工程和计算机科学系美国南达科他州州立大学布鲁金斯和Reza Azad电气工程和信息技术学院RWTH亚琛大学亚琛,德国&Sania Eskandari肯塔基大学列克星敦电气工程系美国&Afshin Bozorgpour信息学和数据科学学院里根斯堡大学德国&Amirhossein Kazerouni电气工程学院伊朗科技大学伊朗德黑兰&Islem Rekik BASIRA实验室,Imperial-X和计算系伦敦帝国理工学院伦敦,英国&Dorit Merhof里根斯堡大学信息学和数据科学学院里根斯堡,德国
Corresponding author: Dorit Merhof, dorit.merhofur.de
通讯作者:Dorit Merhof,dorit.merhofur.de
Abstract 摘要

Foundation models, large-scale, pre-trained deep-learning models adapted to a wide range of downstream tasks have gained significant interest lately in various deep-learning problems undergoing a paradigm shift with the rise of these models. Trained on large-scale dataset to bridge the gap between different modalities, foundation models facilitate contextual reasoning, generalization, and prompt capabilities at test time. The predictions of these models can be adjusted for new tasks by augmenting the model input with task-specific hints called prompts without requiring extensive labeled data and retraining. Capitalizing on the advances in computer vision, medical imaging has also marked a growing interest in these models. With the aim of assisting researchers in navigating this direction, this survey intends to provide a comprehensive overview of foundation models in the domain of medical imaging. Specifically, we initiate our exploration by providing an exposition of the fundamental concepts forming the basis of foundation models. Subsequently, we offer a methodical taxonomy of foundation models within the medical domain, proposing a classification system primarily structured around training strategies, while also incorporating additional facets such as application domains, imaging modalities, specific organs of interest, and the algorithms integral to these models. Furthermore, we emphasize the practical use case of some selected approaches and then discuss the opportunities, applications, and future directions of these large-scale pre-trained models, for analyzing medical images. In the same vein, we address the prevailing challenges and research pathways associated with foundational models in medical imaging. These encompass the areas of interpretability, data management, computational requirements, and the nuanced issue of contextual comprehension. Finally, we gather the over-viewed studies with their available open-source implementations at our GitHub. We aim to update the relevant latest papers within it regularly.
基础模型,适应广泛下游任务的大规模预训练深度学习模型,最近在各种深度学习问题中引起了极大的兴趣,这些问题随着这些模型的兴起而发生了范式转变。在大规模数据集上进行训练,以弥合不同模态之间的差距,基础模型在测试时促进上下文推理,泛化和提示功能。这些模型的预测可以通过使用任务特定的提示(称为提示)来增强模型输入来调整新任务,而不需要大量的标记数据和重新训练。利用计算机视觉的进步,医学成像也标志着对这些模型的兴趣越来越大。为了帮助研究人员在这个方向上导航,本调查旨在提供一个全面的概述基础模型在医学成像领域。 具体来说,我们开始我们的探索提供了一个形成基础模型的基础的基本概念的阐述。随后,我们提供了医学领域内基础模型的系统分类,提出了一个主要围绕训练策略构建的分类系统,同时还结合了其他方面,如应用领域,成像方式,感兴趣的特定器官以及这些模型的算法。此外,我们强调了一些选定方法的实际用例,然后讨论了这些大规模预训练模型用于分析医学图像的机会、应用和未来方向。同样,我们解决了与医学成像基础模型相关的普遍挑战和研究途径。这些包括可解释性,数据管理,计算要求和上下文理解的细微差别的问题。 最后,我们在GitHub上收集了概述的研究及其可用的开源实现。我们的目标是定期更新其中的相关最新论文。

Keywords Foundation models  \cdot Deep learning  \cdot Language and vision  \cdot Large language models  \cdot Score-based models  \cdot Self-supervised learning  \cdot Medical applications  \cdot Survey
关键词基础模型 \cdot 深度学习 \cdot 语言和视觉 \cdot 大型语言模型 \cdot 基于分数的模型 \cdot 自监督学习 \cdot 医疗应用 \cdot 调查

1 Introduction 1介绍

Medical imaging is at the forefront of healthcare, playing a pivotal role in diagnosing and treating diseases. Recent advancements in Artificial Intelligence (AI) have given rise to a new era in the field of medical imaging, driven by the development of Foundation Models (FMs). Foundation Models (FMs) are a type of artificial intelligence (AI) model that exhibit significant progress in their development. These models are typically trained on extensive, diverse dataset, frequently utilizing self-supervision techniques on a massive scale. Following this initial training, they can be further adapted, such as through fine-tuning, for a wide array of downstream tasks that are related to the original training data [1].
医学成像处于医疗保健的最前沿,在诊断和治疗疾病方面发挥着关键作用。人工智能(AI)的最新进展在基础模型(FM)的发展推动下,在医学成像领域开创了一个新时代。基础模型(FM)是一种人工智能(AI)模型,在其发展中取得了重大进展。这些模型通常在广泛的、多样化的数据集上进行训练,经常大规模地使用自我监督技术。在此初始训练之后,它们可以进一步调整,例如通过微调,用于与原始训练数据相关的各种下游任务[ 1]。

In contrast to the conventional deep learning paradigm, which heavily relies on large-scale, task-specific, and crowd-labeled data to train individual deep neural networks (DNNs) for various visual recognition tasks, FMs provide a more efficient alternative. They are pretrained on large-scale dataset that are nearly unlimited in availability, enabling straightforward application to downstream tasks with only a limited amount of labeled data. This shift in approach shows potential for significantly decreasing the labor and time usually necessary for such tasks. The recent surge can be attributed to the progress made possible by large language models (LLMs) and the expansion of data and size [2]. Models such as GPT-3 [3], PaLM [4], Galactica [5], and LLaMA [6] have exhibited strong ability to comprehend natural language and solve complex tasks with zero/few-shot learning, attaining remarkable results without requiring extensive task-specific data. Large-scale vision foundation models are currently making significant advances in perception tasks, as highlighted by [7, 8]. Specifically, vision-language models (VLM) are pre-trained with large-scale image-text pairs and are then directly applicable to downstream visual recognition tasks. VLMs generally consist of three fundamental parts: textual features, visual features, and a fusion module. These elements work together in harmony, allowing the models to efficiently use text and visual data to generate contextually appropriate and logical results. Specifically, the pre-training of VLMs typically adheres to vision-language objectives that aid in acquiring image-text correlations from large collections of image-text pairs. For instance, the pioneer study CLIP [9], an image-text matching model, utilizes contrastive learning methods to generate fused representations for images and texts. The learning objective is to minimize the gap between the representation of an image and its corresponding text, while simultaneously increasing the separation between the representations of unrelated pairs. In addition to this so-called “Textually Prompted Models (TPMs)", researchers have also explored Feature Maps (FMs) that can be prompted by visual inputs (points, boxes, masks) which we refer to as “Visually Prompted Models (VPMs)" [2] (see fig. 2 for a visual depiction of both). Recently, the Segment Anything (SA) model [10] has garnered significant attention in the vision community. SAM is a promptable model developed for the purpose of broad image segmentation. It was trained using a promptable segmentation task that enables powerful zero-shot generalization on 11 million images and over 1 billion masks. Furthermore, SAM has been expanded and refined through training on a large dataset, which encompasses 4.6 million medical images and 19.7 million corresponding masks [11]. This dataset offers a rich diversity, covering 10 distinct medical data modalities, and featuring annotations for 4 anatomical structures in addition to lesions. The training regimen is comprehensive, representing 31 major human organs. Notably, it has yielded impressive results that have bolstered the model’s capacity for enhanced generalization.
传统的深度学习范式严重依赖于大规模的、特定于任务的和人群标记的数据来训练个体深度神经网络(DNN)以完成各种视觉识别任务,与此相反,FM提供了一种更有效的替代方案。它们在可用性几乎无限的大规模数据集上进行预训练,从而可以直接应用于仅具有有限数量标记数据的下游任务。这种方法的转变显示出显著减少此类任务通常所需的劳动力和时间的潜力。最近的激增可以归因于大型语言模型(LLMs)以及数据和大小的扩展所取得的进展[2]。GPT—3 [3]、PaLM [4]、Galactica [5]和LLaMA [6]等模型都表现出了很强的理解自然语言的能力,并通过零/少次学习解决复杂任务,在不需要大量特定任务数据的情况下取得了显著的效果。 大规模的视觉基础模型目前正在感知任务中取得重大进展,如[7,8]所强调的那样。具体来说,视觉语言模型(VLM)是用大规模的图像—文本对进行预训练的,然后直接应用于下游的视觉识别任务。VLM通常由三个基本部分组成:文本特征,视觉特征和融合模块。这些元素和谐地协同工作,使模型能够有效地使用文本和视觉数据来生成符合上下文的逻辑结果。具体而言,VLM的预训练通常遵循视觉语言目标,这些目标有助于从大量图像—文本对集合中获取图像—文本相关性。例如,先驱研究CLIP [9],一种图像—文本匹配模型,利用对比学习方法来生成图像和文本的融合表示。 学习目标是最小化图像的表示和其对应的文本之间的差距,同时增加不相关的对的表示之间的分离。除了这种所谓的“文本标记模型(TPMs)”之外,研究人员还探索了可以通过视觉输入(点,框,掩码)提示的特征映射(FM),我们称之为“视觉标记模型(VPMs)”[ 2](见图2两者的视觉描述)。最近,Segment Anything(SA)模型[ 10]在视觉社区中引起了极大的关注。SAM是一个可扩展的模型,用于广泛的图像分割。它是使用可扩展的分割任务进行训练的,该任务可以对1100万张图像和超过10亿个掩码进行强大的零拍摄泛化。此外,SAM已经通过在一个大型数据集上的训练进行了扩展和改进,该数据集包括460万张医学图像和1970万张相应的掩模[ 11]。 该数据集提供了丰富的多样性,涵盖了10种不同的医学数据模式,并具有除病变外的4种解剖结构的注释。训练方案是全面的,代表了31个主要的人体器官。值得注意的是,它产生了令人印象深刻的结果,增强了模型的泛化能力。

Moreover, this generic visual prompt-based segmentation model has recently been adapted to a wide range of downstream tasks, including medical image analysis [12, 13], image inpainting [14], style transfer [15], and image captioning [16] to name a few. Apart from foundational models that rely on textual and visual prompts, research endeavors have also delved into creating models that harmonize various types of paired modalities (such as image-text, and video-audio) to learn representations assisting diverse downstream tasks.
此外,这种通用的基于视觉识别的分割模型最近已经适应了广泛的下游任务,包括医学图像分析[ 12,13],图像修复[ 14],风格转移[ 15]和图像字幕[ 16]。除了依赖于文本和视觉提示的基础模型之外,研究工作还深入研究了创建协调各种类型的成对模态(例如图像-文本和视频-音频)的模型,以学习辅助不同下游任务的表示。

The creationn of foundation models has garnered significant attention in the medical AI system development realm [17, 18, 19, 20, 21]. Despite the substantial advancements in biomedical AI, the main methodologies used still tend to be tas-specific models. However, medical practice encompasses various data modalities comprising text, imaging, genomics, and others, making it essentially multimodal [22]. Inherently, a Medical Foundation Model (MFM) has the ability to adaptively interpret various medical modalities, including diverse data sources such as images, electronic medical records, lab findings, genomic information, medical diagrams, and textual data [23]. Hence, foundational models have the potential to provide an enhanced foundation for addressing clinical issues, advancing the field of medical imaging, and improving the efficiency and effectiveness of diagnosing and treating diseases, leading to the opportunity to develop a unified biomedical AI system that can interpret complex multimodal data. Due to the acceleration of both biomedical data production and advancements, the influence of these models is expected to expand due to an influx of contributions. As shown in Figure 1, a significant body of research has been devoted to the application of FMs in diverse medical imaging contexts until the first release of our survey in October 2023. These contributions encompass a wide range of potential applications, from fundamental biomedical discoveries to the upgrading of healthcare delivery. Hence, it is advantageous for the community and timely to review the existing literature.
基础模型的创建在医疗AI系统开发领域获得了极大的关注[17,18,19,20,21]。尽管生物医学人工智能取得了长足的进步,但使用的主要方法仍然倾向于特定于tas的模型。然而,医疗实践包括各种数据模式,包括文本,成像,基因组学等,使其基本上是多模态的[ 22]。从本质上讲,医学基础模型(MFM)能够自适应地解释各种医疗模式,包括各种数据源,如图像,电子病历,实验室发现,基因组信息,医学图表和文本数据[ 23]。 因此,基础模型有可能为解决临床问题,推进医学成像领域以及提高诊断和治疗疾病的效率和有效性提供增强的基础,从而有机会开发一个统一的生物医学AI系统,可以解释复杂的多模态数据。由于生物医学数据生产和进步的加速,这些模型的影响力预计将因大量捐款而扩大。如图1所示,在我们于2023年10月首次发布调查之前,大量研究一直致力于FM在各种医学成像环境中的应用。这些贡献涵盖了广泛的潜在应用,从基础生物医学发现到医疗保健服务的升级。因此,对现有文献进行回顾,对社会有利,也是及时的。

Refer to caption
(a) Algorithms (a)算法
Refer to caption
(b) Modalities (B)方式
Refer to caption
(c) Organs (c)机关
Figure 1: The diagram (a) displays the distribution of published papers categorized by their algorithm, (b) categorizes them by their imaging modalities, and (c) classifies them by the type of organ concerned. It is worth noting that the total number of papers included in the analysis is 40.

This paper provides a holistic overview of the foundation models developed for medical imaging applications. We distinguish existing works inspired by the taxonomy proposed in [2] and highlight the major strengths and shortcomings of the existing methods. We hope that this work will point the way forward, provide a roadmap for researchers, stimulate further interest and enthusiasm within the vision community, and harness the potential of foundation models in the medical discipline. This survey will be regularly updated to reflect the dynamic progress of the MFMs, as this is a rapidly evolving and promising field towards AGI in the biomedical field. Our major contributions include:

\bullet We conduct a thorough and exhaustive examination of foundation models proposed in the field of medical imaging, beginning from background and preliminaries for foundation models, to specific applications along with the organ concerned and imaging modality in a hierarchical and structured manner
\bullet 我们对医学成像领域中提出的基础模型进行了彻底和详尽的检查,从基础模型的背景和分类开始,以分层和结构化的方式沿着有关器官和成像模态的特定应用

\bullet Our work provides a taxonomized (Figure 3), in-depth analysis (e.g. task/organ-specific research progress and limitations), as well as a discussion of various aspects.
\bullet 我们的工作提供了分类(图3),深入分析(例如任务/器官特定的研究进展和局限性),以及对各个方面的讨论。

\bullet Furthermore, we discuss the challenges and unresolved aspects linked to foundation models in medical imaging. We pinpoint new trends, raise important questions, and propose future directions for further exploration.
\bullet 此外,我们还讨论了与医学成像基础模型相关的挑战和未解决的问题。我们指出了新的趋势,提出了重要的问题,并提出了进一步探索的未来方向。

Refer to caption
Figure 2: Visual illustration of how our extensive classification categorizes existing works into textually and visually prompted models, distinct from traditional vision models.
Refer to caption
Figure 3: The suggested taxonomy for foundational models used in medical imaging research consists of six distinct groups: I) VPM-Generalist, II) TPM-Hybrid, III) TPM-Contrastive, IV) TPM-Generative, V) VPM-Adaptations, and VI) TPM-Conversational. To maintain conciseness, we assign ascending prefix numbers to each category in the paper’s name and cite each study accordingly as follows: 1. [24], 2. [23], 3. [25], 4. [22], 5. [26], 6. [27], 7. [28], 8. [29], 9. [30], 10. [31], 11. [32], 12. [33], 13. [34], 14, [35], 15. [36], 16. [37], 17. [38], 18. [39], 19. [40], 20. [18], 21. [41], 22. [42], 23. [43], 24. [12], 25. [44], 26. [11], 27. [21], 28. [45], 29. [46], 30. [47], 31. [48], 32. [49], 33. [50], 34. [51], 35. [52], 36. [53], 37. [54], 38. [55], 39. [56], 40. [57]
图3:医学成像研究中使用的基础模型的建议分类法由六个不同的组组成:I)VPM-通用,II)TPM-混合,III)TPM-对比,IV)TPM-生成,V)VPM-适应和VI)TPM-对话。为了保持简洁,我们在论文名称中为每个类别分配升序前缀编号,并相应地引用每个研究如下:1。[ 24],2. [ 23],3. [ 25],4. [ 22],第5页。[ 26],6. [ 27],第7页。[ 28],第8页。[ 29],第9页。[ 30],10. [ 31],11. [ 32],12. [ 33],13. [ 34],14,[ 35],15. [ 36],16. [ 37],第17页。[ 38],第18页。[ 39],第19页。[ 40],第20页。[ 18],21. [ 41],22. [ 42],23. [ 43],第24页。[ 12],第25页。[ 44],第26页。[ 11],第27页。[ 21],第28页。[ 45],第29页。[ 46],第30页。[ 47],31. [ 48],第32页。[ 49],第33页。[ 50],34. [ 51],第35页。[ 52],第36页。[ 53],第37页。[ 54],第38页。[ 55],第39页。[ 56],第40页。[ 57]

1.1 Clinical Importance 1.1临床重要性

In medical imaging, foundation models are reshaping the way research methods are designed and paradigms are approached, paving the way for innovative advancements and pioneering breakthroughs across various sectors owing to some of their inherent properties aligned with the medical domain as follows.

Multi-Modality: Despite advances in biomedical AI, most models today are limited to single-task, unimodal functions. For instance, a mammogram interpretation AI excels at breast cancer screening but can’t incorporate patient records, and additional data like MRI, or engage in meaningful dialogue, limiting its real-world applicability.

Explainability and Generalization: The absence of explainability in deep learning models can erode trust among clinicians accustomed to clear clinical insights [58]. The ability of models to generalize across different medical settings is vital due to varying data sources. Foundation models address these issues by offering a unified framework for tasks like detection and classification, often trained on diverse datasets from various medical centers, enhancing their potential for clinical use by ensuring interpretability and broad applicability.
可解释性和概括性:深度学习模型缺乏可解释性可能会削弱习惯于明确临床见解的临床医生之间的信任[ 58]。由于数据源不同,模型在不同医疗环境中推广的能力至关重要。基础模型通过为检测和分类等任务提供统一的框架来解决这些问题,这些任务通常在来自各个医疗中心的不同数据集上进行训练,通过确保可解释性和广泛的适用性来增强其临床应用潜力。

Privacy Preservation: The computer vision community has a history of open-sourcing datasets, but in medical imaging, privacy regulations limit data sharing. Foundation models offer a privacy-preserving alternative by allowing knowledge transfer without direct access to sensitive data. Additionally, federated learning enables model training on distributed data while keeping it on local machines, ensuring data privacy. Moreover, foundation models facilitate privacy preservation by generating synthetic data resembling real medical images, eliminating the need for actual patient data in model training.

Adaptability: Existing medical AI models struggle when faced with distribution shifts caused by changes in technology, procedures, settings, or populations. In contrast, MFMs can effectively adapt to these shifts through in-context learning. For instance, a hospital can teach an MFM model to interpret X-rays from a new scanner by providing a few examples as prompts, enabling it to adjust to new data distributions in real time. This capability is mainly seen in large language models and is not common in conventional medical AI models, which would typically require complete retraining with new datasets.

Domain Knowledge: Unlike clinicians, traditional medical AI models often lack initial medical domain knowledge, relying solely on statistical associations. Medical imaging foundation models like GMAI can address this limitation by integrating formal medical knowledge, using structures like knowledge graphs and retrieving relevant context from existing databases, improving their performance on specific tasks. In summary, foundation models play a crucial role in advancing medical applications by providing a robust and adaptable framework that enhances efficiency, generalizability, and privacy preservation. Their ability to support various clinical tasks and promote collaboration makes them invaluable tools for improving patient care and medical research.

1.2 Relevant Surveys 1.2相关调查

With the recent success of foundation models, there has been a surge of surveys and contributions in this domain. Some of the reviews investigate recent advances in LLMs, distinguishing different aspects of LLMs by analyzing the impact of pre-training adaptation tuning, utilization, and evaluation [59, 60, 61]. In the context of vision models, the work of [2] provides a comprehensive review of FMs including their typical architecture design, training objectives, and prompting mechanisms. The work of [62] delivers a comprehensive survey of research in prompt engineering on diverse types of vision-language models, organizing existing prompt-engineering approaches from a new perspective. Besides, [63] provides a systematic review of visual language models for various visual recognition tasks including image classification, object detection, and semantic segmentation. In the medical imaging field, [23] identifies the potential applications, opportunities, and challenges of MFMs. The work of [24] provides a comprehensive and objective evaluation of SAM on medical image segmentation, while [64] discusses the spectrum, and future directions of foundation models. However, different from the aforementioned works, we devise a multi-perspective taxonomy of foundation models in the medical community, providing a systematical category of research in medical foundation models and their applications dividing them into textually prompted models, and visually prompted models where each paper is broadly classified according to the proposed algorithm along with the organ concerned and imaging modality, respectively. We present the concepts and theoretical foundations behind foundation models ranging from training objectives and instruction-aligning to prompt engineering (Section 2). In Section 3.1, we comprehensively cover an extensive and up-to-date overview of the recent medical foundation models, as shown in Figure 3. We wrap up this survey by pinpointing future directions and open challenges facing foundation models in medical imaging in Section 5.
随着基金会模式最近的成功,这一领域的调查和捐款激增。一些评论调查了LLMs的最新进展,通过分析预训练适应调整,利用和评估的影响来区分LLMs的不同方面[ 59,60,61]。在视觉模型的背景下,[ 2]的工作提供了对FM的全面回顾,包括其典型的架构设计,培训目标和激励机制。[ 62]的工作提供了对不同类型的视觉语言模型的即时工程研究的全面调查,从新的角度组织现有的即时工程方法。此外,[ 63]还系统地回顾了用于各种视觉识别任务的视觉语言模型,包括图像分类,对象检测和语义分割。在医学成像领域,[ 23]确定了MFM的潜在应用,机遇和挑战。 [24]的工作提供了SAM在医学图像分割上的全面客观评价,而[64]讨论了基础模型的频谱和未来方向。然而,不同于上述作品,我们设计了一个多角度的分类基础模型在医学界,提供了一个系统的类别的研究在医学基础模型及其应用分为文本提示模型,视觉提示模型,其中每篇论文被广泛分类,根据所提出的算法沿着与有关的器官和成像方式,分别。我们提出了基础模型背后的概念和理论基础,从培训目标和预防调整到提示工程(第2节)。在第3.1节中,我们全面介绍了最近医学基金会模型的广泛和最新概述,如图3所示。 我们通过在第5节中指出医学成像基础模型面临的未来方向和开放性挑战来总结本次调查。

1.3 Search Strategy 1.3搜索策略

We conducted extensive searches across various platforms, such as DBLP, Google Scholar, and Arxiv Sanity Preserver. We leveraged their search capabilities to create tailored queries and compile comprehensive lists of academic works. These searches encompassed a broad spectrum of scholarly publications, including peer-reviewed journal articles, conference papers, workshop materials, non-peer-reviewed content, and preprints. We tailored our search criteria to achieve this diversity. Our specific search queries consisted of keywords (foundation* ||| generalist* ||| medical* ||| {Task}*), (med-{FM} ||| foundation*), (foundation* ||| biomedical* ||| image* ||| model*), where {FM} and {Task} refer to one well-known vision foundation model(such as PaLM, CLIP, etc) or Tasks (such as Segmentation, Question Answering, etc) in medical imaging. We then applied filtering to eliminate false positives, ensuring that only papers related to foundation models were included in our analysis.
我们在各种平台上进行了广泛的搜索,例如DBLP,Google Scholar和Arxiv Sanity Preserver。我们利用他们的搜索功能创建量身定制的查询,并编制全面的学术著作列表。这些检索涵盖了广泛的学术出版物,包括同行评审期刊文章、会议论文、研讨会材料、非同行评审内容和预印本。我们调整了搜索标准,以实现这种多样性。我们的特定搜索查询由关键字组成(基金会 * ||| 通才 * ||| 医疗 * ||| {任务}*),(med-{FM} ||| foundation*),(foundation* ||| biomedical* ||| image* ||| model*),其中{FM}和{Task}是指一个众所周知的视觉基础模型(如PaLM、CLIP等)或任务(如分割、问题分类等)。然后,我们应用过滤来消除误报,确保只有与基础模型相关的论文才包含在我们的分析中。

1.4 Paper Organization. 1.4纸组织。

The rest of the survey is organized as follows. Section 2 presents the background and preliminaries for foundation models. We adopt the taxonomy of [2] and categorize previous studies into two main groups: those prompted by textual inputs (discussed in section 3.1) and those driven by visual cues (discussed in section 3.2). In the context of textually prompted foundation models, we further subdivide them into contrastive, generative, hybrid (combining contrastive and generative approaches), and conversational visual language models. In addition, we differentiate textually prompted models into adaptations and generalist models. Furthermore, Section 5 reveals the risk, open problems, and future directions of foundation models. Finally, we conclude our research in Section 6.

2 Preliminaries 2000年

The term "foundational models" made its debut at Stanford Institute for Human-Centred AI in [1] with the definition of "the base models trained on large-scale data in a self-supervised or semi-supervised manner that can be adapted for several other downstream tasks". Specifically, inspired by the surge of large language models (LLMs), using the basic fundamentals of deep learning such as DNNs and self-supervised learning, foundation models have emerged by massively scaling up both data and model size. In this section, we introduce the basic model architectures, concepts, and settings behind FMs focusing on contributing factors for these models in computer vision such as training objectives, instruction-aligning, inference procedure and prompting.
术语“基础模型”在斯坦福大学以人为中心的人工智能研究所首次亮相[ 1],定义为“以自我监督或半监督的方式在大规模数据上训练的基础模型,可以适应其他几个下游任务”。具体来说,受大型语言模型(LLMs)激增的启发,使用深度学习的基本原理,如DNN和自监督学习,基础模型通过大规模扩展数据和模型大小而出现。在本节中,我们将介绍FM背后的基本模型架构、概念和设置,重点介绍计算机视觉中这些模型的影响因素,如训练目标、推理对齐、推理过程和提示。

2.1 Pre-training Objectives

Diverse pretraining objectives have been devised to learn a rich understanding of the relationship between vision and language [65, 66, 67, 68]. We broadly categorize them into contrastive and generative objectives.
设计了各种预训练目标,以学习对视觉和语言之间关系的丰富理解[ 65,66,67,68]。我们大致将它们分为对比性和生成性目标。

2.1.1 Contrastive Objectives

Contrastive objectives instruct models to acquire distinctive representations [69, 67] by bringing related sample pairs closer together while pushing unrelated pairs farther apart within the feature space. Specifically, Image Contrastive Loss (ICL) aims to learn discriminative image features making a query image closely resemble its positive keys (i.e., its data augmentations) while ensuring it remains distant from its negative keys (i.e., other images) within the embedding space. Consider a batch of B𝐵B images, contrastive objectives such as InfoNCE [70] and its variations [67, 69], IInfoNCEsuperscriptsubscriptIInfoNCE\mathcal{L}_{\text{I}}^{\text{InfoNCE}} can be expressed as:
对比目标指导模型通过将相关样本对更紧密地结合在一起,同时将不相关的样本对在特征空间内推得更远来获得独特的表示[69,67]。具体地,图像对比度损失(ICL)旨在学习使查询图像与其正键(即,其数据增强)同时确保其保持远离其否定键(即,其他图像)。考虑一批 B𝐵B 图像,对比目标,如InfoNCE [70]及其变体[67,69], IInfoNCEsuperscriptsubscriptIInfoNCE\mathcal{L}_{\text{I}}^{\text{InfoNCE}} 可以表示为:

IInfoNCE=1Bi=1Blogexp(θiqueryθ+positive/τ)j=1,jiB+1exp(θiqueryθjkey/τ)superscriptsubscriptIInfoNCE1𝐵superscriptsubscript𝑖1𝐵superscriptsubscript𝜃𝑖querysuperscriptsubscript𝜃positive𝜏superscriptsubscriptformulae-sequence𝑗1𝑗𝑖𝐵1superscriptsubscript𝜃𝑖querysuperscriptsubscript𝜃𝑗key𝜏\mathcal{L}_{\text{I}}^{\text{InfoNCE}}=-\frac{1}{B}\sum_{i=1}^{B}\log\frac{\exp\left(\theta_{i}^{\text{query}}\cdot\theta_{+}^{\text{positive}}/\tau\right)}{\sum_{j=1,j\neq i}^{B+1}\exp\left(\theta_{i}^{\text{query}}\cdot\theta_{j}^{\text{key}}/\tau\right)}

where θiquerysuperscriptsubscript𝜃𝑖query\theta_{i}^{\text{query}} represents the query embedding, {θjkey}j=1,jiB+1superscriptsubscriptsuperscriptsubscript𝜃𝑗keyformulae-sequence𝑗1𝑗𝑖𝐵1\{\theta_{j}^{\text{key}}\}_{j=1,j\neq i}^{B+1} are the key embeddings, where θ+positivesuperscriptsubscript𝜃positive\theta_{+}^{\text{positive}} denotes the positive key corresponding to θiquerysuperscriptsubscript𝜃𝑖query\theta_{i}^{\text{query}}, while the rest are considered negative keys. The hyperparameter τ𝜏\tau governs the density of the learned representation.
其中 θiquerysuperscriptsubscript𝜃𝑖query\theta_{i}^{\text{query}} 表示查询嵌入, {θjkey}j=1,jiB+1superscriptsubscriptsuperscriptsubscript𝜃𝑗keyformulae-sequence𝑗1𝑗𝑖𝐵1\{\theta_{j}^{\text{key}}\}_{j=1,j\neq i}^{B+1} 是键嵌入,其中 θ+positivesuperscriptsubscript𝜃positive\theta_{+}^{\text{positive}} 表示与 θiquerysuperscriptsubscript𝜃𝑖query\theta_{i}^{\text{query}} 对应的正键,而其余的被认为是负键。超参数 τ𝜏\tau 控制学习到的表示的密度。

Image-Text Contrastive Loss (ITCL) Seeks to develop distinctive image-text representations by bringing together the embeddings of matched images and texts and pushing apart those that do not match [68, 7]. Let (i,ti)𝑖subscript𝑡𝑖\left(i,t_{i}\right) represent the i𝑖i-th image-text example, then, the image-to-text loss is calculated as:
图像-文本对比损失(ITCL)试图通过将匹配的图像和文本的嵌入放在一起并将不匹配的嵌入推开来开发独特的图像-文本表示[ 68,7]。假设 (i,ti)𝑖subscript𝑡𝑖\left(i,t_{i}\right) 表示第 i𝑖i 个图像-文本示例,则图像-文本损失计算为:

IT=log[exp(θiθ+/τ)j=1Nexp(θiθj/τ)]subscript𝐼𝑇subscript𝜃𝑖subscript𝜃𝜏superscriptsubscript𝑗1𝑁subscript𝜃𝑖subscript𝜃𝑗𝜏\mathcal{L}_{I\rightarrow T}=-\log\left[\frac{\exp\left(\theta_{i}\cdot\theta_{+}/\tau\right)}{\sum_{j=1}^{N}\exp\left(\theta_{i}\cdot\theta_{j}/\tau\right)}\right]

where N𝑁N is the total number of such pairs, and θisubscript𝜃𝑖\theta_{i} corresponds to the embedding for image i𝑖i, while θ+subscript𝜃\theta_{+} and θjsubscript𝜃𝑗\theta_{j} denote positive and negative text representations, respectively. The losses are computed with a focus on the relationship between images and texts while considering the temperature parameter τ𝜏\tau.
其中 N𝑁N 是这样的对的总数,并且 θisubscript𝜃𝑖\theta_{i} 对应于图像 i𝑖i 的嵌入,而 θ+subscript𝜃\theta_{+}θjsubscript𝜃𝑗\theta_{j} 分别表示正文本表示和负文本表示。在考虑温度参数 τ𝜏\tau 的同时,重点关注图像和文本之间的关系来计算损失。

The text-to-image loss is also calculated similarly, and the total loss is the sum of these two terms:

ITC=1Ni=1N[IT+TI]subscript𝐼𝑇𝐶1𝑁superscriptsubscript𝑖1𝑁delimited-[]subscript𝐼𝑇subscript𝑇𝐼\mathcal{L}_{ITC}=\frac{1}{N}\sum_{i=1}^{N}\left[\mathcal{L}_{I\rightarrow T}+\mathcal{L}_{T\rightarrow I}\right]

Akin to ICL and ITCL, various contrastive loss functions have also found an application (SimCLR [69, 71], FILIP Loss [72], Region-Word Alignment (RWA) [73], and Region-Word Contrastive (RWC) [74]).
与ICL和ITCL类似,各种对比损失函数也有应用(Simplified [ 69,71],FILIP Loss [ 72],区域词对齐(RWA)[ 73]和区域词对比(RWC)[ 74])。

2.1.2 Generative Objectives

Generative objectives involve teaching networks to produce image or text data, which allows them to acquire semantic features, accomplished through tasks like image generation [75], and language generation [76].

Masked Image Modelling (MIM) involves the acquisition of cross-patch correlations by applying masking and image reconstruction techniques. In MIM, a selection of patches within an input image is randomly masked, and the encoder is trained to reconstruct these masked patches based on the unmasked patches. For a given batch of B𝐵B images, the loss function is formulated as:
掩模图像建模(MIM)涉及通过应用掩模和图像重建技术来获取跨块相关性。在MIM中,输入图像内的块的选择被随机地掩蔽,并且编码器被训练为基于未掩蔽的块来重建这些掩蔽的块。对于给定批次的 B𝐵B 图像,损失函数被公式化为:

MIM=1Bi=1Blogfθ(x¯iI|x^iI),subscript𝑀𝐼𝑀1𝐵superscriptsubscript𝑖1𝐵subscript𝑓𝜃conditionalsuperscriptsubscript¯𝑥𝑖𝐼superscriptsubscript^𝑥𝑖𝐼\mathcal{L}_{MIM}=-\frac{1}{B}\sum_{i=1}^{B}\log f_{\theta}\left(\bar{x}_{i}^{I}\,|\,\hat{x}_{i}^{I}\right),

where x¯iIsuperscriptsubscript¯𝑥𝑖𝐼\bar{x}_{i}^{I} and x^iIsuperscriptsubscript^𝑥𝑖𝐼\hat{x}_{i}^{I} represent the masked and unmasked patches within xiIsuperscriptsubscript𝑥𝑖𝐼x_{i}^{I}, respectively [63].
其中 x¯iIsuperscriptsubscript¯𝑥𝑖𝐼\bar{x}_{i}^{I}x^iIsuperscriptsubscript^𝑥𝑖𝐼\hat{x}_{i}^{I} 分别代表 xiIsuperscriptsubscript𝑥𝑖𝐼x_{i}^{I} 中的掩蔽和未掩蔽贴片[ 63]。

Masked Language Modelling (MLM) is a widely adopted pretraining objective in Natural Language Processing (NLP). In MLM, a specific percentage of input text tokens is randomly masked, and these masked tokens are reconstructed using the unmasked ones. The loss function for MLM can be expressed as:
Masked Language Modeling(MLM)是自然语言处理(NLP)中广泛采用的预训练目标。在MLM中,特定百分比的输入文本标记被随机屏蔽,并且这些被屏蔽的标记使用未屏蔽的标记进行重构。MLM的损失函数可以表示为:

MLM=1Bi=1Blogfθ(x¯iT|x^iT),subscript𝑀𝐿𝑀1𝐵superscriptsubscript𝑖1𝐵subscript𝑓𝜃conditionalsuperscriptsubscript¯𝑥𝑖𝑇superscriptsubscript^𝑥𝑖𝑇\mathcal{L}_{MLM}=-\frac{1}{B}\sum_{i=1}^{B}\log f_{\theta}\left(\bar{x}_{i}^{T}\,|\,\hat{x}_{i}^{T}\right),

where x¯iTsuperscriptsubscript¯𝑥𝑖𝑇\bar{x}_{i}^{T} and x^iTsuperscriptsubscript^𝑥𝑖𝑇\hat{x}_{i}^{T} denote the masked and unmasked tokens within xiTsuperscriptsubscript𝑥𝑖𝑇x_{i}^{T}, respectively, and B𝐵B denotes the batch size [63].
其中 x¯iTsuperscriptsubscript¯𝑥𝑖𝑇\bar{x}_{i}^{T}x^iTsuperscriptsubscript^𝑥𝑖𝑇\hat{x}_{i}^{T} 分别表示 xiTsuperscriptsubscript𝑥𝑖𝑇x_{i}^{T} 中的掩码和未掩码令牌, B𝐵B 表示批量大小[ 63]。

Likewise, diverse additional generative loss functions have been introduced in the field including Masked Multimodal Modeling (MMM) loss [77], Image-conditioned Masked Language Modeling (IMLM) loss [78], and Captioning with Parallel Prediction (CapPa) [79].
同样,在该领域中引入了各种额外的生成损失函数,包括掩蔽多模态建模(MMM)损失[ 77],图像条件掩蔽语言建模(IMLM)损失[ 78]和并行预测字幕(CapPa)[ 79]。

2.2 Pre-training Tasks 2.2前期培训任务

As discussed in section 2.1, FMs pre-training has been studied with typical approaches including contrastive objectives, and generative objectives. In natural language processing, certain pre-training tasks include masked language modeling, where words in the input sequence are randomly hidden, and the model predicts these hidden words during pre-training. Another task involves next-sentence-prediction, where pairs of sentences from distinct documents are presented, and the model determines whether the order of these sentences is accurate. Additionally, there’s the denoising auto-encoder task, which introduces noise into the original text corpus and then aims to reconstruct the pristine input using the noisy version of the corpus. Likewise, to enable the generalization of learned representations to a range of downstream vision domains, pretext tasks such as inpainting [80], auxiliary supervised discriminative tasks, and data reconstruction tasks [81] are used in the pre-training stage.
如第2.1节所讨论的,FM预训练已经用典型的方法进行了研究,包括对比目标和生成目标。在自然语言处理中,某些预训练任务包括掩蔽语言建模,其中输入序列中的单词被随机隐藏,并且模型在预训练期间预测这些隐藏的单词。另一个任务涉及下一个句子预测,其中呈现来自不同文档的句子对,并且模型确定这些句子的顺序是否准确。此外,还有去噪自动编码器任务,它将噪声引入原始文本语料库,然后旨在使用语料库的噪声版本重建原始输入。 同样,为了将学习的表示推广到一系列下游视觉域,在预训练阶段使用了诸如修复[ 80],辅助监督判别任务和数据重建任务[ 81]等借口任务。

2.3 Instruction-Aligning

Instruction-aligning methods aim to let the LM follow human intents and generate meaningful outputs. This process involves either fine-tuning the model on a diverse set of tasks with human-annotated prompts and feedback (RLHF) [82], conducting supervised fine-tuning on publicly available benchmarks and datasets, which are augmented with manually or automatically generated instructions, and improving the reasoning ability of LLMs by instructing them to produce a sequence of intermediate actions that ultimately lead to the solution of a multi-step problem (Chain-of-thought) [83].
指令对齐方法旨在让LM遵循人类意图并生成有意义的输出。这个过程涉及使用人工注释的提示和反馈(RLHF)[ 82]对不同任务集的模型进行微调,对公开可用的基准和数据集进行监督微调,这些基准和数据集使用手动或自动生成的指令进行增强,提高了LLMs的推理能力通过指导他们产生一系列最终导致多步骤问题解决方案的中间行动(思维链)[ 83]。

2.4 Prompt Engineering 2.4 Prompt工程

Prompt engineering refers to a method that enhances a large pre-trained model by incorporating task-specific hints, referred to as prompts, to tailor the model for new tasks enabling the power to acquire predictions based only on prompts without updating model parameters [62]. In the context of large language models (LLMs) prompting techniques can be categorized into two primary groups depending on the clarity of the templates they employ: "soft prompts" (optimizable, learnable) and "hard prompts (manually crafted text prompts)". Within the "hard prompt" category, there are four subcategories: task instructions, in-context learning, retrieval-based prompting, and chain-of-thought prompting. In contrast, "soft prompts" fall into two strategies: prompt tuning and prefix token tuning, which differ in whether they introduce new tokens into the model’s architecture or simply attach them to the input. In the vision domain, prompt engineering facilitates the acquisition of joint multi-modal representations (e.g., CLIP [68] for image classification of ALIGN [84]) to introduce human interaction to the foundational models and employs vision-language models for visual tasks.
提示工程是指一种通过结合特定于任务的提示(称为提示)来增强大型预训练模型的方法,以针对新任务定制模型,从而能够仅基于提示获取预测而无需更新模型参数[ 62]。在大型语言模型(LLMs)的上下文中,提示技术可以根据它们所采用的模板的清晰度分为两大类:“软提示”(可优化,可学习)和“硬提示”(手动制作的文本提示)。在“硬提示”类别中,有四个子类别:任务指令,上下文学习,基于检索的提示和思维链提示。相比之下,“软提示”分为两种策略:提示调优和前缀令牌调优,它们的不同之处在于它们是将新令牌引入模型的体系结构还是简单地将它们附加到输入。 在视觉领域中,即时工程有助于获取联合多模态表示(例如,CLIP [ 68]用于ALIGN [ 84]的图像分类),将人类交互引入基础模型,并将视觉语言模型用于视觉任务。

3 Foundational Models for Medical Imaging

Establishing a taxonomy for foundational models in medical imaging analysis follows the standard practices commonly employed in the field. However, we distinguish our approach by providing extensive additional information for each sub-category as presented in Figure 3. In this section, we explore foundational-based methods, which have been introduced to tackle diverse challenges in medical imaging analysis through the design of distinct training strategies.

3.1 Textually Prompted Models

3.1.1 Contrastive

Contrastive textually prompted models are increasingly recognized in the foundational models for medical imaging. They learn representations that encapsulate the semantics and relationships between medical images and their textual prompts. Leveraging contrastive learning objectives, they draw similar image-text pairs closer in the feature space while pushing dissimilar pairs apart. These models are pivotal for image classification, segmentation, and retrieval tasks. Architectural explorations have ranged from dual-encoder designs—with separate visual and language encoders—to fusion designs that merge image and text representations via decoder and transformer-based architectures. Their potential in medical imaging tasks such as lesion detection, disease classification, and image synthesis is evident in numerous studies. In this direction, Wang et al. [31] introduced the MedCLIP framework, demonstrating its superiority over state-of-the-art methods in zero-shot prediction, supervised classification, and image-text retrieval. Expanding upon the success of models like CLIP, [34] unveiled BiomedCLIP, tailored for biomedical vision-language processing. Its training on a vast dataset of 15 million figure-caption pairs highlighted the efficacy of specialized pretraining in the medical imaging field.
对比文本提示模型在医学成像的基础模型中越来越受到重视。他们学习封装医学图像及其文本提示之间的语义和关系的表示。利用对比学习目标,它们在特征空间中将相似的图像—文本对拉近,同时将不同的图像—文本对分开。这些模型是图像分类、分割和检索任务的关键。架构探索的范围从双编码器设计(具有单独的视觉和语言编码器)到融合设计(通过基于解码器和转换器的架构合并图像和文本表示)。它们在医学成像任务中的潜力,如病变检测,疾病分类和图像合成在许多研究中是显而易见的。在这个方向上,Wang et al. [ 31]介绍了MedCLIP框架,证明了它在零拍摄预测,监督分类和图像-文本检索方面优于最先进的方法。在CLIP等模型的成功基础上,[ 34]推出了专为生物医学视觉语言处理而定制的BiomedCLIP。它在1500万个图形-标题对的庞大数据集上进行的训练,突出了医学成像领域专业预训练的有效性。

Visual language pre-training has made significant advancements in representation learning, especially evident in challenging scenarios like zero-shot transfer tasks in open-set image recognition. Nevertheless, computational pathology hasn’t delved deeply into zero-shot transfer due to data scarcity and challenges presented by gigapixel histopathology whole-slide images (WSI). Drawing inspiration from the success of multiple-instance learning in weakly supervised learning tasks, [36] introduced MI-Zero (Figure 4). In this method, each WSI is divided into smaller tiles, referred to as instances, which are more manageable for the image encoder. Each instance’s cosine similarity scores at the patch level are calculated independently against every text prompt within the latent space. Following this, instance-level scores are combined to generate slide-level scores using a permutation-invariant operator, similar to those in multiple instance learning, such as mean or top K pooling. An optional spatial smoothing step aggregates the information of neighboring patches. When tested on three different real-world cancer subtyping tasks, MI-Zero either matched or outperformed baselines, achieving an average median zero-shot accuracy of
视觉语言预训练在表征学习方面取得了重大进展,尤其是在具有挑战性的场景中,如开集图像识别中的零镜头迁移任务。然而,由于数据稀缺和千兆像素组织病理学全切片图像(WSI)带来的挑战,计算病理学尚未深入研究零拍摄转移。从弱监督学习任务中多实例学习的成功中汲取灵感,[ 36]引入了MI-Zero(图4)。在该方法中,每个WSI被划分为更小的区块,称为实例,其对于图像编码器更易于管理。每个实例在补丁级别的余弦相似性分数是针对潜在空间内的每个文本提示独立计算的。在此之后,使用置换不变运算符组合实例级分数以生成幻灯片级分数,类似于多实例学习中的那些,例如均值或前K池。 可选的空间平滑步骤聚合相邻块的信息。当在三个不同的现实世界癌症亚型任务中进行测试时,MI—Zero的表现与基线相匹配或优于基线,实现了 %的平均中位零射击准确度。

Refer to caption
Figure 4: Schematic of MI-Zero [36]. A gigapixel WSI is transformed into a set of patches (instances), with each patch being embedded into an aligned visual-language latent space. where the similarity scores between the embeddings of patches and the embeddings of prompts are combined using a permutation-invariant operation like topK max-pooling to generate the classification prediction at the WSI level.
图4:MI零的示意图[ 36]。一个千兆像素的WSI被转换成一组补丁(实例),每个补丁被嵌入到一个对齐的视觉语言潜在空间。其中使用如topK最大池化的置换不变操作来组合补丁的嵌入和提示的嵌入之间的相似性分数,以生成WSI级别的分类预测。

In another study, Bannur et al. [32] unveiled the BioViL-T method for biomedical Vision-Language Processing (VLP). Exploiting the data’s temporal structure, BioViL-T reached state-of-the-art levels in tasks such as progression classification, phrase grounding, and report generation. Incorporating prior images and reports considerably enhanced the model’s efficacy in disease classification and sentence-similarity tasks. The hybrid multi-image encoder in BioViL-T adeptly captured spatiotemporal features, proving valuable for tasks demanding dense visual reasoning over time.

Furthermore, Tiu et al. [30] revealed the potential of self-supervised learning models in pathology detection. Their model, CheXzero, showcased accuracies on par with radiologists in pathology classification. Remarkably, it outdid fully supervised models in detecting certain pathologies and demonstrated adaptability to unannotated pathologies, which weren’t specifically included during training. Such results emphasize the strength of contrastive textually prompted models in deciphering medical image interpretation tasks from unannotated data, thus minimizing dependence on extensive labeling.

The body of work presented emphasizes contrastive textually prompted models’ indispensable role in medical imaging. They showcase efficiency, performance enhancements, and an uncanny ability to infer intricate medical connotations. These models offer a promising solution to data scarcity, enriching medical image understanding and ultimately optimizing healthcare delivery.

3.1.2 Generative

Generative models represent another category within the domain of textually prompted models for medical imaging. These models are designed to generate realistic medical images based on textual prompts or descriptions. They employ techniques such as variational autoencoders (VAEs) and generative adversarial networks (GANs) to understand the underlying distribution of medical images, subsequently creating new samples that correlate with given prompts. These models have shown promise in tasks such as producing images of specific diseases, augmenting training data, and crafting images that adhere to attributes detailed in the prompts. They offer valuable tools for data augmentation, anomaly detection, and creating varied medical image datasets for both training and evaluation. Nonetheless, challenges like capturing the intricacies and variability of medical images, maintaining semantic alignment between generated images and prompts, and addressing ethical concerns tied to fabricated medical images persist and warrant further research.
生成模型表示用于医学成像的文本提示模型领域内的另一类。这些模型旨在根据文本提示或描述生成逼真的医学图像。他们采用变分自编码器(VAE)和生成对抗网络(GAN)等技术来理解医学图像的潜在分布,随后创建与给定提示相关的新样本。这些模型在诸如生成特定疾病的图像、增强训练数据以及制作符合提示中详细描述的属性的图像等任务中表现出了希望。它们为数据增强、异常检测以及创建用于训练和评估的各种医学图像数据集提供了有价值的工具。 尽管如此,像捕捉医学图像的复杂性和可变性,保持生成的图像和提示之间的语义对齐,以及解决与伪造的医学图像相关的伦理问题等挑战仍然存在,并需要进一步的研究。

In a notable study, Yan et al. [41] launched Clinical-BERT, a vision-language pre-training model fine-tuned for the medical sector. Pre-training encompassed domain-specific tasks like Clinical Diagnosis (CD), Masked MeSH Modeling (MMM), and Image-MeSH Matching (IMM). Their research demonstrated that Clinical-BERT outperformed its counterparts, especially in radiograph diagnosis and report generation tasks. Such results emphasize the utility of infusing domain-specific insights during the pre-training phase, thereby refining medical image analysis and clinical decision-making.
在一项值得注意的研究中,Yan等人[ 41]推出了Clinical-BERT,这是一种针对医疗领域进行微调的视觉语言预训练模型。预训练包括特定领域的任务,如临床诊断(CD),掩蔽MeSH建模(MMM)和图像MeSH匹配(IMM)。他们的研究表明,Clinical-BERT的表现优于同类产品,特别是在X光片诊断和报告生成任务方面。这些结果强调了在预训练阶段注入特定领域见解的实用性,从而改进医学图像分析和临床决策。

Singhal et al. [42] put forth Med-PaLM 2, a state-of-the-art large language model (LLM) targeting expert competence in medical question answering. By blending foundational LLM enhancements with medical-specific fine-tuning and innovative prompting tactics, the team sought to amplify the model’s proficiency. Med-PaLM 2 exhibited remarkable progress, registering elevated accuracy and better alignment with clinical utility. When subjected to pairwise ranking assessments, medical practitioners even favored Med-PaLM 2’s responses over those of their peers in terms of clinical relevance. This progression signifies the budding potential of LLMs in the realm of medical inquiries, inching closer to rivaling human physicians.
Singhal等人。[42]提出了Med—PaLM 2,这是一种最先进的大型语言模型(LLM),目标是医学问题回答中的专家能力。通过将基础LLM增强与医疗特定的微调和创新的提示策略相结合,该团队试图增强模型的熟练程度。Med—PaLM 2表现出显著的进步,记录了更高的准确性和更好的临床实用性。当进行成对排名评估时,医疗从业者甚至在临床相关性方面更喜欢Med—PaLM 2的反应。这一进展标志着LLMs在医疗咨询领域的潜力正在萌芽,逐渐接近人类医生。

Moor et al. [43] delved into the creation of Med-Flamingo, a few-shot learner with multimodal capabilities, tailor-made for medical applications. The model underwent pre-training on synchronized and staggered medical image-text data, followed by performance assessment on challenging visual question-answering (VQA) datasets. The outcome revealed that Med-Flamingo augmented generative medical VQA performance by up to 20%, as per clinician evaluations. Moreover, the model demonstrated prowess in addressing intricate medical queries and furnishing comprehensive justifications, surpassing preceding multimodal medical foundational models. These revelations underscore Med-Flamingo’s potential to enrich medical AI paradigms, promote personalized medicine, and bolster clinical decisions.
摩尔等人。[ 43]深入研究了Med-Flamingo的创建,这是一个具有多模式功能的少量学习者,专为医疗应用而量身定制。该模型在同步和交错的医学图像-文本数据上进行了预训练,然后在具有挑战性的视觉问答(VQA)数据集上进行了性能评估。结果显示,Med-Flamingo增强了生成医学VQA性能高达20%,根据临床医生的评估。此外,该模型在解决复杂的医疗问题和提供全面的理由方面表现出了强大的能力,超过了以前的多模式医疗基础模型。这些发现强调了Med-Flamingo在丰富医疗AI范式、促进个性化医疗和支持临床决策方面的潜力。

Collectively, these investigations showcase the strides made in the realm of generative textually prompted models and their implications for the medical sector. Merging domain-specific insights with advancements in language models and multimodal learning techniques has yielded auspicious results in areas like radiograph diagnosis, medical question resolution, and generative medical VQA. Such pioneering works fortify the burgeoning research landscape and chart the course for future innovations in generative models tailored for healthcare applications.

3.1.3 Hybrid

Hybrid textually prompted models distinguish themselves through the integration of training paradigms, specifically leveraging both generative and contrastive methodologies.

In a notable study, Chen et al. [28] unveiled a streamlined computer-aided diagnosis (CAD) system tailored for a specific 3D imaging modality, MRI. Drawing inspiration from BLIP-2 [85], they crafted a language-image pre-training model that employs bootstrapping to amalgamate 3D medical images with textual data via a query mechanism. At the outset, the researchers deployed a patch embedding that was trainable, bridging the disparity between 3D medical images and a previously trained image encoder. This approach markedly diminished the volume of image data requisite for training. Following this, they introduced the MedQFormer, an innovation that harnesses adjustable queries to align visual attributes seamlessly with the linguistic features demanded by a language model. To round off their methodology, they chose BioMedLM [86] as the foundational language model and fine-tuned it by harnessing the LoRA technique [87].
在一项值得注意的研究中,Chen等人[ 28]推出了一种针对特定3D成像模式(MRI)定制的流线型计算机辅助诊断(CAD)系统。从BLIP-2 [ 85]中汲取灵感,他们制作了一个语言图像预训练模型,该模型采用自举法通过查询机制将3D医学图像与文本数据合并。一开始,研究人员部署了一种可训练的补丁嵌入,弥合了3D医学图像和先前训练的图像编码器之间的差异。这种方法显著减少了训练所需的图像数据量。在此之后,他们推出了MedQFormer,这是一项创新,利用可调整的查询将视觉属性与语言模型所需的语言特征无缝对齐。为了完善他们的方法,他们选择BioMedLM [ 86]作为基础语言模型,并通过利用LoRA技术对其进行微调[ 87]。

An exhaustive suite of experiments, encompassing over 30,000 image volumes sourced from five public Alzheimer’s disease (AD) datasets, affirmed the model’s prowess. The results spotlighted its proficiency in zero-shot classification, distinguishing healthy individuals, subjects with mild cognitive impairment (MCI), and those diagnosed with AD. This efficacy underscores the model’s potential in executing medical visual question-answering (VQA) tasks with precision.

3.1.4 Conversational

Conversational textually prompted models aim to enable interactive dialogues between medical professionals and the model by fine-tuning the foundational models on specific instruction sets. These models facilitate communication and collaboration between humans and the model, allowing medical experts to ask questions, provide instructions, or seek explanations regarding medical images. By incorporating conversational capabilities, these models enhance the interpretability and usability of foundational models in medical imaging. Researchers have explored various techniques to fine-tune the models on conversational datasets and develop architectures that can effectively process textual prompts in a dialogue context. Conversational textually prompted models hold great potential in medical imaging, enabling improved communication, knowledge transfer, and decision-making processes among medical professionals and AI systems. However, challenges related to understanding context, handling ambiguous queries, and ensuring accurate responses in complex medical scenarios are areas that require further investigation and refinement.
会话文本提示模型旨在通过在特定指令集上微调基础模型来实现医学专业人员与模型之间的交互式对话。这些模型促进了人类与模型之间的沟通和协作,允许医学专家提出问题,提供指导或寻求有关医学图像的解释。通过整合会话功能,这些模型增强了医学成像基础模型的可解释性和可用性。研究人员已经探索了各种技术来微调会话数据集上的模型,并开发出可以在对话上下文中有效处理文本提示的架构。会话文本提示模型在医学成像中具有巨大的潜力,可以改善医疗专业人员和人工智能系统之间的沟通,知识转移和决策过程。 然而,与理解上下文、处理模糊查询以及确保在复杂医疗场景中准确响应相关的挑战是需要进一步调查和改进的领域。

In the study conducted by Li et al. [51], a cost-efficient approach for training a vision-language conversational assistant for biomedical images was introduced. The researchers leveraged a large-scale biomedical figure-caption dataset and utilized GPT-4 to generate instructions from text alone. By fine-tuning a general-domain vision-language model using a curriculum learning method, they developed the LLaVA-Med model. The findings showed that LLaVA-Med outperformed previous state-of-the-art models on certain metrics in three standard biomedical visual question-answering datasets. This highlights the potential of Conversational Textually Prompted Models, such as LLaVA-Med, in assisting with inquiries and answering open-ended research questions about biomedical images.
在Li等人进行的研究中。[ 51],介绍了一种用于培训生物医学图像视觉语言会话助理的成本效益方法。研究人员利用大规模生物医学图形标题数据集,并利用GPT-4仅从文本中生成指令。通过使用课程学习方法微调通用域视觉语言模型,他们开发了LLaVA-Med模型。研究结果表明,LLaVA-Med在三个标准生物医学视觉问答数据集中的某些指标上优于以前的最先进模型。这突出了会话文本标记模型(如LLaVA-Med)在协助询问和回答有关生物医学图像的开放式研究问题方面的潜力。

Another study, conducted by Thawkar et al. [52], focused on the development of XrayGPT, a conversational medical vision-language model designed specifically for analyzing chest radiographs. XrayGPT aligned a medical visual encoder (MedClip) with a fine-tuned large language model (Vicuna) to enable visual conversation abilities grounded in a deep understanding of radiographs and medical knowledge. The study found that XrayGPT demonstrated exceptional visual conversation abilities and a deep understanding of radiographs and medical domain knowledge. Fine-tuning the large language model on medical data and generating high-quality summaries from free-text radiology reports further improved the model’s performance. These findings highlight the potential of Conversational Textually Prompted Models like XrayGPT in enhancing the automated analysis of chest radiographs and aiding medical decision-making.

The study by Shu et al. [57], introduced Visual Med-Alpaca, an open-source parameter-efficient biomedical foundation model that combines language and visual capabilities. Visual Med-Alpaca was built upon the LLaMa-7B architecture and incorporated plug-and-play visual modules. The model was trained using a curated instruction set generated collaboratively by GPT-3.5-Turbo and human experts. The findings showed that Visual Med-Alpaca is a parameter-efficient biomedical model capable of performing diverse multimodal biomedical tasks. Incorporating visual modules and using cost-effective techniques like Adapter, Instruct-Tuning, and Prompt Augmentation made the model accessible and effective. This study emphasizes the importance of domain-specific foundation models and demonstrates the potential of conversational textually prompted models like Visual Med-Alpaca in biomedical applications.
Shu等人的研究[ 57]引入了Visual Med-Alpaca,这是一种结合了语言和视觉功能的开源参数高效生物医学基础模型。Visual Med-Alpaca基于LLaMa-7 B架构构建,并集成了即插即用的视觉模块。该模型使用GPT-3.5-Turbo和人类专家协作生成的策展指令集进行训练。研究结果表明,Visual Med-Alpaca是一种参数高效的生物医学模型,能够执行各种多模态生物医学任务。扩展可视化模块并使用具有成本效益的技术,如适配器,指令调整和即时增强,使模型易于访问和有效。这项研究强调了特定领域的基础模型的重要性,并展示了会话文本提示模型,如Visual Med-Alpaca在生物医学应用中的潜力。

3.2 Visually Prompted Models

Within medical imaging, the recent surge of visually prompted models promises a blend of precision, adaptability, and generalization. These models, informed by the extensive capabilities of foundation models, offer the potential to revolutionize medical image analysis by catering to specific tasks while also adapting to a vast array of modalities and challenges. This section delves into two main trajectories of such models:

  1. 1.

    Adaptations: As the name suggests, this sub-section explores the adaptations and modifications made to traditional segmentation models, enhancing their specificity and performance for medical imaging tasks. From models that augment SAM’s capabilities for medical images to frameworks that synergize few-shot localization with segmentation abilities, we traverse the journey of various innovations in the realm of medical image segmentation.

  2. 2.

    Generalist: Moving beyond task-specific adaptability, the models in this sub-section embody the essence of a ’Generalist’ approach. They are designed to encompass a broader spectrum of tasks and data modalities. These models not only process different kinds of medical imaging data but also can integrate patient histories and genomic data, marking a stride towards a more holistic healthcare technology ecosystem.


As we delve deeper into this section, we will uncover the transformative potential of visually prompted models in medical imaging, highlighting both their specialized adaptations and their expansive generalist capabilities.

Refer to caption
Figure 5: The SAM-Med2D pipeline [11] involves freezing the image encoder and introducing learnable adapter layers within each Transformer block to assimilate domain-specific expertise in the medical domain. The prompt encoder is fine-tuned using point, Bbox, and mask information, with the mask decoder’s parameters being updated through interactive training.

3.2.1 Adaptations

Traditional medical image segmentation has primarily relied on task-specific models, which, while accurate in their domains, often lack the ability to generalize across multiple tasks and imaging modalities. This necessitates a tailored, resource-intensive approach for each segmentation challenge. The advent of foundation models trained on extensive datasets presents an exciting solution. These models are capable of recognizing and segmenting numerous anatomical structures and pathological lesions across different imaging modalities. However, despite their potential, there are challenges with existing models like SAM, especially when applied to medical images [46]. This necessitates further innovations to extend their capabilities, and one such approach is the Medical SAM Adapter, which bridges the gap and enhances SAM’s performance in the medical domain [46]. This promises an integration of automated processes with specific customization.
传统的医学图像分割主要依赖于特定于任务的模型,这些模型虽然在其领域中是准确的,但通常缺乏跨多个任务和成像模态进行概括的能力。这需要针对每个细分挑战量身定制的资源密集型方法。在广泛的数据集上训练的基础模型的出现提供了一个令人兴奋的解决方案。这些模型能够在不同的成像模式中识别和分割许多解剖结构和病理病变。然而,尽管它们具有潜力,但SAM等现有模型仍存在挑战,特别是在应用于医学图像时[ 46]。这需要进一步的创新来扩展其功能,其中一种方法是医疗SAM适配器,它弥合了差距并增强了SAM在医疗领域的性能[ 46]。这保证了自动化流程与特定定制的集成。

Ma and Wang presented MedSAM, a novel foundation model crafted for medical image segmentation [12]. Built using a comprehensive dataset of over a million medical image-mask pairs, MedSAM can address numerous segmentation tasks across various imaging modalities. Its promptable configuration seamlessly blends automation with user-driven customization. MedSAM excelled in tasks, especially in computing pivotal biomarkers like accurate tumor volume in oncology. However, it had some limitations, such as modality representation imbalances in its training data and challenges in segmenting vessel-like structures. Nevertheless, its architecture permits future refinements to cater to specific tasks, emphasizing the adaptability of foundation models in medical image segmentation.
Ma和Wang介绍了MedSAM,这是一种用于医学图像分割的新型基础模型[ 12]。MedSAM使用超过一百万个医学图像-掩模对的综合数据集构建,可以解决各种成像模式的众多分割任务。它的可配置配置无缝地融合了自动化与用户驱动的定制。MedSAM在任务中表现出色,特别是在计算关键生物标志物方面,如肿瘤学中的准确肿瘤体积。然而,它也有一些局限性,例如训练数据中的模态表示不平衡以及分割血管样结构的挑战。尽管如此,它的架构允许未来的改进,以满足特定的任务,强调在医学图像分割的基础模型的适应性。

Lei et al. tackled the challenge of the intensive annotation workload inherent in the SAM, by introducing MedLSAM, a novel framework that synergizes few-shot landmark localization with SAM’s segmentation capabilities [13]. MedLSAM framework consists of a Localization Anything Model (MedLAM), which employs a shared Pnet to transform support and query patches into 3D latent vectors. During inference, MedLAM initiates with a randomly positioned agent in the query image and guides it toward the target landmark. The agent’s trajectory is updated based on a 3D offset computed from the MedLAM model, effectively localizing the landmark coarsely within the query image. This coarse localization is further refined using the Multi-Scale Similarity (MSS) component, enhancing the accuracy of landmark positioning significantly. Having localized the landmarks, the framework transitions to segmentation using both SAM and MedSAM, a specialized version of SAM fine-tuned for medical images. Trained on an extensive dataset of 14,012 CT scans, MedLSAM autonomously generates 2D bounding boxes across slices, facilitating SAM’s segmentation tasks. Impressively, it equaled SAM’s performance on two 3D datasets that spanned 38 organs but with significantly fewer annotations. Future-proofed with forward compatibility, MedLSAM opens doors for integration with evolving 3D SAM models, signaling even more effective segmentation in the medical domain.
Lei等人通过引入MedLSAM解决了SAM中固有的密集注释工作量的挑战,MedLSAM是一种新型框架,可将少量地标定位与SAM的分割功能协同作用[ 13]。MedLSAM框架由一个本地化模型(MedLAM)组成,它采用一个共享的Pnet将支持和查询补丁转换为3D潜在向量。在推理过程中,MedLAM在查询图像中随机定位代理,并将其引导到目标地标。智能体的轨迹基于从MedLAM模型计算的3D偏移来更新,有效地在查询图像内粗略地定位地标。使用多尺度相似性(MSS)组件进一步细化该粗略定位,从而显著提高地标定位的准确性。定位地标后,框架过渡到使用SAM和MedSAM(SAM针对医学图像进行微调的专用版本)进行分割。 在14,012个CT扫描的广泛数据集上进行训练,MedLSAM自动生成跨切片的2D边界框,促进SAM的分割任务。令人印象深刻的是,它相当于SAM在两个3D数据集上的性能,这些数据集跨越38个器官,但注释明显较少。MedLSAM具有向前兼容性,面向未来,为与不断发展的3D SAM模型集成打开了大门,标志着医疗领域中更有效的分割。

Gong et al. tackled the challenges posed by SAM, originally for 2D natural images when applied to 3D medical image segmentation, particularly for tumor detection [88]. The team introduced a strategy transforming SAM for 3D medical imaging while retaining most of its pre-trained parameters. By employing a visual sampler for the prompt encoder and a lightweight mask decoder emphasizing multi-layer aggregation, the resulting model, the 3DSAM-adapter, exhibited superior performance. It outperformed leading medical segmentation models in three of four tasks, reaffirming the potential to enhance SAM’s utility in intricate medical imaging tasks.

Refer to caption
Figure 6: BiomedGPT [25] demonstrates its versatility in various tasks through pretraining, including unimodal and multimodal approaches, and incorporates object detection for location data. After pretraining, it excels in five downstream tasks, showcasing its data efficiency.
图6:BiomedGPT [ 25]通过预训练展示了其在各种任务中的多功能性,包括单峰和多峰方法,并结合了位置数据的对象检测。经过预训练后,它在五个下游任务中表现出色,展示了其数据效率。

Cheng et al. introduced SAM-Med2D, a specialized model for 2D medical image segmentation [11]. Recognizing the need for domain adaptation, they amassed a substantial dataset of approximately 4.6M images and 19.7M masks, spanning diverse medical modalities. A notable feature of SAM-Med2D is its varied prompt strategies, going beyond bounding boxes and points to incorporate masks, offering a comprehensive interactive segmentation approach as shown in Figure 5. Thorough evaluations showcased its superior performance across various anatomical structures, with remarkable generalization capabilities proven on datasets from the MICCAI 2023 challenge. Despite its prowess, certain challenges remain, particularly with complex boundaries and low-contrast objects. With prospects of integrating natural language interaction, SAM-Med2D stands as a pioneering contribution to medical computer vision research. Building upon the theme of customization, another noteworthy effort is the development of SAMed. This model, unlike its predecessors, employs a low-rank-based finetuning strategy, enabling it to perform semantic segmentation on medical images with only a fraction of SAM’s parameters being updated. This selective approach to parameter adaptation allows SAMed to achieve competitive results, underscoring the potential of customizing large-scale models for specific medical segmentation tasks [47].
Cheng等人介绍了SAM—Med2D,一种用于2D医学图像分割的专用模型[11]。认识到领域适应的必要性,他们积累了大约460万张图像和1970万张面具的大量数据集,涵盖了各种医疗模式。SAM—Med2D的一个显着特点是其多种提示策略,超越了边界框和点,并纳入掩码,提供了一个全面的交互式分割方法,如图5所示。全面的评估展示了其在各种解剖结构上的上级性能,并在MICCAI 2023挑战的数据集上证明了其卓越的泛化能力。尽管它的威力,一些挑战仍然存在,特别是复杂的边界和低对比度的对象。凭借整合自然语言交互的前景,SAM—Med2D是医学计算机视觉研究的先驱。 在定制主题的基础上,另一项值得注意的努力是开发SAMed。与其前辈不同,该模型采用了基于低秩的微调策略,使其能够在仅更新SAM参数的一小部分的情况下对医学图像执行语义分割。这种选择性的参数自适应方法使SAMed能够实现有竞争力的结果,强调了为特定的医疗分割任务定制大规模模型的潜力[ 47]。

In a stride to enhance reliability in medical image segmentation, Deng et al. put forth SAM-U [44], a novel approach employing multi-box prompts for refined uncertainty estimation in SAM predictions. This method significantly improves SAM’s performance, especially in low-quality medical images, and provides crucial insights through generated uncertainty maps, highlighting potential segmentation inaccuracies and serving as an essential guide for clinicians in areas requiring manual annotations. This innovative approach underscores the advancements and adaptability in the realm of medical image segmentation
在提高医学图像分割可靠性的一个步骤中,Deng等人提出了SAM-U [ 44],这是一种采用多框提示的新方法,用于SAM预测中的精确不确定性估计。这种方法显著提高了SAM的性能,特别是在低质量的医学图像中,并通过生成的不确定性图提供了重要的见解,突出了潜在的分割不准确性,并在需要手动注释的领域为临床医生提供了必要的指导。这种创新的方法强调了医学图像分割领域的先进性和适应性

3.2.2 Generalist

In contrast to their adaptability to specific tasks through prompts, foundational models also offer a ’Generalist’ approach, further disrupting the landscape of medical imaging. These Generalist models expand upon the foundational model capabilities by being intrinsically designed to handle a broader spectrum of medical imaging tasks and data modalities—ranging from X-rays to MRIs, and even incorporating patient histories and genomic data. The key advantage here is their capability for dynamic task specification, often enabled by natural language descriptions, obviating the need for model retraining. This inherent flexibility is further augmented by the models’ ability to formally represent medical knowledge, allowing for reasoned outputs and explanations. The emergence of Generalist models in medical imaging signifies a step towards a more integrated and efficient healthcare technology ecosystem.

Moor et al. [23] delve into the intricacies of developing General-purpose Medical Artificial Intelligence (GMAI), a specialized class of foundation models optimized for the healthcare domain. Unlike conventional medical AI, GMAI models are designed to process multiple data modalities, such as imaging studies and electronic health records, simultaneously. These models are not only capable of complex diagnostic tasks but can also generate treatment recommendations complete with evidence-based justifications. The authors discuss challenges unique to GMAI, including the need for multi-disciplinary panels for output verification and increased susceptibility to social biases due to the complex training data sets. Additionally, they raise concerns over patient privacy and the computational and environmental costs associated with model scaling. The paper underscores that the success of GMAI hinges on rigorous validation and ongoing oversight to mitigate these risks while harnessing its transformative potential in healthcare.
摩尔等人[ 23]深入研究了开发通用医疗人工智能(GMAI)的复杂性,GMAI是一种针对医疗保健领域优化的基础模型。与传统的医疗AI不同,GMAI模型旨在同时处理多种数据模式,例如成像研究和电子健康记录。这些模型不仅能够完成复杂的诊断任务,而且还可以生成具有循证依据的治疗建议。作者讨论了GMAI特有的挑战,包括需要多学科小组进行输出验证,以及由于复杂的训练数据集而增加的对社会偏见的敏感性。此外,他们还提出了对患者隐私以及与模型缩放相关的计算和环境成本的担忧。 该文件强调,GMAI的成功取决于严格的验证和持续的监督,以减轻这些风险,同时利用其在医疗保健领域的变革潜力。

Tu et al. extend the pioneering work of Med-PaLM and Med-PaLM2 [42] to introduce Med-PaLM M, a groundbreaking multi-modal biomedical AI system capable of handling diverse medical modalities, including medical imaging, genomics, and electronic health records [22]. Building upon the foundational achievements of Med-PaLM—which was the first AI to surpass the pass mark on USMLE-style questions—and the subsequent improvements in Med-PaLM2, which boasted an accuracy of 86.5% on the same questions, Med-PaLM M employs a fusion of Vision Transformer (ViT) for visual tasks and Language-agnostic Language Model (LLM) for natural language tasks. These components are fine-tuned on a newly assembled MultiMedBench dataset. Med-PaLM M eclipses existing benchmarks, including specialized single-task models and its predecessor generalist models like PaLM-E that lacked biomedical fine-tuning. Notably, the system exhibits unprecedented zero-shot learning capabilities, successfully identifying tuberculosis from chest X-ray images without prior training [24]. It also excels in generating radiology reports, rivaling the performance of expert radiologists in human evaluations. While the study highlights the scalability and promise of multi-modal AI models for a range of biomedical tasks, it also acknowledges existing challenges, such as data scarcity and limitations of current benchmarks. The work serves as a seminal contribution, marking a new frontier in biomedical AI, albeit with cautionary notes on safety and equity considerations for real-world applications.
Tu等人扩展了Med—PaLM和Med—PaLM2的开创性工作[42],引入了Med—PaLM M,这是一种开创性的多模式生物医学AI系统,能够处理各种医疗模式,包括医学成像,基因组学和电子健康记录[22]。基于Med—PaLM的基础成就—这是第一个在USMLE风格的问题上超过及格分数的AI—以及Med—PaLM 2的后续改进,Med—PaLM M在相同的问题上拥有86.5%的准确率,Med—PaLM M采用视觉Transformer(ViT)融合视觉任务和与语言无关的语言模型(LLM)用于自然语言任务。这些组件在新组装的MultiMedBench数据集上进行微调。Med—PaLM M超越了现有的基准,包括专门的单任务模型和其前身缺乏生物医学微调的PaLM—E等通用模型。 值得注意的是,该系统表现出前所未有的零拍摄学习能力,成功地从胸部X射线图像中识别结核病,而无需事先训练[ 24]。它在生成放射学报告方面也表现出色,可与放射科专家在人体评估方面的表现相媲美。虽然该研究强调了多模态人工智能模型在一系列生物医学任务中的可扩展性和前景,但它也承认了现有的挑战,例如数据稀缺和当前基准的局限性。这项工作是一项开创性的贡献,标志着生物医学人工智能的新前沿,尽管对现实世界应用的安全和公平考虑提出了警告。

Zhang et al. introduce BiomedGPT [25], a unified framework that is trained across multiple modalities—including radiographs, digital images, and text—to perform a diverse range of tasks in the biomedical domain as shown in Figure 6. The model particularly excels in image classification on MedMNIST v2 datasets and visual question-answering on SLAKE and PathVQA, setting new state-of-the-art benchmarks. However, it lags in text-based tasks such as natural language inference on the MedNLI dataset. One reason for this performance gap is the model’s constrained scale; with only 182 million parameters, it is smaller than other state-of-the-art models. The study also pinpoints the model’s sensitivity to task instructions and challenges with handling out-of-distribution data as areas for future research. Nonetheless, BiomedGPT represents a significant step towards a versatile, generalist model in the biomedical field, capable of both vision and language tasks.
Zhang等人介绍了BiomedGPT [25],这是一个统一的框架,它在多种模式(包括射线照片,数字图像和文本)中进行训练,以执行生物医学领域的各种任务,如图6所示。该模型在MedMNIST v2数据集上的图像分类以及SLAKE和PathVQA上的视觉问答方面尤其出色,为最先进的基准设定了新的标准。然而,它在基于文本的任务中滞后,例如MedNLI数据集上的自然语言推理。这种性能差距的一个原因是模型的规模有限;只有1.82亿个参数,比其他最先进的模型要小。该研究还指出了该模型对任务指令的敏感性,以及处理分布外数据的挑战,作为未来研究的领域。尽管如此,BiomedGPT代表了生物医学领域朝着多功能、多面手模型迈出的重要一步,能够完成视觉和语言任务。

Wu et al. introduce the Radiology Foundation Model (RadFM) and the MedMD dataset, aiming to unify medical tasks and integrate diverse radiological images [26]. RadFM effectively merges medical scans with natural language, addressing various medical tasks. The study unveils RadBench, a benchmark demonstrating RadFM’s superior synthesis of visual and textual information. Despite the advancements, the authors highlight limitations, such as the prevalence of 2D images in the dataset and challenges in generating clinically useful sentences. Wu et al.’s release of these innovations significantly advances radiological models, encourages collaborative progress, and emphasizes the need for enhanced evaluative metrics and comprehensive solutions in the field.
Wu等人介绍了放射学基础模型(RadFM)和MedMD数据集,旨在统一医疗任务并整合各种放射学图像[ 26]。RadFM有效地将医学扫描与自然语言相结合,解决了各种医疗任务。该研究揭示了RadBench,这是一个基准,展示了RadFM对视觉和文本信息的上级合成。尽管取得了进步,但作者强调了局限性,例如数据集中2D图像的普遍性以及生成临床有用句子的挑战。Wu等人'这些创新的发布大大推进了放射学模型,鼓励合作进步,并强调需要增强该领域的评估指标和综合解决方案。

In another study, Zhou et al. introduce RETFound [27], a versatile foundation model developed through self-supervised learning, trained on 1.6 million unlabeled retinal images. It demonstrates unparalleled adaptability and generalizability in diagnosing eye diseases and predicting systemic disorders with notable accuracy and reduced reliance on extensive annotation. RETFound overcomes significant barriers related to data limitations and model generalization, offering a pioneering solution in medical AI, with the potential to democratize and significantly advance healthcare AI applications.
在另一项研究中,Zhou等人介绍了RETFound [ 27],这是一种通过自我监督学习开发的多功能基础模型,在160万张未标记的视网膜图像上进行了训练。它在诊断眼科疾病和预测系统性疾病方面表现出无与伦比的适应性和普遍性,具有显着的准确性,并减少了对广泛注释的依赖。RETFound克服了与数据限制和模型泛化相关的重大障碍,为医疗AI提供了开创性的解决方案,有可能使医疗AI应用民主化并显着推进。

Refer to caption
Figure 7: Extensions of the SAM model for diverse medical image segmentation tasks [11]. This figure illustrates the versatility of SAM-based adaptations in addressing a wide range of medical image segmentation challenges, showcasing their applicability and adaptability across various healthcare scenarios.
图7:SAM模型用于不同医学图像分割任务的扩展[ 11]。该图说明了基于SAM的适应性在解决各种医学图像分割挑战方面的多功能性,展示了它们在各种医疗保健场景中的适用性和适应性。

4 Discussion 4讨论

In the dynamic landscape of foundational models for medical imaging, each direction outlined in our taxonomy (Figure 3) brings its own set of advantages and distinctive capabilities to the forefront. These divergent paths cater to specific needs, creating a diversified toolkit for addressing the multifaceted challenges of the medical imaging domain. As we delve into this discussion, we will explore the unique advantages of each direction and consider scenarios where one direction might excel over the others, all while peering into how these models learn feature representations and the implications thereof.

Textually Prompted Contrastive Models: These models have shown remarkable prowess in bridging the semantic gap between medical images and text. By leveraging contrastive learning, these models can extract meaningful representations from unpaired medical image-text data, thereby reducing the dependence on vast amounts of labeled data. This approach is particularly advantageous in scenarios where labeled data is scarce or expensive to obtain, such as rare medical conditions or specialized imaging modalities. Contrastive models excel at capturing subtle medical meanings and are well-suited for tasks like zero-shot prediction in medical image-text tasks. For instance, in scenarios where a new, uncharacterized medical condition arises, these models can adapt swiftly by simply providing textual descriptions.

However, there are limitations to consider. Contrastive models might struggle with highly complex medical images or intricate pathologies, where the nuances demand a deeper level of feature representation. Additionally, they may still rely on the availability of large-scale text data, which could be a bottleneck in some cases. The contrastive learning process also hinges on careful tuning of hyperparameters, making it essential to invest time in fine-tuning for optimal performance.

Textually Prompted Generative Models: Textually prompted generative models, exemplified by models like Clinical-BERT [41] and Med-Flamingo [43], offer the ability to generate detailed responses and explanations for medical image-related queries. They excel in tasks requiring a deep understanding of the medical domain, making them invaluable in clinical decision support systems, medical education, and generating radiology reports.
文本提示生成模型:文本提示生成模型,例如Clinical-BERT [ 41]和Med-Flamingo [ 43]等模型,提供了为医学图像相关查询生成详细响应和解释的能力。他们擅长于需要深入了解医学领域的任务,使他们在临床决策支持系统,医学教育和生成放射学报告中具有宝贵价值。

These generative models can be a game-changer when interpretability and reasoning are crucial. For instance, in a clinical setting, generating explanations for a model’s predictions can enhance trust and facilitate collaboration between AI systems and medical professionals. In educational contexts, they can serve as powerful tutors, providing in-depth explanations and context.

Nevertheless, generative models are computationally intensive and demand significant training data. They may not be the most efficient choice for scenarios where quick, lightweight predictions are required. Additionally, they may face challenges in generating text that is both informative and concise, which could be important in some applications.

Textually Prompted Hybrid Models: Hybrid models, as represented by MedBLIP [28], combine the strengths of generative and contrastive methodologies. These models tackle the challenge of integrating textual data with 3D medical images, often a complex task due to the inherent differences in data modalities.
混合模型:以MedBLIP [ 28]为代表的混合模型,联合收割机结合了生成和对比方法的优势。这些模型解决了将文本数据与3D医学图像集成的挑战,由于数据模式的固有差异,这通常是一项复杂的任务。

One of the key advantages of hybrid models is their potential for zero-shot prediction in medical image-text tasks. They seamlessly align visual attributes with linguistic features, making them adept at executing medical visual question-answering tasks with precision. For example, in cases where medical professionals need to quickly diagnose conditions based on both images and textual descriptions, hybrid models can provide valuable support.

Yet, hybrid models may face challenges related to the design of effective integration mechanisms between textual and visual data. The success of these models often relies on the quality of the alignment between different modalities. Additionally, they may require substantial computational resources for training and fine-tuning.

Textually Prompted Conversational Models: Conversational models, like LLaVA-Med [51] and XrayGPT [52], are designed to enable interactive dialogues between medical professionals and AI systems. These models are particularly beneficial in scenarios where medical experts need to ask questions, seek explanations, or instruct AI systems regarding medical images. One of the most significant advantages of conversational models is their potential to enhance communication and collaboration between humans and AI. They can facilitate knowledge transfer, clarify doubts, and provide detailed explanations for complex medical images. In a clinical context, this can lead to more informed decision-making and better patient care.
会话模型,如LLaVA-Med [ 51]和XrayGPT [ 52],旨在实现医疗专业人员和AI系统之间的交互式对话。这些模型在医学专家需要提出问题、寻求解释或指导人工智能系统进行医学图像处理的情况下特别有用。会话模型最重要的优势之一是它们有可能增强人类与人工智能之间的沟通和协作。它们可以促进知识转移,澄清疑问,并为复杂的医学图像提供详细的解释。在临床上,这可以导致更明智的决策和更好的病人护理。

However, conversational models face the challenge of understanding context and handling ambiguous queries effectively. Ensuring accurate responses in complex medical scenarios remains an ongoing research challenge. Additionally, they require careful fine-tuning on conversational datasets to perform optimally.

Visually Prompted Adaptations Models: Visually prompted adaptations models, such as MedLSAM [13], MedLSAM, 3DSAM-adapter [88], SAM-Med2D [11], SAMed [47], and SAM-U [44], focus on enhancing the specificity and performance of medical image segmentation tasks. These models adapt foundational models like SAM for the medical domain, addressing challenges like data scarcity and complex boundaries. A sample of segmentation results on various medical image analysis tasks achieved through the adapted SAM model is presented in Figure 7, showcasing its remarkable achievement in diverse medical imaging scenarios and its robust generalization power.
视觉提示适应模型:视觉提示适应模型,如MedLSAM [ 13],MedLSAM,3DSAM适配器[ 88],SAM-Med 2D [ 11],SAMed [ 47]和SAM-U [ 44],专注于增强医学图像分割任务的特异性和性能。这些模型将SAM等基础模型应用于医疗领域,解决了数据稀缺和复杂边界等挑战。图7显示了通过自适应SAM模型实现的各种医学图像分析任务的分割结果示例,展示了其在各种医学成像场景中的卓越成就及其强大的泛化能力。

Table 1: Overview of the reviewed Foundation models in medical imaging based on their algorithm choice presented in our taxonomy, Figure 3.
\SetRowpurple9 Algorithm \SetRowpurple9算法 Networks Core Ideas 核心思想 Practical Use Cases 实际用例
Textually Prompted Contrastive Models
1MedClip [31] [ 31]第三十一话
2BioViL-T [32]
3CheXzero [30] [30]第三十话
4MI-Zero [36] [36]第三十六话 \bullet [31] aims to tackle medical image and report vision-text contrastive learning difficulties. MediCLIP separates medical images and text for multimodal contrastive learning. Scaling combinatorial training data at low cost answers medical data shortages. The study advises replacing InfoNCE with a medical-based semantic matching loss to eliminate contrastive learning false negatives. MedCLIP captures subtle but important medical meanings better than zero-shot prediction, supervised classification, and image-text retrieval methods. The paper reveals MedCLIP’s efficacy and data efficiency, which might enhance clinical decision-making and downstream tasks.
\bullet [31]旨在解决医学图像和报告视觉—文本对比学习困难。MediCLIP将医学图像和文本分离,用于多模态对比学习。以低成本扩展组合训练数据解决了医疗数据短缺问题。该研究建议用基于医学的语义匹配损失取代InfoNCE,以消除对比学习的假阴性。MedCLIP捕捉微妙但重要的医学意义优于零射击预测,监督分类和图像—文本检索方法。该论文揭示了MedCLIP的有效性和数据效率,这可能会增强临床决策和下游任务。
\bullet Use data temporal structure to improve biological vision-language processing (VLP) in [32]. Researchers introduce BioViL-T, a pre-training system that trains and fine-tunes using past pictures and data. This method uses temporal correlations and a multi-image encoder to handle missing images and longitudinal data without image registration. Modality alignment is improved by analyzing the temporal relationship between visuals and reports, enhancing pre-training and downstream task performance. The study exhibits advanced progression categorization, phrase grounding, and report generation results. Temporal and non-temporal tasks like pneumonia detection and phrase grounding benefit from prior context and temporal knowledge. To test and benchmark chest X-ray VLP models for temporal semantics, the authors offer MS-CXR-T, a multimodal benchmark dataset. An expert radiologist curated this dataset to measure image-text temporal correlations.
\bullet 使用数据时间结构来改善生物视觉语言处理(VLP)[ 32]。研究人员介绍了BioViL-T,这是一种预训练系统,可以使用过去的图片和数据进行训练和微调。该方法使用时间相关性和多图像编码器来处理丢失的图像和纵向数据,而无需图像配准。通过分析视觉和报告之间的时间关系,提高训练前和下游任务的性能,改善了模态对齐。这项研究展示了先进的进展分类,短语接地,和报告生成结果。时间和非时间任务,如肺炎检测和短语接地受益于先前的上下文和时间知识。为了测试和基准胸部X射线VLP模型的时间语义,作者提供了MS-CXR-T,一个多模态基准数据集。放射科专家策划了这个数据集,以测量图像-文本的时间相关性。
\bullet In [30], authors offer a novel medical imaging pathological classifying approach. The study suggests employing self-supervised learning without annotations to accurately diagnose illnesses in unannotated chest X-rays. Large labeled datasets are expensive and time-consuming for traditional medical image interpretation machine-learning algorithms. This research shows that a self-supervised system trained on chest X-rays without annotations can classify illness as well as radiologists. A zero-shot multi-label classification method, natural language supervision from radiology reports, and generalization to diverse image interpretation tasks and datasets are presented in the research. CheXzero learns a representation for zero-shot multi-label classification without labeled data fine-tuning using contrastive learning with image-text pairs. Radiology reports’ natural labeling lets self-supervised algorithms perform as well as professional radiologists and fully supervised approaches on unknown disorders. This approach eliminates explicit labeling, eliminating medical machine-learning workflow inefficiencies from large-scale labeling.
\bullet 在[30]中,作者提供了一种新的医学成像病理分类方法。该研究建议采用无注释的自我监督学习来准确诊断未注释的胸部X光片中的疾病。对于传统的医学图像解释机器学习算法来说,大型标记数据集是昂贵且耗时的。这项研究表明,在没有注释的胸部X光片上训练的自我监督系统可以对疾病和放射科医生进行分类。在研究中提出了零次多标签分类方法,放射学报告的自然语言监督,以及对不同图像解释任务和数据集的推广。CheXzero使用图像—文本对的对比学习来学习零次多标签分类的表示,而无需对标记数据进行微调。 放射学报告的自然标记使自监督算法在未知疾病上的表现与专业放射科医生和完全监督方法一样好。这种方法消除了显式标记,消除了大规模标记的医疗机器学习工作流程效率低下。
\bullet Zero-shot prediction in medical image-text tasks, Supervised classification in medical image analysis, Image-text retrieval in the medical domain, Supporting clinical decision-making and downstream clinical tasks [31].
\bullet 医学图像文本任务中的零拍摄预测,医学图像分析中的监督分类,医学领域中的图像文本检索,支持临床决策和下游临床任务[ 31]。
\bullet Progression Classification: Achieving State-of-the-Art Performance in Tracking Medical Condition Progression, Phrase Grounding: Linking Clinical Report Phrases to Image Regions for Enhanced Analysis, Report Generation: Improved Performance by Incorporating Prior Reports, Disease Classification: Consistent Improvement in Disease Classification Tasks, Pneumonia Detection: State-of-the-Art Results in Detecting Pneumonia [32]
\bullet 进展分类:在跟踪医学状况进展方面实现最先进的性能,短语基础:将临床报告短语链接到图像区域以增强分析,报告生成:通过重复先前报告提高性能,疾病分类:疾病分类任务的一致改进,肺炎检测:检测肺炎的最先进结果[32]
\bullet Automation of complex medical image interpretation tasks, Disease diagnosis, Diagnostic efficiency improvement, Label efficiency enhancement, Decreased reliance on large labeled datasets, Reduction in labeling efforts and costs, Potential for learning a broad range of medical image interpretation tasks from unlabeled data [30]
\bullet 复杂医学图像解释任务的自动化,疾病诊断,诊断效率提高,标签效率提高,减少对大型标记数据集的依赖,减少标记工作和成本,从未标记数据中学习广泛的医学图像解释任务的潜力[ 30]
\bullet Zero-shot transfer for cancer subtype classification on 3 WSI datasets. Moreover, the curated dataset of histopathology image-caption pairs can potentially be generalized and adapted to develop practical solutions in other domains [36].
\bullet 在3个WSI数据集上进行癌症亚型分类的零拍摄转移。此外,组织病理学图像-标题对的策划数据集可能会被推广并适用于其他领域的实际解决方案[ 36]。
Textually Prompted Generative Models
1Clinical-BERT [41] [ 41]第四十一话
2Med-PaLM 2 [42]
3Med-Flamingo [43] [ 43]第四十三话 \bullet Clinical-BERT [41], a medical pre-training paradigm, underpins. The research offers domain-specific pre-training activities, including Clinical Diagnosis (CD), Masked MeSH Modeling (MMM), and Image-MeSH Matching for model training. MeSH words in radiograph reports are stressed. The work aligns MeSH terms with radiographs using region and word sparse attention. The model links visual characteristics with MeSH phrases using this attention mechanism. Clinical-BERT radiograph diagnostic and report production provide cutting-edge results. The article shows domain-specific pre-training exercises and MeSH keywords to improve medical task performance.
\bullet Clinical—BERT [41]是一种医学预训练范式,是基础。该研究提供了特定领域的预训练活动,包括临床诊断(CD),MMM(Masked MeSH Modeling)和用于模型训练的图像MeSH匹配。X线片报告中的MeSH单词被强调。这项工作使MeSH术语与使用区域和单词稀疏注意的射线照片保持一致。该模型使用这种注意力机制将视觉特征与MeSH短语联系起来。临床BERT射线诊断和报告制作提供了最先进的结果。本文展示了特定领域的预训练练习和MeSH关键字,以提高医疗任务的性能。
\bullet Expert medical question answering is done using LLMs in [42]. The study aims to enhance LLM performance to match model and clinician replies. The authors say LLMs have advanced in various disciplines and can address medical questions. They admit prior LLM-based models need to be improved, especially compared to clinician responses. The authors offer various LLM performance enhancements. Base LLM improvements (PaLM 2), medical domain-specific fine-tuning, and a new ensemble refinement approach are used. The strategies aim to enhance medical thinking and results.
\bullet 使用[42]中的LLMs完成专家医疗问题回答。该研究旨在增强LLM性能,以匹配模型和临床医生的回复。作者说LLMs在各个学科都有进步,可以解决医学问题。他们承认之前基于LLM的模型需要改进,特别是与临床医生的反应相比。作者提供了各种LLM性能增强。Base LLM改进(PaLM 2),医疗领域特定的微调,和一个新的合奏细化方法。这些战略旨在加强医学思维和成果。
\bullet Med-Flamingo [43], a vision-language model suggested is pre-trained on medical image-text data from various sources and can create open-ended replies from textual and visual input. Med-Flamingo outperforms prior models in generative medical visual question-answering tasks by 20% in clinical assessment scores due to in-context learning The research also describes Visual USMLE, a difficult created VQA dataset including medical questions, images, and case vignettes. The paper says multimodal few-shot and in-context learning improve medical AI models.
\bullet Med-Flamingo [ 43],一种建议的视觉语言模型,对来自各种来源的医学图像-文本数据进行了预训练,可以从文本和视觉输入中创建开放式回复。Med-Flamingo在生成医学视觉问答任务中的表现优于先前模型,由于上下文学习,临床评估分数提高了20%。该研究还描述了Visual USMLE,这是一个困难的VQA数据集,包括医学问题,图像和案例插图。该论文称,多模式少数镜头和上下文学习改善了医疗AI模型。
\bullet Radiograph Diagnosis and Reports Generation: Achieving state-of-the-art results on challenging datasets, Enhancing Downstream Tasks in the Medical Domain, Improving performance in various medical domain tasks, Learning Medical Domain Knowledge: Enabling the model to acquire domain-specific knowledge for better performance [41]
\bullet 放射影像诊断和报告生成:在具有挑战性的数据集上实现最先进的结果,增强医学领域的下游任务,提高各种医学领域任务的性能,学习医学领域知识:使模型能够获取特定领域的知识,以获得更好的性能[ 41]
\bullet Medical question answering: Providing accurate and reliable answers to medical questions. Medical exams: Assisting in preparing for medical licensing examinations. Clinical decision support: Aiding physicians in making informed decisions during patient care. Consumer health information: Delivering trustworthy medical information to the general public [42].
\bullet 医疗问题回答:为医疗问题提供准确可靠的答案。医疗考试:协助准备医疗执照考试。临床决策支持:帮助医生在患者护理过程中做出明智的决定。消费者健康信息:向公众提供值得信赖的医疗信息[ 42]。
\bullet Generative Medical Visual Question Answering (VQA), Medical Reasoning and Rationale Generation, Clinical Evaluation and Human Rater Study, Dataset Creation for Pre-training and Evaluation [43]
\bullet 生成医学视觉问题分类(VQA),医学推理和原理生成,临床评价和人类评分研究,用于预培训和评价的数据集创建[ 43]
Textually Prompted Hybrid Models
1MedBLIP [28] [28]第二十八话 \bullet Extend a 2D image encoder to extract features from 3D medical images and obtain a lightweight language model for our CAD purpose.
\bullet 扩展2D图像编码器,从3D医学图像中提取特征,并获得用于CAD目的的轻量级语言模型。
\bullet Align different types of medical data into the common space of language models, besides collecting the largest public dataset for studying Alzheimer’s disease (AD).
\bullet 将不同类型的医疗数据整合到语言模型的公共空间中,并收集最大的公共数据集用于研究阿尔茨海默病(AD)。
\bullet Zero-shot prediction in medical image-text tasks
\bullet 医学图文任务中的零拍预测
\bullet Zero-shot medical visual question answering (VQA) which involves producing an initial diagnosis for an unseen case by analyzing input images and textual descriptions, while also offering explanations for the decision-making process.
\bullet 零拍摄医学视觉问答(VQA),包括通过分析输入图像和文本描述为未知病例产生初步诊断,同时还为决策过程提供解释。
Textually Prompted Conversational Models
1LLaVA-Med [51] [ 51]第五十一话
2XrayGPT [52]
3Visual Med-Alpaca [57]
4PMC-LLaMA[89] [ 89]第89话
5ClinicalGPT [50]
6Radiology-LLamA2 [53]
6 放射学-LLamA 2 [ 53]

& \bullet Training a low-cost vision-language conversational assistant for biological imagery is the main notion of [51]. The authors recommend training the computer using a big PubMed Central biomedical figure-caption dataset. Caption data and a novel curriculum learning process let GPT-4 self-instruct open-ended education. The model can align biological vocabulary using figure-caption pairings and grasp open-ended conversational semantics. This strategy resembles how laypeople absorb biological topics. LLaVA-Med can answer biological picture inquiries and has great multimodal communication skills. Fine-tuning LLaVA-Med outperforms supervised biomedical visual question answering in this investigation. The paper releases instruction-following data and the LLaVA-Med model for biomedical multimodal learning research. \bullet XrayGPT [52], a conversational medical vision-language model, answers open-ended chest radiograph questions. The model uses MedClip’s visual characteristics and Vicuna’s textual information to assess radiographs and medical domain knowledge. Interactive and high-quality free-text radiology report summaries enhance XrayGPT automated chest radiograph processing. XrayGPT domain-specific information may enhance chest radiograph analysis. \bullet [57] proposes “visual medical specialists" for multimodal biological activities. Training the model using GPT-3.5 Turbo and human experts use instruction-tuning. Plug-and-play visual modules integrate text and vision for multimodal applications. Visual Med-Alpaca is open-source and cheap for doctors. \bullet [89] uses an open-source medical language model. Through medical expertise, the study proposes a logical strategy to adapt a general-purpose language paradigm to medicine. The language model contains 4.8 million biomedical academic papers and 30,000 medical textbooks. Medical accuracy is improved by fine-tuning the model to domain-specific instructions. Language model reasoning is improved by the paper. The model improves medical judgments by applying medical expertise to case facts and offering well-justified recommendations. Improvement of the language model’s alignment ability to adapt to different tasks without task-specific training is also stressed. \bullet To improve NLP, [50] suggests pre-training and fine-tuning huge language models. Factual errors and a lack of medical language model experience are admitted. A clinical-optimized language model, ClinicalGPT, overcomes these concerns. ClinicalGPT training combines medical records, domain-specific knowledge, and multi-round discussions. This method offers ClinicalGPT context and expertise for clinical tasks. With medical knowledge question-answering, tests, patient consultations, and medical record diagnostic analysis, the study provides a complete evaluation system. This approach assesses ClinicalGPT’s medical performance. ClinicalGPT improves with parameter-efficient fine-tuning. For clinical use, these methods improve model parameters. For huge language models in healthcare, ClinicalGPT outperforms others. \bullet [53] aligns the model with task-specific user objectives, develops radiology-specific language models, evaluates and improves generated impressions and shows the model’s better clinical impression-generating performance over other generative language models. The paper says that personalized language models can automate radiology jobs and improve human competency. \bullet Multimodal Conversational Assistant: LLaVA-Med demonstrates excellent multimodal conversational capability and can assist with inquiries about biomedical images, Biomedical Visual Question Answering (VQA): LLaVA-Med outperforms previous state-of-the-art methods on certain metrics for biomedical VQA tasks, Empowering Biomedical Practitioners: The proposed approach empowers biomedical practitioners by providing assistance with open-ended research questions and improving their understanding of biomedical images [51] \bullet Automated Analysis: XrayGPT enables automated analysis of chest radiographs, Concise Summaries: XrayGPT provides concise summaries highlighting key findings and overall impressions, Interactive Engagement: Users can engage interactively by asking follow-up questions to XrayGPT, Clinical Decision Support: XrayGPT assists medical professionals in making clinical decisions and provides valuable insights, Advancing Research: XrayGPT opens up new avenues for research in the automated analysis of chest radiographs [52]. \bullet Interpreting radiological images, Addressing complex clinical inquiries, Providing information on chemicals for hair loss treatment (as a case study), Supporting healthcare professionals in diagnosis, monitoring, and treatment, Enabling prompt generation for specialized tasks (e.g., radiology image captioning) [57] Visually Prompted Adaptations Models 1MedSAM [12] 2MedLSAM [13] 33DSAM-adapter [88] 4SAM-Med2D [11] 5SAMed [47] 6SAM-U [44] \bullet Versatile and accurate delineation of anatomical structures and pathologies across various medical imaging modalities, surmounting challenges of modality imbalance and intricate segmentation. \bullet Utilizes merged localization and segmentation with a shared 3D coordinate system for streamlined, precise 3D medical image analysis. \bullet Improves accuracy in decoding spatial patterns in volumetric data through enhanced, lightweight 3D medical image segmentation, focusing particularly on tumors. \bullet Optimized for precise 2D medical image segmentation, utilizing diverse prompts and refinements. \bullet utilizes a low-rank-based finetuning strategy for specialized medical image segmentation, maintaining minimal costs and enhanced capabilities. \bullet Employs multi-box prompts to refine SAM’s segmentation with pixel-level uncertainty estimation, increasing accuracy and providing nuanced image understanding. \bullet Pivotal for a range of clinical applications including efficient segmentation, diagnosis, treatment planning, disease monitoring, and in oncology for accurate tumor volume computation, contributing to personalized patient care and improved health outcomes [12]. \bullet A versatile, scalable foundation in medical imaging, reducing annotation burdens, and providing accurate, automated segmentation across medical disciplines, enhancing diagnostic procedures [13]. \bullet Facilitates clinical diagnosis, treatment planning, and medical R&D through improved segmentation and serves as a blueprint for domain-specific adaptations, enhancing medical imaging automation and processes [88]. \bullet Enables accurate medical image analysis, offering insights for researchers and advancing medical computer vision and interactive segmentation [11]. \bullet Serves as a crucial tool in computer-assisted medical diagnoses, excelling in multi-organ segmentation tasks, and is fully compatible with the existing SAM system, offering enhanced accessibility and utility in real-world medical settings [47]. \bullet Valuable for providing pixel-level uncertainty estimation in segmentation, aiding precise diagnoses, and identification of segmentation errors, especially in fundus images. It enriches clinical analyses and fosters the development of advanced segmentation methods [44]. Visually Prompted Generalist Models 1GMAI [23] 2BiomedGPT [25] 3Med-PalM M [22] 4RadFM [26] 5RETFound [27] \bullet Utilizes self-supervision on diverse datasets for multifunctional medical tasks with minimal labeled data, adapting to new tasks and enabling dynamic interaction and advanced reasoning. \bullet excels in diverse tasks with one model weight set, surpassing specialized models and offering versatile zero-shot generalization in biomedicine. \bullet Utilizes multi-task pretraining for knowledge transfer to unseen data, aiming to establish new benchmarks in biomedicine. \bullet Adeptly integrates and analyzes multidimensional medical scans with natural language. \bullet Uses masked autoencoder techniques to identify retinal structures and patterns related to eye and systemic diseases like heart failure, showing high adaptability across various tasks. \bullet Has potential applications in generating radiology reports and aiding medical procedures, reducing radiologist workload through automated, contextual report drafting and visualization, and supporting surgical teams with real-time annotations, alerts, and medical reasoning, thereby improving healthcare delivery [23]. \bullet Efficiently conducts tasks such as image classification and report generation, supporting clinical decisions and diagnostics. It serves versatile medical needs, offering reliable interpretations of diverse biomedical data, particularly where specialized models are unattainable and integrated insights are vital [25]. \bullet Uncovers insights essential for healthcare advancements by integrating information from various medical fields for diverse applications and analyses, requiring no finetuning modifications, and efficiently solving real-world problems [22]. \bullet It integrates multiple images, essential for longitudinal follow-ups and diverse scenarios, aiding professionals in generating accurate, context-rich reports and plans by understanding both visual and textual medical data [26]. \bullet Useful in clinical settings for early detection and risk assessment, it offers a data-efficient solution, minimizing annotation efforts and promoting broader implementation in varied clinical applications, contributing to the democratization of advanced healthcare AI technologies [27].
& \bullet 为生物图像训练低成本的视觉语言会话助理是[51]的主要概念。作者建议使用大型PubMed Central生物医学图形标题数据集来训练计算机。标题数据和新颖的课程学习过程让GPT—4自我指导开放式教育。该模型可以使用图形—标题配对来对齐生物词汇,并掌握开放式会话语义。这种策略类似于外行人吸收生物学主题的方式。LLaVA—Med可以回答生物图片查询,并具有出色的多模态沟通技能。在这项调查中,微调LLaVA—Med优于有监督的生物医学视觉问题回答。该论文发布了用于生物医学多模态学习研究的预防跟踪数据和LLaVA—Med模型。 \bullet XrayGPT [52]是一种对话式的医学视觉语言模型,可以回答开放式的胸部X光片问题。 该模型使用MedClip的视觉特征和维库纳的文本信息来评估射线照片和医学领域知识。交互式和高质量的自由文本放射学报告摘要增强了XrayGPT自动化胸片处理。XrayGPT域特定的信息可以增强胸部X光片分析。 \bullet [57]建议为多模态生物活动提供“视觉医学专家”。使用GPT—3.5 Turbo训练模型,人类专家使用自动调整。即插即用的视觉模块为多模式应用程序集成了文本和视觉。Visual Med—Alpaca是开源的,对医生来说很便宜。 \bullet [89]使用开源医学语言模型。通过医学专业知识,该研究提出了一个逻辑策略,以适应通用语言范式的医学。该语言模型包含480万篇生物医学学术论文和3万本医学教科书。通过将模型微调为特定于领域的指令,提高了医疗准确性。 本文对语言模型推理进行了改进。该模型通过将医学专业知识应用于病例事实并提供合理的建议来改善医疗判断。语言模型的对齐能力的提高,以适应不同的任务,而无需特定的任务训练也强调。#4为了改进NLP,[ 50]建议预训练和微调大型语言模型。承认存在事实错误和缺乏医学语言模型经验。临床优化的语言模型ClinicalGPT克服了这些问题。ClinicalGPT培训结合了医疗记录,特定领域的知识和多轮讨论。该方法为临床任务提供了ClinicalGPT背景和专业知识。通过医学知识问答、考试、患者咨询、病历诊断分析,提供了一个完整的评价体系。该方法评估ClinicalGPT的医疗性能。 ClinicalGPT通过参数高效微调得到改善。对于临床应用,这些方法改进了模型参数。对于医疗保健中的大型语言模型,ClinicalGPT优于其他语言模型。 \bullet [ 53]使模型与特定任务的用户目标保持一致,开发放射学特定的语言模型,评估和改进生成的印象,并显示该模型比其他生成语言模型更好的临床印象生成性能。该论文称,个性化的语言模型可以使放射学工作自动化,并提高人类的能力。 \bullet 多模态对话助手:LLaVA-Med展示了出色的多模态对话能力,可以协助询问生物医学图像,生物医学视觉问题查询(VQA):LLaVA-Med在生物医学VQA任务的某些指标上优于以前的最先进方法,增强生物医学从业人员的能力:所提出的方法通过提供开放式研究问题的帮助并提高他们对生物医学图像的理解来增强生物医学从业者的能力[ 51] \bullet 自动分析:XrayGPT可实现胸片的自动化分析,简明摘要:XrayGPT提供简明摘要,突出显示关键发现和总体印象,交互式参与:用户可以通过询问XrayGPT的后续问题进行交互式参与,临床决策支持:XrayGPT协助医疗专业人员做出临床决策并提供有价值的见解,推进研究:XrayGPT为医学领域的研究开辟了新的途径。胸部X光片的自动分析[52]。 \bullet 解释放射学图像,解决复杂的临床问题,提供有关脱发治疗化学品的信息(作为案例研究),支持医疗保健专业人员进行诊断,监测和治疗,为专业任务(例如,放射学图像字幕)[57]可视化适配模型 1 MedSAM [12] 2 MedLSAM [13] 3 3DSAM适配器[88] 4 SAM—Med2D [11] 5 SAMed [47] 6 SAM—U [44] \bullet 跨各种医学成像模式的解剖结构和病理的多功能和准确描绘,克服模态不平衡和复杂分割的挑战。 \bullet 利用合并的定位和分割与共享的3D坐标系进行精简、精确的3D医学图像分析。 \bullet 通过增强的轻量级3D医学图像分割(特别关注肿瘤),提高解码体积数据中空间模式的准确性。 \bullet 针对精确的2D医学图像分割进行了优化,利用各种提示和优化。 \bullet 利用基于低秩的微调策略进行专业医学图像分割,保持最低成本和增强功能。 \bullet 采用多框提示,通过像素级不确定性估计来完善SAM的分割,提高准确性并提供细致入微的图像理解。 \bullet 可用于一系列临床应用,包括有效分割、诊断、治疗计划、疾病监测,以及肿瘤学中的精确肿瘤体积计算,有助于个性化患者护理和改善健康结局[ 12]。 \bullet 医学成像的多功能、可扩展基础,减少注释负担,并提供跨医学学科的准确、自动分割,增强诊断程序[ 13]。 \bullet 通过改进的分割促进临床诊断、治疗计划和医学研发,并作为特定领域适应的蓝图,增强医学成像自动化和流程[ 88]。 \bullet 实现准确的医学图像分析,为研究人员提供见解,并推进医学计算机视觉和交互式分割[ 11]。 \bullet 作为计算机辅助医疗诊断的重要工具,擅长多器官分割任务,并与现有SAM系统完全兼容,在现实世界的医疗环境中提供增强的可访问性和实用性[ 47]。 \bullet 对于在分割中提供像素级不确定性估计、辅助精确诊断和识别分割错误(尤其是眼底图像)非常有价值。它丰富了临床分析并促进了高级分割方法的发展[ 44]。 1 GMAI [ 23] 2 BiomedGPT [ 25] 3 Med-PalM M [ 22] 4 RadFM [ 26] 5 RETFound [ 27] \bullet 利用不同数据集的自我监督,以最少的标记数据完成多功能医疗任务,适应新任务,实现动态交互和高级推理。 \bullet 通过一个模型权重集在各种任务中表现出色,超越了专业模型,并在生物医学领域提供了多功能的零触发泛化。 \bullet 利用多任务预训练将知识转移到未知数据,旨在建立生物医学的新基准。 \bullet 使用自然语言熟练地集成和分析多维医学扫描。 \bullet 使用掩蔽自动编码器技术来识别与眼睛和系统性疾病(如心力衰竭)相关的视网膜结构和模式,在各种任务中表现出高度的适应性。 \bullet 在生成放射学报告和辅助医疗程序方面具有潜在的应用,通过自动化,上下文报告起草和可视化减少放射科医生的工作量,并通过实时注释,警报和医疗推理支持手术团队,从而改善医疗保健服务[ 23]。 \bullet 高效执行图像分类和报告生成等任务,支持临床决策和诊断。它服务于多方面的医疗需求,为各种生物医学数据提供可靠的解释,特别是在无法实现专业模型和集成见解至关重要的情况下。 \bullet 通过整合来自各种医疗领域的信息进行各种应用和分析,揭示医疗保健进步所必需的见解,无需微调修改,并有效解决现实问题[ 22]。 \bullet 它集成了多个图像,对于纵向随访和各种场景至关重要,通过理解视觉和文本医疗数据,帮助专业人员生成准确,上下文丰富的报告和计划[ 26]。 \bullet 在早期检测和风险评估的临床环境中很有用,它提供了一种数据高效的解决方案,最大限度地减少了注释工作,并促进了各种临床应用的更广泛实施,有助于先进医疗AI技术的民主化[ 27]。

The primary advantage of adaptation models is their ability to excel in specialized medical image segmentation tasks. For instance, in scenarios where precise tumor volume calculation is critical, models like MedSAM can provide accurate results. These models are tailored for the medical domain, making them well-suited for specific clinical applications.

Nonetheless, adaptation models may require substantial annotated data for fine-tuning. They might face challenges in scenarios with limited labeled data, as achieving the desired level of performance could be challenging. Additionally, they might not be the most efficient choice for tasks that require generalization across diverse medical imaging modalities.

Visually Prompted Generalist Models: Visually prompted generalist models, exemplified by models like BiomedGPT [25] and Med-PalM M [22], offer versatility by handling a wide spectrum of medical imaging tasks and data modalities. They can seamlessly switch between tasks without the need for extensive retraining, making them suitable for dynamic healthcare environments.
视觉提示的通才模型:视觉提示的通才模型,例如BiomedGPT [ 25]和Med-PalM M [ 22]等模型,通过处理广泛的医学成像任务和数据模态提供多功能性。他们可以在任务之间无缝切换,而无需进行广泛的再培训,使他们适合动态的医疗环境。

The key advantage of generalist models is their flexibility. In scenarios where medical professionals need a single model that can handle various tasks, such as image classification, text generation, and question-answering, these models shine. Their ability to reason across different modalities and provide informed responses is invaluable in clinical decision support and medical research.

However, generalist models might face challenges related to task-specific fine-tuning. Achieving state-of-the-art performance in highly specialized tasks might require additional domain-specific data. Moreover, these models need robust mechanisms for handling out-of-distribution data effectively.

In conclusion, the choice between these directions largely depends on the specific use case and requirements. While textually prompted models excel in tasks requiring interpretation and detailed explanations, visually prompted models dominate in segmentation and image-specific tasks. Conversational models bridge the gap between human experts and AI systems, facilitating collaborative decision-making. The choice ultimately boils down to the nature of the problem, the availability of data, and the need for adaptability or versatility in the medical imaging domain. Each direction contributes to the evolving landscape of foundational models, offering a rich tapestry of tools to tackle diverse healthcare challenges. To facilitate comprehension, we have presented the benefits, drawbacks, and practical applications of each direction in Section 4. We also showcase the timeline of the reviewed papers in the past quarters, as illustrated in Figure 8. This figure presents a chronological overview of the key milestones and developments in the field, highlighting the rapid evolution and growing significance of textually prompted and visually prompted foundation models in medical imaging. Expanding on this, the timeline provides valuable insights into the progression of research, revealing the emergence of novel techniques, conversational, and generalist models.
总之,这些方向之间的选择在很大程度上取决于特定的用例和需求。虽然文本提示模型在需要解释和详细解释的任务中表现出色,但视觉提示模型在分割和图像特定任务中占主导地位。对话模型弥合了人类专家和人工智能系统之间的差距,促进了协作决策。选择最终归结为问题的性质,数据的可用性,以及在医学成像领域的适应性或多功能性的需要。每个方向都有助于基础模型的不断发展,提供丰富的工具来应对各种医疗保健挑战。为了便于理解,我们在第4节中介绍了每个方向的优点,缺点和实际应用。我们还展示了过去几个季度被评审论文的时间轴,如图8所示。 该图按时间顺序概述了该领域的关键里程碑和发展,突出了医学成像中文本提示和视觉提示基础模型的快速发展和日益重要的意义。在此基础上,时间轴为研究的进展提供了有价值的见解,揭示了新技术,对话和通才模型的出现。

Refer to caption
Figure 8: Timeline of advancements in textually and visually prompted foundation models in medical imaging over the past quarters." This timeline illustrates significant progress and breakthroughs in the field, spanning the last five quarters, highlighting the dynamic nature of research and innovation in textually and visually prompted foundation models for medical image analysis.

4.1 Hardware Requirements and Dataset Sizes

In the pursuit of advancing foundational models in medical imaging, it is imperative to consider the practical aspects of implementing these models in real-world healthcare settings. Two crucial aspects in this regard are the hardware requirements and dataset sizes, both of which significantly influence the feasibility and scalability of deploying these models. Hardware Requirements: Many foundational models, owing to their immense complexity and scale, have substantial hardware requirements. While these models deliver remarkable performance, their training and inference often demand significant computational resources. For instance, models like BiomedGPT, Clinical-BERT, and Visual Med-Alpaca, with millions to billions of parameters, necessitate high-end GPUs or even dedicated hardware accelerators for efficient operation. It is essential to acknowledge that the hardware investment required for these models may present a challenge for resource-constrained healthcare institutions. Therefore, striking a balance between model performance and hardware feasibility is a crucial consideration when implementing these models in clinical practice. Future research should explore strategies to optimize these models for deployment on less resource-intensive hardware, making them accessible to a wider range of healthcare facilities. Dataset Sizes: Another noteworthy aspect is the size of the datasets used to train and fine-tune foundational models. Larger datasets often result in improved model performance and generalization, but they can be challenging to obtain in the medical domain due to privacy concerns and the labor-intensive nature of medical data annotation. Several papers in our survey have employed datasets with varying sizes, from thousands to millions of medical images and reports. Understanding the dataset size requirements for achieving state-of-the-art results is vital for healthcare practitioners and researchers. While some models demonstrate exceptional performance with relatively small datasets, others rely on extensive datasets to excel in complex medical tasks. Future research should explore techniques for efficient dataset collection, augmentation, and utilization, enabling the development of models that can perform well with limited data while preserving patient privacy.
在推进医学成像基础模型的过程中,必须考虑在现实世界的医疗环境中实施这些模型的实际方面。在这方面,两个关键方面是硬件要求和数据集大小,这两者都显著影响部署这些模型的可行性和可扩展性。硬件要求:许多基础模型,由于其巨大的复杂性和规模,有大量的硬件要求。虽然这些模型具有出色的性能,但它们的训练和推理通常需要大量的计算资源。例如,像BiomedGPT、Clinical-BERT和Visual Med-Alpaca这样的模型,具有数百万到数十亿的参数,需要高端GPU甚至专用硬件加速器来实现高效运行。必须承认,这些模型所需的硬件投资可能会对资源有限的医疗机构构成挑战。 因此,在临床实践中实施这些模型时,在模型性能和硬件可行性之间取得平衡是一个至关重要的考虑因素。未来的研究应该探索优化这些模型的策略,以便在资源密集度较低的硬件上部署,使它们能够被更广泛的医疗机构使用。数据集大小:另一个值得注意的方面是用于训练和微调基础模型的数据集的大小。较大的数据集通常会提高模型性能和泛化能力,但由于隐私问题和医疗数据注释的劳动密集型性质,在医疗领域获得这些数据集可能具有挑战性。我们调查中的几篇论文使用了不同大小的数据集,从数千到数百万的医学图像和报告。了解实现最先进结果的数据集大小要求对于医疗保健从业人员和研究人员至关重要。 虽然一些模型在相对较小的数据集上表现出出色的性能,但其他模型则依赖于广泛的数据集来完成复杂的医疗任务。未来的研究应该探索有效的数据集收集,增强和利用技术,使模型的开发能够在有限的数据下表现良好,同时保护患者隐私。

To provide a more detailed overview of the hardware requirements and dataset sizes reported in the reviewed papers, we present Section 4.1. In this table, we present sample hardware configurations utilized for training the network, alongside details regarding dataset sizes.

Table 2: A summary of publicly available information about medical foundational models, their computational demands and training information. The unavailable information is featured with a dash.
ID Category Sub-category Short name 短名称 GPU Model GPU型号 Number of 数量
GPUs GPU Memory GPU存储器
(GB) (GB) Total GPU GPU总数
Memory (GB) 内存(GB) Training Time 训练时间
(GPU Hour) (GPU小时) Input Size 输入大小 Total
batch size 批量 Epochs
1 TPM Contrastive MedCLIP Nvidia RTX 3090 1 24 24 8 224x224 100 10
2 TPM Contrastive BioViL-T Nvidia Tesla V100 8 32 256 - 448x448 240 50, 100
3 TPM Contrastive CLIPDM-OTS NVIDIA RTX A5000 8 24 192 - 96x96x96 42 50
4 TPM Contrastive PTUnifier NVIDIA A100 4 80 320 - 288×288-384x384 16-128 11-60
5 TPM Contrastive BiomedCLIP NVIDIA A100 16 40 640 - 224x224-336x336 4k-64k (context) 40
6 TPM Contrastive KoBo Nvidia RTX 3090 2 24 48 - - 100 50
7 TPM Contrastive MI-Zero NVIDIA A100 8 80 640 - 448x448 512 50
8 TPM Contrastive CITE GeForce GTX 2080 Ti 2 11 22 0.37 224x224 128 (1000 iteration)
9 TPM Generative Clinical-BERT Nvidia RTX 3090 2 24 48 96 224x224 256 50
10 TPM Generative Med-Flamingo Nvidia A100 8 80 640 1296 - 400 -
11 TPM Hybrid MedBLIP Nvidia RTX 3090 1 24 24 - 224x224x224 7 100
12 TPM Hybrid VLM for VQA in MI GeForce GTX 1080 Ti 1 11 11 - 224x224 50 50
13 TPM Conversational DeID-GPT Nvidia RTX 3090 >>1 24
14 TPM Conversational ChatDoctor Nvidia A100 6 80 480 18 max-sq-len: 2048 192 3
15 TPM Conversational PMC-LLaMA Nvidia A100 英伟达A100 32 80 2560 - max-sq-len: 2048 最大平方长度:2048 img:256, text:3200 img:256,text:3200 8
16 TPM Conversational LLaVA-Med Nvidia A100 英伟达A100 8 40 320 120 - 128 100
17 TPM Conversational Radiology-Llama2 Nvidia A100 英伟达A100 4 80 320 - - 128 -
18 VPM Adaptations SAMed Nvidia RTX 3090 2 24 48 - 512x512 12 200
19 VPM Adaptations MedSAM Nvidia A100 英伟达A100 20 80 1600 - 1024x2014 160 100
20 VPM Adaptations AutoSAM NVIDIA Tesla V100 1 16 16 - 1024x1024 4 120
21 VPM Adaptations LVM-Med Nvidia A100 英伟达A100 16 80 1280 2688 224x224 - 1024x1024 16 - 64 20-200
22 VPM Adaptations SAM-Med2D Nvidia A100 英伟达A100 8 80 640 - 256x256 - 12
23 VPM Generalist SAM-B-ZSS Nvidia RTX 3080 1 10 10 - 1024x1024 1 20
24 VPM Generalist RadFM Nvidia A100 英伟达A100 32 80 2560 - 256(3D), 512(2D) 256(3D)、512(2D) 1(3D), 4(2D) 1(3D)、4(2D) 8
25 VPM Generalist RETFound Nvidia A100 英伟达A100 8 40 320 2688 16x16 16, 1792 50, 800

5 Open challenges and Future Direction

Throughout this survey, we have conducted an in-depth analysis of various foundational models, delving into their architectural designs, motivations, objectives, and use cases, all aimed at tackling real-world challenges. In this section, our focus shifts to underscore research directions that have the potential to further empower these models for addressing medical imaging applications.

5.1 Open-source Multimodal Models

The future direction of foundation models in medical imaging holds immense promise, primarily due to their seamless integration of diverse data modalities. This integration creates opportunities to explore medical concepts at multiple scales and leverage insights from various knowledge sources, including imaging, textual, and audio data. This multimodal integration empowers medical discoveries that are challenging to achieve with single-modality data alone, while also facilitating knowledge transfer across domains [23]. For example, current self-supervised learning methods are not universally generalizable and often need to be tailored and developed for each specific modality, highlighting the ongoing need for research and innovation in this area. Foundation models are poised to revolutionize healthcare by offering a holistic understanding of diseases and enabling more precise and data-driven medical interventions. However, to truly unlock the full potential of foundation models in this context, we must emphasize the need to consider inter-modality and cross-modality relationships more effectively. This involves developing methods that can effectively bridge the gap between different data modalities, allowing for better information fusion and more accurate predictions. By enhancing the ability to capture the intricate connections between different medical data, we can further increase the performance and utility of foundation models in medical imaging and healthcare. This interdisciplinary approach is critical for advancing our understanding of complex diseases and improving patient care.
医学成像基础模型的未来发展方向具有巨大的前景,主要是由于它们对不同数据模式的无缝集成。这种整合创造了在多个尺度上探索医学概念的机会,并利用来自各种知识来源的见解,包括成像,文本和音频数据。这种多模态集成使医学发现具有挑战性,仅用单一模态数据难以实现,同时也促进了跨领域的知识转移[ 23]。例如,目前的自我监督学习方法并不是普遍适用的,往往需要针对每一种特定的模式进行调整和开发,这突出表明了在这一领域不断进行研究和创新的必要性。基金会模型有望通过提供对疾病的全面了解并实现更精确和数据驱动的医疗干预来彻底改变医疗保健。 然而,为了真正释放基础模型在这种情况下的全部潜力,我们必须强调需要更有效地考虑模态间和跨模态的关系。这涉及开发能够有效弥合不同数据模式之间的差距的方法,从而实现更好的信息融合和更准确的预测。通过增强捕获不同医疗数据之间复杂联系的能力,我们可以进一步提高基础模型在医学成像和医疗保健中的性能和实用性。这种跨学科的方法对于促进我们对复杂疾病的理解和改善患者护理至关重要。

5.2 Interpretablity

Understanding a model’s capabilities, reasoning, and mechanisms provides profound insights into its outputs. Explainability and interpretability are pivotal in adopting foundation models for building trustworthy AI-driven systems and ensuring their ethical and practical use in healthcare [90]. These capabilities are essential for transparency, accountability, and regulatory compliance. Specifically, understanding what a model can do, why it behaves in certain ways, and how it operates is particularly vital when dealing with foundation models. These complex models, powered by extensive data, possess the ability to perform unforeseen tasks in entirely novel ways [1]. In healthcare, explainability is critical for decisions regarding patient symptoms, clinical trials, and informed consent. Transparent AI reasoning helps resolve disagreements between AI systems and human experts, explaining the reasons behind the created decisions. However, most current foundation models lack built-in explainability, requiring future research. By connecting AI outputs with medical knowledge, models become more understandable, enabling users to grasp not only what the model predicts but why. This interdisciplinary approach, merging AI with domain expertise, advances disease understanding, elevates patient care, and promotes responsible AI use in healthcare.
理解模型的能力、推理和机制可以对其输出提供深刻的见解。可解释性和可解释性是采用基础模型构建可信赖的AI驱动系统并确保其在医疗保健中的道德和实际使用的关键。这些能力对于透明度、问责制和法规遵从性至关重要。具体来说,理解模型可以做什么,为什么它以某些方式运行,以及它如何运行,在处理基础模型时尤为重要。这些复杂的模型由大量数据提供支持,能够以全新的方式执行不可预见的任务[ 1]。在医疗保健领域,可解释性对于有关患者症状、临床试验和知情同意的决策至关重要。透明的人工智能推理有助于解决人工智能系统和人类专家之间的分歧,解释所创建的决策背后的原因。 然而,目前大多数基础模型缺乏内在的可解释性,需要未来的研究。通过将AI输出与医学知识联系起来,模型变得更容易理解,使用户不仅能够掌握模型预测的内容,还能够了解为什么。这种跨学科的方法将AI与领域专业知识相结合,促进了疾病的理解,提高了患者护理水平,并促进了AI在医疗保健中的负责任使用。

5.3 Bias and Variance in Foundational Models

Within the domain of foundational models for medical imaging, two critical aspects demand ongoing scrutiny and investigation: bias and variance [91].
在医学成像的基础模型领域内,两个关键方面需要持续的审查和调查:偏差和方差[ 91]。

Bias: One of the foremost challenges facing foundational models is the presence of bias in both data and predictions. Just as in vision and language models, foundational models in medical imaging can inherit and amplify biases present in the training data. These biases might be related to race, ethnicity, gender, or socioeconomic factors, and they can manifest in the models’ predictions and behaviors. For instance, a model might exhibit disparities in disease diagnosis or treatment recommendations for different demographic groups, potentially leading to unequal healthcare outcomes. Thus, addressing and mitigating biases in foundational models is of paramount importance to ensure fairness, inclusivity, and ethical deployment in the medical domain.

Variance: Variance, on the other hand, pertains to the models’ sensitivity to fluctuations in the training data. In the context of medical imaging, variance can manifest as the models’ inability to generalize effectively across diverse patient populations or different healthcare settings. Models with high variance might perform exceptionally well on one dataset but poorly on another, hindering their reliability in real-world clinical applications. Therefore, strategies that enhance the robustness and generalization capabilities of foundational models are crucial for their widespread adoption and utility.

5.4 Adversarial Attacks 5.4对抗性攻击

In the healthcare system, where the accuracy of medical decisions can have life-altering consequences, susceptibility to adversarial attacks [92] is a pressing concern of paramount importance. These attacks, which involve the deliberate manipulation of model inputs, can lead to not only erroneous but potentially harmful outputs, creating a perilous landscape for medical practitioners and patients alike. For instance, in the context of medical imaging, adversarial attacks could potentially result in misdiagnoses, causing patients to receive incorrect treatments or delay necessary interventions. Furthermore, the compromise of patient data privacy through adversarial tactics can lead to severe breaches of confidentiality, raising ethical and legal concerns.
在医疗保健系统中,医疗决策的准确性可能会改变生命,对抗性攻击的易感性[ 92]是一个至关重要的紧迫问题。这些攻击涉及对模型输入的故意操纵,不仅可能导致错误的输出,而且可能导致潜在的有害输出,为医疗从业者和患者创造了一个危险的环境。例如,在医学成像方面,对抗性攻击可能导致误诊,导致患者接受不正确的治疗或延误必要的干预措施。此外,通过对抗策略损害患者数据隐私可能导致严重违反保密性,引起伦理和法律的担忧。

Additionally, the potential for the spread of false medical information, fueled by adversarial attacks, could have far-reaching consequences, undermining public trust in foundational models and healthcare systems. Therefore, addressing these vulnerabilities and developing robust defence mechanisms are not just academic endeavors but essential imperatives for ensuring the safety, reliability, and ethical use of foundational models in medical applications. The healthcare domain demands a proactive stance in fortifying foundational models against adversarial threats to safeguard the integrity and efficacy of clinical decision-making processes and the privacy of patient data.

5.5 Down-stream Task Adaptation

Foundation models offer powerful adaptability, including fine-tuning and prompting, making them versatile for healthcare and medical tasks. However, their extensive initial training demands substantial resources, adapting them efficiently for different tasks without losing learned knowledge is a critical challenge, and there is a need for research to reduce computational and memory requirements for quick adaptation, as current approaches often require careful hyperparameter selection that can impact generalization performance. Therefore, these challenges point to the need for more efficient foundation models in the future to enhance their general-purpose utility.

5.6 Extensive Data and Computational Demands

Foundation models, while powerful, come with substantial computational costs for development, training, and deployment. In specific cases, smaller models can achieve similar or better results at a lower cost. Training large-scale models is data and compute-intensive, and acquiring extensive labeled data can be expensive and time-consuming, especially for specialized domains or less-resourced languages. Inference with these models is also costly due to their many parameters. A summary of the computational budget and training costs of some of the reviewed models in this paper is provided in Section 4.1.

These computational demands hinder their practicality in real-world applications, particularly those needing real-time inference or running on resource-constrained edge and mobile devices. For instance, visual prompt-based models like Segment Anything [10], while having robust image encoders, currently lack real-time processing speed, a crucial requirement for practical use. FastSAM [93], on the other hand, achieves comparable performance to the SAM method but at 50 times faster run-time speed by replacing the Transformer architecture with YOLOv8-seg [94], significantly expanding the utility of such models in real-world scenarios. Consequently, there’s potential to develop more efficient successors to address this issue, particularly in medical applications where running models on edge devices offer substantial advantages, especially in underserved areas.
这些计算需求阻碍了它们在现实世界应用中的实用性,特别是那些需要实时推理或在资源受限的边缘和移动的设备上运行的应用。例如,像Segment Anything [10]这样的基于视觉的模型虽然具有强大的图像编码器,但目前缺乏实时处理速度,这是实际使用的关键要求。另一方面,FastSAM [93]通过用YOLOv8—seg [94]取代Transformer架构,实现了与SAM方法相当的性能,但运行速度快50倍,显著扩展了此类模型在现实世界场景中的实用性。因此,有可能开发出更有效的后继产品来解决这个问题,特别是在边缘设备上运行模型的医疗应用中,尤其是在服务不足的地区。

5.7 Prompt Engineering 5.7 Prompt工程

Prompt engineering is a critical aspect of foundational models in medical imaging, and its significance lies in its potential to bridge the gap between these models and radiologists, ultimately enhancing patient care [55]. In the context of medical image interpretation, effective communication between radiologists and AI models can lead to several noteworthy benefits. First and foremost, prompt engineering allows radiologists to have natural and interactive conversations with AI models. This capability is particularly valuable as it enables radiologists to seek clarifications, provide additional context, and ask follow-up questions, mirroring real-world clinical scenarios. For example, when reviewing a complex medical image, a radiologist may need to ask the AI model for further explanations about its findings, request alternative views, or explore differential diagnoses. Prompt engineering facilitates this conversational flow, making AI models more accessible and collaborative tools for radiologists. Moreover, the ability to converse with AI models through well-constructed prompts empowers radiologists with a more interactive and intuitive workflow. Instead of relying solely on fixed queries or predefined prompts, radiologists can tailor their interactions based on the specific nuances of each case. This adaptability allows for a more dynamic and personalized user experience, ultimately improving diagnostic accuracy and efficiency. Furthermore, prompt engineering contributes to the interpretability and transparency of AI models. Radiologists can gain insights into how the model arrives at its conclusions by crafting prompts that elicit detailed explanations. This transparency is crucial in a clinical context, where radiologists need to understand the reasoning behind the AI model’s recommendations and trust its diagnostic insights.
快速工程是医学成像基础模型的一个关键方面,其重要性在于其弥合这些模型与放射科医生之间差距的潜力,最终增强患者护理[ 55]。在医学图像解释的背景下,放射科医生和AI模型之间的有效沟通可以带来一些值得注意的好处。首先,快速工程允许放射科医生与AI模型进行自然和交互式对话。这种能力特别有价值,因为它使放射科医生能够寻求澄清,提供额外的背景,并提出后续问题,反映真实世界的临床场景。例如,在查看复杂的医学图像时,放射科医生可能需要向AI模型询问有关其发现的进一步解释,请求替代视图或探索鉴别诊断。 即时工程促进了这种对话流程,使AI模型更容易为放射科医生提供协作工具。此外,通过构造良好的提示与AI模型对话的能力使放射科医生能够实现更具交互性和直观的工作流程。放射科医生可以根据每个病例的具体细微差别定制他们的交互,而不是仅仅依赖于固定的查询或预定义的提示。这种适应性允许更动态和个性化的用户体验,最终提高诊断准确性和效率。此外,快速工程有助于AI模型的可解释性和透明度。放射科医生可以通过精心制作引发详细解释的提示来深入了解模型如何得出结论。这种透明度在临床环境中至关重要,放射科医生需要了解AI模型建议背后的推理,并信任其诊断见解。

5.8 Lack of Effective Benchmark to Monitor Progress

While various benchmark datasets and evaluation metrics exist, they often fall short in comprehensively assessing model performance across diverse medical imaging tasks, modalities, and real-world clinical scenarios. Addressing this issue and establishing a robust benchmarking framework is crucial for several reasons. Firstly, a comprehensive benchmark can facilitate fair and standardized model evaluation, enabling researchers to assess the true strengths and weaknesses of different foundational models accurately. Currently, models may excel in specific datasets or tasks but struggle when applied to new, untested scenarios. An effective benchmark should encompass a wide spectrum of medical imaging challenges, including rare conditions and diverse patient populations, to provide a holistic assessment of model capabilities. Secondly, a well-structured benchmark can drive innovation by defining clear objectives and goals for the field. It can serve as a reference point for researchers and encourage the development of models that can address real-world clinical needs effectively. Moreover, it can incentivize the creation of models that are robust, interpretable, and adaptable to the dynamic nature of healthcare. Lastly, an effective benchmarking framework can aid in the deployment of foundational models in clinical practice. By thoroughly evaluating models’ performance and generalization across various clinical settings, it can assist healthcare providers in selecting the most suitable models for specific tasks and ensure that AI-assisted medical decision-making is reliable and safe.
虽然存在各种基准数据集和评估指标,但它们通常无法全面评估不同医学成像任务,模态和真实世界临床场景的模型性能。解决这一问题并建立一个强有力的基准框架至关重要,原因有几个。首先,一个全面的基准可以促进公平和标准化的模型评估,使研究人员能够准确地评估不同基础模型的真正优势和劣势。目前,模型可能在特定的数据集或任务中表现出色,但在应用于新的、未经测试的场景时却很困难。一个有效的基准应该涵盖广泛的医学成像挑战,包括罕见疾病和不同的患者人群,以提供对模型能力的整体评估。其次,结构良好的基准可以通过为该领域定义明确的目标和目的来推动创新。 它可以作为研究人员的参考点,并鼓励开发能够有效满足现实世界临床需求的模型。此外,它可以激励创建健壮,可解释和适应医疗保健动态性质的模型。最后,一个有效的基准框架可以帮助在临床实践中部署基础模型。通过全面评估模型在各种临床环境中的性能和泛化能力,它可以帮助医疗保健提供者为特定任务选择最合适的模型,并确保人工智能辅助医疗决策的可靠性和安全性。

5.9 Enhancing Feature Representation from Frequency Perspective

Given that the majority of foundational models employ ViT models as their backbone, it becomes crucial to assess these models from a frequency perspective to ensure their ability to capture and learn diverse frequency information necessary for object recognition. Recent research has shed light on the fact that traditional self-attention mechanisms in ViT, while effective in mitigating local feature disparities, tend to neglect vital high-frequency details, such as textures and edge characteristics [95, 96]. This oversight is particularly problematic in tasks like tumor detection, cancer-type identification through radiomics analysis, and treatment response assessment, as these tasks often hinge on recognizing subtle textural abnormalities. Additionally, it’s worth noting that self-attention mechanisms come with a quadratic computational complexity and may generate redundant features [97]. Given these considerations, the design of new foundational models should take these limitations into account and explore potential enhancements. This could involve incorporating CNN layers or adopting more efficient ViT architectures to strike a balance between computational efficiency and preserving high-frequency information.
鉴于大多数基础模型都采用ViT模型作为其主干,因此从频率角度评估这些模型变得至关重要,以确保它们能够捕获和学习对象识别所需的各种频率信息。最近的研究揭示了这样一个事实,即ViT中的传统自我注意机制虽然有效地减轻了局部特征差异,但往往会忽视重要的高频细节,如纹理和边缘特征[ 95,96]。这种疏忽在肿瘤检测、通过放射组学分析识别癌症类型和治疗反应评估等任务中尤其成问题,因为这些任务通常取决于识别细微的纹理异常。此外,值得注意的是,自我注意机制具有二次计算复杂度,可能会产生冗余特征[ 97]。 考虑到这些因素,新的基础模型的设计应该考虑到这些限制,并探索潜在的增强功能。这可能涉及合并CNN层或采用更有效的ViT架构,以在计算效率和保留高频信息之间取得平衡。

6 Conclusion 6结论

In this comprehensive survey, we have conducted an in-depth review of recent advancements in foundational models for medical imaging. Our survey commences with an introductory section that provides insight into the evolution of foundation models and their potential contributions to the healthcare sector.

Subsequently, we categorize these models into four main groups, differentiating between those prompted by text and those guided by visual cues. Each of these categories boasts unique strengths and capabilities, and in Section 3, we delve into these directions by presenting exemplary works and offering comprehensive methodological descriptions. Furthermore, our exploration extends to evaluating the advantages and limitations inherent to each model type. We shed light on their areas of excellence and identify areas where they have room for improvement. This information is presented in the form of pros, cons, and real-world use cases of these models in the context of medical imaging scenarios, and the summarized results can be found in Section 4. Moreover, we consider the hardware and dataset requirements for implementing these models. We provide various configuration strategies to elucidate the prerequisites for future research endeavors, helping researchers gain a clear understanding of the necessary resources. In conclusion, our survey not only reviews recent developments but also sets the stage for future research in foundational models. We propose several directions for future investigations, offering a roadmap for researchers to excel in the field of foundational models for medical imaging.
随后,我们将这些模型分为四个主要组,区分文本提示和视觉提示。每一个类别都拥有独特的优势和能力,在第3节中,我们通过展示示范性作品和提供全面的方法描述来深入研究这些方向。此外,我们的探索延伸到评估每个模型类型固有的优点和局限性。我们揭示了他们的卓越领域,并确定了他们有改进空间的领域。这些信息以这些模型在医学成像场景中的优点、缺点和真实用例的形式呈现,总结的结果可以在第4节中找到。此外,我们还考虑了实现这些模型的硬件和数据集要求。 我们提供各种配置策略,以阐明未来研究工作的先决条件,帮助研究人员清楚地了解必要的资源。总之,我们的调查不仅回顾了最近的发展,而且为基础模型的未来研究奠定了基础。我们提出了未来研究的几个方向,为研究人员在医学成像基础模型领域的卓越表现提供了路线图。

References 引用

  • [1] Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
    Rishi Bommasani,Drew A哈德逊,Ehsan Adeli,Russ Altman,Simran Arora,Sydney von Arx,Michael S伯恩斯坦,珍内特博格,Antoine Bosselut,Emma Brunskill,et al. On the opportunities and risks of foundation models. 2021年12月28日,香港中文大学出版社。
  • [2] Muhammad Awais, Muzammal Naseer, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Foundational models defining a new era in vision: A survey and outlook. arXiv preprint arXiv:2307.13721, 2023.
    Muhammad Awais、Muzammal Naseer、Salman Khan、Rao Muhammad Anwer、Hisham Cholakkal、Mubarak Shah、Ming-Hsuan Yang和Fahad Shahbaz Khan。定义视觉新时代的基础模型:调查和展望。arXiv预印本arXiv:2307.13721,2023。
  • [3] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
    Tom Brown,Benjamin Mann,Nick Ryder,Melanie Subbiah,Jared D Kaplan,Prafulla达里瓦尔,Arvind Neelakantan,Pranav Shyam,Girish Sastry,阿曼达Askell,et al.语言模型是少数成功的学习者。神经信息处理系统的进展,33:1877-1901,2020。
  • [4] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
    Aakanksha Chowdhery,沙兰Narang,Jacob Devlin,Maarten Bosma,Gaurav Mishra,Adam Roberts,Paul Barham,Hyung Won Chung,Charles萨顿,塞巴斯蒂安格尔曼,et al. Palm:Scaling language modeling with pathways. arXiv预印本arXiv:2204.02311,2022。
  • [5] Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. Galactica: A large language model for science. arXiv preprint arXiv:2211.09085, 2022.
    Ross Taylor、Marcin Kardas、Guillem Cucurull、托马斯·西亚罗姆、Anthony Hartshorn、Elvis Saravia、Andrew Poulton、Viktor Kerkez和Robert Stojnic。卡拉狄加:一个大型的科学语言模型。arXiv预印本arXiv:2211.09085,2022。
  • [6] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
    Hugo Touvron,Thibaut Lavril,Gautier Izacard,Xavier Martinet,Marie-Anne Lachaux,Alfrethée Lacroix,Baptiste Rozière,Naman Goyal,Eric Hambro,Faisal Azhar,et al. Llama:Open and efficient foundation language models. arXiv预印本arXiv:2302.13971,2023。
  • [7] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021.
    Chao Jia,Yinfei Yang,Ye Xia,Yi-Ting Chen,Zarana Parekh,Hieu Pham,Quoc Le,Yun-Hsuan Sung,Zhen Li,and Tom Duerig.通过噪声文本监督扩大视觉和视觉语言表征学习。国际机器学习会议,第4904-4916页。PMLR,2021年。
  • [8] Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. Filip: Fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783, 2021.
    Lewei Yao,Runhui Huang,Lu Hou,Guansong Lu,Minzhe Niu,Hang Xu,Xiaodan Liang,Zhengguo Li,Xin Jiang,and Chunjing Xu. Filip:细粒度交互式语言图像预训练。arXiv预印本arXiv:2111.07783,2021。
  • [9] Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
    Christoph Schuhmann、Richard Vencu、Romain博蒙、Robert Kaczmarczyk、克莱顿穆利斯、Aarush Katta、Theo Coombes、Jenia Jitsev和阿兰Komatsuzaki。Laion-400 m:剪辑过滤的4亿个图像-文本对的开放数据集。arXiv预印本arXiv:2111.02114,2021。
  • [10] [10个国家] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  • [11] [第十一届] Junlong Cheng, Jin Ye, Zhongying Deng, Jianpin Chen, Tianbin Li, Haoyu Wang, Yanzhou Su, Ziyan Huang, Jilong Chen, Lei Jiang, Hui Sun, Junjun He, Shaoting Zhang, Min Zhu, and Yu Qiao. Sam-med2d, 2023.
    Junlong Cheng,Jin Ye,Zhongying Deng,Jianpin Chen,Tianbin Li,Haoyu Wang,Yanzhou Su,Ziyan Huang,Jilong Chen,Lei Jiang,Hui Sun,Junjun He,Shaoting Zhang,Min Zhu,and Yu Qiao. 2023年的Sam-med 2d。
  • [12] Jun Ma and Bo Wang. Segment anything in medical images. arXiv preprint arXiv:2304.12306, 2023.
  • [13] Wenhui Lei, Xu Wei, Xiaofan Zhang, Kang Li, and Shaoting Zhang. Medlsam: Localize and segment anything model for 3d medical images. arXiv preprint arXiv:2306.14752, 2023.
    Wenhui Lei,Xu Wei,Xiaofan Zhang,Kang Li,and Shaoting Zhang. Medlsam:3D医学图像的定位和分割。arXiv预印本arXiv:2306.14752,2023。
  • [14] [14个] Tao Yu, Runseng Feng, Ruoyu Feng, Jinming Liu, Xin Jin, Wenjun Zeng, and Zhibo Chen. Inpaint anything: Segment anything meets image inpainting. arXiv preprint arXiv:2304.06790, 2023.
    Tao Yu,Runseng Feng,Ruoyu Feng,Jinming Liu,Xin Jin,Wenjun Zeng,and Zhibo Chen. Inpaint anything:分割任何东西,满足图像修复。arXiv预印本arXiv:2304.06790,2023。
  • [15] [十五] Songhua Liu, Jingwen Ye, and Xinchao Wang. Any-to-any style transfer: Making picasso and da vinci collaborate. arXiv e-prints, pages arXiv–2304, 2023.
    Songhua Liu,Jingwen Ye,and Xinchao Wang.任意风格转换:毕加索与达芬奇的合作。arXiv电子印刷品,第arXiv—2304页,2023年。
  • [16] Teng Wang, Jinrui Zhang, Junjie Fei, Yixiao Ge, Hao Zheng, Yunlong Tang, Zhe Li, Mingqi Gao, Shanshan Zhao, Ying Shan, et al. Caption anything: Interactive image description with diverse multimodal controls. arXiv preprint arXiv:2305.02677, 2023.
    Teng Wang,Jinrui Zhang,Junjie Fei,Yixiao Ge,Hao Zheng,Yunlong Tang,Zhe Li,Mingqi Gao,Shanshan Zhao,Ying Shan,et al. Caption anything:Interactive image description with diverse modular controls. arXiv预印本arXiv:2305.02677,2023。
  • [17] [十七] Zhanghexuan Ji, Dazhou Guo, Puyang Wang, Ke Yan, Jia Ge, Xianghua Ye, Minfeng Xu, Jingren Zhou, Le Lu, Mingchen Gao, et al. Continual segment: Towards a single, unified and accessible continual segmentation model of 143 whole-body organs in ct scans. arXiv preprint arXiv:2302.00162, 2023.
    张和轩,郭达洲,王普阳,严克,葛家,叶祥华,徐敏峰,周景仁,卢乐,高明晨,等.连续分割:建立一个单一的,统一的,可访问的连续分割模型的143全身器官的ct扫描. arXiv预印本arXiv:2302.00162,2023。
  • [18] [18个国家] Yunkun Zhang, Jin Gao, Mu Zhou, Xiaosong Wang, Yu Qiao, Shaoting Zhang, and Dequan Wang. Text-guided foundation model adaptation for pathological image classification. arXiv preprint arXiv:2307.14901, 2023.
    Yunkun Zhang,Jin Gao,Mu Zhou,Dongong Wang,Yu Qiao,Shaoting Zhang,and Dequan Wang.病理图像分类的文本引导基础模型自适应。arXiv预印本arXiv:2307.14901,2023。
  • [19] [十九] Reza Azad, Amirhossein Kazerouni, Moein Heidari, Ehsan Khodapanah Aghdam, Amirali Molaei, Yiwei Jia, Abin Jose, Rijo Roy, and Dorit Merhof. Advances in medical image analysis with vision transformers: A comprehensive review. Medical Image Analysis, 2023.
    Reza Azad,Amirhossein Kazerouni,Moein Heidari,Ehsan Khodapanah Aghdam,Amirali Molaei,Yiwei Jia,Abin Jose,Rijo Roy,and Dorit Merhof.视觉变换器在医学图像分析中的进展:全面综述。医学图像分析,2023年。
  • [20] Zhao Wang, Chang Liu, Shaoting Zhang, and Qi Dou. Foundation model for endoscopy video analysis via large-scale self-supervised pre-train. arXiv preprint arXiv:2306.16741, 2023.
  • [21] Duy MH Nguyen, Hoang Nguyen, Nghiem T Diep, Tan N Pham, Tri Cao, Binh T Nguyen, Paul Swoboda, Nhat Ho, Shadi Albarqouni, Pengtao Xie, et al. Lvm-med: Learning large-scale self-supervised vision models for medical imaging via second-order graph matching. arXiv preprint arXiv:2306.11925, 2023.
    Duy MH Nguyen,Hoang Nguyen,Nghiem T Diep,Tan N Pham,Tri Cao,Binh T Nguyen,Paul Swoboda,Nhat Ho,Shadi Albarqouni,Pengtao Xie,et al. Lvm—med:Learning large—scale self—supervised vision models for medical imaging via second—order graph matching. arXiv预印本arXiv:2306.11925,2023。
  • [22] Tao Tu, Shekoofeh Azizi, Danny Driess, Mike Schaekermann, Mohamed Amin, Pi-Chuan Chang, Andrew Carroll, Chuck Lau, Ryutaro Tanno, Ira Ktena, Basil Mustafa, Aakanksha Chowdhery, Yun Liu, Simon Kornblith, David Fleet, Philip Mansfield, Sushant Prakash, Renee Wong, Sunny Virmani, Christopher Semturs, S Sara Mahdavi, Bradley Green, Ewa Dominowska, Blaise Aguera y Arcas, Joelle Barral, Dale Webster, Greg S. Corrado, Yossi Matias, Karan Singhal, Pete Florence, Alan Karthikesalingam, and Vivek Natarajan. Towards generalist biomedical ai, 2023.
    Tao Tu,Shekoofeh Azizi,Danny Driess,Mike Schaekermann,Mohamed Amin,Pi—Chuan Chang,Andrew Carroll,Chuck Lau,Ryutaro Tanno,Ira ktena,Basil Mustafa,Aakanksha Chowdhery,Yun Liu,Simon Kornblith,David Fleet,Philip Mansfield,Sushant Prakash,Renee Wong,Sunny Virmani,Christopher Semturs,S Sara Mahdavi,Bradley Green,Ewa Dominowska,Blaise Aguera y Arcas,Joelle Barral,Dale Webster,Greg S. Corrado,Yossi Matias,Karan Singhal,Pete佛罗伦萨,Alan Karthikesalingam,和Vivek Natarajan.走向通才生物医学人工智能,2023年。
  • [23] Michael Moor, Oishi Banerjee, Zahra Shakeri Hossein Abad, Harlan M Krumholz, Jure Leskovec, Eric J Topol, and Pranav Rajpurkar. Foundation models for generalist medical artificial intelligence. Nature, 616(7956):259–265, 2023.
    Michael摩尔、Oishi Banerjee、Zahra Shakeri Hossein Abad、哈兰M Krumholz、Jure Leskovec、Eric J Topol和Pranav Rajpurkar。通才医学人工智能基础模型。Nature,616(7956):259—265,2023.
  • [24] [二十四] Peilun Shi, Jianing Qiu, Sai Mu Dalike Abaxi, Hao Wei, Frank P-W Lo, and Wu Yuan. Generalist vision foundation models for medical imaging: A case study of segment anything model on zero-shot medical segmentation. Diagnostics, 13(11):1947, 2023.
    Peilun Shi,Jianing Qiu,Sai Mu Dalike Abaxi,Hao Wei,Frank P—W Lo,and Wu Yuan.医学成像的通用视觉基础模型:零拍摄医学分割的分割模型案例研究。诊断学,13(11):1947,2023。
  • [25] Kai Zhang, Jun Yu, Zhiling Yan, Yixin Liu, Eashan Adhikarla, Sunyang Fu, Xun Chen, Chen Chen, Yuyin Zhou, Xiang Li, Lifang He, Brian D. Davison, Quanzheng Li, Yong Chen, Hongfang Liu, and Lichao Sun. Biomedgpt: A unified and generalist biomedical generative pre-trained transformer for vision, language, and multimodal tasks, 2023.
    张凯,于军,严志玲,刘益欣,Eashan Adhikarla,孙阳富,陈迅,陈晨,周玉银,李翔,何丽芳,Brian D. Davison,Quanzheng Li,Yong Chen,Hongfang Liu,and Lichao Sun. Biomedgpt:一个统一的和通用的生物医学生成预训练的Transformer,用于视觉,语言和多模态任务,2023年。
  • [26] Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Towards generalist foundation model for radiology, 2023.
    Chaoyi Wu,Xiaoman Zhang,Ya Zhang,Xiaoyi Wang,and Weidi Xie.走向放射学的通才基础模型,2023年。
  • [27] Yukun Zhou, Mark A Chia, Siegfried K Wagner, Murat S Ayhan, Dominic J Williamson, Robbert R Struyven, Timing Liu, Moucheng Xu, Mateo G Lozano, Peter Woodward-Court, et al. A foundation model for generalizable disease detection from retinal images. Nature, pages 1–8, 2023.
    Yukun Zhou,Mark A Chia,齐格弗里德K瓦格纳,穆拉特S Ayhan,Dominic J威廉姆森,Robbert R Struyven,Timing Liu,Moucheng Xu,Mateo G Lozano,Peter Woodward-Court,et al. A foundation model for generalizable disease detection from retinal images. Nature,第1-8页,2023年。
  • [28] Qiuhui Chen, Xinyue Hu, Zirui Wang, and Yi Hong. Medblip: Bootstrapping language-image pre-training from 3d medical images and texts. arXiv preprint arXiv:2305.10799, 2023.
  • [29] [二十九] Yakoub Bazi, Mohamad Mahmoud Al Rahhal, Laila Bashmal, and Mansour Zuair. Vision–language model for visual question answering in medical imagery. Bioengineering, 10(3), 2023.
    Yakoub Bazi,Mohamad Mahmoud Al Rahhal,Laila Bashmal,and Mansour Zuair.医学影像中视觉问答的视觉语言模型。生物工程,10(3),2023年。
  • [30] Ekin Tiu, Ellie Talius, Pujan Patel, Curtis P Langlotz, Andrew Y Ng, and Pranav Rajpurkar. Expert-level detection of pathologies from unannotated chest x-ray images via self-supervised learning. Nature Biomedical Engineering, 6(12):1399–1406, 2022.
    Ekin Tiu,Ellie Pastius,Pujan Patel,Curtis P Langlotz,Andrew Y Ng和Pranav Rajpurkar。通过自监督学习从未注释的胸部x射线图像中检测专家级病理。Nature Biomedical Engineering,6(12):1399—1406,2022.
  • [31] Zifeng Wang, Zhenbang Wu, Dinesh Agarwal, and Jimeng Sun. Medclip: Contrastive learning from unpaired medical images and text. arXiv preprint arXiv:2210.10163, 2022.
    Zifeng Wang,Zhenbang Wu,Dinesh Agarwal,and Jimeng Sun. Medclip:从未配对的医学图像和文本中进行的对比学习。arXiv预印本arXiv:2210.10163,2022。
  • [32] [三十二] Shruthi Bannur, Stephanie Hyland, Qianchu Liu, Fernando Perez-Garcia, Maximilian Ilse, Daniel C Castro, Benedikt Boecking, Harshita Sharma, Kenza Bouzid, Anja Thieme, et al. Learning to exploit temporal structure for biomedical vision-language processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15016–15027, 2023.
    Shanxi Bannur,Stephanie Hyland,Qianchu Liu,Fernando Perez-Garcia,马克西米利安Ilse,丹尼尔C Castro,贝内迪克特Boecking,Harshita Sharma,Kenza Bouzid,Anja Thieme等。学习利用生物医学视觉语言处理的时间结构。在IEEE/CVF计算机视觉和模式识别会议论文集,第15016-15027页,2023年。
  • [33] Jie Liu, Yixiao Zhang, Jie-Neng Chen, Junfei Xiao, Yongyi Lu, Bennett A Landman, Yixuan Yuan, Alan Yuille, Yucheng Tang, and Zongwei Zhou. Clip-driven universal model for organ segmentation and tumor detection. arXiv preprint arXiv:2301.00785, 2023.
    Jie Liu,Yixiao Zhang,Jie-Neng Chen,Junfei Xiao,Yongyi Lu,班尼特A Landman,Yixuan Yuan Yuan,Alan Yuille,Yucheng Tang,and Zongwei Zhou.用于器官分割和肿瘤检测的剪辑驱动通用模型。arXiv预印本arXiv:2301.00785,2023。
  • [34] [三十四] Sheng Zhang, Yanbo Xu, Naoto Usuyama, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, Cliff Wong, et al. Large-scale domain-specific pretraining for biomedical vision-language processing. arXiv preprint arXiv:2303.00915, 2023.
    Sheng Zhang,Yanbo Xu,Naoto Jumyama,Jaspreet Bagga,Robert Tinn,Sam普雷斯顿,Rajesh Rao,Mu Wei,Naveen Vengi,Cliff Wong,et al. Large-scale domain-specific pretraining for biomedical vision-language processing. arXiv预印本arXiv:2303.00915,2023。
  • [35] [三十五] Zhihong Chen, Shizhe Diao, Benyou Wang, Guanbin Li, and Xiang Wan. Towards unifying medical vision-and-language pre-training via soft prompts. arXiv preprint arXiv:2302.08958, 2023.
  • [36] [三十六] Ming Y Lu, Bowen Chen, Andrew Zhang, Drew FK Williamson, Richard J Chen, Tong Ding, Long Phi Le, Yung-Sung Chuang, and Faisal Mahmood. Visual language pretrained multiple instance zero-shot transfer for histopathology images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19764–19775, 2023.
    Ming Y Lu,Bowen Chen,Andrew Zhang,Drew FK威廉姆森,Richard J Chen,Tong Ding,Long Phi Le,Yung-Sung Chuang,and Faisal Mahmood.组织病理学图像的视觉语言预训练多实例零拍摄传输。在IEEE/CVF计算机视觉和模式识别会议论文集,第19764-19775页,2023年。
  • [37] ”(37) Xiaofei Chen, Yuting He, Cheng Xue, Rongjun Ge, Shuo Li, and Guanyu Yang. Knowledge boosting: Rethinking medical contrastive vision-language pre-training. arXiv preprint arXiv:2307.07246, 2023.
    Xiaofei Chen,Yuting He,Cheng Xue,Rongjun Ge,Shuo Li和Guanyu Yang。知识提升:重新思考医学对比视觉语言预训练。arXiv preprint arXiv:2307.07246,2023.地图
  • [38] ”(38) Zhi Huang, Federico Bianchi, Mert Yuksekgonul, Thomas J Montine, and James Zou. A visual–language foundation model for pathology image analysis using medical twitter. Nature Medicine, pages 1–10, 2023.
    黄智,Federico比安奇,Mert Yuksekgonul,托马斯J Montine和James Zou。使用医学推特进行病理图像分析的视觉语言基础模型。Nature Medicine,第1-10页,2023年。
  • [39] ”(39) Shawn Xu, L. Yang, Christopher J. Kelly, Marcin Sieniek, Timo Kohlberger, Martin Q. Ma, Wei-Hung Weng, Attila Péter Király, Sahar Kazemzadeh, Zakkai Melamed, Jungyeon Park, Patricia Strachan, Yun Liu, Charles Lau, Preeti Singh, Christina Chen, Mozziyar Etemadi, Sreenivasa Raju Kalidindi, Yossi Matias, Katherine Chou, Greg S Corrado, Shravya Shetty, Daniel Tse, Shruthi Prabhakara, Daniel Golden, Rory Pilgrim, Krish Eswaran, and Andrew Sellergren. Elixr: Towards a general purpose x-ray artificial intelligence system through alignment of large language models and radiology vision encoders. ArXiv, abs/2308.01317, 2023.
    Shawn Xu,L。杨,Christopher J. Kelly,Marcin Sieniek,Timo Kohlberger,Martin Q. Ma,Wei—Hung Weng,Attila Péter Király,Sahar Kazemzadeh,Zakkai Melamed,Jungyeon Park,Patricia Strachan,Yun Liu,Charles Lau,Preeti Singh,Christina Chen,Mozziyar Etemadi,Sreenivasa Raju Kalidindi,Yossi Matias,Katherine Chou,Greg S Corrado,Shravya Shkara,Daniel Tse,Shruthi,Prabhabgrim,Andrew Skrilerman,Olymrim Krisgren,Andrew Sriman,Krisgren。Elixr的目标是通过匹配大型语言模型和放射学视觉编码器来实现一个通用的X射线人工智能系统。ArXiv,abs/2308.01317,2023。
  • [40] Weijian Huang, Hongyu Zhou, Cheng Li, Hao Yang, Jiarun Liu, and Shanshan Wang. Enhancing representation in radiography-reports foundation model: A granular alignment algorithm using masked contrastive learning. arXiv preprint arXiv:2309.05904, 2023.
    Weijian Huang,Hongyu Zhou,Cheng Li,Hao Yang,Jiarun Liu,and Shanshan Wang.增强射线照相报告基础模型中的表示:使用掩蔽对比学习的粒度对齐算法。arXiv预印本arXiv:2309.05904,2023。
  • [41] Bin Yan and Mingtao Pei. Clinical-bert: Vision-language pre-training for radiograph diagnosis and reports generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 2982–2990, 2022.
  • [42] [四十二] Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Le Hou, Kevin Clark, Stephen Pfohl, Heather Cole-Lewis, Darlene Neal, et al. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617, 2023.
    Karan Singhal,Tao Tu,Juraj Gottweis,Rory Sayres,Ellery Wulczyn,Le Hou,Kevin Clark,Stephen Pfohl,石石楠Cole-Lewis,Darlene Neal,et al. Towards expert-level medical question answering with large language models. arXiv预印本arXiv:2305.09617,2023。
  • [43] Michael Moor, Qian Huang, Shirley Wu, Michihiro Yasunaga, Cyril Zakka, Yash Dalmia, Eduardo Pontes Reis, Pranav Rajpurkar, and Jure Leskovec. Med-flamingo: a multimodal medical few-shot learner. arXiv preprint arXiv:2307.15189, 2023.
    Michael摩尔、Qian Huang、Shirley Wu、Yashihiro Yasunaga、西里尔Zakka、Yash Dalmia、Eduardo Pontes Reis、Pranav Rajpurkar和Jure Leskovec。医学火烈鸟:多模态医学少数镜头学习者。arXiv预印本arXiv:2307.15189,2023。
  • [44] [四十四] Guoyao Deng, Ke Zou, Kai Ren, Meng Wang, Xuedong Yuan, Sancong Ying, and Huazhu Fu. Sam-u: Multi-box prompts triggered uncertainty estimation for reliable sam in medical image, 2023.
    Guoyao Deng,Ke Zou,Kai Ren,Meng Wang,Xuedong Yuan,Sancong Ying,and Huazhu Fu. Sam-u:多框提示触发了医学图像中可靠sam的不确定性估计,2023。
  • [45] Xinrong Hu, Xiaowei Xu, and Yiyu Shi. How to efficiently adapt large segmentation model (sam) to medical images. arXiv preprint arXiv:2306.13731, 2023.
    Xinrong Hu,Xiaowei Xu,and Yiyu Shi.如何有效地将大分割模型(sam)应用于医学图像。arXiv预印本arXiv:2306.13731,2023。
  • [46] [第四十六章] Junde Wu, Rao Fu, Huihui Fang, Yuanpei Liu, Zhaowei Wang, Yanwu Xu, Yueming Jin, and Tal Arbel. Medical sam adapter: Adapting segment anything model for medical image segmentation. arXiv preprint arXiv:2304.12620, 2023.
    Junde Wu,Rao Fu,Huihui Fang,Yuanpei Liu,Zhaowei Wang,Yanwu Xu,Yueming Jin,and Tal Arbel.医学sam适配器:适应分割任何模型的医学图像分割。arXiv预印本arXiv:2304.12620,2023。
  • [47] [四十七] Kaidong Zhang and Dong Liu. Customized segment anything model for medical image segmentation. arXiv preprint arXiv:2304.13785, 2023.
  • [48] [四十八] Eugene Vorontsov, Alican Bozkurt, Adam Casson, George Shaikovski, Michal Zelechowski, Siqi Liu, Philippe Mathieu, Alexander van Eck, Donghun Lee, Julian Viret, et al. Virchow: A million-slide digital pathology foundation model. arXiv preprint arXiv:2309.07778, 2023.
    尤金沃龙佐夫,艾丽肯博兹库尔特,亚当卡森,乔治沙科夫斯基,米哈尔Zelechowski,Siqi Liu,菲利普马蒂厄,亚历山大货车埃克,Donghun Lee,朱利安Viret,等。Virchow:百万载玻片数字病理学基础模型。arXiv预印本arXiv:2309.07778,2023。
  • [49] [四十九] Chaoyi Wu, Weixiong Lin, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc-llama: Towards building open-source language models for medicine, 2023.
    Chaoyi Wu,Weixiong Lin,Xiaoman Zhang,Ya Zhang,Xiaoyi Wang,and Weidi Xie. PMC-llama:为医学构建开源语言模型,2023年。
  • [50] Guangyu Wang, Guoxing Yang, Zongxin Du, Longjun Fan, and Xiaohu Li. Clinicalgpt: Large language models finetuned with diverse medical data and comprehensive evaluation, 2023.
    Guangyu Wang,Guoxing Yang,Zongxin Du,Longjun Fan,and Xiaohu Li. Clinicalgpt:大型语言模型与各种医疗数据和综合评估进行了微调,2023年。
  • [51] Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890, 2023.
    Chunyuan Li,Cliff Wong,Sheng Zhang,Naoto Dongyama,Haotian Liu,Jianwei Yang,Tristan Naumann,Hoifung Poon,and Jianfeng Gao. Llava-med:在一天内为生物医学培训一名大型语言和视觉助理。arXiv预印本arXiv:2306.00890,2023。
  • [52] Omkar Thawkar, Abdelrahman Shaker, Sahal Shaji Mullappilly, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Jorma Laaksonen, and Fahad Shahbaz Khan. Xraygpt: Chest radiographs summarization using medical vision-language models. arXiv preprint arXiv:2306.07971, 2023.
    Omkar Thawkar、Abdelrahman Shaker、Sahal Shaji Mullappilly、Hisham Cholakkal、Rao Muhammad Anwer、Salman Khan、Jorma Laaksonen和Fahad Shahbaz Khan。Xraygpt:使用医学视觉语言模型的胸片摘要。arXiv预印本arXiv:2306.07971,2023。
  • [53] Zhengliang Liu, Yiwei Li, Peng Shu, Aoxiao Zhong, Longtao Yang, Chao Ju, Zihao Wu, Chong Ma, Jie Luo, Cheng Chen, Sekeun Kim, Jiang Hu, Haixing Dai, Lin Zhao, Dajiang Zhu, Jun Liu, Wei Liu, Dinggang Shen, Tianming Liu, Quanzheng Li, and Xiang Li. Radiology-llama2: Best-in-class large language model for radiology, 2023.
    Zhengliang Liu,Yiwei Li,Peng Shu,Aoxiao Zhong,Longtao Yang,Chao Ju,Zihao Wu,Chong Ma,Jie Luo,Cheng Chen,Sekeun Kim,Jiang Hu,Haixing Dai,Lin Zhao,Dajiang Zhu,Jun Liu,Wei Liu,Dinggang Shen,Tianming Liu,Quanzheng Li,and Xiang Li. Radiology-llama 2:2023年最佳放射学大型语言模型。
  • [54] [五十四] Sheng Wang, Zihao Zhao, Xi Ouyang, Qian Wang, and Dinggang Shen. Chatcad: Interactive computer-aided diagnosis on medical image using large language models. arXiv preprint arXiv:2302.07257, 2023.
    Sheng Wang,Zihao Zhao,Xi Ouyang,Qian Wang,and Dinggang Shen. Chatcad:使用大型语言模型对医学图像进行交互式计算机辅助诊断。arXiv预印本arXiv:2302.07257,2023。
  • [55] Zheng-Long Liu, Xiao-Xing Yu, Lu Zhang, Zihao Wu, Chao-Yang Cao, Haixing Dai, Lin Zhao, W. Liu, Dinggang Shen, Quanzheng Li, Tianming Liu, Dajiang Zhu, and Xiang Li. Deid-gpt: Zero-shot medical text de-identification by gpt-4. ArXiv, abs/2303.11032, 2023.
    刘正龙,余晓星,张璐,吴子豪,曹朝阳,戴海兴,赵林,W。Liu,Dinggang Shen,Quanzheng Li,Tianming Liu,Dajiang Zhu,and Xiang Li. Deid-gpt:通过gpt-4进行零射击医学文本去识别。ArXiv,abs/2303.11032,2023。
  • [56] Li Yunxiang, Li Zihan, Zhang Kai, Dan Ruilong, and Zhang You. Chatdoctor: A medical chat model fine-tuned on llama model using medical domain knowledge. arXiv preprint arXiv:2303.14070, 2023.
  • [57] Chang Shu, Chen Baian, Fangyu Liu, Zihao Fu, Ehsan Shareghi, and Nigel Collier. Visual med-alpaca: A parameter-efficient biomedical llm with visual capabilities. https://cambridgeltl.github.io/visual-med-alpaca/. Accessed: 2023-09-01.
    Chang Shu,Chen Baian,Fangyu Liu,Zihao Fu,Ehsan Shareghi,and奈杰尔科利尔. Visual med-alpaca:具有视觉功能的参数高效生物医学llm。https://cambridgeltl.github.io/visual-med-alpaca/.访问日期:2023-09-01。
  • [58] Amirhossein Kazerouni, Ehsan Khodapanah Aghdam, Moein Heidari, Reza Azad, Mohsen Fayyaz, Ilker Hacihaliloglu, and Dorit Merhof. Diffusion models in medical imaging: A comprehensive survey. Medical Image Analysis, page 102846, 2023.
    Amirhossein Kazerouni,Ehsan Khodapanah Aghdam,Moein Heidari,Reza Azad,Mohsen Fayyaz,Ilker Hacihaliloglu,and Dorit Merhof.医学成像中的扩散模型:全面综述。医学图像分析,第102846、2023页。
  • [59] [59个] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
  • [60] [第六十章] Lizhou Fan, Lingyao Li, Zihui Ma, Sanggyu Lee, Huizi Yu, and Libby Hemphill. A bibliometric review of large language models research from 2017 to 2023. arXiv preprint arXiv:2304.02020, 2023.
    Lizhou Fan,Lingyao Li,Zihui Ma,Sanggyu Lee,Huizi Yu,and Libby Hemphill. 2017年至2023年大型语言模型研究的文献计量学回顾。arXiv预印本arXiv:2304.02020,2023。
  • [61] [第六十一章] Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A survey. arXiv preprint arXiv:2212.10403, 2022.
    黄洁和Kevin Chen-Chuan Chang。Towards Reasoning in Large Language Models:A Survey. arXiv预印本arXiv:2212.10403,2022。
  • [62] [第六十二章] Jindong Gu, Zhen Han, Shuo Chen, Ahmad Beirami, Bailan He, Gengyuan Zhang, Ruotong Liao, Yao Qin, Volker Tresp, and Philip Torr. A systematic survey of prompt engineering on vision-language foundation models. arXiv preprint arXiv:2307.12980, 2023.
    Jindong Gu,Zhen Han,Shuo Chen,Ahmad Beirami,Bailan He,Gengyuan Zhang,Ruotong Liao,Yao Qin,Volker Tresp,and Philip Torr.视觉语言基础模型的即时工程系统研究。arXiv预印本arXiv:2307.12980,2023。
  • [63] [第六十三章] Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision tasks: A survey. arXiv preprint arXiv:2304.00685, 2023.
    Jingyi Zhang,Jiaxing Huang,Sheng Jin,and Shijian Lu.视觉任务的视觉语言模型:一项调查。arXiv预印本arXiv:2304.00685,2023。
  • [64] [六十四] Shaoting Zhang and Dimitris Metaxas. On the challenges and perspectives of foundation models for medical image analysis. arXiv preprint arXiv:2306.05705, 2023.
  • [65] [第六十五章] Norman Mu, Alexander Kirillov, David Wagner, and Saining Xie. Slip: Self-supervision meets language-image pre-training, 2021.
  • [66] [第六十六章] J. Yang, C. Li, P. Zhang, B. Xiao, C. Liu, L. Yuan, and J. Gao. Unified contrastive learning in image-text-label space. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19141–19151, 2022.
    J.杨角,澳-地Li,P. Zhang,B.肖氏C.柳湖,加-地Yuan,and J. Gao.图像-文本-标签空间中的统一对比学习。2022年IEEE/CVF计算机视觉和模式识别会议(CVPR),第19141-19151页,2022年。
  • [67] [第六十七章] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
    Kaiming He,Haoqi Fan,Yuxin Wu,Saining Xie,and Ross Girshick.用于无监督视觉表征学习的动量对比。在IEEE/CVF计算机视觉和模式识别会议论文集,第9729-9738页,2020年。
  • [68] [第六十八章] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
    Alec拉德福,Jong Wook Kim,Chris Hallacy,Aditya Ramesh,Gabriel Goh,Sandhini Agarwal,Girish Sastry,阿曼达Askell,Pamela Mishkin,Jack Clark等。从自然语言监督中学习可转移的视觉模型。机器学习国际会议,第8748-8763页。PMLR,2021年。
  • [69] [69]第六十九届 Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
  • [70] Aäron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. ArXiv, abs/1807.03748, 2018.
    Aäron货车den Oord,Yazhe Li,and Oriol Vinyals.使用对比预测编码的表示学习。ArXiv,abs/1807.03748,2018。
  • [71] [第七十一章] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey E. Hinton. Big self-supervised models are strong semi-supervised learners. In NeurIPS, 2020.
    Ting Chen,Simon Kornblith,Kevin Swersky,Mohammad Norouzi,and Geoffrey E.欣顿。大型自监督模型是强大的半监督学习器。在NeurIPS,2020年。
  • [72] Lewei Yao, Runhu Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. Filip: Fine-grained interactive language-image pre-training. ArXiv, abs/2111.07783, 2021.
    Lewei Yao,Runhu Huang,Lu Hou,Guansong Lu,Minzhe Niu,Hang Xu,Xiaodan Liang,Zhengguo Li,Xin Jiang,and Chunjing Xu. Filip:细粒度交互式语言图像预训练。ArXiv,abs/2111.07783,2021。
  • [73] Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022.
    六年Harold Li,Pengchuan Zhang,Haotian Zhang,Jianwei Yang,Chun Yuan Li,Yiwu Zhong,Likuan Wang,Lu Yuan,Lei Zhang,Jenq-Neng Hwang,et al.接地语言图像预训练。IEEE/CVF计算机视觉和模式识别会议论文集,第10965-10975页,2022年。
  • [74] Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen-Chun Chen, Liunian Li, Xiyang Dai, Lijuan Wang, Lu Yuan, Jenq-Neng Hwang, and Jianfeng Gao. Glipv2: Unifying localization and vision-language understanding. Advances in Neural Information Processing Systems, 35:36067–36080, 2022.
  • [75] [第七十五章] Hangbo Bao, Li Dong, and Furu Wei. Beit: Bert pre-training of image transformers. ArXiv, abs/2106.08254, 2021.
  • [76] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. ArXiv, abs/1907.11692, 2019.
  • [77] [第七十七章] Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela. Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15638–15650, 2022.
    Amanpreet Singh,Ronghang Hu,Vedanuj Goswami,Guillaume Couaaron,Wojciech Galuba,Marcus Rohrbach,and Douwe Kiela. Flava:基础语言和视觉对齐模型。在IEEE/CVF计算机视觉和模式识别会议论文集,第15638-15650页,2022年。
  • [78] [第78话] Xinsong Zhang, Yan Zeng, Jipeng Zhang, and Hang Li. Toward building general foundation models for language, vision, and vision-language understanding tasks. arXiv preprint arXiv:2301.05065, 2023.
    Xinsong Zhang,Yan Zeng,Jipeng Zhang,and Hang Li.为语言、视觉和视觉语言理解任务建立通用基础模型。arXiv预印本arXiv:2301.05065,2023。
  • [79] [第七十九章] Michael Tschannen, Manoj Kumar, Andreas Steiner, Xiaohua Zhai, Neil Houlsby, and Lucas Beyer. Image captioners are scalable vision learners too. arXiv preprint arXiv:2306.07915, 2023.
    Michael Tschannen、Manoj Kumar、Andreas Steiner、Xiaohua Zhai、Neil Houlsby和Lucas Beyer。图像字幕也是可扩展的视觉学习者。arXiv预印本arXiv:2306.07915,2023。
  • [80] [第八十章] Xinlong Wang, Wen Wang, Yue Cao, Chunhua Shen, and Tiejun Huang. Images speak in images: A generalist painter for in-context visual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6830–6839, 2023.
    Xinlong Wang,Wen Wang,Yue Cao,Chunhua Shen,and Tiejun Huang.图像说话的图像:一个通才画家在上下文视觉学习。在IEEE/CVF计算机视觉和模式识别会议论文集,第6830-6839页,2023年。
  • [81] [第八十一章] Huaishao Luo, Junwei Bao, Youzheng Wu, Xiaodong He, and Tianrui Li. Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In International Conference on Machine Learning, pages 23033–23044. PMLR, 2023.
    Huaishao Luo,Junwei Bao,Youzheng Wu,Xiaodong He,and Tianrui Li. Segclip:具有可学习中心的补丁聚合,用于开放词汇语义分割。国际机器学习会议,第23033-23044页。PMLR,2023年。
  • [82] [第八十二章] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
    Long Ouyang,Jeffrey Wu,Xu Jiang,Diogo Almeida,卡罗尔温赖特,Pamela Mishkin,Chong Zhang,Sandhini Agarwal,Katarina Slama,Alex Ray,et al. Training language models to follow instructions with human feedback.神经信息处理系统的进展,35:27730-27744,2022。
  • [83] [第八十三章] Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
    Yuntao Bai,Saurav Kadavath,Sandipan Kundu,阿曼达Askell,杰克逊Kernion,Andy Jones,安娜陈,安娜戈尔迪,Azalia Mirhoseini,卡梅隆麦金农,et al. Constitutional ai:Harmlessness from ai feedback. arXiv预印本arXiv:2212.08073,2022。
  • [84] [八十四] Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Deepak Gotmare, Shafiq R. Joty, Caiming Xiong, and Steven C. H. Hoi. Align before fuse: Vision and language representation learning with momentum distillation. In Neural Information Processing Systems, 2021.
    李俊南,Ramprasaath R. Selvaraju,Akhilesh Deepak Gotmare,Shafiq R. Joty,Caiming Xiong,and Steven C. H.海融合前对齐:视觉和语言表征学习与动量蒸馏。在神经信息处理系统,2021。
  • [85] [第八十五章] Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ArXiv, abs/2301.12597, 2023.
    Junnan Li,Dongxu Li,Silvio Savarese,and Steven C. H.海Blip-2:使用冻结图像编码器和大型语言模型进行自举语言图像预训练。ArXiv,abs/2301.12597,2023。
  • [86] [第86话] A Venigalla, J Frankle, and M Carbin. Biomedlm: a domain-specific large language model for biomedical text. MosaicML. Accessed: Dec, 23, 2022.
    A Venigalla,J Frankle和M Carbin。biomedlm:一个面向生物医学文本的特定领域大型语言模型。MosaicML.访问日期:2022年12月23日。
  • [87] [八十七] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
    Edward J Hu,Yelong Shen,菲利普沃利斯,Zeyuan Allen-Zhu,Yuanzhi Li,Shean Wang,Lu Wang,and Weizhu Chen. Lora:大型语言模型的低阶适应。arXiv预印本arXiv:2106.09685,2021。
  • [88] [第八十八章] Shizhan Gong, Yuan Zhong, Wenao Ma, Jinpeng Li, Zhao Wang, Jingyang Zhang, Pheng-Ann Heng, and Qi Dou. 3dsam-adapter: Holistic adaptation of sam from 2d to 3d for promptable medical image segmentation. arXiv preprint arXiv:2306.13465, 2023.
  • [89] [八十九] Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc-llama: Further finetuning llama on medical papers. arXiv preprint arXiv:2304.14454, 2023.
    Chaoyi Wu,Xiaoman Zhang,Ya Zhang,Xiaoyi Wang,and Weidi Xie. PMC-llama:进一步在医学论文上微调llama。arXiv预印本arXiv:2304.14454,2023。
  • [90] [第九十章] Reza Azad, Ehsan Khodapanah Aghdam, Amelie Rauland, Yiwei Jia, Atlas Haddadi Avval, Afshin Bozorgpour, Sanaz Karimijafarbigloo, Joseph Paul Cohen, Ehsan Adeli, and Dorit Merhof. Medical image segmentation review: The success of u-net. arXiv preprint arXiv:2211.14830, 2022.
    Reza Azad,Ehsan Khodapanah Aghdam,Amelie Rauland,Yiwei Jia,Atlas Haddadi Avval,Afshin Bozorgpour,Sanaz Karimijafarbigloo,Joseph Paul Cohen,Ehsan Adeli,and Dorit Merhof.医学图像分割回顾:u-net的成功。arXiv预印本arXiv:2211.14830,2022。
  • [91] [九十一] Zitong Yang, Yaodong Yu, Chong You, Jacob Steinhardt, and Yi Ma. Rethinking bias-variance trade-off for generalization of neural networks. In International Conference on Machine Learning, pages 10767–10777. PMLR, 2020.
    杨梓潼,Yaodong Yu,Chong You,Jacob Steinhardt和Yi Ma。重新思考神经网络泛化的偏差-方差权衡。国际机器学习会议,第10767-10777页。PMLR,2020年。
  • [92] [九十二] Natalie Maus, Patrick Chao, Eric Wong, and Jacob R Gardner. Black box adversarial prompting for foundation models. In The Second Workshop on New Frontiers in Adversarial Machine Learning, 2023.
    Natalie Maus、帕特里克赵、Eric Wong和Jacob R Gardner。黑箱对抗性提示的基础模型。在对抗机器学习新前沿的第二次研讨会上,2023年。
  • [93] [九十三] Xu Zhao, Wenchao Ding, Yongqi An, Yinglong Du, Tao Yu, Min Li, Ming Tang, and Jinqiao Wang. Fast segment anything. arXiv preprint arXiv:2306.12156, 2023.
    Xu Zhao,Wenchao Ding,Yongqi An,Yinglong Du,Tao Yu,Min Li,Ming Tang,and Jinqiao Wang.快速分割任何东西。arXiv预印本arXiv:2306.12156,2023。
  • [94] [九十四] Glenn Jocher, Ayush Chaurasia, and Jing Qiu. YOLO by Ultralytics, January 2023.
    格伦·乔彻、阿尤什·乔拉西、余靖秋。YOLO by Ultralytics,2023年1月。
  • [95] [九十五] Peihao Wang, Wenqing Zheng, Tianlong Chen, and Zhangyang Wang. Anti-oversmoothing in deep vision transformers via the fourier domain analysis: From theory to practice. In International Conference on Learning Representations, 2022.
  • [96] [九十六] Reza Azad, Amirhossein Kazerouni, Babak Azad, Ehsan Khodapanah Aghdam, Yury Velichko, Ulas Bagci, and Dorit Merhof. Laplacian-former: Overcoming the limitations of vision transformers in local texture detection. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 736–746. Springer, 2023.
    Reza Azad,Amirhossein Kazerouni,Babak Azad,Ehsan Khodapanah Aghdam,Yury Velichko,Ulas Bagci,and Dorit Merhof. Laplacian-former:克服视觉变换器在局部纹理检测中的局限性。医学图像计算和计算机辅助干预国际会议,第736-746页。斯普林格,2023年。
  • [97] [九十七] Reza Azad, Leon Niggemeier, Michael Huttemann, Amirhossein Kazerouni, Ehsan Khodapanah Aghdam, Yury Velichko, Ulas Bagci, and Dorit Merhof. Beyond self-attention: Deformable large kernel attention for medical image segmentation. arXiv preprint arXiv:2309.00121, 2023.
    Reza Azad,Leon Niggemeier,Michael Huttemann,Amirhossein Kazerouni,Ehsan Khodapanah Aghdam,Yury Velichko,Ulas Bagci,and Dorit Merhof. Beyond self-attention:Deformable large kernel attention for medical image segmentation. arXiv预印本arXiv:2309.00121,2023。