Foundational Models in Medical Imaging: A Comprehensive Survey and Future Vision
医学成像的基础模型：综述与展望

Bobby Azad 鲍比·阿扎德
Electrical Engineering and Computer Science Department
South Dakota State University
Brookings, USA
&Reza Azad
Faculty of Electrical Engineering and Information Technology
RWTH Aachen University
Aachen, Germany
&Sania Eskandari
Department of Electrical Engineering
University of Kentucky
Lexington, USA
&Afshin Bozorgpour
Faculty of Informatics and Data Science
University of Regensburg
Regensburg, Germany
&Amirhossein Kazerouni
School of Electrical Engineering
Iran University of Science and Technology
Tehran, Iran
&Islem Rekik
BASIRA Lab, Imperial-X and Computing Department
Imperial College London
London, UK
&Dorit Merhof
Faculty of Informatics and Data Science
University of Regensburg
Regensburg, Germany
电气工程和计算机科学系美国南达科他州州立大学布鲁金斯和Reza Azad电气工程和信息技术学院RWTH亚琛大学亚琛，德国&Sania Eskandari肯塔基大学列克星敦电气工程系美国&Afshin Bozorgpour信息学和数据科学学院里根斯堡大学德国&Amirhossein Kazerouni电气工程学院伊朗科技大学伊朗德黑兰&Islem Rekik BASIRA实验室，Imperial-X和计算系伦敦帝国理工学院伦敦，英国&Dorit Merhof里根斯堡大学信息学和数据科学学院里根斯堡，德国 Corresponding author: Dorit Merhof, dorit.merhofur.de
通讯作者：Dorit Merhof，dorit.merhofur.de

Abstract 摘要

Foundation models, large-scale, pre-trained deep-learning models adapted to a wide range of downstream tasks have gained significant interest lately in various deep-learning problems undergoing a paradigm shift with the rise of these models. Trained on large-scale dataset to bridge the gap between different modalities, foundation models facilitate contextual reasoning, generalization, and prompt capabilities at test time. The predictions of these models can be adjusted for new tasks by augmenting the model input with task-specific hints called prompts without requiring extensive labeled data and retraining. Capitalizing on the advances in computer vision, medical imaging has also marked a growing interest in these models. With the aim of assisting researchers in navigating this direction, this survey intends to provide a comprehensive overview of foundation models in the domain of medical imaging. Specifically, we initiate our exploration by providing an exposition of the fundamental concepts forming the basis of foundation models. Subsequently, we offer a methodical taxonomy of foundation models within the medical domain, proposing a classification system primarily structured around training strategies, while also incorporating additional facets such as application domains, imaging modalities, specific organs of interest, and the algorithms integral to these models. Furthermore, we emphasize the practical use case of some selected approaches and then discuss the opportunities, applications, and future directions of these large-scale pre-trained models, for analyzing medical images. In the same vein, we address the prevailing challenges and research pathways associated with foundational models in medical imaging. These encompass the areas of interpretability, data management, computational requirements, and the nuanced issue of contextual comprehension. Finally, we gather the over-viewed studies with their available open-source implementations at our GitHub. We aim to update the relevant latest papers within it regularly.
基础模型，适应广泛下游任务的大规模预训练深度学习模型，最近在各种深度学习问题中引起了极大的兴趣，这些问题随着这些模型的兴起而发生了范式转变。在大规模数据集上进行训练，以弥合不同模态之间的差距，基础模型在测试时促进上下文推理，泛化和提示功能。这些模型的预测可以通过使用任务特定的提示（称为提示）来增强模型输入来调整新任务，而不需要大量的标记数据和重新训练。利用计算机视觉的进步，医学成像也标志着对这些模型的兴趣越来越大。为了帮助研究人员在这个方向上导航，本调查旨在提供一个全面的概述基础模型在医学成像领域。具体来说，我们开始我们的探索提供了一个形成基础模型的基础的基本概念的阐述。随后，我们提供了医学领域内基础模型的系统分类，提出了一个主要围绕训练策略构建的分类系统，同时还结合了其他方面，如应用领域，成像方式，感兴趣的特定器官以及这些模型的算法。此外，我们强调了一些选定方法的实际用例，然后讨论了这些大规模预训练模型用于分析医学图像的机会、应用和未来方向。同样，我们解决了与医学成像基础模型相关的普遍挑战和研究途径。这些包括可解释性，数据管理，计算要求和上下文理解的细微差别的问题。最后，我们在GitHub上收集了概述的研究及其可用的开源实现。我们的目标是定期更新其中的相关最新论文。

Keywords Foundation models $\cdot$ Deep learning $\cdot$ Language and vision $\cdot$ Large language models $\cdot$ Score-based models $\cdot$ Self-supervised learning $\cdot$ Medical applications $\cdot$ Survey
关键词基础模型 $\cdot$ 深度学习 $\cdot$ 语言和视觉 $\cdot$ 大型语言模型 $\cdot$ 基于分数的模型 $\cdot$ 自监督学习 $\cdot$ 医疗应用 $\cdot$ 调查

1 Introduction 1介绍

Medical imaging is at the forefront of healthcare, playing a pivotal role in diagnosing and treating diseases. Recent advancements in Artificial Intelligence (AI) have given rise to a new era in the field of medical imaging, driven by the development of Foundation Models (FMs). Foundation Models (FMs) are a type of artificial intelligence (AI) model that exhibit significant progress in their development. These models are typically trained on extensive, diverse dataset, frequently utilizing self-supervision techniques on a massive scale. Following this initial training, they can be further adapted, such as through fine-tuning, for a wide array of downstream tasks that are related to the original training data [1].
医学成像处于医疗保健的最前沿，在诊断和治疗疾病方面发挥着关键作用。人工智能（AI）的最新进展在基础模型（FM）的发展推动下，在医学成像领域开创了一个新时代。基础模型（FM）是一种人工智能（AI）模型，在其发展中取得了重大进展。这些模型通常在广泛的、多样化的数据集上进行训练，经常大规模地使用自我监督技术。在此初始训练之后，它们可以进一步调整，例如通过微调，用于与原始训练数据相关的各种下游任务[ 1]。

In contrast to the conventional deep learning paradigm, which heavily relies on large-scale, task-specific, and crowd-labeled data to train individual deep neural networks (DNNs) for various visual recognition tasks, FMs provide a more efficient alternative. They are pretrained on large-scale dataset that are nearly unlimited in availability, enabling straightforward application to downstream tasks with only a limited amount of labeled data. This shift in approach shows potential for significantly decreasing the labor and time usually necessary for such tasks. The recent surge can be attributed to the progress made possible by large language models (LLMs) and the expansion of data and size [2]. Models such as GPT-3 [3], PaLM [4], Galactica [5], and LLaMA [6] have exhibited strong ability to comprehend natural language and solve complex tasks with zero/few-shot learning, attaining remarkable results without requiring extensive task-specific data. Large-scale vision foundation models are currently making significant advances in perception tasks, as highlighted by [7, 8]. Specifically, vision-language models (VLM) are pre-trained with large-scale image-text pairs and are then directly applicable to downstream visual recognition tasks. VLMs generally consist of three fundamental parts: textual features, visual features, and a fusion module. These elements work together in harmony, allowing the models to efficiently use text and visual data to generate contextually appropriate and logical results. Specifically, the pre-training of VLMs typically adheres to vision-language objectives that aid in acquiring image-text correlations from large collections of image-text pairs. For instance, the pioneer study CLIP [9], an image-text matching model, utilizes contrastive learning methods to generate fused representations for images and texts. The learning objective is to minimize the gap between the representation of an image and its corresponding text, while simultaneously increasing the separation between the representations of unrelated pairs. In addition to this so-called “Textually Prompted Models (TPMs)", researchers have also explored Feature Maps (FMs) that can be prompted by visual inputs (points, boxes, masks) which we refer to as “Visually Prompted Models (VPMs)" [2] (see fig. 2 for a visual depiction of both). Recently, the Segment Anything (SA) model [10] has garnered significant attention in the vision community. SAM is a promptable model developed for the purpose of broad image segmentation. It was trained using a promptable segmentation task that enables powerful zero-shot generalization on 11 million images and over 1 billion masks. Furthermore, SAM has been expanded and refined through training on a large dataset, which encompasses 4.6 million medical images and 19.7 million corresponding masks [11]. This dataset offers a rich diversity, covering 10 distinct medical data modalities, and featuring annotations for 4 anatomical structures in addition to lesions. The training regimen is comprehensive, representing 31 major human organs. Notably, it has yielded impressive results that have bolstered the model’s capacity for enhanced generalization.
传统的深度学习范式严重依赖于大规模的、特定于任务的和人群标记的数据来训练个体深度神经网络（DNN）以完成各种视觉识别任务，与此相反，FM提供了一种更有效的替代方案。它们在可用性几乎无限的大规模数据集上进行预训练，从而可以直接应用于仅具有有限数量标记数据的下游任务。这种方法的转变显示出显著减少此类任务通常所需的劳动力和时间的潜力。最近的激增可以归因于大型语言模型（LLMs）以及数据和大小的扩展所取得的进展[2]。GPT—3 [3]、PaLM [4]、Galactica [5]和LLaMA [6]等模型都表现出了很强的理解自然语言的能力，并通过零/少次学习解决复杂任务，在不需要大量特定任务数据的情况下取得了显著的效果。大规模的视觉基础模型目前正在感知任务中取得重大进展，如[7，8]所强调的那样。具体来说，视觉语言模型（VLM）是用大规模的图像—文本对进行预训练的，然后直接应用于下游的视觉识别任务。VLM通常由三个基本部分组成：文本特征，视觉特征和融合模块。这些元素和谐地协同工作，使模型能够有效地使用文本和视觉数据来生成符合上下文的逻辑结果。具体而言，VLM的预训练通常遵循视觉语言目标，这些目标有助于从大量图像—文本对集合中获取图像—文本相关性。例如，先驱研究CLIP [9]，一种图像—文本匹配模型，利用对比学习方法来生成图像和文本的融合表示。学习目标是最小化图像的表示和其对应的文本之间的差距，同时增加不相关的对的表示之间的分离。除了这种所谓的“文本标记模型（TPMs）”之外，研究人员还探索了可以通过视觉输入（点，框，掩码）提示的特征映射（FM），我们称之为“视觉标记模型（VPMs）”[ 2]（见图2两者的视觉描述）。最近，Segment Anything（SA）模型[ 10]在视觉社区中引起了极大的关注。SAM是一个可扩展的模型，用于广泛的图像分割。它是使用可扩展的分割任务进行训练的，该任务可以对1100万张图像和超过10亿个掩码进行强大的零拍摄泛化。此外，SAM已经通过在一个大型数据集上的训练进行了扩展和改进，该数据集包括460万张医学图像和1970万张相应的掩模[ 11]。该数据集提供了丰富的多样性，涵盖了10种不同的医学数据模式，并具有除病变外的4种解剖结构的注释。训练方案是全面的，代表了31个主要的人体器官。值得注意的是，它产生了令人印象深刻的结果，增强了模型的泛化能力。

Moreover, this generic visual prompt-based segmentation model has recently been adapted to a wide range of downstream tasks, including medical image analysis [12, 13], image inpainting [14], style transfer [15], and image captioning [16] to name a few. Apart from foundational models that rely on textual and visual prompts, research endeavors have also delved into creating models that harmonize various types of paired modalities (such as image-text, and video-audio) to learn representations assisting diverse downstream tasks.
此外，这种通用的基于视觉识别的分割模型最近已经适应了广泛的下游任务，包括医学图像分析[ 12，13]，图像修复[ 14]，风格转移[ 15]和图像字幕[ 16]。除了依赖于文本和视觉提示的基础模型之外，研究工作还深入研究了创建协调各种类型的成对模态（例如图像-文本和视频-音频）的模型，以学习辅助不同下游任务的表示。

The creationn of foundation models has garnered significant attention in the medical AI system development realm [17, 18, 19, 20, 21]. Despite the substantial advancements in biomedical AI, the main methodologies used still tend to be tas-specific models. However, medical practice encompasses various data modalities comprising text, imaging, genomics, and others, making it essentially multimodal [22]. Inherently, a Medical Foundation Model (MFM) has the ability to adaptively interpret various medical modalities, including diverse data sources such as images, electronic medical records, lab findings, genomic information, medical diagrams, and textual data [23]. Hence, foundational models have the potential to provide an enhanced foundation for addressing clinical issues, advancing the field of medical imaging, and improving the efficiency and effectiveness of diagnosing and treating diseases, leading to the opportunity to develop a unified biomedical AI system that can interpret complex multimodal data. Due to the acceleration of both biomedical data production and advancements, the influence of these models is expected to expand due to an influx of contributions. As shown in Figure 1, a significant body of research has been devoted to the application of FMs in diverse medical imaging contexts until the first release of our survey in October 2023. These contributions encompass a wide range of potential applications, from fundamental biomedical discoveries to the upgrading of healthcare delivery. Hence, it is advantageous for the community and timely to review the existing literature.
基础模型的创建在医疗AI系统开发领域获得了极大的关注[17，18，19，20，21]。尽管生物医学人工智能取得了长足的进步，但使用的主要方法仍然倾向于特定于tas的模型。然而，医疗实践包括各种数据模式，包括文本，成像，基因组学等，使其基本上是多模态的[ 22]。从本质上讲，医学基础模型（MFM）能够自适应地解释各种医疗模式，包括各种数据源，如图像，电子病历，实验室发现，基因组信息，医学图表和文本数据[ 23]。因此，基础模型有可能为解决临床问题，推进医学成像领域以及提高诊断和治疗疾病的效率和有效性提供增强的基础，从而有机会开发一个统一的生物医学AI系统，可以解释复杂的多模态数据。由于生物医学数据生产和进步的加速，这些模型的影响力预计将因大量捐款而扩大。如图1所示，在我们于2023年10月首次发布调查之前，大量研究一直致力于FM在各种医学成像环境中的应用。这些贡献涵盖了广泛的潜在应用，从基础生物医学发现到医疗保健服务的升级。因此，对现有文献进行回顾，对社会有利，也是及时的。

This paper provides a holistic overview of the foundation models developed for medical imaging applications. We distinguish existing works inspired by the taxonomy proposed in [2] and highlight the major strengths and shortcomings of the existing methods. We hope that this work will point the way forward, provide a roadmap for researchers, stimulate further interest and enthusiasm within the vision community, and harness the potential of foundation models in the medical discipline. This survey will be regularly updated to reflect the dynamic progress of the MFMs, as this is a rapidly evolving and promising field towards AGI in the biomedical field. Our major contributions include:
本文提供了一个全面的概述基础模型开发的医学成像应用。我们区分现有的作品的启发，在[2]中提出的分类法，并强调现有方法的主要优点和缺点。我们希望这项工作将指明前进的方向，为研究人员提供路线图，激发视觉社区的进一步兴趣和热情，并利用医学学科基础模型的潜力。该调查将定期更新，以反映MFM的动态进展，因为这是生物医学领域AGI快速发展和有前途的领域。我们的主要贡献包括：

$\bullet$ We conduct a thorough and exhaustive examination of foundation models proposed in the field of medical imaging, beginning from background and preliminaries for foundation models, to specific applications along with the organ concerned and imaging modality in a hierarchical and structured manner
$\bullet$ 我们对医学成像领域中提出的基础模型进行了彻底和详尽的检查，从基础模型的背景和分类开始，以分层和结构化的方式沿着有关器官和成像模态的特定应用

$\bullet$ Our work provides a taxonomized (Figure 3), in-depth analysis (e.g. task/organ-specific research progress and limitations), as well as a discussion of various aspects.
$\bullet$ 我们的工作提供了分类（图3），深入分析（例如任务/器官特定的研究进展和局限性），以及对各个方面的讨论。

$\bullet$ Furthermore, we discuss the challenges and unresolved aspects linked to foundation models in medical imaging. We pinpoint new trends, raise important questions, and propose future directions for further exploration.
$\bullet$ 此外，我们还讨论了与医学成像基础模型相关的挑战和未解决的问题。我们指出了新的趋势，提出了重要的问题，并提出了进一步探索的未来方向。

1.1 Clinical Importance 1.1临床重要性

In medical imaging, foundation models are reshaping the way research methods are designed and paradigms are approached, paving the way for innovative advancements and pioneering breakthroughs across various sectors owing to some of their inherent properties aligned with the medical domain as follows.
在医学成像领域，基础模型正在重塑研究方法的设计方式和范式，为各个领域的创新进步和开拓性突破铺平道路，因为它们的一些固有特性与医学领域保持一致，如下所示。
Multi-Modality: Despite advances in biomedical AI, most models today are limited to single-task, unimodal functions. For instance, a mammogram interpretation AI excels at breast cancer screening but can’t incorporate patient records, and additional data like MRI, or engage in meaningful dialogue, limiting its real-world applicability.
多模态：尽管生物医学人工智能取得了进步，但今天的大多数模型仅限于单任务、单峰功能。例如，乳房X光片解释AI在乳腺癌筛查方面表现出色，但无法整合患者记录和MRI等其他数据，也无法进行有意义的对话，从而限制了其在现实世界中的适用性。
Explainability and Generalization: The absence of explainability in deep learning models can erode trust among clinicians accustomed to clear clinical insights [58]. The ability of models to generalize across different medical settings is vital due to varying data sources. Foundation models address these issues by offering a unified framework for tasks like detection and classification, often trained on diverse datasets from various medical centers, enhancing their potential for clinical use by ensuring interpretability and broad applicability.
可解释性和概括性：深度学习模型缺乏可解释性可能会削弱习惯于明确临床见解的临床医生之间的信任[ 58]。由于数据源不同，模型在不同医疗环境中推广的能力至关重要。基础模型通过为检测和分类等任务提供统一的框架来解决这些问题，这些任务通常在来自各个医疗中心的不同数据集上进行训练，通过确保可解释性和广泛的适用性来增强其临床应用潜力。
Privacy Preservation: The computer vision community has a history of open-sourcing datasets, but in medical imaging, privacy regulations limit data sharing. Foundation models offer a privacy-preserving alternative by allowing knowledge transfer without direct access to sensitive data. Additionally, federated learning enables model training on distributed data while keeping it on local machines, ensuring data privacy. Moreover, foundation models facilitate privacy preservation by generating synthetic data resembling real medical images, eliminating the need for actual patient data in model training.
隐私保护：计算机视觉社区有开源数据集的历史，但在医学成像领域，隐私法规限制了数据共享。基础模型提供了一种保护隐私的替代方案，允许在不直接访问敏感数据的情况下转移知识。此外，联邦学习可以在分布式数据上进行模型训练，同时将其保留在本地机器上，确保数据隐私。此外，基础模型通过生成类似于真实的医学图像的合成数据来促进隐私保护，从而消除了在模型训练中对实际患者数据的需要。
Adaptability: Existing medical AI models struggle when faced with distribution shifts caused by changes in technology, procedures, settings, or populations. In contrast, MFMs can effectively adapt to these shifts through in-context learning. For instance, a hospital can teach an MFM model to interpret X-rays from a new scanner by providing a few examples as prompts, enabling it to adjust to new data distributions in real time. This capability is mainly seen in large language models and is not common in conventional medical AI models, which would typically require complete retraining with new datasets.
适应性：现有的医疗AI模型在面临技术、程序、设置或人口变化引起的分布变化时会遇到困难。相比之下，MFM可以通过上下文学习有效地适应这些变化。例如，医院可以通过提供一些示例作为提示来教MFM模型解释来自新扫描仪的X射线，使其能够真实的调整新的数据分布。这种能力主要见于大型语言模型，在传统的医学AI模型中并不常见，这通常需要使用新的数据集进行完整的重新训练。
Domain Knowledge: Unlike clinicians, traditional medical AI models often lack initial medical domain knowledge, relying solely on statistical associations. Medical imaging foundation models like GMAI can address this limitation by integrating formal medical knowledge, using structures like knowledge graphs and retrieving relevant context from existing databases, improving their performance on specific tasks. In summary, foundation models play a crucial role in advancing medical applications by providing a robust and adaptable framework that enhances efficiency, generalizability, and privacy preservation. Their ability to support various clinical tasks and promote collaboration makes them invaluable tools for improving patient care and medical research.
领域知识：与临床医生不同，传统的医学AI模型通常缺乏初始的医学领域知识，仅依赖于统计关联。像GMAI这样的医学成像基础模型可以通过整合正式的医学知识，使用知识图等结构并从现有数据库中检索相关上下文来解决这一限制，从而提高其在特定任务上的性能。总而言之，基础模型通过提供一个强大的、可适应的框架来提高效率、可推广性和隐私保护，在推进医疗应用方面发挥着至关重要的作用。他们支持各种临床任务和促进合作的能力使他们成为改善患者护理和医学研究的宝贵工具。

1.2 Relevant Surveys 1.2相关调查

With the recent success of foundation models, there has been a surge of surveys and contributions in this domain. Some of the reviews investigate recent advances in LLMs, distinguishing different aspects of LLMs by analyzing the impact of pre-training adaptation tuning, utilization, and evaluation [59, 60, 61]. In the context of vision models, the work of [2] provides a comprehensive review of FMs including their typical architecture design, training objectives, and prompting mechanisms. The work of [62] delivers a comprehensive survey of research in prompt engineering on diverse types of vision-language models, organizing existing prompt-engineering approaches from a new perspective. Besides, [63] provides a systematic review of visual language models for various visual recognition tasks including image classification, object detection, and semantic segmentation. In the medical imaging field, [23] identifies the potential applications, opportunities, and challenges of MFMs. The work of [24] provides a comprehensive and objective evaluation of SAM on medical image segmentation, while [64] discusses the spectrum, and future directions of foundation models. However, different from the aforementioned works, we devise a multi-perspective taxonomy of foundation models in the medical community, providing a systematical category of research in medical foundation models and their applications dividing them into textually prompted models, and visually prompted models where each paper is broadly classified according to the proposed algorithm along with the organ concerned and imaging modality, respectively. We present the concepts and theoretical foundations behind foundation models ranging from training objectives and instruction-aligning to prompt engineering (Section 2). In Section 3.1, we comprehensively cover an extensive and up-to-date overview of the recent medical foundation models, as shown in Figure 3. We wrap up this survey by pinpointing future directions and open challenges facing foundation models in medical imaging in Section 5.
随着基金会模式最近的成功，这一领域的调查和捐款激增。一些评论调查了LLMs的最新进展，通过分析预训练适应调整，利用和评估的影响来区分LLMs的不同方面[ 59，60，61]。在视觉模型的背景下，[ 2]的工作提供了对FM的全面回顾，包括其典型的架构设计，培训目标和激励机制。[ 62]的工作提供了对不同类型的视觉语言模型的即时工程研究的全面调查，从新的角度组织现有的即时工程方法。此外，[ 63]还系统地回顾了用于各种视觉识别任务的视觉语言模型，包括图像分类，对象检测和语义分割。在医学成像领域，[ 23]确定了MFM的潜在应用，机遇和挑战。 [24]的工作提供了SAM在医学图像分割上的全面客观评价，而[64]讨论了基础模型的频谱和未来方向。然而，不同于上述作品，我们设计了一个多角度的分类基础模型在医学界，提供了一个系统的类别的研究在医学基础模型及其应用分为文本提示模型，视觉提示模型，其中每篇论文被广泛分类，根据所提出的算法沿着与有关的器官和成像方式，分别。我们提出了基础模型背后的概念和理论基础，从培训目标和预防调整到提示工程（第2节）。在第3.1节中，我们全面介绍了最近医学基金会模型的广泛和最新概述，如图3所示。我们通过在第5节中指出医学成像基础模型面临的未来方向和开放性挑战来总结本次调查。

1.3 Search Strategy 1.3搜索策略

We conducted extensive searches across various platforms, such as DBLP, Google Scholar, and Arxiv Sanity Preserver. We leveraged their search capabilities to create tailored queries and compile comprehensive lists of academic works. These searches encompassed a broad spectrum of scholarly publications, including peer-reviewed journal articles, conference papers, workshop materials, non-peer-reviewed content, and preprints. We tailored our search criteria to achieve this diversity. Our specific search queries consisted of keywords (foundation* $|$ generalist* $|$ medical* $|$ {Task}*), (med-{FM} $|$ foundation*), (foundation* $|$ biomedical* $|$ image* $|$ model*), where {FM} and {Task} refer to one well-known vision foundation model(such as PaLM, CLIP, etc) or Tasks (such as Segmentation, Question Answering, etc) in medical imaging. We then applied filtering to eliminate false positives, ensuring that only papers related to foundation models were included in our analysis.
我们在各种平台上进行了广泛的搜索，例如DBLP，Google Scholar和Arxiv Sanity Preserver。我们利用他们的搜索功能创建量身定制的查询，并编制全面的学术著作列表。这些检索涵盖了广泛的学术出版物，包括同行评审期刊文章、会议论文、研讨会材料、非同行评审内容和预印本。我们调整了搜索标准，以实现这种多样性。我们的特定搜索查询由关键字组成（基金会 * $|$ 通才 * $|$ 医疗 * $|$ {任务}*），（med-{FM} $|$ foundation*），（foundation* $|$ biomedical* $|$ image* $|$ model*），其中{FM}和{Task}是指一个众所周知的视觉基础模型（如PaLM、CLIP等）或任务（如分割、问题分类等）。然后，我们应用过滤来消除误报，确保只有与基础模型相关的论文才包含在我们的分析中。

1.4 Paper Organization. 1.4纸组织。

The rest of the survey is organized as follows. Section 2 presents the background and preliminaries for foundation models. We adopt the taxonomy of [2] and categorize previous studies into two main groups: those prompted by textual inputs (discussed in section 3.1) and those driven by visual cues (discussed in section 3.2). In the context of textually prompted foundation models, we further subdivide them into contrastive, generative, hybrid (combining contrastive and generative approaches), and conversational visual language models. In addition, we differentiate textually prompted models into adaptations and generalist models. Furthermore, Section 5 reveals the risk, open problems, and future directions of foundation models. Finally, we conclude our research in Section 6.
调查的其余部分组织如下。第2节介绍了基础模型的背景和基本原理。我们采用[2]的分类法，将以前的研究分为两大类：由文本输入提示的研究（在3.1节中讨论）和由视觉线索驱动的研究（在3.2节中讨论）。在文本提示的基础模型的背景下，我们进一步将其细分为对比，生成，混合（结合对比和生成方法）和会话视觉语言模型。此外，我们区分文本提示模型适应和通才模型。此外，第5节揭示了风险，开放的问题，和未来的发展方向的基础模型。最后，我们在第6节中总结了我们的研究。

2 Preliminaries 2000年

The term "foundational models" made its debut at Stanford Institute for Human-Centred AI in [1] with the definition of "the base models trained on large-scale data in a self-supervised or semi-supervised manner that can be adapted for several other downstream tasks". Specifically, inspired by the surge of large language models (LLMs), using the basic fundamentals of deep learning such as DNNs and self-supervised learning, foundation models have emerged by massively scaling up both data and model size. In this section, we introduce the basic model architectures, concepts, and settings behind FMs focusing on contributing factors for these models in computer vision such as training objectives, instruction-aligning, inference procedure and prompting.
术语“基础模型”在斯坦福大学以人为中心的人工智能研究所首次亮相[ 1]，定义为“以自我监督或半监督的方式在大规模数据上训练的基础模型，可以适应其他几个下游任务”。具体来说，受大型语言模型（LLMs）激增的启发，使用深度学习的基本原理，如DNN和自监督学习，基础模型通过大规模扩展数据和模型大小而出现。在本节中，我们将介绍FM背后的基本模型架构、概念和设置，重点介绍计算机视觉中这些模型的影响因素，如训练目标、推理对齐、推理过程和提示。

2.1 Pre-training Objectives
2.1前期培训目标

Diverse pretraining objectives have been devised to learn a rich understanding of the relationship between vision and language [65, 66, 67, 68]. We broadly categorize them into contrastive and generative objectives.
设计了各种预训练目标，以学习对视觉和语言之间关系的丰富理解[ 65，66，67，68]。我们大致将它们分为对比性和生成性目标。

2.1.1 Contrastive Objectives
2.1.1对比目标

Contrastive objectives instruct models to acquire distinctive representations [69, 67] by bringing related sample pairs closer together while pushing unrelated pairs farther apart within the feature space. Specifically, Image Contrastive Loss (ICL) aims to learn discriminative image features making a query image closely resemble its positive keys (i.e., its data augmentations) while ensuring it remains distant from its negative keys (i.e., other images) within the embedding space. Consider a batch of $B$ images, contrastive objectives such as InfoNCE [70] and its variations [67, 69], $\mathcal{L}_{\text{I}}^{\text{InfoNCE}}$ can be expressed as:
对比目标指导模型通过将相关样本对更紧密地结合在一起，同时将不相关的样本对在特征空间内推得更远来获得独特的表示[69，67]。具体地，图像对比度损失（ICL）旨在学习使查询图像与其正键（即，其数据增强）同时确保其保持远离其否定键（即，其他图像）。考虑一批 $B$ 图像，对比目标，如InfoNCE [70]及其变体[67，69]， $\mathcal{L}_{\text{I}}^{\text{InfoNCE}}$ 可以表示为：

\mathcal{L}_{\text{I}}^{\text{InfoNCE}}=-\frac{1}{B}\sum_{i=1}^{B}\log\frac{\exp\left(\theta_{i}^{\text{query}}\cdot\theta_{+}^{\text{positive}}/\tau\right)}{\sum_{j=1,j\neq i}^{B+1}\exp\left(\theta_{i}^{\text{query}}\cdot\theta_{j}^{\text{key}}/\tau\right)}

where $\theta_{i}^{\text{query}}$ represents the query embedding, $\{\theta_{j}^{\text{key}}\}_{j=1,j\neq i}^{B+1}$ are the key embeddings, where $\theta_{+}^{\text{positive}}$ denotes the positive key corresponding to $\theta_{i}^{\text{query}}$ , while the rest are considered negative keys. The hyperparameter $\tau$ governs the density of the learned representation.
其中 $\theta_{i}^{\text{query}}$ 表示查询嵌入， $\{\theta_{j}^{\text{key}}\}_{j=1,j\neq i}^{B+1}$ 是键嵌入，其中 $\theta_{+}^{\text{positive}}$ 表示与 $\theta_{i}^{\text{query}}$ 对应的正键，而其余的被认为是负键。超参数 $\tau$ 控制学习到的表示的密度。

Image-Text Contrastive Loss (ITCL) Seeks to develop distinctive image-text representations by bringing together the embeddings of matched images and texts and pushing apart those that do not match [68, 7]. Let $\left(i,t_{i}\right)$ represent the $i$ -th image-text example, then, the image-to-text loss is calculated as:
图像-文本对比损失（ITCL）试图通过将匹配的图像和文本的嵌入放在一起并将不匹配的嵌入推开来开发独特的图像-文本表示[ 68，7]。假设 $\left(i,t_{i}\right)$ 表示第 $i$ 个图像-文本示例，则图像-文本损失计算为：

\mathcal{L}_{I\rightarrow T}=-\log\left[\frac{\exp\left(\theta_{i}\cdot\theta_{+}/\tau\right)}{\sum_{j=1}^{N}\exp\left(\theta_{i}\cdot\theta_{j}/\tau\right)}\right]

where $N$ is the total number of such pairs, and $\theta_{i}$ corresponds to the embedding for image $i$ , while $\theta_{+}$ and $\theta_{j}$ denote positive and negative text representations, respectively. The losses are computed with a focus on the relationship between images and texts while considering the temperature parameter $\tau$ .
其中 $N$ 是这样的对的总数，并且 $\theta_{i}$ 对应于图像 $i$ 的嵌入，而 $\theta_{+}$ 和 $\theta_{j}$ 分别表示正文本表示和负文本表示。在考虑温度参数 $\tau$ 的同时，重点关注图像和文本之间的关系来计算损失。

The text-to-image loss is also calculated similarly, and the total loss is the sum of these two terms:
文本到图像的损失也以类似的方式计算，总损失是这两项的总和：

\mathcal{L}_{ITC}=\frac{1}{N}\sum_{i=1}^{N}\left[\mathcal{L}_{I\rightarrow T}+\mathcal{L}_{T\rightarrow I}\right]

Akin to ICL and ITCL, various contrastive loss functions have also found an application (SimCLR [69, 71], FILIP Loss [72], Region-Word Alignment (RWA) [73], and Region-Word Contrastive (RWC) [74]).
与ICL和ITCL类似，各种对比损失函数也有应用（Simplified [ 69，71]，FILIP Loss [ 72]，区域词对齐（RWA）[ 73]和区域词对比（RWC）[ 74]）。

2.1.2 Generative Objectives
2.1.2生成目标

Generative objectives involve teaching networks to produce image or text data, which allows them to acquire semantic features, accomplished through tasks like image generation [75], and language generation [76].
生成目标涉及教学网络生成图像或文本数据，这使它们能够通过图像生成[75]和语言生成[76]等任务获得语义特征。

Masked Image Modelling (MIM) involves the acquisition of cross-patch correlations by applying masking and image reconstruction techniques. In MIM, a selection of patches within an input image is randomly masked, and the encoder is trained to reconstruct these masked patches based on the unmasked patches. For a given batch of $B$ images, the loss function is formulated as:
掩模图像建模（MIM）涉及通过应用掩模和图像重建技术来获取跨块相关性。在MIM中，输入图像内的块的选择被随机地掩蔽，并且编码器被训练为基于未掩蔽的块来重建这些掩蔽的块。对于给定批次的 $B$ 图像，损失函数被公式化为：

\mathcal{L}_{MIM}=-\frac{1}{B}\sum_{i=1}^{B}\log f_{\theta}\left(\bar{x}_{i}^{I}\,|\,\hat{x}_{i}^{I}\right),

where $\bar{x}_{i}^{I}$ and $\hat{x}_{i}^{I}$ represent the masked and unmasked patches within $x_{i}^{I}$ , respectively [63].
其中 $\bar{x}_{i}^{I}$ 和 $\hat{x}_{i}^{I}$ 分别代表 $x_{i}^{I}$ 中的掩蔽和未掩蔽贴片[ 63]。

Masked Language Modelling (MLM) is a widely adopted pretraining objective in Natural Language Processing (NLP). In MLM, a specific percentage of input text tokens is randomly masked, and these masked tokens are reconstructed using the unmasked ones. The loss function for MLM can be expressed as:
Masked Language Modeling（MLM）是自然语言处理（NLP）中广泛采用的预训练目标。在MLM中，特定百分比的输入文本标记被随机屏蔽，并且这些被屏蔽的标记使用未屏蔽的标记进行重构。MLM的损失函数可以表示为：

\mathcal{L}_{MLM}=-\frac{1}{B}\sum_{i=1}^{B}\log f_{\theta}\left(\bar{x}_{i}^{T}\,|\,\hat{x}_{i}^{T}\right),

where $\bar{x}_{i}^{T}$ and $\hat{x}_{i}^{T}$ denote the masked and unmasked tokens within $x_{i}^{T}$ , respectively, and $B$ denotes the batch size [63].
其中 $\bar{x}_{i}^{T}$ 和 $\hat{x}_{i}^{T}$ 分别表示 $x_{i}^{T}$ 中的掩码和未掩码令牌， $B$ 表示批量大小[ 63]。

Likewise, diverse additional generative loss functions have been introduced in the field including Masked Multimodal Modeling (MMM) loss [77], Image-conditioned Masked Language Modeling (IMLM) loss [78], and Captioning with Parallel Prediction (CapPa) [79].
同样，在该领域中引入了各种额外的生成损失函数，包括掩蔽多模态建模（MMM）损失[ 77]，图像条件掩蔽语言建模（IMLM）损失[ 78]和并行预测字幕（CapPa）[ 79]。

2.2 Pre-training Tasks 2.2前期培训任务

As discussed in section 2.1, FMs pre-training has been studied with typical approaches including contrastive objectives, and generative objectives. In natural language processing, certain pre-training tasks include masked language modeling, where words in the input sequence are randomly hidden, and the model predicts these hidden words during pre-training. Another task involves next-sentence-prediction, where pairs of sentences from distinct documents are presented, and the model determines whether the order of these sentences is accurate. Additionally, there’s the denoising auto-encoder task, which introduces noise into the original text corpus and then aims to reconstruct the pristine input using the noisy version of the corpus. Likewise, to enable the generalization of learned representations to a range of downstream vision domains, pretext tasks such as inpainting [80], auxiliary supervised discriminative tasks, and data reconstruction tasks [81] are used in the pre-training stage.
如第2.1节所讨论的，FM预训练已经用典型的方法进行了研究，包括对比目标和生成目标。在自然语言处理中，某些预训练任务包括掩蔽语言建模，其中输入序列中的单词被随机隐藏，并且模型在预训练期间预测这些隐藏的单词。另一个任务涉及下一个句子预测，其中呈现来自不同文档的句子对，并且模型确定这些句子的顺序是否准确。此外，还有去噪自动编码器任务，它将噪声引入原始文本语料库，然后旨在使用语料库的噪声版本重建原始输入。同样，为了将学习的表示推广到一系列下游视觉域，在预训练阶段使用了诸如修复[ 80]，辅助监督判别任务和数据重建任务[ 81]等借口任务。

2.3 Instruction-Aligning

Instruction-aligning methods aim to let the LM follow human intents and generate meaningful outputs. This process involves either fine-tuning the model on a diverse set of tasks with human-annotated prompts and feedback (RLHF) [82], conducting supervised fine-tuning on publicly available benchmarks and datasets, which are augmented with manually or automatically generated instructions, and improving the reasoning ability of LLMs by instructing them to produce a sequence of intermediate actions that ultimately lead to the solution of a multi-step problem (Chain-of-thought) [83].
指令对齐方法旨在让LM遵循人类意图并生成有意义的输出。这个过程涉及使用人工注释的提示和反馈（RLHF）[ 82]对不同任务集的模型进行微调，对公开可用的基准和数据集进行监督微调，这些基准和数据集使用手动或自动生成的指令进行增强，提高了LLMs的推理能力通过指导他们产生一系列最终导致多步骤问题解决方案的中间行动（思维链）[ 83]。

2.4 Prompt Engineering 2.4 Prompt工程

Prompt engineering refers to a method that enhances a large pre-trained model by incorporating task-specific hints, referred to as prompts, to tailor the model for new tasks enabling the power to acquire predictions based only on prompts without updating model parameters [62]. In the context of large language models (LLMs) prompting techniques can be categorized into two primary groups depending on the clarity of the templates they employ: "soft prompts" (optimizable, learnable) and "hard prompts (manually crafted text prompts)". Within the "hard prompt" category, there are four subcategories: task instructions, in-context learning, retrieval-based prompting, and chain-of-thought prompting. In contrast, "soft prompts" fall into two strategies: prompt tuning and prefix token tuning, which differ in whether they introduce new tokens into the model’s architecture or simply attach them to the input. In the vision domain, prompt engineering facilitates the acquisition of joint multi-modal representations (e.g., CLIP [68] for image classification of ALIGN [84]) to introduce human interaction to the foundational models and employs vision-language models for visual tasks.
提示工程是指一种通过结合特定于任务的提示（称为提示）来增强大型预训练模型的方法，以针对新任务定制模型，从而能够仅基于提示获取预测而无需更新模型参数[ 62]。在大型语言模型（LLMs）的上下文中，提示技术可以根据它们所采用的模板的清晰度分为两大类：“软提示”（可优化，可学习）和“硬提示”（手动制作的文本提示）。在“硬提示”类别中，有四个子类别：任务指令，上下文学习，基于检索的提示和思维链提示。相比之下，“软提示”分为两种策略：提示调优和前缀令牌调优，它们的不同之处在于它们是将新令牌引入模型的体系结构还是简单地将它们附加到输入。在视觉领域中，即时工程有助于获取联合多模态表示（例如，CLIP [ 68]用于ALIGN [ 84]的图像分类），将人类交互引入基础模型，并将视觉语言模型用于视觉任务。

3 Foundational Models for Medical Imaging
3医学成像的基础模型

Establishing a taxonomy for foundational models in medical imaging analysis follows the standard practices commonly employed in the field. However, we distinguish our approach by providing extensive additional information for each sub-category as presented in Figure 3. In this section, we explore foundational-based methods, which have been introduced to tackle diverse challenges in medical imaging analysis through the design of distinct training strategies.
在医学成像分析中建立基础模型的分类遵循该领域通常采用的标准实践。然而，我们通过为每个子类别提供广泛的额外信息来区分我们的方法，如图3所示。在本节中，我们将探讨基于基础的方法，这些方法通过设计不同的训练策略来应对医学成像分析中的各种挑战。

3.1 Textually Prompted Models
3.1文本化的模型

3.1.1 Contrastive

Contrastive textually prompted models are increasingly recognized in the foundational models for medical imaging. They learn representations that encapsulate the semantics and relationships between medical images and their textual prompts. Leveraging contrastive learning objectives, they draw similar image-text pairs closer in the feature space while pushing dissimilar pairs apart. These models are pivotal for image classification, segmentation, and retrieval tasks. Architectural explorations have ranged from dual-encoder designs—with separate visual and language encoders—to fusion designs that merge image and text representations via decoder and transformer-based architectures. Their potential in medical imaging tasks such as lesion detection, disease classification, and image synthesis is evident in numerous studies. In this direction, Wang et al. [31] introduced the MedCLIP framework, demonstrating its superiority over state-of-the-art methods in zero-shot prediction, supervised classification, and image-text retrieval. Expanding upon the success of models like CLIP, [34] unveiled BiomedCLIP, tailored for biomedical vision-language processing. Its training on a vast dataset of 15 million figure-caption pairs highlighted the efficacy of specialized pretraining in the medical imaging field.
对比文本提示模型在医学成像的基础模型中越来越受到重视。他们学习封装医学图像及其文本提示之间的语义和关系的表示。利用对比学习目标，它们在特征空间中将相似的图像—文本对拉近，同时将不同的图像—文本对分开。这些模型是图像分类、分割和检索任务的关键。架构探索的范围从双编码器设计（具有单独的视觉和语言编码器）到融合设计（通过基于解码器和转换器的架构合并图像和文本表示）。它们在医学成像任务中的潜力，如病变检测，疾病分类和图像合成在许多研究中是显而易见的。在这个方向上，Wang et al. [ 31]介绍了MedCLIP框架，证明了它在零拍摄预测，监督分类和图像-文本检索方面优于最先进的方法。在CLIP等模型的成功基础上，[ 34]推出了专为生物医学视觉语言处理而定制的BiomedCLIP。它在1500万个图形-标题对的庞大数据集上进行的训练，突出了医学成像领域专业预训练的有效性。

Visual language pre-training has made significant advancements in representation learning, especially evident in challenging scenarios like zero-shot transfer tasks in open-set image recognition. Nevertheless, computational pathology hasn’t delved deeply into zero-shot transfer due to data scarcity and challenges presented by gigapixel histopathology whole-slide images (WSI). Drawing inspiration from the success of multiple-instance learning in weakly supervised learning tasks, [36] introduced MI-Zero (Figure 4). In this method, each WSI is divided into smaller tiles, referred to as instances, which are more manageable for the image encoder. Each instance’s cosine similarity scores at the patch level are calculated independently against every text prompt within the latent space. Following this, instance-level scores are combined to generate slide-level scores using a permutation-invariant operator, similar to those in multiple instance learning, such as mean or top K pooling. An optional spatial smoothing step aggregates the information of neighboring patches. When tested on three different real-world cancer subtyping tasks, MI-Zero either matched or outperformed baselines, achieving an average median zero-shot accuracy of $70.2$ %.
视觉语言预训练在表征学习方面取得了重大进展，尤其是在具有挑战性的场景中，如开集图像识别中的零镜头迁移任务。然而，由于数据稀缺和千兆像素组织病理学全切片图像（WSI）带来的挑战，计算病理学尚未深入研究零拍摄转移。从弱监督学习任务中多实例学习的成功中汲取灵感，[ 36]引入了MI-Zero（图4）。在该方法中，每个WSI被划分为更小的区块，称为实例，其对于图像编码器更易于管理。每个实例在补丁级别的余弦相似性分数是针对潜在空间内的每个文本提示独立计算的。在此之后，使用置换不变运算符组合实例级分数以生成幻灯片级分数，类似于多实例学习中的那些，例如均值或前K池。可选的空间平滑步骤聚合相邻块的信息。当在三个不同的现实世界癌症亚型任务中进行测试时，MI—Zero的表现与基线相匹配或优于基线，实现了 $70.2$ %的平均中位零射击准确度。

In another study, Bannur et al. [32] unveiled the BioViL-T method for biomedical Vision-Language Processing (VLP). Exploiting the data’s temporal structure, BioViL-T reached state-of-the-art levels in tasks such as progression classification, phrase grounding, and report generation. Incorporating prior images and reports considerably enhanced the model’s efficacy in disease classification and sentence-similarity tasks. The hybrid multi-image encoder in BioViL-T adeptly captured spatiotemporal features, proving valuable for tasks demanding dense visual reasoning over time.
在另一项研究中，Bannur等人[32]公布了用于生物医学视觉语言处理（VLP）的BioViL—T方法。利用数据的时间结构，BioViL—T在进展分类、短语基础和报告生成等任务中达到了最先进的水平。对先前的图像和报告进行分类，大大增强了模型在疾病分类和相似性任务中的功效。BioViL—T中的混合多图像编码器熟练地捕获了时空特征，证明了随着时间的推移需要密集视觉推理的任务的价值。

Furthermore, Tiu et al. [30] revealed the potential of self-supervised learning models in pathology detection. Their model, CheXzero, showcased accuracies on par with radiologists in pathology classification. Remarkably, it outdid fully supervised models in detecting certain pathologies and demonstrated adaptability to unannotated pathologies, which weren’t specifically included during training. Such results emphasize the strength of contrastive textually prompted models in deciphering medical image interpretation tasks from unannotated data, thus minimizing dependence on extensive labeling.
此外，Tiu等人[30]揭示了自监督学习模型在病理检测中的潜力。他们的模型CheXzero在病理分类方面表现出与放射科医生相当的准确性。值得注意的是，它在检测某些病理方面优于完全监督模型，并表现出对未注释病理的适应性，这些病理在训练过程中没有特别包括。这样的结果强调了对比性文本提示模型在从未注释的数据中破译医学图像解释任务中的优势，从而最大限度地减少了对广泛标签的依赖。

The body of work presented emphasizes contrastive textually prompted models’ indispensable role in medical imaging. They showcase efficiency, performance enhancements, and an uncanny ability to infer intricate medical connotations. These models offer a promising solution to data scarcity, enriching medical image understanding and ultimately optimizing healthcare delivery.
提出的工作的主体强调对比文本提示模型在医学成像中不可或缺的作用。他们展示了效率，性能增强，以及推断复杂医学内涵的不可思议的能力。这些模型为数据稀缺提供了一个有前途的解决方案，丰富了医学图像的理解，并最终优化了医疗服务。

3.1.2 Generative

Generative models represent another category within the domain of textually prompted models for medical imaging. These models are designed to generate realistic medical images based on textual prompts or descriptions. They employ techniques such as variational autoencoders (VAEs) and generative adversarial networks (GANs) to understand the underlying distribution of medical images, subsequently creating new samples that correlate with given prompts. These models have shown promise in tasks such as producing images of specific diseases, augmenting training data, and crafting images that adhere to attributes detailed in the prompts. They offer valuable tools for data augmentation, anomaly detection, and creating varied medical image datasets for both training and evaluation. Nonetheless, challenges like capturing the intricacies and variability of medical images, maintaining semantic alignment between generated images and prompts, and addressing ethical concerns tied to fabricated medical images persist and warrant further research.
生成模型表示用于医学成像的文本提示模型领域内的另一类。这些模型旨在根据文本提示或描述生成逼真的医学图像。他们采用变分自编码器（VAE）和生成对抗网络（GAN）等技术来理解医学图像的潜在分布，随后创建与给定提示相关的新样本。这些模型在诸如生成特定疾病的图像、增强训练数据以及制作符合提示中详细描述的属性的图像等任务中表现出了希望。它们为数据增强、异常检测以及创建用于训练和评估的各种医学图像数据集提供了有价值的工具。尽管如此，像捕捉医学图像的复杂性和可变性，保持生成的图像和提示之间的语义对齐，以及解决与伪造的医学图像相关的伦理问题等挑战仍然存在，并需要进一步的研究。

In a notable study, Yan et al. [41] launched Clinical-BERT, a vision-language pre-training model fine-tuned for the medical sector. Pre-training encompassed domain-specific tasks like Clinical Diagnosis (CD), Masked MeSH Modeling (MMM), and Image-MeSH Matching (IMM). Their research demonstrated that Clinical-BERT outperformed its counterparts, especially in radiograph diagnosis and report generation tasks. Such results emphasize the utility of infusing domain-specific insights during the pre-training phase, thereby refining medical image analysis and clinical decision-making.
在一项值得注意的研究中，Yan等人[ 41]推出了Clinical-BERT，这是一种针对医疗领域进行微调的视觉语言预训练模型。预训练包括特定领域的任务，如临床诊断（CD），掩蔽MeSH建模（MMM）和图像MeSH匹配（IMM）。他们的研究表明，Clinical-BERT的表现优于同类产品，特别是在X光片诊断和报告生成任务方面。这些结果强调了在预训练阶段注入特定领域见解的实用性，从而改进医学图像分析和临床决策。

Singhal et al. [42] put forth Med-PaLM 2, a state-of-the-art large language model (LLM) targeting expert competence in medical question answering. By blending foundational LLM enhancements with medical-specific fine-tuning and innovative prompting tactics, the team sought to amplify the model’s proficiency. Med-PaLM 2 exhibited remarkable progress, registering elevated accuracy and better alignment with clinical utility. When subjected to pairwise ranking assessments, medical practitioners even favored Med-PaLM 2’s responses over those of their peers in terms of clinical relevance. This progression signifies the budding potential of LLMs in the realm of medical inquiries, inching closer to rivaling human physicians.
Singhal等人。[42]提出了Med—PaLM 2，这是一种最先进的大型语言模型（LLM），目标是医学问题回答中的专家能力。通过将基础LLM增强与医疗特定的微调和创新的提示策略相结合，该团队试图增强模型的熟练程度。Med—PaLM 2表现出显著的进步，记录了更高的准确性和更好的临床实用性。当进行成对排名评估时，医疗从业者甚至在临床相关性方面更喜欢Med—PaLM 2的反应。这一进展标志着LLMs在医疗咨询领域的潜力正在萌芽，逐渐接近人类医生。

Moor et al. [43] delved into the creation of Med-Flamingo, a few-shot learner with multimodal capabilities, tailor-made for medical applications. The model underwent pre-training on synchronized and staggered medical image-text data, followed by performance assessment on challenging visual question-answering (VQA) datasets. The outcome revealed that Med-Flamingo augmented generative medical VQA performance by up to 20%, as per clinician evaluations. Moreover, the model demonstrated prowess in addressing intricate medical queries and furnishing comprehensive justifications, surpassing preceding multimodal medical foundational models. These revelations underscore Med-Flamingo’s potential to enrich medical AI paradigms, promote personalized medicine, and bolster clinical decisions.
摩尔等人。[ 43]深入研究了Med-Flamingo的创建，这是一个具有多模式功能的少量学习者，专为医疗应用而量身定制。该模型在同步和交错的医学图像-文本数据上进行了预训练，然后在具有挑战性的视觉问答（VQA）数据集上进行了性能评估。结果显示，Med-Flamingo增强了生成医学VQA性能高达20%，根据临床医生的评估。此外，该模型在解决复杂的医疗问题和提供全面的理由方面表现出了强大的能力，超过了以前的多模式医疗基础模型。这些发现强调了Med-Flamingo在丰富医疗AI范式、促进个性化医疗和支持临床决策方面的潜力。

Collectively, these investigations showcase the strides made in the realm of generative textually prompted models and their implications for the medical sector. Merging domain-specific insights with advancements in language models and multimodal learning techniques has yielded auspicious results in areas like radiograph diagnosis, medical question resolution, and generative medical VQA. Such pioneering works fortify the burgeoning research landscape and chart the course for future innovations in generative models tailored for healthcare applications.
总的来说，这些调查展示了在生成文本提示模型领域所取得的进展及其对医学领域的影响。将特定领域的见解与语言模型和多模态学习技术的进步相结合，在放射诊断、医疗问题解决和生成医学VQA等领域取得了可喜的成果。这些开创性的工作巩固了新兴的研究领域，并为医疗保健应用定制的生成模型的未来创新指明了方向。

3.1.3 Hybrid

Hybrid textually prompted models distinguish themselves through the integration of training paradigms, specifically leveraging both generative and contrastive methodologies.
混合文本提示模型通过整合训练范式，特别是利用生成和对比方法来区分自己。

In a notable study, Chen et al. [28] unveiled a streamlined computer-aided diagnosis (CAD) system tailored for a specific 3D imaging modality, MRI. Drawing inspiration from BLIP-2 [85], they crafted a language-image pre-training model that employs bootstrapping to amalgamate 3D medical images with textual data via a query mechanism. At the outset, the researchers deployed a patch embedding that was trainable, bridging the disparity between 3D medical images and a previously trained image encoder. This approach markedly diminished the volume of image data requisite for training. Following this, they introduced the MedQFormer, an innovation that harnesses adjustable queries to align visual attributes seamlessly with the linguistic features demanded by a language model. To round off their methodology, they chose BioMedLM [86] as the foundational language model and fine-tuned it by harnessing the LoRA technique [87].
在一项值得注意的研究中，Chen等人[ 28]推出了一种针对特定3D成像模式（MRI）定制的流线型计算机辅助诊断（CAD）系统。从BLIP-2 [ 85]中汲取灵感，他们制作了一个语言图像预训练模型，该模型采用自举法通过查询机制将3D医学图像与文本数据合并。一开始，研究人员部署了一种可训练的补丁嵌入，弥合了3D医学图像和先前训练的图像编码器之间的差异。这种方法显著减少了训练所需的图像数据量。在此之后，他们推出了MedQFormer，这是一项创新，利用可调整的查询将视觉属性与语言模型所需的语言特征无缝对齐。为了完善他们的方法，他们选择BioMedLM [ 86]作为基础语言模型，并通过利用LoRA技术对其进行微调[ 87]。

An exhaustive suite of experiments, encompassing over 30,000 image volumes sourced from five public Alzheimer’s disease (AD) datasets, affirmed the model’s prowess. The results spotlighted its proficiency in zero-shot classification, distinguishing healthy individuals, subjects with mild cognitive impairment (MCI), and those diagnosed with AD. This efficacy underscores the model’s potential in executing medical visual question-answering (VQA) tasks with precision.
一套详尽的实验，包括来自五个公共阿尔茨海默病（AD）数据集的30，000多张图像，证实了该模型的威力。结果突出了它在零射击分类方面的能力，区分健康个体，轻度认知障碍（MCI）受试者和诊断为AD的受试者。这种功效强调了该模型在精确执行医疗视觉问答（VQA）任务方面的潜力。

3.1.4 Conversational

Conversational textually prompted models aim to enable interactive dialogues between medical professionals and the model by fine-tuning the foundational models on specific instruction sets. These models facilitate communication and collaboration between humans and the model, allowing medical experts to ask questions, provide instructions, or seek explanations regarding medical images. By incorporating conversational capabilities, these models enhance the interpretability and usability of foundational models in medical imaging. Researchers have explored various techniques to fine-tune the models on conversational datasets and develop architectures that can effectively process textual prompts in a dialogue context. Conversational textually prompted models hold great potential in medical imaging, enabling improved communication, knowledge transfer, and decision-making processes among medical professionals and AI systems. However, challenges related to understanding context, handling ambiguous queries, and ensuring accurate responses in complex medical scenarios are areas that require further investigation and refinement.
会话文本提示模型旨在通过在特定指令集上微调基础模型来实现医学专业人员与模型之间的交互式对话。这些模型促进了人类与模型之间的沟通和协作，允许医学专家提出问题，提供指导或寻求有关医学图像的解释。通过整合会话功能，这些模型增强了医学成像基础模型的可解释性和可用性。研究人员已经探索了各种技术来微调会话数据集上的模型，并开发出可以在对话上下文中有效处理文本提示的架构。会话文本提示模型在医学成像中具有巨大的潜力，可以改善医疗专业人员和人工智能系统之间的沟通，知识转移和决策过程。然而，与理解上下文、处理模糊查询以及确保在复杂医疗场景中准确响应相关的挑战是需要进一步调查和改进的领域。

In the study conducted by Li et al. [51], a cost-efficient approach for training a vision-language conversational assistant for biomedical images was introduced. The researchers leveraged a large-scale biomedical figure-caption dataset and utilized GPT-4 to generate instructions from text alone. By fine-tuning a general-domain vision-language model using a curriculum learning method, they developed the LLaVA-Med model. The findings showed that LLaVA-Med outperformed previous state-of-the-art models on certain metrics in three standard biomedical visual question-answering datasets. This highlights the potential of Conversational Textually Prompted Models, such as LLaVA-Med, in assisting with inquiries and answering open-ended research questions about biomedical images.
在Li等人进行的研究中。[ 51]，介绍了一种用于培训生物医学图像视觉语言会话助理的成本效益方法。研究人员利用大规模生物医学图形标题数据集，并利用GPT-4仅从文本中生成指令。通过使用课程学习方法微调通用域视觉语言模型，他们开发了LLaVA-Med模型。研究结果表明，LLaVA-Med在三个标准生物医学视觉问答数据集中的某些指标上优于以前的最先进模型。这突出了会话文本标记模型（如LLaVA-Med）在协助询问和回答有关生物医学图像的开放式研究问题方面的潜力。

Another study, conducted by Thawkar et al. [52], focused on the development of XrayGPT, a conversational medical vision-language model designed specifically for analyzing chest radiographs. XrayGPT aligned a medical visual encoder (MedClip) with a fine-tuned large language model (Vicuna) to enable visual conversation abilities grounded in a deep understanding of radiographs and medical knowledge. The study found that XrayGPT demonstrated exceptional visual conversation abilities and a deep understanding of radiographs and medical domain knowledge. Fine-tuning the large language model on medical data and generating high-quality summaries from free-text radiology reports further improved the model’s performance. These findings highlight the potential of Conversational Textually Prompted Models like XrayGPT in enhancing the automated analysis of chest radiographs and aiding medical decision-making.
Thawkar等人进行的另一项研究[52]专注于XrayGPT的开发，这是一种专门用于分析胸部X光片的对话式医学视觉语言模型。XrayGPT将医学视觉编码器（MedClip）与微调的大型语言模型（维库纳）对齐，以实现基于对射线照片和医学知识的深入理解的视觉对话能力。研究发现，XrayGPT表现出卓越的视觉对话能力，并对X光片和医学领域知识有深刻的理解。对医学数据的大型语言模型进行微调，并从自由文本放射学报告中生成高质量的摘要，进一步提高了模型的性能。这些发现突出了像XrayGPT这样的会话文本识别模型在增强胸部X光片自动分析和辅助医疗决策方面的潜力。

The study by Shu et al. [57], introduced Visual Med-Alpaca, an open-source parameter-efficient biomedical foundation model that combines language and visual capabilities. Visual Med-Alpaca was built upon the LLaMa-7B architecture and incorporated plug-and-play visual modules. The model was trained using a curated instruction set generated collaboratively by GPT-3.5-Turbo and human experts. The findings showed that Visual Med-Alpaca is a parameter-efficient biomedical model capable of performing diverse multimodal biomedical tasks. Incorporating visual modules and using cost-effective techniques like Adapter, Instruct-Tuning, and Prompt Augmentation made the model accessible and effective. This study emphasizes the importance of domain-specific foundation models and demonstrates the potential of conversational textually prompted models like Visual Med-Alpaca in biomedical applications.
Shu等人的研究[ 57]引入了Visual Med-Alpaca，这是一种结合了语言和视觉功能的开源参数高效生物医学基础模型。Visual Med-Alpaca基于LLaMa-7 B架构构建，并集成了即插即用的视觉模块。该模型使用GPT-3.5-Turbo和人类专家协作生成的策展指令集进行训练。研究结果表明，Visual Med-Alpaca是一种参数高效的生物医学模型，能够执行各种多模态生物医学任务。扩展可视化模块并使用具有成本效益的技术，如适配器，指令调整和即时增强，使模型易于访问和有效。这项研究强调了特定领域的基础模型的重要性，并展示了会话文本提示模型，如Visual Med-Alpaca在生物医学应用中的潜力。

3.2 Visually Prompted Models
3.2可视化模型

Within medical imaging, the recent surge of visually prompted models promises a blend of precision, adaptability, and generalization. These models, informed by the extensive capabilities of foundation models, offer the potential to revolutionize medical image analysis by catering to specific tasks while also adapting to a vast array of modalities and challenges. This section delves into two main trajectories of such models:
在医学成像中，最近涌现的视觉提示模型有望融合精确性、适应性和泛化能力。这些模型以基础模型的广泛功能为基础，通过满足特定任务，同时适应各种模式和挑战，提供了彻底改变医学图像分析的潜力。本节深入探讨这类模型的两个主要轨迹：

1.

Adaptations: As the name suggests, this sub-section explores the adaptations and modifications made to traditional segmentation models, enhancing their specificity and performance for medical imaging tasks. From models that augment SAM’s capabilities for medical images to frameworks that synergize few-shot localization with segmentation abilities, we traverse the journey of various innovations in the realm of medical image segmentation.

1.改编：顾名思义，本小节探讨了对传统分割模型的调整和修改，增强了它们在医学成像任务中的特异性和性能。从增强SAM的医学图像功能的模型到将少数镜头定位与分割能力协同的框架，我们穿越了医学图像分割领域的各种创新之旅。
2.

Generalist: Moving beyond task-specific adaptability, the models in this sub-section embody the essence of a ’Generalist’ approach. They are designed to encompass a broader spectrum of tasks and data modalities. These models not only process different kinds of medical imaging data but also can integrate patient histories and genomic data, marking a stride towards a more holistic healthcare technology ecosystem.

2.通才：超越特定任务的适应性，本小节中的模型体现了“通才”方法的本质。它们旨在涵盖更广泛的任务和数据模式。这些模型不仅可以处理不同类型的医学成像数据，还可以整合患者历史和基因组数据，标志着向更全面的医疗技术生态系统迈出了一大步。

As we delve deeper into this section, we will uncover the transformative potential of visually prompted models in medical imaging, highlighting both their specialized adaptations and their expansive generalist capabilities.
随着我们深入研究这一部分，我们将揭示医学成像中视觉提示模型的变革潜力，突出它们的专业适应性和广泛的通才能力。

3.2.1 Adaptations

Traditional medical image segmentation has primarily relied on task-specific models, which, while accurate in their domains, often lack the ability to generalize across multiple tasks and imaging modalities. This necessitates a tailored, resource-intensive approach for each segmentation challenge. The advent of foundation models trained on extensive datasets presents an exciting solution. These models are capable of recognizing and segmenting numerous anatomical structures and pathological lesions across different imaging modalities. However, despite their potential, there are challenges with existing models like SAM, especially when applied to medical images [46]. This necessitates further innovations to extend their capabilities, and one such approach is the Medical SAM Adapter, which bridges the gap and enhances SAM’s performance in the medical domain [46]. This promises an integration of automated processes with specific customization.
传统的医学图像分割主要依赖于特定于任务的模型，这些模型虽然在其领域中是准确的，但通常缺乏跨多个任务和成像模态进行概括的能力。这需要针对每个细分挑战量身定制的资源密集型方法。在广泛的数据集上训练的基础模型的出现提供了一个令人兴奋的解决方案。这些模型能够在不同的成像模式中识别和分割许多解剖结构和病理病变。然而，尽管它们具有潜力，但SAM等现有模型仍存在挑战，特别是在应用于医学图像时[ 46]。这需要进一步的创新来扩展其功能，其中一种方法是医疗SAM适配器，它弥合了差距并增强了SAM在医疗领域的性能[ 46]。这保证了自动化流程与特定定制的集成。

Ma and Wang presented MedSAM, a novel foundation model crafted for medical image segmentation [12]. Built using a comprehensive dataset of over a million medical image-mask pairs, MedSAM can address numerous segmentation tasks across various imaging modalities. Its promptable configuration seamlessly blends automation with user-driven customization. MedSAM excelled in tasks, especially in computing pivotal biomarkers like accurate tumor volume in oncology. However, it had some limitations, such as modality representation imbalances in its training data and challenges in segmenting vessel-like structures. Nevertheless, its architecture permits future refinements to cater to specific tasks, emphasizing the adaptability of foundation models in medical image segmentation.
Ma和Wang介绍了MedSAM，这是一种用于医学图像分割的新型基础模型[ 12]。MedSAM使用超过一百万个医学图像-掩模对的综合数据集构建，可以解决各种成像模式的众多分割任务。它的可配置配置无缝地融合了自动化与用户驱动的定制。MedSAM在任务中表现出色，特别是在计算关键生物标志物方面，如肿瘤学中的准确肿瘤体积。然而，它也有一些局限性，例如训练数据中的模态表示不平衡以及分割血管样结构的挑战。尽管如此，它的架构允许未来的改进，以满足特定的任务，强调在医学图像分割的基础模型的适应性。

Lei et al. tackled the challenge of the intensive annotation workload inherent in the SAM, by introducing MedLSAM, a novel framework that synergizes few-shot landmark localization with SAM’s segmentation capabilities [13]. MedLSAM framework consists of a Localization Anything Model (MedLAM), which employs a shared Pnet to transform support and query patches into 3D latent vectors. During inference, MedLAM initiates with a randomly positioned agent in the query image and guides it toward the target landmark. The agent’s trajectory is updated based on a 3D offset computed from the MedLAM model, effectively localizing the landmark coarsely within the query image. This coarse localization is further refined using the Multi-Scale Similarity (MSS) component, enhancing the accuracy of landmark positioning significantly. Having localized the landmarks, the framework transitions to segmentation using both SAM and MedSAM, a specialized version of SAM fine-tuned for medical images. Trained on an extensive dataset of 14,012 CT scans, MedLSAM autonomously generates 2D bounding boxes across slices, facilitating SAM’s segmentation tasks. Impressively, it equaled SAM’s performance on two 3D datasets that spanned 38 organs but with significantly fewer annotations. Future-proofed with forward compatibility, MedLSAM opens doors for integration with evolving 3D SAM models, signaling even more effective segmentation in the medical domain.
Lei等人通过引入MedLSAM解决了SAM中固有的密集注释工作量的挑战，MedLSAM是一种新型框架，可将少量地标定位与SAM的分割功能协同作用[ 13]。MedLSAM框架由一个本地化模型（MedLAM）组成，它采用一个共享的Pnet将支持和查询补丁转换为3D潜在向量。在推理过程中，MedLAM在查询图像中随机定位代理，并将其引导到目标地标。智能体的轨迹基于从MedLAM模型计算的3D偏移来更新，有效地在查询图像内粗略地定位地标。使用多尺度相似性（MSS）组件进一步细化该粗略定位，从而显著提高地标定位的准确性。定位地标后，框架过渡到使用SAM和MedSAM（SAM针对医学图像进行微调的专用版本）进行分割。在14，012个CT扫描的广泛数据集上进行训练，MedLSAM自动生成跨切片的2D边界框，促进SAM的分割任务。令人印象深刻的是，它相当于SAM在两个3D数据集上的性能，这些数据集跨越38个器官，但注释明显较少。MedLSAM具有向前兼容性，面向未来，为与不断发展的3D SAM模型集成打开了大门，标志着医疗领域中更有效的分割。

Gong et al. tackled the challenges posed by SAM, originally for 2D natural images when applied to 3D medical image segmentation, particularly for tumor detection [88]. The team introduced a strategy transforming SAM for 3D medical imaging while retaining most of its pre-trained parameters. By employing a visual sampler for the prompt encoder and a lightweight mask decoder emphasizing multi-layer aggregation, the resulting model, the 3DSAM-adapter, exhibited superior performance. It outperformed leading medical segmentation models in three of four tasks, reaffirming the potential to enhance SAM’s utility in intricate medical imaging tasks.
Gong等人解决了SAM带来的挑战，最初用于2D自然图像，当应用于3D医学图像分割时，特别是用于肿瘤检测[88]。该团队引入了一种将SAM转换为3D医学成像的策略，同时保留了大部分预先训练的参数。通过采用一个视觉采样器的提示编码器和一个轻量级的掩码解码器，强调多层聚合，产生的模型，3DSAM适配器，表现出上级性能。它在四个任务中的三个任务中优于领先的医疗分割模型，重申了增强SAM在复杂医疗成像任务中的实用性的潜力。

Cheng et al. introduced SAM-Med2D, a specialized model for 2D medical image segmentation [11]. Recognizing the need for domain adaptation, they amassed a substantial dataset of approximately 4.6M images and 19.7M masks, spanning diverse medical modalities. A notable feature of SAM-Med2D is its varied prompt strategies, going beyond bounding boxes and points to incorporate masks, offering a comprehensive interactive segmentation approach as shown in Figure 5. Thorough evaluations showcased its superior performance across various anatomical structures, with remarkable generalization capabilities proven on datasets from the MICCAI 2023 challenge. Despite its prowess, certain challenges remain, particularly with complex boundaries and low-contrast objects. With prospects of integrating natural language interaction, SAM-Med2D stands as a pioneering contribution to medical computer vision research. Building upon the theme of customization, another noteworthy effort is the development of SAMed. This model, unlike its predecessors, employs a low-rank-based finetuning strategy, enabling it to perform semantic segmentation on medical images with only a fraction of SAM’s parameters being updated. This selective approach to parameter adaptation allows SAMed to achieve competitive results, underscoring the potential of customizing large-scale models for specific medical segmentation tasks [47].
Cheng等人介绍了SAM—Med2D，一种用于2D医学图像分割的专用模型[11]。认识到领域适应的必要性，他们积累了大约460万张图像和1970万张面具的大量数据集，涵盖了各种医疗模式。SAM—Med2D的一个显着特点是其多种提示策略，超越了边界框和点，并纳入掩码，提供了一个全面的交互式分割方法，如图5所示。全面的评估展示了其在各种解剖结构上的上级性能，并在MICCAI 2023挑战的数据集上证明了其卓越的泛化能力。尽管它的威力，一些挑战仍然存在，特别是复杂的边界和低对比度的对象。凭借整合自然语言交互的前景，SAM—Med2D是医学计算机视觉研究的先驱。在定制主题的基础上，另一项值得注意的努力是开发SAMed。与其前辈不同，该模型采用了基于低秩的微调策略，使其能够在仅更新SAM参数的一小部分的情况下对医学图像执行语义分割。这种选择性的参数自适应方法使SAMed能够实现有竞争力的结果，强调了为特定的医疗分割任务定制大规模模型的潜力[ 47]。

In a stride to enhance reliability in medical image segmentation, Deng et al. put forth SAM-U [44], a novel approach employing multi-box prompts for refined uncertainty estimation in SAM predictions. This method significantly improves SAM’s performance, especially in low-quality medical images, and provides crucial insights through generated uncertainty maps, highlighting potential segmentation inaccuracies and serving as an essential guide for clinicians in areas requiring manual annotations. This innovative approach underscores the advancements and adaptability in the realm of medical image segmentation
在提高医学图像分割可靠性的一个步骤中，Deng等人提出了SAM-U [ 44]，这是一种采用多框提示的新方法，用于SAM预测中的精确不确定性估计。这种方法显著提高了SAM的性能，特别是在低质量的医学图像中，并通过生成的不确定性图提供了重要的见解，突出了潜在的分割不准确性，并在需要手动注释的领域为临床医生提供了必要的指导。这种创新的方法强调了医学图像分割领域的先进性和适应性

3.2.2 Generalist

In contrast to their adaptability to specific tasks through prompts, foundational models also offer a ’Generalist’ approach, further disrupting the landscape of medical imaging. These Generalist models expand upon the foundational model capabilities by being intrinsically designed to handle a broader spectrum of medical imaging tasks and data modalities—ranging from X-rays to MRIs, and even incorporating patient histories and genomic data. The key advantage here is their capability for dynamic task specification, often enabled by natural language descriptions, obviating the need for model retraining. This inherent flexibility is further augmented by the models’ ability to formally represent medical knowledge, allowing for reasoned outputs and explanations. The emergence of Generalist models in medical imaging signifies a step towards a more integrated and efficient healthcare technology ecosystem.
与其通过提示对特定任务的适应性相反，基础模型还提供了一种“通才”方法，进一步破坏了医学成像的格局。这些Generalist模型扩展了基础模型的功能，其内在设计旨在处理更广泛的医学成像任务和数据模式-从X射线到MRI，甚至包括患者病史和基因组数据。这里的关键优势是它们的动态任务规范能力，通常由自然语言描述实现，避免了模型再训练的需要。这种固有的灵活性进一步增强了模型的能力，正式表示医学知识，允许合理的输出和解释。医学成像中的Generalist模型的出现标志着向更加集成和高效的医疗技术生态系统迈出了一步。

Moor et al. [23] delve into the intricacies of developing General-purpose Medical Artificial Intelligence (GMAI), a specialized class of foundation models optimized for the healthcare domain. Unlike conventional medical AI, GMAI models are designed to process multiple data modalities, such as imaging studies and electronic health records, simultaneously. These models are not only capable of complex diagnostic tasks but can also generate treatment recommendations complete with evidence-based justifications. The authors discuss challenges unique to GMAI, including the need for multi-disciplinary panels for output verification and increased susceptibility to social biases due to the complex training data sets. Additionally, they raise concerns over patient privacy and the computational and environmental costs associated with model scaling. The paper underscores that the success of GMAI hinges on rigorous validation and ongoing oversight to mitigate these risks while harnessing its transformative potential in healthcare.
摩尔等人[ 23]深入研究了开发通用医疗人工智能（GMAI）的复杂性，GMAI是一种针对医疗保健领域优化的基础模型。与传统的医疗AI不同，GMAI模型旨在同时处理多种数据模式，例如成像研究和电子健康记录。这些模型不仅能够完成复杂的诊断任务，而且还可以生成具有循证依据的治疗建议。作者讨论了GMAI特有的挑战，包括需要多学科小组进行输出验证，以及由于复杂的训练数据集而增加的对社会偏见的敏感性。此外，他们还提出了对患者隐私以及与模型缩放相关的计算和环境成本的担忧。该文件强调，GMAI的成功取决于严格的验证和持续的监督，以减轻这些风险，同时利用其在医疗保健领域的变革潜力。

Tu et al. extend the pioneering work of Med-PaLM and Med-PaLM2 [42] to introduce Med-PaLM M, a groundbreaking multi-modal biomedical AI system capable of handling diverse medical modalities, including medical imaging, genomics, and electronic health records [22]. Building upon the foundational achievements of Med-PaLM—which was the first AI to surpass the pass mark on USMLE-style questions—and the subsequent improvements in Med-PaLM2, which boasted an accuracy of 86.5% on the same questions, Med-PaLM M employs a fusion of Vision Transformer (ViT) for visual tasks and Language-agnostic Language Model (LLM) for natural language tasks. These components are fine-tuned on a newly assembled MultiMedBench dataset. Med-PaLM M eclipses existing benchmarks, including specialized single-task models and its predecessor generalist models like PaLM-E that lacked biomedical fine-tuning. Notably, the system exhibits unprecedented zero-shot learning capabilities, successfully identifying tuberculosis from chest X-ray images without prior training [24]. It also excels in generating radiology reports, rivaling the performance of expert radiologists in human evaluations. While the study highlights the scalability and promise of multi-modal AI models for a range of biomedical tasks, it also acknowledges existing challenges, such as data scarcity and limitations of current benchmarks. The work serves as a seminal contribution, marking a new frontier in biomedical AI, albeit with cautionary notes on safety and equity considerations for real-world applications.
Tu等人扩展了Med—PaLM和Med—PaLM2的开创性工作[42]，引入了Med—PaLM M，这是一种开创性的多模式生物医学AI系统，能够处理各种医疗模式，包括医学成像，基因组学和电子健康记录[22]。基于Med—PaLM的基础成就—这是第一个在USMLE风格的问题上超过及格分数的AI—以及Med—PaLM 2的后续改进，Med—PaLM M在相同的问题上拥有86.5%的准确率，Med—PaLM M采用视觉Transformer（ViT）融合视觉任务和与语言无关的语言模型（LLM）用于自然语言任务。这些组件在新组装的MultiMedBench数据集上进行微调。Med—PaLM M超越了现有的基准，包括专门的单任务模型和其前身缺乏生物医学微调的PaLM—E等通用模型。值得注意的是，该系统表现出前所未有的零拍摄学习能力，成功地从胸部X射线图像中识别结核病，而无需事先训练[ 24]。它在生成放射学报告方面也表现出色，可与放射科专家在人体评估方面的表现相媲美。虽然该研究强调了多模态人工智能模型在一系列生物医学任务中的可扩展性和前景，但它也承认了现有的挑战，例如数据稀缺和当前基准的局限性。这项工作是一项开创性的贡献，标志着生物医学人工智能的新前沿，尽管对现实世界应用的安全和公平考虑提出了警告。

Zhang et al. introduce BiomedGPT [25], a unified framework that is trained across multiple modalities—including radiographs, digital images, and text—to perform a diverse range of tasks in the biomedical domain as shown in Figure 6. The model particularly excels in image classification on MedMNIST v2 datasets and visual question-answering on SLAKE and PathVQA, setting new state-of-the-art benchmarks. However, it lags in text-based tasks such as natural language inference on the MedNLI dataset. One reason for this performance gap is the model’s constrained scale; with only 182 million parameters, it is smaller than other state-of-the-art models. The study also pinpoints the model’s sensitivity to task instructions and challenges with handling out-of-distribution data as areas for future research. Nonetheless, BiomedGPT represents a significant step towards a versatile, generalist model in the biomedical field, capable of both vision and language tasks.
Zhang等人介绍了BiomedGPT [25]，这是一个统一的框架，它在多种模式（包括射线照片，数字图像和文本）中进行训练，以执行生物医学领域的各种任务，如图6所示。该模型在MedMNIST v2数据集上的图像分类以及SLAKE和PathVQA上的视觉问答方面尤其出色，为最先进的基准设定了新的标准。然而，它在基于文本的任务中滞后，例如MedNLI数据集上的自然语言推理。这种性能差距的一个原因是模型的规模有限；只有1.82亿个参数，比其他最先进的模型要小。该研究还指出了该模型对任务指令的敏感性，以及处理分布外数据的挑战，作为未来研究的领域。尽管如此，BiomedGPT代表了生物医学领域朝着多功能、多面手模型迈出的重要一步，能够完成视觉和语言任务。

Wu et al. introduce the Radiology Foundation Model (RadFM) and the MedMD dataset, aiming to unify medical tasks and integrate diverse radiological images [26]. RadFM effectively merges medical scans with natural language, addressing various medical tasks. The study unveils RadBench, a benchmark demonstrating RadFM’s superior synthesis of visual and textual information. Despite the advancements, the authors highlight limitations, such as the prevalence of 2D images in the dataset and challenges in generating clinically useful sentences. Wu et al.’s release of these innovations significantly advances radiological models, encourages collaborative progress, and emphasizes the need for enhanced evaluative metrics and comprehensive solutions in the field.
Wu等人介绍了放射学基础模型（RadFM）和MedMD数据集，旨在统一医疗任务并整合各种放射学图像[ 26]。RadFM有效地将医学扫描与自然语言相结合，解决了各种医疗任务。该研究揭示了RadBench，这是一个基准，展示了RadFM对视觉和文本信息的上级合成。尽管取得了进步，但作者强调了局限性，例如数据集中2D图像的普遍性以及生成临床有用句子的挑战。Wu等人'这些创新的发布大大推进了放射学模型，鼓励合作进步，并强调需要增强该领域的评估指标和综合解决方案。

In another study, Zhou et al. introduce RETFound [27], a versatile foundation model developed through self-supervised learning, trained on 1.6 million unlabeled retinal images. It demonstrates unparalleled adaptability and generalizability in diagnosing eye diseases and predicting systemic disorders with notable accuracy and reduced reliance on extensive annotation. RETFound overcomes significant barriers related to data limitations and model generalization, offering a pioneering solution in medical AI, with the potential to democratize and significantly advance healthcare AI applications.
在另一项研究中，Zhou等人介绍了RETFound [ 27]，这是一种通过自我监督学习开发的多功能基础模型，在160万张未标记的视网膜图像上进行了训练。它在诊断眼科疾病和预测系统性疾病方面表现出无与伦比的适应性和普遍性，具有显着的准确性，并减少了对广泛注释的依赖。RETFound克服了与数据限制和模型泛化相关的重大障碍，为医疗AI提供了开创性的解决方案，有可能使医疗AI应用民主化并显着推进。

4 Discussion 4讨论

In the dynamic landscape of foundational models for medical imaging, each direction outlined in our taxonomy (Figure 3) brings its own set of advantages and distinctive capabilities to the forefront. These divergent paths cater to specific needs, creating a diversified toolkit for addressing the multifaceted challenges of the medical imaging domain. As we delve into this discussion, we will explore the unique advantages of each direction and consider scenarios where one direction might excel over the others, all while peering into how these models learn feature representations and the implications thereof.
在医学成像基础模型的动态环境中，我们的分类法中概述的每个方向（图3）都将其自身的一系列优势和独特的功能带到了最前沿。这些不同的路径迎合了特定的需求，为解决医学成像领域的多方面挑战创造了多样化的工具包。当我们深入研究这个讨论时，我们将探索每个方向的独特优势，并考虑一个方向可能优于其他方向的场景，同时研究这些模型如何学习特征表示及其含义。

Textually Prompted Contrastive Models: These models have shown remarkable prowess in bridging the semantic gap between medical images and text. By leveraging contrastive learning, these models can extract meaningful representations from unpaired medical image-text data, thereby reducing the dependence on vast amounts of labeled data. This approach is particularly advantageous in scenarios where labeled data is scarce or expensive to obtain, such as rare medical conditions or specialized imaging modalities. Contrastive models excel at capturing subtle medical meanings and are well-suited for tasks like zero-shot prediction in medical image-text tasks. For instance, in scenarios where a new, uncharacterized medical condition arises, these models can adapt swiftly by simply providing textual descriptions.
文本化对比模型：这些模型在弥合医学图像和文本之间的语义鸿沟方面表现出了非凡的能力。通过利用对比学习，这些模型可以从未配对的医学图像—文本数据中提取有意义的表示，从而减少对大量标记数据的依赖。这种方法在标记数据稀缺或获取昂贵的情况下特别有利，例如罕见的医疗条件或专门的成像模式。对比模型擅长捕捉微妙的医学含义，非常适合医学图像—文本任务中的零拍摄预测等任务。例如，在出现新的、未表征的医疗状况的情况下，这些模型可以通过简单地提供文本描述来快速适应。

However, there are limitations to consider. Contrastive models might struggle with highly complex medical images or intricate pathologies, where the nuances demand a deeper level of feature representation. Additionally, they may still rely on the availability of large-scale text data, which could be a bottleneck in some cases. The contrastive learning process also hinges on careful tuning of hyperparameters, making it essential to invest time in fine-tuning for optimal performance.
然而，有一些限制因素需要考虑。对比模型可能难以处理高度复杂的医学图像或复杂的病理，其中细微差别需要更深层次的特征表示。此外，他们可能仍然依赖于大规模文本数据的可用性，这在某些情况下可能是一个瓶颈。对比学习过程还取决于对超参数的仔细调整，因此必须投入时间进行微调以获得最佳性能。

Textually Prompted Generative Models: Textually prompted generative models, exemplified by models like Clinical-BERT [41] and Med-Flamingo [43], offer the ability to generate detailed responses and explanations for medical image-related queries. They excel in tasks requiring a deep understanding of the medical domain, making them invaluable in clinical decision support systems, medical education, and generating radiology reports.
文本提示生成模型：文本提示生成模型，例如Clinical-BERT [ 41]和Med-Flamingo [ 43]等模型，提供了为医学图像相关查询生成详细响应和解释的能力。他们擅长于需要深入了解医学领域的任务，使他们在临床决策支持系统，医学教育和生成放射学报告中具有宝贵价值。

These generative models can be a game-changer when interpretability and reasoning are crucial. For instance, in a clinical setting, generating explanations for a model’s predictions can enhance trust and facilitate collaboration between AI systems and medical professionals. In educational contexts, they can serve as powerful tutors, providing in-depth explanations and context.
当可解释性和推理至关重要时，这些生成模型可以改变游戏规则。例如，在临床环境中，为模型的预测生成解释可以增强信任，促进AI系统和医疗专业人员之间的合作。在教育背景下，他们可以作为强大的导师，提供深入的解释和背景。

Nevertheless, generative models are computationally intensive and demand significant training data. They may not be the most efficient choice for scenarios where quick, lightweight predictions are required. Additionally, they may face challenges in generating text that is both informative and concise, which could be important in some applications.
然而，生成模型是计算密集型的，需要大量的训练数据。对于需要快速、轻量级预测的场景，它们可能不是最有效的选择。此外，他们可能面临的挑战是生成既信息丰富又简洁的文本，这在某些应用程序中可能是重要的。

Textually Prompted Hybrid Models: Hybrid models, as represented by MedBLIP [28], combine the strengths of generative and contrastive methodologies. These models tackle the challenge of integrating textual data with 3D medical images, often a complex task due to the inherent differences in data modalities.
混合模型：以MedBLIP [ 28]为代表的混合模型，联合收割机结合了生成和对比方法的优势。这些模型解决了将文本数据与3D医学图像集成的挑战，由于数据模式的固有差异，这通常是一项复杂的任务。

One of the key advantages of hybrid models is their potential for zero-shot prediction in medical image-text tasks. They seamlessly align visual attributes with linguistic features, making them adept at executing medical visual question-answering tasks with precision. For example, in cases where medical professionals need to quickly diagnose conditions based on both images and textual descriptions, hybrid models can provide valuable support.
混合模型的主要优势之一是它们在医学图像—文本任务中的零拍摄预测的潜力。它们将视觉属性与语言特征无缝结合，使它们能够精确地执行医疗视觉问答任务。例如，在医疗专业人员需要根据图像和文本描述快速诊断病情的情况下，混合模型可以提供有价值的支持。

Yet, hybrid models may face challenges related to the design of effective integration mechanisms between textual and visual data. The success of these models often relies on the quality of the alignment between different modalities. Additionally, they may require substantial computational resources for training and fine-tuning.
然而，混合模型可能面临的挑战，设计有效的文本和视觉数据之间的集成机制。这些模式的成功往往取决于不同模式之间的一致性质量。此外，它们可能需要大量的计算资源来进行训练和微调。

Textually Prompted Conversational Models: Conversational models, like LLaVA-Med [51] and XrayGPT [52], are designed to enable interactive dialogues between medical professionals and AI systems. These models are particularly beneficial in scenarios where medical experts need to ask questions, seek explanations, or instruct AI systems regarding medical images. One of the most significant advantages of conversational models is their potential to enhance communication and collaboration between humans and AI. They can facilitate knowledge transfer, clarify doubts, and provide detailed explanations for complex medical images. In a clinical context, this can lead to more informed decision-making and better patient care.
会话模型，如LLaVA-Med [ 51]和XrayGPT [ 52]，旨在实现医疗专业人员和AI系统之间的交互式对话。这些模型在医学专家需要提出问题、寻求解释或指导人工智能系统进行医学图像处理的情况下特别有用。会话模型最重要的优势之一是它们有可能增强人类与人工智能之间的沟通和协作。它们可以促进知识转移，澄清疑问，并为复杂的医学图像提供详细的解释。在临床上，这可以导致更明智的决策和更好的病人护理。

However, conversational models face the challenge of understanding context and handling ambiguous queries effectively. Ensuring accurate responses in complex medical scenarios remains an ongoing research challenge. Additionally, they require careful fine-tuning on conversational datasets to perform optimally.
然而，会话模型面临的挑战，理解上下文和处理歧义查询有效。确保在复杂的医疗情况下做出准确的反应仍然是一个持续的研究挑战。此外，它们需要对会话数据集进行仔细的微调，以实现最佳性能。

Visually Prompted Adaptations Models: Visually prompted adaptations models, such as MedLSAM [13], MedLSAM, 3DSAM-adapter [88], SAM-Med2D [11], SAMed [47], and SAM-U [44], focus on enhancing the specificity and performance of medical image segmentation tasks. These models adapt foundational models like SAM for the medical domain, addressing challenges like data scarcity and complex boundaries. A sample of segmentation results on various medical image analysis tasks achieved through the adapted SAM model is presented in Figure 7, showcasing its remarkable achievement in diverse medical imaging scenarios and its robust generalization power.
视觉提示适应模型：视觉提示适应模型，如MedLSAM [ 13]，MedLSAM，3DSAM适配器[ 88]，SAM-Med 2D [ 11]，SAMed [ 47]和SAM-U [ 44]，专注于增强医学图像分割任务的特异性和性能。这些模型将SAM等基础模型应用于医疗领域，解决了数据稀缺和复杂边界等挑战。图7显示了通过自适应SAM模型实现的各种医学图像分析任务的分割结果示例，展示了其在各种医学成像场景中的卓越成就及其强大的泛化能力。

\SetRowpurple9 Algorithm \SetRowpurple9算法	Networks	Core Ideas 核心思想	Practical Use Cases 实际用例
Textually Prompted Contrastive Models 文本化的对比模型	¹MedClip [31] [ 31]第三十一话
²BioViL-T [32]
³CheXzero [30] [30]第三十话
⁴MI-Zero [36] [36]第三十六话	$\bullet$ [31] aims to tackle medical image and report vision-text contrastive learning difficulties. MediCLIP separates medical images and text for multimodal contrastive learning. Scaling combinatorial training data at low cost answers medical data shortages. The study advises replacing InfoNCE with a medical-based semantic matching loss to eliminate contrastive learning false negatives. MedCLIP captures subtle but important medical meanings better than zero-shot prediction, supervised classification, and image-text retrieval methods. The paper reveals MedCLIP’s efficacy and data efficiency, which might enhance clinical decision-making and downstream tasks. $\bullet$ [31]旨在解决医学图像和报告视觉—文本对比学习困难。MediCLIP将医学图像和文本分离，用于多模态对比学习。以低成本扩展组合训练数据解决了医疗数据短缺问题。该研究建议用基于医学的语义匹配损失取代InfoNCE，以消除对比学习的假阴性。MedCLIP捕捉微妙但重要的医学意义优于零射击预测，监督分类和图像—文本检索方法。该论文揭示了MedCLIP的有效性和数据效率，这可能会增强临床决策和下游任务。
$\bullet$ Use data temporal structure to improve biological vision-language processing (VLP) in [32]. Researchers introduce BioViL-T, a pre-training system that trains and fine-tunes using past pictures and data. This method uses temporal correlations and a multi-image encoder to handle missing images and longitudinal data without image registration. Modality alignment is improved by analyzing the temporal relationship between visuals and reports, enhancing pre-training and downstream task performance. The study exhibits advanced progression categorization, phrase grounding, and report generation results. Temporal and non-temporal tasks like pneumonia detection and phrase grounding benefit from prior context and temporal knowledge. To test and benchmark chest X-ray VLP models for temporal semantics, the authors offer MS-CXR-T, a multimodal benchmark dataset. An expert radiologist curated this dataset to measure image-text temporal correlations. $\bullet$ 使用数据时间结构来改善生物视觉语言处理（VLP）[ 32]。研究人员介绍了BioViL-T，这是一种预训练系统，可以使用过去的图片和数据进行训练和微调。该方法使用时间相关性和多图像编码器来处理丢失的图像和纵向数据，而无需图像配准。通过分析视觉和报告之间的时间关系，提高训练前和下游任务的性能，改善了模态对齐。这项研究展示了先进的进展分类，短语接地，和报告生成结果。时间和非时间任务，如肺炎检测和短语接地受益于先前的上下文和时间知识。为了测试和基准胸部X射线VLP模型的时间语义，作者提供了MS-CXR-T，一个多模态基准数据集。放射科专家策划了这个数据集，以测量图像-文本的时间相关性。
$\bullet$ In [30], authors offer a novel medical imaging pathological classifying approach. The study suggests employing self-supervised learning without annotations to accurately diagnose illnesses in unannotated chest X-rays. Large labeled datasets are expensive and time-consuming for traditional medical image interpretation machine-learning algorithms. This research shows that a self-supervised system trained on chest X-rays without annotations can classify illness as well as radiologists. A zero-shot multi-label classification method, natural language supervision from radiology reports, and generalization to diverse image interpretation tasks and datasets are presented in the research. CheXzero learns a representation for zero-shot multi-label classification without labeled data fine-tuning using contrastive learning with image-text pairs. Radiology reports’ natural labeling lets self-supervised algorithms perform as well as professional radiologists and fully supervised approaches on unknown disorders. This approach eliminates explicit labeling, eliminating medical machine-learning workflow inefficiencies from large-scale labeling. $\bullet$ 在[30]中，作者提供了一种新的医学成像病理分类方法。该研究建议采用无注释的自我监督学习来准确诊断未注释的胸部X光片中的疾病。对于传统的医学图像解释机器学习算法来说，大型标记数据集是昂贵且耗时的。这项研究表明，在没有注释的胸部X光片上训练的自我监督系统可以对疾病和放射科医生进行分类。在研究中提出了零次多标签分类方法，放射学报告的自然语言监督，以及对不同图像解释任务和数据集的推广。CheXzero使用图像—文本对的对比学习来学习零次多标签分类的表示，而无需对标记数据进行微调。放射学报告的自然标记使自监督算法在未知疾病上的表现与专业放射科医生和完全监督方法一样好。这种方法消除了显式标记，消除了大规模标记的医疗机器学习工作流程效率低下。	$\bullet$ Zero-shot prediction in medical image-text tasks, Supervised classification in medical image analysis, Image-text retrieval in the medical domain, Supporting clinical decision-making and downstream clinical tasks [31]. $\bullet$ 医学图像文本任务中的零拍摄预测，医学图像分析中的监督分类，医学领域中的图像文本检索，支持临床决策和下游临床任务[ 31]。
$\bullet$ Progression Classification: Achieving State-of-the-Art Performance in Tracking Medical Condition Progression, Phrase Grounding: Linking Clinical Report Phrases to Image Regions for Enhanced Analysis, Report Generation: Improved Performance by Incorporating Prior Reports, Disease Classification: Consistent Improvement in Disease Classification Tasks, Pneumonia Detection: State-of-the-Art Results in Detecting Pneumonia [32] $\bullet$ 进展分类：在跟踪医学状况进展方面实现最先进的性能，短语基础：将临床报告短语链接到图像区域以增强分析，报告生成：通过重复先前报告提高性能，疾病分类：疾病分类任务的一致改进，肺炎检测：检测肺炎的最先进结果[32]
$\bullet$ Automation of complex medical image interpretation tasks, Disease diagnosis, Diagnostic efficiency improvement, Label efficiency enhancement, Decreased reliance on large labeled datasets, Reduction in labeling efforts and costs, Potential for learning a broad range of medical image interpretation tasks from unlabeled data [30] $\bullet$ 复杂医学图像解释任务的自动化，疾病诊断，诊断效率提高，标签效率提高，减少对大型标记数据集的依赖，减少标记工作和成本，从未标记数据中学习广泛的医学图像解释任务的潜力[ 30]
$\bullet$ Zero-shot transfer for cancer subtype classification on 3 WSI datasets. Moreover, the curated dataset of histopathology image-caption pairs can potentially be generalized and adapted to develop practical solutions in other domains [36]. $\bullet$ 在3个WSI数据集上进行癌症亚型分类的零拍摄转移。此外，组织病理学图像-标题对的策划数据集可能会被推广并适用于其他领域的实际解决方案[ 36]。
Textually Prompted Generative Models 文本化生成模型	¹Clinical-BERT [41] [ 41]第四十一话
²Med-PaLM 2 [42]
³Med-Flamingo [43] [ 43]第四十三话	$\bullet$ Clinical-BERT [41], a medical pre-training paradigm, underpins. The research offers domain-specific pre-training activities, including Clinical Diagnosis (CD), Masked MeSH Modeling (MMM), and Image-MeSH Matching for model training. MeSH words in radiograph reports are stressed. The work aligns MeSH terms with radiographs using region and word sparse attention. The model links visual characteristics with MeSH phrases using this attention mechanism. Clinical-BERT radiograph diagnostic and report production provide cutting-edge results. The article shows domain-specific pre-training exercises and MeSH keywords to improve medical task performance. $\bullet$ Clinical—BERT [41]是一种医学预训练范式，是基础。该研究提供了特定领域的预训练活动，包括临床诊断（CD），MMM（Masked MeSH Modeling）和用于模型训练的图像MeSH匹配。X线片报告中的MeSH单词被强调。这项工作使MeSH术语与使用区域和单词稀疏注意的射线照片保持一致。该模型使用这种注意力机制将视觉特征与MeSH短语联系起来。临床BERT射线诊断和报告制作提供了最先进的结果。本文展示了特定领域的预训练练习和MeSH关键字，以提高医疗任务的性能。
$\bullet$ Expert medical question answering is done using LLMs in [42]. The study aims to enhance LLM performance to match model and clinician replies. The authors say LLMs have advanced in various disciplines and can address medical questions. They admit prior LLM-based models need to be improved, especially compared to clinician responses. The authors offer various LLM performance enhancements. Base LLM improvements (PaLM 2), medical domain-specific fine-tuning, and a new ensemble refinement approach are used. The strategies aim to enhance medical thinking and results. $\bullet$ 使用[42]中的LLMs完成专家医疗问题回答。该研究旨在增强LLM性能，以匹配模型和临床医生的回复。作者说LLMs在各个学科都有进步，可以解决医学问题。他们承认之前基于LLM的模型需要改进，特别是与临床医生的反应相比。作者提供了各种LLM性能增强。Base LLM改进（PaLM 2），医疗领域特定的微调，和一个新的合奏细化方法。这些战略旨在加强医学思维和成果。
$\bullet$ Med-Flamingo [43], a vision-language model suggested is pre-trained on medical image-text data from various sources and can create open-ended replies from textual and visual input. Med-Flamingo outperforms prior models in generative medical visual question-answering tasks by 20% in clinical assessment scores due to in-context learning The research also describes Visual USMLE, a difficult created VQA dataset including medical questions, images, and case vignettes. The paper says multimodal few-shot and in-context learning improve medical AI models. $\bullet$ Med-Flamingo [ 43]，一种建议的视觉语言模型，对来自各种来源的医学图像-文本数据进行了预训练，可以从文本和视觉输入中创建开放式回复。Med-Flamingo在生成医学视觉问答任务中的表现优于先前模型，由于上下文学习，临床评估分数提高了20%。该研究还描述了Visual USMLE，这是一个困难的VQA数据集，包括医学问题，图像和案例插图。该论文称，多模式少数镜头和上下文学习改善了医疗AI模型。	$\bullet$ Radiograph Diagnosis and Reports Generation: Achieving state-of-the-art results on challenging datasets, Enhancing Downstream Tasks in the Medical Domain, Improving performance in various medical domain tasks, Learning Medical Domain Knowledge: Enabling the model to acquire domain-specific knowledge for better performance [41] $\bullet$ 放射影像诊断和报告生成：在具有挑战性的数据集上实现最先进的结果，增强医学领域的下游任务，提高各种医学领域任务的性能，学习医学领域知识：使模型能够获取特定领域的知识，以获得更好的性能[ 41]
$\bullet$ Medical question answering: Providing accurate and reliable answers to medical questions. Medical exams: Assisting in preparing for medical licensing examinations. Clinical decision support: Aiding physicians in making informed decisions during patient care. Consumer health information: Delivering trustworthy medical information to the general public [42]. $\bullet$ 医疗问题回答：为医疗问题提供准确可靠的答案。医疗考试：协助准备医疗执照考试。临床决策支持：帮助医生在患者护理过程中做出明智的决定。消费者健康信息：向公众提供值得信赖的医疗信息[ 42]。
$\bullet$ Generative Medical Visual Question Answering (VQA), Medical Reasoning and Rationale Generation, Clinical Evaluation and Human Rater Study, Dataset Creation for Pre-training and Evaluation [43] $\bullet$ 生成医学视觉问题分类（VQA），医学推理和原理生成，临床评价和人类评分研究，用于预培训和评价的数据集创建[ 43]
Textually Prompted Hybrid Models 混合模型	¹MedBLIP [28] [28]第二十八话	$\bullet$ Extend a 2D image encoder to extract features from 3D medical images and obtain a lightweight language model for our CAD purpose. $\bullet$ 扩展2D图像编码器，从3D医学图像中提取特征，并获得用于CAD目的的轻量级语言模型。
$\bullet$ Align different types of medical data into the common space of language models, besides collecting the largest public dataset for studying Alzheimer’s disease (AD). $\bullet$ 将不同类型的医疗数据整合到语言模型的公共空间中，并收集最大的公共数据集用于研究阿尔茨海默病（AD）。	$\bullet$ Zero-shot prediction in medical image-text tasks $\bullet$ 医学图文任务中的零拍预测
$\bullet$ Zero-shot medical visual question answering (VQA) which involves producing an initial diagnosis for an unseen case by analyzing input images and textual descriptions, while also offering explanations for the decision-making process. $\bullet$ 零拍摄医学视觉问答（VQA），包括通过分析输入图像和文本描述为未知病例产生初步诊断，同时还为决策过程提供解释。
Textually Prompted Conversational Models 文本化会话模型	¹LLaVA-Med [51] [ 51]第五十一话
²XrayGPT [52]
³Visual Med-Alpaca [57]
⁴PMC-LLaMA[89] [ 89]第89话
⁵ClinicalGPT [50]
⁶Radiology-LLamA2 [53] ⁶ 放射学-LLamA 2 [ 53]

Foundational Models in Medical Imaging: A Comprehensive Survey and Future Vision
医学成像的基础模型：综述与展望

Abstract 摘要