这是用户在 2024-10-16 11:32 为 https://app.immersivetranslate.com/pdf-pro/84d655bb-5477-458d-bbc6-ee1459d63e03 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

IMPRINT: Generative Object Compositing by Learning Identity-Preserving Representation
## 翻译结果: 生成式物体合成: 通过学习身份识别表征进行印记

Yizhi Song 1 1 ^(1**){ }^{1 *}, Zhifei Zhang 2 2 ^(2){ }^{2}, Zhe Lin 2 2 ^(2){ }^{2}, Scott Cohen 2 2 ^(2)^{2}, Brian Price 2 2 ^(2){ }^{2}, Jianming Zhang 2 2 ^(2){ }^{2}, Soo Ye Kim 2 2 ^(2){ }^{2}, He Zhang 2 2 ^(2){ }^{2}, Wei Xiong 2 2 ^(2){ }^{2}, Daniel Aliaga 1 1 ^(1){ }^{1}
易智松 1 1 ^(1**){ }^{1 *} , 张智飞 2 2 ^(2){ }^{2} , 林哲 2 2 ^(2){ }^{2} , 斯考特·科恩 2 2 ^(2)^{2} , 布赖恩·普莱斯 2 2 ^(2){ }^{2} , 张今鸣 2 2 ^(2){ }^{2} , 金秀 2 2 ^(2){ }^{2} , 张赫 2 2 ^(2){ }^{2} , 魏雄 2 2 ^(2){ }^{2} , 丹尼尔·阿利亚加 1 1 ^(1){ }^{1}
Purdue University 1 1 ^(1){ }^{1}, Adobe Research 2 2 ^(2){ }^{2}
普渡大学 1 1 ^(1){ }^{1} ,Adobe Research 2 2 ^(2){ }^{2}

Figure 1. Top: Comparison with three prior works, i.e., Paint-by-Example [46], ObjectStitch [41], and TF-ICON [27]. Our method IMPRINT outperforms others in terms of identity preservation and color/geometry harmonization. Bottom: Given a coarse mask, IMPRINT can change the pose of the object to follow the shape of the mask.
图 1. 顶部:与三种之前的工作进行比较,即手绘 [46]、ObjectStitch [41] 和 TF-ICON [27]。我们的方法 IMPRINT 在身份保持和颜色/几何形状协调方面优于其他方法。底部:给定一个粗糙的蒙版,IMPRINT 可以改变对象的位置以跟随蒙版的形状。

Abstract ## 摘要

Generative object compositing emerges as a promising new avenue for compositional image editing. However, the requirement of object identity preservation poses a significant challenge, limiting practical usage of most existing methods. In response, this paper introduces IMPRINT, a novel diffusion-based generative model trained with a twostage learning framework that decouples learning of identity preservation from that of compositing. The first stage is targeted for context-agnostic, identity-preserving pretraining of the object encoder, enabling the encoder to learn an embedding that is both view-invariant and conducive to enhanced detail preservation. The subsequent stage leverages this representation to learn seamless harmonization of the object composited to the background. In addition, IMPRINT incorporates a shape-guidance mechanism offering user-directed control over the compositing process. Extensive experiments demonstrate that IMPRINT significantly outperforms existing methods and various baselines on identity preservation and composition quality. Project page: https://song630.github.io/IMPRINT-Project-Page/
生成式对象合成作为一种很有前景的图像编辑方法,正在兴起。然而,对象身份保持的要求提出了重大挑战,限制了大多数现有方法的实际应用。为此,本文引入了 IMPRINT,这是一种新颖的基于扩散的生成模型,该模型使用两阶段学习框架进行训练,该框架将身份保持的学习与合成学习分离。第一阶段针对对象编码器的上下文无关、身份保持预训练,使编码器能够学习一个既是视图不变的又是更有利于增强细节保持的嵌入。随后的阶段利用这个表示来学习将合成对象与背景无缝协调。此外,IMPRINT 还包含了一个形状引导机制,为用户提供了对合成过程的指导控制。大量的实验表明,IMPRINT 在身份保持和合成质量方面明显优于现有的方法和各种基线。项目主页: https://song630.github.io/IMPRINT-Project-Page/

1. Introduction ## 1. 引言

Image compositing, the art of merging a reference object with a background to create a cohesive and realistic image, has witnessed transformative advancements with the advent of diffusion models (DM) [13, 32, 33, 36]. These models have catalyzed the emergence of generative object compositing, a novel task that hinges on two critical aspects: identity (ID) preservation and background harmonization. The goal is to ensure that the object in the composite image retains its identity while adapting its color and geometry for seamless integration with the background. Existing methods [27, 41, 46] demonstrate impressive capabilities in generative compositing; however, they often fail in ID-preservation or context consistency.
图像合成,即将参考对象与背景融合以创建连贯且逼真的图像,随着扩散模型 (DM) 的出现,已经取得了变革性的进展 [13、32、33、36]。这些模型催化了生成式对象合成的出现,这是一项新颖的任务,它取决于两个关键方面:身份 (ID) 保留和背景协调。 目标是在确保合成图像中的对象保留其身份的同时,调整其颜色和几何形状,以便与背景无缝集成。 现有方法 [27、41、46] 在生成合成方面展示出令人印象深刻的功能,但是,它们在 ID 保留或上下文明确性方面常常失败。
Recent works [41, 46], typically struggle with balancing ID preservation and background harmony. While these methods have made strides in spatial adjustments, they predominantly capture categorical rather than detailed information. TF-ICON [27] and two concurrent works [4, 48] have advanced subject fidelity but at the expense of limiting pose and view variations for background integration, thus curtailing their applicability in real-world settings.
最近的作品 [41, 46] 通常难以平衡 ID 保留和背景和谐。虽然这些方法在空间调整方面取得了进展,但它们主要捕获类别信息而不是详细信息。TF-ICON [27] 和两项并行工作 [4, 48] 提高了主题保真度,但以限制背景集成的姿势和视角变化为代价,因此限制了它们在实际场景中的适用性。
To address the trade-off between identity preservation with pose adjustment for background alignment, we introduce IMPRINT, a novel two-stage compositing framework that excels in ID preservation. Diverging from previous works, IMPRINT decouples the compositing process into ID preservation and background alignment stages. The first stage involves a novel context-agnostic ID-preserving train-
为了解决身份保留与姿势调整以进行背景对齐之间的权衡,我们引入了 IMPRINT,这是一种新颖的两阶段合成框架,擅长于 ID 保留。与以前的工作不同,IMPRINT 将合成过程分解为身份保留和背景对齐阶段。第一阶段涉及一个新颖的与上下文无关的 ID 保留训练-

ing, wherein an image encoder is trained to learn viewinvariant features, crucial for detail engraving. The second stage focuses on harmonizing the object with the background, utilizing the robust ID-preserving representation from the first stage. This bifurcation allows for unprecedented fidelity in object detail while facilitating adaptable color and geometry harmonization.
在图像编码器被训练以学习视图不变特征,这对细节雕刻至关重要的图像编码的第一阶段。第二阶段侧重于将对象与背景协调一致,利用来自第一阶段的稳健的 ID 保留表示。这种分叉允许在对象细节方面取得前所未有的保真度,同时促进适应性颜色和几何协调一致。
Our contributions can be summarized as follows:
我们的贡献可以概括如下:
  • We introduce a novel context-agnostic ID-preserving training, demonstrating superior appearance preservation through comprehensive experiments.
    我们引入了一种新颖的、与上下文无关的、ID 保留的训练方法,通过全面的实验,证明了它在外观保留方面具有明显的优势。
  • Our two-stage framework distinctively separates the tasks of ID preservation and background alignment, enabling realistic compositing effects.
    我们的两阶段框架将 ID 保留和背景对齐的任务区分开来,从而实现逼真的合成效果。
  • We incorporate mask control into our model, enhancing shape guidance and generation flexibility.
    我们将掩码控制纳入模型,增强形状引导和生成灵活性。
  • We conduct an extensive study on appearance retention, offering insights into various factors influencing identity preservation, e.g., image encoders, multi-view datasets, training strategies, etc.
    我们对外观保留进行了广泛的研究,提供了对影响身份保留的各个因素的洞察,例如图像编码器、多视图数据集、训练策略等。

2.1. Image Compositing ## 2.1. 图像合成

Image compositing, a pivotal task in image editing applications, aims to insert a foreground object into a background image seamlessly, striving for realism and high fidelity.
图像合成,作为图像编辑应用中的一项关键任务,旨在将前景对象无缝地插入背景图像中,力求逼真和高保真度。
Traditionally, image harmonization [9, 16, 18, 45] and image blending [30, 42, 49, 50] focus on color and lighting consistency between the object and the background. However, these approaches fall short in addressing geometric adjustments. GAN-based works [1, 3, 24] target geometry inconsistency, yet are often domain-specific (e.g., indoor scene) and limited in handling complex transformations (e.g., out-of-plane rotation). Shadow synthesis methods like SGRNet [14] and PixHt-Lab [37] focus on realistic lighting effects.
传统上,图像融合 [9, 16, 18, 45] 和图像混合 [30, 42, 49, 50] 侧重于对象和背景之间的颜色和光照一致性。然而,这些方法在解决几何调整方面存在不足。基于 GAN 的工作 [1, 3, 24] 针对几何不一致性,但通常是特定领域的(例如,室内场景)并且在处理复杂变换(例如,平面外旋转)方面有限。如 SGRNet [14] 和 PixHt-Lab [37] 等阴影合成方法侧重于逼真的光照效果。
With the advent of diffusion models [13, 33, 39, 40], recent research has shifted towards unified frameworks encompassing all aspects of image compositing. Methods like [41, 46] employ CLIP-based adapters for leveraging pretrained models, but they struggle in preserving the object’s identity due to their focus on high-level semantic representations. While TF-ICON [27] improves fidelity by incorporating noise modeling and composite self-attention injection, it faces limitations in object pose adaptability.
随着扩散模型 [13, 33, 39, 40] 的出现,最近的研究转向了包含图像合成所有方面的统一框架。像 [41, 46] 这样的方法采用基于 CLIP 的适配器来利用预训练模型,但由于它们专注于高级语义表示,因此难以保留对象的标识。虽然 TF-ICON [27] 通过结合噪声建模和复合自注意力注入来提高保真度,但在对象姿态适应性方面存在局限性。
Recent research is increasingly centering on appearance preservation in generative object compositing. Two concurrent works, AnyDoor [4] and ControlCom [48], have made strides in this area. AnyDoor combines DINOv2 [29] and high-frequency filter, and ControlCom introduces a local enhancement module. However, these models have limited spatial correction capabilities. In contrast, our model
生成式目标合成中,越来越多的研究开始关注外观保持问题。两项同步进行的工作,AnyDoor [4] 和 ControlCom [48],在这方面取得了进展。AnyDoor 结合了 DINOv2 [29] 和高频滤波器,ControlCom 引入了一个局部增强模块。但是,这些模型的空间校正能力有限。相比之下,我们的模型

designs a novel approach that substantially enhances visual consistency of the object while maintaining geometry and color harmonization, representing a significant advancement in the field.
设计了一种新颖的方法,在保持几何形状和颜色协调性的同时,显著增强了物体的视觉一致性,代表了该领域的一项重大进步。

2.2. Subject-Driven Image Generation
2.2. 主题驱动的图像生成

Subject-driven image generation, the task of creating a subject within a novel context, often involves customizing subject attributes based on text prompts. Based on diffusion models, [7, 17] have led to techniques like using placeholder words for object representation, enabling highfidelity customizations. Subsequent works [19, 26, 34, 35] extend this by fine-tuning pretrained text-to-image models for new concept learning. These advancements have facilitated diverse applications, such as subject swapping [8], open-world generation [22], and non-rigid image editing [2]. However, these methods usually require inference-time fine-tuning or multiple subject images, limiting their practicality. In contrast, our framework offers a fast-forward and background-preserving approach that is versatile for a broad spectrum of real-world data.
主题驱动的图像生成,是指在新的场景中创建主题的任务,通常涉及根据文本提示定制主题属性。基于扩散模型,[7, 17] 已经提出了使用占位词进行对象表示的技术,从而实现高保真定制。后续工作 [19, 26, 34, 35] 通过对预训练的文本到图像模型进行微调以学习新概念来扩展这一点。这些进步促进了各种应用,如主题替换 [8]、开放世界生成 [22] 和非刚性图像编辑 [2]。然而,这些方法通常需要推理时微调或多个主题图像,从而限制了它们的实用性。相比之下,我们的框架提供了一种快速前馈和背景保留的方法,该方法适用于各种现实世界数据。

3. Approach 3. 方法

The proposed object compositing framework, IMPRINT, is summarized in Fig. 2. Formally, given input images of object I o b j R H × W × 3 I o b j R H × W × 3 I_(obj)inR^(H xx W xx3)I_{o b j} \in \mathbb{R}^{H \times W \times 3}, background I b g R H × W × 3 I b g R H × W × 3 I_(bg)inR^(H xx W xx3)I_{b g} \in \mathbb{R}^{H \times W \times 3}, and mask M R H × W M R H × W M inR^(H xx W)M \in \mathbb{R}^{H \times W} that indicates the location and scale for object compositing to the background, we aim to learn a compositing model C C C\mathcal{C} to achieve a composite image I out = C ( I obj , I b g , M ) R H × W × 3 I out  = C I obj  , I b g , M R H × W × 3 I_("out ")=C(I_("obj "),I_(bg),M)inR^(H xx W xx3)I_{\text {out }}=\mathcal{C}\left(I_{\text {obj }}, I_{b g}, M\right) \in \mathbb{R}^{H \times W \times 3}. The ideal outcome is an I out I out  I_("out ")I_{\text {out }} that appears visually coherent and natural, i.e., C C C\mathcal{C} should ensure that the composited object retains the identity of I o b j I o b j I_(obj)I_{o b j}, aligns to the geometry of I b g I b g I_(bg)I_{b g}, and blends seamlessly into the background.
基于物体 I o b j R H × W × 3 I o b j R H × W × 3 I_(obj)inR^(H xx W xx3)I_{o b j} \in \mathbb{R}^{H \times W \times 3} 、背景 I b g R H × W × 3 I b g R H × W × 3 I_(bg)inR^(H xx W xx3)I_{b g} \in \mathbb{R}^{H \times W \times 3} 和指示物体组合到背景的位置和大小的蒙版 M R H × W M R H × W M inR^(H xx W)M \in \mathbb{R}^{H \times W} 的输入图像,我们旨在学习一个组合模型 C C C\mathcal{C} ,以生成一个合成图像 I out = C ( I obj , I b g , M ) R H × W × 3 I out  = C I obj  , I b g , M R H × W × 3 I_("out ")=C(I_("obj "),I_(bg),M)inR^(H xx W xx3)I_{\text {out }}=\mathcal{C}\left(I_{\text {obj }}, I_{b g}, M\right) \in \mathbb{R}^{H \times W \times 3} 。理想的结果是一个在视觉上看起来连贯自然 的 I out I out  I_("out ")I_{\text {out }} ,即 C C C\mathcal{C} 应该确保合成物体保留 I o b j I o b j I_(obj)I_{o b j} 的身份,与 I b g I b g I_(bg)I_{b g} 的几何形状对齐,并与背景无缝融合。 **Figure 2:** IMPRINT 框架概述,用于对象合成
In this section, we expand upon our approach. To leverage pretrained text-to-image diffusion models, we design a novel image encoder to replace the text-encoding branch, thus retaining much richer information from the reference object (see Sec. 3.1). Distinct from existing works, our pipeline bifurcates the task into two specialized sub-tasks to concurrently ensure object fidelity and allow for geometric variations. The first stage defines a context-agnostic IDpreserving task, where the image encoder is trained to learn a unified representation of generic objects (Sec. 3.1). The second stage mainly trains the generator for an image compositing task (Sec. 3.2). In addition, we delve into various aspects contributing to the detail retention capability of our framework: Sec. 3.3 discusses the process of paired data collection, and Sec. 3.4 details our training strategy.
在这个部分,我们将进一步扩展我们的方法。为了充分利用预训练文本到图像扩散模型,我们设计了一种新颖的图像编码器来替代文本编码分支,从而从参考对象中保留更丰富的信息(参见第 3.1 节)。与现有工作不同,我们的管道将任务分为两个专门的子任务,以便同时确保对象保真度并允许几何变化。第一阶段定义了一个与上下文无关的 ID 保留任务,图像编码器在此任务中学习通用对象的统一表示形式(第 3.1 节)。第二阶段则主要训练图像合成任务的生成器(第 3.2 节)。此外,我们将深入研究有助于我们的框架实现细节保留能力的各个方面:第 3.3 节将讨论配对数据收集的过程,第 3.4 节将详细介绍我们的训练策略。

3.1. Context-Agnostic ID-preserving Stage
## 3.1 语义无关的 ID 保留阶段

Distinct from prior methods, we introduce a supervised object view reconstruction task as the first stage of the training
与之前的训练方法不同,我们引入了一个有监督的目标视图重建任务作为训练的第一阶段

that help identity preservation. The motivation behind this task is based on the following key observations:
这有助于身份保存。这项任务背后的动机是基于以下关键的观察:
  • Existing efforts [4, 27, 48], which successfully improve detail preservation, are limited in geometry harmonization and tend to demonstrate copy-and-paste behavior.
    现有的工作 [4, 27, 48] 虽然成功地提高了细节保持能力,但在几何一致性方面却存在局限性,并且倾向于表现出复制粘贴的行为。
  • There is a fundamental trade-off between identity preservation and image compositing: the object is expected to be altered, in terms of color, lighting, and geometry, to better align with the background, while simultaneously, the object’s original pose, color tone, and illumination effects are memorized by the model and define its appearance.
    在身份保留和图像合成的过程中存在着一种基本的权衡:对象的颜色、光照和几何形状需要改变,以便更好地与背景融合,同时,对象的原始姿势、色调和照明效果会由模型记忆并定义其外观。
  • Multi-view data plays a significant role in keeping identity, yet acquiring such datasets is costly. Most large-scale multi-view datasets ( [ 5 , 47 ] [ 5 , 47 ] [5,47][5,47] ) lack sufficient contextual information for compositing; they either lack a background entirely or have a background area that is too limited.
    多视角数据在保持身份方面起着重要作用,然而获取此类数据集成本很高。大多数大规模多视角数据集 ( [ 5 , 47 ] [ 5 , 47 ] [5,47][5,47] ) 缺乏用于合成的足够上下文信息;它们要么完全没有背景,要么背景区域太有限。

    Based on the above insights, we give a formal definition of the task (as depicted in Fig. 2a): given an object of two views I v 1 , I v 2 I v 1 , I v 2 I_(v1),I_(v2)I_{v 1}, I_{v 2} and their associated masks M v 1 , M v 2 M v 1 , M v 2 M_(v1),M_(v2)M_{v 1}, M_{v 2}, the background is removed and the segmented object pairs are denoted as I ^ v 1 = I v 1 M v 1 , I ^ v 2 = I v 2 M v 2 I ^ v 1 = I v 1 M v 1 , I ^ v 2 = I v 2 M v 2 hat(I)_(v1)=I_(v1)oxM_(v1), hat(I)_(v2)=I_(v2)oxM_(v2)\hat{I}_{v 1}=I_{v 1} \otimes M_{v 1}, \hat{I}_{v 2}=I_{v 2} \otimes M_{v 2}. We build a view synthesis model S = { E u , G θ } S = E u , G θ S={E_(u),G_(theta)}\mathcal{S}=\left\{\mathcal{E}_{u}, \mathcal{G}_{\theta}\right\} conditioned on I ^ v 1 I ^ v 1 hat(I)_(v1)\hat{I}_{v 1} to generate the target view I ^ v 2 I ^ v 2 hat(I)_(v2)\hat{I}_{v 2}, where E u E u E_(u)\mathcal{E}_{u} is the image encoder and G θ G θ G_(theta)\mathcal{G}_{\theta} is the UNet backbone parameterized by θ θ theta\theta.
    ## 基于上述分析,我们对任务进行了形式化定义(如图 2a 所示):给定一个具有两个视图的对象 I v 1 , I v 2 I v 1 , I v 2 I_(v1),I_(v2)I_{v 1}, I_{v 2} 及其关联的掩码 M v 1 , M v 2 M v 1 , M v 2 M_(v1),M_(v2)M_{v 1}, M_{v 2} ,删除背景并将分割后的对象对表示为 I ^ v 1 = I v 1 M v 1 , I ^ v 2 = I v 2 M v 2 I ^ v 1 = I v 1 M v 1 , I ^ v 2 = I v 2 M v 2 hat(I)_(v1)=I_(v1)oxM_(v1), hat(I)_(v2)=I_(v2)oxM_(v2)\hat{I}_{v 1}=I_{v 1} \otimes M_{v 1}, \hat{I}_{v 2}=I_{v 2} \otimes M_{v 2} 。我们构建了一个基于 I ^ v 1 I ^ v 1 hat(I)_(v1)\hat{I}_{v 1} 条件的视图合成模型 S = { E u , G θ } S = E u , G θ S={E_(u),G_(theta)}\mathcal{S}=\left\{\mathcal{E}_{u}, \mathcal{G}_{\theta}\right\} 来生成目标视图 I ^ v 2 I ^ v 2 hat(I)_(v2)\hat{I}_{v 2} ,其中 E u E u E_(u)\mathcal{E}_{u} 是图像编码器, G θ G θ G_(theta)\mathcal{G}_{\theta} 是由 θ θ theta\theta 参数化的 UNet 主干网络。
Image Encoder E u E u E_(u)\mathcal{E}_{u} consists of a pretrained DINOv2 [29] and a content adapter following [41]. DINOv2 is a SOTA ViT model outperforming its predecessors [15, 31, 38] which extracts highly expressive visual features for reference-based generation. The content adapter allows the utilization of pretrained T2I models by bridging the domain gap between image and text embedding spaces.
图像编码器 E u E u E_(u)\mathcal{E}_{u} 由一个预训练的 DINOv2 [29] 和一个内容适配器组成,遵循 [41]。 DINOv2 是一个 SOTA ViT 模型,其性能优于其前辈 [15、31、38],它可以提取用于基于参考的生成的具有高度表达能力的视觉特征。 内容适配器允许通过桥接图像和文本嵌入空间之间的域差距来利用预训练的 T2I 模型。
Image Decoder G θ G θ G_(theta)\mathcal{G}_{\theta} takes the conditional denoising autoencoder G θ G θ G_(theta)\mathcal{G}_{\theta} from Stable Diffusion [33] and fine-tune its decoder during training. The objective function is defined as (based on [33]):
图文解码器 G θ G θ G_(theta)\mathcal{G}_{\theta} 采用 Stable Diffusion [33] 中的条件去噪自编码器 G θ G θ G_(theta)\mathcal{G}_{\theta} ,并在训练过程中对其解码器进行微调。 目标函数定义为(基于 [33]):
L id = E I ^ v 1 , I ^ v 2 , t , ϵ [ ϵ G θ ( I ^ v 2 , t , E u ( I ^ v 1 ) ) 2 2 ] L id = E I ^ v 1 , I ^ v 2 , t , ϵ ϵ G θ I ^ v 2 , t , E u I ^ v 1 2 2 L_(id)=E_( hat(I)_(v1), hat(I)_(v2),t,epsilon)[||epsilon-G_(theta)( hat(I)_(v2),t,E_(u)( hat(I)_(v1)))||_(2)^(2)]\mathcal{L}_{\mathrm{id}}=\mathbb{E}_{\hat{I}_{v 1}, \hat{I}_{v 2}, t, \epsilon}\left[\left\|\epsilon-\mathcal{G}_{\theta}\left(\hat{I}_{v 2}, t, \mathcal{E}_{u}\left(\hat{I}_{v 1}\right)\right)\right\|_{2}^{2}\right]
where L id L id  L_("id ")\mathcal{L}_{\text {id }} is the ID-preserving loss and ϵ N ( 0 , 1 ) ϵ N ( 0 , 1 ) epsilon∼N(0,1)\epsilon \sim \mathcal{N}(0,1). The image encoder E u E u E_(u)\mathcal{E}_{u} and the decoder blocks of G θ G θ G_(theta)\mathcal{G}_{\theta} are optimized in this process. Intuitively, the encoder trained for this task will always extract representations that are view-invariant while keeping identity-related details that are shared across different views. The qualitative results of this stage are shown in Sec. 4.7. Unlike previous view-synthesis works [25], our context-agnostic ID-preserving stage does not require any 3D information (e.g., camera parameters) as conditions, and we mainly focus on ID-preservation instead of geometrical consistency to background (which will be handled in the second stage). Therefore, only the image encoder will be taken to the next stage.
在图 2 中,我们将此过程称为“ID 保留”。这里, L id L id  L_("id ")\mathcal{L}_{\text {id }} 是 ID 保留损失, ϵ N ( 0 , 1 ) ϵ N ( 0 , 1 ) epsilon∼N(0,1)\epsilon \sim \mathcal{N}(0,1) 。在此过程中,图像编码器 E u E u E_(u)\mathcal{E}_{u} 和解码器块 G θ G θ G_(theta)\mathcal{G}_{\theta} 被优化。直观地说,为此任务训练的编码器将始终提取与视图无关的表示,同时保留不同视图中共享的与身份相关的详细信息。该阶段的定量结果在 Sec 中展示。4.7。与之前基于视图的图像合成工作 [25] 不同,我们的与上下文无关的 ID 保留阶段不需要任何 3D 信息(例如,相机参数)作为条件,我们主要专注于 ID 保留,而不是与背景的几何一致性(这将在第二阶段处理)。因此,只有图像编码器会被带到下一阶段。


(a) Stage of context-agnostic ID-preserving: we design a novel image encoder (with pre-trained DINOv2 as backbone) trained on multi-view object pairs to learn view-invariant ID-preserving representation.
(a) 情境无关的 ID 保持阶段:我们设计了一种新颖的图像编码器(以预训练的 DINOv2 作为骨干),在多视角物体对上进行训练,以学习与视角无关的 ID 保持表示。
Figure 2. The two-stage training pipeline of the proposed IMPRINT.
图 2. 提出的 IMPRINT 的两阶段训练流水线。

Figure 3. Illustration of the background-blending process. At each denoising step, the background area of the denoised latent is masked and blended with unmasked area from the clean background (intuitively, the model is only denoising the foreground).
图 3. 背景融合过程示意图。 在每个去噪步骤中,去噪潜变量的背景区域被遮蔽,并与来自干净背景的未遮蔽区域进行融合(从直观上看,该模型仅对前景进行去噪)。

3.2. Compositing Stage ## 3.2. 合成阶段

Fig. 2b illustrates the pipeline of the second stage which is trained for the compositing task, comprising the finetuned image encoder E u E u E_(u)\mathcal{E}_{u} and a generator G ϕ G ϕ G_(phi)\mathcal{G}_{\phi} (parameterized by ϕ ϕ phi\phi ) conditioned on the ID-preserving representations.
图 2b 展示了针对合成任务训练的第二阶段的管道,包括微调图像编码器 E u E u E_(u)\mathcal{E}_{u} 和生成器 G ϕ G ϕ G_(phi)\mathcal{G}_{\phi} (由 ϕ ϕ phi\phi 参数化),以 ID 保持表示为条件。
A simple approach is to ignore the view synthesis stage, training the encoder and generator jointly in a single-stage framework. Unfortunately, we found quality degradation from two aspects in this naive endeavor (see Sec. 4.7):
一种简单的方法是忽略视图合成阶段,在单阶段框架中联合训练编码器和生成器。不幸的是,我们在这一幼稚的尝试中发现了两方面的质量下降(见第 4.7 节):
  • When DINOv2 is trained in this stage, the model exhibits more frequent copy-paste-like behavior that composites the object in a very similar view as its original view.
    当 DINOv2 在此阶段进行训练时,模型表现出更频繁的复制粘贴式行为,以非常类似于其原始视图的视图合成对象。
  • When object-centric multi-view datasets, e.g., MVImgNet [47], are enabled in the training set, the model tends to produce more artifacts and exhibit poorer blending results due to the absence of background information in such datasets.
    当目标中心的多视图数据集(例如 MVImgNet [47])在训练集中启用时,由于此类数据集中缺乏背景信息,模型往往会产生更多的人工制品并呈现出更差的混合结果。
To overcome the issues above, we freeze the backbone of the image encoder (i.e., DINOv2) in the second stage and carefully collect a training set (see Sec. 3.3 for details).
为了克服上述问题,我们在第二阶段冻结图像编码器(即 DINOv2)的主干,并仔细收集了一个训练集(见第 3.3 节)。
In this stage, we also leverage a pretrained T2I model as the backbone of the generator, which uses the background I b g I b g I_(bg)I_{b g}, a coarse mask M M MM as inputs, and is conditioned on a ID-preserving object tokens E ^ u = E u ( I o b j ) E ^ u = E u I o b j hat(E)_(u)=E_(u)(I_(obj))\hat{E}_{u}=\mathcal{E}_{u}\left(I_{o b j}\right), where I o b j I o b j I_(obj)I_{o b j} indicates a masked object image. The generation is guided by injecting object tokens into the cross attention layers of G ϕ G ϕ G_(phi)\mathcal{G}_{\phi}. The coarse mask also allows the synthesis of shadows, and interactions of the object and the nearby objects.
在这个阶段,我们还利用预训练的 T2I 模型作为生成器的主干,该模型使用背景 I b g I b g I_(bg)I_{b g} 、粗略的掩码 M M MM 作为输入,并以 ID 保留对象令牌 E ^ u = E u ( I o b j ) E ^ u = E u I o b j hat(E)_(u)=E_(u)(I_(obj))\hat{E}_{u}=\mathcal{E}_{u}\left(I_{o b j}\right) 为条件,其中 I o b j I o b j I_(obj)I_{o b j} 表示掩码对象图像。通过将对象令牌注入到交叉注意力层 G ϕ G ϕ G_(phi)\mathcal{G}_{\phi} 进行引导生成。粗略的掩码还允许合成阴影,以及对象和附近对象的交互。
As E ^ u E ^ u hat(E)_(u)\hat{E}_{u} already encompasses structured view-invariant details of the object, color and geometric adjustments are no longer limited by identity preservation efforts. This freedom allows for greater variation in compositing.
由于 E ^ u E ^ u hat(E)_(u)\hat{E}_{u} 已经包含对象的结构视图不变细节,因此颜色和几何调整不再受身份保留工作的影响。这种自由允许更大的合成变化。
We define the objective function of this stage as:
我们将此阶段的目标函数定义为:
L comp = E I o b j , I b g , M , t , ϵ [ M ϵ G ϕ ( I b g , t , E ^ u ) 2 2 ] L comp = E I o b j , I b g , M , t , ϵ M ϵ G ϕ I b g , t , E ^ u 2 2 L_(comp)=E_(I_(obj),I_(bg)^(**),M,t,epsilon)[M||epsilon-G_(phi)(I_(bg)^(**),t, hat(E)_(u))||_(2)^(2)]\mathcal{L}_{\mathrm{comp}}=\mathbb{E}_{I_{o b j}, I_{b g}^{*}, M, t, \epsilon}\left[M\left\|\epsilon-\mathcal{G}_{\phi}\left(I_{b g}^{*}, t, \hat{E}_{u}\right)\right\|_{2}^{2}\right]
where L comp L comp  L_("comp ")\mathcal{L}_{\text {comp }} is the compositing loss, I b g I b g I_(bg)^(**)I_{b g}^{*} is the target image. G ϕ G ϕ G_(phi)\mathcal{G}_{\phi} and the adapter are optimized.
其中 L comp L comp  L_("comp ")\mathcal{L}_{\text {comp }} 表示合成损失, I b g I b g I_(bg)^(**)I_{b g}^{*} 表示目标图像, G ϕ G ϕ G_(phi)\mathcal{G}_{\phi} 和适配器经过优化。
The Background-blending Process To ensure that the transition area between the object and the background is smooth, we adopt a background-blending strategy. This process is depicted in Fig. 3.
为了确保对象和背景之间的过渡区域平滑,我们采用了一种背景混合策略。 此过程如图 3 所示。
Shape-guided Controllable Compositing could enable more practical guidance of the pose and view of the generated object by drawing a rough mask. However, most prior works [4, 27, 41] have no such control. In our proposed model, following [43], masks are defined at four levels of precision (see the Appendix), where the most coarse mask is a bounding box. Incorporating multiple levels of masks replicates real-world scenarios, where users often prefer more precise masks. Results are shown in Fig. 1.any
基于形状的可控合成可以通过绘制粗略的蒙版,实现对生成对象姿势和视角的更实用的引导。然而,大多数先前的工作[4, 27, 41]都没有这样的控制。在我们提出的模型中,根据[43],蒙版被定义在四个精度级别上(参见附录),其中最粗糙的蒙版是一个包围框。结合多级蒙版可以复制现实世界的场景,在现实世界中,用户通常更喜欢更精确的蒙版。结果如图 1 所示。

Figure 4. Illustration of the data augmentation pipeline.
图 4. 数据增强流程图。

3.3. Paired Data Generation
## 3.3 成对数据生成

The dataset quality is another key to better identity preservation and pose variation. As proved by [4], multi-view datasets can significantly improve the generation fidelity. In practice, we use a combination of image datasets (Pixabay), panoptic video segmentation datasets (YoutubeVOS [44], VIPSeg [28] and PPR10K [23]) and object-centric datasets (MVImgNet [47] and Objaverse [5]). They are incorporated in different training stages and associated with various processing procedures in our self-supervised training.
数据集质量是实现更好的身份保持和姿态变化的另一个关键因素。正如[4]所证实的,多视角数据集可以显著提高生成保真度。在实际应用中,我们使用图像数据集(Pixabay)、全景视频分割数据集(YoutubeVOS [44]、VIPSeg [28] 和 PPR10K [23])和以对象为中心的数据集(MVImgNet [47] 和 Objaverse [5])。它们被纳入不同的训练阶段,并在我们自监督训练中与各种处理程序相关联。
The image datasets we collected have high resolution and rich background information, so they are only utilized in the second stage for better compositing. Inspired by [41, 46], to simulate the lighting and geometry changes in object compositing, we design an augmentation pipeline I ^ o b j = P ( T ( I o b j ) ) I ^ o b j = P T I o b j hat(I)_(obj)=P(T(I_(obj)))\hat{I}_{o b j}=\mathcal{P}\left(\mathcal{T}\left(I_{o b j}\right)\right), where T T T\mathcal{T} are the affine transformations, and P P P\mathcal{P} is color and light perturbation, supported by the lookup table in [16]. The perturbed object I ^ o b j I ^ o b j hat(I)_(obj)\hat{I}_{o b j} is used as the input and the natural image I b g I b g I_(bg)^(**)I_{b g}^{*} containing the original object is used as the target.
由于我们收集的图像数据集具有高分辨率和丰富的背景信息,因此它们仅在第二阶段用于更好的合成。受[41、46]的启发,为了模拟物体合成中的光照和几何变化,我们设计了一个增强管道 I ^ o b j = P ( T ( I o b j ) ) I ^ o b j = P T I o b j hat(I)_(obj)=P(T(I_(obj)))\hat{I}_{o b j}=\mathcal{P}\left(\mathcal{T}\left(I_{o b j}\right)\right) ,其中 T T T\mathcal{T} 是仿射变换, P P P\mathcal{P} 是颜色和光照扰动,由[16]中的查找表支持。扰动后的物体 I ^ o b j I ^ o b j hat(I)_(obj)\hat{I}_{o b j} 用作输入,包含原始物体的自然图像 I b g I b g I_(bg)^(**)I_{b g}^{*} 用作目标。
Video segmentation datasets usually suffer from low resolution and motion blur, which harm the generation quality. Nevertheless, they provide object pairs which naturally differ in lighting, geometry, view and even provide non-rigid pose variations. As a result, they are also used in the second stage. Illustrated by Fig. 4, each training pair comes from one video with instance-level segmentation labels. Two distinct frames are randomly sampled; one serves as the target image, while the object is extracted from the other frame as the augmented input.
视频分割数据集通常分辨率低,运动模糊,这会损害生成质量。然而,它们提供了自然光照、几何形状、视图甚至非刚性姿态变化的物体对。因此,它们也应用于第二阶段。如图 4 所示,每个训练对都来自一个带有实例级分割标签的视频。随机采样了两帧不同的帧;一帧作为目标图像,而另一帧中的物体作为增强输入被提取出来。
Object-centric datasets offer a significantly larger scale than video segmentation datasets and provide more intricate object details. However, they are only used in the first stage due to the limited background information available in these datasets. During training, each pair I v 1 , I v 2 I v 1 , I v 2 I_(v1),I_(v2)I_{v 1}, I_{v 2} are also randomly sampled from the same video with | v 1 v 2 | n | v 1 v 2 | n |v1-v2| <= n|v 1-v 2| \leq n, where n n nn is the temporal sampling window. Empirically, we observe a loss in the generation quality as n n nn increases, and n = 7 n = 7 n=7n=7 strikes a balance between fidelity and quality.
面向对象的数据集比视频分割数据集提供更大规模的数据,并提供更精细的对象细节。 然而,由于这些数据集中可用的背景信息有限,它们只在第一阶段使用。 在训练过程中,每对 I v 1 , I v 2 I v 1 , I v 2 I_(v1),I_(v2)I_{v 1}, I_{v 2} 也从与 | v 1 v 2 | n | v 1 v 2 | n |v1-v2| <= n|v 1-v 2| \leq n 相同的视频中随机抽样,其中 n n nn 是时间采样窗口。 经验表明,随着 n n nn 的增加,我们观察到生成质量下降,而 n = 7 n = 7 n=7n=7 在保真度和质量之间取得了平衡。

3.4. Training Strategies ## 3.4. 训练策略

All previous (or concurrent) training-free methods [4, 41, 46, 48] use a frozen transformer-based image encoder, either using DINOv2 or CLIP. However, freezing the encoder will limit their capability in extracting the object details: i) CLIP only encodes the semantic features of the object; ii) DINOv2 is trained on a dataset that is constructed based on image retrieval, allowing objects that are not entirely identical to be treated as the same instance. To overcome this challenge, we fine-tune the encoder specifically for compositing, ensuring the extraction of instance-level features.
历经训练的无训练方法[4,41,46,48]使用冻结的基于 Transformer 的图像编码器,可以使用 DINOv2 或 CLIP。但是,冻结编码器将会限制其提取物体细节的能力:i)CLIP 仅编码物体的语义特征;ii)DINOv2 在基于图像检索构建的数据集上训练,允许与目标不完全相同,但可以被视为相同实例的对象。为了克服这一挑战,我们针对合成对编码器进行微调,确保提取实例级特征。
Due to the extensive scale of the aforementioned encoders, they are prone to overfitting. The implementation of appropriate training strategies can effectively stabilize the training process and improve identity preservation. To this end, we design a novel training scheme: Sequential Collaborative Training.
鉴于上述编码器规模庞大,它们易于过拟合。 采用适当的训练策略可以有效地稳定训练过程并提高身份保留率。 为此,我们设计了一种新颖的训练方案:顺序协作训练。
More specifically, the object compositing stage is further divided into two phases: 1) in the first n n nn epochs, we assign the adapter a larger learning rate of 4 × 10 5 4 × 10 5 4xx10^(-5)4 \times 10^{-5}, and assign the UNet a smaller learning rate of 4 × 10 6 ; 2 4 × 10 6 ; 2 4xx10^(-6);24 \times 10^{-6} ; 2 ) in the next n n nn epochs, we swap the learning rate of these two components (and the training finishes). This strategy focuses on training one component at each phase, with the other component simultaneously trained at a lower rate to adapt to the changed domain; the generator is trained in the end to ensure the synthesis quality.
更具体地说,对象合成阶段进一步分为两个阶段:1)在第一个 n n nn epoch 中,我们为适配器分配更大的学习率 4 × 10 5 4 × 10 5 4xx10^(-5)4 \times 10^{-5} ,并为 UNet 分配更小的学习率 4 × 10 6 ; 2 4 × 10 6 ; 2 4xx10^(-6);24 \times 10^{-6} ; 2 )在接下来的 n n nn epoch 中,我们交换这两个组件的学习率(训练完成)。这种策略侧重于在每个阶段训练一个组件,同时以较低的速率训练另一个组件以适应变化的域;生成器在最后训练以确保合成质量。

4. Experiments 4. 实验

4.1. Training Details 4.1 训练详情

The first stage is trained on 1,409,545 pairs and validated on 11,175 pairs from MVImgNet, which takes 5 epochs to finish. The learning rate associated with DINOv2 (ViT-g/14 with registers) is 4 × 10 6 4 × 10 6 4xx10^(-6)4 \times 10^{-6}, and the batch size is 256 . The image embedding is dropped at a rate of 0.05 .
第一阶段在来自 MVImgNet 的 1,409,545 对数据集上进行训练,在 11,175 对数据集上进行验证。该阶段使用 DINOv2(带寄存器的 ViT-g/14)模型,共 5 个 epochs。学习率 4 × 10 6 4 × 10 6 4xx10^(-6)4 \times 10^{-6} ,batch size 为 256。图像嵌入以 0.05 的概率丢弃。
The second stage is fine-tuned on a mixture of image datasets and video datasets, including a training set of 217,451 pairs and a validation set of 15,769 pairs (listed in Tab. 1), where we apply [20] to obtain the segmentation masks as labels. It is trained for 15 epochs with a batch size of 256 . The embedding is dropped at a rate of 0.1 .
第二阶段在一个混合图像数据集和视频数据集上进行微调,包括一个包含 217,451 对的训练集和一个包含 15,769 对的验证集(见表 1),我们在其中应用 [20] 来获得分割掩模作为标签。它以 256 的批量大小训练 15 个 epoch。嵌入的丢弃率为 0.1。
In both stages, the images are resized to 512 × 512 512 × 512 512 xx512512 \times 512. During inference, the DDIM sampler generates the composite image after 50 denoising steps using a CFG [12] scale of 3.0. The model is trained on 8 NVIDIA A100 GPUs. The model is built on Stable Diffusion v1.4 ([33]).
在两个阶段中,图像都会调整为 ` 512 × 512 512 × 512 512 xx512512 \times 512 大小。在推理阶段,DDIM 采样器在使用 CFG [12] 缩放比例为 3.0 进行 50 次去噪步骤后生成合成图像。该模型在 8 个 NVIDIA A100 GPU 上进行训练。该模型基于 Stable Diffusion v1.4 ([33]) 构建。

4.2. Evaluation Benchmark
4.2 评估基准

Datasets are collected from Pixabay and DreamBooth [34] for testing. More specifically, Pixabay testing set has 1,000 high-resolution images and has no overlap with the training set. A foreground object is selected from each image
数据集来自 Pixabay 和 DreamBooth [34] 进行测试。更具体地说,Pixabay 测试集包含 1,000 张高分辨率图像,并且与训练集没有重叠。从每张图像中选择一个前景对象
Datasets 数据集 Pixabay 皮克斯巴伊 VIPSeg 贵宾分段 YoutubeVOS 油管 VOS PPR10K
Training 訓練 116,820 51,743 42,868 6,020
Validation 验证 6,490 5,487 3,690 102
Datasets Pixabay VIPSeg YoutubeVOS PPR10K Training 116,820 51,743 42,868 6,020 Validation 6,490 5,487 3,690 102| Datasets | Pixabay | VIPSeg | YoutubeVOS | PPR10K | | :--- | ---: | ---: | ---: | ---: | | Training | 116,820 | 51,743 | 42,868 | 6,020 | | Validation | 6,490 | 5,487 | 3,690 | 102 |
Table 1. Statistics of the datasets used in the second stage.
表格 1. 第二阶段数据集统计信息。

and perturbed through the data augmentation pipeline as in Sec. 3.3. The DreamBooth testing set consists of 25 unique objects with various views. Combined with 59 background images that are manually chosen, 113 pairs are generated for this test set. This dataset is challenging since most objects are of complex texture or structure. We also conduct a user study on this dataset.
经过数据增强管道(如第 3.3 节所述)进行扰动。DreamBooth 测试集包含 25 个具有不同视图的独特对象。结合 59 张人工选择的背景图片,为该测试集生成了 113 对图片。该数据集极具挑战性,因为大多数对象都具有复杂的纹理或结构。我们还对该数据集进行了用户研究。
Metrics measuring fidelity and realism are adopted to evaluate the effectiveness of different models in terms of identity preservation and background harmon