Yizhi Song ^(1**){ }^{1 *}, Zhifei Zhang ^(2){ }^{2}, Zhe Lin ^(2){ }^{2}, Scott Cohen ^(2)^{2}, Brian Price ^(2){ }^{2}, Jianming Zhang ^(2){ }^{2}, Soo Ye Kim ^(2){ }^{2}, He Zhang ^(2){ }^{2}, Wei Xiong ^(2){ }^{2}, Daniel Aliaga ^(1){ }^{1} 易智松 ^(1**){ }^{1 *} , 张智飞 ^(2){ }^{2} , 林哲 ^(2){ }^{2} , 斯考特·科恩 ^(2)^{2} , 布赖恩·普莱斯 ^(2){ }^{2} , 张今鸣 ^(2){ }^{2} , 金秀 ^(2){ }^{2} , 张赫 ^(2){ }^{2} , 魏雄 ^(2){ }^{2} , 丹尼尔·阿利亚加 ^(1){ }^{1}Purdue University ^(1){ }^{1}, Adobe Research ^(2){ }^{2} 普渡大学 ^(1){ }^{1} ,Adobe Research ^(2){ }^{2}
Figure 1. Top: Comparison with three prior works, i.e., Paint-by-Example [46], ObjectStitch [41], and TF-ICON [27]. Our method IMPRINT outperforms others in terms of identity preservation and color/geometry harmonization. Bottom: Given a coarse mask, IMPRINT can change the pose of the object to follow the shape of the mask. 图 1. 顶部:与三种之前的工作进行比较,即手绘 [46]、ObjectStitch [41] 和 TF-ICON [27]。我们的方法 IMPRINT 在身份保持和颜色/几何形状协调方面优于其他方法。底部:给定一个粗糙的蒙版,IMPRINT 可以改变对象的位置以跟随蒙版的形状。
Abstract ## 摘要
Generative object compositing emerges as a promising new avenue for compositional image editing. However, the requirement of object identity preservation poses a significant challenge, limiting practical usage of most existing methods. In response, this paper introduces IMPRINT, a novel diffusion-based generative model trained with a twostage learning framework that decouples learning of identity preservation from that of compositing. The first stage is targeted for context-agnostic, identity-preserving pretraining of the object encoder, enabling the encoder to learn an embedding that is both view-invariant and conducive to enhanced detail preservation. The subsequent stage leverages this representation to learn seamless harmonization of the object composited to the background. In addition, IMPRINT incorporates a shape-guidance mechanism offering user-directed control over the compositing process. Extensive experiments demonstrate that IMPRINT significantly outperforms existing methods and various baselines on identity preservation and composition quality. Project page: https://song630.github.io/IMPRINT-Project-Page/ 生成式对象合成作为一种很有前景的图像编辑方法,正在兴起。然而,对象身份保持的要求提出了重大挑战,限制了大多数现有方法的实际应用。为此,本文引入了 IMPRINT,这是一种新颖的基于扩散的生成模型,该模型使用两阶段学习框架进行训练,该框架将身份保持的学习与合成学习分离。第一阶段针对对象编码器的上下文无关、身份保持预训练,使编码器能够学习一个既是视图不变的又是更有利于增强细节保持的嵌入。随后的阶段利用这个表示来学习将合成对象与背景无缝协调。此外,IMPRINT 还包含了一个形状引导机制,为用户提供了对合成过程的指导控制。大量的实验表明,IMPRINT 在身份保持和合成质量方面明显优于现有的方法和各种基线。项目主页:
https://song630.github.io/IMPRINT-Project-Page/
1. Introduction ## 1. 引言
Image compositing, the art of merging a reference object with a background to create a cohesive and realistic image, has witnessed transformative advancements with the advent of diffusion models (DM) [13, 32, 33, 36]. These models have catalyzed the emergence of generative object compositing, a novel task that hinges on two critical aspects: identity (ID) preservation and background harmonization. The goal is to ensure that the object in the composite image retains its identity while adapting its color and geometry for seamless integration with the background. Existing methods [27, 41, 46] demonstrate impressive capabilities in generative compositing; however, they often fail in ID-preservation or context consistency. 图像合成,即将参考对象与背景融合以创建连贯且逼真的图像,随着扩散模型 (DM) 的出现,已经取得了变革性的进展 [13、32、33、36]。这些模型催化了生成式对象合成的出现,这是一项新颖的任务,它取决于两个关键方面:身份 (ID) 保留和背景协调。 目标是在确保合成图像中的对象保留其身份的同时,调整其颜色和几何形状,以便与背景无缝集成。 现有方法 [27、41、46] 在生成合成方面展示出令人印象深刻的功能,但是,它们在 ID 保留或上下文明确性方面常常失败。
Recent works [41, 46], typically struggle with balancing ID preservation and background harmony. While these methods have made strides in spatial adjustments, they predominantly capture categorical rather than detailed information. TF-ICON [27] and two concurrent works [4, 48] have advanced subject fidelity but at the expense of limiting pose and view variations for background integration, thus curtailing their applicability in real-world settings. 最近的作品 [41, 46] 通常难以平衡 ID 保留和背景和谐。虽然这些方法在空间调整方面取得了进展,但它们主要捕获类别信息而不是详细信息。TF-ICON [27] 和两项并行工作 [4, 48] 提高了主题保真度,但以限制背景集成的姿势和视角变化为代价,因此限制了它们在实际场景中的适用性。
To address the trade-off between identity preservation with pose adjustment for background alignment, we introduce IMPRINT, a novel two-stage compositing framework that excels in ID preservation. Diverging from previous works, IMPRINT decouples the compositing process into ID preservation and background alignment stages. The first stage involves a novel context-agnostic ID-preserving train- 为了解决身份保留与姿势调整以进行背景对齐之间的权衡,我们引入了 IMPRINT,这是一种新颖的两阶段合成框架,擅长于 ID 保留。与以前的工作不同,IMPRINT 将合成过程分解为身份保留和背景对齐阶段。第一阶段涉及一个新颖的与上下文无关的 ID 保留训练-
ing, wherein an image encoder is trained to learn viewinvariant features, crucial for detail engraving. The second stage focuses on harmonizing the object with the background, utilizing the robust ID-preserving representation from the first stage. This bifurcation allows for unprecedented fidelity in object detail while facilitating adaptable color and geometry harmonization. 在图像编码器被训练以学习视图不变特征,这对细节雕刻至关重要的图像编码的第一阶段。第二阶段侧重于将对象与背景协调一致,利用来自第一阶段的稳健的 ID 保留表示。这种分叉允许在对象细节方面取得前所未有的保真度,同时促进适应性颜色和几何协调一致。
Our contributions can be summarized as follows: 我们的贡献可以概括如下:
We introduce a novel context-agnostic ID-preserving training, demonstrating superior appearance preservation through comprehensive experiments. 我们引入了一种新颖的、与上下文无关的、ID 保留的训练方法,通过全面的实验,证明了它在外观保留方面具有明显的优势。
Our two-stage framework distinctively separates the tasks of ID preservation and background alignment, enabling realistic compositing effects. 我们的两阶段框架将 ID 保留和背景对齐的任务区分开来,从而实现逼真的合成效果。
We incorporate mask control into our model, enhancing shape guidance and generation flexibility. 我们将掩码控制纳入模型,增强形状引导和生成灵活性。
We conduct an extensive study on appearance retention, offering insights into various factors influencing identity preservation, e.g., image encoders, multi-view datasets, training strategies, etc. 我们对外观保留进行了广泛的研究,提供了对影响身份保留的各个因素的洞察,例如图像编码器、多视图数据集、训练策略等。
2. Related Work ## 2. 相关工作
2.1. Image Compositing ## 2.1. 图像合成
Image compositing, a pivotal task in image editing applications, aims to insert a foreground object into a background image seamlessly, striving for realism and high fidelity. 图像合成,作为图像编辑应用中的一项关键任务,旨在将前景对象无缝地插入背景图像中,力求逼真和高保真度。
Traditionally, image harmonization [9, 16, 18, 45] and image blending [30, 42, 49, 50] focus on color and lighting consistency between the object and the background. However, these approaches fall short in addressing geometric adjustments. GAN-based works [1, 3, 24] target geometry inconsistency, yet are often domain-specific (e.g., indoor scene) and limited in handling complex transformations (e.g., out-of-plane rotation). Shadow synthesis methods like SGRNet [14] and PixHt-Lab [37] focus on realistic lighting effects. 传统上,图像融合 [9, 16, 18, 45] 和图像混合 [30, 42, 49, 50] 侧重于对象和背景之间的颜色和光照一致性。然而,这些方法在解决几何调整方面存在不足。基于 GAN 的工作 [1, 3, 24] 针对几何不一致性,但通常是特定领域的(例如,室内场景)并且在处理复杂变换(例如,平面外旋转)方面有限。如 SGRNet [14] 和 PixHt-Lab [37] 等阴影合成方法侧重于逼真的光照效果。
With the advent of diffusion models [13, 33, 39, 40], recent research has shifted towards unified frameworks encompassing all aspects of image compositing. Methods like [41, 46] employ CLIP-based adapters for leveraging pretrained models, but they struggle in preserving the object’s identity due to their focus on high-level semantic representations. While TF-ICON [27] improves fidelity by incorporating noise modeling and composite self-attention injection, it faces limitations in object pose adaptability. 随着扩散模型 [13, 33, 39, 40] 的出现,最近的研究转向了包含图像合成所有方面的统一框架。像 [41, 46] 这样的方法采用基于 CLIP 的适配器来利用预训练模型,但由于它们专注于高级语义表示,因此难以保留对象的标识。虽然 TF-ICON [27] 通过结合噪声建模和复合自注意力注入来提高保真度,但在对象姿态适应性方面存在局限性。
Recent research is increasingly centering on appearance preservation in generative object compositing. Two concurrent works, AnyDoor [4] and ControlCom [48], have made strides in this area. AnyDoor combines DINOv2 [29] and high-frequency filter, and ControlCom introduces a local enhancement module. However, these models have limited spatial correction capabilities. In contrast, our model 生成式目标合成中,越来越多的研究开始关注外观保持问题。两项同步进行的工作,AnyDoor [4] 和 ControlCom [48],在这方面取得了进展。AnyDoor 结合了 DINOv2 [29] 和高频滤波器,ControlCom 引入了一个局部增强模块。但是,这些模型的空间校正能力有限。相比之下,我们的模型
designs a novel approach that substantially enhances visual consistency of the object while maintaining geometry and color harmonization, representing a significant advancement in the field. 设计了一种新颖的方法,在保持几何形状和颜色协调性的同时,显著增强了物体的视觉一致性,代表了该领域的一项重大进步。
Subject-driven image generation, the task of creating a subject within a novel context, often involves customizing subject attributes based on text prompts. Based on diffusion models, [7, 17] have led to techniques like using placeholder words for object representation, enabling highfidelity customizations. Subsequent works [19, 26, 34, 35] extend this by fine-tuning pretrained text-to-image models for new concept learning. These advancements have facilitated diverse applications, such as subject swapping [8], open-world generation [22], and non-rigid image editing [2]. However, these methods usually require inference-time fine-tuning or multiple subject images, limiting their practicality. In contrast, our framework offers a fast-forward and background-preserving approach that is versatile for a broad spectrum of real-world data. 主题驱动的图像生成,是指在新的场景中创建主题的任务,通常涉及根据文本提示定制主题属性。基于扩散模型,[7, 17] 已经提出了使用占位词进行对象表示的技术,从而实现高保真定制。后续工作 [19, 26, 34, 35] 通过对预训练的文本到图像模型进行微调以学习新概念来扩展这一点。这些进步促进了各种应用,如主题替换 [8]、开放世界生成 [22] 和非刚性图像编辑 [2]。然而,这些方法通常需要推理时微调或多个主题图像,从而限制了它们的实用性。相比之下,我们的框架提供了一种快速前馈和背景保留的方法,该方法适用于各种现实世界数据。
3. Approach 3. 方法
The proposed object compositing framework, IMPRINT, is summarized in Fig. 2. Formally, given input images of object I_(obj)inR^(H xx W xx3)I_{o b j} \in \mathbb{R}^{H \times W \times 3}, background I_(bg)inR^(H xx W xx3)I_{b g} \in \mathbb{R}^{H \times W \times 3}, and mask M inR^(H xx W)M \in \mathbb{R}^{H \times W} that indicates the location and scale for object compositing to the background, we aim to learn a compositing model C\mathcal{C} to achieve a composite image I_("out ")=C(I_("obj "),I_(bg),M)inR^(H xx W xx3)I_{\text {out }}=\mathcal{C}\left(I_{\text {obj }}, I_{b g}, M\right) \in \mathbb{R}^{H \times W \times 3}. The ideal outcome is an I_("out ")I_{\text {out }} that appears visually coherent and natural, i.e., C\mathcal{C} should ensure that the composited object retains the identity of I_(obj)I_{o b j}, aligns to the geometry of I_(bg)I_{b g}, and blends seamlessly into the background. 基于物体 I_(obj)inR^(H xx W xx3)I_{o b j} \in \mathbb{R}^{H \times W \times 3} 、背景 I_(bg)inR^(H xx W xx3)I_{b g} \in \mathbb{R}^{H \times W \times 3} 和指示物体组合到背景的位置和大小的蒙版 M inR^(H xx W)M \in \mathbb{R}^{H \times W} 的输入图像,我们旨在学习一个组合模型 C\mathcal{C} ,以生成一个合成图像 I_("out ")=C(I_("obj "),I_(bg),M)inR^(H xx W xx3)I_{\text {out }}=\mathcal{C}\left(I_{\text {obj }}, I_{b g}, M\right) \in \mathbb{R}^{H \times W \times 3} 。理想的结果是一个在视觉上看起来连贯自然 的 I_("out ")I_{\text {out }} ,即 C\mathcal{C} 应该确保合成物体保留 I_(obj)I_{o b j} 的身份,与 I_(bg)I_{b g} 的几何形状对齐,并与背景无缝融合。
**Figure 2:** IMPRINT 框架概述,用于对象合成
In this section, we expand upon our approach. To leverage pretrained text-to-image diffusion models, we design a novel image encoder to replace the text-encoding branch, thus retaining much richer information from the reference object (see Sec. 3.1). Distinct from existing works, our pipeline bifurcates the task into two specialized sub-tasks to concurrently ensure object fidelity and allow for geometric variations. The first stage defines a context-agnostic IDpreserving task, where the image encoder is trained to learn a unified representation of generic objects (Sec. 3.1). The second stage mainly trains the generator for an image compositing task (Sec. 3.2). In addition, we delve into various aspects contributing to the detail retention capability of our framework: Sec. 3.3 discusses the process of paired data collection, and Sec. 3.4 details our training strategy. 在这个部分,我们将进一步扩展我们的方法。为了充分利用预训练文本到图像扩散模型,我们设计了一种新颖的图像编码器来替代文本编码分支,从而从参考对象中保留更丰富的信息(参见第 3.1 节)。与现有工作不同,我们的管道将任务分为两个专门的子任务,以便同时确保对象保真度并允许几何变化。第一阶段定义了一个与上下文无关的 ID 保留任务,图像编码器在此任务中学习通用对象的统一表示形式(第 3.1 节)。第二阶段则主要训练图像合成任务的生成器(第 3.2 节)。此外,我们将深入研究有助于我们的框架实现细节保留能力的各个方面:第 3.3 节将讨论配对数据收集的过程,第 3.4 节将详细介绍我们的训练策略。
3.1. Context-Agnostic ID-preserving Stage ## 3.1 语义无关的 ID 保留阶段
Distinct from prior methods, we introduce a supervised object view reconstruction task as the first stage of the training 与之前的训练方法不同,我们引入了一个有监督的目标视图重建任务作为训练的第一阶段
that help identity preservation. The motivation behind this task is based on the following key observations: 这有助于身份保存。这项任务背后的动机是基于以下关键的观察:
Existing efforts [4, 27, 48], which successfully improve detail preservation, are limited in geometry harmonization and tend to demonstrate copy-and-paste behavior. 现有的工作 [4, 27, 48] 虽然成功地提高了细节保持能力,但在几何一致性方面却存在局限性,并且倾向于表现出复制粘贴的行为。
There is a fundamental trade-off between identity preservation and image compositing: the object is expected to be altered, in terms of color, lighting, and geometry, to better align with the background, while simultaneously, the object’s original pose, color tone, and illumination effects are memorized by the model and define its appearance. 在身份保留和图像合成的过程中存在着一种基本的权衡:对象的颜色、光照和几何形状需要改变,以便更好地与背景融合,同时,对象的原始姿势、色调和照明效果会由模型记忆并定义其外观。
Multi-view data plays a significant role in keeping identity, yet acquiring such datasets is costly. Most large-scale multi-view datasets ( [5,47][5,47] ) lack sufficient contextual information for compositing; they either lack a background entirely or have a background area that is too limited. 多视角数据在保持身份方面起着重要作用,然而获取此类数据集成本很高。大多数大规模多视角数据集 ( [5,47][5,47] ) 缺乏用于合成的足够上下文信息;它们要么完全没有背景,要么背景区域太有限。
Based on the above insights, we give a formal definition of the task (as depicted in Fig. 2a): given an object of two views I_(v1),I_(v2)I_{v 1}, I_{v 2} and their associated masks M_(v1),M_(v2)M_{v 1}, M_{v 2}, the background is removed and the segmented object pairs are denoted as hat(I)_(v1)=I_(v1)oxM_(v1), hat(I)_(v2)=I_(v2)oxM_(v2)\hat{I}_{v 1}=I_{v 1} \otimes M_{v 1}, \hat{I}_{v 2}=I_{v 2} \otimes M_{v 2}. We build a view synthesis model S={E_(u),G_(theta)}\mathcal{S}=\left\{\mathcal{E}_{u}, \mathcal{G}_{\theta}\right\} conditioned on hat(I)_(v1)\hat{I}_{v 1} to generate the target view hat(I)_(v2)\hat{I}_{v 2}, where E_(u)\mathcal{E}_{u} is the image encoder and G_(theta)\mathcal{G}_{\theta} is the UNet backbone parameterized by theta\theta. ## 基于上述分析,我们对任务进行了形式化定义(如图 2a 所示):给定一个具有两个视图的对象 I_(v1),I_(v2)I_{v 1}, I_{v 2} 及其关联的掩码 M_(v1),M_(v2)M_{v 1}, M_{v 2} ,删除背景并将分割后的对象对表示为 hat(I)_(v1)=I_(v1)oxM_(v1), hat(I)_(v2)=I_(v2)oxM_(v2)\hat{I}_{v 1}=I_{v 1} \otimes M_{v 1}, \hat{I}_{v 2}=I_{v 2} \otimes M_{v 2} 。我们构建了一个基于 hat(I)_(v1)\hat{I}_{v 1} 条件的视图合成模型 S={E_(u),G_(theta)}\mathcal{S}=\left\{\mathcal{E}_{u}, \mathcal{G}_{\theta}\right\} 来生成目标视图 hat(I)_(v2)\hat{I}_{v 2} ,其中 E_(u)\mathcal{E}_{u} 是图像编码器, G_(theta)\mathcal{G}_{\theta} 是由 theta\theta 参数化的 UNet 主干网络。
Image Encoder E_(u)\mathcal{E}_{u} consists of a pretrained DINOv2 [29] and a content adapter following [41]. DINOv2 is a SOTA ViT model outperforming its predecessors [15, 31, 38] which extracts highly expressive visual features for reference-based generation. The content adapter allows the utilization of pretrained T2I models by bridging the domain gap between image and text embedding spaces. 图像编码器 E_(u)\mathcal{E}_{u} 由一个预训练的 DINOv2 [29] 和一个内容适配器组成,遵循 [41]。 DINOv2 是一个 SOTA ViT 模型,其性能优于其前辈 [15、31、38],它可以提取用于基于参考的生成的具有高度表达能力的视觉特征。 内容适配器允许通过桥接图像和文本嵌入空间之间的域差距来利用预训练的 T2I 模型。
Image Decoder G_(theta)\mathcal{G}_{\theta} takes the conditional denoising autoencoder G_(theta)\mathcal{G}_{\theta} from Stable Diffusion [33] and fine-tune its decoder during training. The objective function is defined as (based on [33]): 图文解码器 G_(theta)\mathcal{G}_{\theta} 采用 Stable Diffusion [33] 中的条件去噪自编码器 G_(theta)\mathcal{G}_{\theta} ,并在训练过程中对其解码器进行微调。 目标函数定义为(基于 [33]):
where L_("id ")\mathcal{L}_{\text {id }} is the ID-preserving loss and epsilon∼N(0,1)\epsilon \sim \mathcal{N}(0,1). The image encoder E_(u)\mathcal{E}_{u} and the decoder blocks of G_(theta)\mathcal{G}_{\theta} are optimized in this process. Intuitively, the encoder trained for this task will always extract representations that are view-invariant while keeping identity-related details that are shared across different views. The qualitative results of this stage are shown in Sec. 4.7. Unlike previous view-synthesis works [25], our context-agnostic ID-preserving stage does not require any 3D information (e.g., camera parameters) as conditions, and we mainly focus on ID-preservation instead of geometrical consistency to background (which will be handled in the second stage). Therefore, only the image encoder will be taken to the next stage. 在图 2 中,我们将此过程称为“ID 保留”。这里, L_("id ")\mathcal{L}_{\text {id }} 是 ID 保留损失, epsilon∼N(0,1)\epsilon \sim \mathcal{N}(0,1) 。在此过程中,图像编码器 E_(u)\mathcal{E}_{u} 和解码器块 G_(theta)\mathcal{G}_{\theta} 被优化。直观地说,为此任务训练的编码器将始终提取与视图无关的表示,同时保留不同视图中共享的与身份相关的详细信息。该阶段的定量结果在 Sec 中展示。4.7。与之前基于视图的图像合成工作 [25] 不同,我们的与上下文无关的 ID 保留阶段不需要任何 3D 信息(例如,相机参数)作为条件,我们主要专注于 ID 保留,而不是与背景的几何一致性(这将在第二阶段处理)。因此,只有图像编码器会被带到下一阶段。
(a) Stage of context-agnostic ID-preserving: we design a novel image encoder (with pre-trained DINOv2 as backbone) trained on multi-view object pairs to learn view-invariant ID-preserving representation. (a) 情境无关的 ID 保持阶段:我们设计了一种新颖的图像编码器(以预训练的 DINOv2 作为骨干),在多视角物体对上进行训练,以学习与视角无关的 ID 保持表示。
Figure 2. The two-stage training pipeline of the proposed IMPRINT. 图 2. 提出的 IMPRINT 的两阶段训练流水线。
Figure 3. Illustration of the background-blending process. At each denoising step, the background area of the denoised latent is masked and blended with unmasked area from the clean background (intuitively, the model is only denoising the foreground). 图 3. 背景融合过程示意图。 在每个去噪步骤中,去噪潜变量的背景区域被遮蔽,并与来自干净背景的未遮蔽区域进行融合(从直观上看,该模型仅对前景进行去噪)。
3.2. Compositing Stage ## 3.2. 合成阶段
Fig. 2b illustrates the pipeline of the second stage which is trained for the compositing task, comprising the finetuned image encoder E_(u)\mathcal{E}_{u} and a generator G_(phi)\mathcal{G}_{\phi} (parameterized by phi\phi ) conditioned on the ID-preserving representations. 图 2b 展示了针对合成任务训练的第二阶段的管道,包括微调图像编码器 E_(u)\mathcal{E}_{u} 和生成器 G_(phi)\mathcal{G}_{\phi} (由 phi\phi 参数化),以 ID 保持表示为条件。
A simple approach is to ignore the view synthesis stage, training the encoder and generator jointly in a single-stage framework. Unfortunately, we found quality degradation from two aspects in this naive endeavor (see Sec. 4.7): 一种简单的方法是忽略视图合成阶段,在单阶段框架中联合训练编码器和生成器。不幸的是,我们在这一幼稚的尝试中发现了两方面的质量下降(见第 4.7 节):
When DINOv2 is trained in this stage, the model exhibits more frequent copy-paste-like behavior that composites the object in a very similar view as its original view. 当 DINOv2 在此阶段进行训练时,模型表现出更频繁的复制粘贴式行为,以非常类似于其原始视图的视图合成对象。
When object-centric multi-view datasets, e.g., MVImgNet [47], are enabled in the training set, the model tends to produce more artifacts and exhibit poorer blending results due to the absence of background information in such datasets. 当目标中心的多视图数据集(例如 MVImgNet [47])在训练集中启用时,由于此类数据集中缺乏背景信息,模型往往会产生更多的人工制品并呈现出更差的混合结果。
To overcome the issues above, we freeze the backbone of the image encoder (i.e., DINOv2) in the second stage and carefully collect a training set (see Sec. 3.3 for details). 为了克服上述问题,我们在第二阶段冻结图像编码器(即 DINOv2)的主干,并仔细收集了一个训练集(见第 3.3 节)。
In this stage, we also leverage a pretrained T2I model as the backbone of the generator, which uses the background I_(bg)I_{b g}, a coarse mask MM as inputs, and is conditioned on a ID-preserving object tokens hat(E)_(u)=E_(u)(I_(obj))\hat{E}_{u}=\mathcal{E}_{u}\left(I_{o b j}\right), where I_(obj)I_{o b j} indicates a masked object image. The generation is guided by injecting object tokens into the cross attention layers of G_(phi)\mathcal{G}_{\phi}. The coarse mask also allows the synthesis of shadows, and interactions of the object and the nearby objects. 在这个阶段,我们还利用预训练的 T2I 模型作为生成器的主干,该模型使用背景 I_(bg)I_{b g}、粗略的掩码 MM 作为输入,并以 ID 保留对象令牌 hat(E)_(u)=E_(u)(I_(obj))\hat{E}_{u}=\mathcal{E}_{u}\left(I_{o b j}\right) 为条件,其中 I_(obj)I_{o b j} 表示掩码对象图像。通过将对象令牌注入到交叉注意力层 G_(phi)\mathcal{G}_{\phi} 进行引导生成。粗略的掩码还允许合成阴影,以及对象和附近对象的交互。
As hat(E)_(u)\hat{E}_{u} already encompasses structured view-invariant details of the object, color and geometric adjustments are no longer limited by identity preservation efforts. This freedom allows for greater variation in compositing. 由于 hat(E)_(u)\hat{E}_{u} 已经包含对象的结构视图不变细节,因此颜色和几何调整不再受身份保留工作的影响。这种自由允许更大的合成变化。
We define the objective function of this stage as: 我们将此阶段的目标函数定义为:
L_(comp)=E_(I_(obj),I_(bg)^(**),M,t,epsilon)[M||epsilon-G_(phi)(I_(bg)^(**),t, hat(E)_(u))||_(2)^(2)]\mathcal{L}_{\mathrm{comp}}=\mathbb{E}_{I_{o b j}, I_{b g}^{*}, M, t, \epsilon}\left[M\left\|\epsilon-\mathcal{G}_{\phi}\left(I_{b g}^{*}, t, \hat{E}_{u}\right)\right\|_{2}^{2}\right]
where L_("comp ")\mathcal{L}_{\text {comp }} is the compositing loss, I_(bg)^(**)I_{b g}^{*} is the target image. G_(phi)\mathcal{G}_{\phi} and the adapter are optimized. 其中 L_("comp ")\mathcal{L}_{\text {comp }} 表示合成损失, I_(bg)^(**)I_{b g}^{*} 表示目标图像, G_(phi)\mathcal{G}_{\phi} 和适配器经过优化。
The Background-blending Process To ensure that the transition area between the object and the background is smooth, we adopt a background-blending strategy. This process is depicted in Fig. 3. 为了确保对象和背景之间的过渡区域平滑,我们采用了一种背景混合策略。 此过程如图 3 所示。
Shape-guided Controllable Compositing could enable more practical guidance of the pose and view of the generated object by drawing a rough mask. However, most prior works [4, 27, 41] have no such control. In our proposed model, following [43], masks are defined at four levels of precision (see the Appendix), where the most coarse mask is a bounding box. Incorporating multiple levels of masks replicates real-world scenarios, where users often prefer more precise masks. Results are shown in Fig. 1.any 基于形状的可控合成可以通过绘制粗略的蒙版,实现对生成对象姿势和视角的更实用的引导。然而,大多数先前的工作[4, 27, 41]都没有这样的控制。在我们提出的模型中,根据[43],蒙版被定义在四个精度级别上(参见附录),其中最粗糙的蒙版是一个包围框。结合多级蒙版可以复制现实世界的场景,在现实世界中,用户通常更喜欢更精确的蒙版。结果如图 1 所示。
Figure 4. Illustration of the data augmentation pipeline. 图 4. 数据增强流程图。
3.3. Paired Data Generation ## 3.3 成对数据生成
The dataset quality is another key to better identity preservation and pose variation. As proved by [4], multi-view datasets can significantly improve the generation fidelity. In practice, we use a combination of image datasets (Pixabay), panoptic video segmentation datasets (YoutubeVOS [44], VIPSeg [28] and PPR10K [23]) and object-centric datasets (MVImgNet [47] and Objaverse [5]). They are incorporated in different training stages and associated with various processing procedures in our self-supervised training. 数据集质量是实现更好的身份保持和姿态变化的另一个关键因素。正如[4]所证实的,多视角数据集可以显著提高生成保真度。在实际应用中,我们使用图像数据集(Pixabay)、全景视频分割数据集(YoutubeVOS [44]、VIPSeg [28] 和 PPR10K [23])和以对象为中心的数据集(MVImgNet [47] 和 Objaverse [5])。它们被纳入不同的训练阶段,并在我们自监督训练中与各种处理程序相关联。
The image datasets we collected have high resolution and rich background information, so they are only utilized in the second stage for better compositing. Inspired by [41, 46], to simulate the lighting and geometry changes in object compositing, we design an augmentation pipeline hat(I)_(obj)=P(T(I_(obj)))\hat{I}_{o b j}=\mathcal{P}\left(\mathcal{T}\left(I_{o b j}\right)\right), where T\mathcal{T} are the affine transformations, and P\mathcal{P} is color and light perturbation, supported by the lookup table in [16]. The perturbed object hat(I)_(obj)\hat{I}_{o b j} is used as the input and the natural image I_(bg)^(**)I_{b g}^{*} containing the original object is used as the target. 由于我们收集的图像数据集具有高分辨率和丰富的背景信息,因此它们仅在第二阶段用于更好的合成。受[41、46]的启发,为了模拟物体合成中的光照和几何变化,我们设计了一个增强管道 hat(I)_(obj)=P(T(I_(obj)))\hat{I}_{o b j}=\mathcal{P}\left(\mathcal{T}\left(I_{o b j}\right)\right) ,其中 T\mathcal{T} 是仿射变换, P\mathcal{P} 是颜色和光照扰动,由[16]中的查找表支持。扰动后的物体 hat(I)_(obj)\hat{I}_{o b j} 用作输入,包含原始物体的自然图像 I_(bg)^(**)I_{b g}^{*} 用作目标。
Video segmentation datasets usually suffer from low resolution and motion blur, which harm the generation quality. Nevertheless, they provide object pairs which naturally differ in lighting, geometry, view and even provide non-rigid pose variations. As a result, they are also used in the second stage. Illustrated by Fig. 4, each training pair comes from one video with instance-level segmentation labels. Two distinct frames are randomly sampled; one serves as the target image, while the object is extracted from the other frame as the augmented input. 视频分割数据集通常分辨率低,运动模糊,这会损害生成质量。然而,它们提供了自然光照、几何形状、视图甚至非刚性姿态变化的物体对。因此,它们也应用于第二阶段。如图 4 所示,每个训练对都来自一个带有实例级分割标签的视频。随机采样了两帧不同的帧;一帧作为目标图像,而另一帧中的物体作为增强输入被提取出来。
Object-centric datasets offer a significantly larger scale than video segmentation datasets and provide more intricate object details. However, they are only used in the first stage due to the limited background information available in these datasets. During training, each pair I_(v1),I_(v2)I_{v 1}, I_{v 2} are also randomly sampled from the same video with |v1-v2| <= n|v 1-v 2| \leq n, where nn is the temporal sampling window. Empirically, we observe a loss in the generation quality as nn increases, and n=7n=7 strikes a balance between fidelity and quality. 面向对象的数据集比视频分割数据集提供更大规模的数据,并提供更精细的对象细节。 然而,由于这些数据集中可用的背景信息有限,它们只在第一阶段使用。 在训练过程中,每对 I_(v1),I_(v2)I_{v 1}, I_{v 2} 也从与 |v1-v2| <= n|v 1-v 2| \leq n 相同的视频中随机抽样,其中 nn 是时间采样窗口。 经验表明,随着 nn 的增加,我们观察到生成质量下降,而 n=7n=7 在保真度和质量之间取得了平衡。
3.4. Training Strategies ## 3.4. 训练策略
All previous (or concurrent) training-free methods [4, 41, 46, 48] use a frozen transformer-based image encoder, either using DINOv2 or CLIP. However, freezing the encoder will limit their capability in extracting the object details: i) CLIP only encodes the semantic features of the object; ii) DINOv2 is trained on a dataset that is constructed based on image retrieval, allowing objects that are not entirely identical to be treated as the same instance. To overcome this challenge, we fine-tune the encoder specifically for compositing, ensuring the extraction of instance-level features. 历经训练的无训练方法[4,41,46,48]使用冻结的基于 Transformer 的图像编码器,可以使用 DINOv2 或 CLIP。但是,冻结编码器将会限制其提取物体细节的能力:i)CLIP 仅编码物体的语义特征;ii)DINOv2 在基于图像检索构建的数据集上训练,允许与目标不完全相同,但可以被视为相同实例的对象。为了克服这一挑战,我们针对合成对编码器进行微调,确保提取实例级特征。
Due to the extensive scale of the aforementioned encoders, they are prone to overfitting. The implementation of appropriate training strategies can effectively stabilize the training process and improve identity preservation. To this end, we design a novel training scheme: Sequential Collaborative Training. 鉴于上述编码器规模庞大,它们易于过拟合。 采用适当的训练策略可以有效地稳定训练过程并提高身份保留率。 为此,我们设计了一种新颖的训练方案:顺序协作训练。
More specifically, the object compositing stage is further divided into two phases: 1) in the first nn epochs, we assign the adapter a larger learning rate of 4xx10^(-5)4 \times 10^{-5}, and assign the UNet a smaller learning rate of 4xx10^(-6);24 \times 10^{-6} ; 2 ) in the next nn epochs, we swap the learning rate of these two components (and the training finishes). This strategy focuses on training one component at each phase, with the other component simultaneously trained at a lower rate to adapt to the changed domain; the generator is trained in the end to ensure the synthesis quality. 更具体地说,对象合成阶段进一步分为两个阶段:1)在第一个 nn epoch 中,我们为适配器分配更大的学习率 4xx10^(-5)4 \times 10^{-5} ,并为 UNet 分配更小的学习率 4xx10^(-6);24 \times 10^{-6} ; 2 )在接下来的 nn epoch 中,我们交换这两个组件的学习率(训练完成)。这种策略侧重于在每个阶段训练一个组件,同时以较低的速率训练另一个组件以适应变化的域;生成器在最后训练以确保合成质量。
4. Experiments 4. 实验
4.1. Training Details 4.1 训练详情
The first stage is trained on 1,409,545 pairs and validated on 11,175 pairs from MVImgNet, which takes 5 epochs to finish. The learning rate associated with DINOv2 (ViT-g/14 with registers) is 4xx10^(-6)4 \times 10^{-6}, and the batch size is 256 . The image embedding is dropped at a rate of 0.05 . 第一阶段在来自 MVImgNet 的 1,409,545 对数据集上进行训练,在 11,175 对数据集上进行验证。该阶段使用 DINOv2(带寄存器的 ViT-g/14)模型,共 5 个 epochs。学习率 4xx10^(-6)4 \times 10^{-6} ,batch size 为 256。图像嵌入以 0.05 的概率丢弃。
The second stage is fine-tuned on a mixture of image datasets and video datasets, including a training set of 217,451 pairs and a validation set of 15,769 pairs (listed in Tab. 1), where we apply [20] to obtain the segmentation masks as labels. It is trained for 15 epochs with a batch size of 256 . The embedding is dropped at a rate of 0.1 . 第二阶段在一个混合图像数据集和视频数据集上进行微调,包括一个包含 217,451 对的训练集和一个包含 15,769 对的验证集(见表 1),我们在其中应用 [20] 来获得分割掩模作为标签。它以 256 的批量大小训练 15 个 epoch。嵌入的丢弃率为 0.1。
In both stages, the images are resized to 512 xx512512 \times 512. During inference, the DDIM sampler generates the composite image after 50 denoising steps using a CFG [12] scale of 3.0. The model is trained on 8 NVIDIA A100 GPUs. The model is built on Stable Diffusion v1.4 ([33]). 在两个阶段中,图像都会调整为 ` 512 xx512512 \times 512 大小。在推理阶段,DDIM 采样器在使用 CFG [12] 缩放比例为 3.0 进行 50 次去噪步骤后生成合成图像。该模型在 8 个 NVIDIA A100 GPU 上进行训练。该模型基于 Stable Diffusion v1.4 ([33]) 构建。
4.2. Evaluation Benchmark 4.2 评估基准
Datasets are collected from Pixabay and DreamBooth [34] for testing. More specifically, Pixabay testing set has 1,000 high-resolution images and has no overlap with the training set. A foreground object is selected from each image 数据集来自 Pixabay 和 DreamBooth [34] 进行测试。更具体地说,Pixabay 测试集包含 1,000 张高分辨率图像,并且与训练集没有重叠。从每张图像中选择一个前景对象
Table 1. Statistics of the datasets used in the second stage. 表格 1. 第二阶段数据集统计信息。
and perturbed through the data augmentation pipeline as in Sec. 3.3. The DreamBooth testing set consists of 25 unique objects with various views. Combined with 59 background images that are manually chosen, 113 pairs are generated for this test set. This dataset is challenging since most objects are of complex texture or structure. We also conduct a user study on this dataset. 经过数据增强管道(如第 3.3 节所述)进行扰动。DreamBooth 测试集包含 25 个具有不同视图的独特对象。结合 59 张人工选择的背景图片,为该测试集生成了 113 对图片。该数据集极具挑战性,因为大多数对象都具有复杂的纹理或结构。我们还对该数据集进行了用户研究。
Metrics measuring fidelity and realism are adopted to evaluate the effectiveness of different models in terms of identity preservation and background harmonization. We utilize CLIP-score [10], DINO-score, and DreamSim [6] as the measurements of generation fidelity. To obtain more precise comparison results, we always crop the output images so that the generated object is located in the center of the image. FID [11] is employed to measure the realism which indicates the compositing quality. 为了评估不同模型在身份保留和背景融合方面的有效性,我们采用了衡量保真度和真实感的指标。我们使用 CLIP-score [10]、DINO-score 和 DreamSim [6] 作为生成保真度的度量。为了获得更精确的比较结果,我们总是裁剪输出图像,使生成的物体位于图像的中心。使用 FID [11] 来度量真实感,它反映了合成质量。
4.3. Quantitative Evaluation 4.3 定量评估
To demonstrate the effectiveness of our model, we test our model and three baseline methods (Paint-by-Example [46], ObjectStitch [41], and TF-ICON [27]) on the two aforementioned test sets. The same inputs (a mask and a reference object) are used in all models. For fair comparison, we further fine-tune Paint-by-Example ( PbE ) on our second-stage training set. 为了证明我们模型的有效性,我们在上述两个测试集上测试了我们的模型和三种基线方法(Paint-by-Example [46]、ObjectStitch [41] 和 TF-ICON [27])。所有模型都使用相同的输入(一个遮罩和一个参考对象)。为了进行公平的比较,我们在我们的第二阶段训练集上进一步微调了 Paint-by-Example ( PbE )。
When testing on TF-ICON, we employ the parameter set in “same domain” mode, as suggested by the official implementation. It also requires a text prompt as an additional input, so we apply BLIP2 [21], a state-of-the-art visionlanguage model to generate captions for the images. Moreover, the captions for the DreamBooth test set are manually refined to improve the performance. As shown in Tab. 2, IMPRINT achieves the best performance in both realism and fidelity. See the Appendix for quantitative comparisons with AnyDoor. 在 TF-ICON 测试中,我们遵循官方实现的建议,采用“相同域”模式下的参数设置。它还要求额外的文本提示作为输入,因此我们应用 BLIP2 [21],这是一个最先进的视觉语言模型,为图像生成字幕。此外,为了提高性能,DreamBooth 测试集的字幕经过了人工优化。如表 2 所示,IMPRINT 在真实感和保真度方面均取得了最佳性能。有关与 AnyDoor 的定量比较,请参阅附录。
4.4. Qualitative Evaluation 4.4 定量评估
Qualitative comparisons are shown in Fig. 5, comparing our model against prior methods. Although PbE and ObjectStitch show natural compositing effects, they often fail to capture the finer details of the objects. When the object has complex texture or structure, their generated object becomes less recognizable and even suffers from artifacts. In contrast, TF-ICON shows better consistency between the input and output, especially in keeping surface textures and captions. However, the background adaptation ability is also strictly restricted. As can be observed, TF-ICON has Fig. 5 展示了定性比较,将我们的模型与之前的方法进行了对比。虽然 PbE 和 ObjectStitch 展示了自然的合成效果,但它们通常无法捕捉到物体的精细细节。当物体具有复杂的纹理或结构时,它们生成的物体变得难以识别,甚至会出现伪影。相比之下,TF-ICON 在输入和输出之间显示出更好的一致性,尤其是在保持表面纹理和字幕方面。然而,背景适应能力也受到严格限制。可以观察到,TF-ICON 具有
Table 2. Quantitative comparison with prior works. IMPRINT and the baselines are tested on two datasets for realism and IDpreserving measurement: DreamBooth (top) and the Pixabay test set (bottom). The results on both datasets demonstrate the advance of our model in both ID-preserving and realistic harmonization with the background. ## 表 2. 与之前工作的定量比较。IMPRINT 和基线模型在两个数据集上进行了测试,以验证其真实性和 ID 保持测量能力:DreamBooth(上)和 Pixabay 测试集(下)。在这两个数据集上的结果都表明,我们的模型在保持 ID 信息和与背景的真实融合方面都取得了进步。
Table 3. User study results. We design two questions to measure the realism and fidelity of the generation. In both questions, the user is presented side-by-side comparisons of our generated image and another image randomly chosen from one of the baselines. The results in the table show user preference percentage. Our model not only achieves better realism, but also outperforms the baselines in ID-preserving by a large margin. 表 3. 用户研究结果。我们设计了两个问题来衡量生成的真实性和保真度。在这两个问题中,用户看到并排比较我们生成的图像和从一个基线随机选择的其他图像。表格中的结果显示了用户的偏好百分比。我们的模型不仅实现了更好的真实感,而且在大幅度保留 ID 方面也优于基线。
less variation in color and geometry changes, which results in a degradation in compositing effects. We further compare to AnyDoor in Fig. 7 (more visual comparisons are in the Appendix). The results show that IMPRINT achieves better ID-preservation and shows the flexibility in adapting to the background in terms of color and geometry. 色彩和几何变化更少, 导致合成效果下降。 我们在图 7 中进一步与 AnyDoor 进行了比较(更多视觉比较见附录)。 结果表明, IMPRINT 实现了更好的 ID 保留, 并在颜色和几何形状方面适应背景的灵活性。
We also show the synthesis results of the first stage in Fig. 6. Using the ID-preserving representation, our model is able to generate high-fidelity objects with large view variations. This process requires no extra condition such as camera parameters. 在图 6 中,我们还展示了第一阶段的合成结果。使用 ID 保留表示,我们的模型能够生成具有较大视角变化的高保真物体。此过程不需要额外的条件,例如相机参数。
4.5. User Study ## 4.5 用户研究
We also conduct a user study using Amazon Mechanical Turk, comparing our method against the three baselines on the challenging DreamBooth dataset. The user study consists of side-by-side comparisons of our result and a randomly chosen result from the baselines. We design two questions: 1) Which image is more realistic? (the input objects are hidden from the users) 2) Which image is more similar to the reference object? Each question has 111 comparisons. We received more than 880 votes from over 130 users. The results are shown in Tab. 3. In terms of realism, our model outperforms PbE and TF-ICON, while compara- 我们还使用 Amazon Mechanical Turk 进行了一项用户研究,将我们的方法与具有挑战性的 DreamBooth 数据集上的三个基线进行了比较。用户研究包括我们结果与基线中随机选择的结果的并排比较。我们设计了两个问题:1)哪个图像更逼真?(输入对象对用户隐藏)2)哪个图像与参考对象更相似?每个问题都有 111 个比较。我们收到来自 130 多个用户的 880 多张投票。结果如表 3 所示。在逼真度方面,我们的模型优于 PbE 和 TF-ICON,同时
Figure 5. Qualitative comparison on the DreamBooth test set. Paint-by-Example and ObjectStitch lose most object details and only maintain categorical information. TF-ICON tends to copy the pose of the input subject. The comparison highlights the advantage of IMPRINT in keeping identity and making geometric changes. 图 5.在 DreamBooth 测试集上的定性比较。Paint-by-Example 和 ObjectStitch 丢失了大多数对象的细节,只保留了类别信息。TF-ICON 倾向于复制输入对象的姿势。比较突出了 IMPRINT 在保持身份和进行几何改变方面的优势。
ble with ObjectStitch. We also evaluate the visual similarity. The preference rate in the table demonstrates that our method has a significant advantage over the baselines. 使用 ObjectStitch 缝合图像,并评估视觉相似度。表格中的偏好率表明,我们的方法比基线方法具有显著优势。
4.6. Additional Visual Results of Shape-control ## 4.6. 形状控制的其他视觉效果
Shape-guided generation introduces a lot more flexibility for image editing, as the user now gains control over the shape, view and pose of the objects, and the transformation can be either rigid or non-rigid. Fig. 8 illustrates the diverse usage of image editing given a mask as guidance. 形状引导的生成为图像编辑带来了更多的灵活性,因为用户现在可以控制物体的形状、视角和姿势,并且转换可以是刚性的或非刚性的。 图 8 说明了给定掩码作为参考的情况下图像编辑的多样化使用。
4.7. Ablation Study ## 4.7. 消融研究
When pursuing better identity preservation and background harmonization in the field of generative object compositing, we gain valuable experience in a wide range of techniques that contribute to this task. In Tab. 4, we provide a complete analysis and insights of all the factors, as well as demonstrate the effectiveness of our proposed method. The same metrics are utilized as explained in Sec. 4.2. 在生成对象合成领域追求更好的身份保留和背景协调时,我们在各种有助于该任务的技术方面获得了宝贵的经验。在表格 4 中,我们对所有因素进行了全面的分析和见解,并证明了我们提出的方法的有效性。使用了与第 4.2 节中解释的相同的指标。
Training strategies. In setting 2, we also optimize CLIP encoder. The results of settings 1 and 2 show that the optimized CLIP can capture better object identity. However, this improvement comes at the cost of variation. Setting 5 and 6 also demonstrate improved identity and less variation. For this reason, the encoder backbone is frozen in our second stage. 训练策略。在设置 2 中,我们也优化了 CLIP 编码器。设置 1 和 2 的结果表明,优化的 CLIP 可以更好地捕捉对象身份。然而,这种改进是以牺牲变化为代价的。设置 5 和 6 也显示出改进的身份和更少的变化。因此,在我们的第二阶段,编码器骨干被冻结了。
Dataset. Dataset is another component that significantly affects the performance. After adding the video datasets, the model develops a stronger capability in engraving the details (setting 2 and 3 ). Nevertheless, if there are too many training pairs from object-centric datasets (MVImgNet), the generation quality will degrade (setting 6 and 7) since the background information is insufficient. 数据集。数据集是显著影响模型性能的另一个组件。添加视频数据集后,模型在细节刻画方面具备更强的能力(设置 2 和 3)。然而,如果来自以对象为中心的训练对(MVImgNet)过多,则生成的质量会下降(设置 6 和 7),因为背景信息不足。
Architecture. Inspired by [41], we also use an adapter to connect the encoder with the generator. Setting 4 and 5 indicates that using the adapter will boost the overall performance in both realism and fidelity. We also observed the model converges faster when using the adapter. 建筑学。受 [41] 的启发,我们还使用了一个适配器将编码器与生成器连接起来。设置 4 和 5 表明使用适配器将提高真实感和保真度的整体性能。我们还观察到,使用适配器时,模型的收敛速度更快。
Table 4. Ablation study on our methodologies and other common components. PRE means whether the setting has our pretraining stage; MVImgNet and video data mean whether they are used in the compositing stage. ## 表 4. 针对我们方法和其他常用组件的消融研究。PRE 指示是否使用我们的预训练阶段;MVImgNet 和视频数据表示是否在合成阶段使用它们。
Figure 6. Top: Results of context-agnostic ID-preserving pretraining (after the first stage); IMPRINT generates view pose changes while memorizing the details of the object. Bottom: Diverse poses of the object after the second stage. 图 6. 顶部: 不依赖上下文的 ID 保留预训练结果(第一阶段后); IMPRINT 在记忆对象细节的同时生成视图姿势变化。 底部:第二阶段后对象的各种姿势。
Figure 7. Comparison with AnyDoor [4]. See the Appendix for more results. 图 7. 与 AnyDoor [4] 的比较。 更多结果参见附录。
Pretraining. In our framework, the first stage pretraining is a key component in improving ID-preservation and har- ## 预训练。 在我们的框架中,第一阶段的预训练是提高身份保持和哈尔-
monization effects. To demonstrate the effectiveness, we test the original DINOv2 and our finetuned DINOv2 on a Objaverse test set. In this evaluation, the encoders generate embeddings for diverse views of 20 objects from various categories. The embeddings are then clustered and visualized in t-SNE figures (Fig. 10). This figure shows that the finetuned encoder produces better clustering results, demonstrating that our ID-preserving representation effectively encodes the key details of the objects. We further ablate on the first stage training using setting 7 (where there is only the compositing stage) and 10 (two-stage). Without the first stage, there is a notable drop in the compositing quality (Fig. 9). Additionally, we assess the effect of freezing some components (i.e., UNet or DINOv2) during the pretraining. Compared with Setting 10, setting 8 and 9 exhibit a drop in both harmonization and ID-preservation, validating the effectiveness of our training scheme. 货币化效果。为了证明其有效性,我们在 Objaverse 测试集上测试了原始的 DINOv2 和我们微调后的 DINOv2。在此评估中,编码器为来自不同类别的 20 个物体的不同视图生成嵌入。然后将嵌入聚类并可视化在 t-SNE 图形 (图 10) 中。该图显示微调的编码器产生了更好的聚类结果,证明我们的 ID 保留表示有效地编码了物体的关键细节。我们进一步对使用设置 7(只有合成阶段)和 10(两阶段)的第一阶段训练进行了消融。如果没有第一阶段,合成质量会明显下降(图 9)。此外,我们评估了在预训练期间冻结某些组件(例如,UNet 或 DINOv2)的影响。与设置 10 相比,设置 8 和 9 在协调和 ID 保留方面都表现出下降,验证了我们训练方案的有效性。
5. Conclusion, Limitation and Future Work ## 5. 结论、局限性和未来工作
In this paper, we propose IMPRINT, a novel two-stage framework that achieves state-of-the-art performance in identity preservation and background harmonization for generative object compositing. We design a new pretraining scheme where the model learns a view-invariant identitypreserving representation that efficiently captures the details of the object. By decoupling the task into an identitypreserving stage and a harmonization stage, IMPRINT can generate large color and geometry variations to better align with the background. Through visual and numerical comparison results, we show that IMPRINT significantly outperforms the previous methods in this task. Furthermore, we add shape guidance as an additional user control. Although IMPRINT effectively addresses both identity preservation and background alignment, it has several limitations. When the required view change is too large, there could be a notable drop in identity preservation, which can be improved by exploring and incorporating a 3D model or NERF representation into our model. Another limitation is that the model may degrade consistency of small texts or logos. 在本论文中,我们提出了一种名为 IMPRINT 的两阶段框架,该框架在生成对象合成的身份保持和背景协调方面实现了最优性能。我们设计了一种新的预训练方案,其中模型学习了一种与视图无关的身份保持表示,该表示可以有效地捕获对象的细节。通过将任务分解为身份保持阶段和协调阶段,IMPRINT 可以生成更大的颜色和几何变化,以更好地与背景对齐。通过视觉和数值比较结果,我们表明 IMPRINT 在这项任务中明显优于先前的方法。此外,我们添加形状引导作为额外的用户控制。尽管 IMPRINT 有效地解决了身份保持和背景对齐问题,但它仍然存在一些局限性。当所需的视图变化过大时,身份保持可能会显著下降,这可以通过探索和将 3D 模型或 NERF 表示纳入我们的模型来改进。另一个限制是模型可能会降低小文本或徽标的一致性。
Figure 8. More shape-control results. IMPRINT introduces more user control by using a user-provided mask as input. Inspired by [43], we define four types of mask (including bounding box). In addition to object compositing, our model also performs edits on the input object. Depending on the shape of the coarse mask, IMPRINT can operate different types of editing, including changing the view of an object, and applying non-rigid transformation on the object. 图 8.更多形状控制结果。IMPRINT 通过使用用户提供的蒙版作为输入,引入了更多用户控制。受 [43] 的启发,我们定义了四种类型的蒙版(包括边界框)。除了对象合成,我们的模型还在输入对象上进行编辑。根据粗蒙版的形状,IMPRINT 可以进行不同类型的编辑,包括改变对象的视角,以及对对象应用非刚性变换。
Figure 9. Ablation study on our two-stage training scheme. In (b) MVImgNet is added to the training set and simply trains the whole network in one stage. Compared with two-stage training, singlestage has a notable degradation in quality and loses more details. 图 9. 对我们两阶段训练方案的消融研究。在 (b) 中,将 MVImgNet 添加到训练集中,并简单地在单个阶段中训练整个网络。与两阶段训练相比,单阶段训练在质量上显着下降,并且丢失更多细节。
Potential ideas to improve this is to employ more accurate latent auto-encoder to avoid loss of information in the latent space and learn object encoders at higher resolution to encode small local details more accurately. 潜在的改进方法是采用更准确的潜在自动编码器,以避免潜在空间中的信息丢失,并学习更高分辨率的对象编码器,以更准确地编码小的局部细节。
(a) Clustering results using the original DINOv2. (a) 使用原始 DINOv2 的聚类结果。
Figure 10. Ablation study on our first stage training. We use DINOv2 (before and after the first stage) to predict embeddings of different views of 20 Objaverse objects. The embeddings are then clustered using the same algorithm and visualized using t-SNE figures. The improved clustering results demonstrate that the embeddings produced by finetuned DINOv2 have higher quality. 图 10. 我们对第一阶段训练的消融研究。我们使用 DINOv2(第一阶段之前和之后)来预测 20 个 Objaverse 对象的多个视图的嵌入。然后使用相同的算法对嵌入进行聚类,并使用 t-SNE 图像进行可视化。改进的聚类结果表明,经过微调的 DINOv2 产生的嵌入具有更高的质量。
References 参考资料
[1] Samaneh Azadi, Deepak Pathak, Sayna Ebrahimi, and Trevor Darrell. Compositional gan: Learning imageconditional binary composition. International Journal of Computer Vision, 128(10):2570-2585, 2020. 2 [1] 沈曼妮·阿扎迪、迪帕克·帕萨克、赛娜·埃布拉希米和特雷弗·达雷尔。组合 式生成对抗网络:学习图像条件的二元组合。计算机视觉国际期刊,128(10):2570-2585,2020。
[2] Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xiaohu Qie, and Yinqiang Zheng. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and [2] 曹茗灯,王新涛,齐中昂,单颖,谢晓虎,郑银强. Masactrl:一种无需微调的互注意力自控机制,用于生成一致性图像和
editing. arXiv preprint arXiv:2304.08465, 2023. 3 arXiv preprint arXiv:2304.08465, 2023. 3
zh-CN: arXiv 预印本 arXiv:2304.08465, 2023 年 3
[3] Bor-Chun Chen and Andrew Kae. Toward realistic image compositing with adversarial learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8415-8424, 2019. 2 [3] 陈柏峻和安德鲁凯. 基于对抗学习的逼真图像合成. 在 2019 年 IEEE/CVF 计算机视觉和模式识别会议论文集, 第 8415-8424 页. 2
[4] Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao, and Hengshuang Zhao. Anydoor: Zero-shot object-level image customization. arXiv preprint arXiv:2307.09481, 2023. 2,3,4,5,8,12,3,4,5,8,1 [4] 陈曦, 黄联华, 刘宇, 沈宇军, 赵德利, 赵恒爽. Anydoor:零样本物体级图像定制. arXiv preprint arXiv:2307.09481, 2023. 2,3,4,5,8,12,3,4,5,8,1
[5] Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142-13153, 2023. 3, 5 [5] Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, 和 Ali Farhadi. Objaverse:一个标注的三维物体宇宙。在 IEEE/CVF 计算机视觉和模式识别会议论文集中,第 13142-13153 页,2023 年。3, 5
[6] Stephanie Fu*, Netanel Tamir*, Shobhita Sundaram*, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dreamsim: Learning new dimensions of human visual similarity using synthetic data. arXiv:2306.09344, 2023. 6 ## 翻译:
[6] Stephanie Fu*, Netanel Tamir*, Shobhita Sundaram*, Lucy Chai, Richard Zhang, Tali Dekel 和 Phillip Isola. Dreamsim:使用合成数据学习人类视觉相似性的新维度. arXiv:2306.09344,2023. 6
[7] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion, 2022. 3 [7] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik 和 Daniel Cohen-Or. 一幅图像胜过千言万语:使用文本反演实现文本到图像生成的个性化,2022 年。3
[8] Jing Gu, Yilin Wang, Nanxuan Zhao, Tsu-Jui Fu, Wei Xiong, Qing Liu, Zhifei Zhang, He Zhang, Jianming Zhang, HyunJoon Jung, et al. Photoswap: Personalized subject swapping in images. arXiv preprint arXiv:2305.18286, 2023. 3 [8] 顾景, 王依林, 赵楠轩, 傅子俊, 熊伟, 刘庆, 张志飞, 张贺, 张建明, 郑贤俊等. Photoswap: 基于人物属性的图像个性化主体替换. arXiv 预印本 arXiv:2305.18286, 2023. 3
[9] Julian Jorge Andrade Guerreiro, Mitsuru Nakazawa, and Björn Stenger. Pct-net: Full resolution image harmonization using pixel-wise color transformations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5917-5926, 2023. 2 [9] 朱利安·豪尔赫·安德拉德·格雷罗、中沢 充、和比约恩·斯滕格。基于像素的颜色转换进行全分辨率图像色彩调整的 Pct-net 。在 2023 年的 IEEE/CVF 计算机视觉与模式识别会议论文集 中, 5917-5926 页。
[10] Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718, 2021.6 [[10]] Jack Hessel、Ari Holtzman、Maxwell Forbes、Ronan Le Bras 和 Yejin Choi。Clipscore: 图像字幕的无参考评估指标. arXiv 预印本 arXiv:2104.08718,2021.6
[11] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017. 6 ## 简体中文翻译:
[11] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler 和 Sepp Hochreiter. 双时间尺度更新规则训练的 GAN 收敛到局部纳什均衡。神经信息处理系统进展,30,2017。6
[12] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2022. 5 [12] 乔纳森·何和蒂姆·萨利曼斯。无分类器扩散引导,2022。 5
[13] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840-6851, 2020. 2 ## 译文:
Jonathan Ho、Ajay Jain 和 Pieter Abbeel 提出了去噪扩散概率模型。发表于《神经信息处理进展》,33:6840-6851,2020 年。
[14] Yan Hong, Li Niu, and Jianfu Zhang. Shadow generation for composite image in real-world scenes. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 914-922, 2022. 2 ## 翻译:
[14] 严宏,牛李,张建夫。真实场景下合成图像的阴影生成。 AAAI 人工智能会议论文集,第 914-922 页,2022 年。
[15] Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, et al. Openclip, july 2021. 2(4):5, 2021. 3 [15] Gabriel Ilharco、Mitchell Wortsman、Ross Wightman、Cade Gordon、Nicholas Carlini、Rohan Taori、Achal Dave、Vaishaal Shankar、Hongseok Namkoong、John Miller 等人。Openclip,2021 年 7 月。2(4):5,2021 年。3
[16] Yifan Jiang, He Zhang, Jianming Zhang, Yilin Wang, Zhe Lin, Kalyan Sunkavalli, Simon Chen, Sohrab Amirghodsi, Sarah Kong, and Zhangyang Wang. Ssh: A self-supervised [16] 江亦凡、张贺、张建明、王艺林、林哲、卡利亚恩·桑卡瓦利、陈西蒙、索赫拉布·阿米尔古德西、萨拉·孔和王长阳。Ssh:一个自监督
framework for image harmonization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4832-4841, 2021. 2, 5 图像协调框架。在 2021 年 IEEE/CVF 国际计算机视觉会议论文集中,第 4832-4841 页。2, 5
[17] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6007-6017, 2023. 3 [17] Bahjat Kawar、Shiran Zada、Oran Lang、Omer Tov、Huiwen Chang、Tali Dekel、Inbar Mosseri 和 Michal Irani。Imagic:使用扩散模型进行基于文本的真实图像编辑。收录于 IEEE/CVF 计算机视觉与模式识别会议论文集,第 6007-6017 页,2023 年。
[18] Zhanghan Ke, Chunyi Sun, Lei Zhu, Ke Xu, and Rynson WH Lau. Harmonizer: Learning to perform white-box image and video harmonization. In European Conference on Computer Vision, pages 690-706. Springer, 2022. 2 [18] 张翰柯, 孙春义, 褚磊, 许珂, 和 Lau, Rynson WH. Harmonizer:白盒图像和视频协调化学习. 在欧洲计算机视觉会议上, 690-706. 斯普林格出版社, 2022.
[19] Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1931-1941, 2023. 3 [19] 努普尔·库马里、张冰良、张理查德、伊莱·舍赫特曼和朱君彦。文本到图像扩散的多概念定制。在 IEEE/CVF 计算机视觉和模式识别会议论文集中,第 1931-1941 页,2023 年。 3
[20] Youngwan Lee and Jongyoul Park. Centermask: Realtime anchor-free instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13906-13915, 2020. 5 [20] Youngwan Lee 和 Jongyoul Park. CenterMask: 实时无锚点实例分割. 2020 年 IEEE/CVF 计算机视觉和模式识别会议论文集,13906-13915 页. 5
[21] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.6 [21] 李君男,李东旭,西尔维奥·萨瓦雷斯和何向东。Blip-2:使用冻结图像编码器和大型语言模型引导语言图像预训练。arXiv 预印本 arXiv:2301.12597, 2023.6
[22] Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22511-22521, 2023.3 [22] 李宇恒、刘浩天、吴青阳、穆方洲、杨建伟、高俊峰、李春元、李永在。GliGen:开放集接地式文本到图像生成。在 IEEE/CVF 计算机视觉和模式识别会议论文集中,第 22511-22521 页,2023 年。
[23] Jie Liang, Hui Zeng, Miaomiao Cui, Xuansong Xie, and Lei Zhang. Ppr10k: A large-scale portrait photo retouching dataset with human-region mask and group-level consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021. 5 Jie Liang, Zeng Hui, Miaomiao Cui, Xuansong Xie 和 Lei Zhang. Ppr10k:一个包含人像区域掩模和组级一致性的,用于人像照片精修的大规模数据集。在 2021 年 IEEE 计算机视觉与模式识别会议论文集中发表。 5
[24] Chen-Hsuan Lin, Ersin Yumer, Oliver Wang, Eli Shechtman, and Simon Lucey. St-gan: Spatial transformer generative adversarial networks for image compositing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9455-9464, 2018. 2 [24] 林宸軒, 厄辛·尤默爾, 王奧利弗, 伊利·舍克特曼, 和西蒙·盧西. 空間變換生成對抗網絡用於圖像合成. 在 IEEE 電腦視覺與模式識別會議論文集中,第 9455-9464 頁,2018 年. 2
[25] Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl Vondrick. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9298-9309, 2023. 3 [25] 刘若识、吴润迪、巴西尔·范·霍里克、帕维尔·托克马科夫、谢尔盖·扎哈罗夫和卡尔·冯德里克. 零-1-到-3:零样本一幅图像到三维物体. 在 IEEE/CVF 国际计算机视觉会议论文集上发表,第 9298-9309 页,2023 年。 3
[26] Zhiheng Liu, Ruili Feng, Kai Zhu, Yifei Zhang, Kecheng Zheng, Yu Liu, Deli Zhao, Jingren Zhou, and Yang Cao. Cones: Concept neurons in diffusion models for customized generation. arXiv preprint arXiv:2303.05125, 2023. 3 [26] 刘志恒、冯瑞丽、朱恺、张翼飞、郑克成、刘宇、赵德利、周景仁、曹阳。Cones:用于定制生成的扩散模型中的概念神经元。arXiv 预印本 arXiv:2303.05125,2023 年。 3
[27] Shilin Lu, Yanzhu Liu, and Adams Wai-Kin Kong. Tf-icon: Diffusion-based training-free cross-domain image composition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2294-2305, 2023. 1, 2, 3, 4,6 [27] 施林禄、刘晏竹和亚当·卫钦·孔。基于扩散的无训练跨域图像合成模型 Tf-icon。在 IEEE/CVF 国际计算机视觉会议论文集中,第 2294-2305 页,2023 年。1, 2, 3, 4,6
[28] Jiaxu Miao, Xiaohan Wang, Yu Wu, Wei Li, Xu Zhang, Yunchao Wei, and Yi Yang. Large-scale video panoptic segmentation in the wild: A benchmark. In Proceedings of [28] 贾绪 缪,王小寒,吴宇,李维,张旭,魏云超,杨毅. 规模化的视频全景分割:一个基准测试. 发表在
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21033-21043, 2022. 5 IEEE/CVF 计算机视觉与模式识别会议,第 21033-21043 页,2022 年。 5
[29] Maxime Oquab, Timothée Darcet, Theo Moutakanni, Huy V. Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Russell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, ShangWen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nicolas Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, and Piotr Bojanowski. Dinov2: Learning robust visual features without supervision, 2023. 2, 3
[30] Patrick Pérez, Michel Gangnet, and Andrew Blake. Poisson image editing. In ACM SIGGRAPH 2003 Papers, pages 313318. 2003. 2 [30] Patrick Pérez、Michel Gangnet 和 Andrew Blake.泊松图像编辑。发表于 ACM SIGGRAPH 2003 论文集,第 313-318 页。2003 年。 2
[31] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748 -8763. PMLR, 2021. 3 [31] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark 等人. 基于自然语言监督学习可迁移的视觉模型. 在国际机器学习会议上发表, 8748 -8763. PMLR, 2021. 3
[32] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1 (2):3, 2022. 2 [32] Aditya Ramesh、Prafulla Dhariwal、Alex Nichol、Casey Chu 和 Mark Chen。使用 CLIP 潜变量进行分层文本条件图像生成。arXiv 预印本 arXiv:2204.06125,1(2):3,2022 年。2
[33] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684-10695, 2022. 2, 3, 5, 6 ## 高分辨率图像合成与潜在扩散模型
[33] Robin Rombach、Andreas Blattmann、Dominik Lorenz、Patrick Esser 和 Björn Ommer。基于潜在扩散模型的高分辨率图像合成。发表于 IEEE/CVF 计算机视觉与模式识别会议论文集,第 10684-10695 页,2022 年。2、3、5、6
[34] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2250022510, 2023. 3, 5 [34] 鲁伊兹、袁筝、贾姆帕尼、普里特奇、鲁宾施坦和阿贝尔曼. 梦工厂: 微调基于文本的图像扩散模型进行基于主体的生成. 在 IEEE/CVF 计算机视觉和模式识别会议论文集,第 22500-22509 页,2023 年. 3, 5
[35] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Wei Wei, Tingbo Hou, Yael Pritch, Neal Wadhwa, Michael Rubinstein, and Kfir Aberman. Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models. arXiv preprint arXiv:2307.06949, 2023. 3 [35] 路易斯·纳塔涅尔,袁振李,瓦伦·贾姆帕尼,魏魏,候廷波,雅埃尔·普里奇,内尔·瓦德瓦,迈克尔·鲁宾斯坦,基尔·阿伯曼。Hyperdreambooth:用于文本到图像模型快速个性化的超网络。arXiv 预印本 arXiv:2307.06949,2023 年。
[36] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479-36494, 2022. 2 [36] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, 等人。具有深度语言理解的光照片级文本图像扩散模型。推进神经信息处理系统, 35: 36479-36494, 2022。 2
[37] Yichen Sheng, Jianming Zhang, Julien Philip, Yannick HoldGeoffroy, Xin Sun, He Zhang, Lu Ling, and Bedrich Benes. Pixht-lab: Pixel height based light effect generation for image compositing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16643-16653,2023.216643-16653,2023.2 [37] 胜亦辰,张建明,朱利安 菲利普,亚尼克 霍尔德,孙新,张赫,凌陆,贝内斯 贝德里克。基于像素高度的光效生成:用于图像合成的 Pixht-lab。在 IEEE/CVF 计算机视觉与模式识别会议论文集中,页面
16643-16653,2023.216643-16653,2023.2
[38] Mannat Singh, Laura Gustafson, Aaron Adcock, Vinicius de Freitas Reis, Bugra Gedik, Raj Prateek Kosaraju, Dhruv Mahajan, Ross Girshick, Piotr Dollár, and Laurens Van Der Maaten. Revisiting weakly supervised pre-training of visual perception models. In Proceedings of the IEEE/CVF [38] Mannat Singh, Laura Gustafson, Aaron Adcock, Vinicius de Freitas Reis, Bugra Gedik, Raj Prateek Kosaraju, Dhruv Mahajan, Ross Girshick, Piotr Dollár, 和 Laurens Van Der Maaten. 重新审视视觉感知模型的弱监督预训练。在 IEEE/CVF 会议论文集
Conference on Computer Vision and Pattern Recognition, pages 804-814, 2022. 3 计算机视觉和模式识别会议,2022 年,第 804-814 页。 3
[39] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256-2265. PMLR, 2015. 2 [39] 贾斯查·索尔-迪克施泰因、埃里克·韦斯、尼鲁·马赫斯瓦拉纳坦和苏里亚·甘古利。基于非平衡热力学的深度无监督学习。在国际机器学习会议上,第 2256-2265 页。PMLR,2015。 2
[40] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32, 2019. 2 ## 简体中文:
[40] 杨松,Stefano Ermon。通过估计数据分布的梯度进行生成建模。Advances in Neural Information Processing Systems,32,2019。2
[41] Yizhi Song, Zhifei Zhang, Zhe Lin, Scott Cohen, Brian Price, Jianming Zhang, Soo Ye Kim, and Daniel Aliaga. Objectstitch: Object compositing with diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18310-18319, 2023. 1, 2, 3, 4,5,6,74,5,6,7
[42] Huikai Wu, Shuai Zheng, Junge Zhang, and Kaiqi Huang. Gp-gan: Towards realistic high-resolution image blending. In Proceedings of the 27th ACM international conference on multimedia, pages 2487-2495, 2019. 2
[43] Shaoan Xie, Zhifei Zhang, Zhe Lin, Tobias Hinz, and Kun Zhang. Smartbrush: Text and shape guided object inpainting with diffusion model, 2022. 4, 9
[44] Ning Xu, Linjie Yang, Yuchen Fan, Jianchao Yang, Dingcheng Yue, Yuchen Liang, Brian Price, Scott Cohen, and Thomas Huang. Youtube-vos: Sequence-to-sequence video object segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 585601, 2018. 5
[45] Ben Xue, Shenghui Ran, Quan Chen, Rongfei Jia, Binqiang Zhao, and Xing Tang. Dccf: Deep comprehensible color filter learning framework for high-resolution image harmonization. arXiv preprint arXiv:2207.04788, 2022. 2
[46] Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, and Fang Wen. Paint by example: Exemplar-based image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18381-18391, 2023. 1, 2, 5, 6
[47] Xianggang Yu, Mutian Xu, Yidan Zhang, Haolin Liu, Chongjie Ye, Yushuang Wu, Zizheng Yan, Chenming Zhu, Zhangyang Xiong, Tianyou Liang, et al. Mvimgnet: A large-scale dataset of multi-view images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9150-9161, 2023. 3, 4, 5
[48] Bo Zhang, Yuxuan Duan, Jun Lan, Yan Hong, Huijia Zhu, Weiqiang Wang, and Li Niu. Controlcom: Controllable image composition using diffusion model. arXiv preprint arXiv:2308.10040, 2023. 2, 3, 5
[49] He Zhang, Jianming Zhang, Federico Perazzi, Zhe Lin, and Vishal M Patel. Deep image compositing. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 365-374, 2021. 2
[50] Lingzhi Zhang, Tarmily Wen, and Jianbo Shi. Deep image blending. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 231-240, 2020. 2
[51] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models, 2023. 1
IMPRINT: Generative Object Compositing by Learning Identity-Preserving Representation 印记:通过学习身份识别表示进行生成式目标合成
Supplementary Material 补充资料
1. Overview ## 1. 概述
The following sections will be discussed to further support our paper:
Mask types (used for shape-guided generation);
Ablation study on two alternative architectures;
Additional results of shape-guided generation;
Additional qualitative comparison results.
Additional comparisons with AnyDoor [4];
Failure cases;
2. Mask Types
As discussed in Sec. 3.2, to enable more user control, we define four levels of coarse masks, including the bounding box mask. Fig. 11 shows all the mask types. As the coarse level increases (from mask 1 to mask 4), the model has more freedom to generate the object.
3. Ablation Study on Alternative Architectures
When making efforts for better identity preservation, we also explore two alternative architectures (Fig. 12) that are more intuitive to inject object features (due to the page limitation, they are removed from the main paper): 1) concatenation and 2) ControlNet [51]. To provide extra features in this two pipelines, a naive idea is to use the same segmented object I_(obj)I_{o b j} as the additional input. However, both the structures of concatenation and ControlNet will result in a spatial correspondence between the output and the additional input (i.e., the generated object tends to have the same size and position as the input), and using I_(obj)I_{o b j} which is much larger than the mask MM destroys such correspondence. For this reason, we use I_(obj)^(**)I_{o b j}^{*}, the inserted object image as the additional hint to provide extra features, where the cropped and resized object I_(obj)I_{o b j} is fitted in the mask area of the background image I_(bg)I_{b g}. To replace the text encoder branch, we use a combination of a CLIP encoder (ViT-L/14) and an adapter as the image encoder, fine-tuned together with the UNet backbone following the sequential collaborative training strategy discussed in Sec. 3.4. Furthermore, the two pipelines are trained on the same datasets (Pixabay and the video datasets) as our proposed model in the second stage.
3.1. Concatenation
The first architecture is illustrated in Fig. 12a. An additional feature injection branch is added for the purpose of better identity preservation: I_(obj)^(**)I_{o b j}^{*} is concatenated with the background image I_(bg)I_{b g}. After this modification, the UNet
encoder has 8 channels, where the extra 4 channels are initialized as 0.0 at the start of the training.
3.2. ControlNet
The second architecture is illustrated in Fig. 12b. ControlNet is another structure to enhance spatial conditioning control, such as depth maps, Canny edges, sketches and human poses. In this pipeline, the extra inputs are fed into a trainable copy of the original UNet encoder to learn the condition. In our task of generative object compositing, we use the concatenation of the inserted object I_(obj)^(**)I_{o b j}^{*} and a mask 1-M1-M indicating the area to generate the object.
3.3. Quantitative Comparison
To quantize the effects of these two architectures, an evaluation is conducted on the DreamBooth dataset, just as in Sec. 4.3. Tab. 5 shows the results, where “Baseline” is setting 3 in the ablation study of the main paper (Sec. 4.7). Our model outperforms the rest pipelines in all three metrics that measure identity preservation, demonstrating the effectiveness of IMPRINT in memorizing object details.
To further assess the compositing effects, we perform another user study with the same configuration as in the main paper (Sec. 4.5), comparing the realism and fidelity of our results against the concatenation pipeline and ControlNet pipeline. Tab. 6 displays the user preferences for different frameworks in the two questions. The results validates the superiority of our model in both ID-preserving and compositing.
3.4. Qualitative Comparison
Fig. 13 provides a qualitative comparison between our model and the other two pipelines. Although the nature of structural correspondence in these two pipelines enhances ID preservation, it also constrains their ability to make spatial adjustments. Thus, in the figure their compositing effects are worse than our model (in the first three examples, our outputs have larger pose changes). Moreover, owing to the pretraining stage, our model achieves better performance in keeping details.
4. Additional Results of Shape-Guided Generation
4.1. Ablation Study
Shape-guidance is an important feature supported by our model that enables more user control. This feature is not in-
Figure 11. The four types of mask used in the second compositing stage. The generation is constrained in the masked area so the userprovided mask is able to modify the pose, view and shape of the subject.
(a) The concatenation-based pipeline. Aside from the embedding branch, an additional input (the inserted object I_(obj)^(**)I_{o b j}^{*} ) is concatenated with I_(bg)I_{b g}. Note that the UNet backbone encoder has 8 input channels, where the extra 4 channels are initialized as 0.0 .
(b) The ControlNet-based pipeline. In the new ControlNet branch, the concatenation of I_(obj)^(**)I_{o b j}^{*} and a mask is given as the additional input.
Figure 12. The pipelines of the two alternative architectures for feature injection: Concatenation and ControlNet.
dependent of our efforts in identity preservation. Instead, the overall performance (realism and fidelity) of shape-
Table 5. Quantitative comparison on the DreamBooth test set. Baseline refers to setting 3 in the ablation study section of the main paper. Detail preservation is measured and displayed in this table, comparing our proposed model with three different architectures.
Table 6. User study results (in percentage). In the two questions that evaluates reality and similarity, the workers are presented with side-by-side results from different models and are asked to make comparison.
guided generation is improved by our pretraining stage, as demonstrated by Tab. 7.
This ablation study is conducted on the video datasets (the test sets). We follow the same data generation pipeline in Sec. 3.3: the target image and the input object are taken from frames I_(n1),I_(n2)I_{n 1}, I_{n 2} respectively, with n1!=n2n 1 \neq n 2. The guidance mask MM is a coarse mask of the object segmentation in the target frame n1n 1. We compare our proposed model with another model that is only trained on the second compositing stage. The quantitative results show the improvement of the pretraining stage.
5. Additional Qualitative Results
To further show the advantages of our model against the baseline methods (Paint-by-Example or PbE [46], ObjectStitch or OS [41] and TF-ICON [27]), we include more qualitative results in Fig. 14 and Fig. 15.
Figure 13. Qualitative comparisons with concatenation-based pipeline and ControlNet-based pipeline. Our model shows stronger ability in geometric adjustments (especially in the first three examples) as well as better performance in identity preservation.
Table 7. Ablation study on the pretraining stage in shape-guided generation. PRE means the pretraining. When the pretraining is finished, the model shows stronger capabilities in ID-preserving and realism, highlighting the fact that our pretraining boosts the performance of shape-guided generation.
6. Additional Comparisons with AnyDoor
We provide additional comparisons below using the official implementation of AnyDoor. We observe that IMPRINT
Table 8. Left: Quantitative comparison on the DreamBooth test set. Right: User study results (in percentage).
significantly outperforms AnyDoor in the following experiments:
We calculate CLIP score and DINO score on the DreamBooth test set to measure the identity preservation (as shown in the left of Tab. 8). Note that to get more accurate results, we masked the background of all generated images when performing the evaluation on the DreamBooth set.
(a) Object
Figure 14. More qualitative comparisons. We compare our proposed model with Paint-by-Example (PbE), ObjectStitch (OS) and TFICON. IMPRINT better preserves object identity and the generated object is more consistent with the background.
(a) Object
Figure 15. More qualitative comparisons. We compare our proposed model with Paint-by-Example (PbE), ObjectStitch (OS) and TFICON. IMPRINT better preserves object identity and the generated object is more consistent with the background.
Figure 16. Additional qualitative comparisons with AnyDoor.
We conduct a new user study under the same setting as the user study in the main paper (shown in the right of Tab. 8). The users have higher preference rate in our results in both realism and detail preservation.
In the additional visual comparisons in Fig. 16, our model demonstrates greater adaptability in adjusting the object’s pose to match the background, while preserving the details.
7. Failure Cases
Fig. 17 shows the limitations of IMPRINT, as discussed in Sec. 5. In the first example, Though the vehicle is well aligned with the background, its structure is deformed and partially lose its identity due to the large spatial transformation. In the second example, the small logos and texts on the item cannot be fully maintained and exhibits small artifacts, mainly caused by the decoder in Stable Diffusion [33].
Figure 17. Limitations. 1) The first example shows identity loss when making large geometric corrections. The structure of the vehicle changes after generation. 2) The second example shows the degradation of small logos and texts after decoding from the latent space.