这是用户在 2024-8-27 12:45 为 https://app.immersivetranslate.com/pdf-pro/becc4fb6-51a5-49ec-894d-f144031873ef 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
2024_08_26_359ba9122b3e5efd128bg

Style-NERF2NERF: 3D Style TRANSFER FROM StYle-AlignED Multi-VIEW IMAGES
样式-NERF2NERF:从 StYle-AlignED 多视角图像传输 3D 样式

Haruo Fujiwara 藤原春夫The University of Tokyo 东京大学fujiwara@mi.t.u-tokyo.ac.jp

Yusuke Mukuta 向田雄介The University of Tokyo / RIKEN
东京大学/理化学研究所
mukuta@mi.t.u-tokyo.ac.jp

Tatsuya Harada 原田达也The University of Tokyo / RIKEN
东京大学/理化学研究所
harada@mi.t.u-tokyo.ac.jp

June 25, 2024 2024 年 6 月 25 日

Abstract 摘要

We propose a simple yet effective pipeline for stylizing a 3D scene, harnessing the power of 2D image diffusion models. Given a NeRF model reconstructed from a set of multi-view images, we perform 3D style transfer by refining the source NeRF model using stylized images generated by a style-aligned image-to-image diffusion model. Given a target style prompt, we first generate perceptually similar multi-view images by leveraging a depth-conditioned diffusion model with an attention-sharing mechanism. Next, based on the stylized multi-view images, we propose to guide the style transfer process with the sliced Wasserstein loss based on the feature maps extracted from a pre-trained CNN model. Our pipeline consists of decoupled steps, allowing users to test various prompt ideas and preview the stylized 3D result before proceeding to the NeRF fine-tuning stage. We demonstrate that our method can transfer diverse artistic styles to real-world 3D scenes with competitive quality. Result videos are also available on our project page: https://haruolabs.github.io/style-n2n/
我们利用二维图像扩散模型的强大功能,提出了一种简单而有效的三维场景风格化管道。给定一个由一组多视角图像重建的 NeRF 模型,我们通过使用由风格对齐的图像到图像扩散模型生成的风格化图像来完善源 NeRF 模型,从而实现三维风格转移。给定目标风格提示后,我们首先利用具有注意力共享机制的深度条件扩散模型生成知觉相似的多视角图像。接下来,基于风格化的多视角图像,我们建议使用基于从预先训练的 CNN 模型中提取的特征图的切片瓦瑟斯坦损失来引导风格转移过程。我们的流程由多个解耦步骤组成,允许用户测试各种提示想法,并在进入 NeRF 微调阶段之前预览风格化三维结果。我们证明,我们的方法可以将不同的艺术风格转移到现实世界的三维场景中,而且质量极具竞争力。我们的项目页面上还有结果视频:https://haruolabs.github.io/style-n2n/

Keywords Neural Radiance Fields Style Transfer Sliced Wasserstein
关键词 神经辐射场 风格转移 切片瓦瑟斯坦

1 Introduction 1 引言

Thanks to recent advancements in 3D reconstruction techniques such as Neural Radiance Fields (NeRF) Mildenhall et al. [2020], it is nowadays possible for creators to develop a 3D asset or a scene from captured real-world data without intensive labor. While such 3D reconstruction methods work well, editing an entire 3D scene to match a desired style or concept is not straightforward.
由于神经辐射场(NeRF)Mildenhall 等人[2020]等三维重建技术的最新进展,如今创作者可以根据捕捉到的真实世界数据开发三维资产或场景,而无需付出高强度的劳动。虽然这种三维重建方法效果很好,但要编辑整个三维场景以符合所需的风格或概念却并不简单。

For instance, editing conventional 3D scenes based on explicit representations like mesh often involves specialized tools and skills. Changing the appearance of the entire mesh-based scene would often require skilled labor, such as shape modeling, texture creation, and material parameter modifications.
例如,编辑基于网格等显式表示的传统 3D 场景通常需要专业工具和技能。要改变整个基于网格的场景的外观,通常需要熟练的技能,如形状建模、纹理创建和材料参数修改。

At the advent of implicit 3D representation techniques such as NeRF, style editing methods for 3D are also emerging Nguyen-Phuoc et al. [2022], Wang et al. [2023], Liu et al. [2023], Kamata et al. [2023], Haque et al. [2023], Dong and Wang [2024] to enhance creators' content development process. Following the recent development of 2 D image generation models, prominent works such as Instruct-NeRF2NeRF Haque et al. [[2023], Vachha and Haque [2024] and Dong and Wang [2024] proposed to leverage the knowledge of large-scale pre-trained text-to-image (T2I) models to supervise the 3D NeRF editing process.
随着 NeRF 等隐式三维表示技术的出现,三维样式编辑方法也在不断涌现,如 Nguyen-Phuoc 等人[2022]、Wang 等人[2023]、Liu 等人[2023]、Kamata 等人[2023]、Haque 等人[2023]、Dong 和 Wang [2024],以增强创作者的内容开发流程。继最近开发出 2 D 图像生成模型之后,Instruct-NeRF2NeRF Haque 等人[[2023]]、Vachha 和 Haque [2024] 以及 Dong 和 Wang [2024] 等著名作品提出利用大规模预训练文本到图像(T2I)模型的知识来监督 3D NeRF 编辑过程。

These methods employ a custom pipeline based on an instruction-based T2I model "Instruct-Pix2Pix" Brooks et al. [2023] to stylize a 3D scene with text instructions. While Instruct-NeRF2NeRF is proven to work well for editing 3D scenes including large-scale 360 environments, their method involves an iterative process of editing and replacing the
这些方法采用基于指令的 T2I 模型 "Instruct-Pix2Pix" Brooks 等人[2023]的自定义流水线,用文本指令对 3D 场景进行风格化处理。事实证明,Instruct-NeRF2NeRF 可以很好地编辑三维场景,包括大规模的 360 环境,但他们的方法涉及一个编辑和替换的迭代过程。

training data during NeRF optimization, occasionally resulting in unpredictable results. As editing by Instruct-Pix 2 Pix runs in tandem with NeRF training, we found adjusting or testing editing styles beforehand difficult.
在 NeRF 优化过程中,有时会出现无法预测的结果。由于 Instruct-Pix 2 Pix 的编辑与 NeRF 训练同步进行,我们发现很难事先调整或测试编辑风格。

To overcome this problem, we propose an artistic style-transfer method that trains a source 3D NeRF scene on stylized images prepared in advance by a text-guided style-aligned diffusion model. Training is guided by Sliced Wasserstein Distance (SWD) loss Heitz et al. [2021], Li et al. [2022] to effectively perform 3D style transfer with NeRF. A summary of our contributions is as the follows:
为了克服这一问题,我们提出了一种艺术风格转移方法,该方法通过文本引导的风格对齐扩散模型,在事先准备好的风格化图像上训练源三维 NeRF 场景。训练由 Sliced Wasserstein Distance (SWD) loss Heitz 等人[2021]、Li 等人[2022]指导,从而有效地利用 NeRF 进行三维风格转换。我们的贡献总结如下:
  • We propose a novel 3D style-transfer approach for NeRF, including large-scale outdoor scenes.
    我们为 NeRF(包括大规模室外场景)提出了一种新颖的 3D 风格转换方法。
  • We show that a style-aligned diffusion model conditioned on depth maps of corresponding source views can generate perceptually view-consistent style images for fine-tuning the source NeRF. Users can test stylization ideas with the diffusion pipeline before proceeding to the NeRF fine-tuning phase.
    我们表明,以相应源视图的深度图为条件的风格对齐扩散模型可以生成与视图一致的感知风格图像,用于微调源 NeRF。在进入 NeRF 微调阶段之前,用户可以使用扩散管道测试风格化想法。
  • We find that fine-tuning the source NeRF with SWD loss can perform 3D style transfer well.
    我们发现,利用 SWD 损耗对源 NeRF 进行微调,可以很好地实现 3D 风格转移。
  • Our experimental results illustrate the rich capability of stylizing scenes with various text prompts.
    我们的实验结果表明,利用各种文本提示对场景进行风格化设计的能力非常强大。

2.1 Implicit 3D Representation
2.1 隐式三维表示法

NeRF, introduced by the seminal paper Mildenhall et al. [2020] became one of the most popular implicit 3D representation techniques due to several benefits. NeRF can render photo-realistic novel views with arbitrary resolution due to its continuous representation with a compact model compared to explicit representations such as polygon mesh or voxels. In our research, we use the "nerfacto" model implemented by Nerfstudio Tancik et al. [2023], which is a combination of modular features from multiple papers Wang et al. [2021], Barron et al. [2022], Müller et al. [2022], Martin-Brualla et al. [2021], Verbin et al. [2022], designed to achieve a balance between speed and quality.
由 Mildenhall 等人的开创性论文[2020]引入的 NeRF 因其多种优势而成为最流行的隐式三维表示技术之一。与多边形网格或体素等显式表示相比,NeRF 具有紧凑的连续表示模型,因此能以任意分辨率呈现照片般逼真的新颖视图。在我们的研究中,我们使用了由 Nerfstudio Tancik 等人[2023] 实现的 "nerfacto "模型,该模型结合了多篇论文中的模块化特征,如 Wang 等人[2021]、Barron 等人[2022]、Müller 等人[2022]、Martin-Brualla 等人[2021]、Verbin 等人[2022],旨在实现速度与质量之间的平衡。

2.2 Style Transfer 2.2 风格转换

2.2.1 2D Style Transfer. 2.2.1 二维样式转换。

Style transfer is a technique for blending images, a source image and a style image, to create another image that retains the first's content but exhibits the second's style. Since the introduction of the foundational style transfer algorithm proposed by Gatys et al. [2015], many follow-up works for 2D style transfer have been explored for further improvements such as faster optimization Johnson et al. [2016], Huang and Belongie [2017], zero-shot style-transfer Li] et al. [2017], and photo-realism Luan et al. [2017].
风格转移是一种混合图像(源图像和风格图像)的技术,以创建另一幅图像,该图像保留了前一幅图像的内容,但展现了后一幅图像的风格。自 Gatys 等人[2015]提出基础风格转移算法以来,许多后续工作都在探索二维风格转移的进一步改进,如更快优化 Johnson 等人[2016]、Huang 和 Belongie [2017]、零镜头风格转移 Li] 等人[2017]、照片逼真度 Luan 等人[2017]等。

2.2.2 3D Style Transfer. 2.2.2 3D 风格转移。

Several recent 3D style transfer works have applied style transfer techniques using deep feature statistics Huang and Belongie [2017] to NeRF Liu et al. [2023], Wang et al. [2023], Zhang et al. [2022]. Although these methods require a reference style image, text-based 3D editing techniques have also been proposed leveraging foundational 2D Text-to-Image (T2I) generative models. While Instruct 3D-to-3D Kamata et al. 2023 proposed using Score Distillation Sampling (SDS) loss Poole et al. [2022] for text guided NeRF stylization, Instruct-NeRF2NeRF Haque et al. [2023] and ViCA-NeRF Dong and Wang [2024] perform NeRF editing by optimizing the underlying scene with a process referred to as Iterative Dataset Update (Iterative DU), which gradually replaces the input images with edited images from InstructPix 2Pix Haque et al. [2023], an image-conditioned instruction-based diffusion model, followed by an update of NeRF. Inspired by these methods, we also develop a 3D style transfer method for NeRF, supervised by images created by a diffusion pipeline but without Iterative DU.
最近有几项三维风格转换工作将使用深度特征统计的风格转换技术应用于 NeRF,Huang 和 Belongie [2017] Liu 等人[2023]、Wang 等人[2023]、Zhang 等人[2022]。虽然这些方法需要参考式图像,但也有人提出了基于文本的三维编辑技术,利用基础的二维文本到图像(T2I)生成模型。Instruct 3D-to-3D Kamata 等人[2023]提出使用分数蒸馏采样(SDS)损失 Poole 等人[2022]进行文本引导的 NeRF 风格化,而 Instruct-NeRF2NeRF Haque 等人[2023]和 ViCA 等人[2022]则提出了基于文本的三维编辑技术。[2023] 和 ViCA-NeRF Dong 和 Wang [2024] 通过优化底层场景来进行 NeRF 编辑,该过程被称为 "迭代数据集更新"(Iterative DU),它将输入图像逐渐替换为 InstructPix 2Pix Haque 等人[2023]编辑的图像,后者是一种基于图像条件的指令扩散模型,随后更新 NeRF。受这些方法的启发,我们还为 NeRF 开发了一种 3D 风格转移方法,由扩散管道创建的图像进行监督,但不使用迭代 DU。

2.3 Diffusion Models 2.3 扩散模型

Diffusion models Sohl-Dickstein et al. [2015], Song et al. [2020a], Dhariwal and Nichol [2021] are generative models that have gained significant attention for their ability to generate high-quality, diverse images. Inspired by classical non-equilibrium thermodynamics, they are trained to generate an image by reversing the diffusion process, progressively denoising noisy images towards meaningful ones. Diffusion models are commonly trained with classifier-free guidance Ho and Salimans 2022] to enable image generation conditioned on an input text.
扩散模型 Sohl-Dickstein 等人[2015]、Song 等人[2020a]、Dhariwal 和 Nichol [2021] 是一种生成模型,因其生成高质量、多样化图像的能力而备受关注。受到经典非平衡热力学的启发,这些模型通过反向扩散过程训练生成图像,逐步对噪声图像进行去噪处理,最终生成有意义的图像。扩散模型通常是在无分类器指导下进行训练的,Ho 和 Salimans 2022],以便在输入文本的条件下生成图像。

2.3.1 Controlled Generations with Diffusion Models.
2.3.1 采用扩散模型的受控代。

Leveraging the success of T2I diffusion models, recent research has expanded their application to controlled image generation and editing, notably in image-to-image (I2I) tasks Meng et al. [2021], Parmar et al. [2023], Kawar et al. [2023], Tumanyan et al. [[2023], Mokady et al. [2023], Hertz et al. [2023a, 2022], Brooks et al. [2023]. For example, SDEdit Meng et al. [[2021] achieves this by first adding noise to a source image and then guiding the diffusion process toward an output based on a given prompt. ControlNet Zhang et al. [2023] was proposed as an add-on architecture for training T2I diffusion models with extra conditioning inputs such as depth, pose, edge maps, and more. Several recent techniques Hertz et al. [2023b], Sohn et al. [2024], Cheng et al. [2023] focus on generating style-aligned images. In our work, we use a depth-conditioned I2I pipeline with an attention-sharing mechanism similar to "StyleAligned" Hertz et al. [2023b] to create a set of multi-view images sharing a consistent style.
利用 T2I 扩散模型的成功,最近的研究已将其应用扩展到受控图像生成和编辑,特别是在图像到图像(I2I)任务中 Meng 等人[2021]、Parmar 等人[2023]、Kawar 等人[2023]、Tumanyan 等人[[2023]、Mokady 等人[2023]、Hertz 等人[2023a, 2022]、Brooks 等人[2023]。例如,SDEdit Meng 等人[[2021]通过首先在源图像中添加噪声,然后根据给定的提示引导扩散过程朝输出方向进行,从而实现了这一目标。ControlNet Zhang 等人[[2023]]提出了一种附加架构,用于训练具有额外条件输入(如深度、姿势、边缘图等)的 T2I 扩散模型。最近的几项技术 Hertz 等人[2023b]、Sohn 等人[2024]、Cheng 等人[2023]侧重于生成风格对齐的图像。在我们的工作中,我们使用深度条件 I2I 流水线和类似于 Hertz 等人[2023b]的 "StyleAligned "注意力共享机制来创建一组共享一致风格的多视角图像。
Figure 1: Overall Pipeline: Our method consists of distinct procedures. We first prepare a NeRF model of the source view images. Given the depth maps of the corresponding views (by either estimation or rendering by NeRF), we generate stylized multi-view images using a style-aligned diffusion model. Lastly, we fine-tune the source NeRF on the stylized images using the SWD loss.
图 1:整体管道:我们的方法由不同的程序组成。我们首先为源视图图像准备一个 NeRF 模型。根据相应视图的深度图(通过 NeRF 估算或渲染),我们使用风格对齐的扩散模型生成风格化的多视图图像。最后,我们使用 SWD 损失对风格化图像上的源 NeRF 进行微调。

3 Method 3 方法

3.1 Preliminaries 3.1 前言

3.1.1 Neural Radiance Fields.
3.1.1 神经辐射场。

NeRF Mildenhall et al. [2020] models a volumetric 3D scene as a continuous function by mapping a 3D coordinate and a 2D viewing direction to a color and a density . This function is often parameterized by a neural network combined with voxel grid structures or other encoding techniques Fridovich-Keil et al. [2022], Müller et al. [2022], Sun et al. [2022a b]] to accelerate performance. Given a NeRF model trained on a set of 2D images taken from various viewpoints of a target scene, the accumulated color along an arbitrary camera ray is calculated with the quadrature rule by volume rendering Max [1995]:
NeRF Mildenhall 等人[2020] 通过将三维坐标 和二维观察方向 映射到颜色 和密度 中,将体积三维场景建模为连续函数。该函数 通常由神经网络参数化,并结合体素网格结构或其他编码技术 Fridovich-Keil 等[2022]、Müller 等[2022]、Sun 等[2022a b]]来加速性能。给定在一组从目标场景的不同视角拍摄的二维图像上训练好的 NeRF 模型,沿任意摄像机光线 可通过体积渲染 Max [1995] 使用正交规则计算出来:
where is the distance between sampled points on the ray and is the accumulated transmittance from origin to the -th sample.
其中, 为射线上采样点之间的距离, 为从原点 个采样点的累积透射率。

3.1.2 Conditional Diffusion Models.
3.1.2 条件扩散模型。

Recent T2I diffusion models Rombach et al. [2022], Podell et al. [2023] are built with a U-net architecture Ronneberger et al. [2015] integrated with convolutional layers and attention blocks Vaswani et al. [2017]. Within the model, attention blocks play a crucial role in correlating text with relevant parts of the deep features during image generation. Our work uses an open-source latent diffusion model Podell et al. [2023], which includes a CLIP text encoder Radford et al. [2021] for text embedding. The cross-attention between contextual text embedding and the deep features of the denoising network is calculated as follows:
最近的 T2I 扩散模型 Rombach 等人[2022]、Podell 等人[2023]采用 U 网架构构建,Ronneberger 等人[2015]集成了卷积层和注意力区块 Vaswani 等人[2017]。在该模型中,注意力区块在图像生成过程中将文本与深度特征的相关部分关联起来方面起着至关重要的作用。我们的工作使用了一个开源的潜在扩散模型 Podell 等人[2023],其中包括一个用于文本嵌入的 CLIP 文本编码器 Radford 等人[2021]。上下文文本嵌入与去噪网络深度特征之间的交叉关注度计算如下:
where are projection matrices for a deep feature map . We may interpret the attention operation in equation 2 as values , originating from conditional text, weighted by the correlation of queries , and the keys . There are often multiple attention heads in each layer along the dimension to allow the model to attend to information from different subspaces in feature space jointly:
其中 是深度特征图 的投影矩阵。我们可以将公式 2 中的注意力操作解释为源自条件文本的 值,并根据查询 和键 的相关性进行加权。在每个层中,通常会沿 维度设置 多个注意力头,以便模型能够共同关注来自特征空间中不同子空间的信息:

3.2 Style-NeRF2NeRF 3.2 样式-NeRF2NeRF

Our method is a distinct two-step process. First, we prepare stylized images of corresponding source views using our style-aligned diffusion pipeline, and then refine the source NeRF model based on the generated views to acquire a style-transferred 3D scene.
我们的方法分为两个步骤。首先,我们使用风格对齐的扩散管道为相应的源视图准备风格化图像,然后根据生成的视图完善源 NeRF 模型,从而获得风格转换的三维场景。

3.2.1 Style-Aligned Image-to-Image Generation.
3.2.1 风格对齐的图像到图像生成

Given a set of source view images , our first goal is to generate a corresponding set of stylized view images under a text condition with as much perceptual view consistencies among images where consists of a sampling process such as DDIM Song et al. 2020b.
给定一组源视图图像 ,我们的第一个目标是在文本条件 下生成一组相应的风格化视图图像 ,图像之间的感知视图要尽可能一致,其中 包含一个采样过程,如 DDIM Song 等人,2020b。

Although T2I diffusion models can generate rich images with arbitrary text prompts, merely sharing the same prompt across different source views is insufficient to generate stylized images with a perceptually consistent style. To alleviate this problem, we apply a fully-shared-attention variant of a style-aligned image generation method proposed by Hertz et al. 2023b. Let be the queries, keys, and values from a deep feature for view , then we generate stylized views simultaneously using the following fully-shared-attention:
尽管 T2I 扩散模型可以生成带有任意文本提示的丰富图像,但仅仅在不同的源视图中共享相同的提示并不足以生成具有一致感知风格的风格化图像。为了缓解这一问题,我们采用了 Hertz 等人提出的风格一致图像生成方法的完全共享注意力变体 2023b。假设 是视图 的深度特征 的查询、键和值,那么我们将使用以下完全共享注意力方法同时生成 风格化视图:
Figure 4 illustrates an example of multi-view images generated with and without the fully-shared-attention mechanism.
图 4 举例说明了使用和不使用完全共享注意力机制生成的多视角图像。

3.2.2 Conditioning on Source Views.
3.2.2 源视图条件。

To further strengthen perceptual consistencies across multi-view frames, we attach a depth-conditioned ControlNet Zhang et al. [2023] and optionally enable SDEdit Meng et al. 2021] for conditioning on the source view. As for the depth inputs, we may either render the corresponding depth maps from the source NeRF or use an off-the-shelf depth estimator model such as MiDaS Ranftl et al. [2020].
为了进一步加强多视图帧之间的感知一致性,我们附加了一个深度调节控制网 Zhang 等人[2023],并可选择启用 SDEdit Meng 等人[2021]对源视图进行调节。至于深度输入,我们可以从源 NeRF 中渲染相应的深度图,或使用现成的深度估计模型,如 MiDaS Ranftl 等人[2020]。

Given a set of translated multi-view images based on style text and their corresponding camera poses for training a source NeRF model, we may proceed to the NeRF refining stage described below.
有了一组根据风格文本翻译的多视角图像及其相应的相机姿态来训练源 NeRF 模型,我们就可以进入下文所述的 NeRF 完善阶段。

3.2.3 NeRF Fine-Tuning. 3.2.3 NeRF 微调。

Based on the perceptually view-consistent images created by the style-aligned image-to-image diffusion model, our next objective is to fine-tune the source NeRF scene to reflect the target style in a 3D consistent manner.
基于由风格对齐的图像到图像扩散模型创建的感知上与视图一致的 图像,我们的下一个目标是对源 NeRF 场景进行微调,以三维一致的方式反映目标风格。

Although the stylized multi-view images are a good starting point for fine-tuning the source NeRF, we found that using a common RGB pixel loss is prone to over-fitting due to ambiguities in 3D geometry and color. Therefore, an alternative
虽然风格化多视角图像是微调源 NeRF 的良好起点,但我们发现,由于三维几何和颜色的模糊性,使用普通 RGB 像素损失容易造成过度拟合。因此,另一种

loss function that reflects the perceptual similarity is preferred for guiding the 3D style-transfer process. To meet our requirement, we employ the Sliced Wasserstein Distance loss (SWD loss) Heitz et al. [2021].
在指导三维风格转换过程时,反映感知相似性的损失函数是首选。为了满足我们的要求,我们采用了海茨等人[2021]提出的切片瓦瑟斯坦距离损失函数(SWD loss)。

3.3 Sliced Wasserstein Distance Loss.
3.3 瓦瑟斯坦切片距离损失。

Feature statistics of pre-trained Convolutional Neural Networks (CNNs) such as VGG-19 Simonyan and Zisserman [2014] are known to be useful for representing a style of an image Gatys et al. [2015], Johnson et al. [2016], Huang and Belongie [2017], Li et al. [2017], Luan et al. [2017]. In our study we employ the SWD loss originally proposed for texture synthesis Heitz et al. [2021] as the loss term to guide the style-transfer process for NeRF.
众所周知,VGG-19 Simonyan 和 Zisserman [2014] 等预先训练好的卷积神经网络(CNN)的特征统计对于表现图像的风格非常有用 Gatys 等人[2015]、Johnson 等人[2016]、Huang 和 Belongie [2017]、Li 等人[2017]、Luan 等人[2017]。在我们的研究中,我们采用了最初为纹理合成提出的 SWD 损失 Heitz 等人[2021],作为损失项来指导 NeRF 的风格转换过程。

Let denote the feature vector of the -th convolutional layer at pixel where is the number of pixels and is the feature dimension size. Using the delta Dirac function, we may express the discrete probability density function of the features for layer as below:
表示 卷积层在像素 处的特征向量,其中 是像素数, 是特征维度大小。利用 delta Dirac 函数,我们可以表达层 中特征的离散概率密度函数 如下:
Using the feature distributions for image and its corresponding optimization target , the style loss is defined as a sum of SWD over the layers:
利用图像 的特征分布 及其相应的优化目标 ,样式损失被定义为各层的 SWD 之和:
where, is the SWD term defined as the expectation over 1-dimensional Wasserstein distances of features projected by random directions sampled from a unit hypersphere.
其中, 是 SWD 项,定义为通过从单位超球中采样的随机方向 投影的特征在一维瓦瑟斯坦距离上的期望值。

Using the projected scalar features , where , as the following where the 1 -dimentional 2 -Wasserstein distance is trivially calculated in a closed form by taking the element-wise distances between sorted scalars in and . An illustration of a projected 1D Wasserstein distance is shown in figure 2
使用投影标量特征 ,其中 如下,其中 1 -dimentional 2 -Wasserstein 距离 可以通过在 中取排序标量之间的元素距离 以闭合形式计算。图 2 展示了投影一维瓦瑟斯坦距离的示意图
Expectation over random projections provides a good approximation in practice and an optimized distribution is proven to converge to the target distribution. SWD is known to capture the complete target distribution Pitie et al. 2005 as described below:
在实践中,对随机投影 的期望提供了一个很好的近似值,而且经过优化的分布已被证明可以收敛到目标分布。已知 SWD 可以捕捉到完整的目标分布 Pitie 等人,2005 年,如下所述:
The calculation of SWD scales in for an -dimensional distribution, making it suitable for machine learning applications with gradient descent algorithms.
对于 维分布,SWD 的计算以 为单位缩放,因此适用于使用梯度下降算法的机器学习应用。

3.4 Style Blending. 3.4 风格混合。

Given two different stylized views and their corresponding feature distributions , one may obtain a style-blended scene by refining the source NeRF model towards the Wasserstein barycenter where is the blending weight between the two styles:
给定两个不同的风格化视图 及其相应的特征分布 ,我们可以通过向瓦塞尔斯坦原点细化源 NeRF 模型来获得风格混合场景,其中 是两种风格之间的混合权重:
An example of style blending is shown in figure 3.
图 3 显示了样式混合的一个示例。


Figure 2: Sliced Wasserstein Distance: and are projected onto a random unit direction (left). The 1-dimensional Wasserstein distance can be calculated by taking the