GaussianImage: 1000 FPS Image Representation and Compression by 2D Gaussian Splatting 高斯图像通过二维高斯拼接技术实现 1000 FPS 图像表示和压缩
Xinjie Zhang ^(1,3***+){ }^{1,3 \star+}, Xingtong Ge^(2,3***+)\mathrm{Ge}^{2,3 \star+}, Tongda Xu ^(4){ }^{4}, Dailan He^(5)\mathrm{He}^{5}, Yan Wang ^(4){ }^{4}, Hongwei Qin ^(3){ }^{3}, Guo Lu ^(6){ }^{6}, Jing Geng ^(2†){ }^{2 \dagger}, and Jun Zhang ^(1†){ }^{1 \dagger}^(1){ }^{1} The Hong Kong University of Science and Technology ^(1){ }^{1} 香港科技大学^(2){ }^{2} Beijing Institute of Technology ^(3){ }^{3} SenseTime Research ^(2){ }^{2} 北京理工大学 ^(3){ }^{3} SenseTime Research^(4){ }^{4} Institute for AI Industry Research (AIR), Tsinghua University ^(4){ }^{4} 清华大学人工智能产业研究院(AIR)^(5){ }^{5} The Chinese University of Hong Kong ^(6){ }^{6} Shanghai Jiaotong University ^(5){ }^{5} 香港中文大学 ^(6){ }^{6} 上海交通大学xzhangga@connect.ust.hk, xingtong.ge@gmail.com, x.tongda@nyu.edu hedailan@link.cuhk.edu.hk, wangyan202199@163.com, qinhongwei@sensetime.comluguo2014@sjtu.edu.cn, janegeng@bit.edu.cn, eejzhang@ust.hk
Abstract 摘要
Implicit neural representations (INRs) recently achieved great success in image representation and compression, offering high visual quality and fast rendering speeds with 10-1000 FPS, assuming sufficient GPU resources are available. However, this requirement often hinders their use on low-end devices with limited memory. In response, we propose a groundbreaking paradigm of image representation and compression by 2D Gaussian Splatting, named GaussianImage. We first introduce 2D Gaussian to represent the image, where each Gaussian has 8 parameters including position, covariance and color. Subsequently, we unveil a novel rendering algorithm based on accumulated summation. Remarkably, our method with a minimum of 3xx3 \times lower GPU memory usage and 5xx5 \times faster fitting time not only rivals INRs (e.g., WIRE, INGP) in representation performance, but also delivers a faster rendering speed of 1500-2000 FPS regardless of parameter size. Furthermore, we integrate existing vector quantization technique to build an image codec. Experimental results demonstrate that our codec attains rate-distortion performance comparable to compression-based INRs such as COIN and COIN++, while facilitating decoding speeds of approximately 2000 FPS. Additionally, preliminary proof of concept shows that our codec surpasses COIN and COIN++ in performance when using partial bits-back coding. Code is available at https://github.com/Xinjie-Q/GaussianImage. 最近,隐式神经表征(INR)在图像表征和压缩方面取得了巨大成功,可提供高视觉质量和 10-1000 FPS 的快速渲染速度,前提是有足够的 GPU 资源。然而,这一要求往往阻碍了它们在内存有限的低端设备上的应用。为此,我们提出了一种开创性的二维高斯拼接图像表示和压缩范例,命名为高斯图像(GaussianImage)。我们首先引入二维高斯来表示图像,每个高斯有 8 个参数,包括位置、协方差和颜色。随后,我们揭示了一种基于累加求和的新型渲染算法。值得注意的是,我们的方法具有 3xx3 \times 更低的 GPU 内存使用量和 5xx5 \times 更快的拟合时间,不仅在表示性能上可与 INRs(如 WIRE、INGP)相媲美,而且无论参数大小如何,都能提供 1500-2000 FPS 的更快渲染速度。此外,我们还整合了现有的矢量量化技术来构建图像编解码器。实验结果表明,我们的编解码器的速率-失真性能可与 COIN 和 COIN++ 等基于压缩的 INR 相媲美,同时解码速度可达到约 2000 FPS。此外,初步概念验证表明,在使用部分比特回溯编码时,我们的编解码器在性能上超过了 COIN 和 COIN++。代码见 https://github.com/Xinjie-Q/GaussianImage。
Image representation is a fundamental tasks in signal processing and computer vision. Traditional image representation methods, including grid graphics, wavelet 图像表示是信号处理和计算机视觉中的一项基本任务。传统的图像表示方法包括网格图形、小波
Fig. 1: Image representation (left) and compression (right) results with different decoding time on the Kodak and DIV2K dataset, respectively. The radius of each point indicates the parameter size (left) or bits per pixel (right). Our method enjoys the fastest decoding speed regardless of parameter size or bpp. 图 1:柯达和 DIV2K 数据集在不同解码时间下的图像表示(左)和压缩(右)结果。每个点的半径表示参数大小(左)或每像素比特数(右)。无论参数大小或比特率如何,我们的方法都具有最快的解码速度。
transform [30], and discrete cosine transform [3], have been extensively applied across a broad spectrum of applications, from image compression to vision task analysis. However, these techniques encounter significant obstacles when processing large-scale datasets and striving for highly efficient storage solutions. 变换 [30] 和离散余弦变换 [3] 已被广泛应用于从图像压缩到视觉任务分析等多个领域。然而,在处理大规模数据集和寻求高效存储解决方案时,这些技术遇到了重大障碍。
The advent of implicit neural representations (INRs) [54,56] marks a significant paradigm shift in image representation techniques. Typically, INRs employ a compact neural network to derive an implicit continuous mapping from input coordinates to the corresponding output values. This allows INRs to capture and retain image details with greater efficiency, which provides considerable benefits across various applications, including image compression [22, 23, 27, 57], deblurring [50,65, 67], and super-resolution [15, 43, 49]. However, most state-of-the-art INR methods [25,51,53,54,59][25,51,53,54,59] rely on a large high-dimensional multi-layer perceptron (MLP) network to accurately represent high-resolution images. This dependency leads to prolonged training times, increased GPU memory requirements, and slow decoding speed. While recent innovations [18, 44, 48, 58] have introduced multi-resolution feature grids coupled with a compact MLP to accelerate training and inference, they still require enough GPU memory to support their fast training and inference, which is difficult to meet when resources are limited. Consequently, these challenges substantially hinder the practical deployment of INRs in real-world scenarios. 隐式神经表征(INR)[54,56] 的出现标志着图像表征技术范式的重大转变。通常情况下,隐式神经表征采用一个紧凑的神经网络,从输入坐标到相应的输出值推导出一个隐式连续映射。这使得 INR 能够更有效地捕捉和保留图像细节,从而在图像压缩 [22、23、27、57]、去毛刺 [50、65、67] 和超分辨率 [15、43、49] 等各种应用中带来巨大优势。然而,大多数最先进的 INR 方法 [25,51,53,54,59][25,51,53,54,59] 都依赖于大型高维多层感知器(MLP)网络来准确表示高分辨率图像。这种依赖性导致训练时间延长、GPU 内存需求增加以及解码速度变慢。虽然最近的创新[18, 44, 48, 58]引入了多分辨率特征网格和紧凑型 MLP 来加速训练和推理,但它们仍然需要足够的 GPU 内存来支持快速训练和推理,这在资源有限的情况下很难实现。因此,这些挑战极大地阻碍了 INR 在现实世界中的实际应用。
In light of these challenges, our research aims to develop an advanced image representation technique that enables efficient training, friendly GPU memory usage, and fast decoding. To achieve this goal, we resort to Gaussian Splatting (GS) [35] that is recently developed for 3D scene reconstruction. By leveraging explicit 3D Gaussian representations and differentiable tile-based rasterization, 3D GS not only enjoys high visual quality with competitive training times, but also achieves real-time rendering capabilities. 鉴于这些挑战,我们的研究旨在开发一种先进的图像表示技术,以实现高效的训练、GPU 内存的合理使用和快速解码。为了实现这一目标,我们采用了最近为三维场景重建而开发的高斯拼接技术(GS)[35]。通过利用显式三维高斯表示法和基于可微分瓦片的光栅化技术,三维高斯表示法不仅能以极具竞争力的训练时间获得高视觉质量,还能实现实时渲染功能。
Nevertheless, it is non-trivial to directly adapt 3D GS for efficient single image representation. Firstly, considering that existing 3D GS methods [12, 35] 尽管如此,直接将三维高分辨率技术用于高效的单幅图像表示并非易事。首先,考虑到现有的 3D GS 方法[12, 35]
depend on varying camera transformation matrices to render images from different perspectives, a straightforward adaptation for single image is fixing the camera transformation matrix to render an image from a single viewing angle. Unfortunately, each 3D Gaussian usually includes 59 learnable parameters [35] and thousands of 3D Gaussians are required for representing a single image. This naive approach substantially increases the storage and communication demands. As can be inferred from Table 1, the storage footprint for a single image with tens of kilobytes can escalate to dozens of megabytes, which makes rendering difficult on low-end devices with limited memory. Secondly, the rasterization algorithm [35] in 3D GS, designed for alpha\alpha-blending approximation, necessitates pre-sorted Gaussians based on depth information derived from camera parameters. This poses a challenge for single images because detailed camera parameters are often not known in natural individual image, while non-natural images, including screenshots and AI-generated content, are not captured by cameras. Without accurate depth information, the Gaussian sorting might be impaired, diminishing the final fitting performance. Moreover, the current rasterization process skips the remaining Gaussians once the accumulated opacity surpasses the given threshold, which results in underutilization of Gaussian data, thereby requiring more Gaussians for high-quality rendering. 单幅图像的直接适应方法是固定摄像机变换矩阵,从单一视角渲染图像。不幸的是,每个三维高斯通常包括 59 个可学习参数[35],因此需要数千个三维高斯来表示单个图像。这种简单的方法大大增加了存储和通信需求。从表 1 中可以推断出,单张图像的存储空间占用仅为几十KB,就能达到几十MB,这使得在内存有限的低端设备上进行渲染变得十分困难。其次,3D GS 中的光栅化算法[35]是专为 alpha\alpha 混合近似而设计的,必须根据从相机参数中获得的深度信息预先对高斯进行排序。这给单张图像带来了挑战,因为在自然的单张图像中,详细的相机参数往往是未知的,而非自然图像,包括屏幕截图和人工智能生成的内容,都不是由相机捕获的。如果没有准确的深度信息,高斯排序可能会受到影响,从而降低最终的拟合性能。此外,当前的栅格化流程会在累积的不透明度超过给定阈值时跳过剩余的高斯,这导致高斯数据利用率不足,从而需要更多的高斯才能实现高质量的渲染。
To address these issues, we propose a new paradigm of image representation and compression, namely GaussianImage, using 2D Gaussian Splatting. Firstly, we adopt 2D Gaussians in lieu of 3D for a compact and expressive representation. Each 2D Gaussian is defined by 4 attributes ( 9 parameters in total): position, anisotropic covariance, color coefficients, and opacity. This modification results in a 6.5 xx6.5 \times compression over 3D Gaussians with equivalent Gaussian points, significantly mitigating storage demands of Gaussian representation. Subsequently, we advocate a unique rasterization algorithm that replaces depth-based Gaussian sorting and alpha\alpha-blending with a accumulated summation process. This novel approach directly computes each pixel’s color from the weighted sum of 2 D Gaussians, which not only fully utilizes the information of all Gaussian points covering the current pixel to improve fitting performance, but also avoids the tedious calculation of accumulated transparency to accelerate training and inference speed. More important, this summation mechanism allows us to merge color coefficients and opacity into a singular set of weighted color coefficients, reducing parameter count to 8 and further improving the compression ratio to 7.375 xx7.375 \times. Finally, we transfer our 2D Gaussian representation into a practical image codec. Framing image compression as a Gaussian attribute compression task, we employ a two-step compression strategy: attribute quantization-aware fine-tuning and encoding. By applying 16-bit float quantization, 6-bit integer quantization [11], and residual vector quantization (RVQ) [73] to positions, covariance parameters, and weighted color coefficients, respectively, we successfully develop the first image codec based on 2D Gaussian Splatting. As a preliminary proof of concept, the partial bits-back coding [52,60][52,60] is optionally used to further improve the compression performance of our codec. Overall, our contributions are threefold: 为了解决这些问题,我们提出了一种新的图像表示和压缩范例,即使用二维高斯拼接技术的高斯图像(GaussianImage)。首先,我们采用二维高斯来代替三维高斯,以实现紧凑而富有表现力的表示。每个二维高斯由 4 个属性(共 9 个参数)定义:位置、各向异性协方差、颜色系数和不透明度。与具有等效高斯点的三维高斯相比,这种修改带来了 6.5 xx6.5 \times 压缩,大大降低了高斯表示法的存储需求。随后,我们提出了一种独特的光栅化算法,用累积求和过程取代基于深度的高斯排序和 alpha\alpha 混合。这种新颖的方法直接通过 2 D 高斯的加权和来计算每个像素的颜色,不仅充分利用了覆盖当前像素的所有高斯点的信息来提高拟合性能,而且避免了繁琐的累积透明度计算,加快了训练和推理速度。更重要的是,通过这种求和机制,我们可以将颜色系数和不透明度合并为一组奇异的加权颜色系数,从而将参数数量减少到 8 个,并将压缩率进一步提高到 7.375 xx7.375 \times 。最后,我们将二维高斯表示法转化为实用的图像编解码器。将图像压缩视为高斯属性压缩任务,我们采用了两步压缩策略:属性量化感知微调和编码。通过对位置、协方差参数和加权颜色系数分别采用 16 位浮点量化、6 位整数量化 [11] 和残差矢量量化 (RVQ) [73],我们成功开发出首个基于二维高斯拼接的图像编解码器。 作为初步的概念验证,可选择使用部分比特回编码 [52,60][52,60] 来进一步提高我们编解码器的压缩性能。总的来说,我们的贡献有三个方面:
We present a pioneering paradigm of image representation and compression by 2D Gaussian Splatting. With compact 2D Gaussian representation and a novel accumulated blending-based rasterization method, our approach achieves high representation performance with short training duration, minimal GPU memory overhead and remarkably, 2000 FPS rendering speed. 我们提出了一种开创性的二维高斯拼接图像表示和压缩范例。通过紧凑的二维高斯表示法和一种新颖的基于累积混合的光栅化方法,我们的方法实现了较高的表示性能、较短的训练时间、最小的 GPU 内存开销,以及惊人的 2000 FPS 渲染速度。
We develop a low-complexity neural image codec using vector quantization. Furthermore, a partial bits-back coding technique is optionally used to reduce the bitrate. 我们利用矢量量化技术开发了一种低复杂度神经图像编解码器。此外,还可选择使用部分比特回溯编码技术来降低比特率。
Experimental results show that when compared with existing INR methods, our approach achieves a remarkable training and inference acceleration with less GPU memory usage while maintaining similar visual quality. When used as an efficient image codec, our approach offers competitive compression performance comparable to COIN [23] and COIN++ [22]. Comprehensive ablations and analyses demonstrate the effectiveness of each proposed component. 实验结果表明,与现有的 INR 方法相比,我们的方法以更少的 GPU 内存使用量实现了显著的训练和推理加速,同时保持了相似的视觉质量。当作为高效图像编码器使用时,我们的方法可提供与 COIN [23] 和 COIN++ [22] 相媲美的压缩性能。全面的消融和分析证明了每个建议组件的有效性。
2 Related Works 2 相关作品
2.1 Implicit Neural Representation 2.1 内隐神经表征
Recently, implicit neural representation has gained increasing attention for its wide-ranging potential applications, such as 3 D scene rendering [7, 8, 46, 66], image [18,48,53,54][18,48,53,54] and video [13,14,40,74][13,14,40,74] representations. We roughly classified existing image INRs into two categories: (i) MLP-based INRs [25, 51, 53, 54, 59] take position encoding of spatial coordinates as input of an MLP network to learn the RGB values of images, while they only rely on the neural network to encode all the image information, resulting in inefficient training and inference especially for high-resolution image.(ii) Feature grid-based INRs [18, 44, 48, 58] adopt a large-scale multi-resolution grid, such as quadtree and hash table, to provide prior information for a compact MLP. This reduces the learning difficulty of MLP to a certain extent and accelerates the training process, making INRs more practical. Unfortunately, they still consume large GPU memory, which is difficult to accommodate on low-end devices. Instead of following existing INR methods, we aim to propose a brand-new image representation paradigm based on 2D Gaussian Splatting, which enables us to enjoy swifter training, faster rendering, and less GPU resource consumption. 最近,隐式神经表示因其广泛的潜在应用而受到越来越多的关注,例如 3 D 场景渲染 [7, 8, 46, 66]、图像 [18,48,53,54][18,48,53,54] 和视频 [13,14,40,74][13,14,40,74] 表示。我们将现有的图像 INR 大致分为两类:(i) 基于 MLP 的 INR [25, 51, 53, 54, 59] 将空间坐标的位置编码作为 MLP 网络的输入来学习图像的 RGB 值,而它们仅仅依靠神经网络来编码所有图像信息,导致训练和推理效率低下,尤其是在高分辨率图像中。(ii) 基于特征网格的 INR [18, 44, 48, 58] 采用大规模多分辨率网格,如四叉树和哈希表,为紧凑型 MLP 提供先验信息。这在一定程度上降低了 MLP 的学习难度,加快了训练过程,使 INR 更为实用。遗憾的是,它们仍然需要消耗大量 GPU 内存,这在低端设备上很难适应。我们的目标不是沿用现有的 INR 方法,而是提出一种基于二维高斯拼接的全新图像表示范式,它能让我们享受到更快的训练速度、更快的渲染速度和更少的 GPU 资源消耗。
2.2 Gaussian Splatting 2.2 高斯拼接法
Gaussian Splatting [35] has recently gained tremendous traction as a promising paradigm to 3D view synthesis. With explicit 3D Gaussian representations and differentiable tile-based rasterization, GS not only brings unprecedented control and editability but also facilitates high-quality and real-time rendering in 3D scene reconstruction. This versatility has opened up new avenues in various domains, including simultaneous localization and mapping (SLAM) [33, 34, 68], 高斯拼贴法[35]作为一种前景广阔的三维视图合成范例,最近获得了极大的关注。通过明确的三维高斯表示和可微分的基于瓦片的光栅化,GS 不仅带来了前所未有的控制和可编辑性,还有助于在三维场景重建中进行高质量的实时渲染。这种多功能性为各种领域开辟了新的途径,包括同步定位和映射(SLAM)[33, 34, 68]、
Fig. 2: Our proposed GaussianImage framework. 2D Gaussians are first formatted and then rasterized to generate the output image. The rasterizer uses our proposed accumulated blending for efficient 2D image representation. 图 2:我们提出的高斯图像框架。首先对二维高斯图像进行格式化,然后光栅化以生成输出图像。光栅化器使用我们提出的累积混合技术来实现高效的二维图像表示。
dynamic scene modeling [42, 63, 70], AI-generated content [16, 19, 76], and autonomous driving [69,75]. Despite its great success in 3D scenarios, the application of GS to single image representation remains unexplored. Our work pioneers the adaptation of GS for 2D image representation, leveraging the strengths of GS in highly parallelized workflow and real-time rendering to outperform INR-based methods in terms of training efficiency and decoding speed. 动态场景建模 [42、63、70]、人工智能生成内容 [16、19、76] 和自动驾驶 [69、75]。尽管 GS 在三维场景中取得了巨大成功,但它在单一图像表征中的应用仍有待探索。我们的工作率先将 GS 应用于 2D 图像表示,利用 GS 在高度并行化工作流程和实时渲染方面的优势,在训练效率和解码速度方面超越了基于 INR 的方法。
2.3 Image Compression 2.3 图像压缩
Traditional image compression techniques, such as JPEG [61], JPEG2000 [55] and BPG [10], follow a transformation, quantization, and entropy coding procedure to achieve good compression efficiency with decent decompression speed. Recently, learning-based image compression methods based on variational autoencoder (VAE) have re-imagined this pipeline, integrating complex nonlinear transformations [4,20,28,41][4,20,28,41] and advanced entropy models [5, 6, 36, 47]. Despite these methods surpassing traditional codecs in rate-distortion (RD) performance, their extremely high computational complexity and very slow decoding speed severely limit their practical deployment. To tackle the computational inefficiency of existing art, some works have explored INR-based compression methods [22, 23, 27, 38, 39, 57]. However, as image resolutions climb, their decoding speeds falter dramatically, challenging their real-world applicability. In this paper, our approach diverges from VAE and INR paradigms, utilizing 2D Gaussian Splatting to forge a neural image codec with unprecedented decoding efficiency. This marks an important milestone for neural image codecs. 传统的图像压缩技术,如 JPEG [61]、JPEG2000 [55] 和 BPG [10],遵循变换、量化和熵编码程序,以达到良好的压缩效率和适当的解压缩速度。最近,基于变异自动编码器(VAE)的学习型图像压缩方法对这一流程进行了重新设计,整合了复杂的非线性变换 [4,20,28,41][4,20,28,41] 和先进的熵模型 [5, 6, 36, 47]。尽管这些方法在速率-失真(RD)性能上超越了传统编解码器,但其极高的计算复杂度和极慢的解码速度严重限制了它们的实际应用。为了解决现有技术计算效率低的问题,一些研究探索了基于 INR 的压缩方法 [22、23、27、38、39、57]。然而,随着图像分辨率的提高,它们的解码速度急剧下降,这对它们在现实世界中的适用性提出了挑战。在本文中,我们的方法有别于 VAE 和 INR 模式,利用二维高斯拼接技术打造了一种具有前所未有的解码效率的神经图像编解码器。这标志着神经图像编解码器的一个重要里程碑。
3 Method 3 方法
Fig. 2 delineates the overall processing pipeline of our GaussianImage. Our approach begins by forming 2D Gaussians on an image plane, a process mainly calculating the 2D covariance matrix Sigma\boldsymbol{\Sigma}. Afterwards, we employ an accumulated blending mechanism to compute the value for each pixel. In what follows, we begin with the formation of a 2D Gaussian in Section 3.1. Next, we describe how to adapt the rasterization process of 3D GS to align the unique characteristics of 2D image representation and upgrade 2D Gaussian with less parameters in Section 3.2. Then, we present a two-step compression pipeline to convert our GaussianImage into a neural image codec in Section 3.3. Finally, we state the training process on the image representation and compression tasks in Section 3.4. 图 2 描述了高斯图像的整体处理流程。我们的方法首先是在图像平面上形成二维高斯,这一过程主要计算二维协方差矩阵 Sigma\boldsymbol{\Sigma} 。然后,我们采用累积混合机制来计算每个像素的值。接下来,我们将从第 3.1 节中的二维高斯形成开始。接下来,我们将在第 3.2 节中介绍如何调整三维高斯的光栅化过程,使其符合二维图像表示的独特特性,并以更少的参数升级二维高斯。然后,我们在第 3.3 节中介绍了将高斯图像转换为神经图像编码器的两步压缩流程。最后,我们将在第 3.4 节中介绍图像表示和压缩任务的训练过程。
3.1 2D Gaussian Formation 3.1 二维高斯形成
In 3D Gaussian Splatting, each 3D Gaussian is initially mapped into a 2D plane through viewing and projection transformation. Then a differentiable rasterizer is used to render the current view image from these projected Gaussians. Since our application is no longer oriented to 3 D scenes, but to 2 D image representation, we discard many bloated operations and redundant parameters in 3D GS, such as project transformation, spherical harmonics, etc. 在三维高斯拼接技术中,每个三维高斯最初都是通过视图和投影变换映射到二维平面上的。然后,使用可微分光栅器从这些投影高斯图中渲染当前视图图像。由于我们的应用不再面向三维场景,而是面向二维图像表示,因此我们摒弃了三维高斯拼接中的许多臃肿操作和冗余参数,如投影变换、球谐波等。
In our framework, the image representation unit is a 2D Gaussian. The basic 2D Gaussian is described by its position mu inR^(2)\boldsymbol{\mu} \in \mathbb{R}^{2}, 2D covariance matrix Sigma in\boldsymbol{\Sigma} \inR^(2xx2)\mathbb{R}^{2 \times 2}, color coefficients c inR^(3)\boldsymbol{c} \in \mathbb{R}^{3} and opacity o inRo \in \mathbb{R}. Note that covariance matrix Sigma\boldsymbol{\Sigma} of a Gaussian distribution requires positive semi-definite. Typically, it is difficult to constrain the learnable parameters using gradient descent to generate such valid matrices. To avoid producing invalid matrix during training, we choose to optimize the factorized form of the covariance matrix. Here, we present two decomposition ways to cover all the information of the original covariance matrix. One intuitive decomposition is the Cholesky factorization [31], which breaks down Sigma\boldsymbol{\Sigma} into the product of a lower triangular matrix L inR^(2xx2)\boldsymbol{L} \in \mathbb{R}^{2 \times 2} and its conjugate transpose L^(T)\boldsymbol{L}^{T} : 在我们的框架中,图像表示单元是一个二维高斯。基本的二维高斯由其位置 mu inR^(2)\boldsymbol{\mu} \in \mathbb{R}^{2} 、二维协方差矩阵 Sigma in\boldsymbol{\Sigma} \inR^(2xx2)\mathbb{R}^{2 \times 2} 、颜色系数 c inR^(3)\boldsymbol{c} \in \mathbb{R}^{3} 和不透明度 o inRo \in \mathbb{R} 描述。请注意,高斯分布的协方差矩阵 Sigma\boldsymbol{\Sigma} 需要正半定。通常情况下,很难通过梯度下降来约束可学习参数,从而生成此类有效矩阵。为了避免在训练过程中产生无效矩阵,我们选择优化协方差矩阵的因式分解形式。在此,我们介绍两种分解方法,以涵盖原始协方差矩阵的所有信息。一种直观的分解方法是 Cholesky 因式分解 [31],它将 Sigma\boldsymbol{\Sigma} 分解为下三角矩阵 L inR^(2xx2)\boldsymbol{L} \in \mathbb{R}^{2 \times 2} 与其共轭转置 L^(T)\boldsymbol{L}^{T} 的乘积:
For the sake of writing, we use a Choleksy vector l={l_(1),l_(2),l_(3)}\boldsymbol{l}=\left\{l_{1}, l_{2}, l_{3}\right\} to represent the lower triangular elements in matrix L\boldsymbol{L}. When compared with 3D Gaussian having 59 learnable parameters, our 2D Gaussian only require 9 parameters, making it more lightweight and suitable for image representation. 为了便于书写,我们使用 Choleksy 向量 l={l_(1),l_(2),l_(3)}\boldsymbol{l}=\left\{l_{1}, l_{2}, l_{3}\right\} 来表示矩阵 L\boldsymbol{L} 中的下三角元素。与拥有 59 个可学习参数的三维高斯相比,我们的二维高斯只需要 9 个参数,因此更轻便,更适合图像表示。
Another decomposition follows 3D GS [35] to factorize the covariance matrix into a rotation matrix R inR^(2xx2)\boldsymbol{R} \in \mathbb{R}^{2 \times 2} and scaling matrix S inR^(2xx2)\boldsymbol{S} \in \mathbb{R}^{2 \times 2} : 另一种分解方法遵循 3D GS [35],将协方差矩阵分解为旋转矩阵 R inR^(2xx2)\boldsymbol{R} \in \mathbb{R}^{2 \times 2} 和缩放矩阵 S inR^(2xx2)\boldsymbol{S} \in \mathbb{R}^{2 \times 2} :
Here, theta\theta represents the rotation angle. s_(1)s_{1} and s_(2)s_{2} are scaling factors in different eigenvector directions. While the decomposition of the covariance matrix is not unique, they have equivalent capabilities to represent the image. However, the robustness to compression of different decomposition forms is inconsistent, which is explained in detail in the appendix. Therefore, we need to carefully choose the decomposition form of the covariance matrix when facing different image tasks. 这里, theta\theta 表示旋转角度。 s_(1)s_{1} 和 s_(2)s_{2} 是不同特征向量方向上的缩放因子。虽然协方差矩阵的分解不是唯一的,但它们在表示图像方面具有同等的能力。但是,不同的分解形式对压缩的鲁棒性是不一致的,这在附录中有详细解释。因此,在面对不同的图像任务时,我们需要谨慎选择协方差矩阵的分解形式。
During the rasterization phase, 3D GS first forms a sorted list of Gaussians N\mathcal{N} based on the projected depth information. Then the alpha\alpha-blending is adopted to 在光栅化阶段,3D GS 首先根据投影深度信息形成高斯 N\mathcal{N} 排序列表。然后采用 alpha\alpha 混合法,以
render pixel ii : 呈现像素 ii :
where T_(n)T_{n} denotes the accumulated transparency. The alpha_(n)\alpha_{n} is computed with projected 2D covariance Sigma\boldsymbol{\Sigma} and opacity o_(n)o_{n} : 其中 T_(n)T_{n} 表示累积透明度。 alpha_(n)\alpha_{n} 是通过投影二维协方差 Sigma\boldsymbol{\Sigma} 和不透明度 o_(n)o_{n} 计算得出的:
where d inR^(2)\boldsymbol{d} \in \mathbb{R}^{2} is the displacement between the pixel center and the projected 2 D Gaussian center. 其中, d inR^(2)\boldsymbol{d} \in \mathbb{R}^{2} 是像素中心与投影的 2 D 高斯中心之间的位移。
Since the acquisition of depth information involves viewing transformation, it requires us to know the intrinsic and extrinsic parameters of the camera in advance. However, it is difficult for natural individual image to access the detailed camera parameters, while non-natural images, such as screenshots and AI-generated content, are not captured by the camera. In this case, retaining the alpha\alpha-blending of the 3D GS without depth cues would result in arbitrary blending sequences, compromising the rendering quality. Moreover, 3D GS only maintains Gaussians with a 99%99 \% confidence interval in order to solve the problem of numerical instability in computing the projected 2D covariance, but this makes only part of Gaussians covering pixel ii contribute to the rendering of pixel ii, leading to inferior fitting performance. 由于深度信息的获取涉及视角变换,因此需要我们提前了解摄像机的内在和外在参数。然而,自然的单个图像很难获取详细的相机参数,而非自然图像,如屏幕截图和人工智能生成的内容,则不会被相机捕获。在这种情况下,在没有深度线索的情况下保留 3D GS 的 alpha\alpha 混合会导致任意混合序列,从而影响呈现质量。此外,3D GS 只保留置信区间为 99%99 \% 的高斯,以解决计算投影 2D 协方差时数值不稳定的问题,但这使得只有覆盖像素 ii 的部分高斯有助于像素 ii 的渲染,从而导致拟合性能较差。
To overcome these limitations, we propose an accumulated summation mechanism to unleash the potential of our 2D Gaussian representation. Since there is no viewpoint influence when rendering an image, the rays we observe from each element are determined, and so as all the alpha\alpha values. Therefore, we merge the T_(n)T_{n} part in Equation 4 into the o_(n)o_{n} term, and simplify the computation consuming alpha\alpha-blending to a weighted sum: 为了克服这些限制,我们提出了一种累积求和机制,以释放二维高斯表示法的潜力。由于在渲染图像时不会受到视角的影响,我们从每个元素观察到的光线都是确定的,因此所有的 alpha\alpha 值也是确定的。因此,我们将公式 4 中的 T_(n)T_{n} 部分合并到 o_(n)o_{n} 项中,并将消耗 alpha\alpha 混合的计算简化为加权和:
This removes the necessity of Gaussian sequence order, so that we can remove the sorting from rasterization. 这样就不需要高斯序列顺序了,因此我们可以从光栅化中移除排序。
This novel rasterization algorithm brings multiple benefits. First, our accumulated blending process is insensitive to the order of Gaussian points. This property allows us to avoid the impact of the random order of Gaussian points on rendering, achieving robustness to any order of Gaussian points. Second, when compared with Equation 4, our rendering skips the tedious sequential calculation of accumulated transparency T_(n)T_{n}, improving our training efficiency and rendering speed. Third, since the color coefficients c_(n)\boldsymbol{c}_{n} and the opacity o_(n)o_{n} are learnable parameters, they can be merged to further simplify Equation 6: 这种新颖的光栅化算法具有多重优势。首先,我们的累积混合过程对高斯点的顺序不敏感。这一特性使我们能够避免高斯点的随机顺序对渲染的影响,实现了对任何高斯点顺序的鲁棒性。其次,与公式 4 相比,我们的渲染跳过了繁琐的累积透明度 T_(n)T_{n} 的连续计算,提高了训练效率和渲染速度。第三,由于颜色系数 c_(n)\boldsymbol{c}_{n} 和不透明度 o_(n)o_{n} 是可学习的参数,因此可以将它们合并,进一步简化公式 6:
Fig. 3: Compression pipeline of our proposed GaussianImage. After overfitting image, we apply attribute quantization-aware fine-tuning to build an ultra-fast image codec. Partial bits-back coding is used to achieve the best compression performance. 图 3:我们提出的高斯图像压缩流水线。在对图像进行过拟合后,我们应用属性量化感知微调来构建超快图像编解码器。为了达到最佳压缩性能,我们使用了部分比特回溯编码。
where the weighted color coefficients c_(n)^(')inR^(3)\boldsymbol{c}_{n}^{\prime} \in \mathbb{R}^{3} is no longer limited in the range of [0,1][0,1]. In this way, instead of the basic 2D Gaussian that requires 4 attributes in Section 3.1, our upgraded 2D Gaussian is described by only 3 attributes (i.e., position, covariance, and weighted color coefficients) with a total of 8 parameters. This further improves the compression ratio to 7.375 xx7.375 \times when compared with 3D Gaussian under equivalent Gaussian points. 其中,加权颜色系数 c_(n)^(')inR^(3)\boldsymbol{c}_{n}^{\prime} \in \mathbb{R}^{3} 不再受限于 [0,1][0,1] 的范围。这样,与第 3.1 节中需要 4 个属性的基本二维高斯相比,我们的升级版二维高斯只需 3 个属性(即位置、协方差和加权颜色系数)共 8 个参数即可描述。与等效高斯点下的三维高斯相比,压缩率进一步提高到 7.375 xx7.375 \times 。
3.3 Compression Pipeline 3.3 压缩管道
After overfitting the image, we propose a compression pipeline for image compression with GaussianImage. As shown in Fig. 3, our standard compression pipeline is composed of two steps: image overfitting and attribute quantizationaware fine-tuning. To achieve the best compression performance, partial bitsback coding [52,60][52,60] is an optional strategy. Herein, we elucidate the compression process using our GaussianImage based on Cholesky factorization as an example. Attribute Quantization-aware Fine-tuning. Given a set of 2D Gaussian points fit on an image, we apply distinct quantization strategies to various attributes. Since the Gaussian location is sensitive to quantization, we adopt 16bit float precision for position parameters to preserve reconstruction fidelity. For Choleksy vector l_(n)\boldsymbol{l}_{n} in the nn-th Gaussian, we incorporate a bb-bit asymmetric quantization technique [11], where both the scaling factor gamma_(i)\gamma_{i} and the offset factor beta_(i)\beta_{i} are learned during fine-tuning: 在对图像进行过拟合后,我们提出了使用高斯图像进行图像压缩的压缩流水线。如图 3 所示,我们的标准压缩流水线由两个步骤组成:图像过拟合和属性量化感知微调。为了达到最佳压缩性能,部分比特回溯编码 [52,60][52,60] 是一种可选策略。在此,我们以基于 Cholesky 因子化的高斯图像为例,阐明压缩过程。属性量化感知微调。给定一组拟合在图像上的二维高斯点,我们对不同的属性采用不同的量化策略。由于高斯位置对量化很敏感,我们对位置参数采用 16 位浮点精度,以保持重建的保真度。对于第 nn -th 高斯中的 Choleksy 向量 l_(n)\boldsymbol{l}_{n} ,我们采用了 bb -bit 非对称量化技术 [11],其中缩放因子 gamma_(i)\gamma_{i} 和偏移因子 beta_(i)\beta_{i} 都是在微调过程中学习的:
where i in{0,1,2}i \in\{0,1,2\}. Note that we share the same scaling and offset factors at all Gaussians in order to reduce metadata overhead. After fine-tuning, the covariance parameters are encoded with bb-bit precision, while the scaling and offset values required for re-scaling are stored in 32-bit float precision. 其中 i in{0,1,2}i \in\{0,1,2\} 。请注意,我们在所有高斯中共享相同的缩放和偏移因子,以减少元数据开销。微调后,协方差参数以 bb - 位精度编码,而重新缩放所需的缩放和偏移值则以 32 位浮点精度存储。
As for weighted color coefficients, a codebook enables representative color attribute encoding via vector quantization (VQ) [26]. While naively applying vector quantization leads to inferior rendering quality, we employ residual vector 至于加权色彩系数,编码本可以通过矢量量化(VQ)[26] 实现有代表性的色彩属性编码。天真地应用矢量量化会导致较差的渲染质量,而我们则采用残差矢量量化技术。
quantization (RVQ) [73] that cascades MM stages of VQ with codebook size BB to mitigate performance degradation: 量化 (RVQ) [73],级联 MM 级的 VQ,编码本大小为 BB ,以减轻性能下降:
where hat(c)_(n)^('m)\hat{\boldsymbol{c}}_{n}^{\prime m} denotes the output color vector after mm quantization stages, C^(m)in\mathcal{C}^{m} \inR^(B xx3)\mathbb{R}^{B \times 3} represents the codebook at the stage m,i^(m)in{0,cdots,B-1}^(N)m, i^{m} \in\{0, \cdots, B-1\}^{N} is the codebook indices at the stage mm, and C[i]inR^(3)\mathcal{C}[i] \in \mathbb{R}^{3} is the vector at index ii of the codebook C\mathcal{C}. To train the codebooks, we apply the commitment loss L_(c)\mathcal{L}_{c} as follows: 其中 hat(c)_(n)^('m)\hat{\boldsymbol{c}}_{n}^{\prime m} 表示经过 mm 量化阶段后的输出颜色向量, C^(m)in\mathcal{C}^{m} \inR^(B xx3)\mathbb{R}^{B \times 3} 表示阶段 m,i^(m)in{0,cdots,B-1}^(N)m, i^{m} \in\{0, \cdots, B-1\}^{N} 的编码本,m,i^(m)in{0,cdots,B-1}^(N)m, i^{m} \in\{0, \cdots, B-1\}^{N} 是阶段 mm 的编码本索引, C[i]inR^(3)\mathcal{C}[i] \in \mathbb{R}^{3} 是编码本 C\mathcal{C} 中索引 ii 的向量。为了训练编码本,我们采用了如下的承诺损失 L_(c)\mathcal{L}_{c} :
L_(c)=(1)/(N xx B)sum_(k=1)^(M)sum_(n=1)^(N)||sg[c_(n)^(')- hat(c)_(n)^('k-1)]-C^(k)[i_(n)^(k)]||_(2)^(2)\mathcal{L}_{c}=\frac{1}{N \times B} \sum_{k=1}^{M} \sum_{n=1}^{N}\left\|\operatorname{sg}\left[c_{n}^{\prime}-\hat{c}_{n}^{\prime k-1}\right]-\mathcal{C}^{k}\left[i_{n}^{k}\right]\right\|_{2}^{2}
where NN is the number of Gaussians and sg[*]\mathrm{sg}[\cdot] is the stop-gradient operation. 其中, NN 是高斯数, sg[*]\mathrm{sg}[\cdot] 是停止梯度操作。
Partial Bits-Back Coding. As we have not adopted any auto-regressive context [47] to encode 2D Gaussian parameters, any permutation of 2D Gaussian points can be seen as an equivariant graph without edge. Therefore, we can adopt bits-back coding [60] for equivariant graph described by [37] to save bitrate. More specifically, [37] show that an unordered set with NN elements has NN ! equivariant, and bits-back coding can save a bitrate of 部分比特回溯编码。由于我们没有采用任何自动回归上下文[47]来编码二维高斯参数,二维高斯点的任何排列都可以看作是一个没有边的等变图。因此,我们可以对 [37] 所描述的等变图采用比特回编码 [60],以节省比特率。更具体地说,[37] 的研究表明,元素为 NN 的无序集合具有 NN ! 等变性,比特-后向编码可节省比特率为
log N!-log N\log N!-\log N
compared with directly store those unordered elements. 与直接存储这些无序元素相比。
However, the vanilla bits-back coding requires initial bits [60] of log N\log N !, which means that it can only work on a dataset, not on a single image. To tackle this challenge, [52] introduces a partial bits-back coding strategy that segments the image data, applying vanilla entropy coding to a fraction of the image as the initial bit allocation, with the remainder encoded via bits-back coding. 然而,香草比特-后向编码需要 log N\log N 的初始比特 [60] !这意味着它只能用于数据集,而不能用于单个图像。为了解决这一难题,[52] 引入了部分比特-后向编码策略,即分割图像数据,将香草熵编码应用于图像的一部分作为初始比特分配,其余部分通过比特-后向编码进行编码。
In our case, we reuse the idea of [52]. Specifically, we encode the initial KK Gaussians by vanilla entropy coding, and the subsequent N-KN-K Gaussians by bits-back coding. This segmented approach is applicable to single image compression, contingent upon the bitrate of the initial KK Gaussian exceeding the initial bits log(N-K)\log (N-K) !. Let R_(k)R_{k} denotes the bitrate of kk-th Gaussian, the final bitrate saving can be formalized as: 在我们的案例中,我们重复使用了 [52] 的想法。具体来说,我们通过香草熵编码对初始 KK 高斯进行编码,并通过比特回编码对后续 N-KN-K 高斯进行编码。这种分段方法适用于单一图像压缩,前提是初始 KK 高斯的比特率超过初始比特 log(N-K)\log (N-K) !假设 R_(k)R_{k} 表示第 kk 个高斯的比特率,则最终节省的比特率可形式化为:
{:[log(N-K^(**))!-log(N-K^(**))],[" where "K^(**)=i n f K", s.t. "sum_(k=1)^(K)R_(k)-log(N-K^(**))! >= 0]:}\begin{gathered}
\log \left(N-K^{*}\right)!-\log \left(N-K^{*}\right) \\
\text { where } K^{*}=\inf K \text {, s.t. } \sum_{k=1}^{K} R_{k}-\log \left(N-K^{*}\right)!\geq 0
\end{gathered}
Despite its theoretical efficacy, bits-back coding may not align with the objective of developing an ultra-fast codec due to its slow processing latency [37]. Consequently, we leave this part as a preliminary proof of concept on the best rate-distortion performance our codec can achieve, instead of a final result of our codec can achieve with 2000 FPS. 尽管比特回编码具有理论上的功效,但由于其处理延迟较慢,可能与开发超高速编解码器的目标不符[37]。因此,我们将这一部分留作初步概念验证,即我们的编解码器能达到的最佳速率-失真性能,而不是我们的编解码器能达到 2000 FPS 的最终结果。
Table 1: Quantitative comparison with various baselines in PSNR, MS-SSIM, training time, rendering speed, GPU memory usage and parameter size. 表 1:在 PSNR、MS-SSIM、训练时间、渲染速度、GPU 内存使用量和参数大小方面与各种基线的定量比较。
(a) Kodak dataset (a) 柯达数据集
Methods 方法
PSNR uarr\uparrow
MS-SSIM uarr\uparrow
Training Time(s) darr\downarrow 培训时间(秒) darr\downarrow
For image representation, our objective is to minimize the distortion between the original image xx and reconstructed image hat(x)\hat{x}. To this end, we employ the L2 loss function to optimize the Gaussian parameters. It is worth noting that previous GS method [35] introduces adaptive density control to split and clone Gaussians when optimizing 3D scenes. Since there exists many empty areas in the 3D space, they need to consider avoiding populating these areas. By contrast, there is no so-called empty area in the 2D image space. Therefore, we discard adaptive density control, which greatly simplifies the optimization process of 2D image representation. 对于图像表示,我们的目标是最小化原始图像 xx 和重建图像 hat(x)\hat{x} 之间的失真。为此,我们采用 L2 损失函数来优化高斯参数。值得注意的是,之前的 GS 方法 [35] 在优化 3D 场景时引入了自适应密度控制来分割和克隆高斯。由于三维空间中存在许多空白区域,他们需要考虑避免填充这些区域。相比之下,二维图像空间中不存在所谓的空白区域。因此,我们摒弃了自适应密度控制,从而大大简化了二维图像表示的优化过程。
As for image compression task, the overall loss L\mathcal{L} consists of the reconstruction loss L_(rec)\mathcal{L}_{r e c} and the commitment loss L_(c)\mathcal{L}_{c} : 至于图像压缩任务,总损失 L\mathcal{L} 包括重建损失 L_(rec)\mathcal{L}_{r e c} 和承诺损失 L_(c)\mathcal{L}_{c} :
L=L_(rec)+lambdaL_(c),\mathcal{L}=\mathcal{L}_{r e c}+\lambda \mathcal{L}_{c},
where lambda\lambda serves as a hyper-parameter, balancing the weight of each loss component. The color codebooks are initialized using the K-means algorithm, providing a robust starting point for subsequent optimization. During fine-tuning, we adopt the exponential moving average mode to update the codebook. 其中 lambda\lambda 是一个超参数,用于平衡每个损失分量的权重。颜色编码本使用 K-means 算法进行初始化,为后续优化提供一个稳健的起点。在微调过程中,我们采用指数移动平均模式更新编码本。
4 Experiments 4 项实验
4.1 Experimental Setup 4.1 实验装置
Dataset. Our evaluation in image representation and compression is conducted on two popular datasets. We use the Kodak dataset [1], which consists of 24 images with a resolution of 768 xx512768 \times 512, and the DIV2K validation set [2] with 2xx2 \times bicubic downscaling, featuring 100 images with dimensions varying from 408 xx1020408 \times 1020 to 1020 xx10201020 \times 1020. 数据集我们在两个流行的数据集上对图像表示和压缩进行了评估。我们使用柯达数据集[1],该数据集由 24 幅分辨率为 768 xx512768 \times 512 的图像组成,以及 DIV2K 验证集[2],该验证集采用 2xx2 \times 双立方降频技术,包含 100 幅尺寸从 408 xx1020408 \times 1020 到 1020 xx10201020 \times 1020 不等的图像。
Fig. 4: Rate-distortion curves of our approach and different baselines on Kodak and DIV2K datasets in PSNR and MS-SSIM. BB denotes partial bits-back coding. Bound denotes the theoretical rate of our codec. 图 4:在柯达和 DIV2K 数据集上,我们的方法和不同基线在 PSNR 和 MS-SSIM 下的速率-失真曲线。BB 表示部分比特回编码。Bound 表示我们编解码器的理论速率。
Evaluation Metrics. To assess image quality, we employ two esteemed metrics: PSNR and MS-SSIM [62], which measure the distortion between reconstructed images and their originals. The bitrate for image compression is quantified in bits per pixel (bpp). 评估指标。为了评估图像质量,我们采用了两个备受推崇的指标:PSNR 和 MS-SSIM[62],这两个指标用于衡量重建图像与原始图像之间的失真度。图像压缩的比特率以每像素比特(bpp)为单位。
Implementation Details. Our GaussianImage, developed on top of gsplat [72], incorporates custom CUDA kernels for rasterization based on accumulated blending. We represent the covariance of 2D Gaussians using Cholesky factorization unless otherwise stated. The Gaussian parameters are optimized over 50000 steps using the Adan optimizer [64], starting with an initial learning rate of 1e^(-3)1 e^{-3}, halved every 20000 steps. During attribute quantization-aware fine-tuning, the quantization precision bb of covariance parameters is set to 6 bits, with the RVQ color vectors’ codebook size BB and the number of quantization stages MM fixed at 8 and 2 , respectively. The iterations of K-means algorithm are set to 5. Experiments are performed using NVIDIA V100 GPUs and PyTorch, with further details available in the supplementary material. 实现细节我们的高斯图像(GaussianImage)是在 gsplat [72] 的基础上开发的,并结合了基于累积混合的光栅化定制 CUDA 内核。除非另有说明,我们使用 Cholesky 因式分解来表示二维高斯的协方差。我们使用 Adan 优化器[64]对高斯参数进行了 50000 步优化,初始学习率为 1e^(-3)1 e^{-3} ,每 20000 步减半。在属性量化感知微调过程中,协方差参数的量化精度 bb 设置为 6 位,RVQ 颜色向量的代码集大小 BB 和量化级数 MM 分别固定为 8 和 2。K-means 算法的迭代次数设为 5。实验使用英伟达 V100 GPU 和 PyTorch 进行,更多详细信息请参见补充材料。
Benchmarks. For image representation comparisons, GaussianImage is benchmarked against competitive INR methods like SIREN [54], WIRE [53], I-NGP [48], and NeuRBF [18]. As for image compression, baselines span traditional codecs (JPEG [61], JPEG2000 [55]), VAE-based codecs (Ballé17 [5], Ballé18 [6]), 基准测试在图像表示比较方面,GaussianImage 的基准是 SIREN [54]、WIRE [53]、I-NGP [48] 和 NeuRBF [18]等具有竞争力的 INR 方法。至于图像压缩,基准则包括传统编解码器(JPEG [61]、JPEG2000 [55])、基于 VAE 的编解码器(Ballé17 [5]、Ballé18 [6])、
Fig. 5: Subjective comparison of various codecs on Kodak at low Bpp. 图 5:各种编解码器在低 Bpp 时在柯达上的主观对比。
Table 2: Computational complexity of traditional and learning-based image codecs on DIV2K Dataset at low and high Bpp. 表 2:传统图像编解码器和基于学习的图像编解码器在低 Bpp 值和高 Bpp 值 DIV2K 数据集上的计算复杂度。
INR-based codecs (COIN [23], COIN++ [22]). We utilize the open-source PyTorch implementation [17] of NeuRBF for I-NGP. These INR methods maintain consistent training steps with GaussianImage. Detailed implementation notes for baselines are found in the appendix. 基于 INR 的编解码器(COIN [23]、COIN++ [22])。我们利用开源 PyTorch 实现了 NeuRBF 的 I-NGP[17]。这些 INR 方法与高斯图像的训练步骤保持一致。有关基线的详细实施说明见附录。
4.2 Image Representation 4.2 图像表示法
Fig. 1 (left) and Table 1 show the representation performance of various methods on the Kodak and DIV2K datasets under the same training steps. Although MLP-based INR methods (SIREN [54], WIRE [53]) utilize fewer parameters to fit an image, they suffer from enormous training time and hyperslow rendering speed. Recent feature grid-based INR methods (I-NGP [48], NeuRBF [18]) accelerate training and inference, but they demand substantially more GPU memory compared to GS-based methods. Since the original 3D GS uses 3D Gaussian as the representation unit, it face the challenge of giant parameter count, which decelerates training and restricts inference speed. By choosing 2D Gaussian as the representation unit, our method secures pronounced advantages in training duration, rendering velocity, and GPU memory usage, while substantially reducing the number of stored parameters yet preserving comparable fitting quality. 图 1(左)和表 1 显示了在相同训练步骤下,各种方法在柯达和 DIV2K 数据集上的表示性能。虽然基于 MLP 的 INR 方法(SIREN [54]、WIRE [53])使用较少的参数来拟合图像,但它们存在训练时间长、渲染速度慢的问题。最新的基于特征网格的 INR 方法(I-NGP [48], NeuRBF [18])加速了训练和推理,但与基于 GS 的方法相比,它们需要更多的 GPU 内存。由于原始的 3D GS 使用 3D 高斯作为表示单元,它面临着参数数量巨大的挑战,从而降低了训练速度并限制了推理速度。通过选择二维高斯作为表示单元,我们的方法在训练时间、渲染速度和 GPU 内存使用方面都有明显优势,同时大幅减少了存储参数的数量,并保持了相当的拟合质量。
4.3 Image Compression 4.3 图像压缩
Coding performance. Fig. 4 presents the RD curves of various codecs on the Kodak and DIV2K datasets. Notably, our method achieves comparable compression performance with COIN [23] and COIN++ [22] in PSNR. With the help of the partial bits-back coding, our codec can outperform COIN and COIN++. Furthermore, when measured by MS-SSIM, our method surpasses COIN by a large margin. Fig. 5 provides a qualitative comparison between our method, JPEG [61], and COIN, revealing that our method restores image details more effectively and delivers superior reconstruction quality by consuming lower bits. 编码性能图 4 显示了各种编解码器在柯达和 DIV2K 数据集上的 RD 曲线。值得注意的是,在 PSNR 方面,我们的方法达到了与 COIN [23] 和 COIN++ [22] 相当的压缩性能。在部分比特回编码的帮助下,我们的编解码器可以超越 COIN 和 COIN++。此外,通过 MS-SSIM 测量,我们的方法在很大程度上超过了 COIN。图 5 提供了我们的方法、JPEG [61] 和 COIN 之间的定性比较,揭示了我们的方法能更有效地还原图像细节,并以更低的比特消耗提供卓越的重构质量。
Table 3: Ablation study of image representation on Kodak dataset with 30000 Gaussian points over 50000 training steps. AR means accumulated blending-based rasterization, M indicates merging color coefficients c\boldsymbol{c} and opacity oo. RS denotes decomposing the covariance matrix into rotation and scaling matrices. The final row in each subclass represents our default solution. 表 3:在柯达数据集上对 50000 个训练步骤中的 30000 个高斯点进行的图像表示消融研究。AR 表示基于累积混合的光栅化,M 表示合并颜色系数 c\boldsymbol{c} 和不透明度 oo 。RS 表示将协方差矩阵分解为旋转矩阵和缩放矩阵。每个子类的最后一行代表我们的默认解决方案。
Methods 方法
PSNR uarr\uparrow
MS-SSIM uarr\uparrow
Training Time(s) darr\downarrow 培训时间(秒) darr\downarrow
Computational complexity. Table 2 reports the computational complexity of several image codecs on the DIV2K dataset, with learning-based codecs operating on an NVIDIA V100 GPU and traditional codecs running on an Intel Core™ i9-10920X processor at a base frequency of 3.50 GHz in single-thread mode. Impressively, the decoding speed of our codec reaches around 2000 FPS, outpacing traditional codecs like JPEG, while also providing enhanced compression performance at lower bitrates. This establishes a significant advancement in the field of neural image codecs. 计算复杂度表 2 报告了几种图像编解码器在 DIV2K 数据集上的计算复杂度,其中基于学习的编解码器在英伟达 V100 GPU 上运行,而传统编解码器在英特尔酷睿™ i9-10920X 处理器上运行,单线程模式下的基本频率为 3.50 GHz。令人印象深刻的是,我们的编解码器的解码速度达到约 2000 FPS,超过了 JPEG 等传统编解码器,同时还能在较低比特率下提供更强的压缩性能。这是在神经图像编解码器领域取得的重大进展。
4.4 Ablation Study 4.4 消融研究
Effect of different components. To highlight the contributions of the key components in GaussianImage, we conduct a comprehensive set of ablation studies, as detailed in Table 3. Initially, the original 3D GS [35] method, which employs a combination of L1 and SSIM loss, is adapted to use L2 loss. This modification halves the training time at a minor cost to performance. Then, we replace the 3D Gaussian with the basic 2D Gaussian in Section 3.1, which not only improves the fitting performance and decreases training time by (1)/(3)\frac{1}{3}, but also doubles the inference speed and reduces parameter count by 6.5 xx6.5 \times. By simplifying alpha blending to accumulated blending, we eliminate the effects of random 2D Gaussian ordering and bypasses the complex calculations for the accumulated transparency TT, resulting in a significant 0.8 dB improvement in PSNR alongside notable training and inference speed gains. This underscores the efficiency of our proposed accumulated blending approach. Furthermore, by merging the color vector c\boldsymbol{c} and opacity oo to form our upgraded 2D Gaussian, we observe a 10%10 \% reduction in parameter count with a negligible 0.1dB decrease in PSNR. 不同组件的效果。为了突出 GaussianImage 关键组件的贡献,我们进行了一系列全面的烧蚀研究,详见表 3。最初的 3D GS [35] 方法采用 L1 和 SSIM 损失的组合,我们将其调整为使用 L2 损失。这种修改只需花费少量时间就能将训练时间减半。然后,我们用第 3.1 节中的基本二维高斯替换三维高斯,这不仅提高了拟合性能,减少了 (1)/(3)\frac{1}{3} 的训练时间,还使推理速度提高了一倍,减少了 6.5 xx6.5 \times 的参数数量。通过将阿尔法混合简化为累积混合,我们消除了随机二维高斯排序的影响,并绕过了累积透明度 TT 的复杂计算,从而显著提高了 PSNR 0.8 dB,同时显著提高了训练和推理速度。这凸显了我们提出的累积混合方法的效率。此外,通过合并颜色向量 c\boldsymbol{c} 和不透明度 oo 以形成我们升级的 2D 高斯,我们观察到参数数量减少了 10%10 \% ,而 PSNR 降低了 0.1dB,可以忽略不计。
Loss function. We evaluate various combinations of L2, L1, and SSIM losses, with findings presented in Table 3. These results confirm that L2 loss is optimally 损失函数。我们对 L2、L1 和 SSIM 损失的各种组合进行了评估,结果见表 3。这些结果证实,L2 损失是最佳的
Table 4: Ablation study of quantization schemes on Kodak dataset. The first row denotes our final solution and is set as the anchor. 表 4:量化方案在柯达数据集上的消融研究。第一行表示我们的最终解决方案,并被设为锚点。
suited for our approach, significantly improving image reconstruction quality while facilitating rapid training. 适合我们的方法,大大提高了图像重建质量,同时促进了快速训练。
Factorized form of covariance matrix. As outlined in Section 3.1, we optimize the factorized form of the covariance matrix through decomposition. The findings detailed in Table 3 demonstrate that various factorized forms possess similar capabilities in representing images, despite the decomposition’s inherent non-uniqueness. The appendix provides additional analysis on the compression robustness of different factorized forms. 协方差矩阵的因式分解。如第 3.1 节所述,我们通过分解来优化协方差矩阵的因式分解形式。表 3 中详细列出的研究结果表明,尽管分解具有固有的非唯一性,但各种因式分解形式在表示图像时具有相似的能力。附录对不同因式分解形式的压缩鲁棒性进行了补充分析。
Quantization strategies. Table 4 investigates the effect of different quantization schemes on image compression. Without the commitment loss L_(c)\mathcal{L}_{c} (V1), the absence of supervision for the RVQ codebook leads to significant deviations of the codebook vector from the original vector, adversely affecting reconstruction quality. Moreover, eliminating RVQ in favor of 6 -bit integer quantization for color parameters (V2) resulted in a 7.02%7.02 \% increase in bitrate consumption when compared with our default solution. This outcome suggests that the color vectors across different Gaussians share similarities, making them more suitable for RVQ. Further exploration into the use of higher bit quantization (V3) reveals a deterioration in compression efficiency. 量化策略。表 4 显示了不同量化方案对图像压缩的影响。如果没有承诺损失 L_(c)\mathcal{L}_{c} (V1),没有对 RVQ 编码本的监督会导致编码本向量与原始向量产生显著偏差,从而对重建质量产生不利影响。此外,与我们的默认解决方案相比,取消 RVQ 而对颜色参数进行 6 位整数量化(V2)会导致比特率消耗增加 7.02%7.02 \% 。这一结果表明,不同高斯的颜色向量具有相似性,因此更适合 RVQ。进一步探索使用更高的比特量化(V3),发现压缩效率有所下降。
5 Conclusion 5 结论
In this work, we introduce GaussianImage, an innovative paradigm for image representation that leverages 2D Gaussian Splatting. This approach diverges significantly from the commonly utilized implicit neural networks, offering a discrete and explicit representation of images. When compared to 3D Gaussian Splatting, employing 2D Gaussian kernels brings forth two notable benefits for image representation. Firstly, the computationally intensive alpha blending is simplified to an efficient and permutation-invariant accumulated summation blending. Secondly, the quantity of parameters required for each Gaussian diminishes drastically from 59 to just 8 , marking a substantial reduction in complexity. Consequently, GaussianImage emerges as a highly efficient and compact technique for image coding. Experimental results confirm that this explicit representation strategy enhances training and inference efficiency substantially. Moreover, it delivers a competitive rate-distortion performance after adopting vector quantization on parameters, compared to methods adopting implicit neural representation. These findings suggest promising avenues for further exploration in non-end-to-end image compression and representation strategies. 在这项工作中,我们引入了高斯图像(GaussianImage),这是一种利用二维高斯拼接技术进行图像表示的创新范例。这种方法与常用的隐式神经网络大相径庭,它提供了一种离散而明确的图像表示方法。与三维高斯拼接法相比,采用二维高斯核为图像表示带来了两个显著的优势。首先,计算密集型的阿尔法混合被简化为高效的、不受置换影响的累积求和混合。其次,每个高斯模型所需的参数数量从 59 个大幅减少到 8 个,大大降低了复杂性。因此,高斯图像是一种高效、紧凑的图像编码技术。实验结果证实,这种明确的表示策略大大提高了训练和推理效率。此外,与采用隐式神经表示的方法相比,在对参数进行矢量量化后,它还能提供具有竞争力的速率-失真性能。这些发现为进一步探索非端到端图像压缩和表示策略提供了广阔的前景。
Acknowledgments. This work was supported by the National Natural Science Fund of China (Project No. 42201461, 62102024, 62331014) and the General Research Fund (Project No. 16209622) from the Hong Kong Research Grants Council. 致谢。本研究得到国家自然科学基金(项目编号:42201461、62102024、62331014)和香港研究资助局一般研究基金(项目编号:16209622)的资助。
Agustsson, E., Timofte, R.: Ntire 2017 challenge on single image super-resolution: Dataset and study. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (July 2017) 10 Agustsson, E., Timofte, R.: Ntire 2017 单图像超分辨率挑战赛:数据集与研究。In:电气与电子工程师学会计算机视觉与模式识别(CVPR)研讨会(2017 年 7 月)10
Ballé, J., Minnen, D., Singh, S., Hwang, S.J., Johnston, N.: Variational image compression with a scale hyperprior. In: International Conference on Learning Representations (2018) 5, 11, 12, 23, 25 Ballé,J.、Minnen,D.、Singh,S.、Hwang,S.J.、Johnston,N.:使用尺度超优先级的变量图像压缩。In:学习表征国际会议(2018)5、11、12、23、25
Barron, J.T., Mildenhall, B., Tancik, M., Hedman, P., Martin-Brualla, R., Srinivasan, P.P.: Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5855-5864 (2021) 4 Barron, J.T., Mildenhall, B., Tancik, M., Hedman, P., Martin-Brualla, R., Srinivasan, P.P..:Mip-nerf:用于抗锯齿神经辐射场的多尺度表示法。In:第 5855-5864 页(2021 年) 4
Barron, J.T., Mildenhall, B., Verbin, D., Srinivasan, P.P., Hedman, P.: Mipnerf 360: Unbounded anti-aliased neural radiance fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 54705479 (2022) 4 Barron, J.T., Mildenhall, B., Verbin, D., Srinivasan, P.P., Hedman, P..:Mipnerf 360:无界反锯齿神经辐射场。In:pp.
Bégaint, J., Racapé, F., Feltman, S., Pushparaja, A.: Compressai: a pytorch library and evaluation platform for end-to-end compression research. arXiv preprint arXiv:2011.03029 (2020) 23
Bhalgat, Y., Lee, J., Nagel, M., Blankevoort, T., Kwak, N.: Lsq+: Improving lowbit quantization through learnable offsets and better initialization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. pp. 696-697 (2020) 3, 8 Bhalgat,Y.、Lee,J.、Nagel,M.、Blankevoort,T.、Kwak,N.:Lsq+:通过可学习偏移和更好的初始化改进低位量化。In:696-697 (2020) 3, 8
Chen, G., Wang, W.: A survey on 3d gaussian splatting. arXiv preprint arXiv:2401.03890 (2024) 2 ArXiv preprint arXiv:2401.03890 (2024) 2
Chen, H., Gwilliam, M., Lim, S.N., Shrivastava, A.: Hnerv: A hybrid neural representation for videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10270-10279 (2023) 4 Chen, H., Gwilliam, M., Lim, S.N., Shrivastava, A.: Hnerv:视频混合神经表示法。In:pp.
Chen, H., He, B., Wang, H., Ren, Y., Lim, S.N., Shrivastava, A.: Nerv: Neural representations for videos. Advances in Neural Information Processing Systems 34, 21557-21568 (2021) 4 Chen,H.,He,B.,Wang,H.,Ren,Y.,Lim,S.N.,Shrivastava,A.:Nerv. Neural Representations for videos:视频的神经表征。神经信息处理系统进展 34, 21557-21568 (2021) 4
Chen, Y., Liu, S., Wang, X.: Learning continuous image representation with local implicit image function. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8628-8638 (2021) 2 Chen, Y., Liu, S., Wang, X.:利用局部隐式图像函数学习连续图像表征In:pp.
Chen, Y., Chen, Z., Zhang, C., Wang, F., Yang, X., Wang, Y., Cai, Z., Yang, L., Liu, H., Lin, G.: Gaussianeditor: Swift and controllable 3d editing with gaussian splatting. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2024) 5, 25 Chen, Y., Chen, Z., Zhang, C., Wang, F., Yang, X., Wang, Y., Cai, Z., Yang, L., Liu, H., Lin, G.: Gaussianeditor:快速可控的高斯拼接 3d 编辑。In:IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2024) 5, 25
Chen, Z., Li, Z., Song, L., Chen, L., Yu, J., Yuan, J., Xu, Y.: Neurbf: A neural fields representation with adaptive radial basis functions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4182-4194 (2023) 2, 4, 10, 11, 12, 23 Chen, Z., Li, Z., Song, L., Chen, L., Yu, J., Yuan, J., Xu, Y.:Neurbf:具有自适应径向基函数的神经场表示。In:IEEE/CVF 计算机视觉国际会议论文集》,第 4182-4194 页。4182-4194 (2023) 2, 4, 10, 11, 12, 23
Chen, Z., Wang, F., Liu, H.: Text-to-3d using gaussian splatting. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2024) 5 Chen,Z.、Wang,F.、Liu,H.:使用高斯拼接技术实现文本到 3D 的转换。In:电气与电子工程师学会计算机视觉与模式识别会议(CVPR)(2024 年)5
Cheng, Z., Sun, H., Takeuchi, M., Katto, J.: Learned image compression with discretized gaussian mixture likelihoods and attention modules. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7939-7948 (2020) 5 Cheng,Z.、Sun,H.、Takeuchi,M.、Katto,J.:使用离散高斯混合似然和注意力模块的学习图像压缩。In:pp.
Duda, J.: Asymmetric numeral systems: entropy coding combining speed of huffman coding with compression rate of arithmetic coding. arXiv: Information Theory (2013), https://api.semanticscholar.org/CorpusID:13409455 22
Dupont, E., Loya, H., Alizadeh, M., Golinski, A., Teh, Y., Doucet, A.: Coin++: neural compression across modalities. Transactions on Machine Learning Research 2022(11) (2022) 2, 4, 5, 12 Dupont,E.、Loya,H.、Alizadeh,M.、Golinski,A.、Teh,Y.、Doucet,A.:Coin++:跨模态神经压缩。机器学习研究论文集 2022(11) (2022) 2, 4, 5, 12
Dupont, E., Golinski, A., Alizadeh, M., Teh, Y.W., Doucet, A.: Coin: Compression with implicit neural representations. In: Neural Compression: From Information Theory to Applications-Workshop@ ICLR 2021 (2021) 2, 4, 5, 12, 23, 25 Dupont, E., Golinski, A., Alizadeh, M., Teh, Y.W., Doucet, A.: Coin:利用隐式神经表征进行压缩。In:神经压缩:从信息论到应用--工作坊@ ICLR 2021 (2021) 2, 4, 5, 12, 23, 25
Fathony, R., Sahu, A.K., Willmott, D., Kolter, J.Z.: Multiplicative filter networks. In: International Conference on Learning Representations (2020) 2, 4 Fathony, R., Sahu, A.K., Willmott, D., Kolter, J.Z.:乘法滤波网络。In:国际学习表征会议(2020) 2, 4
Gray, R.: Vector quantization. IEEE Assp Magazine 1(2), 4-29 (1984) 8 Gray, R.: 矢量量化。IEEE Assp Magazine 1(2), 4-29 (1984) 8
Guo, Z., Flamich, G., He, J., Chen, Z., Hernández-Lobato, J.M.: Compression with bayesian implicit neural representations. Advances in Neural Information Processing Systems 36 (2024) 2, 5 Guo,Z.、Flamich,G.、He,J.、Chen,Z.、Hernández-Lobato,J.M.:利用贝叶斯隐式神经表征进行压缩。神经信息处理系统进展 36 (2024) 2, 5
He, D., Yang, Z., Peng, W., Ma, R., Qin, H., Wang, Y.: Elic: Efficient learned image compression with unevenly grouped space-channel contextual adaptive coding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5718-5727 (2022) 5 He, D., Yang, Z., Peng, W., Ma, R., Qin, H., Wang, Y.:Elic:使用不均匀分组空间信道上下文自适应编码的高效学习图像压缩。In:第 5718-5727 页(2022 年) 5
He, D., Yang, Z., Yu, H., Xu, T., Luo, J., Chen, Y., Gao, C., Shi, X., Qin, H., Wang, Y.: Po-elic: Perception-oriented efficient learned image coding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1764-1769 (2022) 25 He, D., Yang, Z., Yu, H., Xu, T., Luo, J., Chen, Y., Gao, C., Shi, X., Qin, H., Wang, Y..:Po-elic:面向感知的高效学习图像编码。In:pp.
Hu, Y., Yang, S., Yang, W., Duan, L.Y., Liu, J.: Towards coding for human and machine vision: A scalable image coding approach. In: 2020 IEEE International Conference on Multimedia and Expo (ICME). pp. 1-6. IEEE (2020) 25 Hu, Y., Yang, S., Yang, W., Duan, L.Y., Liu, J.: Towards coding for human and machine vision:一种可扩展的图像编码方法。In: 2020 IEEE International Conference on Multimedia and Expo (ICME). pp.IEEE (2020) 25
Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics 42(4) (2023) 2, 3,4,6,10,133,4,6,10,13 Kerbl,B.、Kopanas,G.、Leimkühler,T.、Drettakis,G.:用于实时辐射场渲染的 3D 高斯拼接。ACM Transactions on Graphics 42(4) (2023) 2, 3,4,6,10,133,4,6,10,13 .
Koyuncu, A.B., Gao, H., Boev, A., Gaikov, G., Alshina, E., Steinbach, E.: Contextformer: A transformer with spatio-channel attention for context modeling in learned image compression. In: European Conference on Computer Vision. pp. 447-463. Springer (2022) 5 Koyuncu, A.B., Gao, H., Boev, A., Gaikov, G., Alshina, E., Steinbach, E.: Contextformer:用于学习图像压缩中上下文建模的空间通道关注变换器。In. European Conference on Computer Vision:欧洲计算机视觉会议。447-463.Springer (2022) 5
Kunze, J., Severo, D., Zani, G., van de Meent, J.W., Townsend, J.: Entropy coding of unordered data structures. In: The Twelfth International Conference on Learning Representations (2024), https://openreview.net/forum?id=afQuNt3Ruh 9, 22 Kunze,J.、Severo,D.、Zani,G.、van de Meent,J.W.、Townsend,J.:无序数据结构的熵编码。In:第十二届学习表征国际会议(2024),https://openreview.net/forum?id=afQuNt3Ruh 9, 22
Ladune, T., Philippe, P., Henry, F., Clare, G., Leguay, T.: Cool-chic: Coordinatebased low complexity hierarchical image codec. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13515-13522 (2023) 5 Ladune, T., Philippe, P., Henry, F., Clare, G., Leguay, T.: Cool-chic:基于坐标的低复杂度分层图像编解码器。In:pp. 13515-13522 (2023) 5
Leguay, T., Ladune, T., Philippe, P., Clare, G., Henry, F., Déforges, O.: Lowcomplexity overfitted neural image codec. In: 2023 IEEE 25th International Workshop on Multimedia Signal Processing (MMSP). pp. 1-6. IEEE (2023) 5 Leguay, T., Ladune, T., Philippe, P., Clare, G., Henry, F., Déforges, O.:低复杂度过拟合神经图像编解码器。In: 2023 IEEE 25th International Workshop on Multimedia Signal Processing (MMSP). pp.IEEE (2023) 5
Li, Z., Wang, M., Pi, H., Xu, K., Mei, J., Liu, Y.: E-nerv: Expedite neural video representation with disentangled spatial-temporal context. In: European Conference on Computer Vision. pp. 267-284. Springer (2022) 4 Li, Z., Wang, M., Pi, H., Xu, K., Mei, J., Liu, Y.:E-nerv:加速时空背景分离的神经视频表示。In. European Conference on Computer Vision:pp.Springer (2022) 4
Liu, J., Sun, H., Katto, J.: Learned image compression with mixed transformer-cnn architectures. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14388-14397 (2023) 5 Liu,J.、Sun,H.、Katto,J.:采用混合变压器-网络架构的学习图像压缩。In:pp.
Martel, J.N., Lindell, D.B., Lin, C.Z., Chan, E.R., Monteiro, M., Wetzstein, G.: Acorn: adaptive coordinate networks for neural scene representation. ACM Transactions on Graphics (TOG) 40(4), 1-13 (2021) 2, 4 Martel,J.N.、Lindell,D.B.、Lin,C.Z.、Chan,E.R.、Monteiro,M.、Wetzstein,G.:Acorn:用于神经场景表示的自适应坐标网络。ACM Transactions on Graphics (TOG) 40(4), 1-13 (2021) 2, 4
Mentzer, F., Toderici, G.D., Tschannen, M., Agustsson, E.: High-fidelity generative image compression. Advances in Neural Information Processing Systems 33, 1191311924 (2020) 25 Mentzer, F., Toderici, G.D., Tschannen, M., Agustsson, E.: High-fidelity generative image compression.神经信息处理系统进展 33, 1191311924 (2020) 25
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. In: European Conference on Computer Vision. pp. 405-421. Springer (2020) 4 Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis.In. European Conference on Computer Vision:欧洲计算机视觉会议。405-421.Springer (2020) 4
Minnen, D., Ballé, J., Toderici, G.D.: Joint autoregressive and hierarchical priors for learned image compression. In: Advances in neural information processing systems (2018) 5, 9 Minnen, D., Ballé, J., Toderici, G.D.:用于学习图像压缩的联合自回归和层次先验。In:神经信息处理系统进展(2018)5, 9
Müller, T., Evans, A., Schied, C., Keller, A.: Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (ToG) 41(4), 115 (2022) 2, 4, 10, 11, 12, 23 Müller,T.、Evans,A.、Schied,C.、Keller,A.:使用多分辨率哈希编码的即时神经图形基元。ACM Transactions on Graphics (ToG) 41(4), 115 (2022) 2, 4, 10, 11, 12, 23
Nguyen, Q.H., Beksi, W.J.: Single image super-resolution via a dual interactive implicit neural network. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 4936-4945 (2023) 2 Nguyen,Q.H.,Beksi,W.J.:通过双交互隐式神经网络实现单图像超分辨率。In:IEEE/CVF 计算机视觉应用冬季会议论文集》,第 4936-4945 页 (2023) 2.4936-4945 (2023) 2
Quan, Y., Yao, X., Ji, H.: Single image defocus deblurring via implicit neural inverse kernels. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 12600-12610 (2023) 2 Quan,Y.、Yao,X.、Ji,H.:通过隐式神经逆核实现单图像去焦。In:IEEE/CVF 计算机视觉国际会议论文集》,第 12600-12610 页(2023 年)2
Ramasinghe, S., Lucey, S.: Beyond periodicity: Towards a unifying framework for activations in coordinate-mlps. In: European Conference on Computer Vision. pp. 142-158. Springer (2022) 2, 4 Ramasinghe, S., Lucey, S.: Beyond periodicity:Towards a unifying framework for activations in coordinate-mlps.In. European Conference on Computer Vision:pp.Springer (2022) 2, 4
Ryder, T., Zhang, C., Kang, N., Zhang, S.: Split hierarchical variational compression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 386-395 (2022) 3, 8, 9 Ryder, T., Zhang, C., Kang, N., Zhang, S.: Split hierarchical variational compression.In:pp.386-395 (2022) 3, 8, 9
Saragadam, V., LeJeune, D., Tan, J., Balakrishnan, G., Veeraraghavan, A., Baraniuk, R.G.: Wire: Wavelet implicit neural representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1850718516 (2023) 2, 4, 10, 11, 12, 23 Saragadam, V., LeJeune, D., Tan, J., Balakrishnan, G., Veeraraghavan, A., Baraniuk, R.G.: Wire:小波隐式神经表征。In:pp. 1850718516 (2023) 2, 4, 10, 11, 12, 23
Sitzmann, V., Martel, J., Bergman, A., Lindell, D., Wetzstein, G.: Implicit neural representations with periodic activation functions. Advances in neural information processing systems 33, 7462-7473 (2020) 2, 4, 10, 11, 12, 23 Sitzmann, V.、Martel, J.、Bergman, A.、Lindell, D.、Wetzstein, G.:具有周期性激活函数的隐式神经表征。神经信息处理系统进展 33, 7462-7473 (2020) 2, 4, 10, 11, 12, 23
Skodras, A., Christopoulos, C., Ebrahimi, T.: The jpeg 2000 still image compression standard. IEEE Signal processing magazine 18(5), 36-58 (2001) 5, 11, 12 Skodras, A., Christopoulos, C., Ebrahimi, T.: jpeg 2000 静态图像压缩标准。电气和电子工程师学会信号处理杂志 18(5),36-58(2001) 5、11、12
Stanley, K.O.: Compositional pattern producing networks: A novel abstraction of development. Genetic programming and evolvable machines 8, 131-162 (2007) 2 Stanley, K.O.:构图模式生成网络:发展的新抽象。Genetic programming and evolvable machines 8, 131-162 (2007) 2
Strümpler, Y., Postels, J., Yang, R., Gool, L.V., Tombari, F.: Implicit neural representations for image compression. In: European Conference on Computer Vision. pp. 74-91. Springer (2022) 2, 5 Strümpler,Y.、Postels,J.、Yang,R.、Gool,L.V.、Tombari,F.:用于图像压缩的隐式神经表征。In:pp.Springer (2022) 2, 5
Takikawa, T., Litalien, J., Yin, K., Kreis, K., Loop, C., Nowrouzezahrai, D., Jacobson, A., McGuire, M., Fidler, S.: Neural geometric level of detail: Real-time rendering with implicit 3d shapes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11358-11367 (2021) 2, 4 Takikawa,T.、Litalien,J.、Yin,K.、Kreis,K.、Loop,C.、Nowrouzezahrai,D.、Jacobson,A.、McGuire,M.、Fidler,S.:神经几何细节水平:使用隐式三维形状进行实时渲染。In:第 11358-11367 页 (2021) 2, 4
Tancik, M., Srinivasan, P., Mildenhall, B., Fridovich-Keil, S., Raghavan, N., Singhal, U., Ramamoorthi, R., Barron, J., Ng, R.: Fourier features let networks learn high frequency functions in low dimensional domains. Advances in Neural Information Processing Systems 33, 7537-7547 (2020) 2, 4 Tancik, M., Srinivasan, P., Mildenhall, B., Fridovich-Keil, S., Raghavan, N., Singhal, U., Ramamoorthi, R., Barron, J., Ng, R.: Fourier features let networks learn high frequency functions in low dimensional domains.神经信息处理系统进展 33, 7537-7547 (2020) 2, 4
Wallace, G.K.: The jpeg still picture compression standard. Communications of the ACM 34(4), 30-44 (1991) 5, 11, 12 Wallace, G.K.: The jpeg still picture compression standard.ACM 通信》34(4),30-44(1991 年) 5、11、12
Wang, Z., Simoncelli, E.P., Bovik, A.C.: Multiscale structural similarity for image quality assessment. In: The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003. vol. 2, pp. 1398-1402. Ieee (2003) 11 Wang,Z.、Simoncelli,E.P.、Bovik,A.C.:用于图像质量评估的多尺度结构相似性。In:第 2 卷,第 1398-1402 页。IEEE (2003) 11
Wu, G., Yi, T., Fang, J., Xie, L., Zhang, X., Wei, W., Liu, W., Tian, Q., Wang, X.: 4d gaussian splatting for real-time dynamic scene rendering. arXiv preprint arXiv:2310.08528 (2023) 5 Wu,G.、Yi,T.、Fang,J.、Xie,L.、Zhang,X.、Wei,W.、Liu,W.、Tian,Q.、Wang,X.:用于实时动态场景渲染的 4d 高斯拼接。
Xu, D., Wang, P., Jiang, Y., Fan, Z., Wang, Z.: Signal processing for implicit neural representations. Advances in Neural Information Processing Systems 35, 13404-13418 (2022) 2 Xu, D., Wang, P., Jiang, Y., Fan, Z., Wang, Z.:隐式神经表征的信号处理神经信息处理系统进展 35, 13404-13418 (2022) 2
Xu, Q., Xu, Z., Philip, J., Bi, S., Shu, Z., Sunkavalli, K., Neumann, U.: Point-nerf: Point-based neural radiance fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5438-5448 (2022) 4 Xu,Q.、Xu,Z.、Philip,J.、Bi,S.、Shu,Z.、Sunkavalli,K.、Neumann,U. Point-nerf:基于点的神经辐射场:Point-nerf:基于点的神经辐射场。In:pp.
Xu, W., Jiao, J.: Revisiting implicit neural representations in low-level vision. In: International Conference on Learning Representations Workshop (2023) 2 Xu, W., Jiao, J.: Revisiting implicit neural representations in low-level vision.In:国际学习表征研讨会(2023)2
Yan, C., Qu, D., Wang, D., Xu, D., Wang, Z., Zhao, B., Li, X.: Gs-slam: Dense visual slam with 3d gaussian splatting. arXiv preprint arXiv:2311.11700 (2023) 4 Yan, C., Qu, D., Wang, D., Xu, D., Wang, Z., Zhao, B., Li, X.:Gs-slam:ArXiv preprint arXiv:2311.11700 (2023) 4
Yan, Y., Lin, H., Zhou, C., Wang, W., Sun, H., Zhan, K., Lang, X., Zhou, X., Peng, S.: Street gaussians for modeling dynamic urban scenes. arXiv preprint arXiv:2401.01339 (2024) 5 Yan,Y.、Lin,H.、Zhou,C.、Wang,W.、Sun,H.、Zhan,K.、Lang,X.、Zhou,X.、Peng,S.:用于动态城市场景建模的街道高斯模型.arXiv 预印本 arXiv:2401.01339 (2024) 5
Yang, Z., Yang, H., Pan, Z., Zhu, X., Zhang, L.: Real-time photorealistic dynamic scene representation and rendering with 4 d gaussian splatting. arXiv preprint arXiv:2310.10642 (2023) 5
Ye, V., Kanazawa, A.: Mathematical supplement for the gsplat library. arXiv preprint arXiv:2312.02121 (2023) 20
Ye, V., Turkulainen, M., the Nerfstudio team: gsplat, https://github.com/ nerfstudio-project/gsplat 11
Zeghidour, N., Luebs, A., Omran, A., Skoglund, J., Tagliasacchi, M.: Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing 30, 495-507 (2021) 3, 9 Zeghidour, N., Luebs, A., Omran, A., Skoglund, J., Tagliasacchi, M.: Soundstream:端到端神经音频编解码器。IEEE/ACM Transactions on Audio, Speech, and Language Processing 30, 495-507 (2021) 3, 9
Zhang, X., Yang, R., He, D., Ge, X., Xu, T., Wang, Y., Qin, H., Zhang, J.: Boosting neural representations for videos with a conditional decoder. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024) 4 Zhang,X.、Yang,R.、He,D.、Ge,X.、Xu,T.、Wang,Y.、Qin,H.、Zhang,J.:用条件解码器提升视频的神经表征。In:IEEE/CVF 计算机视觉与模式识别会议论文集(2024 年) 4
Zielonka, W., Bagautdinov, T., Saito, S., Zollhöfer, M., Thies, J., Romero, J.: Drivable 3d gaussian avatars. arXiv preprint arXiv:2311.08581 (2023) 5
A Details of Gradient Computation 梯度计算的细节
In this section, we delineate the process of computing the gradients of a scalar loss function with respect to the input Gaussian parameters. Beginning with the gradient of the scalar loss L\mathcal{L} with respect to each pixel of the output image, we employ the standard chain rule to propagate the gradients backward toward the original input parameters. 在本节中,我们将描述计算标量损失函数相对于输入高斯参数的梯度的过程。从标量损失 L\mathcal{L} 相对于输出图像的每个像素的梯度开始,我们采用标准链式规则将梯度向后传播到原始输入参数。
A. 1 Gradients of Accumulated Rasterization A.1 累积光栅化的梯度
The initial step involves back-propagating the loss gradients from a given pixel ii to the Gaussian that contributed to the pixel. For a Gaussian nn impacting pixel ii, we aim to calculate the gradients with respect to its color c^(')inR^(3)\boldsymbol{c}^{\prime} \in \mathbb{R}^{3}, the 2 D means mu inR^(2)\boldsymbol{\mu} \in \mathbb{R}^{2} and 2 D covariance Sigma inR^(2xx2)\boldsymbol{\Sigma} \in \mathbb{R}^{2 \times 2}. 初始步骤包括将给定像素 ii 的损失梯度反向传播到对该像素产生影响的高斯。对于影响像素 ii 的高斯 nn ,我们的目标是计算其颜色 c^(')inR^(3)\boldsymbol{c}^{\prime} \in \mathbb{R}^{3} 、2 D 均值 mu inR^(2)\boldsymbol{\mu} \in \mathbb{R}^{2} 和 2 D 协方差 Sigma inR^(2xx2)\boldsymbol{\Sigma} \in \mathbb{R}^{2 \times 2} 的梯度。
where d inR^(2)\boldsymbol{d} \in \mathbb{R}^{2} is the displacement between the pixel center and the 2 D Gaussian center. 其中, d inR^(2)\boldsymbol{d} \in \mathbb{R}^{2} 是像素中心与 2 D 高斯中心之间的位移。
B Details of Partial Bits-Back Coding B 部分比特回溯编码的细节
In Fig. 4 of Section 4.2, we show the results of our codec (“Ours”), alone with two variants using bits-back coding (“Ours+BB”, “Ours-Bound”). The “Ours” is the original GaussianImage codec without any bits-back coding. It is the practical codec that achieves 2000 FPS. The “Ours+BB” is the partial bits-back coding codec described in Section 3.3. It reduces a bitrate of 在第 4.2 节的图 4 中,我们展示了我们的编解码器("Ours")和两个使用比特回溯编码的变体("Ours+BB"、"Ours-Bound")的结果。我们的 "是不带任何比特回溯编码的原始高斯图像编解码器。它是能达到 2000 FPS 的实用编解码器。Ours+BB "是第 3.3 节中描述的部分比特回溯编码编解码器。它将比特率降低到
log(N-K)!-log(N-K)\log (N-K)!-\log (N-K)
Algorithm 1 Partial Bits-Back Coding Encode
input the 2D Gaussian parameters \(G[1: N]\).
procedure Partial-Bits-Back-Encode( \(G[1: N]\) )
\(m \leftarrow \emptyset\)
for \(K=1\) to \(N\) do
\(m \leftarrow\) ans-encode \((m, G[K])\left\{\right.\) Rate \(\left.+R_{K}\right\}\)
if \(\operatorname{len}(m) \geq \log (N-K)\) ! then
break
end if
end for
\(m, G[K+1: N] \leftarrow \operatorname{ans-decode}\left(m, \mathcal{U}_{(N-K)!}, G[K+1: N]\right)\{\) Rate \(-\log (N-K)!\}\)
\(m \leftarrow \operatorname{ans-encode}(m, G[K+1: N])\left\{\right.\) Rate \(\left.+\sum_{i=K+1}^{N} R_{i}\right\}\)
\(m \leftarrow\) ans-encode \((m, N-K)\{\) Rate \(+\log (N-K)\}\)
return m
from the original GaussianImage codec. And KK is selected as the lowerbound of previous KK 2D Gaussian whose cumulative bitrate is at least log(N-K)\log (N-K) !: 从原始的 GaussianImage 编解码器中提取。而 KK 则被选为先前 KK 2D 高斯的下限,其累计比特率至少为 log(N-K)\log (N-K) !
K^(**)=i n f K", s.t. "sum_(k=1)^(K)R_(k) >= log(N-K)!K^{*}=\inf K \text {, s.t. } \sum_{k=1}^{K} R_{k} \geq \log (N-K)!
This rate reduction is introduced in [37], and can be implemented using a first in last out entropy coder named Asymmetric Numeral Systems (ANS) [21]. The encoding procedure has a sub-procedure of decoding, and the decoding procedure has a sub-procedure of encoding [60]. The general process of partial bits-back coding is described in Algorithm 1 and 2, where U_(d)\mathcal{U}_{d} is uniform distribution with dd elements. 这种降低速率的方法在 [37] 中介绍过,可以使用一种名为非对称数字系统 (ANS) [21] 的先进后出熵编码器来实现。编码过程有一个解码子过程,解码过程有一个编码子过程 [60]。部分比特回编码的一般过程如算法 1 和 2 所述,其中 U_(d)\mathcal{U}_{d} 为均匀分布,元素为 dd 。
When applied to a whole dataset, the partial bits-back coding becomes unnecessary. More specifically, we no longer require encoding previous KK Gaussian for initial bits. Instead, we can just use previous image as initial bits. In that case, we can just follow the vanilla bits-back coding in [60]. For a dataset with infinite images, the average rate reduction is 当应用于整个数据集时,部分比特回溯编码就变得没有必要了。更具体地说,我们不再需要为初始比特编码之前的 KK 高斯。相反,我们可以直接使用之前的图像作为初始比特。在这种情况下,我们只需遵循 [60] 中的 vanilla bits-back 编码即可。对于具有无限图像的数据集,平均速率降低率为
log N!-log N\log N!-\log N
However, this rate reduction is never achieved as dataset is never infinite. While it is indeed the greatest lowerbound as any rate greater is achievable. Or to say, the greatest lowerbound of rate for bits-back coding is not achievable. Therefore, we name it “Ours-Bound”. 然而,由于数据集永远不会是无限的,因此这一比率的降低永远无法实现。虽然这确实是最大下限,因为任何更大的速率都是可以实现的。或者说,比特回编码的最大速率下限是无法实现的。因此,我们将其命名为 "我们的下限"。
C Experiments C 实验
C. 1 Implementation Details C.1 实施细节
GaussianImage. During the formation of 2D Gaussian, we apply the tanh function to limit the range of position parameters to (-1,1)(-1,1). For covariance parameters, we add 0.5 to the diagonal elements l_(1),l_(3)l_{1}, l_{3} of the lower triangular matrix 高斯图像。在二维高斯形成过程中,我们使用 tanh 函数将位置参数的范围限制为 (-1,1)(-1,1) 。对于协方差参数,我们在下三角矩阵的对角元素 l_(1),l_(3)l_{1}, l_{3} 中添加 0.5
L\boldsymbol{L} in the Cholesky factorization or the scaling elements s_(1),s_(2)s_{1}, s_{2} in the rotationalscaling factorization. This adjustment prevents the scaling of the covariance from becoming excessively small. In addition, the covariance parameters and weighted color coefficients are initialized using a uniform distribution. The position parameters are initialized as follows: L\boldsymbol{L} ,或旋转缩放因子中的缩放元素 s_(1),s_(2)s_{1}, s_{2} 。这种调整可以防止协方差的缩放变得过小。此外,协方差参数和加权颜色系数的初始化采用均匀分布。位置参数初始化如下
where rand(n)\operatorname{rand}(n) generates nn random numbers from a uniform distribution. 其中 rand(n)\operatorname{rand}(n) 从均匀分布中生成 nn 随机数。
Baselines. For SIREN [54] and WIRE [53], we implement them by using the open-source project ^(1){ }^{1} from WIRE. For I-NGP [48] and NeuRBF [18], we adopt the project ^(2){ }^{2} from NeuRBF. As for compression baselines, the implementation of COIN [23] uses their library ^(3){ }^{3}. We evaluate the VAE-based codecs (Ballé17 [5], Ballé18 [6]) using the MSE-optimized models provided by CompressAI [9]. It is worth noting that during the inference of the INR methods, we sample all image coordinates at once to output the corresponding RGB values. This is the maximum inference speed that the INR methods can achieve when the GPU memory resources are sufficient. 基线。对于 SIREN [54] 和 WIRE [53],我们使用 WIRE 的开源项目 ^(1){ }^{1} 来实现。对于 I-NGP [48] 和 NeuRBF [18],我们采用了 NeuRBF 的 ^(2){ }^{2} 项目。至于压缩基线,COIN [23] 的实现采用了他们的库 ^(3){ }^{3} 。我们使用 CompressAI [9] 提供的 MSE 优化模型评估了基于 VAE 的编解码器(Ballé17 [5] 和 Ballé18 [6])。值得注意的是,在 INR 方法的推理过程中,我们会一次性采样所有图像坐标,以输出相应的 RGB 值。这是在 GPU 内存资源充足的情况下,INR 方法所能达到的最大推理速度。
C. 2 Image Representation and Compression C.2 图像的表示和压缩
As shown in Fig. 6, we provide performance comparisons of image representation and compression on DIV2K and Kodak datasets, respectively. 如图 6 所示,我们分别在 DIV2K 和柯达数据集上对图像表示和压缩进行了性能比较。
C. 3 Ablation Study C.3 消融研究
Number of Gaussians. As shown in Fig. 7, our proposed methods improve the quality of the fitted image as the number of Gaussians increases. 高斯数如图 7 所示,随着高斯数的增加,我们提出的方法可以改善拟合图像的质量。
Effect of additive operation. The additive operation can be seen as convolving the covariance matrix with a continuous Gaussian, which helps antialiasing, thereby effectively improving the fitting performance, as illustrated in Table 5. 加法运算的效果。加法运算可以看作是用连续高斯对协方差矩阵进行卷积,有助于抗锯齿,从而有效提高拟合性能,如表 5 所示。
Fig. 6: Image representation (left) and compression (right) results with different decoding time on the DIV2K and Kodak dataset, respectively. The radius of each point indicates the parameter size (left) or bits per pixel (right). Our method enjoys the fastest decoding speed regardless of parameter size or bpp. 图 6:DIV2K 和柯达数据集在不同解码时间下的图像表示(左)和压缩(右)结果。每个点的半径表示参数大小(左)或每像素比特数(右)。无论参数大小或比特率如何,我们的方法都具有最快的解码速度。
Fig. 7: Image representation with different number of Gaussians. 图 7:不同高斯数的图像表示法。
Robustness in different factorized forms. Fig. 8 highlights the application of identical quantization approaches to various factorized forms. Notably, the RS codec variant underperforms the Cholesky codec variant, suggesting that rotation and scaling parameters are particularly susceptible to compression distortions, which requires carefully tailored specialized quantization strategies to achieve efficient compression. 不同因式分解形式的鲁棒性图 8 着重说明了相同量化方法在不同因式分解形式中的应用。值得注意的是,RS 编解码器变体的性能低于 Cholesky 编解码器变体,这表明旋转和缩放参数特别容易受到压缩失真的影响,需要精心定制专门的量化策略才能实现高效压缩。
D Discussion D 讨论
In this paper, we simply apply existing compression techniques to build our image codec. As depicted in Fig. 4 in the main paper, there remains a considerable discrepancy between our codec and existing traditional/VAE-based codecs in the compression performance. This gap indicates an imperative need for the development of specialized compression algorithms tailored for Gaussian-based codecs 在本文中,我们只是应用现有的压缩技术来构建我们的图像编解码器。如本文正文图 4 所示,我们的编解码器与现有的基于传统/VAE 的编解码器在压缩性能上仍有相当大的差距。这一差距表明,我们亟需为基于高斯的编解码器开发专门的压缩算法。
Table 5: Ablation study of additive operation on Kodak and DIV2K datasets with 30000 Gaussian points over 50000 training steps. 表 5:在柯达和 DIV2K 数据集上对 50000 个训练步骤中的 30000 个高斯点进行加法运算的消融研究。
Fig. 8: Image compression results with different factorized forms on the Kodak and DIV2K dataset, respectively. 图 8:在柯达和 DIV2K 数据集上分别使用不同因式的图像压缩结果。
to elevate the performance. Moreover, as shown in Table 2 in the main paper, although our encoding speed has been improved by three orders of magnitude compared with COIN [23], there is still a gap of four orders of magnitude compared with VAE-based codecs [5,6]. Therefore, exploring avenues for more rapid image fitting and Gaussian compression emerges as a critical research direction. 以提高性能。此外,如本文表 2 所示,虽然我们的编码速度与 COIN [23] 相比提高了三个数量级,但与基于 VAE 的编解码器 [5,6] 相比仍有四个数量级的差距。因此,探索更快速的图像拟合和高斯压缩途径成为一个重要的研究方向。
Considering that our GaussianImage provides an explicit representation and coding of images, we will further investigate this line from various aspects in the future. First, recent literature successfully performs segmentation-based textguided editing on 3D scene represented by Gaussians [16,24], since this discrete representation naturally provides a semantics layout. Intuitively, a similar property also exists in 2D Gaussian images, and it has the potential to develop few-shot text-guided editing on them. Second, image coding for machine [32] is a popular topic in the learned image coding community. This explicit representation is also likely to benefit downstream vision tasks like classification and detection. Finally, high-fidelity image representation [29,45] is also an essential task to delve into. 考虑到我们的高斯图像(GaussianImage)为图像提供了明确的表示和编码,未来我们将从多方面进一步研究这一思路。首先,最近的一些文献成功地对高斯表示的三维场景进行了基于分割的文本引导编辑[16,24],因为这种离散表示法自然地提供了语义布局。直观地说,二维高斯图像中也存在类似的特性,因此有可能在二维高斯图像上开发少镜头文本引导编辑。其次,机器图像编码[32]是学习图像编码界的热门话题。这种明确的表示也可能有利于分类和检测等下游视觉任务。最后,高保真图像表征 [29,45] 也是一项需要深入研究的重要任务。
Equal Contribution. ^(+){ }^{+}This work was done when Xinjie Zhang and Xingtong Ge interned at SenseTime Research. ^(†){ }^{\dagger} Corresponding Authors. 平等贡献。 ^(+){ }^{+} 这项工作是张新杰和葛星彤在SenseTime Research实习时完成的。 ^(†){ }^{\dagger} 通讯作者。