这是用户在 2025-7-4 18:46 为 https://app.immersivetranslate.com/pdf-pro/37711d5d-decb-40f1-b9dc-1b1c1b7b3ef0/ 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

VGGT: Visual Geometry Grounded Transformer
VGGT:Visual Geometry Grounded Transformer (视觉几何接地变压器)

Figure 1. VGGT is a large feed-forward transformer with minimal 3D-inductive biases trained on a trove of 3D-annotated data. It accepts up to hundreds of images and predicts cameras, point maps, depth maps, and point tracks for all images at once in less than a second, which often outperforms optimization-based alternatives without further processing.
图 1.VGGT 是一种大型前馈变压器,具有最小的 3D 电感偏置,在 3D 注释数据宝库上训练。它接受多达数百张图像,并在不到一秒的时间内同时预测所有图像的相机、点图、深度图和点轨迹,这通常优于基于优化的替代方案,无需进一步处理。
Abstract  抽象
We present VGGT, a feed-forward neural network that directly infers all key 3D attributes of a scene, including camera parameters, point maps, depth maps, and 3D point tracks, from one, a few, or hundreds of its views. This approach is a step forward in 3D computer vision, where models have typically been constrained to and specialized for single tasks. It is also simple and efficient, reconstructing images in under one second, and still outperforming alternatives that require post-processing with visual geometry optimization techniques. The network achieves state-of-the-art results in multiple 3D tasks, including camera parameter estimation, multi-view depth estimation, dense point cloud reconstruction, and 3 D 3 D 3D3 D point tracking. We also show that using pretrained VGGT as a feature backbone significantly enhances downstream tasks, such as non-rigid point tracking and feed-forward novel view synthesis. Code and models are publicly available at https://github.com/facebookresearch/vggt.
我们介绍了 VGGT,这是一种前馈神经网络,可从一个、几个或数百个视图中直接推断场景的所有关键 3D 属性,包括相机参数、点图、深度图和 3D 点轨迹。这种方法是 3D 计算机视觉向前迈出的一步,在 3D 计算机视觉中,模型通常被限制在单一任务中并专门用于单个任务。它还简单高效,可在一秒内重建图像,并且仍然优于需要使用视觉几何优化技术进行后处理的替代方案。该网络在多个 3D 任务中实现了最先进的结果,包括相机参数估计、多视图深度估计、密集点云重建和 3 D 3 D 3D3 D 点跟踪。我们还表明,使用预训练的 VGGT 作为特征主干可以显著增强下游任务,例如非刚性点跟踪和前馈新视图合成。代码和模型在 https://github.com/facebookresearch/vggt 上公开提供。

\section*{1. Introduction}
\section*{1.简介}

We consider the problem of estimating the 3D attributes of a scene, captured in a set of images, utilizing a feedforward neural network. Traditionally, 3D reconstruction has been approached with visual-geometry methods, utilizing iterative optimization techniques like Bundle Adjustment (BA) [45]. Machine learning has often played an important complementary role, addressing tasks that cannot be solved by geometry alone, such as feature matching and monocular depth prediction. The integration has become increasingly tight, and now state-of-the-art Structure-fromMotion (SfM) methods like VGGSfM [125] combine machine learning and visual geometry end-to-end via differentiable BA. Even so, visual geometry still plays a major role in 3D reconstruction, which increases complexity and computational cost.
我们考虑利用前馈神经网络估计场景的 3D 属性的问题,该场景在一组图像中捕获。传统上,3D 重建是通过视觉几何方法来实现的,利用光束法调整 (BA) [45] 等迭代优化技术。机器学习通常发挥着重要的补充作用,解决了仅靠几何结构无法解决的任务,例如特征匹配和单目深度预测。集成变得越来越紧密,现在最先进的 Structure-fromMotion (SfM) 方法,如 VGGSfM [125],通过可微分 BA 将机器学习和视觉几何端到端地结合在一起。即便如此,视觉几何在 3D 重建中仍然发挥着重要作用,这增加了复杂性和计算成本。
As networks become ever more powerful, we ask if, finally, 3D tasks can be solved directly by a neural network, eschewing geometry post-processing almost entirely. Recent contributions like DUSt3R [129] and its evolution
随着网络变得越来越强大,我们想知道 3D 任务是否终于可以通过神经网络直接解决,几乎完全避免了几何后处理。DUSt3R [129] 及其演变等最新贡献
MASt3R [62] have shown promising results in this direction, but these networks can only process two images at once and rely on post-processing to reconstruct more images, fusing pairwise reconstructions.
MASt3R [62] 在这个方向上已经显示出有希望的结果,但这些网络只能同时处理两张图像,并依靠后处理来重建更多图像,融合成对重建。
In this paper, we take a further step towards removing the need to optimize 3D geometry in post-processing. We do so by introducing Visual Geometry Grounded Transformer (VGGT), a feed-forward neural network that performs 3D reconstruction from one, a few, or even hundreds of input views of a scene. VGGT predicts a full set of 3D attributes, including camera parameters, depth maps, point maps, and 3D point tracks. It does so in a single forward pass, in seconds. Remarkably, it often outperforms optimization-based alternatives even without further processing. This is a substantial departure from DUSt3R, MASt3R, or VGGSfM, which still require costly iterative post-optimization to obtain usable results.
在本文中,我们朝着消除在后处理中优化 3D 几何体的需求迈出了一步。为此,我们引入了 Visual Geometry Grounded Transformer (VGGT),这是一种前馈神经网络,可从场景的一个、几个甚至数百个输入视图执行 3D 重建。VGGT 预测一整套 3D 属性,包括摄像机参数、深度图、点图和 3D 点轨迹。它在几秒钟内通过一次向前传递即可完成。值得注意的是,即使不进一步处理,它通常也优于基于优化的替代方案。这与 DUSt3R、MASt3R 或 VGGSfM 有很大不同,后者仍然需要昂贵的迭代后优化才能获得可用的结果。
We also show that it is unnecessary to design a special network for 3D reconstruction. Instead, VGGT is based on a fairly standard large transformer [119], with no particular 3D or other inductive biases (except for alternating between frame-wise and global attention), but trained on a large number of publicly available datasets with 3D annotations. VGGT is thus built in the same mold as large models for natural language processing and computer vision, such as GPTs [1, 29, 148], CLIP [86], DINO [10, 78], and Stable Diffusion [34]. These have emerged as versatile backbones that can be fine-tuned to solve new, specific tasks. Similarly, we show that the features computed by VGGT can significantly enhance downstream tasks like point tracking in dynamic videos, and novel view synthesis.
我们还表明,没有必要为 3D 重建设计一个特殊的网络。相反,VGGT 基于一个相当标准的大型变压器 [119],没有特定的 3D 或其他归纳偏差(除了在逐帧和全局注意力之间交替),而是在大量带有 3D 注释的公开可用数据集上进行训练。因此,VGGT 与自然语言处理和计算机视觉的大型模型(如 GPT [1, 29, 148]、CLIP [86]、DINO [10, 78] 和 Stable Diffusion [34])构建在同一个模具中。这些已成为多功能的主干,可以进行微调以解决新的特定任务。同样,我们表明 VGGT 计算的特征可以显着增强下游任务,如动态视频中的点跟踪和新颖的视图合成。
There are several recent examples of large 3D neural networks, including DepthAnything [142], MoGe [128], and LRM [49]. However, these models only focus on a single 3D task, such as monocular depth estimation or novel view synthesis. In contrast, VGGT uses a shared backbone to predict all 3D quantities of interest together. We demonstrate that learning to predict these interrelated 3D attributes enhances overall accuracy despite potential redundancies. At the same time, we show that, during inference, we can derive the point maps from separately predicted depth and camera parameters, obtaining better accuracy compared to directly using the dedicated point map head.
最近有几个大型 3D 神经网络的例子,包括 DepthAnything [142]、MoGe [128] 和 LRM [49]。然而,这些模型只专注于单个 3D 任务,例如单眼深度估计或新颖的视图合成。相比之下,VGGT 使用共享主干来一起预测所有感兴趣的 3D 量。我们证明,尽管可能存在冗余,但学习预测这些相互关联的 3D 属性可以提高整体准确性。同时,我们表明,在推理过程中,我们可以从单独预测的深度和相机参数中得出点图,与直接使用专用点图头相比,获得了更高的准确性。
To summarize, we make the following contributions: (1) We introduce VGGT, a large feed-forward transformer that, given one, a few, or even hundreds of images of a scene, can predict all its key 3D attributes, including camera intrinsics and extrinsics, point maps, depth maps, and 3D point tracks, in seconds. (2) We demonstrate that VGGT’s predictions are directly usable, being highly competitive and usually better than those of state-of-the-art methods that use slow post-processing optimization techniques. (3) We also show that, when further combined with BA post-processing,
总而言之,我们做出了以下贡献:(1) 我们介绍了 VGGT,这是一个大型前馈转换器,给定一个场景的一张、几张甚至数百张图像,可以在几秒钟内预测其所有关键的 3D 属性,包括相机内在和外在、点图、深度图和 3D 点轨迹。(2) 我们证明 VGGT 的预测是直接可用的,具有很强的竞争力,并且通常比使用缓慢后处理优化技术的最新方法更好。(3) 我们还表明,当进一步与 BA 后处理相结合时,
VGGT achieves state-of-the-art results across the board, even when compared to methods that specialize in a subset of 3D tasks, often improving quality substantially.
VGGT 全面实现了最先进的结果,即使与专注于 3D 任务子集的方法相比,通常也能显著提高质量。
We make our code and models publicly available at https://github.com/facebookresearch/vggt. We believe that this will facilitate further research in this direction and benefit the computer vision community by providing a new foundation for fast, reliable, and versatile 3D reconstruction.
我们在 https://github.com/facebookresearch/vggt 上公开提供我们的代码和模型。我们相信,这将促进这一方向的进一步研究,并通过为快速、可靠和多功能的 3D 重建提供新的基础,使计算机视觉社区受益。
Structure from Motion is a classic computer vision problem [45, 77, 80] that involves estimating camera parameters and reconstructing sparse point clouds from a set of images of a static scene captured from different viewpoints. The traditional SfM pipeline [2, 36, 70, 94, 103, 134] consists of multiple stages, including image matching, triangulation, and bundle adjustment. COLMAP [94] is the most popular framework based on the traditional pipeline. In recent years, deep learning has improved many components of the SfM pipeline, with keypoint detection [21, 31, 116, 149] and image matching [11,67,92,99] being two primary areas of focus. Recent methods [5, 102, 109, 112, 113, 118, 122, 125, 131, 160] explored end-to-end differentiable SfM, where VGGSfM [125] started to outperform traditional algorithms on challenging phototourism scenarios.
运动中的结构是一个经典的计算机视觉问题 [45, 77, 80],它涉及估计相机参数并从从不同视点捕获的一组静态场景图像中重建稀疏点云。传统的 SfM 管道 [2, 36, 70, 94, 103, 134] 由多个阶段组成,包括图像匹配、三角测量和光束法平差。COLMAP [94] 是最流行的基于传统管道的框架。近年来,深度学习改进了 SfM 管道的许多组件,关键点检测 [21, 31, 116, 149] 和图像匹配 [11,67,92,99] 是两个主要关注领域。最近的方法[5, 102, 109, 112, 113, 118, 122, 125, 131, 160]探索了端到端可微分 SfM,其中 VGGSfM [125]在具有挑战性的光旅游场景中开始优于传统算法。
Multi-view Stereo aims to densely reconstruct the geometry of a scene from multiple overlapping images, typically assuming known camera parameters, which are often estimated with SfM. MVS methods can be divided into three categories: traditional handcrafted [38, 39, 96, 130], global optimization [37, 74, 133, 147], and learning-based methods [42, 72, 84, 145, 157]. As in SfM, learning-based MVS approaches have recently seen a lot of progress. Here, DUSt3R [129] and MASt3R [62] directly estimate aligned dense point clouds from a pair of views, similar to MVS but without requiring camera parameters. Some concurrent works [111, 127, 141, 156] explore replacing DUSt3R’s test-time optimization with neural networks, though these attempts achieve only suboptimal or comparable performance to DUSt3R. Instead, VGGT outperforms DUSt3R and MASt3R by a large margin.
多视图立体旨在从多个重叠图像中密集重建场景的几何图形,通常假设已知的相机参数,这些参数通常使用 SfM 估计。MVS 方法可分为三类:传统手工制作 [38, 39, 96, 130]、全局优化 [37, 74, 133, 147] 和基于学习的方法 [42, 72, 84, 145, 157]。与 SfM 一样,基于学习的 MVS 方法最近取得了很大进展。在这里,DUSt3R [129] 和 MASt3R [62] 直接从一对视图估计对齐的密集点云,类似于 MVS,但不需要相机参数。一些同时进行的工作 [111, 127, 141, 156] 探索用神经网络替换 DUSt3R 的测试时优化,尽管这些尝试只能实现次优或与 DUSt3R 相当的性能。相反,VGGT 的性能大大优于 DUSt3R 和 MASt3R。

Tracking-Any-Point was first introduced in Particle Video [91] and revived by PIPs [44] during the deep learning era, aiming to track points of interest across video sequences including dynamic motions. Given a video and some 2D query points, the task is to predict 2D correspondences of these points in all other frames. TAP-Vid [23] proposed three benchmarks for this task and a simple baseline method later improved in TAPIR [24]. CoTracker [55, 56] utilized correlations between different points to track through occlusions, while DOT [60] enabled dense tracking through occlusions. Recently, TAPTR [63] proposed
Tracking-Any-Point 最早是在粒子视频 [91] 中引入的,并在深度学习时代被 PIP [44] 复兴,旨在跟踪视频序列中的兴趣点,包括动态运动。给定一个视频和一些 2D 查询点,任务是预测这些点在所有其他帧中的 2D 对应关系。TAP-Vid [23] 为这项任务提出了三个基准,以及一种后来在 TAPIR [24] 中改进的简单基线方法。CoTracker [55, 56] 利用不同点之间的相关性来跟踪遮挡,而 DOT [60] 则支持通过遮挡进行密集跟踪。最近,TAPTR [63] 提出了

Figure 2. Architecture Overview. Our model first patchifies the input images into tokens by DINO, and appends camera tokens for camera prediction. It then alternates between frame-wise and global self attention layers. A camera head makes the final prediction for camera extrinsics and intrinsics, and a DPT [87] head for any dense output.
图 2.架构概述。我们的模型首先通过 DINO 将输入图像修补成 token,并附加相机 token 用于相机预测。然后,它在逐帧和全局自我注意力层之间交替。摄像头对摄像头 extrinsics 和 intrinsics 进行最终预测,而 DPT [87] 头对任何密集输出进行预测。

an end-to-end transformer for this task, and LocoTrack [13] extended commonly used pointwise features to nearby regions. All of these methods are specialized point trackers. Here, we demonstrate that VGGT’s features yield state-of-the-art tracking performance when coupled with existing point trackers.
用于此任务的端到端转换器,LocoTrack [13] 将常用的逐点特征扩展到附近区域。所有这些方法都是专门的点跟踪器。在这里,我们演示了 VGGT 的功能在与现有点跟踪器结合使用时会产生最先进的跟踪性能。

3. Method  3. 方法

We introduce VGGT, a large transformer that ingests a set of images as input and produces a variety of 3D quantities as output. We start by introducing the problem in Sec. 3.1, followed by our architecture in Sec. 3.2 and its prediction heads in Sec. 3.3, and finally the training setup in Sec. 3.4.
我们介绍了 VGGT,这是一个大型转换器,它摄取一组图像作为输入并生成各种 3D 量作为输出。我们从第 3.1 节中介绍问题开始,然后是第 3.2 节的架构和第 3.3 节的预测头,最后是第 3.4 节的训练设置。

3.1. Problem definition and notation
3.1. 问题定义和符号

The input is a sequence ( I i ) i = 1 N I i i = 1 N (I_(i))_(i=1)^(N)\left(I_{i}\right)_{i=1}^{N} of N N NN RGB images I i I i I_(i)inI_{i} \in R 3 × H × W R 3 × H × W R^(3xx H xx W)\mathbb{R}^{3 \times H \times W}, observing the same 3D scene. VGGT’s transformer is a function that maps this sequence to a corresponding set of 3D annotations, one per frame:
输入是一系列 ( I i ) i = 1 N I i i = 1 N (I_(i))_(i=1)^(N)\left(I_{i}\right)_{i=1}^{N} N N NN RGB 图像 I i I i I_(i)inI_{i} \in R 3 × H × W R 3 × H × W R^(3xx H xx W)\mathbb{R}^{3 \times H \times W} ,观察相同的 3D 场景。VGGT 的 transformer 是一个函数,它将此序列映射到一组相应的 3D 注释,每帧一个:
f ( ( I i ) i = 1 N ) = ( g i , D i , P i , T i ) i = 1 N . f I i i = 1 N = g i , D i , P i , T i i = 1 N . f((I_(i))_(i=1)^(N))=(g_(i),D_(i),P_(i),T_(i))_(i=1)^(N).f\left(\left(I_{i}\right)_{i=1}^{N}\right)=\left(\mathbf{g}_{i}, D_{i}, P_{i}, T_{i}\right)_{i=1}^{N} .
The transformer thus maps each image I i I i I_(i)I_{i} to its camera parameters g i R 9 g i R 9 g_(i)inR^(9)\mathbf{g}_{i} \in \mathbb{R}^{9} (intrinsics and extrinsics), its depth map D i R H × W D i R H × W D_(i)inR^(H xx W)D_{i} \in \mathbb{R}^{H \times W}, its point map P i R 3 × H × W P i R 3 × H × W P_(i)inR^(3xx H xx W)P_{i} \in \mathbb{R}^{3 \times H \times W}, and a grid T i R C × H × W T i R C × H × W T_(i)inR^(C xx H xx W)T_{i} \in \mathbb{R}^{C \times H \times W} of C C CC-dimensional features for point tracking. We explain next how these are defined.
因此,transformer 将每个图像 I i I i I_(i)I_{i} 映射到其相机参数 g i R 9 g i R 9 g_(i)inR^(9)\mathbf{g}_{i} \in \mathbb{R}^{9} (intrinsics 和 extrinsics)、 depth map D i R H × W D i R H × W D_(i)inR^(H xx W)D_{i} \in \mathbb{R}^{H \times W} 、 point map P i R 3 × H × W P i R 3 × H × W P_(i)inR^(3xx H xx W)P_{i} \in \mathbb{R}^{3 \times H \times W} 以及用于点跟踪的 C C CC 维度特征网格 T i R C × H × W T i R C × H × W T_(i)inR^(C xx H xx W)T_{i} \in \mathbb{R}^{C \times H \times W} 。接下来我们将解释这些是如何定义的。
For the camera parameters g i g i g_(i)\mathbf{g}_{i}, we use the parametrization from [125] and set g = [ q , t , f ] g = [ q , t , f ] g=[q,t,f]\mathbf{g}=[\mathbf{q}, \mathbf{t}, \mathbf{f}] which is the concatenation of the rotation quaternion q R 4 q R 4 qinR^(4)\mathbf{q} \in \mathbb{R}^{4}, the translation vector t R 3 t R 3 tinR^(3)\mathbf{t} \in \mathbb{R}^{3}, and the field of view f R 2 f R 2 finR^(2)\mathbf{f} \in \mathbb{R}^{2}. We assume that the camera’s principal point is at the image center, which is common in SfM frameworks [95, 125].
对于相机参数 g i g i g_(i)\mathbf{g}_{i} ,我们使用 [125] 中的参数化 ,并设置 g = [ q , t , f ] g = [ q , t , f ] g=[q,t,f]\mathbf{g}=[\mathbf{q}, \mathbf{t}, \mathbf{f}] 它是旋转四元数 q R 4 q R 4 qinR^(4)\mathbf{q} \in \mathbb{R}^{4} 、 平移向量 t R 3 t R 3 tinR^(3)\mathbf{t} \in \mathbb{R}^{3} 和视野 f R 2 f R 2 finR^(2)\mathbf{f} \in \mathbb{R}^{2} 的串联。我们假设相机的主轴位于图像中心,这在 SfM 框架中很常见 [95, 125]。
We denote the domain of the image I i I i I_(i)I_{i} with I ( I i ) = I I i = I(I_(i))=\mathcal{I}\left(I_{i}\right)= { 1 , , H } × { 1 , , W } { 1 , , H } × { 1 , , W } {1,dots,H}xx{1,dots,W}\{1, \ldots, H\} \times\{1, \ldots, W\}, i.e., the set of pixel locations. The depth map D i D i D_(i)D_{i} associates each pixel location y I ( I i ) y I I i yinI(I_(i))\mathbf{y} \in \mathcal{I}\left(I_{i}\right) with its corresponding depth value D i ( y ) R + D i ( y ) R + D_(i)(y)inR^(+)D_{i}(\mathbf{y}) \in \mathbb{R}^{+}, as observed from the i i ii-th camera. Likewise, the point map P i P i P_(i)P_{i} associates each pixel with its corresponding 3D scene point P i ( y ) R 3 P i ( y ) R 3 P_(i)(y)inR^(3)P_{i}(\mathbf{y}) \in \mathbb{R}^{3}. Importantly, like in DUSt3R [129], the point maps are viewpoint invariant, meaning that the 3D points P i ( y ) P i ( y ) P_(i)(y)P_{i}(\mathbf{y}) are defined in the coordinate system of the first camera g 1 g 1 g_(1)\mathbf{g}_{1}, which we take as the world reference frame.
我们用 I ( I i ) = I I i = I(I_(i))=\mathcal{I}\left(I_{i}\right)= { 1 , , H } × { 1 , , W } { 1 , , H } × { 1 , , W } {1,dots,H}xx{1,dots,W}\{1, \ldots, H\} \times\{1, \ldots, W\} 表示图像 I i I i I_(i)I_{i} 的域,即像素位置的集合。深度图 D i D i D_(i)D_{i} 将每个像素位置 y I ( I i ) y I I i yinI(I_(i))\mathbf{y} \in \mathcal{I}\left(I_{i}\right) 与其相应的 depth 值 D i ( y ) R + D i ( y ) R + D_(i)(y)inR^(+)D_{i}(\mathbf{y}) \in \mathbb{R}^{+} 相关联, i i ii 如从第 -th 相机观察到的那样。同样,点映射 P i P i P_(i)P_{i} 将每个像素与其对应的 3D 场景点 P i ( y ) R 3 P i ( y ) R 3 P_(i)(y)inR^(3)P_{i}(\mathbf{y}) \in \mathbb{R}^{3} 相关联。重要的是,与 DUSt3R [129] 一样,点图是视点不变的,这意味着 3D 点 P i ( y ) P i ( y ) P_(i)(y)P_{i}(\mathbf{y}) 是在第一个相机 g 1 g 1 g_(1)\mathbf{g}_{1} 的坐标系中定义的,我们将其作为世界参考系。
Finally, for keypoint tracking, we follow track-anypoint methods such as [25,57]. Namely, given a fixed query image point y q y q y_(q)\mathbf{y}_{q} in the query image I q I q I_(q)I_{q}, the network outputs a track T ( y q ) = ( y i ) i = 1 N T y q = y i i = 1 N T^(***)(y_(q))=(y_(i))_(i=1)^(N)\mathcal{T}^{\star}\left(\mathbf{y}_{q}\right)=\left(\mathbf{y}_{i}\right)_{i=1}^{N} formed by the corresponding 2D points y i R 2 y i R 2 y_(i)inR^(2)\mathbf{y}_{i} \in \mathbb{R}^{2} in all images I i I i I_(i)I_{i}.
最后,对于关键点跟踪,我们遵循 track-anypoint 方法,例如 [25,57]。即,给定查询图像中一个固定的查询图像点 y q y q y_(q)\mathbf{y}_{q} ,网络输出由所有图像 I i I i I_(i)I_{i} 中对应的 2D 点 y i R 2 y i R 2 y_(i)inR^(2)\mathbf{y}_{i} \in \mathbb{R}^{2} 形成的轨迹 T ( y q ) = ( y i ) i = 1 N T y q = y i i = 1 N T^(***)(y_(q))=(y_(i))_(i=1)^(N)\mathcal{T}^{\star}\left(\mathbf{y}_{q}\right)=\left(\mathbf{y}_{i}\right)_{i=1}^{N} I q I q I_(q)I_{q}
Note that the transformer f f ff above does not output the tracks directly but instead features T i R C × H × W T i R C × H × W T_(i)inR^(C xx H xx W)T_{i} \in \mathbb{R}^{C \times H \times W}, which are used for tracking. The tracking is delegated to a separate module, described in Sec. 3.3, which implements a function T ( ( y j ) j = 1 M , ( T i ) i = 1 N ) = ( ( y ^ j , i ) i = 1 N ) j = 1 M T y j j = 1 M , T i i = 1 N = y ^ j , i i = 1 N j = 1 M T((y_(j))_(j=1)^(M),(T_(i))_(i=1)^(N))=(( hat(y)_(j,i))_(i=1)^(N))_(j=1)^(M)\mathcal{T}\left(\left(\mathbf{y}_{j}\right)_{j=1}^{M},\left(T_{i}\right)_{i=1}^{N}\right)=\left(\left(\hat{\mathbf{y}}_{j, i}\right)_{i=1}^{N}\right)_{j=1}^{M}. It ingests the query point y q y q y_(q)\mathbf{y}_{q} and the dense tracking features T i T i T_(i)T_{i} output by the transformer f f ff and then computes the track. The two networks f f ff and T T T\mathcal{T} are trained jointly end-to-end.
请注意, f f ff 上面的 transformer 不直接输出 tracks,而是输出 features T i R C × H × W T i R C × H × W T_(i)inR^(C xx H xx W)T_{i} \in \mathbb{R}^{C \times H \times W} ,用于跟踪。跟踪被委托给一个单独的模块,如第 3.3 节所述,该模块实现一个函数 T ( ( y j ) j = 1 M , ( T i ) i = 1 N ) = ( ( y ^ j , i ) i = 1 N ) j = 1 M T y j j = 1 M , T i i = 1 N = y ^ j , i i = 1 N j = 1 M T((y_(j))_(j=1)^(M),(T_(i))_(i=1)^(N))=(( hat(y)_(j,i))_(i=1)^(N))_(j=1)^(M)\mathcal{T}\left(\left(\mathbf{y}_{j}\right)_{j=1}^{M},\left(T_{i}\right)_{i=1}^{N}\right)=\left(\left(\hat{\mathbf{y}}_{j, i}\right)_{i=1}^{N}\right)_{j=1}^{M} 。它提取查询点 y q y q y_(q)\mathbf{y}_{q} 和转换器输出的密集跟踪特征 T i T i T_(i)T_{i} f f ff ,然后计算轨迹。这两个网络 f f ff T T T\mathcal{T} 进行端到端联合训练。
Order of Predictions. The order of the images in the input sequence is arbitrary, except that the first image is chosen as the reference frame. The network architecture is designed to be permutation equivariant for all but the first frame.
预测顺序。图像在输入序列中的顺序是任意的,除了选择第一个图像作为参考帧。网络架构设计为除第一帧之外的所有帧都是等变排列的。
Over-complete Predictions. Notably, not all quantities predicted by VGGT are independent. For example, as shown by DUSt3R [129], the camera parameters g g g\mathbf{g} can be inferred from the invariant point map P P PP, for instance, by solving the Perspective- n n nn-Point (PnP) problem [35, 61].
Over-complete Predictions(超完整预测)。值得注意的是,并非所有 VGGT 预测的量都是独立的。例如,如 DUSt3R [129] 所示,相机参数 g g g\mathbf{g} 可以从不变点图 P P PP 中推断出来,例如,通过解决透视点 n n nn (PnP) 问题 [35, 61]。