9 minute read

Wu, Guanjun, et al. “4d gaussian splatting for real-time dynamic scene rendering.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

image

<이미지 출처 4D Gaussian Splatting>

2.1. Novel View Synthesis

  • [9] proposes to use an explicit voxel grid to model temporal information, accelerating the learning time for dynamic scenes to half an hour and applied in [19, 32, 62]. The proposed deformation-based neural rendering methods are shown in Fig. 2 (a).
    • [9] Jiemin Fang, Taoran Yi, Xinggang Wang, Lingxi Xie, Xiaopeng Zhang, Wenyu Liu, Matthias Nießner, and Qi Tian. Fast dynamic radiance fields with time-aware neural voxels. In SIGGRAPH Asia 2022 Conference Papers, pages 1–9, 2022. 1, 2, 4, 5, 6, 7, 12, 13, 14, 15
    • [19] Xiang Guo, Jiadai Sun, Yuchao Dai, Guanying Chen, Xiaoqing Ye, Xiao Tan, Errui Ding, Yumeng Zhang, and Jingdong Wang. Forward flow for novel view synthesis of dynamic scenes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16022–16033, 2023. 2, 5, 6, 7, 12
    • [32] Yu-Lun Liu, Chen Gao, Andreas Meuleman, Hung-Yu Tseng, Ayush Saraf, Changil Kim, Yung-Yu Chuang, Johannes Kopf, and Jia-Bin Huang. Robust dynamic radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13–23, 2023. 2
    • [62] Taoran Yi, Jiemin Fang, Xinggang Wang, and Wenyu Liu. Generalizable neural voxels for fast human radiance fields. arXiv preprint arXiv:2303.15387, 2023. 2
  • Flow-based [14, 28, 32, 52, 67] methods adopting warping algorithm to synthesis novel views by blending nearby frames.
    • [14] Chen Gao, Ayush Saraf, Johannes Kopf, and Jia-Bin Huang. Dynamic view synthesis from dynamic monocular video. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5712–5721, 2021. 2
    • [28] Zhengqi Li, Simon Niklaus, Noah Snavely, and OliverWang. Neural scene flow fields for space-time view synthesis of dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6498– 6508, 2021. 2
    • [32] Yu-Lun Liu, Chen Gao, Andreas Meuleman, Hung-Yu Tseng, Ayush Saraf, Changil Kim, Yung-Yu Chuang, Johannes Kopf, and Jia-Bin Huang. Robust dynamic radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13–23, 2023. 2
    • [52] Fengrui Tian, Shaoyi Du, and Yueqi Duan. Mononerf: Learning a generalizable dynamic radiance field from monocular videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17903–17913, 2023. 2
    • [67] Kaichen Zhou, Jia-Xing Zhong, Sangyun Shin, Kai Lu, Yiyuan Yang, Andrew Markham, and Niki Trigoni. Dynpoint: Dynamic neural point for view synthesis. Advances in Neural Information Processing Systems, 36, 2024. 2, 3
  • [5, 12, 13, 25, 48, 53] represent further advancements in faster dynamic scene learning by adopting decomposed neural voxels. They treat sampled points in each timestamp individually as shown in Fig. 2 (b).
    • [5] Ang Cao and Justin Johnson. Hexplane: A fast representation for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 130–141, 2023. 1, 2, 4, 5, 6, 7, 13, 14
    • [12] Sara Fridovich-Keil, Giacomo Meanti, Frederik Rahbæk Warburg, Benjamin Recht, and Angjoo Kanazawa. K-planes: Explicit radiance fields in space, time, and appearance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12479–12488, 2023. 1, 2, 4, 5, 6, 7, 13, 14
    • [13] Wanshui Gan, Hongbin Xu, Yi Huang, Shifeng Chen, and Naoto Yokoya. V4d: Voxel for 4d novel view synthesis. IEEE Transactions on Visualization and Computer Graphics, 2023. 2, 6, 7
    • [25] Tianye Li, Mira Slavcheva, Michael Zollhoefer, Simon Green, Christoph Lassner, Changil Kim, Tanner Schmidt, Steven Lovegrove, Michael Goesele, Richard Newcombe, et al. Neural 3d video synthesis from multi-view video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5521–5531, 2022. 2, 6, 7, 12, 13, 15
    • [48] Ruizhi Shao, Zerong Zheng, Hanzhang Tu, Boning Liu, Hongwen Zhang, and Yebin Liu. Tensor4d: Efficient neural 4d decomposition for high-fidelity dynamic reconstruction and rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16632– 16642, 2023. 1, 2, 4
    • [53] Feng Wang, Zilong Chen, Guokang Wang, Yafei Song, and Huaping Liu. Masked space-time hash encoding for efficient dynamic scene reconstruction. Advances in neural information processing systems, 2023. 2, 5, 6, 7
  • [16, 30, 41, 54, 56, 58] are efficient methods to handle multi-view setups.
    • [16] Xiangjun Gao, Jiaolong Yang, Jongyoo Kim, Sida Peng, Zicheng Liu, and Xin Tong. Mps-nerf: Generalizable 3d human rendering from multiview images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022. 2
    • [30] Haotong Lin, Sida Peng, Zhen Xu, Tao Xie, Xingyi He, Hujun Bao, and Xiaowei Zhou. High-fidelity and real-time novel view synthesis for dynamic scenes. In SIGGRAPH Asia Conference Proceedings, 2023. 2, 5, 6, 7, 9, 15
    • [41] Sida Peng, Yunzhi Yan, Qing Shuai, Hujun Bao, and Xiaowei Zhou. Representing volumetric videos as dynamic mlp maps. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4252– 4262, 2023. 2
    • [54] Feng Wang, Sinan Tan, Xinghang Li, Zeyue Tian, Yafei Song, and Huaping Liu. Mixed neural voxels for fast multiview video synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19706– 19716, 2023. 2, 5, 13
    • [58] Qingshan Xu,Weihang Kong,Wenbing Tao, and Marc Pollefeys. Multi-scale geometric consistency guided and planar prior assisted multi-view stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):4945–4963, 2022. 2

The aforementioned methods though achieve fast training speed, real-time rendering for dynamic scenes is still challenging, especially for monocular input.

  • 4D Gaussian Splatting aims at constructing a highly efficient training and rendering pipeline in Fig. 2 (c), while maintaining the quality, even for sparse inputs.

4.2 Gaussiann Deformation Field Network

The network to learn the Gaussian deformation field includes an efficient spatial-temporal structure encoder $H$ and a Gaussian deformation decoder $D$ for predicting the defromation of each 3D Gaussian.

  • The deformation of 3D Gaussians $\Delta G$
  • Gaussian deformation field network $F$
  • the deformed 3D Gaussians $G’$
  • the spatial-temporal structure encoder $H$
  • the multi-head Gaussian deformation Decoder $D$
\[G' = \Delta G + G\] \[f_d = H(G,t)\] \[\Delta G = D(f)\] \[\Delta G = F(G,t)\]

Spatial-Temporal Structure Encoders.

Nearby 3D Gaussians always share similar spatial and temporal information. To model 3D Gaussians’ features effectively, we introduce an efficient spatial-temporal struture encoder $H$ including a multi-resolution HexPlane $R(i,j)$ and a tiny MLP $\phi_d$ inspired by [5,9,12,48]

  • [5] Ang Cao and Justin Johnson. Hexplane: A fast representation for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 130–141, 2023. 1, 2, 4, 5, 6, 7, 13, 14
  • [9] Jiemin Fang, Taoran Yi, Xinggang Wang, Lingxi Xie, Xiaopeng Zhang, Wenyu Liu, Matthias Nießner, and Qi Tian. Fast dynamic radiance fields with time-aware neural voxels. In SIGGRAPH Asia 2022 Conference Papers, pages 1–9, 2022. 1, 2, 4, 5, 6, 7, 12, 13, 14, 15
  • [12] Sara Fridovich-Keil, Giacomo Meanti, Frederik Rahbæk Warburg, Benjamin Recht, and Angjoo Kanazawa. K-planes: Explicit radiance fields in space, time, and appearance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12479–12488, 2023. 1, 2, 4, 5, 6, 7, 13, 14
  • [48] Ruizhi Shao, Zerong Zheng, Hanzhang Tu, Boning Liu, Hongwen Zhang, and Yebin Liu. Tensor4d: Efficient neural 4d decomposition for high-fidelity dynamic reconstruction and rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16632– 16642, 2023. 1, 2, 4

While the vanilla 4D neural voxel is memory-consuming, we adopt a 4D K-Planes [12] module to decompose the 4D neural voxel into 6 multi-resolution planes.

All 3D Gaussians in a certain area can be contained in the bounding plane voxels and the deformation of Gaussians can also be encoded in nearby temporal voxels.

Specifically, the spatial-temporal structure encoder $H$ contains 6 multi-resolution plane modules $R_l(i,j)$ and a tiny MLP $\phi_d$, i.e. $H(G,t) = \{R_l(i,j), \phi_d (i,j) \in \{(x,y),(x,z),(y,z),(x,t),(y,t),(z,t)\}, l \in \{1,2\}\}.$

The position $\mu = (x,y,z)$ is the mean value of 3D Gaussians $G$.

Each voxel module is defined by $R(i,j) \in \mathcal{R}^{h \times lN_i \times lN_j}$,

  • $h$: the hidden dim of features
  • $N$: the basic resolution of voxel grid
  • $l$: the upsampling scale

This entails encoding information of the 3D Gaussians within the 6 2D voxel planes while considering temporal information.

The formula for computing seperate voxel features is as follows:

\(f_h = \bigcup_l \prod \text{interp}(R_l(i,j)),\) \((i,j) \in \\{(x,y),(x,z),(y,z),(x,t),(y,t),(z,t)\\}\)

$f_h \in \mathcal{R}^{h * l}$ is the feature of neural voxels. ‘interp’ denotes the bilinear interpolation for querying the voxel features located at 4 vertices of the grid.

The discussion of the production process is similar to K-Planes [12].

Then a tiny MLP $\phi_d$ merges all the features by $f_d = \phi_d(f_h)$.

Multi-head Gaussian Deformation Decoder.

When all the features of 3D Gaussians are encoded, we can compute any desired variable with a multi-head Gaussian deformation decoder $D = \{\phi_x, \phi_r, \phi_s \}$.

Seperate MLPs are employed to compute

  • the deformation of position $\Delta X = \phi_x(f_d)$,
  • the defomration of rotation $\Delta r = \phi_r(f_d)$,
  • the deformation of scaling $\Delta s = \phi_s(f_d)$.

Then, the deformed feature $(X’, r’, s’)$ can be addressed as:

\[(X', r', s') = (X + \Delta X, r + \Delta r, s + \Delta s).\]

Finally, we obtain the deformed 3D Gaussians $G’ = \{X’, s’, r’, \sigma, C\}$.

4.3 Optimization

3D Gaussian Initialization.

4D Gaussians can also leverage the power of proper 3D Gaussian initialization.

We optimize 3D Gaussians at initial 3000 iterations for warm-up and then render images with 3D Gaussians $\hat{I} = S(M,G)$ instead of 4D Gaussians $\hat{I}=S(M,G’)$.

  • $M = [R,T]$: a view matrix
  • a novel-view image $\hat{I}$ is rendered by differential splatting $S$.

The illustration of the optimization process is shown in Fig. 4.

image

Loss Function.

Similar to other reconstruction methods [9,22,42], we use the L1 color loss to supervise the training process.

  • [9] Jiemin Fang, Taoran Yi, Xinggang Wang, Lingxi Xie, Xiaopeng Zhang, Wenyu Liu, Matthias Nießner, and Qi Tian. Fast dynamic radiance fields with time-aware neural voxels. In SIGGRAPH Asia 2022 Conference Papers, pages 1–9, 2022. 1, 2, 4, 5, 6, 7, 12, 13, 14, 15
  • [22] Bernhard Kerbl, Georgios Kopanas, Thomas Leimk¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics (ToG), 42(4):1–14, 2023. 1, 3, 4, 5, 6, 7, 11, 12, 14
  • [42] Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. D-nerf: Neural radiance fields for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10318–10327, 2021. 1, 2, 5, 6, 12, 15

A grid-based total-variational loss [5,9,12,51] $L_{tv}$ is also applied.

\[L = |\hat{I}-I| + L_{tv}.\]
  • [5] Ang Cao and Justin Johnson. Hexplane: A fast representation for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 130–141, 2023. 1, 2, 4, 5, 6, 7, 13, 14
  • [9] Jiemin Fang, Taoran Yi, Xinggang Wang, Lingxi Xie, Xiaopeng Zhang, Wenyu Liu, Matthias Nießner, and Qi Tian. Fast dynamic radiance fields with time-aware neural voxels. In SIGGRAPH Asia 2022 Conference Papers, pages 1–9, 2022. 1, 2, 4, 5, 6, 7, 12, 13, 14, 15
  • [12] Sara Fridovich-Keil, Giacomo Meanti, Frederik Rahbæk Warburg, Benjamin Recht, and Angjoo Kanazawa. K-planes: Explicit radiance fields in space, time, and appearance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12479–12488, 2023. 1, 2, 4, 5, 6, 7, 13, 14
  • [51] Cheng Sun, Min Sun, and Hwann-Tzong Chen. Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5459– 5469, 2022. 1, 3, 5

Fair Comparison Setup은 어떻게 하였는지?

5.2. Results

The K-Planes results on the synthetic dataset originate from the Deformable-3DGS [60] paper.

The other results of the compared methods are from their papers, reproduced by their code or provided by the authors.

The rendering spped and storage data for [5,9,12,22] are estimated based on the official implementations.

Leave a comment