4 minute read

Fast Dynamic Radiance Fields with Time-Aware Neural Voxels

Explict data structures, e.g. voxel features, show great potential to accelerate the training process.

However, voxel features face two big challenges to be applied to dynamic scenes, i.e.

  • modeling temporal information
  • capturing different scales of point motions

Especially, most existing methods model dynamic scenes by introducing an additional deformation network with a similar scale of the radiance network, which maps point coordinates into a canonical space.

This manner means much more cost for training and inferring dynamic fields.

The cumbersome time cost impedes wide applications in real-life scenarios.

One direct and simple solution is to expand the voxel grids with an additional time dimension.

However, this manner will undoubtedly increase memory cost significantly.

Changing voxel grids from 4D to 5D, i.e. $(C, N_x, N_y, N_z) \rightarrow (C, N_x, N_y, N_z, N_t)$, will multiply the storage cost by $N_t$.

On the other hand, there usually exist motions of different scales.

Voxels with the high resolutions locate in small grids, which fail to model large motions; voxels in large grids fail to capture details with small motions.

핵심요약

  • dynamic fields 학습 소요시간을 줄이기 위해 explicit data structures를 사용
  • large motions, small motions을 modeling하기 위해 multi-distance interpolation을 사용

To encode temporal information, we first build a highly compressed deformation network which maps 3D point coordinates into a coarse canonical space.

Voxel features are queried with the transformed coordinates.

We further enhance the temporal information by feeding time and coordinate embeddings into the latter radiance network.

Thus deviation introduced by point mapping can be automatically suppressed by the neural network.

Moreover, we propose a multi-distance interpolation method, where features are obtained from voxels with multiple distances.

In this way, both small and large motions can be modeled even though only one single-resolution voxel features are constructed.

2.1 Neural Rendering for Dynamic Scenes

The key problem to solve dynamic-scene rendering lies in temporal information encoding. One stream of dynamic NeRF methods model deformations in scenes by extending radiance fields with an additional time dimension.

2.2 Neural Rendering Acceleration

Some works propose to represent scenes with explicit voxel-grid features/properties and directly optimize these voxels for extremely fast convergence speed, reducing training time from hours to minutes.

However, storage cost is significantly increased for storing voxel features.

Recent works effectively reduce the storage cost via voxel hashing, tensor decomposition, and bitrate dicionary lookup while still maintaining surprisingly high training speed.

These voxel-optimizable methods yet only focus on static scenes, where voxel features are direct to construct for only spatial information.

We introduce optimizable explicit voxel features into dynamic scenes.

Temporal information is encoded to obtain time-aware neural voxel features.

3 Method

3.2 Multi-Distance Interpolation with Nerual Voxels

To predict the property of one 3D point, neural voxels stored in eight vertices of the grid this point lies in are queried and interpolated trilinearly.

The interpolated feature is then inferred by neural networks to predict the expected properties.

  • These interpolated voxel features are first positional-encoded as in Eq. 4 and being fed into the neural network for inference.
    • This manner is of critical importance for compressing neural voxels into small sizes while maintaining strong performance for modeling details, which is evaluated in experiments (Sec. 4.3).

image

Considering points in dynamic scenes may move with a dramatic motion trajectory, small grids of neural voxels have limited capacity to model these point movements.

We propose a multi-distance interpolation method to model point motions with various scales.

As shown in Fig. 3, besides the smallest grid, voxel features are also interpolated from vertices of larger grids.

This means the final features for inference not only come from nearest voxels, but also from sub- and subsub-nearest voxels.

In this way, small motions can be modeled via near voxels while motions in a large region are perceived with farther voxels.

3.3 Temporal Information Encoding

Coarse Coordinate Deformation. Like previous implicit NeRF methods for dynamic scenes, we introduce a deformation network to shift coordinates of points which simulates the movement of points, but compress the network into a very small one.

We use only 3-layer MLPs (multilayer perceptrions) as the deformation network, which is much smaller than ones adopted in previous dynamic-scene methods.

As this deformation network is applied on every sample point which accounts for a large portion of computation cost, we compress this network from both widths and depths for accelerating optimization and rendering processes.

Temporal Information Enhancement. 앞선 MLPs가 3-layer로 훨씬 작아졌으므로, limited capacity로 인해 coordinate shifting에 피할 수 없는 deviation이 생길 수 있습니다.

Besides, this deviation will be aggravated as neural voxels are queried according to point coordinates. Thus, final error not only comes from interpolation weights but also from the queried vertices.

이런 deviation/mismatch 문제를 해결하기 위해 Fig. 2에서처럼 temporal information을 interpolated features에 positional-encoded coordinates와 neural-encoded temporal embeddings를 concat함으로써 temporal information을 enhance하여 neural networks에 fed하여 deviation을 완화합니다.

image

Fig. 2에서 Temporal Information Enhancement에서 각 색깔의 features는 다음과 같습니다.

  • green: temporal information
  • blue: positional-encoded coordinates
  • orange, yellow, ivory: neural-encoded temporal embeddings

3.4 Overall Framework and Optimization

전체 framework는 Fig. 2와 같습니다.

  • First, the time stamp is encoded by a two-layer MLPs $\phi_t$, and then fed into a compressed deformation network $\phi_d$ along with coordinates of the sampled point $(x,y,z)$ to obtain shifted coordinates $(x’,y’,z’)$ as in Eq. 6.
    • $\phi_t$: 2-layers MLPs
    • $\phi_d$: 3-layers MLPs
  • The shifted coordinates are used for querying and interpolating neural voxels with a multi-distance manner as in Eq. 5.
  • Then, the concatenated neural voxel $v$ as in Eq. 5, encoded time embeddings $t=\phi_t(\gamma(t))$ and original coordiantes $(x,y,z)$ are all concatenated to be fed into a narrow and shallow radiance network $\phi_r$ to obtain the final density $\sigma$ and color $c$:
\[c, \sigma = \phi_r(\gamma(v), t, \gamma(x,y,z), \gamma(d)),\]
  • $\gamma$: the positional encoding

Leave a comment