HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: manyfoot

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2310.04152v2 [cs.CV] 18 Mar 2024
\equalcont

These authors contributed equally to this work.

\equalcont

These authors contributed equally to this work.

[2]\fnmSung Soo \surHwang

[1,3]\fnmIl Yong \surChun

1]\orgdivDepartment of Electrical and Computer Engineering, \orgnameSungkyunkwan University, \orgaddress\citySuwon, \postcode16419, \countryRepublic of Korea

2]\orgdivDepartment of Information and Communication Engineering, \orgnameHandong Global University, \orgaddress\cityPohang, \postcode37554, \countryRepublic of Korea

3]\orgdivDepartments of Artificial Intelligence, Semiconductor Convergence Engineering, and Display Convergence Engineering, and Center for Neuroscience Imaging Research, Institute for Basic Science, \orgnameSungkyunkwan University, \orgaddress\citySuwon, \postcode16419, \countryRepublic of Korea

Improving Neural Radiance Fields Using Near-Surface Sampling With Point Cloud Generation

\fnmHye Bin \surYoo [email protected]    \fnmHyun Min \surHan [email protected]    [email protected]    [email protected] [ [ [
Abstract

Neural radiance field (NeRF) is an emerging view synthesis method that samples points in a three-dimensional (3D) space and estimates their existence and color probabilities. The disadvantage of NeRF is that it requires a long training time since it samples many 3D points. In addition, if one samples points from occluded regions or in the space where an object is unlikely to exist, the rendering quality of NeRF can be degraded. These issues can be solved by estimating the geometry of 3D scene. This paper proposes a near-surface sampling framework to improve the rendering quality of NeRF. To this end, the proposed method estimates the surface of a 3D object using depth images of the training set and performs sampling only near the estimated surface. To obtain depth information on a novel view, the paper proposes a 3D point cloud generation method and a simple refining method for projected depth from a point cloud. Experimental results show that the proposed near-surface sampling NeRF framework can significantly improve the rendering quality, compared to the original NeRF and three different state-of-the-art NeRF methods. In addition, one can significantly accelerate the training time of a NeRF model with the proposed near-surface sampling framework.

keywords:
Neural radiance field (NeRF), neural rendering, near-surface sampling, point cloud, depth image, three-dimensional geometry

1 Introduction

Recently, metaverse and virtual reality applications are rapidly drawing attention. In such applications, it is important to generate novel views accurately. One way to achieve this goal is to generate a three-dimensional (3D) model first and follow a conventional rendering pipeline [1]. However, generating a 3D model needs a lot of time and effort.

Image-based rendering (IBR) is another approach that generates novel views without explicitly generating a 3D model. Several methods generate a novel view using image morphing [4]. The Layered Depth Images method [24] stores multiple depth and color values for each pixel to effectively fill the hole behind the foreground object in a novel view. Light fields [11] and Lumigraph [7] that express light rays as a function were also proposed.

Recently, among IBR methods, neural radiance field (NeRF) [17] has been rapidly gaining attention. Ray, a core concept of NeRF, means lines shot in a straight line from the camera position to an object. A NeRF network predicts the color and density of each point utilizing 3D points sampled from each ray. Then a novel view is obtained by performing a line integral using this color and density.

The original NeRF [17] performs sampling within a range that includes the entire 3D object. This paper proposes to use depth information to sample 3D points only around surface of an object in NeRF, where we consider the practical scenario that depth information is only available at hands (from depth cameras) in a training dataset. To consider that measured/estimated depths maps may be inaccurate due to capturing environments, we propose to generate a 3D point cloud using available (inaccurate) depth information in training, and to use this 3D point cloud to estimate a depth image for each novel view in test (i.e., inference). Figure 1 illustrates the brief overview of the proposed NeRF framework. Simply projecting a 3D point cloud onto a novel view generates a rather rough depth image. To obtain more accurate depth images, we additionally propose a refining method that removes unnecessary 3D points in generating a point cloud and fills the hole of the projected depth image. Simply put, to improve NeRF, the paper proposes an advanced sampling method around the surface of an object/a scene using estimated depth images from generated point cloud. Our experimental results with different datasets demonstrate that the proposed framework outperforms original NeRF and three different state-of-the-art NeRF methods.

Refer to caption
Figure 1: The brief overview of the proposed NeRF framework that samples points near the estimated surface from a point cloud.

The rest of the paper is organized as follow. Section 2 reviews NeRF and its follow-up works with particularly related works with ours, and presents differences between the proposed NeRF and existing depth-based NeRFs. Section 3 provides motivation and detail of the proposed method, Section 4 reports experiments and analysis, and Section 5 discusses conclusions, limitation and future work.

2 Related works

2.1 NeRF

NeRF [17] is a state-of-the-art view synthesis technology that samples points on rays and synthesizes views through differentiable volume rendering. The input of this algorithm is a single continuous five-dimensional (5D) coordinate consisting of a 3D spatial location and a two-dimensional viewing direction. The output is a volume density and view-dependent emitted radiance at the corresponding spatial location. In other words, the key idea of NeRF is to train a neural network that predicts a view-dependent color value and a volume probability value by taking a 5D coordinate. Using those two predicted values, a final rendered color value is determined by performing a line integral with classical volume rendering. To further improve the rendering quality, NeRF uses the following two techniques: positional encoding and hierarchical volume sampling. Positional encoding increases the dimension of input data; the hierarchical volume sampling technique allocates more samples to regions that are expected to include visible content. Hierarchical volume sampling is named as it performs sampling with two different networks, “coarse” one and “fine” one. For each ray, a coarse network gives a view-dependent emitted color and volume density using Ncsubscript𝑁cN_{\text{c}}italic_N start_POSTSUBSCRIPT c end_POSTSUBSCRIPT points that are sampled with stratified sampling method along the ray. A piecewise-constant probability density function (PDF) is generated (along each ray) by normalizing contribution weights that are calculated with volume densities and the distances between adjacent samples of Ncsubscript𝑁cN_{\text{c}}italic_N start_POSTSUBSCRIPT c end_POSTSUBSCRIPT points. After integrating the generated PDF to calculate cumulative distribution function, Nfsubscript𝑁fN_{\text{f}}italic_N start_POSTSUBSCRIPT f end_POSTSUBSCRIPT points are sampled through inverse transform sampling. A fine network gives a view-dependent color value and volume density using Ncsubscript𝑁cN_{\text{c}}italic_N start_POSTSUBSCRIPT c end_POSTSUBSCRIPT points and those more informed Nfsubscript𝑁fN_{\text{f}}italic_N start_POSTSUBSCRIPT f end_POSTSUBSCRIPT points. After all, one calculates the final rendering of the corresponding ray with Nc+Nfsubscript𝑁csubscript𝑁fN_{\text{c}}+N_{\text{f}}italic_N start_POSTSUBSCRIPT c end_POSTSUBSCRIPT + italic_N start_POSTSUBSCRIPT f end_POSTSUBSCRIPT points. Through this process, NeRF can represent a 3D object (in 360 degrees) and forward-facing scenes with continuous views. However, NeRF in its original form has several limitations. For example, it can represent only static scenes; its training and inference is slow; one NeRF network represents only one object/scene.

2.2 Follow-up works of NeRF

Researchers has been improving the original NeRF model [17] in various aspects. The first aspect is to reduce training time of NeRF models while maintaining rendering accuracy [9, 5, 27, 18]. [9] reduces training time by proposing a new sampling method to use less number of samples per ray. [5] supervises depth to use a smaller number of views in training. [27] can accelerate training by quickly generating an initial rough point cloud and refining it in an iterative manner. [18] uses a learnable encoding method instead of positional encoding, and update only parameters related to sampling positions instead of updating all parameters.

The second aspect is to improve inference time of NeRF models [23, 14, 13, 19, 18]. [23] and [14] reduces inference time by spatially decomposing and processing the scene: [23] uses a spatially decomposed scene and a small network for each space; [14] skips spaces with irrelevant scenes among the decomposed spaces during inference. [13] uses volume integral calculation network instead of the classical integral calculation method to shorten inference. [19] uses a rendering pipeline that includes a network to predict the optimal sample locations on rays to reduce inference time. Using learnable encoding method instead of positional encoding [18] can accelerate inference.

Third aspect is to consider different scenarios with NeRF models [29, 12, 26, 20, 10, 21, 22, 25, 2, 16]. [29] additionally estimates camera pose. [12] considers the case that camera poses are imperfect or unknown. [26, 20, 10] consider multi-object/scene representation. In particular, [26] disentangles foreground and background. Dynamic scene representation [21, 22] and relighting [25, 2, 16] makes NeRF to be applicable to changing scenes rather than static scenes.

2.3 Depth-based NeRFs and their relations with the proposed NeRF framework

Depth oracle neural radiance field (DONeRF) [19] uses ground-truth depth images of the training set to train ideal sample locations on rays, and performs sampling in the estimated locations. However, DONeRF works only on forward-facing scenes where all camera poses belong to a bounding box called the view cell. Depth supervised neural radiance field (DSNeRF) [5] uses a sparse depth map estimated with the structure from motion technique and adds an optimization process to the original NeRF using estimated depth information, to achieve the best rendering performance of original NeRF with fewer training iterations and images.

Similar to DONeRF, we aim to improve the quality of rendered images by using depth images available at hand in a training dataset. Note, however, that different from DONeRF, the proposed method does not use the view cell information that is required in DONeRF, and is applicable with less restricted camera positions. Similar to DSNeRF, we use depth information by leveraging a point cloud. However, the proposed framework and DSNeRF use a point cloud in a different way. DSNeRF uses a point cloud to adjust the volume density function of NeRF. Different from this, the proposed framework uses a point cloud to directly estimate the distance to the object surface from a camera.

3 Proposed method

3.1 Motivation

Refer to caption
Figure 2: The NeRF rendering accuracy comparisons with different sampling ranges. Here, d𝑑ditalic_d denotes the default sampling range of NeRF.

In NeRF [17], there exists a room for improvement of rendering accuracy. NeRF uses a hierarchical volume sampling method that performs sampling twice: “rough” sampling with a stratified sampling approach and “fine” sampling in the space where an object is likely to exist. See details in Section 2.1. The stratified sampling approach in NeRF divides a specified range many bins and selects a sample uniformly a random from each bin. In the stratified sampling process, sampling is performed not only in the space where the object exists, but also in the free space or the occluded region. Sampling in free space and occluded region may degrade rendering quality. If one can sample points only around an object in the rough sampling stage, the rendering performance might improve even without the fine sampling process.

To show the effects of the sampling density around an object on the rendering quality, we ran simple experiments with different sampling ranges around the surface of an object. Figure 2 shows the rendering accuracy with peak signal-to-noise ratio (PSNR) values with different sampling range, where we increased the default sampling range of NeRF by a factor of 2222, 4444, and 8888 by increasing distances between two samples. As the sampling range increases, i.e., sampling density around an object decreases, the rendering accuracy rapidly degrades. We observed from theses experiments that narrowing the sampling range around an object can improve the rendering quality in NeRF. This corresponds to the hierarchical volume sampling scheme of original NeRF that re-extracts samples with high volume density values to increase rendering efficiency.

Recently, diverse low-cost depth cameras with high accuracy have been proposed [15, 6]. Depth cameras (using multi-view) can measure the distance between an object and the device, giving additional 3D information of an object. We conjecture that if we sample points on 3D ray only around the surface of an object, the rendering quality of NeRF improves.

3.2 Overview

Refer to caption
Figure 3: The overall diagram of the proposed NeRF framework. The red words highlight proposed modules.

Figure 3 illustrates the overall process of the proposed framework. A training set consists of color images and depth images, and at the train stage we use both. In particular, we use depth images to sample in the area close to the surface of the object in a 3D space, and we refer this sampling strategy as surface-based sampling. By using those sample points obtained through surface-based sampling, we train the NeRF model. At the offline stage, we use depth images of the training set to generate a point cloud and save this point cloud for inference. At the test stage, we use the saved point cloud at the offline stage to generate a depth image corresponding to a novel view. We further refine depth images through computationally efficient hole filling for surface-based sampling. Using sampled points only around the surface of an object that is estimated with a refined depth, we render images of novel views with a single NeRF network.

3.3 Surface-based sampling

Refer to caption
Figure 4: Sampling range comparisons between the original NeRF (blue) and the proposed surface-based sampling scheme (orange). The solid line represents the surface of an object, and the dotted lines inside the blue fan represent rays. The area within two dotted lines outside the blue region corresponds to the field of view of a camera. Different the original NeRF, the proposed method samples only around the surface of an object.

Figure 4 illustrates the difference between the sampling range of the original NeRF’s sampling method (blue) and that of surface-based sampling method (orange). Different from original NeRF that samples 3D points at a wide range that includes the entire 3D object, the proposed surface-based sampling method mainly samples those around the surface of the object.

We now describe the geometry of the proposed surface-based sampling method for each ray of each view. As in the original NeRF, we assume that each ray is propagated from the location of a camera (see Figure 4). We define the location of a camera in each ray as 00. The distance between the locations of a camera and an object is the depth value from a depth image, and we denote it as d𝑑ditalic_d. Let the half of some specified sampling range be α𝛼\alphaitalic_α. Then, the location of a point nearest to the camera within the sampling range can be calculated as follows:

Sr,0=dα,subscript𝑆𝑟0𝑑𝛼S_{r,0}=d-\alpha,italic_S start_POSTSUBSCRIPT italic_r , 0 end_POSTSUBSCRIPT = italic_d - italic_α , (1)

Now, we determine the location of the n𝑛nitalic_nth sample for each ray (considering that a ray is originated from the camera location, 00) by

Sr,n=Sr,0+(n1)2αN+γ,n=1,,N,formulae-sequencesubscript𝑆𝑟𝑛subscript𝑆𝑟0𝑛12𝛼𝑁𝛾𝑛1𝑁S_{r,n}=S_{r,0}+(n-1)\frac{2\alpha}{N}+\gamma,\quad n=1,\ldots,N,italic_S start_POSTSUBSCRIPT italic_r , italic_n end_POSTSUBSCRIPT = italic_S start_POSTSUBSCRIPT italic_r , 0 end_POSTSUBSCRIPT + ( italic_n - 1 ) divide start_ARG 2 italic_α end_ARG start_ARG italic_N end_ARG + italic_γ , italic_n = 1 , … , italic_N , (2)

where N𝑁Nitalic_N is the number of sample points for each ray, and γ𝛾\gammaitalic_γ is a random number generated between 00 and 2α/N2𝛼𝑁2\alpha/N2 italic_α / italic_N. We perform stratified sampling near the surface of an object, where we determine the sample locations by (2). In (2), [0,2α/N]02𝛼𝑁[0,2\alpha/N][ 0 , 2 italic_α / italic_N ] is the length of each bin in stratified sampling of the original NeRF method. Here, the parameter α𝛼\alphaitalic_α determines the sampling range; if N𝑁Nitalic_N is fixed, α𝛼\alphaitalic_α ultimately affects the sampling density around the surface. As α𝛼\alphaitalic_α decreases, the length of each bin is shorter and distances between sample points are expected to become close, so the sampling density near the surface increases. As α𝛼\alphaitalic_α increases, the length of each bin is longer and distances between sample points are expected to become far, so the sampling density near the surface decreases.

Different from the two-step network sampling scheme in original NeRF, the proposed framework directly samples points near the surface of an object by using depth information in the near-surface sampling scheme (2) in a single step, i.e., it uses a single network. We expect that if the depth to the surface of a 3D object d𝑑ditalic_d is accurately estimated, the rendering quality improves by using small α𝛼\alphaitalic_α, i.e., densely sampling 3D points. If it is poorly estimated, we expect that small α𝛼\alphaitalic_α rather degrades the rendering quality. With fixed N𝑁Nitalic_N, we recommend setting α𝛼\alphaitalic_α considering the accuracy of depth images.

Refer to caption
Figure 5: An example of the proposed point cloud refinement. In the first step, we generate a point cloud from a depth image of a viewpoint. In the second step, we project the generated point cloud onto the next viewpoint. In the third step, we use the depth thresholding scheme (3) using projected points in the next viewpoint and ground-truth depth values. If a projected point in the next viewpoint has a similar value to the ground-truth, we consider that the corresponding 3D point is redundant to generate. We then generate new 3D points in the next viewpoint if they are determined to be necessary. We repeat the above steps.

3.4 Depth image generation for novel views

In the training stage, we perform surface-based sampling without any additional process, assuming that a depth image for each view is available. In the test stage, however, we assume that depth images are unavailable, so we perform depth estimation for a novel view for surface-based sampling. For depth estimation, in the offline stage, we generate and save a point cloud as shown in Figure 3. In the test stage, we use this point cloud to estimate depth images for novel views. Using this depth estimation process, surface-based sampling can be performed without a ground truth depth image in the test stage.

3.4.1 Point cloud generation and refinement in the offline stage

Figure 5 illustrates the key concept of the proposed point cloud generation and refinement method. To improve the accuracy of depth estimation, we generate 3D points with a subset of training images, by repeatedly eliminating inaccurate points. In constructing a subset of training images, we give a sufficient and uniform distance between their adjacent viewpoints. This setup is more efficient in constructing a 3D point cloud, compared to the setup that uses the entire training views. See details of this experimental setup later in Section 4.2.

Each iteration consists of the following four steps and we repeat them with the cardinality of a subset of training images, where we sequentially follow the trajectory of viewpoints in a subset of training data:

  1. 1)

    We generate a point cloud using a depth image from a viewpoint.

  2. 2)

    We project 3D points of the generated point cloud onto an image plane of the next viewpoint, and obtain the distance between each 3D point and the camera location of the next viewpoint by using the multiple view geometry calculation method [8].

  3. 3)

    We compare each calculated distance to a ground-truth depth value from the depth image at the next viewpoint, and identify if the following condition is satisfied:

    |d~dGT|τ,~𝑑subscript𝑑GT𝜏\big{|}\tilde{d}-d_{\text{GT}}\big{|}\leq\tau,| over~ start_ARG italic_d end_ARG - italic_d start_POSTSUBSCRIPT GT end_POSTSUBSCRIPT | ≤ italic_τ , (3)

    where d~~𝑑\tilde{d}over~ start_ARG italic_d end_ARG denotes the calculated distance using the second step above, and dGTsubscript𝑑GTd_{\text{GT}}italic_d start_POSTSUBSCRIPT GT end_POSTSUBSCRIPT denotes the ground-truth depth value of a pixel position where the 3D point is projected, and τ𝜏\tauitalic_τ denotes some specified threshold.

  4. 4)

    If the condition (3) is not satisfied, we generate a new 3D point by back-projecting a pixel of the value dGTsubscript𝑑GTd_{\text{GT}}italic_d start_POSTSUBSCRIPT GT end_POSTSUBSCRIPT.

Setting τ𝜏\tauitalic_τ appropriately is important to generate an accurate point cloud. If τ𝜏\tauitalic_τ is too large, 3D points with similar locations will be considered as the same point. Consequently, fewer 3D points are generated, leading to faster rendering times; however, estimated depth images may contain many holes. Conversely, if τ𝜏\tauitalic_τ is too small, the number of 3D points increases since point clouds can be generated with overlapping. This decreases the number of holes in depth images, but it takes a long time for the rendering process.

Throughout the paper, we use a subset of training views for point cloud generation and refinement.

Difference with multi-view stereo (MVS) in point cloud generation. MVS is a standard approach for generating a cloud or mesh, from a set of images captured from many different views. We observed that the proposed point cloud generation method can generate more points than the standard MVS method [3] for similar computational time111With a standard graphics processing unit (GPU), the processing time of standard MVS is 86.7886.7886.7886.78 seconds (sec.) and that of the proposed point cloud generation is 87.1287.1287.1287.12 sec., both with 20202020 views.. This leads to the consequence that a point cloud generated by the proposed method above can improve rendering quality compared to that generated by MVS. Within the proposed NeRF framework, a point cloud generated by the proposed point cloud generation method and that given by the standard MVS method resulted in 31.4431.4431.4431.44 dB and 30.2730.2730.2730.27 dB in PSNR, respectively (for the Pavillon dataset [19]; α=1/2𝛼12\alpha=1/2italic_α = 1 / 2, N=8𝑁8N=8italic_N = 8).

3.4.2 Depth estimation from a point cloud in the test stage

Refer to caption
Figure 6: A depth image by projecting a point cloud (left); a depth image by projecting a refined point cloud with hole filling (middle); the ground truth depth image (right).

To obtain a depth image at a novel viewpoint using a point cloud, we calculate the distance from a 3D point to the camera location by projecting a generated point cloud in Section 3.4.1 to the image plane. If more than one 3D point are projected onto the same pixel location, we use the closest 3D point to the camera location for distance calculations.

At a novel viewpoint, a projected depth image from a point cloud could have “holes”, i.e., pixels with zero values, if those do not have corresponding 3D point(s) in a point cloud. In projected depth images, however, one cannot identify if such holes correspond to background areas or are missing information on the surface of a foreground object due to limited 3D points.

In this section, we aim to fill-up missing information on the object surface while maintaining background areas. To distinguish whether holes in projected depth images correspond to background area(s) or missing information on the surface of a foreground object, we use the following condition for a pixel of value p𝑝pitalic_p:

pμσ>κ,𝑝𝜇𝜎𝜅\frac{p-\mu}{\sigma}>\kappa,divide start_ARG italic_p - italic_μ end_ARG start_ARG italic_σ end_ARG > italic_κ , (4)

where μ𝜇\muitalic_μ and σ𝜎\sigmaitalic_σ is the average and the standard deviation calculated from M×M𝑀𝑀M\times Mitalic_M × italic_M neighboring pixels in a projected depth image – whose center is the pixel of p𝑝pitalic_p value – respectively, and κ𝜅\kappaitalic_κ is some specified threshold. If the condition (4) is satisfied, we determine that a hole is missing information on the surface, and fill that hole by applying the moving average filter with a kernel of size M×M𝑀𝑀M\times Mitalic_M × italic_M. If κ𝜅\kappaitalic_κ is too large, there still may exist many holes with missing information on the surface of an object (not in background area(s)) even after the hole filling process. If κ𝜅\kappaitalic_κ is too small, however, one may even fill holes in background area(s). Selecting an appropriate κ𝜅\kappaitalic_κ value can generate more accurate/useful depth images by minimizing missing information on the object surface and mitigating hole-filling the background areas.

Figure 6 shows examples of estimated depth images without and with the proposed hole filling process, and the ground-truth depth image. We observed that the proposed hole filling method estimates missing depth information for a foreground object, giving more appropriate depth maps. However, a few parts of the background that are supposed to have zero values are filled with some non-zero values. It is suboptimal in the perspective of depth estimation, but it is a simple method that can provide sufficiently useful information for proposed near-surface sampling in Section 3.3.

4 Results and discussion

4.1 Datasets

Refer to caption
Figure 7: The Lego (1111st), Ship (2222nd), BlendedMVS (3333rd), and Pavillon (4444th) datasets.

We used the synthetic Lego and Ship datasets in original NeRF [17],222 Each original synthetic dataset consist of 100100100100 training images and 100100100100 test images; viewpoints are sampled on the upper hemisphere (with fixed diameter) around an object. , the real dataset with the identifier 5a8aa0fab18050187cbe060e in BlendedMVS [28], and the Pavillon scene dataset. Figure 7 shows these datasets. For each synthetic dataset, we used 150150150150 training images and 50505050 test images, all with the spatial resolution of 800×800800800800\times 800800 × 800. In generating a point cloud (Section 3.4.1) for each synthetic dataset, we used 20202020 of 100100100100 training images from the original dataset.333 We generated a point cloud with 20202020 viewpoints by sequentially using the every fifth viewpoint from 100100100100 viewpoints. We repeated the point cloud generation process 20202020 times, where each iteration consists of four steps in Section 3.4.1. (Section 3.4.1 describes the relation between the numbers of viewpoints and repetitions.) In constructing a training dataset for each synthetic data, we selected 50505050 of 100100100100 original test images by skipping one view by one view and added them to the original training dataset. For the real dataset, we used 100100100100 training images and 11111111 test images, all with the the resolution of 574×475574475574\times 475574 × 475. In generating a point cloud, we used 20202020 of 100100100100 training images.\@footnotemark For all datasets, each instance has a different viewpoint. If not further specified, we used the above experimental setup throughout all experiments.

The chosen real data contains multi-view images taken around an object and several images are captured from closer viewpoints to an object. In our experiments, we used included depth images in [28], and used blended color images reflecting view-dependent lighting [28], as the ground truth color images.

We compared the proposed NeRF framework using near-surface sampling with a point cloud, with original NeRF, DONeRF [19], DSNeRF [5], and Instant-NGP [18]. For comparing performances between all five methods, we used the re-rendered Lego dataset and Pavillon scene dataset to better fit the view cell methodology of DONeRF that uses additional configurations for view cell generation, and is forward-facing. We used 210210210210 training images and 60606060 test images, for these comparison experiments. For a point cloud generation, we used 20202020 training images. For comparing performances between the proposed and original NeRF, we used all the three different datasets (Lego, Ship, and BlendedMVS) that are not necessarily forward-facing.

4.2 Experimental setup

Throughout experiments with different sampling ranges of the proposed surface-based sampling method, we assumed that the full sampling range of original NeRF [17], i.e., the radius of the blue fan-shape in Figure 4, is 4444 (unitless). For synthetic datasets, we set half of the sampling range of proposed NeRF, i.e., α𝛼\alphaitalic_α in (1)–(2), as 1/2121/21 / 2, 1/4141/41 / 4, 1/8181/81 / 8, and 1/161161/161 / 16. For real dataset, we set α𝛼\alphaitalic_α as 1111, 1/2121/21 / 2, 1/4141/41 / 4, and 1/8181/81 / 8. (We used larger sampling ranges in real dataset experiments compared to synthetic dataset experiments, since the depth quality of the real dataset is relatively poorer than that of the synthetic dataset.444 For the synthetic Lego and Ship datasets and real BlendedMVS dataset, the PSNR value (in dB) for estimated depth in inference is 19.319.319.319.3, 16.816.816.816.8, and 10.410.410.410.4, respectively.) To see the effects of depth estimation accuracy in the proposed NeRF framework, we also ran experiments with ground-truth depth images and estimated depth images via the proposed method. We set the number of sample points N=64𝑁64N=64italic_N = 64, except for experiments using different N𝑁Nitalic_N’s.

In experiments comparing different NeRF methods, we used different numbers of sampling points, i.e., N𝑁Nitalic_N in (2). For fair comparisons, the total number of sampling points per ray of original NeRF is set identical to those of proposed NeRF, DONeRF [19], DSNeRF [5], and Instant-NGP [18]. In the original NeRF approach, for each coarse and fine network, we set the number of sample points per ray to 4444, 8888, 16161616, and 32323232. For the proposed NeRF, DONeRF, DSNeRF, we set N𝑁Nitalic_N as 8888, 16161616, 32323232, and 64646464, and used only one rendering network. Different from original NeRF that uses samples with different locations for two different networks, Instant-NGP uses two networks that estimate color and density respectively, but use samples with the same locations. For Instant-NGP, we set the number of samples per ray to 8888, 16161616, 32323232, and 64646464. That is, in comparing different NeRF methods, we set the total number of sample points per ray as 8888, 16161616, 32323232, and 64646464 consistently for all the NeRF methods.

The remaining hyperparameters of the proposed NeRF approach are listed as follows. In determining sampling locations (2), we randomly sampled γ𝛾\gammaitalic_γ via the uniform distribution between 00 and 2α/N2𝛼𝑁2\alpha/N2 italic_α / italic_N. In the point cloud refinement condition (3), we set τ𝜏\tauitalic_τ as 0.10.10.10.1. In the hole filling condition (4), we set κ𝜅\kappaitalic_κ and M𝑀Mitalic_M as 2222 and 11111111, respectively.

We used the following hyperparameters throughout all experiments. We set the total number of training iterations as 400,000400000400,\!000400 , 000, as the training losses tend to converge after 400,000400000400,\!000400 , 000 iterations. For each iteration, we set the batch size of input rays as 1024102410241024. We used the learning rate of 5×1045superscript1045\times 10^{-4}5 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT until 250,000250000250,\!000250 , 000 iterations, and reduced it to 5×1055superscript1055\times 10^{-5}5 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT after 250,000250000250,\!000250 , 000 iterations. We used the ADAM optimizer.

For quantitave comparisons, we used the most representative measure, PSNR in dB, excluding the background area (if available). We used an NVIDIA GeForce RTX 4090 GPU with 24GB GDDR6X VRAM and 2.31GHz, Intel(R) Xeon(R) Gold 6326 CPU with 2.90GHz, and main memory of 503GB RAM.

4.3 Comparisons with different sampling ranges in the proposed NeRF framework

(a) The Lego dataset (b) The Ship dataset
Refer to caption Refer to caption
(c) The BlendedMVS dataset
Refer to caption
Figure 8: PSNR (dB) comparisons with different sampling ranges, for three different datasets (N=64𝑁64N=64italic_N = 64). The dotted and solid lines denote the rendering accuracy in PSNR values of proposed NeRF, with the ground-truth and estimated depth images, respectively.

(a) α=1/2𝛼12\alpha=1/2italic_α = 1 / 2 (b) α=1/4𝛼14\alpha=1/4italic_α = 1 / 4 (c) α=1/8𝛼18\alpha=1/8italic_α = 1 / 8 (d) Ground truth
Refer to caption Refer to caption Refer to caption Refer to caption
Figure 9: Comparisons of rendered images via proposed NeRF for the Lego (1111st row), Ship (2222nd row), and BlendedMVS (3333rd row) datasets, with different sampling ranges (we used estimated depth images via the proposed method; N=64𝑁64N=64italic_N = 64). The sampling ranges are scaled versions of the original NeRF’s with α𝛼\alphaitalic_α’s in (2). Images in the 4444th column are ground truths.

Using the proposed surface-based sampling method, we compared results between different sampling ranges, either with ground-truth or estimated depth images. First, we compare performances between different sampling ranges, with the ground-truth depth images. Figure 8 with dotted lines compares the rendering quality of proposed NeRF with different sampling ranges, for three different datasets. It demonstrates that as the sampling range becomes narrow, the rendering quality of NeRF improves. With the ground truth depth information, the rendering accuracy improved as the sampling range becomes narrow. This is natural as the narrower the sampling range, the more sample points are located near the surface of an object.

Next, we compare performances between different sampling ranges, with the estimated depth images via the proposed point cloud generation and hole filling approaches. Figures 8 (solid lines)–9 compare the rendering quality of proposed NeRF with different sampling ranges, for three different datasets. In Figure 9, different columns show rendered images with different sampling ranges; in the last column, the ground truth images are presented; different rows show rendered images with different datasets. Figures 89 demonstrate that the rendering quality of proposed NeRF improves, as the sampling range becomes narrow, but only up to the certain sampling range, e.g., 1/8181/81 / 8 and 1/2121/21 / 2 of the full sampling range of original NeRF for synthetic data and real data, respectively. If the sampling range is too narrow, e.g., 1/161161/161 / 16 and 1/4141/41 / 4 for synthetic data and real data, respectively, the rendering accuracy degraded. This is because some estimated depth information is inaccurate, but we sample points too near the corresponding inaccurate regions where actual surfaces do not exist.

Finally, we compare the rendering accuracy between two proposed NeRF methods using ground truth and estimated depth images respectively. Figure 8 demonstrates that in the proposed NeRF framework, using estimated depth images degrade the overall rendering accuracy compared to using the ground truth depth, as one may expect. In particular, points sampled around the inaccurately estimated surface of an object degrade the rendering accuracy.

4.4 Rendering quality comparisons between different NeRF models

4.4.1 Comparisons between five different NeRF models

Table 1: PSNR (dB) comparisons with different numbers of samples per ray for different NeRF methods (α=1/16𝛼116\alpha=1/16italic_α = 1 / 16 and α=1/2𝛼12\alpha=1/2italic_α = 1 / 2 for the Lego and Pavillon datasets in [19], respectively).
(a) The Lego dataset
N𝑁Nitalic_N Method NeRF DSNeRF Instant-NGP DONeRF Proposed NeRF
64 27.84 29.24 30.40 31.25 33.57
32 26.82 28.67 30.31 31.13 32.13
16 24.58 25.81 29.13 30.08 31.36
8 22.72 23.92 28.94 29.13 30.25
(b) The Pavillon scene dataset
N𝑁Nitalic_N Method NeRF DSNeRF Instant-NGP DONeRF Proposed NeRF
64 25.13 27.61 32.09 32.00 33.21
32 22.26 26.15 31.71 31.99 32.40
16 19.25 23.89 30.60 31.71 31.97
8 17.52 21.18 28.42 31.29 31.44
\botrule

(a) Original NeRF (b) DSNeRF (c) Instant-NGP
Refer to caption Refer to caption Refer to caption
(d) DONeRF (e) Proposed NeRF (f) Ground truth
Refer to caption Refer to caption Refer to caption
Figure 10: Comparisons of rendered images with different NeRFs (the Lego dataset [19]; N=8𝑁8N=8italic_N = 8, α=1/16𝛼116\alpha=1/16italic_α = 1 / 16)

(a) Original NeRF (b) DSNeRF (c) Instant-NGP
Refer to caption Refer to caption Refer to caption
(d) DONeRF (e) Proposed NeRF (f) Ground truth
Refer to caption Refer to caption Refer to caption

Figure 11: Comparisons of rendered images with different NeRFs (the Pavillon scene dataset [19]; N=8𝑁8N=8italic_N = 8, α=1/2𝛼12\alpha=1/2italic_α = 1 / 2)
(a) The Lego dataset (b) The Ship dataset
Refer to caption Refer to caption
(c) The BlendedMVS dataset
Refer to caption
Figure 12: PSNR (dB) comparisons with different numbers of samples per ray, for for three different datasets (for Lego and Ship, α=1/16𝛼116\alpha=1/16italic_α = 1 / 16; for BlendedMVS, α=1/4𝛼14\alpha=1/4italic_α = 1 / 4). The green line with squares and yellow line with triangles denote the rendering accuracy of proposed and original NeRF, respectively.

(a) N=8𝑁8N=8italic_N = 8 (b) N=16𝑁16N=16italic_N = 16 (c) N=32𝑁32N=32italic_N = 32 (d) N=64𝑁64N=64italic_N = 64 (e) Ground truth
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 13: Comparisons of rendered images via proposed NeRF for the Lego (1111st row), Ship (2222nd row), and BlendedMVS (3333rd row) datasets, with different numbers of samples per ray (for Lego and Ship, α=1/16𝛼116\alpha=1/16italic_α = 1 / 16; for BlendedMVS, α=1/4𝛼14\alpha=1/4italic_α = 1 / 4). Images in the 5555th column are ground truths.

(a) Original NeRF (b) Proposed NeRF (c) Ground truth
Refer to caption Refer to caption Refer to caption
Figure 14: A closer look at rendered images by the original NeRF and proposed NeRF method for a real dataset in BlendedMVS [28] (N=64𝑁64N=64italic_N = 64; we used the worst sampling range for the N=64𝑁64N=64italic_N = 64 case, α=1/8𝛼18\alpha=1/8italic_α = 1 / 8).
Table 2: PSNR (dB) comparisons between the proposed method and original NeRF with a different number of samples and sampling range. The numbers in parentheses denote performance comparisons between the proposed and original NeRF models.
N𝑁Nitalic_N Method Sampling range (2α2𝛼2\alpha2 italic_α) Lego Ship BlendedMVS
64 Original NeRF 4 25.43 24.37 19.96
Proposed NeRF 1 26.19 (+0.76) 24.85 (+0.48) 20.92 (+0.96)
1/2 28.38 (+2.95) 25.49 (+1.12) 20.81 (+0.85)
1/4 28.87 (+3.44) 25.63 (+1.26) 19.81 (-0.15)
1/8 27.86 (+2.43) 25.19 (+0.82)
  32 Original NeRF 4 22.06 22.75 18.79
Proposed NeRF 1 23.50 (+1.44) 23.57 (+0.82) 19.92 (+1.13)
1/2 26.05 (+3.99) 24.64 (+1.89) 20.29 (+1.50)
1/4 27.45 (+5.39) 25.09 (+2.34) 19.54 (+0.75)
1/8 27.55 (+5.49) 25.04 (+2.29)
  16 Original NeRF 4 19.15 21.07 16.99
Proposed NeRF 1 21.10 (+1.95) 22.59 (+1.52) 18.30 (+1.31)
1/2 23.50 (+4.35) 23.65 (+2.58) 19.37 (+2.38)
1/4 25.16 (+6.01) 24.23 (+3.16) 18.99 (+2.00)
1/8 26.78 (+7.63) 24.70 (+3.63)
  8 Origianl NeRF 4 16.75 19.11 14.42
Proposed NeRF 1 19.51 (+2.76) 21.48 (+2.37) 16.73 (+2.31)
1/2 21.12 (+4.37) 22.63 (+3.52) 17.74 (+3.32)
1/4 22.67 (+5.92) 23.21 (+4.10) 18.06 (+3.64)
1/8 25.44 (+8.69) 24.11 (+5.00)
\botrule

Table 1 and Figures 1011 compare the rendering quality between the five different NeRF models, with different number of samples. The demonstrate that the proposed NeRF outperforms original NeRF, DONeRF, DSNeRF, and Instant-NGP, regardless of the number of sample points per ray. Figures 1011 show that the proposed NeRF framework produces significantly better details of a 3D object, compared to the original NeRF, DONeRF, DSNeRF and Instant-NGP. Table 1 with two different datasets shows that rendering accuracy reduces as the number of sample points per ray decreases. This is similarly observed in all the five different NeRF models. This is because as the number of sample point decreases, we have less information to model a 3D object via networks.

4.4.2 A closer look at original NeRF vs. proposed NeRF

Figure 12 compares the rendering performance particularly between original and proposed NeRFs, with different numbers of samples per ray. The figure demonstrates for the three different datasets that the proposed NeRF framework gives significantly better rendering accuracy compared to original NeRF, regardless of the number of sample points per ray. More importantly, Figure 12 shows that in the proposed NeRF framework, the performance degradation according to reduction of number of samples per ray is significantly less, compared to original NeRF. In other words, proposed NeRF can maintain the rendering quality, while reducing the number of samples per ray. Consequently, we conclude that only with a limited number of samples per ray, the proposed NerF model can achieve significantly better rendering accuracy, compared to the original NeRF model using many samples per ray. For the synthetic datasets, the proposed framework using 16161616 samples per ray outperformed original NeRF using 64646464 samples per ray; for the real data, the rendering accuracy of the proposed NeRF model using 16161616 samples per ray is comparable with that of original NeRF using 64646464 samples per ray. We expect that the smaller the error in estimated depth at a novel view, the narrower sampling range can be used while reducing the number of samples.

Figure 13 shows rendered images by the proposed framework for different numbers of sample points per ray, with three different datasets. Except for the extreme case of using only eight samples per ray (N=8𝑁8N=8italic_N = 8), the image quality of rendered images by the proposed framework gradually degraded as the number of samples per ray reduces. (When N=8𝑁8N=8italic_N = 8, the rendering quality significantly degraded.) This with the above results from Figure 12 underscores the importance of the near-surface sampling approach.

Figure 14 compares rendered images by the original and proposed NeRF methods when N=64𝑁64N=64italic_N = 64. Particularly in the proposed NeRF framework, we used the worst sampling range for the BlendedMVS dataset. The proposed surface-based sampling method significantly improves the overall rendering quality of NeRF, but there exists some dot artifacts. This is because some missing information still exists or filled holes have inaccurate depth information, after the hole filling. We conjecture that if one uses a fancier depth estimation method than the proposed simple hole filling scheme, one can remove those artifacts.

Table 2 summarizes PSNR values of the original and proposed NeRF models, for different numbers of samples per ray (N𝑁Nitalic_N) and different sampling range (2α2𝛼2\alpha2 italic_α). For each setup using an identical N𝑁Nitalic_N value, the proposed NeRF framework outperformed the original NeRF model, regardless of α𝛼\alphaitalic_α.

Table 3: Training time (hour) comparisons between the proposed method and four different NeRF models with a different number of samples (the Pavillon scene dataset). We used 400,000400000400,\!000400 , 000 iterations throughout the experiments.
N𝑁Nitalic_N Method NeRF DSNeRF Instant-NGP DONeRF Proposed NeRF
64 21.27 16.76 1.54 16.16 12.50
32 17.58 14.16 0.65 13.66 9.34
16 13.27 12.44 0.62 11.63 7.52
8 12.14 11.86 0.57 11.04 7.47
\botrule

4.5 Training time comparisons between different NeRF models

Table 3 compares the training time between the five different NeRF methods, with different numbers of samples. The Instant-NGP model showed the fastest training time among the five NeRF models – note, however, that its rendering accuracy is significantly worse than the proposed NeRF method (see Table 1). Except for Instant-NGP, the proposed NeRF method showed the the fastest training time. Particularly compared to the original NeRF, the proposed NeRF was about two times faster. The reason is that we trained a single fully-connected network in the proposed NeRF framework, whereas the original NeRF approach trained two fully-connected networks. It took longer in training DONeRF and DSNeRF than the proposed NeRF model (with the same number of iterations). This is natural because DONeRF and DSNeRF train an extra depth estimation network.

Regardless of the models, the smaller the number of sample points, it took the less training time.

5 Conclusion

In NeRF methods, it is important to reduce the number of sample points per ray while maintaining the rendering quality, as using less samples can reduce training/inference time. Based on the assumption that the closer the sample point is to the surface of an object, the more important it is for rendering, we propose a near-surface sampling method for NeRF. The proposed framework samples 3D points only near the surface of an object, by estimating depth images from a 3D point cloud generated with a subset of training data and a simple hole filling method. For different datasets, the proposed NeRF framework significantly improves the original NeRF [17] and three state-of-the-art NeRF methods, DONeRF [19], DSNeRF [5], and Instant-NGP [18]. Particularly compared to the original NeRF method, the proposed framework can achieve significantly better rendering accuracy, with only a quarter of sample points per ray. In addition, the proposed near-surface sampling framework can accelerate the NeRF training time twice as fast, while improving the rendering quality with an appropriate sampling range parameter. The proposed method would be useful particularly for applications/technologies where visualizing details is important in novel views.

There are a number of avenues for future work to improve the proposed framework. First, the proposed framework takes a longer inference time compared to the original NeRF model, because projecting many 3D points to a view plane and estimating a depth image is slower than inference via coarse network in original NeRF. We expect to reduce rendering time by speeding up the point cloud projection process. Second, the proposed NeRF framework is not completely end-to-end. In particular, the point cloud generation and refinement process is in the offline stage and not yet optimized for rendering. Therefore, we expect to improve the performance of the NeRF model by modifying it with the fully end-to-end approach, incorporating point cloud generation and refinement process into training. Finally, we expect to further improve the rendering performance of the proposed method by using a more accurate depth estimation method.

Statements and Declarations

\bmhead

Funding The work of H. B. Yoo and I. Y. Chun was supported in part by NRF grants 2022R1F1A1074546 and RS-2023-00213455 funded by MSIT, and the BK21 FOUR Project. The work of I. Y. Chun was additionally supported in part by IITP grant 2019-0-00421 funded by MSIT, IBS grant R015-D1, KIAT grant P0022098 funded by MOTIE, the KEIT Technology Innovation program grant 20014967 funded by MOTIE, SKKU-SMC and SKKU-KBSMC Future Convergence Research Program grants, and SKKU seed grants. The work of H. M. Han and S. S. Hwang was supported the NRF grant NRF-2022R1C1C1011084 funded by MSIT.

\bmhead

Competing interests The authors declare that they have no conflict of interest.

\bmhead

Ethics approval Not applicable

\bmhead

Consent to participate Not applicable

\bmhead

Consent for publication Not applicable

\bmhead

Availability of data and materials The NeRF dataset and BlendedMVS dataset are publicly available at https://paperswithcode.com/dataset/nerf and https://paperswithcode.com/dataset/blendedmvs

\bmhead

Code availability The code in this study is available from the corresponding author on reasonable request.

\bmhead

Authors’ contributions Conceptualization, H. B. Y., H. M. H., S. S. H., & I. Y. C.; data curation, H. M. H.; formal analysis, H. B. Y. & I. Y. C.; funding acquisition, S. S. H. & I. Y. C.; investigation, H. B. Y. & H. M. H.; methodology, H. B. Y., H. M. H., S. S. H., & I. Y. C.; project administration, S. S. H. & I. Y. C.; resources, I. Y. C.; software, H. B. Y. & H. M. H.; supervision, S. S. H. & I. Y. C.; validation, H. M. H., S. S. H., & I. Y. C.; visualization, H. M. H.; writing—original draft preparation, H. B. Y. & H. M. H.; writing—review and editing, I. Y. C. All authors have read and agreed to the published version of the manuscript.

References

\bibcommenthead
  • Alan [1993] Alan W (1993) 3D computer Graphics. Addison-Wesley
  • Boss et al [2021] Boss M, Braun R, Jampani V, et al (2021) NeRD: Neural reflectance decomposition from image collections. In: IEEE/CVF International Conference on Computer Vision, pp 12664–12674, 10.1109/ICCV48922.2021.01245
  • Cernea [2020] Cernea D (2020) OpenMVS: Multi-view stereo reconstruction library, URL https://cdcseacave.github.io/openMVS
  • Chen and Williams [1993] Chen SE, Williams L (1993) View interpolation for image synthesis. In: Proceedings of the Conference on Computer Graphics and Interactive Techniques, pp 279–288, 10.1145/166117.166153
  • Deng et al [2022] Deng K, Liu A, Zhu JY, et al (2022) Depth-supervised NeRF: Fewer views and faster training for free. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 12872–12881, 10.1109/CVPR52688.2022.01254
  • Draelos et al [2015] Draelos M, Qiu Q, Bronstein A, et al (2015) Intel realsense = real low cost gaze. In: IEEE International Conference on Image Processing, pp 2520–2524, 10.1109/ICIP.2015.7351256
  • Gortler et al [1996] Gortler SJ, Grzeszczuk R, Szeliski R, et al (1996) The lumigraph. In: Proceedings of the Conference on Computer Graphics and Interactive Techniques, pp 43–54, 10.1145/237170.237200
  • Hartley and Zisserman [2003] Hartley R, Zisserman A (2003) Multiple view geometry in computer vision, Second Edition. Cambridge University Press
  • Hu et al [2022] Hu T, Liu S, Chen Y, et al (2022) EfficientNeRF efficient neural radiance fields. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 12902–12911, 10.1109/CVPR52688.2022.01256
  • Johari et al [2022] Johari MM, Lepoittevin Y, Fleuret F (2022) GeoNeRF: Generalizing nerf with geometry priors. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 18344–18347, 10.1109/CVPR52688.2022.01782
  • Levoy and Hanrahan [1996] Levoy M, Hanrahan P (1996) Light field rendering. In: Proceedings of the Conference on Computer Graphics and Interactive Techniques, pp 31–42, 0.1145/237170.237199
  • Lin et al [2021] Lin CH, Ma WC, Torralba A, et al (2021) BARF: Bundle-adjusting neural radiance fields. In: IEEE/CVF International Conference on Computer Vision, pp 5721–5731, 10.1109/ICCV48922.2021.00569
  • Lindell et al [2021] Lindell DB, Martel JNP, Wetzstein G (2021) AutoInt: Automatic integration for fast neural volume rendering. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 14551–14560, 10.1109/CVPR46437.2021.01432
  • Liu et al [2020] Liu L, Gu J, Lin KZ, et al (2020) Neural sparse voxel fields. In: Proceedings of the International Conference on Neural Information Processing Systems, p 15651–15663
  • Mankoff and Russo [2013] Mankoff K, Russo T (2013) The kinect: A low-cost, high-resolution, short-range 3d camera. Earth Surface Processes and Landforms 38:926–936. doi.org/10.1002/esp.3332
  • Martin-Brualla et al [2021] Martin-Brualla R, Radwan N, Sajjadi MSM, et al (2021) NeRF in the Wild: Neural radiance fields for unconstrained photo collections. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 7206–7215, 10.1109/CVPR46437.2021.00713
  • Mildenhall et al [2020] Mildenhall B, Srinivasan PP, Tancik M, et al (2020) NeRF: Representing scenes as neural radiance fields for view synthesis. In: Proceedings of the European Conference on Computer Vision, pp 405–421, 10.1007/978-3-030-58452-8_24
  • Müller et al [2022] Müller T, Evans A, Schied C, et al (2022) Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans Graph 41. 10.1145/3528223.3530127
  • Neff et al [2021] Neff T, Stadlbauer P, Parger M, et al (2021) DONeRF: Towards real-time rendering of neural radiance fields using depth oracle networks. Computer Graphics Forum 40:45–49. 10.1111/cgf.14340
  • Niemeyer and Geiger [2021] Niemeyer M, Geiger A (2021) GIRAFFE: Representing scenes as compositional generative neural feature fields. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 11448–11459, 10.1109/CVPR46437.2021.01129
  • Park et al [2021] Park K, Sinha U, Barron JT, et al (2021) Nerfies: Deformable neural radiance fields. In: IEEE/CVF International Conference on Computer Vision, pp 5845–5854, 10.1109/ICCV48922.2021.00581
  • Pumarola et al [2021] Pumarola A, Corona E, Pons-Moll G, et al (2021) D-NeRF: Neural radiance fields for dynamic scenes. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10313–10322, 10.1109/CVPR46437.2021.01018
  • Rebain et al [2021] Rebain D, Jiang W, Yazdani S, et al (2021) DeRF: Decomposed radiance fields. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 14148–14156, 10.1109/CVPR46437.2021.01393
  • Shade et al [1998] Shade J, Gortler S, He Lw, et al (1998) Layered depth images. In: Proceedings of the Conference on Computer Graphics and Interactive Techniques, pp 231–242, 10.1145/280814.280882
  • Srinivasan et al [2021] Srinivasan PP, Deng B, Zhang X, et al (2021) NeRV: Neural reflectance and visibility fields for relighting and view synthesis. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 7491–7500, 10.1109/CVPR46437.2021.00741
  • Xie et al [2021] Xie C, Park K, Martin-Brualla R, et al (2021) Fig-NeRF: Figure-ground neural radiance fields for 3d object category modelling. In: International Conference on 3D Vision, p 962–971, 10.1109/3DV53792.2021.00104
  • Xu et al [2022] Xu Q, Xu Z, Philip J, et al (2022) Point-NeRF: Point-based neural radiance fields. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5438–5448, 10.1109/CVPR52688.2022.00536
  • Yao et al [2020] Yao Y, Luo Z, Li S, et al (2020) BlendedMVS: A large-scale dataset for generalized multi-view stereo networks. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 1787–1796, 10.1109/cvpr42600.2020.00186
  • Yen-Chen et al [2021] Yen-Chen L, Florence P, Barron JT, et al (2021) iNeRF: Inverting neural radiance fields for pose estimation. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), p 1323–1330, 10.1109/IROS51168.2021.9636708