\equalcont

These authors contributed equally to this work.

\equalcont

These authors contributed equally to this work.

[2]\fnmSung Soo \surHwang

[1,3]\fnmIl Yong \surChun

1]\orgdivDepartment of Electrical and Computer Engineering, \orgnameSungkyunkwan University, \orgaddress\citySuwon, \postcode16419, \countryRepublic of Korea

2]\orgdivDepartment of Information and Communication Engineering, \orgnameHandong Global University, \orgaddress\cityPohang, \postcode37554, \countryRepublic of Korea

3]\orgdivDepartments of Artificial Intelligence, Semiconductor Convergence Engineering, and Display Convergence Engineering, and Center for Neuroscience Imaging Research, Institute for Basic Science, \orgnameSungkyunkwan University, \orgaddress\citySuwon, \postcode16419, \countryRepublic of Korea

Improving Neural Radiance Fields Using Near-Surface Sampling With Point Cloud Generation

\fnmHye Bin \surYoo [email protected] \fnmHyun Min \surHan [email protected] [email protected] [email protected] [ [ [

Abstract

Neural radiance field (NeRF) is an emerging view synthesis method that samples points in a three-dimensional (3D) space and estimates their existence and color probabilities. The disadvantage of NeRF is that it requires a long training time since it samples many 3D points. In addition, if one samples points from occluded regions or in the space where an object is unlikely to exist, the rendering quality of NeRF can be degraded. These issues can be solved by estimating the geometry of 3D scene. This paper proposes a near-surface sampling framework to improve the rendering quality of NeRF. To this end, the proposed method estimates the surface of a 3D object using depth images of the training set and performs sampling only near the estimated surface. To obtain depth information on a novel view, the paper proposes a 3D point cloud generation method and a simple refining method for projected depth from a point cloud. Experimental results show that the proposed near-surface sampling NeRF framework can significantly improve the rendering quality, compared to the original NeRF and three different state-of-the-art NeRF methods. In addition, one can significantly accelerate the training time of a NeRF model with the proposed near-surface sampling framework.

keywords:

Neural radiance field (NeRF), neural rendering, near-surface sampling, point cloud, depth image, three-dimensional geometry

1 Introduction

Recently, metaverse and virtual reality applications are rapidly drawing attention. In such applications, it is important to generate novel views accurately. One way to achieve this goal is to generate a three-dimensional (3D) model first and follow a conventional rendering pipeline [1]. However, generating a 3D model needs a lot of time and effort.

Image-based rendering (IBR) is another approach that generates novel views without explicitly generating a 3D model. Several methods generate a novel view using image morphing [4]. The Layered Depth Images method [24] stores multiple depth and color values for each pixel to effectively fill the hole behind the foreground object in a novel view. Light fields [11] and Lumigraph [7] that express light rays as a function were also proposed.

Recently, among IBR methods, neural radiance field (NeRF) [17] has been rapidly gaining attention. Ray, a core concept of NeRF, means lines shot in a straight line from the camera position to an object. A NeRF network predicts the color and density of each point utilizing 3D points sampled from each ray. Then a novel view is obtained by performing a line integral using this color and density.

The original NeRF [17] performs sampling within a range that includes the entire 3D object. This paper proposes to use depth information to sample 3D points only around surface of an object in NeRF, where we consider the practical scenario that depth information is only available at hands (from depth cameras) in a training dataset. To consider that measured/estimated depths maps may be inaccurate due to capturing environments, we propose to generate a 3D point cloud using available (inaccurate) depth information in training, and to use this 3D point cloud to estimate a depth image for each novel view in test (i.e., inference). Figure 1 illustrates the brief overview of the proposed NeRF framework. Simply projecting a 3D point cloud onto a novel view generates a rather rough depth image. To obtain more accurate depth images, we additionally propose a refining method that removes unnecessary 3D points in generating a point cloud and fills the hole of the projected depth image. Simply put, to improve NeRF, the paper proposes an advanced sampling method around the surface of an object/a scene using estimated depth images from generated point cloud. Our experimental results with different datasets demonstrate that the proposed framework outperforms original NeRF and three different state-of-the-art NeRF methods.

Refer to caption — Figure 1: The brief overview of the proposed NeRF framework that samples points near the estimated surface from a point cloud.

The rest of the paper is organized as follow. Section 2 reviews NeRF and its follow-up works with particularly related works with ours, and presents differences between the proposed NeRF and existing depth-based NeRFs. Section 3 provides motivation and detail of the proposed method, Section 4 reports experiments and analysis, and Section 5 discusses conclusions, limitation and future work.

2 Related works

2.1 NeRF

NeRF [17] is a state-of-the-art view synthesis technology that samples points on rays and synthesizes views through differentiable volume rendering. The input of this algorithm is a single continuous five-dimensional (5D) coordinate consisting of a 3D spatial location and a two-dimensional viewing direction. The output is a volume density and view-dependent emitted radiance at the corresponding spatial location. In other words, the key idea of NeRF is to train a neural network that predicts a view-dependent color value and a volume probability value by taking a 5D coordinate. Using those two predicted values, a final rendered color value is determined by performing a line integral with classical volume rendering. To further improve the rendering quality, NeRF uses the following two techniques: positional encoding and hierarchical volume sampling. Positional encoding increases the dimension of input data; the hierarchical volume sampling technique allocates more samples to regions that are expected to include visible content. Hierarchical volume sampling is named as it performs sampling with two different networks, “coarse” one and “fine” one. For each ray, a coarse network gives a view-dependent emitted color and volume density using $N_{\text{c}}$ points that are sampled with stratified sampling method along the ray. A piecewise-constant probability density function (PDF) is generated (along each ray) by normalizing contribution weights that are calculated with volume densities and the distances between adjacent samples of $N_{\text{c}}$ points. After integrating the generated PDF to calculate cumulative distribution function, $N_{\text{f}}$ points are sampled through inverse transform sampling. A fine network gives a view-dependent color value and volume density using $N_{\text{c}}$ points and those more informed $N_{\text{f}}$ points. After all, one calculates the final rendering of the corresponding ray with $N_{\text{c}}+N_{\text{f}}$ points. Through this process, NeRF can represent a 3D object (in 360 degrees) and forward-facing scenes with continuous views. However, NeRF in its original form has several limitations. For example, it can represent only static scenes; its training and inference is slow; one NeRF network represents only one object/scene.

2.2 Follow-up works of NeRF

Researchers has been improving the original NeRF model [17] in various aspects. The first aspect is to reduce training time of NeRF models while maintaining rendering accuracy [9, 5, 27, 18]. [9] reduces training time by proposing a new sampling method to use less number of samples per ray. [5] supervises depth to use a smaller number of views in training. [27] can accelerate training by quickly generating an initial rough point cloud and refining it in an iterative manner. [18] uses a learnable encoding method instead of positional encoding, and update only parameters related to sampling positions instead of updating all parameters.

The second aspect is to improve inference time of NeRF models [23, 14, 13, 19, 18]. [23] and [14] reduces inference time by spatially decomposing and processing the scene: [23] uses a spatially decomposed scene and a small network for each space; [14] skips spaces with irrelevant scenes among the decomposed spaces during inference. [13] uses volume integral calculation network instead of the classical integral calculation method to shorten inference. [19] uses a rendering pipeline that includes a network to predict the optimal sample locations on rays to reduce inference time. Using learnable encoding method instead of positional encoding [18] can accelerate inference.

Third aspect is to consider different scenarios with NeRF models [29, 12, 26, 20, 10, 21, 22, 25, 2, 16]. [29] additionally estimates camera pose. [12] considers the case that camera poses are imperfect or unknown. [26, 20, 10] consider multi-object/scene representation. In particular, [26] disentangles foreground and background. Dynamic scene representation [21, 22] and relighting [25, 2, 16] makes NeRF to be applicable to changing scenes rather than static scenes.

2.3 Depth-based NeRFs and their relations with the proposed NeRF framework

Depth oracle neural radiance field (DONeRF) [19] uses ground-truth depth images of the training set to train ideal sample locations on rays, and performs sampling in the estimated locations. However, DONeRF works only on forward-facing scenes where all camera poses belong to a bounding box called the view cell. Depth supervised neural radiance field (DSNeRF) [5] uses a sparse depth map estimated with the structure from motion technique and adds an optimization process to the original NeRF using estimated depth information, to achieve the best rendering performance of original NeRF with fewer training iterations and images.

Similar to DONeRF, we aim to improve the quality of rendered images by using depth images available at hand in a training dataset. Note, however, that different from DONeRF, the proposed method does not use the view cell information that is required in DONeRF, and is applicable with less restricted camera positions. Similar to DSNeRF, we use depth information by leveraging a point cloud. However, the proposed framework and DSNeRF use a point cloud in a different way. DSNeRF uses a point cloud to adjust the volume density function of NeRF. Different from this, the proposed framework uses a point cloud to directly estimate the distance to the object surface from a camera.

3 Proposed method

3.1 Motivation

In NeRF [17], there exists a room for improvement of rendering accuracy. NeRF uses a hierarchical volume sampling method that performs sampling twice: “rough” sampling with a stratified sampling approach and “fine” sampling in the space where an object is likely to exist. See details in Section 2.1. The stratified sampling approach in NeRF divides a specified range many bins and selects a sample uniformly a random from each bin. In the stratified sampling process, sampling is performed not only in the space where the object exists, but also in the free space or the occluded region. Sampling in free space and occluded region may degrade rendering quality. If one can sample points only around an object in the rough sampling stage, the rendering performance might improve even without the fine sampling process.

To show the effects of the sampling density around an object on the rendering quality, we ran simple experiments with different sampling ranges around the surface of an object. Figure 2 shows the rendering accuracy with peak signal-to-noise ratio (PSNR) values with different sampling range, where we increased the default sampling range of NeRF by a factor of $2$ , $4$ , and $8$ by increasing distances between two samples. As the sampling range increases, i.e., sampling density around an object decreases, the rendering accuracy rapidly degrades. We observed from theses experiments that narrowing the sampling range around an object can improve the rendering quality in NeRF. This corresponds to the hierarchical volume sampling scheme of original NeRF that re-extracts samples with high volume density values to increase rendering efficiency.

Recently, diverse low-cost depth cameras with high accuracy have been proposed [15, 6]. Depth cameras (using multi-view) can measure the distance between an object and the device, giving additional 3D information of an object. We conjecture that if we sample points on 3D ray only around the surface of an object, the rendering quality of NeRF improves.

3.2 Overview

Figure 3 illustrates the overall process of the proposed framework. A training set consists of color images and depth images, and at the train stage we use both. In particular, we use depth images to sample in the area close to the surface of the object in a 3D space, and we refer this sampling strategy as surface-based sampling. By using those sample points obtained through surface-based sampling, we train the NeRF model. At the offline stage, we use depth images of the training set to generate a point cloud and save this point cloud for inference. At the test stage, we use the saved point cloud at the offline stage to generate a depth image corresponding to a novel view. We further refine depth images through computationally efficient hole filling for surface-based sampling. Using sampled points only around the surface of an object that is estimated with a refined depth, we render images of novel views with a single NeRF network.

3.3 Surface-based sampling

Figure 4 illustrates the difference between the sampling range of the original NeRF’s sampling method (blue) and that of surface-based sampling method (orange). Different from original NeRF that samples 3D points at a wide range that includes the entire 3D object, the proposed surface-based sampling method mainly samples those around the surface of the object.

We now describe the geometry of the proposed surface-based sampling method for each ray of each view. As in the original NeRF, we assume that each ray is propagated from the location of a camera (see Figure 4). We define the location of a camera in each ray as $0$ . The distance between the locations of a camera and an object is the depth value from a depth image, and we denote it as $d$ . Let the half of some specified sampling range be $\alpha$ . Then, the location of a point nearest to the camera within the sampling range can be calculated as follows:

S_{r,0}=d-\alpha,

(1)

Now, we determine the location of the $n$ th sample for each ray (considering that a ray is originated from the camera location, $0$ ) by

S_{r,n}=S_{r,0}+(n-1)\frac{2\alpha}{N}+\gamma,\quad n=1,\ldots,N,

(2)

where $N$ is the number of sample points for each ray, and $\gamma$ is a random number generated between $0$ and $2\alpha/N$ . We perform stratified sampling near the surface of an object, where we determine the sample locations by (2). In (2), $[0,2\alpha/N]$ is the length of each bin in stratified sampling of the original NeRF method. Here, the parameter $\alpha$ determines the sampling range; if $N$ is fixed, $\alpha$ ultimately affects the sampling density around the surface. As $\alpha$ decreases, the length of each bin is shorter and distances between sample points are expected to become close, so the sampling density near the surface increases. As $\alpha$ increases, the length of each bin is longer and distances between sample points are expected to become far, so the sampling density near the surface decreases.

Different from the two-step network sampling scheme in original NeRF, the proposed framework directly samples points near the surface of an object by using depth information in the near-surface sampling scheme (2) in a single step, i.e., it uses a single network. We expect that if the depth to the surface of a 3D object $d$ is accurately estimated, the rendering quality improves by using small $\alpha$ , i.e., densely sampling 3D points. If it is poorly estimated, we expect that small $\alpha$ rather degrades the rendering quality. With fixed $N$ , we recommend setting $\alpha$ considering the accuracy of depth images.

3.4 Depth image generation for novel views

In the training stage, we perform surface-based sampling without any additional process, assuming that a depth image for each view is available. In the test stage, however, we assume that depth images are unavailable, so we perform depth estimation for a novel view for surface-based sampling. For depth estimation, in the offline stage, we generate and save a point cloud as shown in Figure 3. In the test stage, we use this point cloud to estimate depth images for novel views. Using this depth estimation process, surface-based sampling can be performed without a ground truth depth image in the test stage.

3.4.1 Point cloud generation and refinement in the offline stage

Figure 5 illustrates the key concept of the proposed point cloud generation and refinement method. To improve the accuracy of depth estimation, we generate 3D points with a subset of training images, by repeatedly eliminating inaccurate points. In constructing a subset of training images, we give a sufficient and uniform distance between their adjacent viewpoints. This setup is more efficient in constructing a 3D point cloud, compared to the setup that uses the entire training views. See details of this experimental setup later in Section 4.2.

Each iteration consists of the following four steps and we repeat them with the cardinality of a subset of training images, where we sequentially follow the trajectory of viewpoints in a subset of training data:

1)

We generate a point cloud using a depth image from a viewpoint.
2)

We project 3D points of the generated point cloud onto an image plane of the next viewpoint, and obtain the distance between each 3D point and the camera location of the next viewpoint by using the multiple view geometry calculation method [8].
3)

We compare each calculated distance to a ground-truth depth value from the depth image at the next viewpoint, and identify if the following condition is satisfied:

$\big{|}\tilde{d}-d_{\text{GT}}\big{|}\leq\tau,$ (3)

where $\tilde{d}$ denotes the calculated distance using the second step above, and $d_{\text{GT}}$ denotes the ground-truth depth value of a pixel position where the 3D point is projected, and $\tau$ denotes some specified threshold.
4)

If the condition (3) is not satisfied, we generate a new 3D point by back-projecting a pixel of the value $d_{\text{GT}}$ .

Setting $\tau$ appropriately is important to generate an accurate point cloud. If $\tau$ is too large, 3D points with similar locations will be considered as the same point. Consequently, fewer 3D points are generated, leading to faster rendering times; however, estimated depth images may contain many holes. Conversely, if $\tau$ is too small, the number of 3D points increases since point clouds can be generated with overlapping. This decreases the number of holes in depth images, but it takes a long time for the rendering process.

Throughout the paper, we use a subset of training views for point cloud generation and refinement.

Difference with multi-view stereo (MVS) in point cloud generation. MVS is a standard approach for generating a cloud or mesh, from a set of images captured from many different views. We observed that the proposed point cloud generation method can generate more points than the standard MVS method [3] for similar computational time¹¹1With a standard graphics processing unit (GPU), the processing time of standard MVS is $86.78$ seconds (sec.) and that of the proposed point cloud generation is $87.12$ sec., both with $20$ views.. This leads to the consequence that a point cloud generated by the proposed method above can improve rendering quality compared to that generated by MVS. Within the proposed NeRF framework, a point cloud generated by the proposed point cloud generation method and that given by the standard MVS method resulted in $31.44$ dB and $30.27$ dB in PSNR, respectively (for the Pavillon dataset [19]; $\alpha=1/2$ , $N=8$ ).

3.4.2 Depth estimation from a point cloud in the test stage

To obtain a depth image at a novel viewpoint using a point cloud, we calculate the distance from a 3D point to the camera location by projecting a generated point cloud in Section 3.4.1 to the image plane. If more than one 3D point are projected onto the same pixel location, we use the closest 3D point to the camera location for distance calculations.

At a novel viewpoint, a projected depth image from a point cloud could have “holes”, i.e., pixels with zero values, if those do not have corresponding 3D point(s) in a point cloud. In projected depth images, however, one cannot identify if such holes correspond to background areas or are missing information on the surface of a foreground object due to limited 3D points.

In this section, we aim to fill-up missing information on the object surface while maintaining background areas. To distinguish whether holes in projected depth images correspond to background area(s) or missing information on the surface of a foreground object, we use the following condition for a pixel of value $p$ :

\frac{p-\mu}{\sigma}>\kappa,

(4)

where $\mu$ and $\sigma$ is the average and the standard deviation calculated from $M\times M$ neighboring pixels in a projected depth image – whose center is the pixel of $p$ value – respectively, and $\kappa$ is some specified threshold. If the condition (4) is satisfied, we determine that a hole is missing information on the surface, and fill that hole by applying the moving average filter with a kernel of size $M\times M$ . If $\kappa$ is too large, there still may exist many holes with missing information on the surface of an object (not in background area(s)) even after the hole filling process. If $\kappa$ is too small, however, one may even fill holes in background area(s). Selecting an appropriate $\kappa$ value can generate more accurate/useful depth images by minimizing missing information on the object surface and mitigating hole-filling the background areas.

Figure 6 shows examples of estimated depth images without and with the proposed hole filling process, and the ground-truth depth image. We observed that the proposed hole filling method estimates missing depth information for a foreground object, giving more appropriate depth maps. However, a few parts of the background that are supposed to have zero values are filled with some non-zero values. It is suboptimal in the perspective of depth estimation, but it is a simple method that can provide sufficiently useful information for proposed near-surface sampling in Section 3.3.

4 Results and discussion

4.1 Datasets

We used the synthetic Lego and Ship datasets in original NeRF [17],²²2 Each original synthetic dataset consist of $100$ training images and $100$ test images; viewpoints are sampled on the upper hemisphere (with fixed diameter) around an object. , the real dataset with the identifier 5a8aa0fab18050187cbe060e in BlendedMVS [28], and the Pavillon scene dataset. Figure 7 shows these datasets. For each synthetic dataset, we used $150$ training images and $50$ test images, all with the spatial resolution of $800\times 800$ . In generating a point cloud (Section 3.4.1) for each synthetic dataset, we used $20$ of $100$ training images from the original dataset.³³3 We generated a point cloud with $20$ viewpoints by sequentially using the every fifth viewpoint from $100$ viewpoints. We repeated the point cloud generation process $20$ times, where each iteration consists of four steps in Section 3.4.1. (Section 3.4.1 describes the relation between the numbers of viewpoints and repetitions.) In constructing a training dataset for each synthetic data, we selected $50$ of $100$ original test images by skipping one view by one view and added them to the original training dataset. For the real dataset, we used $100$ training images and $11$ test images, all with the the resolution of $574\times 475$ . In generating a point cloud, we used $20$ of $100$ training images.\@footnotemark For all datasets, each instance has a different viewpoint. If not further specified, we used the above experimental setup throughout all experiments.

The chosen real data contains multi-view images taken around an object and several images are captured from closer viewpoints to an object. In our experiments, we used included depth images in [28], and used blended color images reflecting view-dependent lighting [28], as the ground truth color images.

We compared the proposed NeRF framework using near-surface sampling with a point cloud, with original NeRF, DONeRF [19], DSNeRF [5], and Instant-NGP [18]. For comparing performances between all five methods, we used the re-rendered Lego dataset and Pavillon scene dataset to better fit the view cell methodology of DONeRF that uses additional configurations for view cell generation, and is forward-facing. We used $210$ training images and $60$ test images, for these comparison experiments. For a point cloud generation, we used $20$ training images. For comparing performances between the proposed and original NeRF, we used all the three different datasets (Lego, Ship, and BlendedMVS) that are not necessarily forward-facing.

4.2 Experimental setup

Throughout experiments with different sampling ranges of the proposed surface-based sampling method, we assumed that the full sampling range of original NeRF [17], i.e., the radius of the blue fan-shape in Figure 4, is $4$ (unitless). For synthetic datasets, we set half of the sampling range of proposed NeRF, i.e., $\alpha$ in (1)–(2), as $1/2$ , $1/4$ , $1/8$ , and $1/16$ . For real dataset, we set $\alpha$ as $1$ , $1/2$ , $1/4$ , and $1/8$ . (We used larger sampling ranges in real dataset experiments compared to synthetic dataset experiments, since the depth quality of the real dataset is relatively poorer than that of the synthetic dataset.⁴⁴4 For the synthetic Lego and Ship datasets and real BlendedMVS dataset, the PSNR value (in dB) for estimated depth in inference is $19.3$ , $16.8$ , and $10.4$ , respectively.) To see the effects of depth estimation accuracy in the proposed NeRF framework, we also ran experiments with ground-truth depth images and estimated depth images via the proposed method. We set the number of sample points $N=64$ , except for experiments using different $N$ ’s.

In experiments comparing different NeRF methods, we used different numbers of sampling points, i.e., $N$ in (2). For fair comparisons, the total number of sampling points per ray of original NeRF is set identical to those of proposed NeRF, DONeRF [19], DSNeRF [5], and Instant-NGP [18]. In the original NeRF approach, for each coarse and fine network, we set the number of sample points per ray to $4$ , $8$ , $16$ , and $32$ . For the proposed NeRF, DONeRF, DSNeRF, we set $N$ as $8$ , $16$ , $32$ , and $64$ , and used only one rendering network. Different from original NeRF that uses samples with different locations for two different networks, Instant-NGP uses two networks that estimate color and density respectively, but use samples with the same locations. For Instant-NGP, we set the number of samples per ray to $8$ , $16$ , $32$ , and $64$ . That is, in comparing different NeRF methods, we set the total number of sample points per ray as $8$ , $16$ , $32$ , and $64$ consistently for all the NeRF methods.

The remaining hyperparameters of the proposed NeRF approach are listed as follows. In determining sampling locations (2), we randomly sampled $\gamma$ via the uniform distribution between $0$ and $2\alpha/N$ . In the point cloud refinement condition (3), we set $\tau$ as $0.1$ . In the hole filling condition (4), we set $\kappa$ and $M$ as $2$ and $11$ , respectively.

We used the following hyperparameters throughout all experiments. We set the total number of training iterations as $400,\!000$ , as the training losses tend to converge after $400,\!000$ iterations. For each iteration, we set the batch size of input rays as $1024$ . We used the learning rate of $5\times 10^{-4}$ until $250,\!000$ iterations, and reduced it to $5\times 10^{-5}$ after $250,\!000$ iterations. We used the ADAM optimizer.

For quantitave comparisons, we used the most representative measure, PSNR in dB, excluding the background area (if available). We used an NVIDIA GeForce RTX 4090 GPU with 24GB GDDR6X VRAM and 2.31GHz, Intel(R) Xeon(R) Gold 6326 CPU with 2.90GHz, and main memory of 503GB RAM.

4.3 Comparisons with different sampling ranges in the proposed NeRF framework

Using the proposed surface-based sampling method, we compared results between different sampling ranges, either with ground-truth or estimated depth images. First, we compare performances between different sampling ranges, with the ground-truth depth images. Figure 8 with dotted lines compares the rendering quality of proposed NeRF with different sampling ranges, for three different datasets. It demonstrates that as the sampling range becomes narrow, the rendering quality of NeRF improves. With the ground truth depth information, the rendering accuracy improved as the sampling range becomes narrow. This is natural as the narrower the sampling range, the more sample points are located near the surface of an object.

Next, we compare performances between different sampling ranges, with the estimated depth images via the proposed point cloud generation and hole filling approaches. Figures 8 (solid lines)–9 compare the rendering quality of proposed NeRF with different sampling ranges, for three different datasets. In Figure 9, different columns show rendered images with different sampling ranges; in the last column, the ground truth images are presented; different rows show rendered images with different datasets. Figures 8–9 demonstrate that the rendering quality of proposed NeRF improves, as the sampling range becomes narrow, but only up to the certain sampling range, e.g., $1/8$ and $1/2$ of the full sampling range of original NeRF for synthetic data and real data, respectively. If the sampling range is too narrow, e.g., $1/16$ and $1/4$ for synthetic data and real data, respectively, the rendering accuracy degraded. This is because some estimated depth information is inaccurate, but we sample points too near the corresponding inaccurate regions where actual surfaces do not exist.

Finally, we compare the rendering accuracy between two proposed NeRF methods using ground truth and estimated depth images respectively. Figure 8 demonstrates that in the proposed NeRF framework, using estimated depth images degrade the overall rendering accuracy compared to using the ground truth depth, as one may expect. In particular, points sampled around the inaccurately estimated surface of an object degrade the rendering accuracy.

4.4 Rendering quality comparisons between different NeRF models

4.4.1 Comparisons between five different NeRF models

Table 1: PSNR (dB) comparisons with different numbers of samples per ray for different NeRF methods (

\alpha=1/16

and

\alpha=1/2

for the Lego and Pavillon datasets in [19], respectively).

(a) The Lego dataset
	NeRF	DSNeRF	Instant-NGP	DONeRF	Proposed NeRF
64	27.84	29.24	30.40	31.25	33.57
32	26.82	28.67	30.31	31.13	32.13
16	24.58	25.81	29.13	30.08	31.36
8	22.72	23.92	28.94	29.13	30.25
(b) The Pavillon scene dataset
	NeRF	DSNeRF	Instant-NGP	DONeRF	Proposed NeRF
64	25.13	27.61	32.09	32.00	33.21
32	22.26	26.15	31.71	31.99	32.40
16	19.25	23.89	30.60	31.71	31.97
8	17.52	21.18	28.42	31.29	31.44
\botrule

Table 2: PSNR (dB) comparisons between the proposed method and original NeRF with a different number of samples and sampling range. The numbers in parentheses denote performance comparisons between the proposed and original NeRF models.

$N$	Method	Sampling range ( $2\alpha$ )	Lego	Ship	BlendedMVS
64	Original NeRF	4	25.43	24.37	19.96
	Proposed NeRF	1	26.19 (+0.76)	24.85 (+0.48)	20.92 (+0.96)
		1/2	28.38 (+2.95)	25.49 (+1.12)	20.81 (+0.85)
		1/4	28.87 (+3.44)	25.63 (+1.26)	19.81 (-0.15)
		1/8	27.86 (+2.43)	25.19 (+0.82)
32	Original NeRF	4	22.06	22.75	18.79
	Proposed NeRF	1	23.50 (+1.44)	23.57 (+0.82)	19.92 (+1.13)
		1/2	26.05 (+3.99)	24.64 (+1.89)	20.29 (+1.50)
		1/4	27.45 (+5.39)	25.09 (+2.34)	19.54 (+0.75)
		1/8	27.55 (+5.49)	25.04 (+2.29)
16	Original NeRF	4	19.15	21.07	16.99
	Proposed NeRF	1	21.10 (+1.95)	22.59 (+1.52)	18.30 (+1.31)
		1/2	23.50 (+4.35)	23.65 (+2.58)	19.37 (+2.38)
		1/4	25.16 (+6.01)	24.23 (+3.16)	18.99 (+2.00)
		1/8	26.78 (+7.63)	24.70 (+3.63)
8	Origianl NeRF	4	16.75	19.11	14.42
	Proposed NeRF	1	19.51 (+2.76)	21.48 (+2.37)	16.73 (+2.31)
		1/2	21.12 (+4.37)	22.63 (+3.52)	17.74 (+3.32)
		1/4	22.67 (+5.92)	23.21 (+4.10)	18.06 (+3.64)
		1/8	25.44 (+8.69)	24.11 (+5.00)
\botrule

Table 1 and Figures 10–11 compare the rendering quality between the five different NeRF models, with different number of samples. The demonstrate that the proposed NeRF outperforms original NeRF, DONeRF, DSNeRF, and Instant-NGP, regardless of the number of sample points per ray. Figures 10–11 show that the proposed NeRF framework produces significantly better details of a 3D object, compared to the original NeRF, DONeRF, DSNeRF and Instant-NGP. Table 1 with two different datasets shows that rendering accuracy reduces as the number of sample points per ray decreases. This is similarly observed in all the five different NeRF models. This is because as the number of sample point decreases, we have less information to model a 3D object via networks.

4.4.2 A closer look at original NeRF vs. proposed NeRF

Figure 12 compares the rendering performance particularly between original and proposed NeRFs, with different numbers of samples per ray. The figure demonstrates for the three different datasets that the proposed NeRF framework gives significantly better rendering accuracy compared to original NeRF, regardless of the number of sample points per ray. More importantly, Figure 12 shows that in the proposed NeRF framework, the performance degradation according to reduction of number of samples per ray is significantly less, compared to original NeRF. In other words, proposed NeRF can maintain the rendering quality, while reducing the number of samples per ray. Consequently, we conclude that only with a limited number of samples per ray, the proposed NerF model can achieve significantly better rendering accuracy, compared to the original NeRF model using many samples per ray. For the synthetic datasets, the proposed framework using $16$ samples per ray outperformed original NeRF using $64$ samples per ray; for the real data, the rendering accuracy of the proposed NeRF model using $16$ samples per ray is comparable with that of original NeRF using $64$ samples per ray. We expect that the smaller the error in estimated depth at a novel view, the narrower sampling range can be used while reducing the number of samples.

Figure 13 shows rendered images by the proposed framework for different numbers of sample points per ray, with three different datasets. Except for the extreme case of using only eight samples per ray ( $N=8$ ), the image quality of rendered images by the proposed framework gradually degraded as the number of samples per ray reduces. (When $N=8$ , the rendering quality significantly degraded.) This with the above results from Figure 12 underscores the importance of the near-surface sampling approach.

Figure 14 compares rendered images by the original and proposed NeRF methods when $N=64$ . Particularly in the proposed NeRF framework, we used the worst sampling range for the BlendedMVS dataset. The proposed surface-based sampling method significantly improves the overall rendering quality of NeRF, but there exists some dot artifacts. This is because some missing information still exists or filled holes have inaccurate depth information, after the hole filling. We conjecture that if one uses a fancier depth estimation method than the proposed simple hole filling scheme, one can remove those artifacts.

Table 2 summarizes PSNR values of the original and proposed NeRF models, for different numbers of samples per ray ( $N$ ) and different sampling range ( $2\alpha$ ). For each setup using an identical $N$ value, the proposed NeRF framework outperformed the original NeRF model, regardless of $\alpha$ .

Table 3: Training time (hour) comparisons between the proposed method and four different NeRF models with a different number of samples (the Pavillon scene dataset). We used

400,\!000

iterations throughout the experiments.

	NeRF	DSNeRF	Instant-NGP	DONeRF	Proposed NeRF
64	21.27	16.76	1.54	16.16	12.50
32	17.58	14.16	0.65	13.66	9.34
16	13.27	12.44	0.62	11.63	7.52
8	12.14	11.86	0.57	11.04	7.47
\botrule

4.5 Training time comparisons between different NeRF models

Table 3 compares the training time between the five different NeRF methods, with different numbers of samples. The Instant-NGP model showed the fastest training time among the five NeRF models – note, however, that its rendering accuracy is significantly worse than the proposed NeRF method (see Table 1). Except for Instant-NGP, the proposed NeRF method showed the the fastest training time. Particularly compared to the original NeRF, the proposed NeRF was about two times faster. The reason is that we trained a single fully-connected network in the proposed NeRF framework, whereas the original NeRF approach trained two fully-connected networks. It took longer in training DONeRF and DSNeRF than the proposed NeRF model (with the same number of iterations). This is natural because DONeRF and DSNeRF train an extra depth estimation network.

Regardless of the models, the smaller the number of sample points, it took the less training time.

5 Conclusion

In NeRF methods, it is important to reduce the number of sample points per ray while maintaining the rendering quality, as using less samples can reduce training/inference time. Based on the assumption that the closer the sample point is to the surface of an object, the more important it is for rendering, we propose a near-surface sampling method for NeRF. The proposed framework samples 3D points only near the surface of an object, by estimating depth images from a 3D point cloud generated with a subset of training data and a simple hole filling method. For different datasets, the proposed NeRF framework significantly improves the original NeRF [17] and three state-of-the-art NeRF methods, DONeRF [19], DSNeRF [5], and Instant-NGP [18]. Particularly compared to the original NeRF method, the proposed framework can achieve significantly better rendering accuracy, with only a quarter of sample points per ray. In addition, the proposed near-surface sampling framework can accelerate the NeRF training time twice as fast, while improving the rendering quality with an appropriate sampling range parameter. The proposed method would be useful particularly for applications/technologies where visualizing details is important in novel views.

There are a number of avenues for future work to improve the proposed framework. First, the proposed framework takes a longer inference time compared to the original NeRF model, because projecting many 3D points to a view plane and estimating a depth image is slower than inference via coarse network in original NeRF. We expect to reduce rendering time by speeding up the point cloud projection process. Second, the proposed NeRF framework is not completely end-to-end. In particular, the point cloud generation and refinement process is in the offline stage and not yet optimized for rendering. Therefore, we expect to improve the performance of the NeRF model by modifying it with the fully end-to-end approach, incorporating point cloud generation and refinement process into training. Finally, we expect to further improve the rendering performance of the proposed method by using a more accurate depth estimation method.

Statements and Declarations

\bmhead

Funding The work of H. B. Yoo and I. Y. Chun was supported in part by NRF grants 2022R1F1A1074546 and RS-2023-00213455 funded by MSIT, and the BK21 FOUR Project. The work of I. Y. Chun was additionally supported in part by IITP grant 2019-0-00421 funded by MSIT, IBS grant R015-D1, KIAT grant P0022098 funded by MOTIE, the KEIT Technology Innovation program grant 20014967 funded by MOTIE, SKKU-SMC and SKKU-KBSMC Future Convergence Research Program grants, and SKKU seed grants. The work of H. M. Han and S. S. Hwang was supported the NRF grant NRF-2022R1C1C1011084 funded by MSIT.

\bmhead

Competing interests The authors declare that they have no conflict of interest.

\bmhead

Ethics approval Not applicable

\bmhead

Consent to participate Not applicable

\bmhead

Consent for publication Not applicable

\bmhead

Availability of data and materials The NeRF dataset and BlendedMVS dataset are publicly available at https://paperswithcode.com/dataset/nerf and https://paperswithcode.com/dataset/blendedmvs

\bmhead

Code availability The code in this study is available from the corresponding author on reasonable request.

\bmhead

Authors’ contributions Conceptualization, H. B. Y., H. M. H., S. S. H., & I. Y. C.; data curation, H. M. H.; formal analysis, H. B. Y. & I. Y. C.; funding acquisition, S. S. H. & I. Y. C.; investigation, H. B. Y. & H. M. H.; methodology, H. B. Y., H. M. H., S. S. H., & I. Y. C.; project administration, S. S. H. & I. Y. C.; resources, I. Y. C.; software, H. B. Y. & H. M. H.; supervision, S. S. H. & I. Y. C.; validation, H. M. H., S. S. H., & I. Y. C.; visualization, H. M. H.; writing—original draft preparation, H. B. Y. & H. M. H.; writing—review and editing, I. Y. C. All authors have read and agreed to the published version of the manuscript.

References

\bibcommenthead

Alan [1993] Alan W (1993) 3D computer Graphics. Addison-Wesley

Boss et al [2021] Boss M, Braun R, Jampani V, et al (2021) NeRD: Neural reflectance decomposition from image collections. In: IEEE/CVF International Conference on Computer Vision, pp 12664–12674, 10.1109/ICCV48922.2021.01245

Cernea [2020] Cernea D (2020) OpenMVS: Multi-view stereo reconstruction library, URL https://cdcseacave.github.io/openMVS

Chen and Williams [1993] Chen SE, Williams L (1993) View interpolation for image synthesis. In: Proceedings of the Conference on Computer Graphics and Interactive Techniques, pp 279–288, 10.1145/166117.166153

Deng et al [2022] Deng K, Liu A, Zhu JY, et al (2022) Depth-supervised NeRF: Fewer views and faster training for free. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 12872–12881, 10.1109/CVPR52688.2022.01254

Draelos et al [2015] Draelos M, Qiu Q, Bronstein A, et al (2015) Intel realsense = real low cost gaze. In: IEEE International Conference on Image Processing, pp 2520–2524, 10.1109/ICIP.2015.7351256

Gortler et al [1996] Gortler SJ, Grzeszczuk R, Szeliski R, et al (1996) The lumigraph. In: Proceedings of the Conference on Computer Graphics and Interactive Techniques, pp 43–54, 10.1145/237170.237200

Hartley and Zisserman [2003] Hartley R, Zisserman A (2003) Multiple view geometry in computer vision, Second Edition. Cambridge University Press

Hu et al [2022] Hu T, Liu S, Chen Y, et al (2022) EfficientNeRF efficient neural radiance fields. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 12902–12911, 10.1109/CVPR52688.2022.01256

Johari et al [2022] Johari MM, Lepoittevin Y, Fleuret F (2022) GeoNeRF: Generalizing nerf with geometry priors. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 18344–18347, 10.1109/CVPR52688.2022.01782

Levoy and Hanrahan [1996] Levoy M, Hanrahan P (1996) Light field rendering. In: Proceedings of the Conference on Computer Graphics and Interactive Techniques, pp 31–42, 0.1145/237170.237199

Lin et al [2021] Lin CH, Ma WC, Torralba A, et al (2021) BARF: Bundle-adjusting neural radiance fields. In: IEEE/CVF International Conference on Computer Vision, pp 5721–5731, 10.1109/ICCV48922.2021.00569

Lindell et al [2021] Lindell DB, Martel JNP, Wetzstein G (2021) AutoInt: Automatic integration for fast neural volume rendering. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 14551–14560, 10.1109/CVPR46437.2021.01432

Liu et al [2020] Liu L, Gu J, Lin KZ, et al (2020) Neural sparse voxel fields. In: Proceedings of the International Conference on Neural Information Processing Systems, p 15651–15663

Mankoff and Russo [2013] Mankoff K, Russo T (2013) The kinect: A low-cost, high-resolution, short-range 3d camera. Earth Surface Processes and Landforms 38:926–936. doi.org/10.1002/esp.3332

Martin-Brualla et al [2021] Martin-Brualla R, Radwan N, Sajjadi MSM, et al (2021) NeRF in the Wild: Neural radiance fields for unconstrained photo collections. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 7206–7215, 10.1109/CVPR46437.2021.00713

Mildenhall et al [2020] Mildenhall B, Srinivasan PP, Tancik M, et al (2020) NeRF: Representing scenes as neural radiance fields for view synthesis. In: Proceedings of the European Conference on Computer Vision, pp 405–421, 10.1007/978-3-030-58452-8_24

Müller et al [2022] Müller T, Evans A, Schied C, et al (2022) Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans Graph 41. 10.1145/3528223.3530127

Neff et al [2021] Neff T, Stadlbauer P, Parger M, et al (2021) DONeRF: Towards real-time rendering of neural radiance fields using depth oracle networks. Computer Graphics Forum 40:45–49. 10.1111/cgf.14340

Niemeyer and Geiger [2021] Niemeyer M, Geiger A (2021) GIRAFFE: Representing scenes as compositional generative neural feature fields. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 11448–11459, 10.1109/CVPR46437.2021.01129

Park et al [2021] Park K, Sinha U, Barron JT, et al (2021) Nerfies: Deformable neural radiance fields. In: IEEE/CVF International Conference on Computer Vision, pp 5845–5854, 10.1109/ICCV48922.2021.00581

Pumarola et al [2021] Pumarola A, Corona E, Pons-Moll G, et al (2021) D-NeRF: Neural radiance fields for dynamic scenes. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10313–10322, 10.1109/CVPR46437.2021.01018

Rebain et al [2021] Rebain D, Jiang W, Yazdani S, et al (2021) DeRF: Decomposed radiance fields. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 14148–14156, 10.1109/CVPR46437.2021.01393

Shade et al [1998] Shade J, Gortler S, He Lw, et al (1998) Layered depth images. In: Proceedings of the Conference on Computer Graphics and Interactive Techniques, pp 231–242, 10.1145/280814.280882

Srinivasan et al [2021] Srinivasan PP, Deng B, Zhang X, et al (2021) NeRV: Neural reflectance and visibility fields for relighting and view synthesis. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 7491–7500, 10.1109/CVPR46437.2021.00741

Xie et al [2021] Xie C, Park K, Martin-Brualla R, et al (2021) Fig-NeRF: Figure-ground neural radiance fields for 3d object category modelling. In: International Conference on 3D Vision, p 962–971, 10.1109/3DV53792.2021.00104

Xu et al [2022] Xu Q, Xu Z, Philip J, et al (2022) Point-NeRF: Point-based neural radiance fields. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5438–5448, 10.1109/CVPR52688.2022.00536

Yao et al [2020] Yao Y, Luo Z, Li S, et al (2020) BlendedMVS: A large-scale dataset for generalized multi-view stereo networks. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 1787–1796, 10.1109/cvpr42600.2020.00186

Yen-Chen et al [2021] Yen-Chen L, Florence P, Barron JT, et al (2021) iNeRF: Inverting neural radiance fields for pose estimation. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), p 1323–1330, 10.1109/IROS51168.2021.9636708

(a) $\alpha=1/2$	(b) $\alpha=1/4$	(c) $\alpha=1/8$	(d) Ground truth

(a) Original NeRF	(b) DSNeRF	(c) Instant-NGP

(d) DONeRF	(e) Proposed NeRF	(f) Ground truth

(a) Original NeRF	(b) DSNeRF	(c) Instant-NGP

(d) DONeRF	(e) Proposed NeRF	(f) Ground truth

(a) $N=8$	(b) $N=16$	(c) $N=32$	(d) $N=64$	(e) Ground truth