Pytorch implemntation from scratch
Introduction
Neural Radiance Fields are a way of storing a 3D scene within a neural network. This way of storing and representing a scene is often called an implicit representation, since the scene parameters are fully represented by the underlying Multi-Layer Perceptron (MLP). (As compared to an explicit representation that stores scene parameters like colour or density explicitly in voxel grids.) This novel way of representing a scene showed very impressive results in the task of novel view synthesis, the task of interpolating novel views from camera perspectives that are not in the training set. Furthermore, it allows us to store large scenes with a smaller memory footprint than explicit representation, since we merely need to store the weights of our neural network compared to voxel grids, which increase in memory size by a cubic term.
1) March camera rays through the scene to generate a sampled set of 3D points 2) Use those points and their corresponding 2D viewing directions as input to the neural network to produce an output set of colors and densities 3) Use classical volume rendering techniques to accumulate those colors and densities into a 2D image
Because this process is naturally differentiable, we can use gradient descent to optimize this model by minimizing the error between each observed image and the corresponding views rendered from our representation. Minimizing this error across multiple views encourages the network to predict a coherent model of the scene by assigning high volume densities and accurate colors to the locations that contain the true underlying scene content
Neural Radiance Field Scene Representation
– An approach for representing continuous scenes with complex geometry and
materials as 5D neural radiance fields, parameterized as basic MLP networks.
– A differentiable rendering procedure based on classical volume rendering techniques, which we use to optimize these representations from standard RGB
images. This includes a hierarchical sampling strategy to allocate the MLP’s
capacity towards space with visible scene content.
– A positional encoding to map each input 5D coordinate into a higher dimensional space, which enables us to successfully optimize neural radiance fields
to represent high-frequency scene content.
Volume rendering requires a differential probability of a ray terminating at an infinitesimal particle at location x.
Loss Formulation
As the discretized volumetric rendeing equation is fully differentiable, the weights of the underlying neural network can then be trained using a reconstruction loss on the rendered pixels.
Quadrature
(To be studied)
Optimizing a Neural radiance Field
1) Positional Encoding
Mapping the inputs to a higher
dimensional space using high frequency functions before passing them to the
network enables better fitting of data that contains high frequency variation.
A high frquency sinoisoidal function is used to encode the 3D coordinates.
Reformulating FΘ as a composition of two functions FΘ = F
0
Θ ◦ γ, one
learned and one not
2) Hierarchical Volume Sampling
Instead of just using a single network to represent the scene, they simultaneously optimize two networks: one “coarse” and one “fine”.
References
https://huggingface.co/learn/computer-vision-course/en/unit8/nerf
https://arxiv.org/pdf/2003.08934 \