Stanford and NVIDIA researchers present a 3D GAN framework based on three planes to enable geometry-aware high-resolution image synthesis

Generative Adversarial Networks (GANs) have been one of the major hype of recent years. Based on the famous generator-discriminator mechanism, their very simple operation has prompted research to constantly improve the old architecture. The peak of image generation has been achieved by StyleGANs, which can produce amazingly realistic, high-quality images capable of fooling even humans.

While the generation of new samples has achieved excellent results in the 2D domain, 3D GANs are still very inefficient. If the exact mechanism of 2D GANs is applied in the 3D environment, the computational effort is too high because 3D data is difficult to handle for current GPUs. For this reason, research has focused on how to construct geometry-aware GANs that can infer the emphasized 3D property using only 2D images. But, in this case, the approximations are generally not consistent in 3D.

In collaboration with Stanford University, the NVIDIA team has proposed its 3D GANs compatible with the efficient geometry to solve this problem. This architecture is capable of not only synthesizing consistent high-resolution 2D images with multiple views, but also producing high-quality 3D geometry. This was achieved with two main introductions: the first is an explicit-implicit hybrid 3D representation, called tri-plane, which is both efficient and expressive; the second is a set of a double discrimination strategy and pose-based generator conditioning, promoting multi-view consistency.

3D tri-plane representation

In a 3D scene, if you select a specific point, it can be defined by its position (X Y Z coordinates) and the direction (the point of view). The 3D representation takes these two values ​​as input and returns the RGB color and its density (if you are imaging a ray passing through a 3D scene, the density at a certain point is the probability that the ray stopped there; for example, if the point is inside a solid, the density will be high).

The 3D representation can be explicit or implicit, for example, Voxel and NeRF representations, respectively. Explicit representations, such as the discrete voxel grid, are fast in prediction but very memory-intensive, because you need to keep the explicit representation in memory (i.e. the cube in figure (b) below below). Otherwise, implicit representations represent a scene as a continuous function; thus, they are very memory efficient but have expensive prediction, as you can see from the (relatively) large decoder in figure (a) below.

Source: https://arxiv.org/pdf/2112.07945.pdf

The triplane representation takes the advantages of both approaches to define an implicit-explicit representation. It stores in memory three planes with a resolution of NxNxC (VS is not 1, as it can be wrongly inferred from the image above) and not the full voxel grid. A 3D position is projected into the planes, and the corresponding feature vectors are added and transmitted to a small decoder. The triplane representation is both fast in prediction and efficient in memory, and its expressiveness has also been demonstrated in comparison with the other two approaches in the figure below.

Source: https://arxiv.org/pdf/2112.07945.pdf

3D GAN Framework

Source: https://arxiv.org/pdf/2112.07945.pdf

Before presenting the whole framework, it should be noted that although the authors often specify in the article that the algorithm only works with 2D images, each image is associated with a set of intrinsic and extrinsic cameras, which were determined using a pose detector ( P in the image above).

First, the random latent code and camera parameters mentioned above are processed by a mapping network which returns an intermediate latent code, which is used to modulate both the StyleGAN2 generator and the subsequent Super Resolution module. The following generator produces a 256x256x96 image, divided by channel to form the tri-plane features. The three-axis features are aggregated in the neural renderer and the lightweight decoder outputs a 128x128x32 picture IF from a given camera pose.

The whole process is still too slow to render at high resolutions; for this reason, the authors rendered at relatively low resolution and oversampled with a super-res module.

Finally, a discriminator is used to evaluate the renders but with two modifications compared to that of the original StyleGAN2. First, a double discrimination was used, consisting of the following procedure: IF is passed to the super-res module, which produces a 512x512x3 picture IRGB+. In parallel, the first three channels of IF are interpreted as a low resolution RGB image IRGB, which is bilinearly unsampled at 512x512x3 and concatenated with IRGB+ to form a six-channel image. This process encourages super-resolution images to be consistent with neural renderings.

Second, the authors conditioned the discriminator with the camera poses from which the generated images are rendered, to help the generator learn the correct 3D priors.

Results

The results of this approach are impressive for both 2D and 3D generation. A visual comparison with other existing techniques is shown below.

Source: https://arxiv.org/pdf/2112.07945.pdf

Interestingly, the authors also tested single-view 3D reconstruction, using pivot adjustment inversion to adjust the test images and obtain a 3D reconstruction from an RGB image (image below).

Source: https://arxiv.org/pdf/2112.07945.pdf

As the author pointed out, this approach still lacks fine detail (like individual teeth) but presents a very significant improvement in the area of ​​3D-aware GANs, and we can’t wait to see what’s next!

Article: https://arxiv.org/pdf/2112.07945.pdf

Project: https://matthew-a-chan.github.io/EG3D/

https://www.youtube.com/watch?v=cXxEwI7QbKg