Generative models are becoming the de facto solution for many complex tasks in computer science. They represent one of the most promising ways to analyze and synthesize visual data. Stable Diffusion is the best-known generative model for producing beautiful realistic images from a complex input prompt. The architecture is based on Diffusion Models (DMs), which have shown phenomenal generative power for images and videos. Rapid advances in broadcast and generative modeling are fueling a revolution in 2D content creation. The mantra is quite simple: “If you can describe it, you can visualize it”. or better, “if you can describe it, the model can paint it for you.” It is indeed amazing what generative models are capable of.
While 2D content has proven to be a stress test for DMs, 3D content poses several challenges due to, but not limited to, the added dimension. Generating 3D content, such as avatars, with the same quality as 2D content is a difficult task considering memory and processing costs, which can be prohibitive to produce the rich detail required for high quality avatars.
With technology pushing the use of digital avatars in movies, games, the metaverse, and the 3D industry, allowing anyone to create a digital avatar can be beneficial. This is the motivation that has driven the development of this work.
The authors propose the Roll-out distribution network (Rodin) to respond to the problem of creating a digital avatar. An overview of the model is given in the figure below.
The model input can be an image, random noise, or a textual description of the desired avatar. The latent vector z is then derived from the given input and used in the broadcast. The diffusion process consists of several denoising steps. First, random noise is added to the starting state or image and denoised to get a much cleaner image.
The difference here lies in the 3D nature of the content sought. The diffusion process works as usual, but instead of targeting a 2D image, the diffusion model generates the coarse geometry of the avatar, followed by a diffusion upsampler for detail synthesis.
Computing and memory efficiency is one of the goals of this work. To achieve this, the authors exploited the tri-plane (three-axis) representation of a neural radiation field, which, compared to voxel grids, offers a significantly smaller memory footprint without sacrificing expressiveness.
Another scattering model is then trained to upsample the produced tri-plane representation to match the desired resolution. Finally, a lightweight MLP decoder consisting of 4 fully connected layers is exploited to generate an RGB volumetric image.
Some results are reported below.
Compared to the advanced approaches mentioned, Rodin provides the sharpest digital avatars. For the model, no artifacts are visible in the shared samples, unlike the other techniques.
It was the summary of Rodin, a new framework for easily generating 3D digital avatars from various input sources. If you are interested, you can find more information in the links below.
Check Paper. All credit for this research goes to the researchers on this project. Also don’t forget to register. our Reddit page and discord channelwhere we share the latest AI research news, cool AI projects, and more.
Daniele Lorenzi obtained his M.Sc. in ICT for Internet and Multimedia Engineering in 2021 from the University of Padua, Italy. He holds a doctorate. candidate at the Institute of Information Technology (ITEC) of the Alpen-Adria-Universität (AAU) Klagenfurt. He currently works at the Christian Doppler ATHENA laboratory and his research interests include adaptive video streaming, immersive media, machine learning and QoS/QoE assessment.