Nvidia on Thursday introduced a new research project for text-to-3D content creation called LATTE3D.
LATTE3D turns text prompts into 3D representations of objects and animals within seconds, Nvidia said.
While researchers trained the model on two datasets of animals and objects, developers could use the same model architecture to train the AI system on other data types, such as a dataset of plants.
LATTE3D was trained with Nvidia A100 Tensor Core GPU, and prompts generated using OpenAI’s ChatGPT.
Nvidia said users can export the shapes or objects generated into graphics software applications or platforms such as Nvidia Omniverse, the AI hardware-software vendor’s platform for creating metaverse applications.
The new research project comes after Nvidia introduced two new powerful Blackwell GPU chips earlier this week.
It also comes as the vendor introduced a new project for using foundation models to design human-like robots.
The evolution of 3D images
What Nvidia is doing with LATTE3D is displaying the prowess of its GPUs and the possibilities of its hardware for a range of applications, said Chirag Shah, professor at the information school at the University of Washington.
“This work doesn’t necessarily produce better 3D images, at least not in a noticeable, groundbreaking way,” Shah said. “What it does, however, is produce such images much faster.”
Rendering 3D objects is computationally expensive and complex with the added layer of text-to-image, he added.
LATTE3D is a natural evolution of text-to-image generation technology, said Futurum Group analyst David Nicholson.
“We’ve been conditioned over decades to retrieve things,” Nicholson said, referring not only to LLMs but also to retrieving images from search. “When we go to a 3D thing, it’s even more complicated.”
For LATTE3D, Nvidia used amortized methods, a mechanism that enables the model to break down a text prompt into different components to generate results faster. With this approach, Nvidia is aiming to tackle some of the challenges of the earlier versions of 3D image generation models, such as the speed at which the images are generated.
The vendor is using both software and hardware to do that, Nicholson noted.
Nvidia is also using plain language prompts and diffusion technology to create and structure this model in inexpensively and quickly, he added.
“It is illustrative of Nvidia’s leadership position in all sorts of different areas of AI,” he continued.
Some limitations
However, LATTE3D might not become the go-to model for 3D used for creating video or animated films since it is still limited to the specific tasks it can perform.
“Just because this helps to refine the creation of 3D images and 3D models doesn’t mean this will find its way into say, the production of a full Hollywood animated film,” Nicholson said.
Moreover, the techniques Nvidia used still have the same limitations or risks as other text-to-image generation techniques. That is similar to what happened to Google Gemini, Shah said, referring to the bias problem that caused Google to shut down the image generation part of its Gemini multimodal model.
“Things will still cause issues,” he said. “We just have the ability to do that faster.”
Other than LATTE3D, Nvidia introduced another research project named EDM2.
EDM2 is a generative AI model that improves the structure of diffusion-based neural networks used in current image and video generators.
Esther Ajao is a TechTarget Editorial news writer and podcast host covering artificial intelligence software and systems.