NVIDIA Cosmos 3: One Open-Source Model for Text, Video, Audio, and Robot Actions

June 11, 2026

NVIDIA has released Cosmos 3, an omnimodal world model designed for Physical AI that handles text, images, video, audio, and robot actions within a single architecture.

What It Does

Cosmos 3 reasons, generates, and simulates across all five modalities simultaneously, according to the release. NVIDIA positions it as a unified foundation for physical AI development rather than a collection of separate specialized models.

Benchmark Claims

NVIDIA says Cosmos 3 ranks first among open-source models in three categories: image generation, video generation, and robot policy. The source does not specify which benchmarks were used or the margin of improvement over competing models.

Honest Limitations

The source material is sparse on technical specifics. Parameter count, training data composition, inference hardware requirements, and licensing terms are not detailed in the available information. The robot policy ranking claim is notable but unverified against independent evaluations — NVIDIA's own benchmarking methodology is not disclosed here.

Previously, NVIDIA released open-source physical AI agent skills and tools alongside an Isaac GR00T humanoid reference robot, signaling a sustained push to build out its physical AI developer stack.

If the benchmark claims hold under independent scrutiny, Cosmos 3 gives robotics developers a single open-source model to replace several task-specific ones — reducing integration overhead and potentially lowering the bar for deploying capable robot policies.

Your cart is empty

What It Does

Benchmark Claims

Honest Limitations