Tencent's Hunyuan research team has released HY-Embodied-0.5, a family of open-source AI foundation models built specifically to power robots and autonomous agents operating in physical environments, with the code and weights publicly available on GitHub.

Most large vision-language models (VLMs) — the AI systems that interpret images and text together — are designed for screens, not the physical world. They struggle with tasks that require understanding where objects are in three-dimensional space, how a scene changes over time, or how to plan a sequence of physical actions. HY-Embodied-0.5 addresses that gap with models purpose-built for what researchers call "embodied intelligence."

Two Models, Two Use Cases

The suite ships in two variants. The smaller model uses 2 billion activated parameters and is optimized for edge deployment — meaning it can run on the computing hardware built into a robot rather than relying on a remote data centre. The larger model uses 32 billion activated parameters and targets complex reasoning tasks where raw capability matters more than speed or power efficiency.

According to the paper, the 32B model achieves performance "comparable to frontier models such as Gemini 2.0 Pro" on embodied understanding benchmarks. It is worth noting these benchmarks are self-reported by the research team and have not been independently verified.

The smaller 2B model outperforms similarly sized state-of-the-art models on 16 out of 22 benchmarks tested, according to the Hunyuan team.

The Architecture: Teaching AI to See in 3D

The models use a Mixture-of-Transformers (MoT) architecture — a design that routes different types of information (such as visual data versus language) through specialized processing streams rather than treating everything the same way. This modality-specific approach, combined with what the team calls "latent tokens," is intended to sharpen how the model represents spatial and physical details that standard VLMs tend to blur.

For reasoning, the team developed an iterative, self-evolving post-training approach. Rather than training once on a fixed dataset, the model refines its own outputs over multiple rounds, progressively improving its ability to predict outcomes, plan interactions, and reason about physical consequences. The team also used on-policy distillation — a technique where the larger 32B model acts as a teacher, transferring its capabilities to the smaller 2B model to push its performance as far as the compact architecture allows.

From Vision-Language Model to Robot Controller

The ultimate test of any embodied AI model is whether it works on an actual robot. The Hunyuan team used HY-Embodied-0.5 as the backbone for a Vision-Language-Action (VLA) model — a system that takes visual input and language instructions and outputs physical control signals for a robot. VLA models are a growing area of robotics research, with Google, Physical Intelligence, and others pursuing similar architectures.

According to the paper, the resulting VLA model achieved strong results in real-world physical evaluations, though the abstract does not specify which robot platforms were tested or quantify success rates in detail. The full paper provides additional experimental context.

Open-Source in a Competitive Field

The decision to open-source both the code and model weights is notable. Much of the leading work in embodied AI — particularly systems close to commercial deployment — remains proprietary. By releasing HY-Embodied-0.5 under an open licence, Tencent Hunyuan positions the models as a public research foundation that other teams can build on, fine-tune, or benchmark against.

The embodied AI space has accelerated significantly over the past 18 months, driven by a combination of better base models, more available robot hardware, and growing industry investment. Companies including Boston Dynamics, Figure, and 1X Technologies are all pursuing general-purpose humanoid robots, and the quality of the underlying AI models is viewed as a central bottleneck to broader deployment.

HY-Embodied-0.5 is evaluated across 22 benchmarks covering visual perception, spatial reasoning, and embodied understanding — a broader evaluation surface than many comparable releases, though the selection and weighting of benchmarks always reflects choices made by the authors themselves.

What This Means

Open-sourcing capable, robot-ready foundation models lowers the barrier for academic and commercial teams working on physical AI, and Tencent Hunyuan's release adds competition to a field previously dominated by a handful of well-resourced Western labs.