Researchers have built a 3D object detection system capable of recognising more than 13,500 object categories from a single photograph, paired with a dataset of over 1 million images — both figures that dwarf anything previously available in open 3D detection.
The work, posted to ArXiv in April 2025, introduces WildDet3D, a model designed to recover the size, location, and orientation of objects in three-dimensional space using only an ordinary RGB image as input. Its companion dataset, WildDet3D-Data, was constructed by generating candidate 3D bounding boxes from existing 2D image annotations, then filtering the results through human verification — a pipeline the authors say yields higher quality than fully automated labelling at scale.
Why Existing 3D Detectors Fall Short in the Real World
Most 3D object detectors today are trained and evaluated on tightly controlled datasets covering a handful of categories — typically vehicles, pedestrians, and furniture. When deployed outside those categories, performance collapses. The problem is twofold: the models are architecturally locked to a single way of receiving instructions (a text label, for instance, but not a click or a drawn box), and the training data simply does not reflect the variety of the open world.
WildDet3D attempts to solve both problems simultaneously. Its architecture is described as "geometry-aware" and accepts three distinct prompt types — natural language text, point prompts (a single click on an object), and bounding box prompts — within a single unified model. This matters because different applications demand different interaction styles, and previous systems forced developers to choose one.
Incorporating depth cues at inference time yields an average gain of +20.7 AP across settings, according to the authors.
The model can also absorb auxiliary depth information at inference time, without requiring it. When a depth sensor or depth estimation output is available, performance improves markedly. The authors report an average gain of +20.7 AP (average precision) across tested settings when depth cues are added — a finding that suggests the architecture is designed with real-world deployment flexibility in mind, where depth data is sometimes available from LiDAR or stereo cameras but not always.
Benchmark Results Across Multiple Evaluation Settings
The paper reports results across several evaluation frameworks, all of which are self-reported by the research team and have not been independently verified at the time of publication.
On the newly introduced WildDet3D-Bench — designed to test open-world generalisation — the model achieves 22.6 AP3D with text prompts and 24.8 AP3D with box prompts. On Omni3D, a broader indoor/outdoor benchmark, it reaches 34.2 AP3D (text) and 36.4 AP3D (box). In zero-shot evaluation — testing on datasets the model was never trained on — WildDet3D achieves 40.3 ODS on Argoverse 2 (an autonomous driving dataset) and 48.9 ODS on ScanNet (an indoor scene dataset).
Zero-shot performance is particularly significant because it measures whether a model has learned general 3D reasoning or merely memorised training distributions. Strong zero-shot numbers suggest the latter is not the case here, though independent replication on held-out benchmarks would be needed to confirm the claim.
The Data Problem Was as Important as the Model
Building WildDet3D-Data was not a straightforward data collection exercise. The researchers leveraged the large corpus of existing 2D annotated images — where bounding boxes already exist for millions of object instances — and developed a pipeline to "lift" those annotations into 3D space, estimating depth and orientation to generate candidate 3D boxes. Human annotators then reviewed and accepted or rejected these candidates, yielding a dataset the team says spans diverse real-world scenes rather than studio or track environments.
The scale is notable: over 1 million images across 13,500 categories is an order of magnitude larger in category breadth than prior public 3D detection datasets. The Omni3D dataset, one of the more comprehensive predecessors, covers around 100 categories. The jump to 13,500 represents a qualitative shift in what "open-world" detection can plausibly mean in practice.
The dataset will likely be as consequential as the model itself. Researchers working on robotics, augmented reality, and autonomous systems have long cited data scarcity as the primary bottleneck to building generalist 3D perception systems. A large, diverse, human-verified dataset released alongside a baseline model creates a new foundation for the field to build on.
What This Means
WildDet3D represents a substantial attempt to bring 3D object detection out of controlled benchmarks and into the open world — and if its results hold under independent scrutiny, the combination of a flexible multi-prompt architecture and a million-image dataset could accelerate progress in robotics, AR, and autonomous systems.