A new research framework combining a lightweight object detector with small vision-language models has improved construction hazard identification accuracy by nearly 47% compared to using the models alone, according to a study published on ArXiv in April 2025.
Construction sites generate thousands of visual safety decisions every day — whether a worker is too close to moving machinery, whether protective equipment is missing, whether a dangerous configuration has emerged. Automating that judgment requires AI that is both accurate and fast enough to be useful in practice. Large vision-language models (VLMs) can reason well about complex scenes, but their computational demands make real-time deployment on a building site largely impractical. Small VLMs, defined in this study as models with fewer than 4 billion parameters, run faster but tend to produce less accurate assessments and are prone to "hallucination" — generating plausible-sounding but incorrect descriptions of what they see.
How the Detection-Guided Framework Works
The researchers' solution is a two-stage pipeline. First, a YOLOv11n object detector — a compact, fast model in the YOLO (You Only Look Once) family — scans each image and pinpoints the locations of workers and construction machinery. Those locations are then converted into structured text prompts that tell the small VLM exactly where the relevant objects are before it begins reasoning about hazards. This "spatial grounding" step gives the language model a factual anchor, reducing the chance it invents details or misidentifies relationships between objects in a busy scene.
Integrating lightweight object detection with small VLM reasoning provides an effective and efficient solution for context-aware construction safety hazard detection.
The framework was tested on six small VLMs: Gemma-3 4B, Qwen-3-VL 2B and 4B, InternVL-3 1B and 2B, and SmolVLM-2B. All were evaluated in zero-shot settings — meaning the models received no task-specific training examples — on a curated dataset of construction site images annotated with hazard labels and written rationales. All benchmark results reported here are self-reported by the study authors and have not yet been independently replicated.
Performance Gains Across Every Model Tested
The framework improved hazard detection performance across all six models. The strongest result came from Gemma-3 4B, which reached an F1-score of 50.6%, up from 34.5% in the baseline configuration where the model received no detection guidance. The F1-score is a standard metric balancing precision (avoiding false alarms) and recall (catching real hazards); a 16-point gain is substantial in safety-critical applications where both missed hazards and unnecessary alerts carry real costs.
Explanation quality improved markedly too. BERTScore F1 — a metric that measures how semantically similar a model's written explanation is to a human-written reference — rose from 0.61 to 0.82 for the best model. That matters because a system flagging a hazard without a clear rationale is harder for site supervisors to act on confidently.
Most practically relevant: the object detection step added only 2.5 milliseconds per image to inference time. That overhead is negligible for near real-time monitoring applications, suggesting the approach could realistically integrate into existing site camera systems without requiring expensive hardware upgrades.
Why Small Models Matter for Site Deployment
The emphasis on small models reflects a genuine operational constraint. Construction sites are often remote or poorly connected environments. Routing video feeds to cloud-based large models introduces latency, raises data privacy questions, and creates a dependency on reliable internet — none of which are guaranteed on a working site. Edge deployment, running AI directly on local hardware near the cameras, is more resilient but demands lean models.
Small VLMs with under 4 billion parameters can run on mid-range GPUs or even high-end CPUs, but until now their tendency to hallucinate in complex visual environments limited their usefulness for safety applications where a missed hazard has direct consequences. The detection-guided approach offloads the hard spatial reasoning to the object detector — a task detectors are well-suited for — and lets the language model focus on contextual interpretation rather than object localisation.
This division of labour is not entirely new in computer vision research, but applying it specifically to close the accuracy gap for small VLMs in safety-critical domains is a contribution to the field. The construction industry has been a target for AI safety tools for several years, driven by persistently high rates of workplace fatalities; in the United States, the construction sector consistently accounts for around 20% of all worker deaths annually, according to the Occupational Safety and Health Administration.
Limitations and What Comes Next
The study's authors acknowledge the framework still falls short of the accuracy levels large VLMs can achieve in less constrained settings. An F1-score of 50.6% leaves meaningful room for improvement before deployment in high-stakes environments without human oversight. The zero-shot evaluation also means performance could improve further with even light fine-tuning on domain-specific data — something the researchers flag as a direction for future work.
The dataset used for evaluation, while annotated with hazard labels and rationales, is described as "curated" rather than drawn from continuous real-world footage, which may not fully capture the noise, occlusion, and lighting variability of live construction environments. Independent validation on larger, more diverse datasets would strengthen confidence in the results.
The six models tested represent a snapshot of the small VLM landscape as of early 2025, a space that is evolving quickly. Newer releases from the same model families could shift the performance rankings significantly.
What This Means
For safety teams and technology developers in the construction sector, this research demonstrates a practical path to deploying AI hazard detection on-site without depending on large, resource-intensive models — potentially bringing automated safety monitoring within reach of projects that currently cannot afford it.