FORGE Benchmark: Domain Knowledge Key to Manufacturing AI

A new benchmark called FORGE has exposed a critical weakness in AI systems being deployed in manufacturing: the models lack domain-specific knowledge, not visual perception ability, according to researchers who tested 18 state-of-the-art multimodal large language models on industrial tasks.

Editor's Note: This article is based on a preprint research paper that has not yet undergone peer review. DeepBrief is actively monitoring for peer-reviewed publication and additional independent analysis.

The study, published on arXiv in April 2025, challenges a widely held assumption in the field. Researchers and engineers have long focused on improving visual grounding — the ability of AI to locate and identify objects in images — as the key hurdle for deploying AI in industrial settings. FORGE's results suggest the real obstacle lies elsewhere.

The bottleneck analysis shows that visual grounding is not the primary limiting factor — insufficient domain-specific knowledge is the key bottleneck.

Why Existing Benchmarks Fall Short for Industry

Multimodal large language models (MLLMs) combine image and text understanding, and manufacturers have been increasingly exploring them to move beyond basic visual inspection toward more autonomous decision-making. The problem, according to the FORGE authors, is that no existing benchmark adequately reflects the precision that industrial environments demand.

Current evaluation datasets lack what the researchers call fine-grained domain semantics — details like exact model numbers for components, precise structural descriptions, and the kind of specific technical vocabulary that manufacturing professionals use daily. Without benchmarks that test for this granularity, it has been difficult to know where AI systems actually fail on the factory floor.

FORGE addresses this by combining real-world 2D images and 3D point clouds, annotated with detailed domain-specific labels. The three tasks it evaluates — workpiece verification, structural surface inspection, and assembly verification — reflect genuine industrial use cases rather than synthetic or generalised scenarios.

18 Models Tested, Significant Gaps Found

The researchers evaluated 18 current MLLMs across these three tasks. According to the paper, the results revealed significant performance gaps between what these models can do and what manufacturing applications actually require. The finding that visual grounding is not the primary limitation differs from conventional understanding in the field.

This is a meaningful reframing. It implies that efforts to improve how AI models see industrial components may be less urgent than efforts to give those models deeper knowledge of the components themselves — their specifications, failure modes, acceptable tolerances, and the technical language used to describe them.

The distinction matters for how companies and researchers allocate development resources. If the vision side of multimodal AI is largely adequate for manufacturing tasks, investment should shift toward domain knowledge acquisition and training data curation.

Fine-Tuning a Small Model Yields Large Gains

Beyond diagnosis, the FORGE team also demonstrated a practical pathway forward. They used their structured dataset annotations as a training resource, applying supervised fine-tuning to a 3-billion-parameter model — relatively compact by current standards. According to the paper, this process produced up to a 90.8% relative improvement in accuracy on held-out manufacturing scenarios not seen during training.

This result is self-reported by the researchers and has not yet been independently validated, which is standard at the pre-publication arXiv stage. That said, the scale of improvement — achieved with a smaller model fine-tuned on targeted data — suggests that high-quality, domain-specific training data may matter more than raw model size for industrial AI applications.

The 3B-parameter model result is particularly notable for manufacturing contexts, where edge deployment — running AI on local hardware rather than cloud servers — is often preferred for latency, cost, and data-security reasons. Smaller models that perform well after fine-tuning are more viable for that kind of deployment.

A Dataset Built for the Factory, Not the Lab

The FORGE dataset itself is a contribution independent of the benchmark results. Combining 2D images with 3D point cloud data gives the dataset a richer geometric representation of physical objects than image-only datasets provide. Point clouds capture the three-dimensional shape of an object — useful for inspecting whether a part has been correctly assembled or whether a surface has a structural defect.

Annotations include exact model numbers and other precise industrial identifiers, making the dataset more representative of the specificity required in real manufacturing quality-control workflows. The researchers have made the code and datasets publicly available at ai4manufacturing.github.io/forge-web, which should allow other teams to replicate findings and build on the resource.

Data scarcity is a persistent problem in industrial AI. Manufacturing companies are often reluctant to share operational imagery due to competitive sensitivity, which limits the size and diversity of publicly available training sets. FORGE does not fully solve that problem, but it adds a structured, annotated resource to a space where high-quality public data is genuinely scarce.

What This Means

For organisations evaluating or deploying multimodal AI in manufacturing, FORGE shifts the development priority from visual capability to domain knowledge — meaning that curating high-quality, technically precise training data is likely a more productive investment than selecting larger or more visually capable foundation models.

FORGE Benchmark Reveals Domain Knowledge — Not Vision — Is Key Weakness in Manufacturing AI

Why Existing Benchmarks Fall Short for Industry

18 Models Tested, Significant Gaps Found

Fine-Tuning a Small Model Yields Large Gains

A Dataset Built for the Factory, Not the Lab

What This Means

Berkeley Researchers Propose GRASP, a Gradient-Based Planner for Long-Horizon World Models

Stanford AI Index Reports US-China Model Performance Gap Narrowed to 2.7%

Vidoc Security Says It Replicated Anthropic's Mythos Findings Using Public Models

FORGE Benchmark Reveals Domain Knowledge — Not Vision — Is Key Weakness in Manufacturing AI

Why Existing Benchmarks Fall Short for Industry

18 Models Tested, Significant Gaps Found

Fine-Tuning a Small Model Yields Large Gains

A Dataset Built for the Factory, Not the Lab

What This Means

Related

Berkeley Researchers Propose GRASP, a Gradient-Based Planner for Long-Horizon World Models

Stanford AI Index Reports US-China Model Performance Gap Narrowed to 2.7%

Vidoc Security Says It Replicated Anthropic's Mythos Findings Using Public Models