Large language models hallucinate non-existent code library features in up to 40% of responses, according to new research published on ArXiv, which also finds that static analysis tools — while useful — face a fundamental ceiling in how much of the problem they can ever fix.

The study, titled An Empirical Analysis of Static Analysis Methods for Detection and Mitigation of Code Library Hallucinations, examines one of the most persistent failure modes in AI-assisted software development: LLMs confidently generating code that calls functions, methods, or classes that simply do not exist in the referenced libraries. The researchers evaluated performance across multiple LLMs and natural-language-to-code benchmarks that require real library usage.

How Bad Is the Hallucination Problem?

Across the benchmarks tested, LLMs produced code referencing non-existent library features in between 8.1% and 40% of responses — a wide range that reflects meaningful variation across models and tasks. At the upper end of that range, nearly half of all generated code snippets contain at least one fabricated library call.

The practical consequence is significant. A developer using an AI coding assistant who doesn't scrutinise every suggestion could introduce broken dependencies, silent failures, or security gaps into production code. The problem is compounded by the fact that hallucinated library calls can look entirely plausible — correct syntax, reasonable naming conventions, coherent logic — making them hard to spot through casual review.

Static analysis tools can detect 14–85% of library hallucinations, but manual analysis reveals a hard upper bound of 48.5–77%, meaning a meaningful share of errors are structurally invisible to these methods.

What Static Analysis Can and Cannot Do

Static analysis — the automated inspection of code without executing it — is an established software engineering practice. Tools in this category check code for syntax errors, type mismatches, undefined references, and similar issues before a program ever runs. The researchers tested whether these tools could be repurposed as a practical, low-cost filter for LLM hallucinations.

The results show real but inconsistent value. Static analysis tools detected between 16% and 70% of all errors, and between 14% and 85% of library-specific hallucinations, according to the paper. That wide variance reflects how much performance depends on the specific LLM generating the code and the dataset being evaluated — there is no single reliable detection rate.

Critically, the team went beyond measuring what static tools currently detect. Through manual analysis, they identified categories of hallucination that a static method could not plausibly catch under any circumstances — for example, cases where a fabricated function name closely mirrors a real one, or where the hallucinated call is syntactically valid even though the feature doesn't exist. This analysis produced an upper bound on static analysis potential of 48.5% to 77% — meaning that even a theoretically perfect static analyser would miss between 23% and 51.5% of library hallucinations.

Why the Ceiling Matters

The existence of a hard ceiling is a significant finding. Much applied work in AI reliability focuses on improving existing tools incrementally. This research draws a structural line: for a defined category of errors, static analysis is the wrong class of solution, not merely an immature one.

The authors describe static analysis as a "cheap method for addressing some forms of hallucination" — an important qualification. For teams deploying LLMs in coding workflows, static analysis tools offer a low-overhead, low-cost layer of protection that can catch a meaningful fraction of errors at essentially no computational cost compared to re-running the model. That is a practical benefit worth capturing.

But the findings also suggest that tooling improvements alone will not close the gap. The errors that static analysis cannot catch require different mitigation strategies — potentially including retrieval-augmented generation that grounds the model in actual library documentation, fine-tuning on verified code, or runtime testing approaches that execute the generated code against real library versions.

Broader Context in LLM Code Generation Research

Library hallucination sits at the intersection of two well-documented LLM weaknesses: factual hallucination and code generation reliability. LLMs are trained on code snapshots that may not reflect current library versions, and their parametric knowledge of any given library's API surface is inherently incomplete and potentially outdated.

Previous research has addressed hallucination in prose — where factual errors can sometimes be caught through retrieval or verification pipelines — but code hallucinations present a distinct challenge. A hallucinated library call does not just mislead; it produces non-functional software. The stakes in production environments are concrete.

The benchmark methodology used in the study focuses on natural-language-to-code tasks that specifically require library usage, which isolates the library hallucination problem rather than conflating it with general code quality issues. All results are based on the researchers' own evaluations as reported in the paper.

What This Means

For teams building or evaluating LLM-based coding tools, this research establishes that static analysis is a worthwhile but structurally limited first line of defence — and that solving library hallucinations fully will require approaches that go beyond code inspection alone.