Large language models consistently underperform a much smaller graph-based parser on relation extraction tasks as linguistic complexity increases, according to a new study published on ArXiv CS.CL.

Relation extraction — identifying how entities in text relate to one another — is a foundational step in building knowledge graphs, powering everything from search engines to biomedical databases. The assumption that LLMs, with their vast parameter counts and broad training, would naturally excel at such tasks has driven widespread adoption. This study puts that assumption under pressure.

Four LLMs Benchmarked Across Six Datasets

Researchers evaluated four LLMs against a single graph-based parser across six relation extraction datasets, each containing sentence graphs of varying sizes and complexities. The paper does not name specific LLM products in the abstract, but the benchmark spans both supervised and in-context learning settings — the two most common deployment approaches for LLMs on structured tasks.

The core finding is straightforward: the graph-based parser's advantage grows as the number of relations in a document increases. In simpler documents with few entity relationships, the gap is modest. In complex documents with dense relational structures, the lightweight parser performs substantially better.

The graph-based parser increasingly outperforms the LLMs as the number of relations in the input documents increases.

This matters because real-world knowledge graph construction — in scientific literature, legal documents, or financial filings — routinely involves exactly the kind of dense, multi-relation sentences where LLMs appear to struggle most.

Why Graph Complexity Exposes LLM Limitations

The researchers attribute the performance difference to the nature of the underlying linguistic structures. When a sentence encodes many simultaneous relationships between entities, a graph-based parser can explicitly model those dependencies. LLMs, by contrast, process text as sequences and must implicitly infer relational structure — a process that appears to degrade as graph complexity scales.

This is not a new concern in NLP research. Graph neural networks and dependency parsers have long been used precisely because they can represent structural relationships directly, rather than compressing them into attention patterns. What this study adds is a systematic, cross-dataset demonstration that the gap is not merely theoretical — and that it grows predictably with complexity.

The finding also arrives in a specific context: the NLP community has debated whether LLMs have effectively made task-specific architectures obsolete. For many classification and generation tasks, that case is strong. For structured extraction from complex text, this research suggests the answer is more nuanced.

The Case for Lightweight, Specialised Models

Perhaps the most practically significant finding is the cost asymmetry. Graph-based parsers are substantially smaller than the LLMs they outperform here. Running a large language model at inference scale requires significant compute — GPU hours, memory, and energy. A graph-based parser designed for dependency structures operates at a fraction of that cost.

For organisations building knowledge graphs at scale — processing millions of scientific abstracts, patent filings, or news articles — this tradeoff is not academic. Choosing the wrong architecture could mean higher costs and lower accuracy simultaneously.

The study evaluates models in both supervised settings, where the model is fine-tuned on labelled extraction data, and in-context learning settings, where LLMs are prompted with examples rather than trained. The fact that the graph-based parser outperforms in both conditions strengthens the conclusion: this is not simply a matter of LLMs needing more task-specific fine-tuning.

Benchmarks and What They Don't Capture

It is worth noting that these results are drawn from six specific datasets with measurable graph complexity metrics — the findings reflect performance on those benchmarks, which may not generalise uniformly to every domain or language. The researchers measure complexity primarily through the number of relations per document, which is one meaningful dimension but not the only one. Factors like relation type diversity, entity ambiguity, and domain-specific vocabulary could interact differently across systems.

The benchmarks appear to be established academic datasets rather than proprietary evaluations, which adds credibility — though the paper has not yet undergone peer review at the time of publication on ArXiv.

The study also does not claim LLMs are unsuitable for relation extraction broadly. In lower-complexity settings, the performance difference narrows. The practical implication is about matching tool to task: LLMs may remain the better option for general-purpose or low-volume extraction, while specialised parsers hold the advantage where structure is dense.

What This Means

For teams building knowledge graphs or information extraction pipelines from complex text, this research is a concrete signal to evaluate graph-based parsers alongside LLMs — particularly when document complexity is high — rather than defaulting to larger models on the assumption that scale solves all problems.