Large language models grade essays differently from human raters, according to a new study published on ArXiv, with models from the GPT and Llama families showing only weak agreement with human scores when used without task-specific training.

The research, posted to ArXiv CS.AI, evaluates automated essay scoring as a practical application of LLMs — an area attracting growing interest from educators and EdTech developers. The study tested models in an "out-of-the-box" setting, meaning no fine-tuning or specialised prompting was applied, to reflect how these tools are often deployed in real-world conditions.

LLMs Reward Brevity, Punish Imperfect Prose

The study's most concrete finding is a consistent directional bias: LLMs tend to assign higher scores to short or underdeveloped essays and lower scores to longer essays that contain minor grammatical or spelling errors. Human raters, by contrast, do not penalise length-associated surface errors as heavily and do not reward brevity as a proxy for quality.

This divergence suggests the models may be treating surface-level signals — clean formatting, absence of errors, conciseness — as indicators of quality, rather than engaging with the depth, argumentation, or originality that human graders typically prioritise.

Essays receiving more praise tend to receive higher scores, while essays receiving more criticism tend to receive lower scores — but the signals driving those judgements differ from those used by human raters.

The researchers do find one area of internal coherence: LLM-generated scores align with LLM-generated feedback. When a model praised an essay, it scored it higher; when it criticised, it scored lower. This internal consistency is notable, but it does not resolve the core problem — the model's praise and criticism are themselves calibrated to different criteria than a human teacher would apply.

What the Models Are Actually Responding To

The pattern of over-scoring short essays is particularly significant for practical deployment. A student who writes a brief but polished response may receive an inflated score, while a student who attempts a more ambitious, longer piece — and makes a handful of spelling mistakes in doing so — may be unfairly penalised. In educational contexts, these are not marginal errors. They could shape feedback loops, grade distributions, and student behaviour.

The study does not identify which specific features within essays trigger these biases, but the directional pattern is clear enough to raise concerns about deploying these models as standalone graders. The authors note that agreement between LLM and human scores "varies with essay characteristics," implying the gap is not uniform — some essay types may align better than others, though the paper does not specify which.

How This Fits the Broader Automated Scoring Landscape

Automated essay scoring is not new. Rule-based systems and earlier machine learning approaches have been used for decades in high-stakes testing environments, including some standardised exams. What is new is the proposal to use general-purpose LLMs — not purpose-built scoring engines — as drop-in graders.

The appeal is obvious: LLMs can generate written feedback alongside a score, something traditional automated systems struggle to do meaningfully. The risk, as this study illustrates, is that general-purpose models carry assumptions and biases baked in from their training data that do not map cleanly onto educational assessment rubrics.

The researchers stop short of dismissing LLMs as useless in this domain. According to the paper, LLMs "can be reliably used in supporting essay scoring" — a carefully worded conclusion that frames the technology as an assistant rather than a replacement. The distinction matters: a tool that flags essays for human review or offers students a first-pass feedback draft carries far lower risk than one trusted to assign final grades.

Limitations Worth Noting

The study tests models in a zero-shot, out-of-the-box configuration. This is a meaningful real-world baseline, but it is also the weakest version of the use case. Models fine-tuned on essay-scoring datasets or prompted with detailed rubrics may perform differently. The authors do not claim their findings generalise to all LLM deployment strategies — only to the default, unprompted setting.

The findings are also based on self-reported analysis from the research team; the benchmarks and datasets used are not independently verified in the ArXiv preprint format, and the paper has not yet undergone formal peer review at the time of publication.

What This Means

Educators and EdTech developers considering LLMs for essay grading should treat these models as feedback-generation tools rather than autonomous scorers — the internal coherence is there, but the alignment with human judgement is not strong enough for high-stakes assessment without significant oversight.