A comparative study of nine AI models analysing 10,990 Arabic news headlines about the 2023 Gaza War has found that the choice of model dramatically shapes which sentiments are detected — with some systems classifying nearly everything as negative and others defaulting overwhelmingly to neutral.
The research, posted to arXiv in April 2025 and authored by Eleraqi, draws on a corpus of Arabic-language headlines compiled from conflict-period coverage. Rather than measuring models against a single human-annotated benchmark, the study treats sentiment classification itself as an interpretive act — one that varies systematically depending on model architecture. To measure those differences, the researchers applied information-theoretic tools including Shannon Entropy, Jensen-Shannon Distance, and a custom Variance Score tracking each model's deviation from the group aggregate.
How the Nine Models Split on Conflict Headlines
The study pitted three general-purpose large language models — including GPT-4.1 and Meta's LLaMA-3.1-8B — against six Arabic-specialised BERT variants, including MARBERT, which is fine-tuned on large volumes of Arabic social media text. The divergence was stark. Fine-tuned BERT models, particularly MARBERT, showed a strong pull toward neutral classifications, effectively flattening the emotional signal across headlines regardless of content. LLMs moved in the opposite direction, consistently assigning negative sentiment at elevated rates.
LLaMA-3.1-8B produced the most extreme result: according to the study, it exhibited near-total collapse into negativity, classifying the overwhelming majority of headlines as negative irrespective of framing or subject matter. GPT-4.1 behaved differently — it adjusted its sentiment judgments in response to the narrative frame of a given headline, distinguishing between humanitarian, legal, and security contexts and modulating its output accordingly. Other LLMs showed limited capacity for that kind of contextual sensitivity.
The choice of model constitutes a choice of interpretive lens, shaping how conflict narratives are algorithmically framed and emotionally evaluated.
Why Model Architecture Drives the Divergence
The study argues the divergence is not random noise but a structural feature of how different architectures process language. BERT-based models fine-tuned on Arabic corpora learn statistical associations from their training data — and MARBERT's heavy weighting toward neutral may reflect patterns in Arabic social media text, where explicit sentiment markers are often absent or ambiguous. LLMs, trained with reinforcement learning from human feedback and instruction-tuning, may amplify negative readings in conflict contexts because their training surfaces reflect how human annotators or raters respond to war-related language.
This architectural explanation matters because it implies the divergence is reproducible and predictable — not a bug to be patched but a property of the system. Researchers applying any single model to conflict media would, according to this framing, be measuring that model's interpretive tendencies as much as the underlying sentiment of the text itself.
What This Means for Computational Social Science
The stakes are practical. Sentiment analysis of news media is a standard tool in political science, media studies, and public health research. Studies tracking how coverage of a conflict shifts over time, or comparing tone across outlets, routinely rely on automated classifiers to process volumes of text no human team could read. If those classifiers embed systematic biases tied to architecture, comparisons across studies using different tools may be measuring different things entirely.
The paper highlights this as a methodological risk specific to contexts of war and crisis — moments when the emotional valence of coverage carries significant interpretive and political weight. A study using LLaMA-3.1-8B to measure negativity in Gaza war headlines would, on this evidence, find very high negativity almost by construction. A parallel study using MARBERT would find predominantly neutral coverage of the same headlines. Neither result would straightforwardly reflect the text.
The researchers also flag a subtler concern: that automated sentiment outputs are often treated as neutral, objective measures of media tone. The study's epistemological framing directly contests that assumption, positioning each model as an agent with an interpretive stance rather than a transparent measuring instrument.
Limitations and What Comes Next
The study does not evaluate models against a human-annotated gold standard — a deliberate methodological choice, but one that limits traditional accuracy comparisons. Readers cannot determine from this research which model is correct; the design is built to surface disagreement, not resolve it. The corpus itself — 10,990 headlines — is large enough for distributional analysis but confined to a single language and conflict, so generalisability to other languages or crises remains untested.
The Arabic NLP community has invested heavily in domain-specific fine-tuning over the past five years, and tools like MARBERT represent genuine advances in handling the morphological complexity of Arabic text. But this study suggests that specialisation for a language does not automatically produce neutrality on politically charged content — it may simply relocate the bias.
Future work could extend the comparative framework to human annotation studies, allowing researchers to assess not just inter-model divergence but model-to-human divergence across annotator demographics and political contexts.
What This Means
For researchers, journalists, and policymakers using AI sentiment tools to interpret conflict media coverage, this study is a direct warning: the model you choose is not a neutral instrument — it is an interpretive position, and swapping models mid-analysis, or comparing findings across studies that used different tools, may produce conclusions that reflect architecture rather than reality.