AI Research
The latest breakthroughs, papers, and findings from AI research labs worldwide.
Researchers affiliated with Berkeley AI Research (BAIR) have published a blog post describing GRASP, a gradient-based planning method designed to make long-horizon planning with learned world models more robust. According to the post, GRASP lifts trajectories into virtual states to parallelize optimization across time, injects stochasticity into state iterates for exploration, and reshapes gradients so that action signals remain clean while avoiding gradients through high-dimensional vision models. The authors list Mike Rabbat, Aditi Krishnapriyan, Yann LeCun, and Amir Bar as collaborators. The post frames long-horizon planning as the stress test where current world-model-based control methods tend to break down.


Stanford AI Index Reports US-China Model Performance Gap Narrowed to 2.7%
Stanford's 2026 AI Index Report documents the performance gap between top US and Chinese AI models shrinking from 17.5-31.6 percentage points in May 2023 to 2.7% in March 2026. The report records the change despite US private AI investment of $285.9 billion against China's $12.4 billion, and alongside an 89% drop in AI talent migration to the United States since 2017.

Vidoc Security Says It Replicated Anthropic's Mythos Findings Using Public Models
Researchers at Vidoc Security say they reproduced safety-relevant behaviors from Anthropic's Mythos work using off-the-shelf AI models, according to a report in Decrypt. The claim, if it holds up under peer review, speaks to a recurring question in AI safety research: whether findings from proprietary frontier models generalize to publicly available systems.

AI Models Play Cards Against Humanity — and Agree With Each Other More Than With Humans
A new study from ArXiv tested five frontier large language models by having them play Cards Against Humanity across nearly 10,000 rounds, selecting the funniest card from ten options each time. All models beat random chance, but their humor preferences aligned far more with each other than with human players — raising questions about whether AI humor judgment reflects genuine taste or built-in structural biases.

LLMs Lose Ground to Lightweight Graph Parsers When Relation Extraction Gets Complex
A new study from ArXiv CS.CL finds that large language models underperform significantly smaller graph-based parsers on relation extraction tasks when text contains complex linguistic structures. Researchers tested four LLMs against a graph-based parser across six datasets, finding the performance gap widens as the number of relations in a document increases. The findings challenge assumptions about LLM superiority and point toward specialised architectures for knowledge graph construction.

New Research Reveals What Makes Preference Training Data Work for AI Reasoning
A new study from researchers publishing on ArXiv identifies two distinct factors that determine how effectively preference data trains language models to reason: the capability gap between models generating good and bad examples, and the quality difference within individual data pairs. The findings offer a concrete recipe for building better training datasets — maximize the capability gap between your example generators, then filter by within-pair quality to train more efficiently.

AI Writing Tools Erasing Linguistic Fingerprints From Research Papers, Study Finds
A new study from ArXiv CS.CL finds that large language models are accelerating the erasure of native language signals in academic writing. Researchers analyzed papers from the ACL Anthology across three eras — pre-neural network, pre-LLM, and post-LLM — and found a consistent decline in the ability to identify an author's native language from their writing. The effect is uneven: Japanese and Korean authors show sharper-than-expected signal loss, while Chinese and French writers show unexpected resistance.
Study Exposes Systematic Failures When AI Agents Serve Multiple Users
A new paper from ArXiv researchers presents the first systematic study of 'multi-user' AI agents — systems required to serve several people simultaneously, each with different roles and authority levels. Testing frontier large language models under these conditions, the researchers found consistent failures: models struggled to maintain consistent priorities between conflicting users, leaked private information over extended conversations, and became inefficient when coordination required gathering information from multiple sources.
AI Models Disagree Sharply on Sentiment in Gaza War Headlines, Study Finds
A new study comparing nine AI models — three large language models and six fine-tuned Arabic BERT models — on a corpus of 10,990 Arabic news headlines about the 2023 Gaza War found pronounced, non-random divergence in how each model classified sentiment. Fine-tuned BERT models leaned heavily toward neutral labels, while LLMs amplified negative sentiment, with Meta's LLaMA-3.1-8B collapsing almost entirely into negativity. The findings challenge assumptions that automated sentiment tools produce neutral or interchangeable readings of conflict media.

Tree-Structured Sparsity Cuts Transformer Compute to 5% Without Accuracy Loss
Researchers have demonstrated that replacing standard feed-forward layers in transformer models with tree-structured sparse alternatives can activate fewer than 5% of a layer's units per token while matching the performance of fully dense models. The work, published on arXiv, scales beyond 1 billion parameters and works in zero- and few-shot settings — offering a potentially practical path to cheaper large language model inference without a dedicated routing network.

New 'Attn-Sampler' Algorithm Improves Decoding for Diffusion Language Models
Researchers have proposed a training-free decoding algorithm called Attn-Sampler that improves how diffusion-based large language models generate text. Published on ArXiv, the paper argues that decoding tokens in order of their attention matrix column sums — rather than relying on token-level information alone — leads to better output quality and faster parallel generation. The method requires no additional model training and shows consistent gains across multiple benchmarks.
Temperature Settings Matter as Much as Prompting Strategy in Reasoning AI Models
A new study on ArXiv evaluating Grok-4.1 across four temperature settings and two prompting strategies on Olympic-level mathematics problems found that the common practice of using zero temperature for reasoning tasks may be suboptimal. Zero-shot prompting peaked at 59% accuracy at moderate temperatures, while chain-of-thought prompting performed best at extremes. The findings suggest developers should tune temperature and prompting strategy together rather than independently.
Neural Models Match Human Consistency in Text-to-Speech Quality Ratings
Researchers have developed a suite of neural models capable of evaluating text-to-speech audio quality more consistently than human raters. Published on ArXiv, the study introduces NeuralSBS and WhisperBert — models that approximate expert judgments at scale, achieving a Root Mean Square Error of ~0.40 against a human inter-rater baseline of 0.62. The findings also flag that popular large language models, including Gemini 2.5 and Qwen2-Audio, fall short as zero-shot TTS evaluators.
Stay informed
Get DeepBrief delivered to your inbox.