Researchers have fine-tuned large language models on a 2,500-year-old Indian logical framework called Navya-Nyaya, reporting that the approach produces semantically correct reasoning on held-out test problems even when models don't strictly follow the prescribed format.
The paper, published on ArXiv, targets one of the most persistent problems in AI deployment: language models that generate confident-sounding answers without traceable justification. The authors cite Apple research showing that adding irrelevant context to mathematical problems degraded LLM performance by 65%, framing this as evidence of pattern-matching rather than genuine reasoning. The Pramana project attempts to address this by giving models an explicit epistemological methodology to follow, rather than relying on general-purpose chain-of-thought prompting.
Why Ancient Indian Logic, and What It Actually Requires
Navya-Nyaya is a classical school of Indian philosophy focused on formal logic and the sources of valid knowledge. The researchers structured it into a six-phase reasoning protocol that models must work through sequentially. The phases are: SAMSHAYA (identifying and articulating doubt), PRAMANA (identifying the source and type of evidence), PANCHA AVAYAVA (a five-member syllogism that includes universal rules), TARKA (counterfactual verification — checking whether the opposite could be true), HETVABHASA (fallacy detection), and NIRNAYA (a final ascertainment that distinguishes established knowledge from hypothesis).
This structure goes meaningfully beyond standard chain-of-thought prompting, which encourages models to show working but imposes no formal requirements on what that working must contain. Navya-Nyaya demands that models explicitly identify their evidence type, test for logical fallacies, and separate what is known from what is merely inferred.
The inability to ground claims in traceable evidence limits AI reliability in domains requiring justification — and Navya-Nyaya's six-phase structure directly targets each failure point.
The Experimental Setup and What the Numbers Show
The team fine-tuned two open models: Llama 3.2-3B, a compact model from Meta, and DeepSeek-R1-Distill-Llama-8B, a distilled reasoning model from DeepSeek. Training used 55 Nyaya-structured logical problems spanning constraint satisfaction, Boolean satisfiability, and multi-step deduction tasks — a deliberately small dataset intended to test whether the framework itself, rather than data volume, drives improvement.
At Stage 1, the authors report 100% semantic correctness on held-out evaluation problems. Critically, this figure is self-reported by the research team and has not been independently verified. The paper also notes that strict format adherence at this stage was only 40%, meaning models frequently produced correct answers while departing from the prescribed six-phase structure. The authors interpret this as evidence that models internalize the reasoning content even when structural enforcement is imperfect — a finding with implications for how format compliance and actual understanding relate.
Ablation studies identified that both format prompting style and temperature settings critically affect performance, with optimal configurations differing depending on which reasoning stage is being evaluated. This suggests the framework may require stage-specific tuning rather than a single universal configuration.
What the Approach Does and Doesn't Claim
The paper is careful in its framing. The researchers are not claiming that Pramana eliminates hallucination or solves reasoning at scale. The dataset of 55 problems is small, and the evaluation is limited to the specific logical task types used in training. Generalisability to open-ended real-world queries — medical diagnosis, legal reasoning, scientific analysis — remains untested in this work.
What the paper does claim is that structured epistemological scaffolding produces measurable improvements over baseline reasoning on formal logic tasks, and that the Navya-Nyaya framework offers something distinct from Western logic traditions typically used in AI research. The six-phase structure's explicit fallacy-detection and evidence-sourcing steps map directly onto failure modes that current LLMs exhibit regularly.
All models, datasets, and training infrastructure have been released on Hugging Face, which means independent researchers can replicate, stress-test, and extend the findings without requiring access to the original team.
Situating Pramana in the Broader Reasoning Research Landscape
The problem Pramana addresses sits at the centre of current AI safety and reliability concerns. Language models are increasingly deployed in high-stakes contexts — healthcare triage, legal document review, financial analysis — where an answer that sounds correct but lacks grounded justification can cause direct harm. Approaches to fix this have ranged from retrieval-augmented generation (connecting models to external knowledge bases) to formal verification methods to structured prompting techniques.
Pramana's contribution is methodological rather than architectural. It doesn't change the underlying model structure; it changes what the model is trained to do when it reasons. By drawing on a non-Western philosophical tradition, the researchers also challenge the assumption that the logical frameworks embedded in most AI training data — predominantly derived from Aristotelian and Boolean traditions — represent the only available options.
The approach also raises a practical question that the paper begins to answer: how much structured reasoning methodology can be instilled through fine-tuning on a small, high-quality dataset, versus requiring massive data or architectural changes? The 100% semantic correctness result at Stage 1, if it holds under independent scrutiny, would suggest small-dataset epistemological fine-tuning merits further investigation.
What This Means
For AI researchers and practitioners building systems where reasoning must be auditable and grounded, Pramana offers a concrete, open-source framework to evaluate — and the 2,500-year-old logic it draws on may prove more practically useful than its age suggests.