Researchers Use LLM-as-Judge to Train Smaller Language Models Without Labeled Data
JO
James Okafor
AI Research CorrespondentArXiv CS.CL✓Verified across 1 source
The Brief
A new reinforcement learning framework uses a large language model as an evaluator to train smaller LLMs on unlabeled data, eliminating the need for ground truth labels. The approach, which generates efficient single-token rewards, improved performance on math reasoning benchmarks when combined with verifiable rewards, showing LLM-based evaluators can provide effective training signals.
✓Verified across 1 independent source
Sources