A new study from researchers posting to ArXiv finds that a lightweight fine-tuning method called Contrastive Prompt Tuning can improve the accuracy of AI-generated code, but its ability to make that code more energy-efficient remains unreliable across different models and programming languages.

The paper, published on ArXiv CS.LG in April 2025, addresses a problem that has received growing attention as AI code generation tools become mainstream: large language models (LLMs) tend to produce code that works correctly but runs less efficiently than solutions written by experienced human developers. That gap in efficiency translates directly into higher energy consumption at runtime — a conflict with the principles of Green Software Development (GSD), an industry movement aimed at reducing the carbon footprint of software.

Why AI-Generated Code Has an Energy Problem

LLMs trained on code are optimised primarily for functional correctness — producing outputs that pass test cases and satisfy requirements. Energy efficiency is rarely part of that training signal. The result is code that may solve the problem but does so with more computational overhead than necessary, consuming more CPU cycles, memory, and ultimately electricity over its lifetime.

This matters at scale. As organisations integrate AI coding assistants into development pipelines, the cumulative energy cost of running millions of lines of AI-generated code could become significant. The researchers frame this as a direct tension between the productivity gains of AI-assisted development and sustainability goals.

The method achieves consistent improvements in code accuracy for two models, but efficiency gains vary by model, language, and task complexity — indicating that improvements are not uniformly reliable.

What Contrastive Prompt Tuning Actually Does

Contrastive Prompt Tuning (CPT) combines two distinct techniques. The first is Contrastive Learning, a training approach that teaches a model to distinguish between examples — in this case, between energy-efficient and energy-inefficient code. By exposing the model to paired examples of both, it learns to favour patterns associated with lower computational cost.

The second component is Prompt Tuning, a method that falls under the broader category of Parameter-Efficient Fine Tuning (PEFT). Rather than retraining all of a model's billions of parameters — an expensive process requiring significant compute and data — Prompt Tuning inserts a small set of learnable tokens into the model's input. Only these tokens are updated during training, making the process far cheaper than traditional fine-tuning while still shifting the model's behaviour.

The combination is designed to be practical: organisations or researchers without access to large-scale GPU clusters could theoretically apply CPT to adapt an existing LLM toward more energy-conscious code generation without prohibitive cost.

What the Study Tested and Found

The researchers evaluated CPT across three different LLMs (specific models are not named in the abstract) on coding problems in Python, Java, and C++ — three of the most widely used programming languages, each with different performance characteristics and common patterns.

On the accuracy front — whether the generated code is functionally correct — CPT produced consistent improvements in two of the three models tested. That is a meaningful finding in itself, suggesting that the contrastive learning signal helps models produce better code overall, not just more efficient code.

However, the energy efficiency results were more complicated. Gains were inconsistent, shifting depending on which model was used, which programming language was targeted, and how complex the task was. The researchers acknowledge that these improvements are "not uniformly reliable" — a candid assessment that distinguishes this paper from work that overstates results.

The variability by language is notable. Python, Java, and C++ have very different runtime environments and optimisation opportunities. An approach that successfully nudges a model toward efficient Python idioms may not transfer to the same effect in C++, where low-level memory management and compiler behaviour play a larger role in actual energy consumption.

Limitations and What Comes Next

This study is explicitly described as an initial exploration, and the scope reflects that framing. Three models across three languages provides a broader evaluation than many prompt-tuning papers, but the results raise as many questions as they answer. Which model architectures respond best to CPT? Does the method scale to more complex, real-world codebases rather than contained coding problems? How is energy efficiency being measured — at the instruction level, through profiling tools, or via proxy metrics?

The paper does not appear to make claims about production readiness. Instead, it positions CPT as a research direction with the caveat that current results are mixed.

For the field of Green Software Development, even a partial, inconsistent improvement is a data point that did not previously exist. Establishing that contrastive learning can influence energy-related outputs — even imperfectly — opens a path for more targeted work: better training data pairing efficient and inefficient solutions, more refined prompting strategies, or hybrid approaches that combine CPT with runtime feedback.

What This Means

Organisations relying on LLMs for code generation cannot yet assume those tools will produce energy-efficient outputs by default — but this research suggests that targeted fine-tuning methods, even lightweight ones, may offer a practical path toward closing that gap as the technique matures.