Researchers have developed a prompting technique called Image Prompt Packaging (IPPg) that converts structured text into images before feeding it to multimodal AI models, cutting inference costs by as much as 91% in tested scenarios — though results vary sharply by model and task.

The work, published on arXiv in April 2025, addresses a practical bottleneck in deploying large multimodal language models commercially: token-based pricing. Most cloud AI APIs charge per token processed, and complex prompts — particularly those involving structured data like database schemas or long instructions — can accumulate significant costs at scale. IPPg attempts to sidestep this by encoding text visually, exploiting the fact that image tokens and text tokens are priced and counted differently.

How Image Prompt Packaging Works

The core idea is straightforward: instead of passing structured text as conventional tokens, IPPg renders that text into an image and submits the image as part of the prompt. Because multimodal models can read text embedded in images, the model still receives the same information — but the token count from the text portion drops substantially.

The researchers developed a cost formulation that decomposes savings by token type, allowing precise accounting of where reductions occur. Across their benchmark suite of five datasets, token compression reached as high as 96%, and inference cost reductions ranged from 35.8% to 91.0%, according to the paper.

A 125-configuration rendering ablation revealed accuracy shifts of 10 to 30 percentage points, establishing visual encoding choices as a first-class variable in multimodal system design.

The study tested IPPg against two task families: visual question answering (VQA) and code generation, specifically text-to-SQL tasks. These represent meaningfully different demands — VQA tests perceptual and semantic understanding, while code generation requires precise structural reasoning over schemas.

Results Are Uneven Across Models

The headline cost savings come with an important qualifier: outcomes are, in the researchers' own words, "highly model- and task-dependent." GPT-4.1 performed well under IPPg on the CoSQL benchmark — a text-to-SQL dataset — achieving simultaneous accuracy improvements and cost reductions. That is the best-case scenario the technique is designed to deliver.

Claude 3.5 Sonnet, however, actually incurred cost increases on several VQA benchmarks under the same approach. This counterintuitive result likely reflects differences in how models price and process image tokens internally, and serves as a caution against assuming universal applicability.

GPT-4o results fell between these extremes, reinforcing that IPPg's value proposition depends heavily on which model and which task a team is optimizing for.

Where the Technique Fails — and Where It Thrives

The researchers conducted a systematic error analysis and produced what they call a failure-mode taxonomy — a categorization of the conditions under which IPPg degrades performance. Three areas emerged as most vulnerable: spatial reasoning tasks, non-English language inputs, and character-sensitive operations such as those requiring exact string matching or precise numeric formatting.

This is intuitive. When text is rendered as an image, fine-grained character-level fidelity depends on rendering quality — font size, resolution, and layout all become variables that can introduce errors. For a model performing SQL generation, a misread column name or data type can invalidate an otherwise correct query.

Conversely, schema-structured tasks — where the prompt contains well-organized, tabular information — benefited most from the technique. Structured layouts appear to render cleanly and remain legible to vision encoders, making them strong candidates for IPPg deployment.

Rendering Choices Matter More Than Previously Understood

Perhaps the most significant methodological finding concerns rendering configuration. The team ran a 125-configuration ablation study, varying visual encoding choices such as font, layout, and image resolution. Accuracy shifts of 10 to 30 percentage points were observed across these configurations for the same underlying content.

This finding has implications beyond IPPg itself. It suggests that in any multimodal system where text is presented visually — whether deliberately or as a byproduct of document processing — the visual encoding decisions are not cosmetic. They are a primary determinant of model performance, comparable in importance to prompt wording in text-only systems.

The authors argue this elevates visual encoding to a first-class variable in multimodal system design — a variable that practitioners should actively tune rather than treat as a fixed preprocessing step.

What This Means

For teams deploying multimodal AI at scale, IPPg offers a concrete, low-infrastructure method to reduce inference costs significantly on structured tasks — but only after careful validation on the specific model and task in question, since gains are not guaranteed and some configurations will increase costs rather than reduce them.