A new open-source model called InsEdit can edit videos using plain text instructions and requires roughly 100,000 training examples — substantially fewer than approaches that typically demand massive datasets of edited video pairs.

Instruction-based video editing — telling a model to change a scene, swap an object, or alter a style using natural language — has long struggled with a fundamental bottleneck: models need enormous quantities of paired video data showing "before" and "after" versions of edits. High-quality paired video editing datasets are scarce and expensive to produce, making the field slow to advance despite rapid progress in video generation more broadly.

How InsEdit Gets Around the Data Problem

The researchers behind InsEdit, whose paper appeared on ArXiv in April 2025, built their system on top of HunyuanVideo-1.5, a capable open-source video generation backbone. Rather than collecting or synthesising millions of edited video pairs, they designed a data pipeline that extracts more value from a smaller pool of material.

The key innovation is a technique called Mutual Context Attention (MCA). Conventional approaches to creating training pairs typically require edits to begin at the very first frame of a clip, which limits the variety and realism of the data. MCA allows edits to start partway through a clip, generating aligned video pairs — matching "original" and "edited" versions — that better reflect how real edits actually work. This produces richer, more diverse training signal from the same underlying footage.

With only O(100)K video editing data, InsEdit achieves state-of-the-art results among open-source methods on video instruction editing benchmarks.

The architecture also incorporates a visual editing component alongside the video diffusion backbone, giving the model a structured way to interpret and apply textual instructions to specific visual regions and content across time.

What the Benchmarks Show

According to the paper, InsEdit achieves state-of-the-art performance among open-source methods on the team's video instruction editing benchmarks — though it is important to note these benchmarks are self-reported by the authors and have not yet been independently validated. The researchers do not claim superiority over proprietary closed systems, focusing their comparison on the open-source landscape.

One practical benefit of the training approach is dual-modality support. Because the researchers included image editing data alongside video data during training, InsEdit handles still image editing without requiring any changes to the model's architecture or a separate fine-tuning stage. This makes deployment simpler and gives practitioners a single model that covers both use cases.

Why Data Efficiency Matters Here

The significance of achieving strong results at O(100)K scale — meaning roughly in the hundreds of thousands rather than millions — goes beyond academic interest. Assembling large video editing datasets requires either extensive human annotation, expensive synthetic generation pipelines, or both. Reducing that requirement by even one order of magnitude meaningfully lowers the barrier for academic groups, startups, and open-source communities to train competitive editing models.

This connects to a broader trend in AI research: finding ways to extract more capability from less data, rather than simply scaling up. Techniques like MCA represent an architectural answer to a resource constraint, and the authors argue the approach transfers well because a strong generation backbone already encodes rich visual knowledge that can be redirected toward editing tasks.

The model is built on HunyuanVideo-1.5, which Tencent released as an open-source video generation foundation model. Leveraging an existing, capable backbone rather than training from scratch is itself a form of data and compute efficiency — the editing capability is layered on top of a model that already understands how videos should look and move.

Limitations and Open Questions

The paper does not provide detailed comparisons against closed proprietary systems such as those from Google, OpenAI, or Runway, so the absolute performance ceiling relative to commercial tools remains unclear. The benchmarks used are also constructed by the research team itself, which is standard practice but means independent replication will be important for establishing the claims more firmly.

It is also worth noting that "instruction-based" editing covers a wide range of task complexity. Simple style transfers and object replacements are more tractable than fine-grained spatial edits or temporally consistent changes across long clips. The paper does not extensively document performance at the harder end of that spectrum.

What This Means

InsEdit demonstrates that the data bottleneck holding back open-source video editing models is surmountable — and that architectural choices around how training pairs are constructed can be as important as raw dataset size, giving developers a credible path to capable video editors without requiring industrial-scale data collection.