AWS Nova Multimodal Embeddings for Video Search

Amazon Web Services published a technical walkthrough describing how developers can build a video semantic search system on Amazon Bedrock using its Nova Multimodal Embeddings model, which the company says processes text, documents, images, video, and audio into a shared vector space. The post, published on the AWS Machine Learning Blog, includes a reference implementation on GitHub and an architecture diagram covering ingestion and query pipelines.

What the model does

According to the AWS post, Nova Multimodal Embeddings generates 1024-dimensional vectors and supports up to 30 seconds of video per embedding. AWS states the model natively handles visual scenes, ambient audio, spoken dialogue, and temporal signals without first converting them to text. The company argues that the common practice of transcribing or captioning video before applying text embeddings "inevitably loses critical information," citing the loss of temporal understanding and errors introduced by transcription.

AWS describes the model as "a unified embedding model that natively processes text, documents, images, video, and audio into a shared semantic vector space" and claims it "delivers retrieval accuracy and cost efficiency." Those characterizations are the company's own; AWS did not publish benchmark figures or comparative accuracy numbers in the post.

The reference architecture

The blog post outlines a two-phase architecture: an ingestion pipeline and a search pipeline. On ingestion, AWS says videos uploaded to Amazon S3 trigger an AWS Lambda orchestrator that updates Amazon DynamoDB and starts an AWS Step Functions workflow. AWS Fargate then runs FFmpeg scene detection to segment the video.

Each segment becomes the atomic unit of retrieval. If a segment is too short, it loses the surrounding context that gives a moment its meaning. If it is too long, it fuses multiple topics or scenes together.

Three parallel branches process each segment, according to the post: Nova Multimodal Embeddings produces visual and audio vectors stored in Amazon S3 Vectors; Amazon Transcribe converts speech to text, which is then embedded by Nova; and Amazon Rekognition performs celebrity detection mapped to timestamps. Amazon Nova 2 Lite generates segment-level captions and genre labels. Final documents are indexed into Amazon OpenSearch Service.

Query handling and hybrid search

For search, AWS says user queries pass through Amazon API Gateway to a Lambda function that runs two parallel operations: intent analysis and query embedding. AWS states that intent analysis uses Anthropic Claude Haiku via Bedrock to assign relevance weights between 0.0 and 1.0 across visual, audio, transcription, and metadata modalities. The query is then embedded three times — for visual, audio, and transcription similarity — according to the post.

AWS describes the overall approach as a "hybrid search architecture that fuses semantic and lexical signals," combining keyword matching with vector similarity. The post does not publish quantitative retrieval accuracy comparisons between this hybrid approach and single-modality baselines.

Segmentation trade-offs

The AWS post emphasizes segmentation as a design decision with direct impact on retrieval quality. Fixed-length chunking is presented as a starting option, with the 30-second per-embedding ceiling cited as the upper bound. AWS writes that fixed boundaries "may arbitrarily truncate a scene mid-action or split a sentence mid-thought," and recommends scene-detection-based segmentation for semantic continuity.

Use cases cited by AWS

The company frames the tool around several industry scenarios, including sports broadcasters surfacing scoring moments for highlight clips, studios locating scenes featuring specific actors across archives, and news organizations retrieving footage by mood, location, or event. AWS writes that a user searching for "a tense car chase with sirens" is "asking about a visual event and an audio event at the same time," which the company uses to illustrate the limitation of text-only indexing.

What the post does not cover

The AWS blog post does not publish pricing for Nova Multimodal Embeddings, a list of AWS Regions where the model is available, throughput or quota limits, or benchmark comparisons against competing services such as Google Cloud Video AI or Azure Video Indexer. DeepBrief has requested pricing, regional availability, and quota details from AWS and is seeking independent developer and analyst assessments of the tool relative to existing video understanding APIs. No independent corroborating coverage was available at the time of publication.

The reference implementation is published at github.com/aws-samples/sample-video-semantic-search-multimodal-embeddings, according to AWS.

AWS Details Video Semantic Search on Nova Multimodal Embeddings

What the model does

The reference architecture

Query handling and hybrid search

Segmentation trade-offs

Use cases cited by AWS

What the post does not cover

Anthropic Launches Claude Design, a Research Preview for Visual Work

Anthropic Launches Claude Design for Generating Prototypes and Slides

Google Adds Google Photos Integration to Gemini App Image Generation

AWS Details Video Semantic Search on Nova Multimodal Embeddings

What the model does

The reference architecture

Query handling and hybrid search

Segmentation trade-offs

Use cases cited by AWS

What the post does not cover

Related

Anthropic Launches Claude Design, a Research Preview for Visual Work

Anthropic Launches Claude Design for Generating Prototypes and Slides

Google Adds Google Photos Integration to Gemini App Image Generation