Skip to main content

Video Pipeline

AgentOS provides a provider-agnostic video pipeline covering generation (text-to-video, image-to-video), analysis (scene detection, transcription, summarisation), and RAG-ready indexing. Three high-level API functions expose the full pipeline:

FunctionPurpose
generateVideo()Text-to-video and image-to-video generation
analyzeVideo()Scene detection, description, audio transcription, summarisation
detectScenes()Streaming scene boundary detection from a frame source

Providers

Video generation is backed by three provider adapters, each implementing the IVideoGenerator interface:

ProviderEnv VarDefault ModelCapabilities
RunwayRUNWAY_API_KEYgen-3-alphatext-to-video, image-to-video
ReplicateREPLICATE_API_TOKENklingai/kling-v1text-to-video, image-to-video
FalFAL_API_KEYvariestext-to-video

When multiple providers are configured, a FallbackVideoProxy wraps the chain so that a transient failure on the primary provider automatically retries on the next available backend.

generateVideo()

Generate a video from a text prompt or a source image.

import { generateVideo } from '@framers/agentos';

// Text-to-video
const result = await generateVideo({
prompt: 'A drone flying over a misty forest at sunrise',
provider: 'runway',
durationSec: 5,
aspectRatio: '16:9',
});
console.log(result.videos[0].url);

// Image-to-video — provide a source image for motion synthesis
import { readFileSync } from 'node:fs';

const i2v = await generateVideo({
prompt: 'Camera slowly zooms out revealing the full landscape',
image: readFileSync('input.png'),
provider: 'replicate',
});

GenerateVideoOptions

OptionTypeDescription
promptstringText prompt describing the desired video content (required)
imageBufferSource image for image-to-video generation
providerstringProvider ID ("runway", "replicate", "fal")
modelstringModel override (e.g. "gen3a_turbo")
durationSecnumberDesired output duration in seconds
aspectRatioVideoAspectRatioOutput aspect ratio ("16:9", "9:16", "1:1", etc.)
resolutionstringOutput resolution (e.g. "1280x720")
negativePromptstringContent to avoid
seednumberSeed for reproducible generation
timeoutMsnumberMaximum wait time in milliseconds
onProgress(event) => voidProgress callback with VideoProgressEvent
providerPreferencesMediaProviderPreferenceReorder or filter the fallback chain
apiKeystringOverride the API key

GenerateVideoResult

interface GenerateVideoResult {
model: string; // e.g. "gen-3-alpha"
provider: string; // e.g. "runway"
created: number; // Unix timestamp (ms)
videos: GeneratedVideo[];
usage?: VideoProviderUsage;
}

Each GeneratedVideo contains url, optional base64, mimeType, durationSec, width, height, and thumbnailUrl.

analyzeVideo()

Analyse a video and produce structured understanding: scene segmentation, per-scene descriptions, audio transcription, and an overall summary.

import { analyzeVideo } from '@framers/agentos';

const analysis = await analyzeVideo({
videoUrl: 'https://example.com/demo.mp4',
prompt: 'What products are shown in this video?',
transcribeAudio: true,
descriptionDetail: 'detailed',
indexForRAG: true,
});

console.log(analysis.description);
for (const scene of analysis.scenes ?? []) {
console.log(`[${scene.startSec}s-${scene.endSec}s] ${scene.description}`);
}
// RAG chunk IDs are available for retrieval
console.log(analysis.ragChunkIds);

AnalyzeVideoOptions

OptionTypeDefaultDescription
videoUrlstring-URL of the video to analyse
videoBufferBuffer-Raw video bytes (alternative to URL)
promptstring-Analysis question / guidance
modelstringautoVision LLM model identifier
maxFramesnumber-Maximum frames to sample
sceneThresholdnumber0.3Scene change sensitivity (0-1)
transcribeAudiobooleantrueTranscribe audio track via STT
descriptionDetailDescriptionDetail'detailed''brief' / 'detailed' / 'exhaustive'
maxScenesnumber100Cap on detected scenes
indexForRAGbooleanfalseIndex results into the RAG vector store
onProgress(event) => void-Progress callback

When indexForRAG: true, scene descriptions and transcripts are chunked and embedded into the configured vector store. The returned ragChunkIds can be used for downstream retrieval.

STT auto-detection

Audio transcription probes for STT providers in priority order:

  1. OpenAI Whisper (OPENAI_API_KEY)
  2. Deepgram (DEEPGRAM_API_KEY)
  3. AssemblyAI (ASSEMBLYAI_API_KEY)
  4. Azure Speech (AZURE_SPEECH_KEY + AZURE_SPEECH_REGION)

detectScenes()

Stream scene boundaries from a frame source. Returns an AsyncGenerator<SceneBoundary> so callers can process boundaries as they are detected without buffering the entire video.

import { detectScenes } from '@framers/agentos';

// Pre-recorded video (frames from ffmpeg or similar)
for await (const boundary of detectScenes({
frames: extractFrames('video.mp4'),
hardCutThreshold: 0.3,
minSceneDurationSec: 1.0,
})) {
console.log(`Scene ${boundary.index} at ${boundary.startTimeSec}s (${boundary.cutType})`);
}

// Live webcam with CLIP-based semantic detection
for await (const boundary of detectScenes({
frames: webcamFrameStream,
methods: ['histogram', 'ssim', 'clip'],
clipProvider: 'openai',
})) {
console.log(`Motion detected at ${boundary.startTimeSec}s`);
}

Detection methods

MethodDescription
histogramChi-squared histogram distance (fast, good for hard cuts)
ssimStructural similarity index (catches gradual transitions)
clipCLIP embedding cosine distance (semantic scene changes)

Multiple methods are combined by taking the maximum diff score across all methods.

Scene boundary types

Each SceneBoundary includes a cutType classification:

  • hard-cut -- Abrupt frame-to-frame change
  • dissolve -- Cross-dissolve / superimposition
  • fade -- Fade from/to black or white
  • wipe -- Directional wipe transition
  • gradual -- Other gradual transition
  • start -- First scene in the video

Types reference

VideoAspectRatio

type VideoAspectRatio = '1:1' | '16:9' | '9:16' | '4:3' | '3:4' | '21:9' | (string & {});

VideoOutputFormat

type VideoOutputFormat = 'mp4' | 'webm' | 'gif';

VideoProgressEvent

interface VideoProgressEvent {
status: 'queued' | 'processing' | 'downloading' | 'complete' | 'failed';
progress?: number; // 0-100
estimatedRemainingMs?: number;
message?: string;
}

VideoAnalysisProgressEvent

interface VideoAnalysisProgressEvent {
phase: 'extracting-frames' | 'detecting-scenes' | 'describing'
| 'transcribing' | 'summarizing';
progress?: number;
currentScene?: number;
message?: string;
}

Observability

All video API calls emit OpenTelemetry spans (agentos.api.generate_video, agentos.api.analyze_video) and record usage metrics to the durable usage ledger when configured.