Video Pipeline
AgentOS provides a provider-agnostic video pipeline covering generation (text-to-video, image-to-video), analysis (scene detection, transcription, summarisation), and RAG-ready indexing. Three high-level API functions expose the full pipeline:
| Function | Purpose |
|---|---|
generateVideo() | Text-to-video and image-to-video generation |
analyzeVideo() | Scene detection, description, audio transcription, summarisation |
detectScenes() | Streaming scene boundary detection from a frame source |
Providers
Video generation is backed by three provider adapters, each implementing the IVideoGenerator interface:
| Provider | Env Var | Default Model | Capabilities |
|---|---|---|---|
| Runway | RUNWAY_API_KEY | gen-3-alpha | text-to-video, image-to-video |
| Replicate | REPLICATE_API_TOKEN | klingai/kling-v1 | text-to-video, image-to-video |
| Fal | FAL_API_KEY | varies | text-to-video |
When multiple providers are configured, a FallbackVideoProxy wraps the chain so that a transient failure on the primary provider automatically retries on the next available backend.
generateVideo()
Generate a video from a text prompt or a source image.
import { generateVideo } from '@framers/agentos';
// Text-to-video
const result = await generateVideo({
prompt: 'A drone flying over a misty forest at sunrise',
provider: 'runway',
durationSec: 5,
aspectRatio: '16:9',
});
console.log(result.videos[0].url);
// Image-to-video — provide a source image for motion synthesis
import { readFileSync } from 'node:fs';
const i2v = await generateVideo({
prompt: 'Camera slowly zooms out revealing the full landscape',
image: readFileSync('input.png'),
provider: 'replicate',
});
GenerateVideoOptions
| Option | Type | Description |
|---|---|---|
prompt | string | Text prompt describing the desired video content (required) |
image | Buffer | Source image for image-to-video generation |
provider | string | Provider ID ("runway", "replicate", "fal") |
model | string | Model override (e.g. "gen3a_turbo") |
durationSec | number | Desired output duration in seconds |
aspectRatio | VideoAspectRatio | Output aspect ratio ("16:9", "9:16", "1:1", etc.) |
resolution | string | Output resolution (e.g. "1280x720") |
negativePrompt | string | Content to avoid |
seed | number | Seed for reproducible generation |
timeoutMs | number | Maximum wait time in milliseconds |
onProgress | (event) => void | Progress callback with VideoProgressEvent |
providerPreferences | MediaProviderPreference | Reorder or filter the fallback chain |
apiKey | string | Override the API key |
GenerateVideoResult
interface GenerateVideoResult {
model: string; // e.g. "gen-3-alpha"
provider: string; // e.g. "runway"
created: number; // Unix timestamp (ms)
videos: GeneratedVideo[];
usage?: VideoProviderUsage;
}
Each GeneratedVideo contains url, optional base64, mimeType, durationSec, width, height, and thumbnailUrl.
analyzeVideo()
Analyse a video and produce structured understanding: scene segmentation, per-scene descriptions, audio transcription, and an overall summary.
import { analyzeVideo } from '@framers/agentos';
const analysis = await analyzeVideo({
videoUrl: 'https://example.com/demo.mp4',
prompt: 'What products are shown in this video?',
transcribeAudio: true,
descriptionDetail: 'detailed',
indexForRAG: true,
});
console.log(analysis.description);
for (const scene of analysis.scenes ?? []) {
console.log(`[${scene.startSec}s-${scene.endSec}s] ${scene.description}`);
}
// RAG chunk IDs are available for retrieval
console.log(analysis.ragChunkIds);
AnalyzeVideoOptions
| Option | Type | Default | Description |
|---|---|---|---|
videoUrl | string | - | URL of the video to analyse |
videoBuffer | Buffer | - | Raw video bytes (alternative to URL) |
prompt | string | - | Analysis question / guidance |
model | string | auto | Vision LLM model identifier |
maxFrames | number | - | Maximum frames to sample |
sceneThreshold | number | 0.3 | Scene change sensitivity (0-1) |
transcribeAudio | boolean | true | Transcribe audio track via STT |
descriptionDetail | DescriptionDetail | 'detailed' | 'brief' / 'detailed' / 'exhaustive' |
maxScenes | number | 100 | Cap on detected scenes |
indexForRAG | boolean | false | Index results into the RAG vector store |
onProgress | (event) => void | - | Progress callback |
When indexForRAG: true, scene descriptions and transcripts are chunked and embedded into the configured vector store. The returned ragChunkIds can be used for downstream retrieval.
STT auto-detection
Audio transcription probes for STT providers in priority order:
- OpenAI Whisper (
OPENAI_API_KEY) - Deepgram (
DEEPGRAM_API_KEY) - AssemblyAI (
ASSEMBLYAI_API_KEY) - Azure Speech (
AZURE_SPEECH_KEY+AZURE_SPEECH_REGION)
detectScenes()
Stream scene boundaries from a frame source. Returns an AsyncGenerator<SceneBoundary> so callers can process boundaries as they are detected without buffering the entire video.
import { detectScenes } from '@framers/agentos';
// Pre-recorded video (frames from ffmpeg or similar)
for await (const boundary of detectScenes({
frames: extractFrames('video.mp4'),
hardCutThreshold: 0.3,
minSceneDurationSec: 1.0,
})) {
console.log(`Scene ${boundary.index} at ${boundary.startTimeSec}s (${boundary.cutType})`);
}
// Live webcam with CLIP-based semantic detection
for await (const boundary of detectScenes({
frames: webcamFrameStream,
methods: ['histogram', 'ssim', 'clip'],
clipProvider: 'openai',
})) {
console.log(`Motion detected at ${boundary.startTimeSec}s`);
}
Detection methods
| Method | Description |
|---|---|
histogram | Chi-squared histogram distance (fast, good for hard cuts) |
ssim | Structural similarity index (catches gradual transitions) |
clip | CLIP embedding cosine distance (semantic scene changes) |
Multiple methods are combined by taking the maximum diff score across all methods.
Scene boundary types
Each SceneBoundary includes a cutType classification:
hard-cut-- Abrupt frame-to-frame changedissolve-- Cross-dissolve / superimpositionfade-- Fade from/to black or whitewipe-- Directional wipe transitiongradual-- Other gradual transitionstart-- First scene in the video
Types reference
VideoAspectRatio
type VideoAspectRatio = '1:1' | '16:9' | '9:16' | '4:3' | '3:4' | '21:9' | (string & {});
VideoOutputFormat
type VideoOutputFormat = 'mp4' | 'webm' | 'gif';
VideoProgressEvent
interface VideoProgressEvent {
status: 'queued' | 'processing' | 'downloading' | 'complete' | 'failed';
progress?: number; // 0-100
estimatedRemainingMs?: number;
message?: string;
}
VideoAnalysisProgressEvent
interface VideoAnalysisProgressEvent {
phase: 'extracting-frames' | 'detecting-scenes' | 'describing'
| 'transcribing' | 'summarizing';
progress?: number;
currentScene?: number;
message?: string;
}
Observability
All video API calls emit OpenTelemetry spans (agentos.api.generate_video, agentos.api.analyze_video) and record usage metrics to the durable usage ledger when configured.