📄️Image Generation

Generate images from text prompts across 5 providers with a single unified API.

📄️Image Editing (Img2Img, Inpainting, Upscaling)

Edit, upscale, and create variations of existing images across multiple providers with a unified API.

📄️Character Consistency

Generate images that maintain a consistent character identity across multiple outputs using reference images and face embeddings.

📄️Style Transfer

Apply the visual style of one image to another using transferStyle(), backed by Flux Redux and cross-provider img2img.

📄️Vision Pipeline (OCR & Image Understanding)

A 3-tier progressive enhancement pipeline for extracting text, understanding images, and generating visual embeddings.

📄️Image Segmentation (SAM2 / GroundedSAM)

Turn an image plus a prompt (text, point, box, or "segment everything") into

📄️Audio Generation

AgentOS provides provider-agnostic APIs for generating music and sound effects from text prompts. Two high-level functions cover the full audio generation pipeline:

The ProviderPreferences system gives callers fine-grained control over which media providers are used and in what order. It applies to image generation, video generation, and audio generation (music and SFX separately).

📄️Video Pipeline

AgentOS provides a provider-agnostic video pipeline covering generation (text-to-video, image-to-video), analysis (scene detection, transcription, summarisation), and RAG-ready indexing. Three high-level API functions expose the full pipeline:

Media Generation