Voice Synthesis
Voice input/output tools for AgentOS:
text_to_speechvia OpenAI, ElevenLabs, or a local Ollama-compatible runtimespeech_to_textvia OpenAI Whisper, Deepgram, or Whisper-local/OpenAI-compatible local STT runtimes
Installation
npm install @framers/agentos-ext-voice-synthesis
Configuration
Set one or more of the following environment variables, or pass them via extension options:
OPENAI_API_KEYELEVENLABS_API_KEYDEEPGRAM_API_KEYOLLAMA_BASE_URLfor a local OpenAI-compatible TTS runtimeWHISPER_LOCAL_BASE_URLfor a local OpenAI-compatible STT runtimeTTS_PROVIDERto preferopenai,elevenlabs,ollama, orautoSTT_PROVIDERto preferopenai,deepgram,whisper-local, orautoOPENAI_BASE_URLfor OpenAI-compatible speech endpoints
Tool: text_to_speech
Input:
text(string, required) — Text to convert, max 5000 charsprovider(string, optional) —openai,elevenlabs,ollama, orautovoice(string, optional) — OpenAI:alloy,echo,fable,onyx,nova,shimmer; ElevenLabs:rachel,domi,bella,antoni,josh,arnold,adam,sammodel(string, optional) — OpenAI:tts-1,tts-1-hd; ElevenLabs:eleven_monolingual_v1,eleven_multilingual_v2speed(number, OpenAI only) — 0.25 to 4.0stability(number, ElevenLabs only) — 0 to 1similarity_boost(number, ElevenLabs only) — 0 to 1format(string) —mp3,opus,aac,flac,wav
Output: Base64-encoded audio, provider metadata, and a duration estimate.
Provider selection
The tool prefers providers in this order when provider is omitted:
- OpenAI, if
OPENAI_API_KEYis set - ElevenLabs, if
ELEVENLABS_API_KEYis set - Ollama-compatible local runtime as a best-effort fallback
Ollama support is experimental and assumes an OpenAI-compatible TTS endpoint at
/v1/audio/speech.
Tool: speech_to_text
Providers:
openai— hosted Whisper viaOPENAI_API_KEYdeepgram— hosted Deepgram STT viaDEEPGRAM_API_KEYwhisper-local— local OpenAI-compatible transcription endpoint viaWHISPER_LOCAL_BASE_URLauto— prefers OpenAI, then Deepgram, then explicitly configured local STT
Input:
audioBase64(string) — Base64 audio payload, optionally as adata:URLaudioUrl(string) — Fetchable audio URLprovider(string, optional) —auto,openai,deepgram, orwhisper-localmimeType(string, optional) — For exampleaudio/wavfileName(string, optional) — File name sent to the providerformat(string, optional) — Audio format hint such aswav,mp3,m4a,webmlanguage(string, optional) — ISO language hintprompt(string, optional) — Context prompt to bias transcriptionmodel(string, optional) — Provider model override such aswhisper-1,nova-2, orbasetemperature(number, optional) — Whisper temperature overrideresponseFormat(string, optional) —json,text,srt,verbose_json, orvttdiarize(boolean, optional) — Enable speaker diarization where supportedutterances(boolean, optional) — Request utterance segmentation where supportedsmartFormat(boolean, optional) — Enable provider-side formatting where supporteddetectLanguage(boolean, optional) — Enable provider-side language detection where supported
Output: Transcribed text, provider/model metadata, language, optional confidence, duration, and optional segments.
whisper-local targets OpenAI-compatible local transcription servers. That
keeps the tool contract stable even when you swap between hosted Whisper,
Deepgram, and local runtimes behind the same API shape.
License
MIT - Frame.dev