Skip to main content

Voice Pipeline

A voice agent that talks back is straightforward to build if you don't care that it interrupts the user, never knows when they've stopped speaking, can't recover when the network blips for half a second, and will keep happily generating into a phone that the user already hung up. A voice agent you actually want to use has to handle all of those, which is why the voice path through AgentOS is its own subsystem rather than a thin wrapper over text generation. Turn-taking is a first-class concern. Barge-in is a first-class concern. The fact that audio chunks arrive on a different schedule than text tokens is a first-class concern. The state machine has six states because conversation has at least six distinct things going on at any moment.

This page is the architectural map. The configuration surface is at the bottom; the conceptual model and the wiring sit on top.

Architecture

The pipeline is six interfaces wired together by the VoicePipelineOrchestrator:

State Machine

The orchestrator manages a conversational loop through these states:

Quick Start

Programmatic

The agent({ voice }) field is typed against VoiceConfig. The factory is synchronous — it does not return a Promise.

import { agent } from '@framers/agentos';

// Basic voice mode (Whisper STT + OpenAI TTS)
const basic = agent({
voice: { enabled: true },
});

// Deepgram STT + ElevenLabs TTS with diarization
const advanced = agent({
provider: 'openai', // LLM provider
voice: {
enabled: true,
stt: 'deepgram',
tts: 'elevenlabs',
ttsVoice: 'nova',
endpointing: 'heuristic',
diarization: true, // boolean, not an object
bargeIn: 'hard-cut',
language: 'en-US',
},
});

Install the matching streaming voice packs and set the required API keys before enabling voice:

  • @framers/agentos-ext-streaming-stt-whisper + OPENAI_API_KEY
  • @framers/agentos-ext-streaming-stt-deepgram + DEEPGRAM_API_KEY
  • @framers/agentos-ext-streaming-tts-openai + OPENAI_API_KEY
  • @framers/agentos-ext-streaming-tts-elevenlabs + ELEVENLABS_API_KEY

semantic endpointing also requires an LLM callback to be wired into the pipeline; when that callback is absent, the runtime falls back to heuristic endpointing.

Wunderland CLI

The same shape is consumed by the Wunderland CLI's chat command via --voice flags (documented in TELEPHONY_PROVIDERS.md). For example:

wunderland chat \
--voice \
--voice-stt=deepgram \
--voice-tts=elevenlabs \
--voice-endpointing=heuristic \
--voice-barge-in=hard-cut \
--voice-port=8765

CLI flags override values configured in code.

Core Interfaces

InterfacePurpose
IStreamTransportBidirectional audio pipe (WebSocket now, WebRTC later)
IStreamingSTTReal-time speech-to-text with interim results
IEndpointDetectorTurn-taking: decides when the user is done speaking
IDiarizationEngineSpeaker identification and labeling
IStreamingTTSToken-stream to audio synthesis
IBargeinHandlerHandles user interruption during agent speech

Endpointing Modes

ModeHow it worksLatencyCost
acousticPure energy-based VAD + silence timeoutHighest (~3s)Free
heuristicPunctuation/syntax analysis + silence fallbackLow (~0.5s for . ? !)Free
semanticLLM classifier for ambiguous pausesLowest (smart)LLM API call per ambiguous turn

Barge-in Modes

ModeBehavior
hard-cutImmediately cancel TTS after 300ms of user speech. Injects [interrupted] marker into conversation history.
soft-fadeFade TTS over 200ms. If user speaks < 2s (backchannel), resume. If > 2s, cancel.
disabledAgent speaks to completion regardless of user speech.

Extension Packs

Packnpm PackageProviderEnv Var
Deepgram STT@framers/agentos-ext-streaming-stt-deepgramDeepgram Nova-2DEEPGRAM_API_KEY
Whisper STT@framers/agentos-ext-streaming-stt-whisperOpenAI WhisperOPENAI_API_KEY
OpenAI TTS@framers/agentos-ext-streaming-tts-openaiOpenAI TTS-1OPENAI_API_KEY
ElevenLabs TTS@framers/agentos-ext-streaming-tts-elevenlabsElevenLabsELEVENLABS_API_KEY
Diarization@framers/agentos-ext-diarizationLocal x-vector
Semantic Endpoint@framers/agentos-ext-endpoint-semanticAny LLMLLM API key

WebSocket Protocol

The voice server communicates via WebSocket:

  • Binary messages: Raw audio (client→server: PCM Float32 mono; server→client: encoded mp3/opus)
  • Text messages: JSON control/metadata

Client → Server

// Text messages
{ type: 'config', sampleRate: 16000, voice: 'nova', language: 'en-US' }
{ type: 'control', action: 'mute' | 'unmute' | 'stop' }

// Binary messages: raw PCM Float32 mono audio

Server → Client

{ type: 'session_started', sessionId: '...', config: { sampleRate: 24000, format: 'opus' } }
{ type: 'transcript', text: 'Hello', isFinal: false, speaker: 'Speaker_0' }
{ type: 'agent_thinking' }
{ type: 'agent_speaking', text: 'Hi there!' }
{ type: 'agent_done' }
{ type: 'barge_in', action: 'cancelled' }
{ type: 'session_ended', reason: 'disconnect' }

// Binary messages: encoded audio (mp3/opus) in negotiated format

Error Recovery

FailureRecovery
STT connection dropsAuto-reconnect with exponential backoff (100ms → 5s). Audio frames buffered during reconnect.
TTS connection dropsCancel current utterance, re-create session, re-send buffered text.
Transport disconnectsTear down all sessions. Client must reconnect.
Endpoint stuck30s watchdog timer forces turn_complete.
Diarization lagNon-blocking. Transcript sent to LLM immediately; speaker labels backfilled.

Known Limitations

The voice pipeline is functional but has these known limitations that will be addressed in future releases:

No True Incremental LLM Streaming

The current chat --voice implementation gets the full LLM text reply first, then chunks it for TTS. This means:

  • First audio playback is delayed until the LLM finishes generating
  • Barge-in cannot cancel in-flight LLM generation — only TTS playback
  • Future: wire a real streaming text-turn API from the chat runtime into IVoicePipelineAgentSession

Semantic Endpointing Requires LLM Callback

The semantic endpoint detector (@framers/agentos-ext-endpoint-semantic) only invokes the LLM turn-completeness classifier when an explicit llmCall callback is provided. Without it, the detector falls back to heuristic endpointing (punctuation + silence timeout).

Telephony Media Stream Bridge

The TelephonyStreamTransport bridges provider media streams (Twilio, Telnyx, Plivo) into the voice pipeline. Webhook routes handle call lifecycle via CallManager, and media stream WebSocket connections feed audio through the same VoicePipelineOrchestrator used by browser voice. The VoiceTransportAdapter now fully wires deliverNodeOutput() to pushToTTS() and getNodeInput() to waitForUserTurn() for IVR graph flows.

Env-Based Provider Resolution

The SpeechProviderResolver and createStreamingPipeline() currently resolve voice components based on environment variables and static configuration. Future versions will resolve through a real ExtensionManager runtime with dynamic pack loading and hot-swapping.

No Call Recording or Transcript Persistence

Call transcripts are held in memory during the call but are not persisted to storage after the call ends. Future: integrate with AgentOS storage/memory system.


Voice-Graph Integration

AgentOS lets you embed voice I/O directly inside an orchestration graph. There are two complementary integration modes: voice nodes (one step in a larger graph is a voice session) and voice transport (the entire graph runs inside a phone call or real-time voice session).

Voice as a Graph Node Type

Use the voiceNode() builder to create a GraphNode of type 'voice'. The node manages a full multi-turn STT/TTS session and exits when one of its configured exit conditions fires.

import { voiceNode } from '@framers/agentos/orchestration';

const listenNode = voiceNode('intake', {
mode: 'conversation',
stt: 'deepgram',
tts: 'elevenlabs',
maxTurns: 5,
exitOn: 'keyword',
exitKeywords: ['confirmed', 'cancel'],
})
.on('keyword:confirmed', 'process-intake')
.on('keyword:cancel', 'goodbye')
.on('hangup', 'end')
.on('turns-exhausted', 'fallback')
.build();

The builder produces a GraphNode with:

PropertyValue
type'voice'
executorConfig.type'voice'
executionMode'react_bounded' — models the multi-turn loop
effectClass'external' — touches real-world audio I/O
checkpoint'before' — snapshot taken before the session starts

Exit reasons map to the next node via .on(exitReason, targetNodeId). The .on() chain is order-independent; the voice executor resolves the correct edge after the session ends.

Voice Transport Mode

When the entire workflow should run inside a single phone call, declare a transport at the workflow level. All nodes in the graph then receive input from STT and deliver output to TTS via a VoiceTransportAdapter.

import { workflow } from '@framers/agentos/orchestration';
import { VoiceTransportAdapter } from '@framers/agentos/orchestration/runtime/VoiceTransportAdapter';

const callFlow = workflow('phone-intake')
.input(inputSchema)
.returns(outputSchema)
.transport('voice', { stt: 'deepgram', tts: 'openai', voice: 'alloy' })
.step('greet', { voice: { mode: 'speak-only' } })
.step('listen', { voice: { mode: 'conversation', maxTurns: 3 } })
.step('confirm', { voice: { mode: 'conversation', exitOn: 'keyword', exitKeywords: ['yes', 'no'] } })
.step('process', { tool: 'crm_update' })
.compile();

The VoiceTransportAdapter bridges the graph I/O cycle:

  • getNodeInput(nodeId) — waits for the user's next speech turn (resolves on turn_complete).
  • deliverNodeOutput(nodeId, text) — sends the node's response to TTS and emits a voice_audio graph event.
  • init(state) — injects state.scratch.voiceTransport so voice nodes can access the transport.
  • dispose() — emits voice_session ended and tears down the adapter.

YAML Syntax

Voice step in a YAML workflow

name: phone-intake
steps:
- id: greet
voice:
mode: speak-only
tts: openai
voice: alloy

- id: collect-info
voice:
mode: conversation
stt: deepgram
endpointing: heuristic
bargeIn: hard-cut
maxTurns: 5
exitOn: keyword
exitKeywords:
- confirmed
- cancel

Voice transport at workflow level

name: phone-intake
transport:
type: voice
stt: deepgram
tts: elevenlabs
voice: nova
bargeIn: hard-cut
endpointing: heuristic
steps:
- id: greet
voice:
mode: speak-only
- id: intake
voice:
mode: conversation
maxTurns: 3
exitOn: keyword
exitKeywords: [confirmed, done]

When transport.type: voice is present, compileWorkflowYaml() attaches the config to compiled._transport so the caller can detect that the workflow expects a VoiceTransportAdapter at runtime.

YAML voice step fields

FieldTypeDescription
modeconversation | listen-only | speak-onlyRequired. Session direction.
sttstringSTT provider override (e.g. deepgram, openai).
ttsstringTTS provider override (e.g. openai, elevenlabs).
voicestringTTS voice name.
endpointingacoustic | heuristic | semanticEndpoint detection mode.
bargeInhard-cut | soft-fade | disabledBarge-in handling.
diarizationbooleanEnable speaker diarization.
languagestringBCP-47 language tag (e.g. en-US).
maxTurnsnumberMaximum turns before turns-exhausted exit. 0 = unlimited.
exitOnstringPrimary exit condition: hangup, silence-timeout, keyword, turns-exhausted, manual.
exitKeywordsstring[]Phrases that trigger keyword exit. Case-insensitive substring match.

Barge-in Routing with Exit Conditions

The VoiceNodeExecutor races multiple exit conditions simultaneously via a Promise.race. The first condition to fire determines the exitReason string, which is then looked up in the node's edge map to resolve the routeTarget.

exitReasonTriggerTypical edge target
hangupTransport emits close or disconnectedend / cleanup node
turns-exhaustedturn_complete fires and turnCount >= maxTurnssummarize / fallback node
keyword:<word>final_transcript contains a phrase from exitKeywordsintent-specific handler
silence-timeoutNo speech for 30 s when exitOn: silence-timeouttimeout handler / retry
interruptedAbortController fired with a VoiceInterruptError (barge-in)re-listen / cancel TTS

When a barge-in occurs, the executor catches the VoiceInterruptError and returns exitReason: 'interrupted'. Wire a loopback edge .on('interrupted', 'listen') to restart the listen cycle:

voiceNode('listen', { mode: 'conversation' })
.on('interrupted', 'listen') // barge-in → re-listen
.on('turns-exhausted', 'summarize')
.on('hangup', 'end')
.build();

Graph Events for Voice

Voice nodes emit the following GraphEvent values in causal order:

Event typeWhen
voice_session (action: started)Immediately on execute() entry
voice_transcript (isFinal: false)Each interim_transcript from STT
voice_transcript (isFinal: true)Each confirmed final_transcript
voice_turn_completeEach turn_complete from endpoint detector
voice_audio (direction: outbound)When TTS delivery is triggered by VoiceTransportAdapter.deliverNodeOutput()
voice_barge_inEach barge_in event from the pipeline session
voice_session (action: ended)On node exit, with exitReason

Consume events via the GraphRuntime stream:

for await (const event of runtime.stream(graph, input)) {
if (event.type === 'voice_transcript' && event.isFinal) {
console.log(`[${event.speaker}] ${event.text}`);
}
if (event.type === 'voice_session' && event.action === 'ended') {
console.log('Session exit reason:', event.exitReason);
}
}

Checkpoint Support

Voice nodes use checkpoint: 'before' so the runtime takes a state snapshot before each voice session starts. If the process crashes mid-call, the graph can be resumed from the beginning of that voice node.

In addition, the VoiceNodeExecutor writes a VoiceNodeCheckpoint to scratchUpdate[nodeId] after every execution:

interface VoiceNodeCheckpoint {
turnIndex: number; // total turns completed (inclusive of prior runs)
transcript: TranscriptEntry[]; // full buffered transcript
lastExitReason: string | null;
speakerMap: Record<string, string>;
sessionConfig: VoiceNodeConfig;
}

Pass state.scratch[nodeId].turnIndex back as the initialTurnCount when constructing a VoiceTurnCollector to resume the turn counter from where the previous run left off — enabling a call that spans multiple graph runs (e.g. after a human-approval pause) to count turns continuously rather than resetting to zero.


Provider Options (sttOptions / ttsOptions)

The orchestrator forwards pipeline-level sttOptions and ttsOptions to providers via providerOptions. This enables provider-specific features without changing the core interfaces.

Deepgram STT Options

Pass through VoicePipelineConfig.sttOptions:

const orchestrator = new VoicePipelineOrchestrator({
stt: 'deepgram-streaming',
tts: 'elevenlabs-streaming',
sttOptions: {
sentiment: true, // Per-utterance sentiment analysis
smart_format: true, // Auto-punctuation, capitalization, numbers
diarize: true, // Speaker diarization labels
utterance_end_ms: 1000, // Server-side silence endpoint (ms)
keywords: [ // Keyword boosting (name:weight format)
'Gideon:2',
'The Crevasse:1.5',
'fireball:1.5',
],
},
});
OptionTypeDeepgram ParamEffect
sentimentbooleansentiment=trueReturns sentiment per utterance (positive/negative/neutral + confidence)
smart_formatbooleansmart_format=trueAuto-punctuates, capitalizes, formats numbers and dates
diarizebooleandiarize=trueLabels speaker identity per word (speaker: 0, speaker: 1)
utterance_end_msnumberutterance_end_ms=NServer-side silence endpoint detection (supplements client-side heuristic)
keywordsstring[]keywords=word:weightBoosts recognition of specific terms (names, game terms, spells)

Sentiment in TranscriptEvent

When sentiment: true is enabled, TranscriptEvent includes a sentiment field:

interface TranscriptEvent {
text: string;
confidence: number;
words: TranscriptWord[];
isFinal: boolean;
durationMs?: number;
sentiment?: {
label: 'positive' | 'negative' | 'neutral';
confidence: number;
};
}

Consumers can use this for mood modulation, game mechanics, or UI feedback without additional NLP processing.

ElevenLabs TTS Options

Pass through VoicePipelineConfig.ttsOptions:

const orchestrator = new VoicePipelineOrchestrator({
stt: 'deepgram-streaming',
tts: 'elevenlabs-streaming',
ttsOptions: {
stability: 0.3, // 0.0-1.0: lower = more expressive
similarityBoost: 0.75, // 0.0-1.0: voice clone fidelity
style: 0.6, // 0.0-1.0: style exaggeration
useSpeakerBoost: true, // Clarity enhancement
speed: 0.85, // 0.1-5.0: speaking rate
},
});
OptionTypeRangeDefaultEffect
stabilitynumber0.0-1.00.5Intonation variability. Low = more expressive.
similarityBoostnumber0.0-1.00.75Voice clone fidelity.
stylenumber0.0-1.00.0Exaggeration of the voice's natural style.
useSpeakerBoostbooleantrueClarity enhancement filter.
speednumber0.1-5.01.0Speaking rate multiplier.

These are sent in the ElevenLabs WebSocket BOS (beginning-of-stream) message as voice_settings and generation_config.speed.

Dynamic Expressiveness

For applications that modulate voice based on character state (personality, mood, game context), compute ttsOptions per turn rather than setting them once at session start. The orchestrator creates a new TTS session per utterance, so changing ttsOptions between turns takes effect immediately.

// Example: mood-reactive voice
const expressiveness = computeExpressiveness(personality, currentMood);
const orchestrator = new VoicePipelineOrchestrator({
// ...
ttsOptions: expressiveness,
});

References

Voice activity detection + endpoint detection

  • Tan, Z.-H., Sarkar, A. K., & Dehak, N. (2020). rVAD: An unsupervised segment-based robust voice activity detection method. Computer Speech & Language, 59, 1–21. — Robust VAD baseline informing the heuristic endpoint detector's silence-vs-speech discrimination. arXiv:1906.03588
  • Silero Team. (2024). Silero VAD: Pre-trained enterprise-grade voice activity detector. — Production-grade VAD model widely used in real-time pipelines; reference for the acoustic endpoint detector design. GitHub
  • Skerry-Ryan, R. J., Battenberg, E., Xiao, Y., Wang, Y., Stanton, D., Shor, J., Weiss, R., Clark, R., & Saurous, R. A. (2018). Towards end-to-end prosody transfer for expressive speech synthesis with Tacotron. ICML 2018. — Prosody-aware synthesis foundations behind the TTS provider abstraction. arXiv:1803.09047

Streaming ASR

  • Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. ICML 2006. — CTC foundations behind streaming ASR — informs how partial-transcript timing flows through the endpoint detector. ACM DL
  • Chiu, C.-C., Sainath, T. N., Wu, Y., Prabhavalkar, R., Nguyen, P., Chen, Z., Kannan, A., Weiss, R. J., Rao, K., Gonina, E., Jaitly, N., Li, B., Chorowski, J., & Bacchiani, M. (2018). State-of-the-art speech recognition with sequence-to-sequence models. ICASSP 2018. — Reference architecture for the streaming-STT provider interface. arXiv:1712.01769
  • Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2023). Robust speech recognition via large-scale weak supervision. ICML 2023. — Whisper, the default fallback STT in the pipeline. arXiv:2212.04356

Barge-in / interruption handling

  • Edlund, J., Heldner, M., & Hirschberg, J. (2009). Pause and gap length in face-to-face interaction. Interspeech 2009. — Pause statistics informing the heuristic endpoint detector's silence thresholds. ISCA Archive
  • Skantze, G. (2021). Turn-taking in conversational systems and human-robot interaction: A review. Computer Speech & Language, 67, 101178. — Survey of turn-taking strategies; the barge-in handler implements the "hard cut on speech-detected during TTS" pattern from this taxonomy. DOI

Real-time voice agents

  • Anastassiou, P., Chen, J., Chen, J., Chen, Y., Chen, Z., Chen, Z., Cong, J., Deng, L., Ding, C., Gao, L., Gong, M., Huang, P., Huang, Q., Huang, Z., Huo, Y., Jia, D., Li, C., Li, F., Li, H., ... Wei, X. (2024). Seed-TTS: A family of high-quality versatile speech generation models. arXiv preprint. — Reference for low-latency, prosody-controllable TTS — informs the SPEAKING-state design where TTS is allowed to overlap with EOL planning. arXiv:2406.02430

Implementation references

  • packages/agentos/src/voice-pipeline/VoicePipelineOrchestrator.ts — the state machine
  • packages/agentos/src/voice-pipeline/HeuristicEndpointDetector.ts + AcousticEndpointDetector.ts — endpoint detection strategies
  • packages/agentos/src/voice-pipeline/HardCutBargeinHandler.ts + SoftFadeBargeinHandler.ts — barge-in handlers
  • packages/agentos/src/voice-pipeline/types.tsIStreamTransport, IStreamingSTT, IStreamingTTS, IBargeinHandler interfaces