Background Audio for Speech-to-Speech Voice Agents
Custom extension to OpenAI’s TwilioRealtimeTransportLayer
Note: Full implementation can be found here https://github.com/chrisvin-jabamani/twilio-openai-background-audio
Overview
The voice AI industry is on the precipice of a significant architectural shift. Historically, voice AI has relied on chained architectures – transcribing speech to text, processing it through an LLM, and converting the output back to speech. This architecture has several drawbacks, including:
Inability to capture non-textual audio cues. Vocal signals like “hmm” that indicate thinking are lost during transcription.
Loss of prosody. Speech-to-text conversion strips away acoustic features like tone, emotion and expression that convey meaning beyond words.
Latency. Processing through three sequential stages introduces additional delays in response generation
Speech-to-speech models represent a paradigm shift – a single model processes audio input directly and generates audio output in real-time, preserving the natural characteristics of human speech. Given the nascency of this technology, several capabilities remain undeveloped. One of the most critical is background audio, which serves as a key signal of authenticity when users interact with a voice agent.
Despite the advancements of OpenAI’s Realtime API and Agents SDK, neither framework natively supports background audio injection for voice agents. This writeup and GitHub repository detail a custom extension to OpenAI’s TwilioRealtimeTransportLayer that enables continuous background audio playback with automatic muting when the agent is speaking, resulting in more natural and realistic voice interactions.
Solution Summary
The solution cleanly extends OpenAI’s TwilioRealtimeTransportLayer through a custom TwilioWithBackgroundAudio class that maintains a separate background audio stream synchronized with agent speech state, preserving base transport functionality (audio routing, interruption handling etc.) without modifying SDK source code. Key features include:
1. Speech State Detection. The challenge is tracking agent speech across multiple lifecycle stages: when audio generation begins, when it completes, and when playback to the caller finishes. The solution combines three mechanisms:
_onAudio override to detect streaming start
RealtimeSession listeners (response.done, response.cancelled) to track generation lifecycle
Twilio mark events injected post-response that Twilio echoes back upon playback completion, signaling when to resume background audio
2. Buffer Management. Sends Twilio’s clear event before each agent response to flush buffered audio, preventing background audio bleed-through and preserving pristine voice quality
Execution Flow
The following trace illustrates how background audio synchronizes with agent speech during a typical call.
Pre-Call Initialization
Constructor loads background audio file from disk via loadBackgroundAudio()
setupTwilioListeners() registers WebSocket event handlers for start and mark events
setupSessionListeners() (called from index.ts) registers OpenAI session event handlers
0:00 | Call Start
Twilio sends start event with streamSid
startBackgroundAudio() initiates a timer that sends 160-byte audio chunks every 20ms (aligns with Twilio Media Streams API requirement for 8kHz audio)
Background audio loops continuously using modulo arithmetic on backgroundPosition, with drift correction to maintain precise timing
0:03 | Agent Speaks
OpenAI generates first audio chunk, triggering _onAudio()
Sets isAgentSpeaking = true
Calls stopBackgroundAudio() to cancel the timer
Sends Twilio clear command to flush buffered background audio chunks
Forwards agent audio to caller via super._onAudio()
0:06 | Agent Speech Generation Complete
OpenAI emits response.done event
sendEndOfAudioMark() injects a unique Twilio mark event into the audio stream
Stores mark name in pendingMarkName and waits for confirmation
0:08 | Agent Speech Playback Complete
Twilio confirms mark playback via WebSocket callback
Verifies mark name matches pendingMarkName
Sets isAgentSpeaking = false and clears pendingMarkName
Calls startBackgroundAudio() to resume from current backgroundPosition – audio continues seamlessly without restarting
0:11 | User Interrupts Agent
OpenAI fires audio_interrupted event
Immediately sets isAgentSpeaking = false, clears pendingMarkName, and resumes background audio without waiting for mark confirmation
Getting Started
Audio Requirements
A sample background audio file (sample-background-mulaw-8khz.raw) is included in the repository. To use custom audio, files must be μ-law encoded at 8kHz. Convert existing files using FFmpeg.
ffmpeg -i input.mp3 -ar 8000 -ac 1 -acodec pcm_mulaw output.rawQuick Start
import { TwilioBackgroundAudioTransport } from ‘./TwilioBackgroundAudioTransport’;
import { RealtimeAgent, RealtimeSession } from ‘@openai/agents/realtime’;
// Create agent and transport with background audio
const agent = new RealtimeAgent({ name: ‘Assistant’, instructions: ‘...’ });
const transport = new TwilioBackgroundAudioTransport({
twilioWebSocket: connection,
backgroundAudioPath: ‘./sample-background-mulaw-8khz.raw’ // included in repo
});
// Setup and connect
const session = new RealtimeSession(agent, { transport });
transport.setupSessionListeners(session);
await session.connect({ apiKey: process.env.OPENAI_API_KEY });

Wow, your point about background audio as a key signal of authenticity really stood out. What if integrating such comprehensiv audio context could also significantly enhance AI's capacity for cross-cultural understanding and communication?