ElevenLabs WebSocket API Documentation: WSS Text-to-Speech Integration
In this comprehensive guide, we explore the ElevenLabs (11 Labs) WebSocket API for Text-to-Speech (TTS), providing developers with all the crucial details needed to seamlessly integrate and optimize streaming audio generation. Our goal is to deliver an unmatched, detailed reference for leveraging wss://api.elevenlabs.io/v1/text-to-speech/:voice_id/stream-input effectively.
What is ElevenLabs WebSocket API?
The ElevenLabs WebSocket API is a powerful interface designed for real-time audio generation from streamed or chunked text input. Unlike traditional HTTP endpoints, this API enables dynamic text input processing and offers word-to-audio alignment data, making it ideal for live applications, interactive systems, and advanced speech synthesis platforms.
WebSocket Connection and Handshake
To initiate a WebSocket handshake, connect to:
rubyCopyEditwss://api.elevenlabs.io/v1/text-to-speech/:voice_id/stream-input
- Method:
GET
- Status:
101 Switching Protocols
- Required Path Parameter:
voice_id
— Unique ID for the voice model.
Required Headers
- xi-api-key: Your API key for authentication.
Optional Query Parameters
authorization
— Bearer token if needed.model_id
— Specifies the model.language_code
— ISO 639-1 code for the language.enable_logging
— Defaulttrue
; enable/disable request logging.enable_ssml_parsing
— Defaultfalse
; control SSML parsing.output_format
— Select audio format (MP3, WAV, etc.).inactivity_timeout
— Default20
; max180
seconds.sync_alignment
— Defaultfalse
; include timing data with chunks.auto_mode
— Defaultfalse
; disables buffering to reduce latency.apply_text_normalization
—auto
,on
, oroff
.seed
— Integer 0-4294967295; for deterministic output.
Sending Messages Over WebSocket
Initialize Connection
The first message initializes the connection:
jsonCopyEdit{
"text": " ",
"voice_settings": {
"stability": 0.5,
"similarity_boost": 0.75,
"speed": 1
},
"xi_api_key": "<YOUR_API_KEY>"
}
text
: Required — Must start with a single space.voice_settings
: Optional; adjuststability
,similarity_boost
,speed
,style
,use_speaker_boost
.generation_config
: Optional; includeschunk_length_schedule
.pronunciation_dictionary_locators
: Optional; custom pronunciation dictionary references.
Send Text
Send subsequent text:
jsonCopyEdit{
"text": "Your text chunk here ",
"try_trigger_generation": false
}
text
: Required — Must end with a space.try_trigger_generation
: Optional; attempt immediate audio generation.
Flush Buffer
Force immediate generation:
jsonCopyEdit{
"flush": true
}
- Ensures all buffered text generates audio output.
Close Connection
End the session:
jsonCopyEdit{
"text": ""
}
Audio Output Structure
The API returns:
jsonCopyEdit{
"audio": "<base64_encoded_audio>",
"isFinal": false,
"normalizedAlignment": {
"charStartTimesMs": [...],
"charsDurationsMs": [...],
"chars": [...]
},
"alignment": {
"charStartTimesMs": [...],
"charsDurationsMs": [...],
"chars": [...]
}
}
audio
: Partial audio chunk, base64-encoded.isFinal
:true
when the final chunk is sent.normalizedAlignment
: Timing info for normalized text.alignment
: Timing info for raw input text.
Chunk Length Scheduling
The chunk_length_schedule dictates when audio is generated based on buffer size:
jsonCopyEdit[120, 160, 250, 290]
- First chunk: 120 characters
- Second: Additional 160
- Third: Additional 250
- Fourth onward: 290 per chunk
Adjustable range: 50 - 500
Tip: Lower values reduce latency but may affect quality.
Advanced Features
Pronunciation Dictionaries
Enhance clarity using:
jsonCopyEdit{
"pronunciation_dictionary_locators": [
{
"pronunciation_dictionary_id": "dict_id",
"version_id": "version_id"
}
]
}
Ensure this is part of the initial message.
Text Normalization
Control how numbers, abbreviations, and special text are spoken:
auto
(default)on
off
Use Cases for ElevenLabs WebSocket TTS
- Live transcription and playback
- Interactive voice agents
- Dynamic narration tools
- Educational apps requiring word-level timing
- Games and virtual environments needing real-time speech
Best Practices for Using ElevenLabs WebSocket API
- Start your connection with a blank space (
" "
). - End text chunks with a space to signal boundaries.
- Tune
chunk_length_schedule
for latency vs. quality trade-offs. - Utilize
flush
to force generation without closing the connection. - Incorporate alignment data to synchronize visuals or subtitles with speech.
- Use
auto_mode
where minimal latency is critical. - Test various voice settings to match your brand’s tone.
The ElevenLabs WebSocket API provides unmatched flexibility for generating high-quality streaming speech from dynamic text input. Its advanced buffering system, configurable parameters, and precise alignment data make it the ideal solution for developers building the next generation of voice-powered applications.