ElevenLabs WebSocket API Documentation: WSS Text-to-Speech Integration

In this comprehensive guide, we explore the ElevenLabs (11 Labs) WebSocket API for Text-to-Speech (TTS), providing developers with all the crucial details needed to seamlessly integrate and optimize streaming audio generation. Our goal is to deliver an unmatched, detailed reference for leveraging wss://api.elevenlabs.io/v1/text-to-speech/:voice_id/stream-input effectively.

What is ElevenLabs WebSocket API?

The ElevenLabs WebSocket API is a powerful interface designed for real-time audio generation from streamed or chunked text input. Unlike traditional HTTP endpoints, this API enables dynamic text input processing and offers word-to-audio alignment data, making it ideal for live applications, interactive systems, and advanced speech synthesis platforms.

WebSocket Connection and Handshake

To initiate a WebSocket handshake, connect to:

rubyCopyEditwss://api.elevenlabs.io/v1/text-to-speech/:voice_id/stream-input

Method: GET
Status: 101 Switching Protocols
Required Path Parameter:
- voice_id — Unique ID for the voice model.

Required Headers

xi-api-key: Your API key for authentication.

Optional Query Parameters

authorization — Bearer token if needed.
model_id — Specifies the model.
language_code — ISO 639-1 code for the language.
enable_logging — Default true; enable/disable request logging.
enable_ssml_parsing — Default false; control SSML parsing.
output_format — Select audio format (MP3, WAV, etc.).
inactivity_timeout — Default 20; max 180 seconds.
sync_alignment — Default false; include timing data with chunks.
auto_mode — Default false; disables buffering to reduce latency.
apply_text_normalization — auto, on, or off.
seed — Integer 0-4294967295; for deterministic output.

Sending Messages Over WebSocket

Initialize Connection

The first message initializes the connection:

jsonCopyEdit{
  "text": " ",
  "voice_settings": {
    "stability": 0.5,
    "similarity_boost": 0.75,
    "speed": 1
  },
  "xi_api_key": "<YOUR_API_KEY>"
}

text: Required — Must start with a single space.
voice_settings: Optional; adjust stability, similarity_boost, speed, style, use_speaker_boost.
generation_config: Optional; includes chunk_length_schedule.
pronunciation_dictionary_locators: Optional; custom pronunciation dictionary references.

Send Text

Send subsequent text:

jsonCopyEdit{
  "text": "Your text chunk here ",
  "try_trigger_generation": false
}

text: Required — Must end with a space.
try_trigger_generation: Optional; attempt immediate audio generation.

Flush Buffer

Force immediate generation:

jsonCopyEdit{
  "flush": true
}

Ensures all buffered text generates audio output.

Close Connection

End the session:

jsonCopyEdit{
  "text": ""
}

Audio Output Structure

The API returns:

jsonCopyEdit{
  "audio": "<base64_encoded_audio>",
  "isFinal": false,
  "normalizedAlignment": {
    "charStartTimesMs": [...],
    "charsDurationsMs": [...],
    "chars": [...]
  },
  "alignment": {
    "charStartTimesMs": [...],
    "charsDurationsMs": [...],
    "chars": [...]
  }
}

audio: Partial audio chunk, base64-encoded.
isFinal: true when the final chunk is sent.
normalizedAlignment: Timing info for normalized text.
alignment: Timing info for raw input text.

Chunk Length Scheduling

The chunk_length_schedule dictates when audio is generated based on buffer size:

jsonCopyEdit[120, 160, 250, 290]

First chunk: 120 characters
Second: Additional 160
Third: Additional 250
Fourth onward: 290 per chunk

Adjustable range: 50 - 500
Tip: Lower values reduce latency but may affect quality.

Advanced Features

Pronunciation Dictionaries

Enhance clarity using:

jsonCopyEdit{
  "pronunciation_dictionary_locators": [
    {
      "pronunciation_dictionary_id": "dict_id",
      "version_id": "version_id"
    }
  ]
}

Ensure this is part of the initial message.

Text Normalization

Control how numbers, abbreviations, and special text are spoken:

auto (default)
on
off

Use Cases for ElevenLabs WebSocket TTS

Live transcription and playback
Interactive voice agents
Dynamic narration tools
Educational apps requiring word-level timing
Games and virtual environments needing real-time speech

Best Practices for Using ElevenLabs WebSocket API

Start your connection with a blank space (" ").
End text chunks with a space to signal boundaries.
Tune chunk_length_schedule for latency vs. quality trade-offs.
Utilize flush to force generation without closing the connection.
Incorporate alignment data to synchronize visuals or subtitles with speech.
Use auto_mode where minimal latency is critical.
Test various voice settings to match your brand’s tone.

The ElevenLabs WebSocket API provides unmatched flexibility for generating high-quality streaming speech from dynamic text input. Its advanced buffering system, configurable parameters, and precise alignment data make it the ideal solution for developers building the next generation of voice-powered applications.