ElevenLabs WebSocket API Documentation: WSS Text-to-Speech Integration

In this comprehensive guide, we explore the ElevenLabs (11 Labs) WebSocket API for Text-to-Speech (TTS), providing developers with all the crucial details needed to seamlessly integrate and optimize streaming audio generation. Our goal is to deliver an unmatched, detailed reference for leveraging wss://api.elevenlabs.io/v1/text-to-speech/:voice_id/stream-input effectively.

What is ElevenLabs WebSocket API?

The ElevenLabs WebSocket API is a powerful interface designed for real-time audio generation from streamed or chunked text input. Unlike traditional HTTP endpoints, this API enables dynamic text input processing and offers word-to-audio alignment data, making it ideal for live applications, interactive systems, and advanced speech synthesis platforms.

WebSocket Connection and Handshake

To initiate a WebSocket handshake, connect to:

rubyCopyEditwss://api.elevenlabs.io/v1/text-to-speech/:voice_id/stream-input
  • Method: GET
  • Status: 101 Switching Protocols
  • Required Path Parameter:
    • voice_id — Unique ID for the voice model.

Required Headers

  • xi-api-key: Your API key for authentication.

Optional Query Parameters

  • authorization — Bearer token if needed.
  • model_id — Specifies the model.
  • language_code — ISO 639-1 code for the language.
  • enable_logging — Default true; enable/disable request logging.
  • enable_ssml_parsing — Default false; control SSML parsing.
  • output_format — Select audio format (MP3, WAV, etc.).
  • inactivity_timeout — Default 20; max 180 seconds.
  • sync_alignment — Default false; include timing data with chunks.
  • auto_mode — Default false; disables buffering to reduce latency.
  • apply_text_normalizationauto, on, or off.
  • seed — Integer 0-4294967295; for deterministic output.

Sending Messages Over WebSocket

Initialize Connection

The first message initializes the connection:

jsonCopyEdit{
  "text": " ",
  "voice_settings": {
    "stability": 0.5,
    "similarity_boost": 0.75,
    "speed": 1
  },
  "xi_api_key": "<YOUR_API_KEY>"
}
  • text: Required — Must start with a single space.
  • voice_settings: Optional; adjust stability, similarity_boost, speed, style, use_speaker_boost.
  • generation_config: Optional; includes chunk_length_schedule.
  • pronunciation_dictionary_locators: Optional; custom pronunciation dictionary references.

Send Text

Send subsequent text:

jsonCopyEdit{
  "text": "Your text chunk here ",
  "try_trigger_generation": false
}
  • text: Required — Must end with a space.
  • try_trigger_generation: Optional; attempt immediate audio generation.

Flush Buffer

Force immediate generation:

jsonCopyEdit{
  "flush": true
}
  • Ensures all buffered text generates audio output.

Close Connection

End the session:

jsonCopyEdit{
  "text": ""
}

Audio Output Structure

The API returns:

jsonCopyEdit{
  "audio": "<base64_encoded_audio>",
  "isFinal": false,
  "normalizedAlignment": {
    "charStartTimesMs": [...],
    "charsDurationsMs": [...],
    "chars": [...]
  },
  "alignment": {
    "charStartTimesMs": [...],
    "charsDurationsMs": [...],
    "chars": [...]
  }
}
  • audio: Partial audio chunk, base64-encoded.
  • isFinal: true when the final chunk is sent.
  • normalizedAlignment: Timing info for normalized text.
  • alignment: Timing info for raw input text.

Chunk Length Scheduling

The chunk_length_schedule dictates when audio is generated based on buffer size:

jsonCopyEdit[120, 160, 250, 290]
  • First chunk: 120 characters
  • Second: Additional 160
  • Third: Additional 250
  • Fourth onward: 290 per chunk

Adjustable range: 50 - 500
Tip: Lower values reduce latency but may affect quality.

Advanced Features

Pronunciation Dictionaries

Enhance clarity using:

jsonCopyEdit{
  "pronunciation_dictionary_locators": [
    {
      "pronunciation_dictionary_id": "dict_id",
      "version_id": "version_id"
    }
  ]
}

Ensure this is part of the initial message.

Text Normalization

Control how numbers, abbreviations, and special text are spoken:

  • auto (default)
  • on
  • off

Use Cases for ElevenLabs WebSocket TTS

  • Live transcription and playback
  • Interactive voice agents
  • Dynamic narration tools
  • Educational apps requiring word-level timing
  • Games and virtual environments needing real-time speech

Best Practices for Using ElevenLabs WebSocket API

  • Start your connection with a blank space (" ").
  • End text chunks with a space to signal boundaries.
  • Tune chunk_length_schedule for latency vs. quality trade-offs.
  • Utilize flush to force generation without closing the connection.
  • Incorporate alignment data to synchronize visuals or subtitles with speech.
  • Use auto_mode where minimal latency is critical.
  • Test various voice settings to match your brand’s tone.

The ElevenLabs WebSocket API provides unmatched flexibility for generating high-quality streaming speech from dynamic text input. Its advanced buffering system, configurable parameters, and precise alignment data make it the ideal solution for developers building the next generation of voice-powered applications.