Audio · Synthesis

Streaming Text-to-Speech

Convert text to natural-sounding Cantonese speech with low latency. Unlike the standard Text-to-Speech endpoint — which returns one complete audio file — this endpoint streams audio as it is generated, so playback can begin after the first chunk (typically a few hundred milliseconds) instead of waiting for the whole clip.

POSThttps://cantonese.ai/api/tts-stream

Request Body

Send a JSON body with the following parameters:

api_keystringRequired

Your API key for authentication.

textstringRequired

Text to convert to speech. The audio is streamed back as it is generated.

voice_idstringRequired

Identifier of the voice whose timbre to clone. Pick any voice from the voice library; its reference sample sets the speaker.

model_idstringOptional"6"

Streaming TTS model version. Currently "6" (V6).

languagestringOptional"Chinese"

Language of the text. "Chinese" synthesizes Cantonese; "English" for English.

temperaturenumberOptional0.9

Sampling temperature. Lower is more deterministic.

top_pnumberOptional0.95

Nucleus sampling cutoff.

Example Request

The response body is a chunked stream of raw little-endian 16-bit PCM samples (mono). Read it incrementally and feed each chunk to your audio player for live playback.

to auto-fill your API key in the code examples below.

# Streams raw 16-bit PCM (24kHz, mono) to out.pcm as it is generated.
curl -N -X POST https://cantonese.ai/api/tts-stream \
  -H "Content-Type: application/json" \
  -d '{
    "api_key": "YOUR_API_KEY",
    "text": "歡迎使用粵語人工智能嘅即時語音合成示範，呢段聲音係一邊生成一邊播放嘅。",
    "voice_id": "YOUR_VOICE_ID"
  }' \
  --output out.pcm

# Convert the raw PCM to a WAV file (s16le, 24kHz, mono):
ffmpeg -f s16le -ar 24000 -ac 1 -i out.pcm out.wav

Response

The response is not a JSON object or a complete audio file — it is a raw PCM byte stream delivered with Transfer-Encoding: chunked. Audio metadata is returned in the response headers:

Header	Description
Content-Type	application/octet-stream
X-Sample-Rate	Sample rate of the PCM stream in Hz (24000)
X-Audio-Format	PCM encoding of the stream (pcm_s16le)
X-Channels	Number of audio channels (1 — mono)

Format: 16-bit signed little-endian PCM, 24,000 Hz, mono. To turn the raw stream into a playable file, wrap it in a WAV container (see the cURL example) or decode each chunk in the browser with the Web Audio API (see the JavaScript example).

Status Codes

The API returns standard HTTP status codes to indicate the success or failure of requests.

Status	Description
200	Success — audio is streamed as raw PCM in the response body
400	Bad Request — missing/empty text or voice_id
401	Unauthorized — invalid or missing API key
404	Not Found — no reference audio for the given voice_id
429	Too Many Requests — rate limit exceeded
502	Bad Gateway — the streaming model server was unreachable