Streaming Text-to-Speech
Convert text to natural-sounding Cantonese speech with low latency. Unlike the standard Text-to-Speech endpoint — which returns one complete audio file — this endpoint streams audio as it is generated, so playback can begin after the first chunk (typically a few hundred milliseconds) instead of waiting for the whole clip.
Request Body
Send a JSON body with the following parameters:
Your API key for authentication.
Text to convert to speech. The audio is streamed back as it is generated.
Identifier of the voice whose timbre to clone. Pick any voice from the voice library; its reference sample sets the speaker.
Streaming TTS model version. Currently "6" (V6).
Language of the text. "Chinese" synthesizes Cantonese; "English" for English.
Sampling temperature. Lower is more deterministic.
Nucleus sampling cutoff.
Example Request
The response body is a chunked stream of raw little-endian 16-bit PCM samples (mono). Read it incrementally and feed each chunk to your audio player for live playback.
# Streams raw 16-bit PCM (24kHz, mono) to out.pcm as it is generated.
curl -N -X POST https://cantonese.ai/api/tts-stream \
-H "Content-Type: application/json" \
-d '{
"api_key": "YOUR_API_KEY",
"text": "歡迎使用粵語人工智能嘅即時語音合成示範,呢段聲音係一邊生成一邊播放嘅。",
"voice_id": "YOUR_VOICE_ID"
}' \
--output out.pcm
# Convert the raw PCM to a WAV file (s16le, 24kHz, mono):
ffmpeg -f s16le -ar 24000 -ac 1 -i out.pcm out.wavResponse
The response is not a JSON object or a complete audio file — it is a raw PCM byte stream delivered with Transfer-Encoding: chunked. Audio metadata is returned in the response headers:
| Header | Description |
|---|---|
| Content-Type | application/octet-stream |
| X-Sample-Rate | Sample rate of the PCM stream in Hz (24000) |
| X-Audio-Format | PCM encoding of the stream (pcm_s16le) |
| X-Channels | Number of audio channels (1 — mono) |
Format: 16-bit signed little-endian PCM, 24,000 Hz, mono. To turn the raw stream into a playable file, wrap it in a WAV container (see the cURL example) or decode each chunk in the browser with the Web Audio API (see the JavaScript example).
Status Codes
The API returns standard HTTP status codes to indicate the success or failure of requests.
| Status | Description |
|---|---|
| 200 | Success — audio is streamed as raw PCM in the response body |
| 400 | Bad Request — missing/empty text or voice_id |
| 401 | Unauthorized — invalid or missing API key |
| 404 | Not Found — no reference audio for the given voice_id |
| 429 | Too Many Requests — rate limit exceeded |
| 502 | Bad Gateway — the streaming model server was unreachable |