Audio · Synthesis

Streaming Text-to-Speech

Convert text to natural-sounding Cantonese speech with low latency. Unlike the standard Text-to-Speech endpoint — which returns one complete audio file — this endpoint streams audio as it is generated, so playback can begin after the first chunk (typically a few hundred milliseconds) instead of waiting for the whole clip.

POSThttps://cantonese.ai/api/tts-stream

Request Body

Send a JSON body with the following parameters:

api_keystringRequired

Your API key for authentication.

textstringRequired

Text to convert to speech. The audio is streamed back as it is generated.

voice_idstringRequired

Identifier of the voice whose timbre to clone. Pick any voice from the voice library; its reference sample sets the speaker.

model_idstringOptional"6"

Streaming TTS model version. Currently "6" (V6).

languagestringOptional"Chinese"

Language of the text. "Chinese" synthesizes Cantonese; "English" for English.

temperaturenumberOptional0.9

Sampling temperature. Lower is more deterministic.

top_pnumberOptional0.95

Nucleus sampling cutoff.

Example Request

The response body is a chunked stream of raw little-endian 16-bit PCM samples (mono). Read it incrementally and feed each chunk to your audio player for live playback.

to auto-fill your API key in the code examples below.
# Streams raw 16-bit PCM (24kHz, mono) to out.pcm as it is generated.
curl -N -X POST https://cantonese.ai/api/tts-stream \
  -H "Content-Type: application/json" \
  -d '{
    "api_key": "YOUR_API_KEY",
    "text": "歡迎使用粵語人工智能嘅即時語音合成示範,呢段聲音係一邊生成一邊播放嘅。",
    "voice_id": "YOUR_VOICE_ID"
  }' \
  --output out.pcm

# Convert the raw PCM to a WAV file (s16le, 24kHz, mono):
ffmpeg -f s16le -ar 24000 -ac 1 -i out.pcm out.wav

Response

The response is not a JSON object or a complete audio file — it is a raw PCM byte stream delivered with Transfer-Encoding: chunked. Audio metadata is returned in the response headers:

HeaderDescription
Content-Typeapplication/octet-stream
X-Sample-RateSample rate of the PCM stream in Hz (24000)
X-Audio-FormatPCM encoding of the stream (pcm_s16le)
X-ChannelsNumber of audio channels (1 — mono)
i

Format: 16-bit signed little-endian PCM, 24,000 Hz, mono. To turn the raw stream into a playable file, wrap it in a WAV container (see the cURL example) or decode each chunk in the browser with the Web Audio API (see the JavaScript example).

Status Codes

The API returns standard HTTP status codes to indicate the success or failure of requests.

StatusDescription
200Success — audio is streamed as raw PCM in the response body
400Bad Request — missing/empty text or voice_id
401Unauthorized — invalid or missing API key
404Not Found — no reference audio for the given voice_id
429Too Many Requests — rate limit exceeded
502Bad Gateway — the streaming model server was unreachable