Text-to-Speech
Convert text to natural-sounding Cantonese speech. Supports multiple voices, audio formats, and customization parameters. On model_id "v5"/"v6", you can also pass a jyutping romanization — alongside text to guide pronunciation, or on its own for jyutping-only synthesis.
Request Body
Send a JSON body with the following parameters:
Your API key for authentication.
Text to convert to speech. Maximum 5000 characters. On model_id "v5"/"v6", may be omitted only when jyutping is provided on its own; if both are supplied, the model uses them together (text for the characters, jyutping to guide pronunciation).
Jyutping romanization to guide pronunciation. Only supported when model_id is "v5" or "v6". Can be supplied alongside text, or on its own (with text empty) for jyutping-only synthesis. Supports <break time="..."/> tags; when both text and jyutping contain breaks, the chunk counts must match.
TTS model version. Options: "v2", "v3", "v4", "v5", "v6". The jyutping input is only honored on "v5" and "v6".
Audio frame rate in Hz. Common values: "16000", "24000", "44100".
Speech speed multiplier. Range: 0.5–3.0.
Target duration in seconds.
Pitch adjustment in semitones. Range: −12 to +12.
Language code. Options: "cantonese", "english", "mandarin".
Audio output format. Options: "wav", "mp3".
Unique identifier for the voice to use. Defaults to system default voice.
Whether to apply audio enhancement.
Whether to convert simplified Chinese to traditional Chinese before synthesis.
If true, returns a JSON response with base64-encoded audio plus SRT and word-level timestamps instead of a direct audio file.
Use the faster turbo model. Supported voices are listed in the voice library.
Example Request
By default the API returns a direct audio file in the requested format (.wav or .mp3).
curl -X POST "https://cantonese.ai/api/tts" \
-H "Content-Type: application/json" \
-d '{
"api_key": "YOUR_API_KEY",
"text": "你今日食咗飯未?",
"frame_rate": "24000",
"speed": 1,
"pitch": 0,
"language": "cantonese",
"output_extension": "wav",
"voice_id": "2725cf0f-efe2-4132-9e06-62ad84b2973d",
"should_return_timestamp": false
}' \
--output output.wavWith Timestamps
When should_return_timestamp = true, the API returns a JSON response with the base64-encoded audio file and timing data.
curl -s -X POST "https://cantonese.ai/api/tts" \
-H "Content-Type: application/json" \
-d '{
"api_key": "YOUR_API_KEY",
"text": "你今日食咗飯未?",
"frame_rate": "24000",
"speed": 1,
"pitch": 0,
"language": "cantonese",
"output_extension": "wav",
"voice_id": "2725cf0f-efe2-4132-9e06-62ad84b2973d",
"should_return_timestamp": true
}' -o response.json
jq -r '.file' response.json | base64 -d > output.wav
jq '.timestamps' response.json > timestamps.json
With Jyutping (v5 / v6)
On model_id = "v5" or "v6", pass a jyutping string to guide pronunciation. Supply it alongside text, or on its own for jyutping-only synthesis.
curl -X POST "https://cantonese.ai/api/tts" \
-H "Content-Type: application/json" \
-d '{
"api_key": "YOUR_API_KEY",
"text": "你今日食咗飯未?",
"jyutping": "nei5 gam1 jat6 sik6 zo2 faan6 mei6",
"model_id": "v6",
"frame_rate": "24000",
"speed": 1,
"pitch": 0,
"language": "cantonese",
"output_extension": "wav",
"voice_id": "2725cf0f-efe2-4132-9e06-62ad84b2973d",
"should_return_timestamp": false
}' \
--output output.wav
Response
The default response is a direct audio file. When should_return_timestamp = true, the response is a JSON object with the following fields:
file— base64-encoded audio file in the requested formatrequest_id— unique identifier for this requestsrt_timestamp— subtitle timestamps in SRT formattimestamps— array of timing entries withstart,end, andtext
{
"file": "Z+AOEA4wHjAXf7s/Qw7uXoYuwz8LD22PVH8gzwR+os6zrq...",
"request_id": "a1b2c3d4-e5f6-7890-1234-567890abcdef",
"srt_timestamp": "1\n00:00:00,000 --> 00:00:01,984\n你今日食咗飯未\n\n",
"timestamps": [
{
"start": 0,
"end": 1.984,
"text": "你今日食咗飯未"
}
]
}Jyutping requests on v5 / v6 return the same shape. The submitted jyutping is not echoed back in the response.
{
"file": "Z+AOEA4wHjAXf7s/Qw7uXoYuwz8LD22PVH8gzwR+os6zrq...",
"request_id": "b2c3d4e5-f6a7-8901-2345-67890abcdef0",
"srt_timestamp": "1\n00:00:00,000 --> 00:00:02,112\n你今日食咗飯未\n\n",
"timestamps": [
{
"start": 0,
"end": 2.112,
"text": "你今日食咗飯未"
}
]
}Status Codes
The API returns standard HTTP status codes to indicate the success or failure of requests.
| Status | Description |
|---|---|
| 200 | Success — audio file generated successfully |
| 400 | Bad Request — invalid parameters or malformed request |
| 401 | Unauthorized — invalid or missing API key |
| 403 | Forbidden — API key doesn't have permission for this endpoint |
| 413 | Payload Too Large — text exceeds maximum length (5000 characters) |
| 422 | Unprocessable Entity — invalid parameter values or unsupported voice/format |
| 429 | Too Many Requests — rate limit exceeded |
| 500 | Internal Server Error — server encountered an unexpected condition |
| 503 | Service Unavailable — server is temporarily unable to handle the request |