Audio · Synthesis

Text-to-Speech

Convert text to natural-sounding Cantonese speech. Supports multiple voices, audio formats, and customization parameters. On model_id "v5"/"v6", you can also pass a jyutping romanization — alongside text to guide pronunciation, or on its own for jyutping-only synthesis.

POSThttps://cantonese.ai/api/tts

Request Body

Send a JSON body with the following parameters:

api_keystringRequired

Your API key for authentication.

textstringRequired

Text to convert to speech. Maximum 5000 characters. On model_id "v5"/"v6", may be omitted only when jyutping is provided on its own; if both are supplied, the model uses them together (text for the characters, jyutping to guide pronunciation).

jyutpingstringOptional

Jyutping romanization to guide pronunciation. Only supported when model_id is "v5" or "v6". Can be supplied alongside text, or on its own (with text empty) for jyutping-only synthesis. Supports <break time="..."/> tags; when both text and jyutping contain breaks, the chunk counts must match.

model_idstringOptional

TTS model version. Options: "v2", "v3", "v4", "v5", "v6". The jyutping input is only honored on "v5" and "v6".

frame_ratestringOptional"24000"

Audio frame rate in Hz. Common values: "16000", "24000", "44100".

speednumberOptional1.0

Speech speed multiplier. Range: 0.5–3.0.

durationnumberOptional

Target duration in seconds.

pitchnumberOptional0

Pitch adjustment in semitones. Range: −12 to +12.

languagestringOptional"cantonese"

Language code. Options: "cantonese", "english", "mandarin".

output_extensionstringOptional"wav"

Audio output format. Options: "wav", "mp3".

voice_idstringOptional

Unique identifier for the voice to use. Defaults to system default voice.

should_enhancebooleanOptionalfalse

Whether to apply audio enhancement.

should_convert_from_simplified_to_traditionalbooleanOptionalfalse

Whether to convert simplified Chinese to traditional Chinese before synthesis.

should_return_timestampbooleanOptionalfalse

If true, returns a JSON response with base64-encoded audio plus SRT and word-level timestamps instead of a direct audio file.

should_use_turbo_modelbooleanOptionalfalse

Use the faster turbo model. Supported voices are listed in the voice library.

Example Request

By default the API returns a direct audio file in the requested format (.wav or .mp3).

to auto-fill your API key in the code examples below.
curl -X POST "https://cantonese.ai/api/tts" \
  -H "Content-Type: application/json" \
  -d '{
    "api_key": "YOUR_API_KEY",
    "text": "你今日食咗飯未?",
    "frame_rate": "24000",
    "speed": 1,
    "pitch": 0,
    "language": "cantonese",
    "output_extension": "wav",
    "voice_id": "2725cf0f-efe2-4132-9e06-62ad84b2973d",
    "should_return_timestamp": false
  }' \
  --output output.wav

With Timestamps

When should_return_timestamp = true, the API returns a JSON response with the base64-encoded audio file and timing data.

to auto-fill your API key in the code examples below.
curl -s -X POST "https://cantonese.ai/api/tts" \
  -H "Content-Type: application/json" \
  -d '{
    "api_key": "YOUR_API_KEY",
    "text": "你今日食咗飯未?",
    "frame_rate": "24000",
    "speed": 1,
    "pitch": 0,
    "language": "cantonese",
    "output_extension": "wav",
    "voice_id": "2725cf0f-efe2-4132-9e06-62ad84b2973d",
    "should_return_timestamp": true
  }' -o response.json

jq -r '.file' response.json | base64 -d > output.wav
jq '.timestamps' response.json > timestamps.json

With Jyutping (v5 / v6)

On model_id = "v5" or "v6", pass a jyutping string to guide pronunciation. Supply it alongside text, or on its own for jyutping-only synthesis.

to auto-fill your API key in the code examples below.
curl -X POST "https://cantonese.ai/api/tts" \
  -H "Content-Type: application/json" \
  -d '{
    "api_key": "YOUR_API_KEY",
    "text": "你今日食咗飯未?",
    "jyutping": "nei5 gam1 jat6 sik6 zo2 faan6 mei6",
    "model_id": "v6",
    "frame_rate": "24000",
    "speed": 1,
    "pitch": 0,
    "language": "cantonese",
    "output_extension": "wav",
    "voice_id": "2725cf0f-efe2-4132-9e06-62ad84b2973d",
    "should_return_timestamp": false
  }' \
  --output output.wav

Response

The default response is a direct audio file. When should_return_timestamp = true, the response is a JSON object with the following fields:

  • file — base64-encoded audio file in the requested format
  • request_id — unique identifier for this request
  • srt_timestamp — subtitle timestamps in SRT format
  • timestamps — array of timing entries with start, end, and text
{
  "file": "Z+AOEA4wHjAXf7s/Qw7uXoYuwz8LD22PVH8gzwR+os6zrq...",
  "request_id": "a1b2c3d4-e5f6-7890-1234-567890abcdef",
  "srt_timestamp": "1\n00:00:00,000 --> 00:00:01,984\n你今日食咗飯未\n\n",
  "timestamps": [
    {
      "start": 0,
      "end": 1.984,
      "text": "你今日食咗飯未"
    }
  ]
}
i

Jyutping requests on v5 / v6 return the same shape. The submitted jyutping is not echoed back in the response.

{
  "file": "Z+AOEA4wHjAXf7s/Qw7uXoYuwz8LD22PVH8gzwR+os6zrq...",
  "request_id": "b2c3d4e5-f6a7-8901-2345-67890abcdef0",
  "srt_timestamp": "1\n00:00:00,000 --> 00:00:02,112\n你今日食咗飯未\n\n",
  "timestamps": [
    {
      "start": 0,
      "end": 2.112,
      "text": "你今日食咗飯未"
    }
  ]
}

Status Codes

The API returns standard HTTP status codes to indicate the success or failure of requests.

StatusDescription
200Success — audio file generated successfully
400Bad Request — invalid parameters or malformed request
401Unauthorized — invalid or missing API key
403Forbidden — API key doesn't have permission for this endpoint
413Payload Too Large — text exceeds maximum length (5000 characters)
422Unprocessable Entity — invalid parameter values or unsupported voice/format
429Too Many Requests — rate limit exceeded
500Internal Server Error — server encountered an unexpected condition
503Service Unavailable — server is temporarily unable to handle the request