Text-to-Speech

Convert text to natural-sounding Cantonese speech. This endpoint supports multiple voice options, audio formats, and customization parameters.

Request Body

Parameter	Type	Required	Description
api_key	string	Yes	Your API key for authentication.
text	string	Yes	The text to convert to speech. Maximum 5000 characters.
frame_rate	string	No	Audio frame rate in Hz. Common values: "16000", "24000", "44100". Defaults to "24000".
speed	number	No	Speech speed multiplier. Range: 0.5-3.0. Defaults to 1.0.
duration	number	No	Target duration (seconds).
pitch	number	No	Pitch adjustment in semitones. Range: -12 to +12. Defaults to 0.
language	string	No	Language code. Defaults to "cantonese". Options: "cantonese", "english", "mandarin".
output_extension	string	No	Audio output format. Defaults to "wav". Options: "wav", "mp3"
voice_id	string	No	Unique identifier for the voice to use. Defaults to system default voice.
should_enhance	boolean	No	Whether to apply audio enhancement. Defaults to false.
should_convert_from_simplified_to_traditional	boolean	No	Whether to convert simplified Chinese to traditional Chinese before synthesis. Defaults to false.
should_return_timestamp	boolean	No	Defaults to false.
should_use_turbo_model	boolean	No	Defaults to false.

Turbo Model v1

Enables faster speech synthesis for improved performance.
Supported voices available in the voice library.

Response Types

The API supports two different response formats depending on the should_return_timestamp parameter:

🎵 Audio File Response

Direct Audio File

When should_return_timestamp = false (default), the API returns a direct audio file.

curl -X POST "https://cantonese.ai/api/tts" \
  -H "Content-Type: application/json" \
  -d '{
    "api_key": "YOUR_API_KEY",
    "text": "你今日食咗飯未？",
    "frame_rate": "24000",
    "speed": 1,
    "duration": 2,
    "pitch": 0,
    "language": "cantonese",
    "output_extension": "wav",
    "voice_id": "2725cf0f-efe2-4132-9e06-62ad84b2973d",
    "should_enhance": false,
    "should_convert_from_simplified_to_traditional": true,
    "should_return_timestamp": false,
    "should_use_turbo_model": false
  }' \
  --output output.wav

📁 Output Format

Direct audio file in the requested format: .wav, .mp3

📊 JSON Response with Timestamps

JSON with Base64 Audio + Timestamps

When should_return_timestamp = true, the API returns a JSON response with base64-encoded audio and timing data.

curl -X POST "https://cantonese.ai/api/tts" \
  -H "Content-Type: application/json" \
  -d '{
    "api_key": "YOUR_API_KEY",
    "text": "你今日食咗飯未？",
    "frame_rate": "24000",
    "speed": 1,
    "duration": 2,
    "pitch": 0,
    "language": "cantonese",
    "voice_id": "2725cf0f-efe2-4132-9e06-62ad84b2973d",
    "should_enhance": false,
    "should_convert_from_simplified_to_traditional": true,
    "should_return_timestamp": true,
    "should_use_turbo_model": false
  }'

📋 JSON Response Structure

fileBase64-encoded audio file in the requested format

request_idUnique identifier for this request

srt_timestampSubtitle timestamps in SRT format

timestampsArray of word-level timing data with start/end times and text

{
  "file": "Z+AOEA4wHjAXf7s/Qw7uXoYuwz8LD22PVH8gzwR+os6zrq...",
  "request_id": "a1b2c3d4-e5f6-7890-1234-567890abcdef",
  "srt_timestamp": "1\n00:00:00,000 --> 00:00:01,984\n你今日食咗飯未\n\n",
  "timestamps": [
    {
      "start": 0,
      "end": 1.984,
      "text": "你今日食咗飯未"
    }
  ]
}

Status Codes

The API returns standard HTTP status codes to indicate the success or failure of requests.