Speech-to-Text (Synchronous)

Convert Cantonese audio files to accurate text transcriptions. This endpoint supports multiple audio formats, timestamps, speaker diarization, and advanced transcription options.

Two modes — pick the right one for your audio:

Synchronous (this page): one blocking POST. Best for clips under ~10 minutes — long enough for short voice notes, message replies, and most podcast snippets, but the request is bound by Cloudflare's 100-second proxy timeout.
Async (/speech-to-text-async): submit + poll. Required for long-form audio (interviews, lectures, meetings, sermons) — anything that takes more than ~90 seconds of server-side processing.

Request Parameters

This endpoint requires multipart/form-data for file uploads.

Parameter	Type	Required	Description
api_key	string	Yes	Your API key for authentication
data	file	Yes	Audio file to transcribe. Supported formats: wav, mp3, m4a, flac, ogg.
with_timestamp	boolean	No	Include word-level timestamps in the response. Defaults to false.
with_diarization	boolean	No	Enable speaker diarization to identify different speakers. Defaults to false.
context	string	No	Free-form text describing the recording's topic, speakers, or domain (e.g. “quarterly earnings call for HSBC, speakers are CFO and analysts”). Used as a hint during the post-ASR fusion step to disambiguate homophones and prefer domain-specific vocabulary. Has no effect when fusion is disabled (`skip_fusion=true` or `corpus_ids=“none”`).
skip_fusion	boolean	No	Set to `true` to skip the post-ASR fusion step entirely — no second-pass jyutping ASR, no LLM homophone correction. Use this for latency-sensitive calls (typically ~30–60% faster on Cantonese audio) when you only need the raw transcription. Defaults to `false` (fusion runs with the default public corpora). When skipped, the response omits `fused_transcription` and `jyutping_transcription`; use `transcription` instead. Equivalent to `corpus_ids=“none”`; takes precedence if both are supplied.
corpus_ids	string	No	Comma-separated list of corpus IDs to bias the fusion LLM toward domain-specific vocabulary. Omit to use all default public corpora (opt-out behaviour for Cantonese). Pass `“none”` (or an empty string) to disable fusion — equivalent to `skip_fusion=true`. Only applies to `lang=cantonese` on the default backend.
wait_for_completion	boolean	No	Defaults to `false` (async). On this synchronous endpoint, set `wait_for_completion=true` if you want the response to include the full transcription instead of just `{ request_id, status }`. For long-form audio prefer the dedicated async endpoint — sync calls hit Cloudflare's 100-second timeout for anything that takes longer to process.

Example Request

Sends the audio synchronously and blocks until transcription + fusion finish. The response arrives in one round trip — no polling. Cloudflare cuts requests at 100 seconds, so use the async endpoint for any audio whose total processing time is likely to exceed that.

to auto-fill your API key in the code examples below.

curl -X POST "https://cantonese.ai/api/stt" \
  -F "api_key=YOUR_API_KEY" \
  -F "with_timestamp=false" \
  -F "with_diarization=false" \
  -F "context=Quarterly earnings call for HSBC, speakers are CFO and analysts" \
  -F "[email protected];type=audio/wav"

Response

On success, the response returns a JSON object with the transcription results:

Default response format:

{
  "text": "When you call someone who is thousands of miles away, you're using a satellite.",
  "duration": "6.540000",
  "process_time": "0.19"
}

with_timestamp = true

{
  "text": "1\n00:00:01,032 --> 00:00:04,083\nWhen you call someone who is thousands of\n\n2\n00:00:04,083 --> 00:00:04,868\n miles away, you're using a satellite.\n\n",
  "duration": "6.540000",
  "process_time": "1.86"
}

with_diarization = true

{
  "text": "When you call someone who is thousands of miles away, you're using a satellite.",
  "diarization": "SPEAKER_00: When you call someone who is thousands of miles away, you're using a satellite.",
  "duration": "6.540000",
  "process_time": "0.19"
}

with_timestamp = true and with_diarization = true

{
  "text": "1\n00:00:01,032 --> 00:00:04,083\nSPEAKER_00: When you call someone who is thousands of\n\n2\n00:00:04,083 --> 00:00:04,868\nSPEAKER_00:  miles away, you're using a satellite.\n\n",
  "duration": "6.540000",
  "process_time": "3.22"
}

Status Codes

The API returns standard HTTP status codes to indicate the success or failure of requests.

Status Code	Description
200	Success - Audio transcribed successfully
400	Bad Request - Invalid parameters or malformed request
401	Unauthorized - Invalid or missing API key
403	Forbidden - API key doesn't have permission for this endpoint
413	Payload Too Large - Audio file exceeds maximum size limit
415	Unsupported Media Type - Audio format not supported
422	Unprocessable Entity - Audio file corrupted or invalid parameter values
429	Too Many Requests - Rate limit exceeded
500	Internal Server Error - Server encountered an unexpected condition
503	Service Unavailable - Server is temporarily unable to handle the request