Speech-to-Text (Async)

Async mode lets you submit an audio file and fetch the transcription later. Use this for any recording that takes more than ~90 seconds of server-side processing — typically anything longer than ~10 minutes of audio with fusion enabled. The synchronous endpoint will time out behind Cloudflare's 100-second proxy limit for those workloads.

When to use async vs synchronous:

Synchronous (/speech-to-text): a single blocking call. Convenient for clips under ~10 min.
Async (this page): submit + poll. Required for long-form audio (interviews, podcasts, meetings, sermons).

How It Works

POST to /api/stt with wait_for_completion=false. The endpoint returns within seconds with { request_id, status: "processing" }. The transcription continues server-side independently of the HTTP connection.
Poll GET /api/speech-to-text/get-result?id=<request_id>every 5–15 seconds. Each poll is a fast round trip — Cloudflare-safe.
Stop polling once transcription.status is completed or failed. The full transcription is on the same response.

Step 1 — Submit (POST /api/stt)

Same endpoint and parameters as the synchronous version. The only required difference is wait_for_completion=false (which is already the default — listed here for clarity).

Immediate response (returns in seconds):

{
  "success": true,
  "data": {
    "request_id": "873d6cbc-c273-436a-a2c9-16a0e91caaa8",
    "status": "processing"
  }
}

Save the request_id — that's what you'll use for the poll.

Step 2 — Poll (GET /api/speech-to-text/get-result)

Field	Value
Method	GET
Path	/api/speech-to-text/get-result
Query	id=<request_id>
Auth	`x-api-key` header — same key used to submit the request
Suggested poll interval	5–15 seconds (longer is fine; backoff acceptable)

Status values

transcription.status moves through one of these terminal states. Stop polling on completed or failed.

Status	Meaning
pending	Initial state — request accepted, transcription not started yet.
processing	Transcription and/or fusion currently running. Keep polling.
completed	Done — read `transcription.transcription` / `fused_transcription` / etc. from the response.
failed	An error occurred — see `transcription.error`. Don't retry automatically; check the message first.

Full Example: Submit + Poll

to auto-fill your API key in the code examples below.

#!/usr/bin/env bash
set -euo pipefail

API_KEY="YOUR_API_KEY"
BASE="https://cantonese.ai"

# 1. Submit the audio. wait_for_completion=false (default) returns
#    immediately with a request_id while the server keeps transcribing.
RESPONSE=$(curl -s -X POST "$BASE/api/stt" \
  -F "api_key=$API_KEY" \
  -F "with_timestamp=false" \
  -F "with_diarization=false" \
  -F "wait_for_completion=false" \
  -F "context=Quarterly earnings call for HSBC, speakers are CFO and analysts" \
  -F "[email protected];type=audio/wav")
REQUEST_ID=$(echo "$RESPONSE" | jq -r '.data.request_id')
echo "submitted: request_id=$REQUEST_ID"

# 2. Poll for the result. Each poll is a fast round trip — Cloudflare-safe.
while true; do
  RESULT=$(curl -s -G "$BASE/api/speech-to-text/get-result" \
    --data-urlencode "id=$REQUEST_ID" \
    -H "x-api-key: $API_KEY")
  STATUS=$(echo "$RESULT" | jq -r '.transcription.status')
  echo "status=$STATUS"
  case "$STATUS" in
    completed|failed) break ;;
  esac
  sleep 5
done

# 3. Final transcription is in the same response.
echo "$RESULT" | jq '.transcription'

Poll Response Examples

While processing:

{
  "success": true,
  "transcription": {
    "request_id": "873d6cbc-c273-436a-a2c9-16a0e91caaa8",
    "status": "processing",
    "duration_s": 2824,
    "output_language": "cantonese",
    "backend": "cantonese_ai",
    "with_diarization": false,
    "include_timestamp": false,
    "context": "Quarterly earnings call for HSBC, speakers are CFO and analysts"
  }
}

When completed (47-min audio with fusion enabled — same shape as the synchronous response, just delivered via poll):

{
  "success": true,
  "transcription": {
    "request_id": "873d6cbc-c273-436a-a2c9-16a0e91caaa8",
    "status": "completed",
    "duration_s": 2824,
    "output_language": "cantonese",
    "backend": "cantonese_ai",
    "with_diarization": false,
    "include_timestamp": false,
    "context": "Quarterly earnings call for HSBC, speakers are CFO and analysts",
    "transcription": "噉，Hello 喂你好你好，你喺邊買先？Hello Brian，Hello hello，聽到話 …  (~18,400 chars)",
    "jyutping_transcription": "seng4 jat6 dou1 caau2 zok3 nei1 di1 siu1 sik1 ge3 waa2 …  (~56,500 chars)",
    "fused_transcription": "(post-fusion homophone-corrected version, same length as transcription)",
    "processing_time": 311.63,
    "transcription_time": 29.8,
    "fusion_time": 272.47,
    "credits_used": 5648
  }
}

Use fused_transcription for the final corrected text when fusion ran. It applies LLM-based homophone correction biased by the corpora and any caller-supplied context hint. Fall back to transcription (raw whisperx) when fusion was skipped or empty.

When failed:

{
  "success": true,
  "transcription": {
    "request_id": "873d6cbc-c273-436a-a2c9-16a0e91caaa8",
    "status": "failed",
    "duration_s": 2824,
    "output_language": "cantonese",
    "error": "fusion timed out: chunk 4/6 exceeded 180s budget",
    "processing_time": 198.4
  }
}

Other Response Shapes

The response shape inside transcription matches the synchronous endpoint exactly, so all the variations below apply when you poll a completed request that used those flags:

with_timestamp = true

{
  "text": "1\n00:00:01,032 --> 00:00:04,083\nWhen you call someone who is thousands of\n\n2\n00:00:04,083 --> 00:00:04,868\n miles away, you're using a satellite.\n\n",
  "duration": "6.540000",
  "process_time": "1.86"
}

with_diarization = true

{
  "text": "When you call someone who is thousands of miles away, you're using a satellite.",
  "diarization": "SPEAKER_00: When you call someone who is thousands of miles away, you're using a satellite.",
  "duration": "6.540000",
  "process_time": "0.19"
}

Best Practices

Always use async for long audio. A 30-minute recording with fusion enabled can run 3–10 minutes server-side — well past Cloudflare's 100-second proxy timeout that fronts the synchronous endpoint.
Don't poll faster than every 5 seconds. The transcription completes in seconds-to-minutes, not milliseconds. Aggressive polling burns rate-limit headroom without delivering results faster.
Cap your polling time. If a request stays in processing for more than 30 minutes, treat it as stuck and surface an error to the user.
Save request_id on your side. If your poll loop crashes you can resume by polling the same request_id later — work continues server-side.
Handle failed gracefully. Read transcription.error for the reason. Failures are not auto-retried.

Status Codes

Status Code	Description
200	Success. Inspect `transcription.status` to know whether to keep polling.
401	Unauthorized — bad or missing `x-api-key`.
200 + success: false	`id` not found, or it doesn't belong to your account.