Speech-to-Text (Async)
Async mode lets you submit an audio file and fetch the transcription later. Use this for any recording that takes more than ~90 seconds of server-side processing — typically anything longer than ~10 minutes of audio with fusion enabled. The synchronous endpoint will time out behind Cloudflare's 100-second proxy limit for those workloads.
- Synchronous (/speech-to-text): a single blocking call. Convenient for clips under ~10 min.
- Async (this page): submit + poll. Required for long-form audio (interviews, podcasts, meetings, sermons).
How It Works
- POST to
/api/sttwithwait_for_completion=false. The endpoint returns within seconds with{ request_id, status: "processing" }. The transcription continues server-side independently of the HTTP connection. - Poll
GET /api/speech-to-text/get-result?id=<request_id>every 5–15 seconds. Each poll is a fast round trip — Cloudflare-safe. - Stop polling once
transcription.statusiscompletedorfailed. The full transcription is on the same response.
Step 1 — Submit (POST /api/stt)
Same endpoint and parameters as the synchronous version. The only required difference is wait_for_completion=false (which is already the default — listed here for clarity).
Immediate response (returns in seconds):
{
"success": true,
"data": {
"request_id": "873d6cbc-c273-436a-a2c9-16a0e91caaa8",
"status": "processing"
}
}Save the request_id — that's what you'll use for the poll.
Step 2 — Poll (GET /api/speech-to-text/get-result)
| Field | Value |
|---|---|
| Method | GET |
| Path | /api/speech-to-text/get-result |
| Query | id=<request_id> |
| Auth | x-api-key header — same key used to submit the request |
| Suggested poll interval | 5–15 seconds (longer is fine; backoff acceptable) |
Status values
transcription.status moves through one of these terminal states. Stop polling on completed or failed.
| Status | Meaning |
|---|---|
| pending | Initial state — request accepted, transcription not started yet. |
| processing | Transcription and/or fusion currently running. Keep polling. |
| completed | Done — read transcription.transcription / fused_transcription / etc. from the response. |
| failed | An error occurred — see transcription.error. Don't retry automatically; check the message first. |
Full Example: Submit + Poll
#!/usr/bin/env bash
set -euo pipefail
API_KEY="YOUR_API_KEY"
BASE="https://cantonese.ai"
# 1. Submit the audio. wait_for_completion=false (default) returns
# immediately with a request_id while the server keeps transcribing.
RESPONSE=$(curl -s -X POST "$BASE/api/stt" \
-F "api_key=$API_KEY" \
-F "with_timestamp=false" \
-F "with_diarization=false" \
-F "wait_for_completion=false" \
-F "context=Quarterly earnings call for HSBC, speakers are CFO and analysts" \
-F "[email protected];type=audio/wav")
REQUEST_ID=$(echo "$RESPONSE" | jq -r '.data.request_id')
echo "submitted: request_id=$REQUEST_ID"
# 2. Poll for the result. Each poll is a fast round trip — Cloudflare-safe.
while true; do
RESULT=$(curl -s -G "$BASE/api/speech-to-text/get-result" \
--data-urlencode "id=$REQUEST_ID" \
-H "x-api-key: $API_KEY")
STATUS=$(echo "$RESULT" | jq -r '.transcription.status')
echo "status=$STATUS"
case "$STATUS" in
completed|failed) break ;;
esac
sleep 5
done
# 3. Final transcription is in the same response.
echo "$RESULT" | jq '.transcription'
Poll Response Examples
While processing:
{
"success": true,
"transcription": {
"request_id": "873d6cbc-c273-436a-a2c9-16a0e91caaa8",
"status": "processing",
"duration_s": 2824,
"output_language": "cantonese",
"backend": "cantonese_ai",
"with_diarization": false,
"include_timestamp": false,
"context": "Quarterly earnings call for HSBC, speakers are CFO and analysts"
}
}When completed (47-min audio with fusion enabled — same shape as the synchronous response, just delivered via poll):
{
"success": true,
"transcription": {
"request_id": "873d6cbc-c273-436a-a2c9-16a0e91caaa8",
"status": "completed",
"duration_s": 2824,
"output_language": "cantonese",
"backend": "cantonese_ai",
"with_diarization": false,
"include_timestamp": false,
"context": "Quarterly earnings call for HSBC, speakers are CFO and analysts",
"transcription": "噉,Hello 喂你好你好,你喺邊買先?Hello Brian,Hello hello,聽到話 … (~18,400 chars)",
"jyutping_transcription": "seng4 jat6 dou1 caau2 zok3 nei1 di1 siu1 sik1 ge3 waa2 … (~56,500 chars)",
"fused_transcription": "(post-fusion homophone-corrected version, same length as transcription)",
"processing_time": 311.63,
"transcription_time": 29.8,
"fusion_time": 272.47,
"credits_used": 5648
}
}Use fused_transcription for the final corrected text when fusion ran. It applies LLM-based homophone correction biased by the corpora and any caller-supplied context hint. Fall back to transcription (raw whisperx) when fusion was skipped or empty.
When failed:
{
"success": true,
"transcription": {
"request_id": "873d6cbc-c273-436a-a2c9-16a0e91caaa8",
"status": "failed",
"duration_s": 2824,
"output_language": "cantonese",
"error": "fusion timed out: chunk 4/6 exceeded 180s budget",
"processing_time": 198.4
}
}Other Response Shapes
The response shape inside transcription matches the synchronous endpoint exactly, so all the variations below apply when you poll a completed request that used those flags:
with_timestamp = true
{
"text": "1\n00:00:01,032 --> 00:00:04,083\nWhen you call someone who is thousands of\n\n2\n00:00:04,083 --> 00:00:04,868\n miles away, you're using a satellite.\n\n",
"duration": "6.540000",
"process_time": "1.86"
}with_diarization = true
{
"text": "When you call someone who is thousands of miles away, you're using a satellite.",
"diarization": "SPEAKER_00: When you call someone who is thousands of miles away, you're using a satellite.",
"duration": "6.540000",
"process_time": "0.19"
}Best Practices
- Always use async for long audio. A 30-minute recording with fusion enabled can run 3–10 minutes server-side — well past Cloudflare's 100-second proxy timeout that fronts the synchronous endpoint.
- Don't poll faster than every 5 seconds. The transcription completes in seconds-to-minutes, not milliseconds. Aggressive polling burns rate-limit headroom without delivering results faster.
- Cap your polling time. If a request stays in
processingfor more than 30 minutes, treat it as stuck and surface an error to the user. - Save
request_idon your side. If your poll loop crashes you can resume by polling the samerequest_idlater — work continues server-side. - Handle
failedgracefully. Readtranscription.errorfor the reason. Failures are not auto-retried.
Status Codes
| Status Code | Description |
|---|---|
| 200 | Success. Inspect transcription.status to know whether to keep polling. |
| 401 | Unauthorized — bad or missing x-api-key. |
| 200 + success: false | id not found, or it doesn't belong to your account. |