Real-time Speech-to-Text
Transcribe speech to text in real-time using a persistent Socket.IO connection. Audio is streamed from the client to the server, and transcription results are returned as they become available.
How It Works
- Connect to the Socket.IO endpoint at
/api/stt-realtime-socketwith your API key. - Emit a
start-sessionevent to begin a transcription session. - Stream audio by emitting
audio-chunkevents with base64-encoded PCM audio data. - Receive transcription results via the
transcript-messageevent. - Emit
stop-sessionto end the session.
Connection
Connect using a Socket.IO client with the following configuration:
| Parameter | Value |
|---|---|
| path | /api/stt-realtime-socket |
| extraHeaders.x-api-key | Your API key |
Client Events (Client → Server)
| Event | Payload | Description |
|---|---|---|
| start-session | { sampleRate: number } | Start a new transcription session. Sample rate should match your audio source (e.g. 16000). |
| audio-chunk | { audio_base_64: string, sample_rate: number } | Send a chunk of audio data. Audio must be base64-encoded 16-bit PCM. |
| stop-session | (none) | Stop the current transcription session and close the upstream connection. |
Server Events (Server → Client)
| Event | Payload | Description |
|---|---|---|
| session-started | (none) | The transcription session is ready. You can start sending audio chunks. |
| transcript-message | object (see below) | A transcription result from the upstream service. |
| session-error | { error: string } | An error occurred during the session. |
| session-closed | { code: number, reason: string } | The upstream connection was closed. Code 1000 indicates normal closure. |
Transcript Message Types
The transcript-message event payload contains a message_type field indicating the type of transcription result:
| message_type | Description |
|---|---|
| session_started | The upstream transcription session has started. Contains a session_id. |
| partial_transcript | An in-progress transcription that may change. Contains text. |
| committed_transcript | A finalized transcription. Contains text. |
| committed_transcript_with_timestamps | A finalized transcription with word-level timestamps. |
Audio Format
Audio must be sent as base64-encoded 16-bit PCM (signed integer, little-endian). The recommended sample rate is 16000 Hz, mono channel. If your audio source uses a different sample rate, pass the actual sample rate in the start-session and audio-chunk events.
Example
to auto-fill your API key in the code examples below.
const { io } = require("socket.io-client");
const API_KEY = "YOUR_API_KEY";
const BASE_URL = "https://cantonese.ai";
// Connect to the real-time STT socket
const socket = io(BASE_URL, {
path: "/api/stt-realtime-socket",
extraHeaders: {
"x-api-key": API_KEY,
},
});
socket.on("connect", () => {
console.log("Connected to real-time STT");
// Start a transcription session
socket.emit("start-session", { sampleRate: 16000 });
});
socket.on("session-started", () => {
console.log("Session started, ready to receive audio");
// Send audio chunks (base64-encoded PCM)
// socket.emit("audio-chunk", {
// audio_base_64: "<base64-encoded-pcm-audio>",
// sample_rate: 16000,
// });
});
socket.on("transcript-message", (message) => {
switch (message.message_type) {
case "partial_transcript":
process.stdout.write(`\rPartial: ${message.text}`);
break;
case "committed_transcript":
console.log(`\nFinal: ${message.text}`);
break;
}
});
socket.on("session-error", (data) => {
console.error("Session error:", data.error);
});
socket.on("session-closed", (data) => {
console.log("Session closed:", data.code, data.reason);
});
// To stop the session:
// socket.emit("stop-session");
// socket.disconnect();
Status Codes
The Socket.IO handshake endpoint returns standard HTTP status codes:
| Status Code | Description |
|---|---|
| 200 | Success - Socket connection established |
| 401 | Unauthorized - Invalid or missing API key |
| 500 | Internal Server Error - Server encountered an unexpected condition |