Real-time Speech-to-Text

Transcribe speech to text in real-time using a persistent Socket.IO connection. Audio is streamed from the client to the server, and transcription results are returned as they become available.

How It Works

Connect to the Socket.IO endpoint at /api/stt-realtime-socket with your API key.
Emit a start-session event to begin a transcription session.
Stream audio by emitting audio-chunk events with base64-encoded PCM audio data.
Receive transcription results via the transcript-message event.
Emit stop-session to end the session.

Connection

Connect using a Socket.IO client with the following configuration:

Parameter	Value
path	`/api/stt-realtime-socket`
extraHeaders.x-api-key	Your API key

Client Events (Client → Server)

Event	Payload	Description
start-session	{ sampleRate: number }	Start a new transcription session. Sample rate should match your audio source (e.g. 16000).
audio-chunk	{ audio_base_64: string, sample_rate: number }	Send a chunk of audio data. Audio must be base64-encoded 16-bit PCM.
stop-session	(none)	Stop the current transcription session and close the upstream connection.

Server Events (Server → Client)

Event	Payload	Description
session-started	(none)	The transcription session is ready. You can start sending audio chunks.
transcript-message	object (see below)	A transcription result from the upstream service.
session-error	{ error: string }	An error occurred during the session.
session-closed	{ code: number, reason: string }	The upstream connection was closed. Code 1000 indicates normal closure.

Transcript Message Types

The transcript-message event payload contains a message_type field indicating the type of transcription result:

message_type	Description
session_started	The upstream transcription session has started. Contains a `session_id`.
partial_transcript	An in-progress transcription that may change. Contains `text`.
committed_transcript	A finalized transcription. Contains `text`.
committed_transcript_with_timestamps	A finalized transcription with word-level timestamps.

Audio Format

Audio must be sent as base64-encoded 16-bit PCM (signed integer, little-endian). The recommended sample rate is 16000 Hz, mono channel. If your audio source uses a different sample rate, pass the actual sample rate in the start-session and audio-chunk events.

Example

to auto-fill your API key in the code examples below.

const { io } = require("socket.io-client");

const API_KEY = "YOUR_API_KEY";
const BASE_URL = "https://cantonese.ai";

// Connect to the real-time STT socket
const socket = io(BASE_URL, {
  path: "/api/stt-realtime-socket",
  extraHeaders: {
    "x-api-key": API_KEY,
  },
});

socket.on("connect", () => {
  console.log("Connected to real-time STT");
  // Start a transcription session
  socket.emit("start-session", { sampleRate: 16000 });
});

socket.on("session-started", () => {
  console.log("Session started, ready to receive audio");
  // Send audio chunks (base64-encoded PCM)
  // socket.emit("audio-chunk", {
  //   audio_base_64: "<base64-encoded-pcm-audio>",
  //   sample_rate: 16000,
  // });
});

socket.on("transcript-message", (message) => {
  switch (message.message_type) {
    case "partial_transcript":
      process.stdout.write(`\rPartial: ${message.text}`);
      break;
    case "committed_transcript":
      console.log(`\nFinal: ${message.text}`);
      break;
  }
});

socket.on("session-error", (data) => {
  console.error("Session error:", data.error);
});

socket.on("session-closed", (data) => {
  console.log("Session closed:", data.code, data.reason);
});

// To stop the session:
// socket.emit("stop-session");
// socket.disconnect();

Status Codes

The Socket.IO handshake endpoint returns standard HTTP status codes:

Status Code	Description
200	Success - Socket connection established
401	Unauthorized - Invalid or missing API key
500	Internal Server Error - Server encountered an unexpected condition