cantonese.aiAPI Reference

Real-time Speech-to-Text

Transcribe speech to text in real-time using a persistent Socket.IO connection. Audio is streamed from the client to the server, and transcription results are returned as they become available.

How It Works

  1. Connect to the Socket.IO endpoint at /api/stt-realtime-socket with your API key.
  2. Emit a start-session event to begin a transcription session.
  3. Stream audio by emitting audio-chunk events with base64-encoded PCM audio data.
  4. Receive transcription results via the transcript-message event.
  5. Emit stop-session to end the session.

Connection

Connect using a Socket.IO client with the following configuration:

ParameterValue
path/api/stt-realtime-socket
extraHeaders.x-api-keyYour API key

Client Events (Client → Server)

EventPayloadDescription
start-session{ sampleRate: number }Start a new transcription session. Sample rate should match your audio source (e.g. 16000).
audio-chunk{ audio_base_64: string, sample_rate: number }Send a chunk of audio data. Audio must be base64-encoded 16-bit PCM.
stop-session(none)Stop the current transcription session and close the upstream connection.

Server Events (Server → Client)

EventPayloadDescription
session-started(none)The transcription session is ready. You can start sending audio chunks.
transcript-messageobject (see below)A transcription result from the upstream service.
session-error{ error: string }An error occurred during the session.
session-closed{ code: number, reason: string }The upstream connection was closed. Code 1000 indicates normal closure.

Transcript Message Types

The transcript-message event payload contains a message_type field indicating the type of transcription result:

message_typeDescription
session_startedThe upstream transcription session has started. Contains a session_id.
partial_transcriptAn in-progress transcription that may change. Contains text.
committed_transcriptA finalized transcription. Contains text.
committed_transcript_with_timestampsA finalized transcription with word-level timestamps.

Audio Format

Audio must be sent as base64-encoded 16-bit PCM (signed integer, little-endian). The recommended sample rate is 16000 Hz, mono channel. If your audio source uses a different sample rate, pass the actual sample rate in the start-session and audio-chunk events.

Example

to auto-fill your API key in the code examples below.
const { io } = require("socket.io-client");

const API_KEY = "YOUR_API_KEY";
const BASE_URL = "https://cantonese.ai";

// Connect to the real-time STT socket
const socket = io(BASE_URL, {
  path: "/api/stt-realtime-socket",
  extraHeaders: {
    "x-api-key": API_KEY,
  },
});

socket.on("connect", () => {
  console.log("Connected to real-time STT");
  // Start a transcription session
  socket.emit("start-session", { sampleRate: 16000 });
});

socket.on("session-started", () => {
  console.log("Session started, ready to receive audio");
  // Send audio chunks (base64-encoded PCM)
  // socket.emit("audio-chunk", {
  //   audio_base_64: "<base64-encoded-pcm-audio>",
  //   sample_rate: 16000,
  // });
});

socket.on("transcript-message", (message) => {
  switch (message.message_type) {
    case "partial_transcript":
      process.stdout.write(`\rPartial: ${message.text}`);
      break;
    case "committed_transcript":
      console.log(`\nFinal: ${message.text}`);
      break;
  }
});

socket.on("session-error", (data) => {
  console.error("Session error:", data.error);
});

socket.on("session-closed", (data) => {
  console.log("Session closed:", data.code, data.reason);
});

// To stop the session:
// socket.emit("stop-session");
// socket.disconnect();

Status Codes

The Socket.IO handshake endpoint returns standard HTTP status codes:

Status CodeDescription
200Success - Socket connection established
401Unauthorized - Invalid or missing API key
500Internal Server Error - Server encountered an unexpected condition