Fast mode

Fast mode provides synchronous transcription for individual audio files. The response returns immediately after the transcription completes.

Key features

Single file processing Processes one audio file at a time
Immediate results Returns results after the transcription completes
Short recordings Works best for short recordings

Fast mode can handle one (mono) or two (stereo) audio channels. The API returns either a single combined transcript or separate transcripts for each channel.

Input

We support two audio input channel formats.

Mono channel
Stereo channel

Output

Single transcript Single aggregated transcript when channel_separation=false
Per-channel transcripts Per-channel aggregate transcripts with channel tags when channel_separation=true

Audio formats

WAV, MP3, M4A, and MP4

Transcription options

These options control transcription language, segmentation mode and channel separation.

Parameter	Type	Default	Description
`language`	string	—	Language code (e.g., en-US). See language abbreviations.
`segmentation_mode`	string	`"auto"`	Voice Activity Detection (VAD) mode. Set `"none"` to disable.
`word_time_offsets`	boolean	`false`	Include word-level timestamps.
`channel_separation`	boolean	`false`	Transcribe stereo channels separately.

Audio transcription

Use this endpoint to send an audio file for synchronous transcription:

POST /aiservices/scribe/transcribe

Example request (cURL)

This cURL example sends an audio file to the Scribe API and returns a transcription:

curl -X POST https://api.zoom.us/v2/aiservices/scribe/transcribe \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "file": "https://example.com/path/clip.mp3",
    "config": {
      "language": "en-US",
      "word_time_offsets": true,
      "channel_separation": false
    }
  }'

Response (200)

{
    "request_id": "req_123",
    "duration_sec": 27.4,
    "result": {
        "text_display": "Human-readable with punctuation.",
        "text_lexical": "lowercase lexical form",
        "segments": [
            {
                "start": 0.0,
                "end": 5.2,
                "text": "Welcome everyone ...",
                "words": [
                    { "word": "Welcome", "start": 0.0, "end": 0.4 },
                    { "word": "everyone", "start": 0.41, "end": 0.9 }
                ]
            }
        ]
    }
}