Handling media data

Realtime Media Streams (RTMS) delivers audio data from Zoom Contact Center (ZCC) engagements over WebSocket connections between your application and the RTMS servers.

After establishing a connection to the RTMS server, your application receives media data based on:

The scopes configured for your app
The formats specified in your media handshake request

Audio

Audio data is available per participant and as a merged packet of all participants.

The RTMS server sends audio data packets from participants in base64-encoded binary format.

Each packet of data contains the user's participant unique identifier (channel_id) and the timestamp (timestamp) of the audio data. Your app can use the channel_id to associate the user metadata, because the metadata of each participant was sent at the beginning of the session through events.

Example audio packet

{
    "msg_type": 14,
    "content": {
        "channel_id": "123abc",
        "data": "Hw1kDacNAA4sDkMOAQ5eDekMgAzXCw0L4Al4CDwHBAayBEwDmwHD/wn+rfyQ+3z6Z/k4+Ef3jvb09aj1YPUF9QL1OfVt9eH1f/YE96P3jPib+dX6Pvyr/ef+IABcAZ4CmQN3BFAF6QVxBs0G9wbjBqYGXwYFBm8FxwT/Aw4DHAIkARYA7v6o/T382/qq+YD4fPd99mz1p/Qq9KTzcfNv807zcPPQ8z70yfQ89az1SPbO9m/3Lfjf+I75M/rz+pT7KPzs/In9Fv7G/lP/q/8GAG4AtgBHAcABDQKtAkwDIgRQBYQGrgfyCEgKigvDDP4NFw/8D78QhBErEqoS/BL5EsYSfRLnEXARwhCMD1kO8ww8C5gJ5gfrBe8D3AGp/7P9IPyY+hL5t/dl9mv10fQ89NjzvvO78/HzbfTb9F71I/bw9vL3Kvlh+pX71/wr/m//owDCAbQCkwNoBPYEYQWxBZ0FZQUdBYkEBgRkA64CAgIdAT0AZf+T/tD93vz9+xf7B/o8+WX4pvcE90L2n/VM9d70pPSe9IL0nfTa9BX1cfXK9QP2S/a69hH3cPfx91P4w/hO+cr5XPr2+l/71/t9/AP9f/0R/p/+Pv/5/7IAWAH+AeMCAwQ7BaAG+wcmCZEKKQy8DT4PWhBXEVYSVBOgFM0VMxZSFhYW4BXTFUMVNBSzEtIQJg99Da0LogkXB40ESgJCAHj+ivyA+qn4KfcH9hr1JvRf89byq/Lt8kjzhvPy85P0n/X/9kn4lvni+kj84P1S/4wAuwHUAvcD9wS/BUwGaAZcBmcGTQb5BUkFQgQzAzkCVQEgALX+U/3N+4L6XfkU+Nj2o/WB9MTzOfO38g==",
        "timestamp": 1738392033699
    }
}

Send rate

The default interval is 20ms between audio packets. You can configure this in multiples of 20ms, up to a maximum of 1000ms. If you specify an interval above 1000ms in handshake requests, RTMS will change it to 1000ms.

Timestamps

Timestamps denote the creation time on Zoom's server. The timestamp for each audio packet changes relative to the send_rate defined by the handshake request.

If the send_rate is set to 20ms (default), the timestamp for each audio packet will change by 20ms.

When working with streaming audio, timestamps are useful in determining the sequence of messages. Use timestamps to

Infer the period of time where a user might be muted

Data processing

When processing audio data, a separate file is needed for each channel_id. Once you have each separate file you can combine them into one file.

Find the channel ID

Before you can save the files, you need to find and save the channel IDs.

ws.on("message", (data) => {
    const message = JSON.parse(data.toString());
    if (message.msg_type === 4) {
        // Media handshake response
        if (message.status_code === 0) {
            // Send CLIENT_READY_ACK to signaling connection
            signalingWs.send(
                JSON.stringify({
                    msg_type: 7,
                    rtms_stream_id: rtmsStreamId,
                }),
            );
        }
    } else if (message.msg_type === 12) {
        // Keep-alive request
        ws.send(JSON.stringify({ msg_type: 13, timestamp: message.timestamp }));
    } else if (message.msg_type === 14) {
        // Audio data
        const audioBuffer = Buffer.from(message.content.data, "base64");
        const channelId = message.content.channel_id;
        if (!engagementData.channelPaths.has(channelId)) {
            const rawPath = getChannelRawPath(
                engagementData.sessionDir,
                channelId,
            );
            const wavPath = getChannelWavPath(
                engagementData.sessionDir,
                channelId,
            );
            engagementData.channelPaths.set(channelId, { rawPath, wavPath });
            console.log(`🎙️  New channel ${channelId} → ${rawPath}`);
        }
        const { rawPath } = engagementData.channelPaths.get(channelId);
        saveRawAudio(audioBuffer, rawPath);
        engagementData.audioChunkCount++;
        if (engagementData.audioChunkCount % 100 === 0) {
            console.log(
                `🎵 Audio chunks: ${engagementData.audioChunkCount} (channels: ${engagementData.channelPaths.size})`,
            );
        }
    }
});

Convert and save stream

Now that you have the channel IDs, convert the stream data, save it to a file, and then combine it into a single file, if desired.

If you don't want a single file, only use the code until line 97.

import fs from "fs";
import path from "path";
import { exec } from "child_process";
import { promisify } from "util";
const execAsync = promisify(exec);
// Cache of open write streams keyed by file path
const writeStreams = new Map();
export function makeSessionTimestamp() {
    const now = new Date();
    const pad = (n) => String(n).padStart(2, "0");
    return `${now.getFullYear()}-${pad(now.getMonth() + 1)}-${pad(now.getDate())}_${pad(now.getHours())}-${pad(now.getMinutes())}-${pad(now.getSeconds())}`;
}
export function getChannelRawPath(sessionDir, channelId) {
    return path.join(sessionDir, `channel_${channelId}.raw`);
}
export function getChannelWavPath(sessionDir, channelId) {
    return path.join(sessionDir, `channel_${channelId}.wav`);
}
export function saveRawAudio(chunk, rawPath) {
    const dir = path.dirname(rawPath);
    if (!fs.existsSync(dir)) {
        fs.mkdirSync(dir, { recursive: true });
    }
    let stream = writeStreams.get(rawPath);
    if (!stream) {
        stream = fs.createWriteStream(rawPath, { flags: "a" });
        writeStreams.set(rawPath, stream);
    }
    stream.write(chunk);
}
export async function convertRawToWav(inputFile, outputFile, options = {}) {
    const sampleRate = options.sampleRate || 16000;
    const channels = options.channels || 1;
    const command = `ffmpeg -y -f s16le -ar ${sampleRate} -ac ${channels} -i "${inputFile}" "${outputFile}"`;
    try {
        await execAsync(command);
    } catch (error) {
        throw new Error(`FFmpeg conversion failed: ${error.message}`);
    }
}
export function closeRawStream(rawPath) {
    const stream = writeStreams.get(rawPath);
    if (!stream) return Promise.resolve();
    return new Promise((resolve) => {
        stream.end(() => {
            writeStreams.delete(rawPath);
            resolve();
        });
    });
}
export function closeAllAudioStreams() {
    const promises = Array.from(writeStreams.entries()).map(
        ([, stream]) => new Promise((resolve) => stream.end(resolve)),
    );
    writeStreams.clear();
    return Promise.all(promises);
}
// WAV params: 16kHz, 16-bit
const WAV_SAMPLE_RATE = 16000;
const WAV_BITS_PER_SAMPLE = 16;
// Build 44-byte WAV header for raw PCM data
export function buildWavHeader(dataSize, channels) {
    const bitsPerSample = WAV_BITS_PER_SAMPLE;
    const byteRate = WAV_SAMPLE_RATE * channels * (bitsPerSample / 8);
    const blockAlign = channels * (bitsPerSample / 8);
    const fileSize = 36 + dataSize;
    const buf = Buffer.alloc(44);
    buf.write("RIFF", 0, "ascii");
    buf.writeUInt32LE(fileSize, 4);
    buf.write("WAVE", 8, "ascii");
    buf.write("fmt ", 12, "ascii");
    buf.writeUInt32LE(16, 16);
    buf.writeUInt16LE(1, 20); // PCM format
    buf.writeUInt16LE(channels, 22);
    buf.writeUInt32LE(WAV_SAMPLE_RATE, 24);
    buf.writeUInt32LE(byteRate, 28);
    buf.writeUInt16LE(blockAlign, 32);
    buf.writeUInt16LE(bitsPerSample, 34);
    buf.write("data", 36, "ascii");
    buf.writeUInt32LE(dataSize, 40);
    return buf;
}

User IDs and timestamps

When selecting multiple streams, your application will receive an audio stream for each participant. By sending separate streams, RTMS enables your app to perform audio mixing, isolation, and individual analysis.

Each user will have a unique user_id and their own incremental timestamp.

For merged audio, the user_id will be 0.

Buffered audio

When a contact_center.voice_rtms_started webhook event is received, the RTMS server starts buffering audio packets, and the timestamps start to increment, while the signaling connection is made. The RTMS server buffers audio up to 60 seconds while the signaling and media connections are established. Once the connections are established, the buffered audio packets are delivered.

To determine the amount of buffered data, calculate the difference between the timestamp of the voice_rtms_started event and the first packet of audio data.

buffer_duration = firstPacketTimestamp - rtmsStartedEventTs

Best practices

When an engagement starts, capture the first timestamp from the signaling connection. This denotes the start of the engagement.

When participants mute their microphones, the RTMS server stops sending audio packets for that user. Use timestamps to detect these gaps and insert silence if needed for your application.

When working with pulse-coded modulation (PCM) audio consider the following:

The size of raw PCM buffer and the storage it might take up without compression
That you may need to convert the audio to WAV format to utilize services such as live streaming or speech to text transcription.
That compression to lossy formats, such as mp3, requires the entire audio file to be completed before compression. We recommend you compress the audio after the engagement has ended.

Transcripts

Transcript data is available per participant with attribution, also called diarization.

Transcripts arrive continuously as speech is detected and processed in real time.

The RTMS server sends transcript data packets from each participant as text data with the participant's identifier (user_id), username (user_name), timestamp, and language.

Example transcript packet

{
    "msg_type": 17,
    "content": {
        "channel_id": "c8b2d3ce-b0d6-4cd1-9195-fe17a4bd5b8e",
        "start_time": 1738392033699,
        "end_time": 1738392036866,
        "timestamp": 1727384349000000,
        "language": 9,
        "data": "Hi, hello world!"
    }
}

Languages

The language field provides the automatically detected language spoken. When switching from one language to another, it typically takes 10-30 seconds for automatic detection to identify the new language.

Timestamps

Timestamps are sent when the sentence/utterance begins. Use this to create a log of when the user began speaking.

Best practices

As Zoom's transcription optimizes for low latency, it may be helpful to post-process text transcripts into final assets.

In sentence detection, pauses in speech can sometimes be mistaken as an end of a sentence.

Combining multiple transcript messages using message interval and timeouts is a useful strategy to combine disjointed sentences.