# Handling media data Realtime Media Streams (RTMS) delivers audio, video, screen share, and transcript data from Zoom meetings and webinars over WebSocket connections between your application and the RTMS servers. After establishing a connection to the RTMS server, your application receives media data based on: - The [scopes configured for your app](/docs/rtms/meetings/add-features/#add-rtms-scopes-to-your-app) - The formats specified in your [media handshake request](/docs/rtms/event-reference/#media-handshake-request) - The availability of data in the meeting or webinar Each media data type has specific formats and best practices for processing the data efficiently. ## Audio Audio data is available per participant and as a merged packet of all participants. By default, audio data is sent as **uncompressed raw PCM (L16)** data with a **16kHz** sample rate and **mono** channels. The RTMS server sends [audio data packets](/docs/rtms/event-reference/#audio-data) from participants in **base64-encoded binary format**. Each packet of data contains the user's participant identifier (`user_id`), username (`user_name`), and the timestamp (`timestamp`) of the audio data. ### Example audio packet: ```json { "msg_type": 14, "content": { "user_id": 16778240, "user_name": "John Smith", "data": "Hw1kDacNAA4sDkMOAQ5eDekMgAzXCw0L4Al4CDwHBAayBEwDmwHD/wn+rfyQ+3z6Z/k4+Ef3jvb09aj1YPUF9QL1OfVt9eH1f/YE96P3jPib+dX6Pvyr/ef+IABcAZ4CmQN3BFAF6QVxBs0G9wbjBqYGXwYFBm8FxwT/Aw4DHAIkARYA7v6o/T382/qq+YD4fPd99mz1p/Qq9KTzcfNv807zcPPQ8z70yfQ89az1SPbO9m/3Lfjf+I75M/rz+pT7KPzs/In9Fv7G/lP/q/8GAG4AtgBHAcABDQKtAkwDIgRQBYQGrgfyCEgKigvDDP4NFw/8D78QhBErEqoS/BL5EsYSfRLnEXARwhCMD1kO8ww8C5gJ5gfrBe8D3AGp/7P9IPyY+hL5t/dl9mv10fQ89NjzvvO78/HzbfTb9F71I/bw9vL3Kvlh+pX71/wr/m//owDCAbQCkwNoBPYEYQWxBZ0FZQUdBYkEBgRkA64CAgIdAT0AZf+T/tD93vz9+xf7B/o8+WX4pvcE90L2n/VM9d70pPSe9IL0nfTa9BX1cfXK9QP2S/a69hH3cPfx91P4w/hO+cr5XPr2+l/71/t9/AP9f/0R/p/+Pv/5/7IAWAH+AeMCAwQ7BaAG+wcmCZEKKQy8DT4PWhBXEVYSVBOgFM0VMxZSFhYW4BXTFUMVNBSzEtIQJg99Da0LogkXB40ESgJCAHj+ivyA+qn4KfcH9hr1JvRf89byq/Lt8kjzhvPy85P0n/X/9kn4lvni+kj84P1S/4wAuwHUAvcD9wS/BUwGaAZcBmcGTQb5BUkFQgQzAzkCVQEgALX+U/3N+4L6XfkU+Nj2o/WB9MTzOfO38g==", "timestamp": 1738392033699 } } ``` ### Send rate The default interval is 20ms between audio packets. You can configure this in multiples of 20ms, up to a maximum of 1000ms. If you specify an interval above 1000ms in handshake requests, RTMS will change it to 1000ms. ### Timestamps Timestamps denote the creation time on Zoom's server. The timestamp for each audio packet changes relative to the `send_rate` defined by the handshake request. If the `send_rate` is set to 20ms (default), the timestamp for each audio packet will change by 20ms. When working with streaming audio, timestamps are useful in determining the sequence of messages. Use timestamps to - Infer the period of time where a user might be muted - Match timestamps with video and screen share data to combine, or mux, the audio, video, and screen share frames ### User IDs and timestamps When selecting multiple streams, your application will receive an audio stream for each participant. By sending separate streams, RTMS enables your app to perform audio mixing, isolation, and individual analysis. Each user will have a unique `user_id` and their own incremental timestamp. For merged audio, the `user_id` will be `0`. ### Buffered audio When a [`meeting.rtms_started`](/docs/api/rtms/events/#tag/meeting/postmeeting.rtms_started), or [`webinar.rtms_started`](/docs/api/rtms/events/#tag/webinar/postwebinar.rtms_started), webhook event is received, the RTMS server starts buffering audio packets, and the timestamps start to increment, while the signaling connection is made. The RTMS server buffers audio up to 60 seconds while the signaling and media connections are established. Once the connections are established, the buffered audio packets are delivered. To determine the amount of buffered data, calculate the difference between the timestamp of the `rtms_started` event and the first packet of audio data. `buffer_duration = firstPacketTimestamp - rtmsStartedEventTs` > **Note** > > Video data is not buffered, and its timestamps begin incrementing as soon as the connections are established. As a result, the audio and video timestamps will be offset by the duration of the buffered audio. Take this offset into account when syncing audio and video data. ### Best practices When a meeting or webinar starts, capture the first timestamp from the signaling connection. This denotes the start of the meeting or webinar. When participants mute their microphones, the RTMS server stops sending audio packets for that user. Use timestamps to detect these gaps and insert silence if needed for your application. When working with pulse-coded modulation (PCM) audio consider the following: - The size of raw PCM buffer and the storage it might take up without compression - That you may need to convert the audio to WAV format to utilize services such as live streaming or speech to text transcription. - That compression to lossy formats, such as mp3, requires the entire audio file to be completed before compression. We recommend you compress the audio after the meeting, or webinar, has ended. ## Video Video data is sent as a single video stream of the active speaker. **Supported resolutions:** - **SD**: 480p (854×480) or 360p (640×360) - **HD**: 720p (1280×720) - **FHD**: 1080p (1920×1080) - **QHD**: 2K (2560×1440) > Video resolution may change dynamically based on participants' hardware capabilities and network conditions. The RTMS server sends [video data packets](/docs/rtms/event-reference/#video-data) in **base64-encoded binary format**. Each packet of data contains the user's participant identifier (`user_id`), username (`user_name`), and the timestamp (`timestamp`) of the video data. ### Example video packet: ```json { "msg_type": 15, "content": { "user_id": 16778240, "user_name": "John Smith", "data": "Hw1kDacNAA4sDkMOAQ5eDekMgAzXCw0L4Al4CDwHBAayBEwDmwHD/wn+rfyQ+3z6Z/k4+Ef3jvb09aj1YPUF9QL1OfVt9eH1f/YE96P3jPib+dX6Pvyr/ef+IABcAZ4CmQN3BFAF6QVxBs0G9wbjBqYGXwYFBm8FxwT/Aw4DHAIkARYA7v6o/T382/qq+YD4fPd99mz1p/Qq9KTzcfNv807zcPPQ8z70yfQ89az1SPbO9m/3Lfjf+I75M/rz+pT7KPzs/In9Fv7G/lP/q/8GAG4AtgBHAcABDQKtAkwDIgRQBYQGrgfyCEgKigvDDP4NFw/8D78QhBErEqoS/BL5EsYSfRLnEXARwhCMD1kO8ww8C5gJ5gfrBe8D3AGp/7P9IPyY+hL5t/dl9mv10fQ89NjzvvO78/HzbfTb9F71I/bw9vL3Kvlh+pX71/wr/m//owDCAbQCkwNoBPYEYQWxBZ0FZQUdBYkEBgRkA64CAgIdAT0AZf+T/tD93vz9+xf7B/o8+WX4pvcE90L2n/VM9d70pPSe9IL0nfTa9BX1cfXK9QP2S/a69hH3cPfx91P4w/hO+cr5XPr2+l/71/t9/AP9f/0R/p/+Pv/5/7IAWAH+AeMCAwQ7BaAG+wcmCZEKKQy8DT4PWhBXEVYSVBOgFM0VMxZSFhYW4BXTFUMVNBSzEtIQJg99Da0LogkXB40ESgJCAHj+ivyA+qn4KfcH9hr1JvRf89byq/Lt8kjzhvPy85P0n/X/9kn4lvni+kj84P1S/4wAuwHUAvcD9wS/BUwGaAZcBmcGTQb5BUkFQgQzAzkCVQEgALX+U/3N+4L6XfkU+Nj2o/WB9MTzOfO38g==", "timestamp": 1738392033699 } } ``` ### Individual participant stream By default, video streams the active speaker. To receive video from a specific participant instead, see [Stream a single participant's video](/docs/rtms/meetings/video-single-stream/). ### Stop video behavior When a user stop sharing video, the RTMS server will stop sending video packets until data is available again. If all users have stopped sharing video no video data will be sent because no video data is available. If you're recording the meeting or webinar to be played back, you'll need to add in filler frames during the periods when there is no video data when combining, or muxing, audio, video, and screen share data using timestamps. For more information, see [Combining media data](#combining-media-data). ### Codecs Video data can be in `JPG` or `PNG` format when the frames per second (fps) is lower than or equal to 5 fps. Video data will be in `H.264` format when the fps is greater than 5 fps up to the maximum of 30 fps. ### Combining media data Video data is sent separate from audio and screen share data. Combine, or mux, these data formats using tools like ffmpeg and gstreamer for a playable file. Use our [recording sample apps](/docs/rtms/sample-apps/?tags=recording) as a reference for muxing. ### Timestamps Timestamps denote the creation time on Zoom's server. The interval at which the timestamp increases depends on the fps settings specified in the [media handshake request](/docs/rtms/event-reference/#media-handshake-request). If the fps is set at 25, the timestamp should increase at 40ms per video packet. Timestamps are essential to sync video, audio, and screen share data and can be used to determine the order of frames. Timestamps can also be used to: - Infer the period of time where a user has their video turned off - Match timestamps with audio and screen share data to combine, or mux, the audio, video, and screen share frames - Sample video frames for downsampling - Determine fps ### Best practices H.264 format has a large disk size footprint. When handling a larger amount of simultaneous writing to disk, be mindful of the available disk space, network bandwidth available, and disk I/O throughput. In scenarios where you might be handling multiple streams, consider a distributed processing approach. In scenarios where real-time data is not essential, consider processing data after the meeting or webinar to prevent unnecessary local resource usage by your application. ## Screen share > Screen share is available using the REST API or through direct WebSocket connections. Screen share data is sent as a single video stream of the participant sharing the screen. ### Supported resolutions: - **HD**: 720p (1280×720) - **FHD**: 1080p (1920×1080) - **QHD**: 2K (2560×1440) > Resolution may change dynamically based on participants' hardware capabilities and network conditions. The RTMS server sends [screen share data packets](/docs/rtms/event-reference/#screen-share-data) in **base64-encoded binary format**. Each packet of data contains the user's participant identifier (`user_id`), username (`user_name`), and the timestamp (`timestamp`) of the screen share data. ### Example screen share packet: ```json { "msg_type": 16, "content": { "user_id": 16778240, "user_name": "John Smith", "data": "Hw1kDacNAA4sDkMOAQ5eDekMgAzXCw0L4Al4CDwHBAayBEwDmwHD/wn+rfyQ+3z6Z/k4+Ef3jvb09aj1YPUF9QL1OfVt9eH1f/YE96P3jPib+dX6Pvyr/ef+IABcAZ4CmQN3BFAF6QVxBs0G9wbjBqYGXwYFBm8FxwT/Aw4DHAIkARYA7v6o/T382/qq+YD4fPd99mz1p/Qq9KTzcfNv807zcPPQ8z70yfQ89az1SPbO9m/3Lfjf+I75M/rz+pT7KPzs/In9Fv7G/lP/q/8GAG4AtgBHAcABDQKtAkwDIgRQBYQGrgfyCEgKigvDDP4NFw/8D78QhBErEqoS/BL5EsYSfRLnEXARwhCMD1kO8ww8C5gJ5gfrBe8D3AGp/7P9IPyY+hL5t/dl9mv10fQ89NjzvvO78/HzbfTb9F71I/bw9vL3Kvlh+pX71/wr/m//owDCAbQCkwNoBPYEYQWxBZ0FZQUdBYkEBgRkA64CAgIdAT0AZf+T/tD93vz9+xf7B/o8+WX4pvcE90L2n/VM9d70pPSe9IL0nfTa9BX1cfXK9QP2S/a69hH3cPfx91P4w/hO+cr5XPr2+l/71/t9/AP9f/0R/p/+Pv/5/7IAWAH+AeMCAwQ7BaAG+wcmCZEKKQy8DT4PWhBXEVYSVBOgFM0VMxZSFhYW4BXTFUMVNBSzEtIQJg99Da0LogkXB40ESgJCAHj+ivyA+qn4KfcH9hr1JvRf89byq/Lt8kjzhvPy85P0n/X/9kn4lvni+kj84P1S/4wAuwHUAvcD9wS/BUwGaAZcBmcGTQb5BUkFQgQzAzkCVQEgALX+U/3N+4L6XfkU+Nj2o/WB9MTzOfO38g==", "timestamp": 1738392033699 } } ``` ### Stop screen share behavior When a user stop sharing their screen, the RTMS server will stop sending messages until data is available again. If all users have stopped sharing their screen, no screen share data will be sent because no screen share data is available. If you're recording the meeting or webinar to be played back, you'll need to add in filler frames during the periods when there is no screen share data when combining, or muxing, audio, video, and screen share data using timestamps. For more information, see **Combining media data**. ### Codecs Screen share data can be in `JPG` or `PNG` format when the frames per second (fps) is lower than or equal to 1 fps. Screen share data will be in `H.264` format when the fps is greater than 1 fps up to the maximum of 30 fps. ### Combining media data Screen share data is sent separate from audio and video data. Combine, or mux, these data formats using tools like ffmpeg and gstreamer for a playable file. Use our [recording sample apps](/docs/rtms/sample-apps/?tags=recording) as a reference for muxing. ### Timestamps Timestamps denote the creation time on Zoom's server. The interval at which the timestamp increases depends on the fps settings specified in the [media handshake request](/docs/rtms/event-reference/#media-handshake-request). If the fps is set at 25, the timestamp should increase at 40ms per screen share packet. Timestamps are essential to sync video, audio, and screen share data and can be used to determine the order of frames. Timestamps can also be used to: - Infer the period of time where a user stopped sharing their screen - Match timestamps with audio and video data to combine, or mux, the audio, video, and screen share frames - Determine fps ### Best practices H.264 format has a large disk size footprint. When handling a larger amount of simultaneous writing to disk, be mindful of the available disk space, network bandwidth available, and disk I/O throughput. In scenarios where you might be handling multiple streams, consider a distributed processing approach. In scenarios where real-time data is not essential, consider processing data after the meeting, or webinar, to prevent unnecessary local resource usage by your application. ## Transcripts Transcript data is available per participant with attribution, also called diarization. Transcripts arrive continuously as speech is detected and processed in real time. The RTMS server sends [transcript data packets](/docs/rtms/event-reference/#transcript-data) from each participant as text data with the participant's identifier (`user_id`), username (`user_name`), `timestamp`, and `language`. ### Example transcript packet ```json { "msg_type": 17, "content": { "user_id": 19778240, "user_name": "John Smith", "start_time": 1727384100000, "end_time": 1727384310000, "timestamp": 1727384349000, "language": 9, "data": "Hi, hello world!" } } ``` ### Languages The `language` field provides the automatically detected language spoken. When switching from one language to another, it typically takes 10-30 seconds for automatic detection to identify the new language. ### Timestamps Timestamps are sent when the sentence/utterance begins. Use this to create a log of when the user began speaking. ### Best practices As Zoom's transcription optimizes for low latency, it may be helpful to post-process text transcripts into final assets. In sentence detection, pauses in speech can sometimes be mistaken as an end of a sentence. Combining multiple transcript messages using message interval and timeouts is a useful strategy to combine disjointed sentences.