Build real-time voice streaming applications with Amazon Nova Sonic and WebRTC

Building end-to-end live streaming applications with real-time voice interaction presents several challenges: network bandwidth constraints can cause high latency and quality degradation in time-critical applications. Language barriers limit effective human-machine interaction in multilingual voice communication. Scalability and resilience require a difficult balance between performance and infrastructure costs. Cross-browser and mobile compatibility demands significant development effort, especially for startups.

This post introduces a solution based on Amazon Nova 2 Sonic (Nova Sonic) and Amazon Kinesis Video Streams WebRTC (WebRTC) that addresses these challenges. WebRTC is responsible for dynamically adjusting the bitrate in unstable networks, which helps to maintain audio quality while reducing dropped connections. Nova Sonic provides effective human language dialogues, so users can interact more naturally in their chosen language. Both services are fully managed by AWS, so they scale automatically with high resilience. AWS also provides open-source samples that you can use as a starting point for your own application.

In this post, we’ll walk through the solution architecture, implementation patterns, and two real-world scenario examples.

Nova Sonic and WebRTC

Traditional voice agent pipelines typically involve separate modules for speech recognition, language processing, and speech synthesis. Nova Sonic offers a unified speech-to-speech architecture that enables real-time voice conversations between users and AI agents with low latency.

With unified speech understanding and generation, Nova Sonic delivers natural, human-like conversational AI. The Nova Sonic model provides different speaking styles and tool interfaces for external agents. You can use it to build a more responsive and intuitive voice interface with higher contextual awareness.

A typical streaming pipeline comprises three main components: media source, media server, and media consumer. The previous diagram shows these components and their respective protocols, such as RTMP, RTSP, HLS, MPEG-DASH, and WebRTC.

Web Real-Time Communication (WebRTC) is a public protocol that modernizes live streaming by providing real-time peer-to-peer direct connections without additional plugins or software installations. This approach eliminates the need for intermediate servers and significantly reduces latency. Among all media streaming protocols, WebRTC delivers the lowest latency, as shown in the following image.

WebRTC also includes built-in features like adaptive bitrate (ABR) streaming, forward error correction (FEC), and jitter buffer management. These features can automatically adjust the bandwidth consumption, and resolve packet loss or jitter issues in weak connectivity. You can maintain fluent conversations even in poor network conditions.

WebRTC’s open-source nature and broad browser compatibility (Chrome, Firefox, Safari, Edge, Android, iOS, etc.) will accelerate solution adoption and encourage continuous improvement. It is also well suited for real-time processing of media streams with AI functions.

Solution architecture

You might want to deploy live streaming solutions with multilingual voice interaction for the following scenarios: Connected vehicles that assist drivers with real-time translation capabilities. Smart factories that support cross-cultural operator communication through voice-activated quality control systems. Robotics applications that provide multilingual customer service interactions. Smart home devices that offer instant voice control in different languages, so that you can obtain global technical support through real-time audio translation and visual guidance.

The following diagram illustrates how to deploy Nova Sonic solution together with Kinesis Video Streams as a managed WebRTC service. It shows tool integration with popular sources such as Retrieval Augmented Generation (RAG), Model Context Protocol (MCP), and Strands Agents.

[1] On the client App, users establish the WebRTC negotiation process by connecting to the Kinesis Video Streams WebRTC signaling channel. Audio and video data are transmitted through the bidirectional WebRTC connection.

[2] After signaling messages for Session Description Protocol (SDP) offer/answer and Interactive Connectivity Establishment (ICE) candidates exchange, the client and server initiate the bi-directional peer connection attempts. Then video and audio data can be transmitted with low latency through the successful RTC connection.

[3] The media channel handles real-time audio and video streaming with adaptive bitrate control and codec negotiation. The data channel provides reliable and ordered transmission of arbitrary application data, e.g. text, files, and control messages. Both use Datagram Transport Layer Security (DTLS) encryption and Session Traversal Utilities for NAT (STUN)/Traversal Using Relays around NAT (TURN) protocols for Network Address Translation (NAT) traversal.

[4] Speech-to-speech event processor orchestrates the input events and output events interaction with Nova Sonic. In our solution, they are categorized into media events which are transmitted via WebRTC media channel, and text data via WebRTC data channel.

[5] You use the Python SDK to establish an HTTP/2 connection for bidirectional streaming with Nova Sonic. This connection supports real-time media data communication and minimizes latency for users.

[6] In addition to speech-to-speech audio conversation with pre-trained knowledge, Nova Sonic supports asynchronous tool calling to access MCP servers, Strands agents, or RAG. This post demonstrates the tool use feature with examples.

If you’re already using Nova Sonic, you will notice this architecture is similar to the WebSocket solution. I’ll show you the key differences.

Solution comparison

Compared to the WebSocket deployment option, this WebRTC-based speech-to-speech solution provides a different network layer suited for mobile and IoT devices. These devices often require low-latency connections without high network bandwidth. The solution also incorporates a customized Voice Activity Detection (VAD) layer for an enhanced user experience.

Audio streaming protocol changed from WebSocket to WebRTC

The voice data are transmitted through WebRTC media channel in a streaming way, namely through the audio track of the peer connection in Secure Real-time Transport Protocol (SRTP) format, instead of WebSocket messages. We implemented WebRTC features (such as SDP offer/answer, DTLS, Stream Control Transmission Protocol (SCTP), SRTP, and peer connection) using the aiortc Python library.

Human voice detection mechanism

The React WebRTC client continuously captures audio and sends it to the Python WebRTC server. To suppress noise, increase speech accuracy, and reduce audio tokens for Nova Sonic, the solution applies Voice Activity Detection (VAD) to the pipeline on server side. The code implementation based on the Python WebRTCVAD library is shown in the following image. Built on a Gaussian Mixture Model (GMM), this library is lightweight, stable, and fast for WebRTC frame-level audio processing. You can also use other libraries such as Silero VAD, Pyannote VAD.

Audio data format adaptation

WebRTC defines specific audio and video format standards. When sending and receiving audio data through a WebRTC connection, you must perform some format adaptation: [1] Interleaved stereo frames require extracting the left or right audio channel; [2] 48kHz or other sampling rates will be resampled to 16kHz, as required by Nova Sonic API; [3] Int16 data values will be converted to Float32 for enhanced calculation precision. For more information, see the GitHub documentation.

Solution walkthrough

The solution in this GitHub repository provides a generic sample and two specific scenario examples: a smart home example and a connected vehicle example. You can adapt these patterns for your own applications.

Smart home example

In the smart home scenario, you open a dialog with Nova Sonic to control IoT devices. To illustrate a full command pipeline, the solution uses an Amazon Bedrock Knowledge Base to retrieve MQTT topics and generate AI responses. It then connects to the MCP server for AWS IoT Core to deliver command messages. The full architecture is shown in the following image.

For setup steps, see the smart-home readme on GitHub.

Connected vehicle example

In the connected vehicle scenario, the system establishes real-time monitoring to detect dangerous phone-use behaviors of drivers. The system uses voice assistants to ask if assistance is needed and verify driver attentiveness. Supervisory personnel can access real-time monitoring feeds in an independent video channel to confirm the safety status of both vehicles and drivers. The following architecture addresses this scenario:

The full media pipeline in the connected vehicle scenario is shown in the following diagram. The concurrent WebRTC connections are independent from each other with dedicated TLS encryption.

For setup steps, see the connected-vehicle readme on GitHub.

Conclusion

In this post, we showed you how to build a WebRTC-based solution that combines Amazon Nova 2 Sonic and Amazon Kinesis Video Streams WebRTC. This solution addresses common barriers in live streaming, such as degraded performance in unstable networks and the lack of conversational intelligence. You can use this solution as the basis for building your own low-latency, smart, robust, flexible voice assistant applications for users of smart devices and connected vehicles.

To get started and learn more:


About the authors

Zihang Huang

Zihang Huang is a specialist solution architect for Agentic AI at AWS. He is an agentic AI expert for connected vehicles, smart home, renewable energy, and industrial IoT. Currently, he focuses on AI solutions with AgentCore, physical AI, IoT, edge computing, and big data.

Lana Zhang

Lana Zhang is a Senior Specialist Solutions Architect for Generative AI at AWS within the Worldwide Specialist Organization. She specializes in AI/ML, with a focus on use cases such as AI voice assistants and multimodal understanding. She works closely with customers across diverse industries, including media and entertainment, gaming, sports, advertising, financial services, and healthcare, to help them transform their business solutions through AI.

Bin Chen

Bin Chen is a Generative AI Specialist Solutions Architect at AWS, which he joined in 2019. He is dedicated to helping customers explore the frontiers of generative AI and bring projects from proof of concept to production using services such as Amazon Bedrock and Amazon SageMaker. He is currently especially focused on Agentic AI and end-to-end speech models.

Siva Somasundaram

Siva Somasundaram is a senior engineer at AWS and builds embedded SDK and server-side components for Kinesis Video Streams. With over 15 years of experience in video streaming services, he has developed media processing pipelines, transcoding and security features for large-scale video ingestion. His expertise spans across video compression, WebRTC, RTSP, and video AI. He is passionate about creating metadata hubs that power semantic search, RAG experiences, and pushing the boundaries of what’s possible in video technology.