3 minutes
csm.rs: Blazing-fast Rust Conversational Speech Model
A few weeks ago, we introduced azzurra-voice, our open, state-of-the-art Italian Text-to-Speech (TTS) model. It was the first step toward our vision for the Azzurra Project: a private, personal, and empathetic AI companion that runs locally on your own device.
A beautiful voice is just one part of the equation. To create a truly interactive and real-time experience, that voice needs a powerful engine—one that can generate speech instantly, without compromising on quality or sending data to the cloud.
Today, we’re releasing the next crucial piece of the Azzurra Project:
csm.rs a blazing-fast, open-source TTS inference engine written in Rust.
What is csm.rs?
csm.rs is a high-performance Rust implementation of Sesame’s Conversational Speech Model (CSM) (the base model of azzurra-voice), built on the powerful candle machine learning framework. It’s designed from the ground up for one thing: raw performance for real-time streaming TTS.
This is the engine that will power azzurra-voice, and any other csm-1b compatible model, in our upcoming Azzurra-Pipeline.
The engine is built around several key features that make it ideal for local AI. It is blazing fast, built in Rust on candle to deliver the high-throughput, low-latency performance needed for natural, real-time conversation. csm.rs is also extremely efficient, with support for GGUF-based q8_0 and q4_k quantization that allows you to run large TTS models with a significantly smaller memory footprint, making it perfect for consumer hardware. Thanks to candle, it runs virtually anywhere, supporting multiple hardware backends including MKL (Intel), Accelerate (macOS), CUDA/cuDNN (NVIDIA), and Metal (Apple Silicon). Integration is seamless; we’ve included a built-in web server with an OpenAI-compatible API, allowing csm.rs to act as a drop-in replacement for existing TTS services. Finally, it provides broad model support, natively handling the original sesame/csm-1b weights as well as fine-tuned models from the Hugging Face Hub, like our own azzurra-voice.
Why We Built a New Engine
The goal of the Azzurra Project is to create a fully local conversational agent that listens, thinks, and speaks on your personal computer. For a conversation to feel natural, the delay between you speaking and the agent responding must be minimal. Traditional Python-based inference frameworks, while excellent for research, often carry overhead that makes achieving this ultra-low latency a challenge.
By building csm.rs in Rust, we gain the low-level control needed to squeeze every last drop of performance out of the hardware. This ensures the voice generation step is never the bottleneck, paving the way for a truly fluid and responsive AI companion.
Open Source for a Private Future
We are releasing csm.rs under the GNU Affero General Public License (AGPL) v3. We believe that foundational tools for private, personal AI should be open, transparent, and available to everyone. We encourage developers, researchers, and hobbyists to use it, learn from it, and contribute back to the project.
What’s Next?
With the voice (azzurra-voice) and now the engine (csm.rs) released, we are another step closer to realizing the complete Azzurra Project. The next component we’re focused on is azzurra-brain, a conversational Italian large language model designed to be a thoughtful companion.
Stay tuned as we continue to build the future of private, empathetic AI.
Federico Galatolo, Cartesia CTO