Realtime state-space speech

State-space architectures (SSMs) are a notable alternative to the all-conquering transformer design of ChatGPT, Gemini and Claude. An interesting use-case for more efficient SSMs emerged this week. Cartesia who have been pioneering in the area, released Sonic, a new voice model for high-quality lifelike audio. It has a latency of just 135ms, making it the fastest in its class. Essentially it can generate speech from text in a range of voices almost instantaneously, great for user interactions and voice powered solutions.

It also claims to be able to clone a voice with less than a minute of audio. Our initial experiments found the pre-prepared voices much more reliable than the cloned ones, but the potential is huge.

At ExoBrain we’ve recently been experimenting with ElevenLabs. It’s AI voice generator converts text into natural-sounding speech in 29 languages. It supports various accents and styles and is the most common AI TTS (text-to-speech) solution in use today. ElevenLabs targets content creators, writers, game developers, and businesses looking to create audio experiences, but lacks the near-instant generation of Cartesia.

You can experience voice generation in a new podcast version of this newsletter. Listen to last week’s episode here on Spotify which showcases a range of ElevenLabs voices, and this week using a cloned voice workflow embedded below. Whilst they sound pretty good, one of the issues with AI synthesis is its monotonic nature, especially if there is a single narrator…

Takeaways: We’re using a new workflow, based on a full voice cloning which improves the natural feel. If you want to quickly generate audio material the following setup and workflow is worth exploring:

Setup:

Signup for ElevenLabs and clone your voice. You’ll need to verify its you by reading out a generated phrase; being allowed to clone somebody else’s voices would be highly problematic.
We’ve found that a good quality mic and recording under a strategically draped duvet to deaden the sound helps to improve the capture. You’ll need to record a minimum of 10 minutes of training data, but more is better.

Workflow:

Claude or Gemini do a great job of writing scripts from source text. Tell them it’s a podcast and they are able to structure in a suitable way, but some editing will be required.
ElevenLabs ‘projects’ allow a large amount of text to be converted to speech, but the monotony will creep in to the resulting audio. We convert the AI script into our cloned voice first, section by section, which speeds narration up and introduces more variation.
We then use the voice-to-voice generation tool, feeding back in the cloned audio to generate more varied output with one of the many professional voices on offer.
Finally, we cut together in an audio editor (using AI music and effects from Suno).

For now, the state of the art is quite impressive, if not yet entirely convincing. The use of fast new architectures, and OpenAI’s yet to be released Voice Engine hints and significant progress to come.

Realtime state-space speech

Figma’s new AI features

Harnesses are the new AI battleground

Visual thinking points to the next wave

The adaptive thinking backlash

Subscribe to the ExoBrain Weekly Newsletter