ExoBrain
AI safetycreative AImodel releasesmultimodal AI

The art of conversation

Recent releases from Kyutai, OpenAI, Character.AI, and ElevenLabs demonstrate significant advancements in real-time multimodal and voice interactions, raising both excitement and ethical concerns regarding safety and misuse.

Joel Miller

Joel Miller

2 min read
The art of conversation

This week we saw demos and releases suggesting our interactions with AI will continue to get more fluid. Kyutai’s Moshi experiment, a new OpenAI GPT-4o voice mode demo, Character.AI, and ElevenLabs all showcased real-time capabilities.

Moshi, a relatively small open-weight multimodal language model from a French lab, can process speech input and output simultaneously, understand and express emotions, and speak with different accents. Meanwhile on-stage at last week’s AI Engineer World Fair, OpenAI’s GPT-4o was demonstrated via an unreleased version of ChatGPT Desktop. The demo showcased integrated low-latency voice generation, visual context understanding, video generation, and rapid optical character recognition.

Character.AI unveiled ‘Character Calls‘, a feature that enables users to have real-time voice conversations with AI characters on their mobile app. Meanwhile, ElevenLabs expanded its AI voice capabilities by introducing AI-recreated voices of late Hollywood celebrities like Judy Garland and James Dean to its reader product. Both emphasise safety measures to prevent misuse, but as ever ethical questions loom large. The delay until the Autumn of GPT-4o voice mode is likely due to both the technical demands of serving to many users, but also the challenge of preventing undesired emotions and content in voice output.

Takeaways: These advancements represent a significant evolution in AI user experience, moving us closer to genuine conversational interaction. The ability to seamlessly switch between visual, audio, and text inputs promises a great deal. However, we’re still in the early stages. While the potential is exciting, widespread and fully integrated availability is limited, making it challenging to evaluate the practical utility and broader implications of these multimodal interactions.