GPT-4 goes omni-modal

At the highly anticipated OpenAI Spring event on Monday, the company unveiled GPT-4o, a new ‘omni-modal’ model capable of fluidly handling images, video, text, and audio. The new model boasts near instant responsiveness, emotionally engaging interactions (think the movie ‘Her’), and new abilities in image generation. In perhaps the biggest news from the event, OpenAI announced plans to make GPT-4o accessible to its 100+ million free ChatGPT users. With a potential partnership with Apple for Siri further expanding its reach, the event showcased a GPT4 class models potential for true assistance via voice and following video and screen sharing. Here’s a rundown of the key announcements and their current status:

GPT-4o: Now available to most on the paid ChatGPT Premium. We’ve been using it extensively and its fast and very human-like in its general answers. Apparently, the free version rollout has started although so far only available to a minority of users.
GPT-4o image input: Now live in ChatGPT.
GPT-4o image generation: ChatGPT still uses DALL-E, so we await to try the new integrated modality.
GPT-4o voice input and output: ChatGPT is still using the old system, so no chance for most users to experience the instant response and “Her” style answers yet.
Improvement to data analysis: Now rolling out.
Custom GPTs: AIs created by users are still stuck on the old GPT-4.
New Mac and iOS apps: Not widely available yet (and no Windows or Android versions mentioned). The mobile demos all seemed very much geared towards closing the deal with Apple.

Full rollout expected over the next “few weeks”. The reaction has been mixed. Those who see LLMs hitting a wall in terms of progress suggested the release proves this thesis, those with other opinions suggested that its capabilities and likely more efficient size prove there is much progress to come. GPT-4o is officially now top on the LMSYS Arena Leaderboard. Some have found it better in writing tasks, but less capable than the older GPT-4 in coding and more complex multi-step reasoning. It is an entirely new model compared to GPT-4, trained primarily to be cheaper to run and multi-modal. Many believe GPT-5 will build substantially on this new combined video, image, audio, and text approach.

Days after the launch event OpenAI were back in the headlines with the resignations of Ilya Sutskever, OpenAI co-founder and Chief Scientist (who has been keeping a very low profile since Sam Altman’s temporary ousting in November 2023) and Jan Leike, co-head of OpenAI’s AI safety team. These resignations follow several other safety-related departures, including Daniel Kokotajlo, suggesting a growing concern among AI safety researchers about OpenAI’s priorities. It’s likely the compute budget at OpenAI is being directed to new products. Sutskever (who’s focus since being instrumental in getting GPT models to work so effectively, has been on managing their growing capability as they scale) seems to be looking for other avenues to pursue this ‘alignment’ research. OpenAI has recently hired Shivakumar Venkataraman, who previously led the Google search ads business, so it seems, given the product nature of 4o, the OpenAI culture is in flux.

Takeaways: No new enterprise features were announced (and no sign of Microsoft), and nothing on autonomous agent-based capabilities. Look out for separate events in the coming weeks. The release of GPT-4o marks a significant milestone in the evolution of AI and its integration into everyday life through voice and video; the traditional digital assistant is about to be replaced. The products OpenAI (and Google) embed into our lives will have profound consequences.

GPT-4 goes omni-modal

The art of conversation

The geometry of AI thought

Visual thinking points to the next wave

Superhuman adaptable intelligence

Subscribe to the ExoBrain Weekly Newsletter