OpenAI has taken a significant step beyond chat this week with ChatGPT Agent. It represents the company’s first serious entry into multi-purpose autonomous AI, that can control computers, browse the web, and tackle many tasks in an integrated fashion.
Manus AI launched back in March and hit the headlines with its unique approach for the time, Perplexity Labs in June, and now ChatGPT Agent in July. All three promise similar capabilities: web browsing, code execution, document creation, and extended processing times. This rapid convergence suggests we’ve reached a technical consensus on what AI agents do next, at least in the consumer sphere, even if execution varies.
ChatGPT Agent merges OpenAI’s previous Operator and Deep Research tools, running in its own virtual machine with access to browsers, terminal, and connectors for Gmail and GitHub. Early demonstrations show it planning date nights by checking calendars and booking restaurants. As product lead Yash Kumar notes, users aren’t meant to watch the process; it’s designed for background operation.
The tone of the launch video indicated that OpenAI are aware that there are no guarantees of safety with this level of agentic power. The Register reports the system resists 95% of adversarial prompts from security researchers, but that 5% failure rate matters when an agent has terminal access to your files.
Performance benchmarks reveal some inconsistencies and notable promise. On FrontierMath mathematical reasoning tests, ChatGPT Agent achieves 27.4% accuracy on first attempts but reaches 49% given 16 tries. Yet on investment banking modelling tasks, like building three-statement financial models or leveraged buyout models, ChatGPT Agent hits 71.3% accuracy outperforming both Deep Research and o3. It’s likely there will be a many optimal and many sub-optimal tasks to be discovered for this combination of capabilities and training. What the benchmarks underscore is that if you give a model tools, it can invariably outperform.
Our own testing at ExoBrain compared ChatGPT Agent against Manus and Perplexity Labs on research, coding and presentation tasks. Manus delivered the best final product (likely due to already having millions of user sessions to have learnt from), while ChatGPT Agent excelled at research depth, benefiting from its Deep Research foundation. Perplexity Labs produced the least comprehensive output.
These new Swiss army knife agents excel at one-off, fuzzy tasks where some variability is acceptable. Planning a date night? Creative. Extracting insights from that folder of support messages and building an Excel? Done. Running critical business processes on a repeatable basis? Not yet.
Takeaways: Increasingly autonomous single thread systems such as ChatGPT agent work best as intelligent assistants for discrete tasks rather than autonomous workers. The surprising excellence at investment banking modelling versus struggles with other tasks reveals we’re still mapping where these tools excel. Businesses should experiment widely; you might discover your agent performs poorly at simple tasks yet excels at complex ones. The next challenge isn’t making agents more capable; it’s understanding and harnessing their uneven strengths while building dependable systems for repeated, critical, fully orchestrated work.
