ExoBrain
Claude clicks with computers
agentic AIcoding agentsmultimodal AIresearch and science

Claude clicks with computers

Anthropic’s experimental computer use features allow Claude models to interact with digital interfaces through visual reasoning, demonstrating a promising but currently limited approach to agentic AI capabilities.

Joel Miller

Joel Miller

3 min read

Anthropic released major updates to their Claude models this week, upgrading Claude 3.5 Sonnet and introducing a faster, cost-effective 3.5 Haiku. While 3.5 Opus, the largest model in the family, remains unreleased amid rumours of indefinite delays, the updates bring notable improvements. The new Sonnet delivers enhanced coding and reasoning capabilities plus code execution in the Claude app, but it’s the experimental computer use features that are drawing the most attention. A demonstration open source tool released by Anthropic lets the standard Sonnet model interact with computers like a human would – moving the mouse, clicking buttons, and typing text.

The system runs in a controlled virtual environment where Claude views multiple screenshots and controls standard Linux automation tools like xdotool (contrary to reports, it does not control your computer out of the box). When ExoBrain tested the system, it consumed over 500,000 tokens just to research and compile a simple information table, with several errors highlighting both its experimental nature and current cost limitations. On OSWorld, a benchmark for over 300 human-like computer tasks, Claude scores 14.9% – nearly double the previous best AI system but far below capable computer user of 70-75%. The system struggles with common actions like scrolling, dragging and zooming. But in a fascinating insight from Anthropic, Claude showed remarkably human tendencies during demonstrations, occasionally getting distracted and stopping mid-task to browse Yellowstone National Park photos.

This approach differs from other AI computer solutions like Open Interpreter, which focuses on OS code execution, or OpenAdapt, which learns from human demonstrations similar to robotic process automation tools. Instead, Claude builds understanding through visual reasoning and trial-and-error learning – much like a human encountering a new application. For example, when working with spreadsheets, it learns to identify buttons and features by reasoning about their icons and testing their functions. An interesting parallel comes from the OS-Copilot project, which takes a more flexible approach by generating new tools and learning from experience. Their FRIDAY agent improved spreadsheet task performance by 35% through self-directed learning, hinting at how future systems might combine Claude’s approach with more adaptive capabilities. While OpenAI and Microsoft have demonstrated their AI apps analysing screen shares in a similar way, these features are not yet widely available.

The real significance here lies not in today’s capabilities but in what this approach reveals about multi-modal AI learning. By giving an AI model, trained on software imagery and video, direct access to computer interfaces along with the ability to try, fail, and learn from outcomes, we’re seeing how complex skills can be developed through exploration rather than explicit programming.

Takeaways: While computer use capabilities remain more prototype than practical tool, they demonstrate how frontier AI models can learn complex interfaces through reasoning and experimentation. This suggests a future where AI assistants might master new applications organically, similar to humans. For now, the focus should be on understanding these systems’ potential while being realistic about their current limitations. Early adopters should expect significant experimentation time and costs, but the insights gained could be valuable for understanding how AI might eventually automate routine computer tasks.