On October 22nd, 2024, Anthropic announced improvements and new features on their Claude models. Apart from model upgrades, the main news is a feature that allows Claude to use computers the way people do—by looking at a screen, moving a cursor, clicking buttons, and typing text. Currently in public beta for testers, this experimental feature represents a leap in expanding the agentic capabilities of LLMs.
Key Capabilities:
Interacts with computer interfaces through an API
Can translate instructions into computer commands
Available through Anthropic API, Amazon Bedrock, and Google Cloud's Vertex AI
This breakthrough takes Claude beyond traditional text-based assistants, moving it closer to a fully interactive agent.
From Assistants to Agents
Even before the ChatGPT revolution, developers sought to expand LLMs into more capable agents. Initially, these models could only use plugins or APIs; later, user-defined functions—custom code created by developers to execute specific tasks or calculations—added even more power. These functions allow developers to extend an LLM’s abilities with specialized knowledge or actions, such as data processing, complex calculations, or interactions with external systems. However, fully utilizing these capabilities has often required significant expertise from specialists.
To make these models truly agentic, they follow an observe-plan-act loop to make decisions and interact with their environment. In each cycle, the agent observes its surroundings, gathering context to understand the current state. Based on these observations, it plans its next steps, selecting the best action to reach its goals. In the act phase, it follows through on this plan, interacting directly with the environment. This repeating cycle allows the agent to adapt dynamically to changes in real time, enabling more autonomous and complex workflows.
Each of these components is included in the solution proposed by Anthropic:
An LLM with planning / reasoning capabilities: Claude 3.5 Sonnet (new)
Environment to interact with: A Graphical User Interface (GUI) and shell
A set of tools to observe and interact with the GUI, filesystem and terminal.
Below there is an overview of these tools:
An example
Imagine Claude is tasked with generating a monthly sales report by gathering data from a CRM and formatting it in a spreadsheet.
Observe: Claude begins by taking a screenshot of the CRM dashboard using the Computer Tool to identify elements like “Reports” and “Sales Data.”
Plan: Recognizing the “Reports” tab, Claude plans to navigate there by moving the mouse and clicking to access the section.
Act: Claude uses the Computer Tool to move the cursor to the “Reports” tab and clicks.
Observe: After clicking, Claude takes another screenshot to confirm it’s now in the “Reports” section.
Plan: Claude identifies a date filter option and plans to use the keyboard to type the relevant month, ensuring it pulls data only for that period.
Act: Using the Computer Tool, Claude enters the date range and applies the filter.
Observe: Claude confirms the filtered data is displayed and takes a final screenshot to verify.
Plan: Now, it plans to click “Export” and select “CSV” as the output format.
Act: Using the Computer Tool to control the mouse and keyboard, Claude downloads the CSV.
Next, Claude opens the CSV in a text editor. Using the Text Editor Tool, it reads and formats the data, preparing it for the report. Finally, with the Bash Tool, it automates the email to send the finished report to stakeholders.
Reference implementation
Claude’s implementation of computer interaction is OS-agnostic, leaving it up to developers to interpret tool commands and translate them into specific actions. To help users get started quickly, however, Anthropic has provided a reference implementation for onboarding and testing these capabilities.
The reference implementation consists of a Linux Docker container running the following components:
A chatbot connected to Claude and executing the agent loop. Tool invocations are translated into xdotool invocations and bash commands.
A GUI desktop running on X window with a VNC server and a noVNC browser client
A web server displaying the chat and the GUI side by side.
Key use cases
Claude’s new ability to interact with computers as humans do opens a wide range of automation possibilities across multiple fields. This capability enables Claude to handle tasks traditionally requiring human intervention, offering significant improvements in efficiency, accuracy, and accessibility. Claude’s skills can be applied to repetitive tasks, dynamic workflows, and user-guided tutorials alike. Below are some areas where Claude’s new capabilities can make a real difference:
IT Support Automation: Handles routine support tasks (e.g., software setup, password resets), reducing time spent on repetitive troubleshooting.
Data Entry & Reporting: Automates data input, form submissions, and report updates, improving speed and accuracy for large datasets.
Social Media & Customer Engagement: Manages social media interactions and schedules posts, freeing teams to focus on high-value engagement. Note: the current implementation has implemented security measures to prevent suchalike interactions.
Robotic Process Automation (RPA) Alternative: Simplifies automation for complex, dynamic workflows without needing custom code, making it more accessible.
Compliance Checks & QA Testing: Performs documentation verification and software testing by navigating interfaces, minimizing legal risks and boosting QA efficiency.
E-commerce & Order Processing: Automates order fulfillment and inventory updates, reducing manual intervention and improving processing accuracy.
Training & Onboarding: Provides guided, hands-free tutorials for new users, enhancing the onboarding experience.
As noted, certain safety measures, such as restricting Claude’s access to specific platforms (e.g., social media) or sensitive data, ensure its capabilities are used responsibly. Moving forward, maintaining robust guardrails will be crucial as Claude’s potential expands across more diverse and sensitive workflows.
Safety
Anthropic has implemented guardrails to prevent misuse of Claude’s new computer-interaction capabilities. While specific details are limited, they’ve confirmed using classifiers and other methods to identify and mitigate potentially harmful actions. These safeguards are designed to flag activities that fall outside intended use cases, ensuring Claude’s interactions remain secure and ethical.
Additionally, there are restrictions around sensitive domains. For example, Claude is discouraged from engaging in election-related tasks or interacting with critical areas like government websites, social media content creation, and domain registration.
One concern they’ve identified is prompt injection—a type of cyberattack where malicious instructions are fed to an AI model, potentially causing it to override its initial directions or perform unintended actions that deviate from the user’s intent. Since Claude can interpret screenshots from internet-connected computers, it could be exposed to content that contains prompt injection attacks.
Using sandboxed environments is also recommended, as they create a controlled space for testing and operating Claude’s capabilities safely. While the current implementation does not enforce sandboxing, Anthropic emphasizes its importance to protect system security and limit unintended access to sensitive applications or data. A sandbox setup helps isolate Claude’s interactions from the rest of the system, providing an extra layer of security, especially in high-stakes or sensitive use cases.
Limitations
The reference implementation gives a glimpse of what is possible when we give full control of our computer to an Agent. Nevertheless, it is highly experimental and has some limitations.
It is slow and potentially expensive: The cycle of taking a screenshot, reasoning, and executing the next action is painfully slow and costly. This is particularly true for workflows involving multiple actions like move and click, although some asynchronous use cases may tolerate this speed.
It is not reliable: On an evaluation designed to test developer attempts at model-computer interaction (OSWorld), Claude achieves an accuracy of only 14.9%—significantly below human-level skill (typically 70-75%). This can lead to frequent mistakes or the agent getting lost in a task. Precision-demanding applications, like drawing a house, are not feasible at this stage.
Observations are discrete and agent-initiated: The agent cannot respond to real-time changes in the GUI, such as notifications or dynamic game elements.
Safety concerns: Jailbreaking such a system could have serious risks. A safer approach could involve exposing a limited set of applications in a sandboxed environment.
Unpredictability: In predictable workflows, a Robotic Process Automation (RPA) solution might make more sense. For unpredictable workflows, production deployment is generally inadvisable. However, this approach could aid in defining RPA workflows automatically.
Complex gestures and animations: At present, the agent struggles with interpreting complex gestures and animations in the GUI, including scrolling, dropdown boxes, etc.
Custom prompting: Tasks often demand custom prompting to guide the agent effectively. This adds a layer of complexity and may limit its ability to work effectively.
We expect that future iterations will address these limitations, gradually enhancing reliability, speed, and responsiveness. Innovations like continuous observation, user-friendly customization, and improved handling of complex gestures could transform this tool into a more robust, adaptable agent.
Looking Forward
Claude’s evolving ability to interact with computer interfaces opens the door to significant advancements in automated workflows. As these capabilities develop, we can anticipate a future where Claude not only performs tasks with greater speed and accuracy but also handles complex, multi-step processes with minimal human intervention. By extending its planning and reasoning horizon, Claude will be able to tackle intricate workflows that currently require substantial oversight, steadily reducing the need for human supervision. These changes could make Claude’s assistance more autonomous, flexible, and applicable across diverse contexts. Below are several key directions for future development:
Application-Specific GUI Access: Limiting the GUI interaction to a defined subset of applications or even a single application could enhance control, focus functionality, and mitigate potential security risks, especially for sensitive workflows.
Collaborative and User-Friendly Interactions: Building more intuitive, user-friendly interfaces for interacting with Claude could allow non-technical users to define and adjust workflows easily. Interactive guidance features and visual task builders would empower users to create automated solutions without coding.
Continuous and Reactive Observations: Future iterations could allow LLMs to perform ongoing monitoring and real-time responses to GUI changes—such as notifications or gameplay—expanding potential applications in dynamic, time-sensitive environments.
Integration into Other LLMs: As other models catch up, shared advancements and interoperability will drive the development of more versatile, widely accessible automation tools.
Each of these advancements could bring Claude’s interaction capabilities closer to seamless, human-like collaboration, broadening its role in both personal and professional settings.
This advancement underscores the potential of ‘Anthropic computer use’ in transforming AI from passive assistants to active agents capable of complex tasks.
References:
Claude 3.5 Sonnet Model card addendum: https://assets.anthropic.com/m/61e7d27f8c8f5919/original/Claude-3-Model-Card.pdf
Anthropic announcement: https://www.anthropic.com/news/developing-computer-use
Reference implementation: https://github.com/anthropics/anthropic-quickstarts/tree/main/computer-use-demo