Home Blog Browser AI Automation: Can LLMs Really Handle the Mundane? From Lunch Orders to Complex Workflows

Browser AI Automation: Can LLMs Really Handle the Mundane? From Lunch Orders to Complex Workflows

How can Large Language Models automate browser-based tasks with minimal prompt engineering, highlighting their capabilities, limitations, and the broader landscape of AI-driven web automation tools?

TL;DR: LLMs for Browser Automation

  • Tested LLMs for automating browser-based tasks.
  • Focused on two scenarios: ordering lunch via Google Form and submitting holiday requests through an intranet.
  • Achieved full automation with simple natural language commands (e.g., “order meat dumplings, no soup”).
  • Found that while LLMs are powerful, they need careful guidance to handle web interface complexities.
  • Explored the broader AI browser automation landscape, including:
    • Commercial tools (e.g., OpenAI Operator).
    • Open-source solutions (e.g., BrowserUse, Open Operator).
    • Vision-based parsing technologies and custom-built solutions for business-specific needs.

Let’s examine this topic through two specific use cases.

The Rise of Browser-Based Processes & the Need for Automation

Much of our work and daily life now happens in web browsers, from using internal tools and SaaS apps to filling forms online. This reliance on browser workflows creates a demand for automation, especially for repetitive tasks like data entry or submitting reports. Automating these processes boosts productivity, reduces errors, and frees up employees for more strategic work.

Limitations of Traditional RPA

While Traditional Robotic Process Automation (RPA) has automated tasks for years, it often struggles with dynamic web interfaces. RPA relies on rigid rules, making it brittle to UI changes and inflexible with variations. AI-powered automation using Large Language Models (LLMs) offers a promising alternative. With their natural language understanding, LLMs provide a more adaptable and intuitive approach to browser automation, overcoming traditional RPA’s limitations.

Browser-use 

We based our solution on browser-use

Why We Chose This Tool

  • Popularity: The rapid growth in GitHub stars (50k in 3 months) clearly reflects strong community interest and adoption.
  • Ease of Use: Creating a minimal working example for task automation requires only a prompt and about 20 lines of code. This simplicity allows for quick experimentation and iteration.

How Does browser-use Work?

browser-use is built on top of Playwright, a browser automation library developed by Microsoft. It extends Playwright’s capabilities by integrating large language models (LLMs) to enable a deeper understanding of web page content. This is achieved in two key ways:

  • Parsing HTML: The raw HTML of a page is analyzed and fed into the model’s context.
  • Visual Understanding (optional): If using a multimodal model like GPT-4o, the tool can take a screenshot of the current view and extract visual features from it.

By combining structured HTML analysis, visual context, and Playwright’s robust automation features, browser-use offers a powerful and intuitive way to navigate and interact with web pages using natural language.

Use Case No. 1: Luncher – Automating Lunch Orders

To explore the practical capabilities of AI agents in browser automation, we selected a common, relatable scenario: ordering lunch via a Google Form. Many organizations use similar forms for various simple, recurring tasks. While straightforward, manually filling out such forms daily can be repetitive. This makes it an ideal candidate for testing how effectively an AI agent can handle a typical browser-based workflow.

Task Flow – Step-by-Step

The lunch ordering process via Google Form is structured as follows:

  1. Page 1: User Identification & Order Details.
    • The user must select an option to include their email address in the form response.
    • The user is required to choose a main dish from a list of options.
    • The user must select a soup option, including a “no soup” choice if desired.
    • Navigation to the second page is done by clicking the “Next” button at the bottom of the page.
  2. Page 2: Confirmation & Submission.
    • Optionally, the user can select “Send me a copy of my responses”.
    • Finally, the user clicks the “Submit” button to place the lunch order.

While seemingly straightforward, this process has critical nuances. For instance, failing to select the email inclusion option or forgetting to choose a soup option (even “no soup”) will result in form validation errors, preventing successful submission. These subtle requirements make it a good test case for AI agents to demonstrate contextual understanding and error handling.

Agent’s Performance and Prompt Engineering Journey

Initially, we tried a very simplistic, natural language prompt: “go to <LUNCH_FORM_URL> and order me <MAIN DISH> and <SOUP>”. The results were inconsistent, to put it mildly.. The agent frequently overlooked selecting the “include mail option,” seemed to get lost scrolling up and down the page, and sometimes prematurely declared task completion before even reaching the second page and submitting the order. Despite these issues, the agent would confidently claim “success.”

This initial experiment highlighted a key limitation: the agent struggled with understanding the implicit context and constraints of the form. It wasn’t enough to simply tell it what to order; we needed to guide it through the process of correctly filling out the form.

Our solution was to provide explicit, step-by-step instructions within the prompt, essentially “injecting” our understanding of the form’s structure into the agent’s task definition. This led to our refined, and ultimately successful, prompt:

“Go to <LUNCH_FORM_URL> and order lunch.

Firstly, select the option to include my email in the response.

Secondly, select the main dish.

Thirdly, scroll down and select the soup option (or ‘no soup’ if I don’t want soup).

Finally, click on the ‘Next’ button. On the next page, select the ‘Send me a copy of my responses’ option and then click ‘Submit’.

Here is my order: <MAIN DISH> and <SOUP>”

As you can see, this prompt explicitly outlines each necessary step, mirroring the logical flow of the form. With this detailed prompt, the agent achieved 100% success in automating lunch orders.

Key Takeaways from Google Forms Handling

  • Context is King: LLMs, even powerful ones, may not inherently grasp the implicit context of web forms. Explicitly guiding them through the process is crucial.
  • Detailed Prompt Engineering is Key: Investing time in understanding the task flow and translating that understanding into a detailed, step-by-step prompt is essential for reliable browser automation.
  • Simplicity vs. Clarity: While natural language prompts are appealing, sometimes a more structured and procedural prompt is necessary to ensure task completion, especially in complex web interfaces.

Use Case No. 2: Requesting a Leave / Day Off with Intranet Site

Many organizations utilize internal web applications or intranets for managing common HR tasks like employee leave requests. While essential, submitting these requests often involves navigating specific forms and fields. This routine process presents another excellent opportunity to test how AI agents can automate interactions within typical internal company tools, simplifying the workflow for employees. We use our own solution called “Grapes”. 

Task Flow – Step-by-Step

The process of creating a leave request in “Grapes” involves the following steps:

  1. Initiation: Navigate to the “Grapes” application URL and click the “Take leave days” button. This action opens a pop-up window to input leave details.
  2. Pop-up Form – Leave Details:
    • Processing Manager: Select the appropriate manager from a dropdown list.
    • Leave Type: Choose the type of leave (e.g., vacation, sick leave) from a dropdown menu.
    • Start and End Date: Specify the leave period. Dates can be entered in DD-MM-YYYY format directly or selected using a calendar widget.
    • Leave Notes (Optional): Add any optional notes or reasons for the leave request.
  3. Submission: Click the “OK” button within the pop-up. If all required fields are correctly filled, the leave request is successfully created.

Agent’s Performance and Prompt Refinement

For “Grapes,” we started with a seemingly straightforward prompt:

“Go to <LEAVE_REQUESTS_URL> and create a request for <LEAVE_TYPE> leave from <START_DATE> to <END_DATE>. My processing manager is <NAME SURNAME>.”

This prompt almost worked. The agent successfully identified and populated most of the required fields, including leave type and processing manager. However, it consistently stumbled on the date format. The “Grapes” application expects dates in DD-MM-YYYY format, but the agent stubbornly inputted dates in YYYY/MM/DD.

The resulting error message in “Grapes” was not particularly helpful: “Select start date.” This ambiguous error likely confused the models. Interestingly, even advanced reasoning models like OpenAI’s o1 and o3 correctly reasoned that the issue was related to the date format. They even attempted to adjust it to DD/MM/YYYY, but failed to recognize the need to change the separator from “/” to “-“. Ultimately, with this simple prompt, none of the tested models could successfully create a leave request.

Our breakthrough came with a simple, yet crucial addition to the prompt: explicitly specifying the required date format. The refined prompt became:

“Go to <LEAVE_REQUESTS_URL> and create a request for <LEAVE_TYPE> leave.

from <START_DATE> to <END_DATE>.

The date format is DD-MM-YYYY.

My processing manager is <NAME SURNAME>.”

Adding the line “The date format is DD-MM-YYYY” instantly resolved the date formatting issue. With this updated prompt, all tested models successfully inputted dates in the correct DD-MM-YYYY format on the first attempt.

Intriguing Agent Behavior: Handling Missing Information

In a further experiment, we intentionally omitted the processing manager’s name from the prompt. Interestingly, instead of failing or throwing an error, the agents exhibited a degree of “self-reflection.” They seemed to recognize the missing information and, in a display of unexpected improvisation, selected the first manager from the dropdown list in the “Processing Manager” field. While this wasn’t the intended solution (and could lead to incorrect leave requests in a real-world scenario), it showcased the agent’s ability to identify missing data, reason about the context, and attempt to “fill in the gaps,” even if imperfectly.

Key Takeaways from Grapes

  • Data Type Awareness is Critical: When automating web forms, explicitly inform the agent about expected data types and formats, especially for dates, numbers, and specific input constraints. This significantly improves accuracy and reduces errors.
  • Agents Can Improvise (Sometimes): LLMs possess a degree of reasoning and can attempt to handle missing information, even if their improvisation isn’t always ideal. This highlights both the potential and the need for careful control and validation in AI automation.
  • Error Messages Matter (Even for AI): Ambiguous or misleading error messages in web applications can hinder even sophisticated AI agents. Clearer error feedback could potentially enable more advanced error handling and self-correction in future models.

Alternatives: Exploring the Spectrum of Browser AI Automation – From OpenAI Operator to Open Source and Custom Solutions

The field of browser AI automation is rich with diverse approaches and tools. Beyond direct LLM prompting, several exciting alternatives are emerging, each catering to different needs and technical capabilities. Let’s explore a broader spectrum:

OmniParser and Vision-Based GUI Understanding

OmniParser (from Microsoft) offers a powerful vision-based approach to understanding web interfaces. By parsing UI screenshots into structured elements, it enables more robust and accurate agent interactions, especially in dynamic and visually complex web applications. OmniParser remains a valuable technology for building the “perception” layer of sophisticated browser automation systems. However, it needs to be underlined that it does not offer automation of web browsing.

OpenAI Operator: The Commercial Agent from OpenAI

OpenAI Operator stands out as a commercially available, ready-to-use browser automation agent. Leveraging GPT-4o and its “Computer-Using Agent” (CUA) model, Operator provides a user-friendly way to automate everyday browser tasks. Its key strengths lie in its ease of use, direct browser interaction, and integration with OpenAI’s ecosystem. However, being a general-purpose tool and still in research preview, it may have limitations for highly specific or complex business workflows.

Open Operator: The Open-Source, Build-Your-Own Agent

For those seeking more control, customization, and a deeper understanding of the underlying technology, Open Operator presents an intriguing open-source alternative. Developed by Browserbase, Open Operator is explicitly positioned as a proof-of-concept and a toolkit for building your own web agent.

Key aspects of Open Operator:

  • Open-Source and Customizable: Being open-source (MIT license), Open Operator allows developers to inspect, modify, and extend its code, offering maximum flexibility and control.
  • Building Blocks Approach: It’s not intended as a finished product but rather as a demonstration and a set of tools (leveraging Browserbase and Stagehand) for developers to construct their own agents.
  • Powered by Browserbase and Stagehand: It utilizes Browserbase for core browser automation and interaction, and Stagehand for precise DOM manipulation and state management – highlighting these technologies as crucial components for building web agents.
  • Community-Driven (Potentially): As an open-source project, it invites community contributions, potentially leading to further development and refinement by a wider group of developers.

Open Operator is ideal for:

  • Developers and Researchers: Those who want to learn about the inner workings of web agents and experiment with building their own custom automation solutions.
  • Organizations with Specific Needs: Companies with highly unique or complex automation requirements who prefer to build in-house solutions tailored to their exact specifications.
  • Those Seeking Transparency and Control: Users who value open-source software and want full control over their automation tools and data.

However, it’s crucial to understand Open Operator’s limitations:

  • Proof-of-Concept: It’s not a polished, ready-to-use product like OpenAI Operator. It requires technical expertise and development effort to build upon.
  • Requires Technical Setup: Setting up and running Open Operator involves configuring API keys (OpenAI and Browserbase), installing dependencies, and potentially coding or customizing the agent logic.
  • Not for End-Users (Directly): Primarily targeted at developers and technically proficient users, not intended for general end-user consumption without further development.

Custom Web Agents Solutions by deepsense.ai

Finally, for organizations seeking fully tailored, robust, and supported browser automation solutions, deepsense.ai can be a strong partner. We leverage our expertise in AI, machine learning, and technologies like OmniParser and vision-based parsing to build custom in-house solutions. We can also incorporate and adapt open-source components like those demonstrated in Open Operator and BrowserUse, ensuring the best fit for your specific business needs.

Key aspects:

Advantages of custom in-house solutions from deepsense.ai, compared to using general agents or building from scratch with BrowserUse or Open Operator, include:

  • Expertise and Experience: Leverage deepsense.ai’s proven track record in AI and automation.
  • Turnkey Solutions (or Guided Development): We can deliver fully functional, ready-to-deploy solutions, or collaborate with your team in a guided development process.
  • Reduced Development Effort: Minimize your internal development time and resources by partnering with automation specialists.
  • Ongoing Support and Maintenance: Benefit from continuous support, updates, and adaptation of your automation solutions.

Choosing Your Browser Automation Path

Selecting the right browser automation approach depends on your goals, technical resources, and desired level of customization and control:

  • OpenAI Operator: For immediate, user-friendly automation of common browser tasks. Good for individuals and businesses seeking quick wins.
  • Open Operator: For developers and technically proficient users who want to build and customize their own agents, leveraging open-source tools and frameworks. Ideal for learning and highly specific needs.
  • OmniParser (and similar vision-based tools): As a technology component to enhance the robustness and visual understanding of any browser automation system, especially when dealing with complex UIs.
  • Custom In-House Solutions by Deepsense.ai: For businesses requiring robust, tailored, deeply integrated, and expertly supported automation solutions designed for their unique workflows and long-term needs.

Sum up

The browser AI automation landscape is diverse and rapidly evolving, offering a range of options from ready-to-use commercial agents like OpenAI Operator to open-source toolkits like Open Operator, vision-based parsing with OmniParser, and custom-built solutions.

The optimal path depends on your specific requirements – whether you prioritize ease of use, customization, control, or expert support. By understanding the strengths and weaknesses of each approach, businesses can strategically leverage AI to streamline browser-based workflows and unlock new levels of efficiency. 
If you have experience with WebAgents or are interested in exploring them to enhance your business operations, please drop us a message.

Table of contents