Generative AI Archives - deepsense.ai

Evaluations, Limitations, and the Future of Web Agents – WebGPT, WebVoyager, Agent-E

October 14, 2024/in Generative AI /by Maks Operlejn and Alicja Kotyla

Web Agents are no longer just a concept from science fiction—they’re the cutting-edge tools that are automating and streamlining our online interactions at an unprecedented scale. From effortlessly sifting through vast amounts of information to performing complex tasks like form submissions and website navigation, these agents are redefining efficiency in the digital age.

Thanks to groundbreaking advancements in Large Language Models (LLMs), we’re witnessing a paradigm shift. The humble beginnings of web scraping and archiving have rapidly evolved. Now, sophisticated agents like WebGPT and WebVoyager don’t just perform tasks; they understand context, interpret nuances, and mimic human browsing behavior in ways we once only imagined. They’re not just faster—they’re smarter.

But with great power comes great complexity. The real world isn’t a controlled environment, and web agents face formidable challenges. Dynamic content, ever-changing web landscapes, intricate user interactions—the hurdles are significant. Current systems, while advanced, are just scratching the surface of what’s possible.

This article isn’t just an overview; it’s a journey into the heart of web agent technology. We’ll explore the state-of-the-art solutions making waves today, delve into the intricacies of evaluating their performance, and spotlight the key obstacles that innovators are racing to overcome.

Whether you’re a business leader looking to harness this technology for a competitive edge or a tech professional eager to dive into the next big thing, understanding web agents is crucial. The future of web interaction is here, and it’s more exciting—and challenging—than ever before.

TL;DR – Web Agents: From Browsing to Automation

Forget bots that just scrape data—Web Agents are like having little AI assistants navigating the web for you. Need info buried deep in a website? Want to automate a tedious online task? These AI-powered tools are stepping up their game, using LLMs to understand and interact with websites like never before.

Key takeaways:

Forget basic bots—Web Agents are leveling up. They’re starting to understand context, interpret nuances, and browse the web in a way that’s more like us humans.
WebGPT, an early pioneer, uses text-based browsing to answer questions more accurately than traditional LLMs.
WebVoyager, powered by GPT-4v, utilizes visual input and the “Set-of-Mark” method to navigate real websites, overcoming challenges like dynamic content.
Agent-E introduces advanced HTML distillation techniques and a multi-agent architecture for efficient task execution and resource utilization.
Evaluating Web Agents requires new benchmarks: WebArena and WebVoyager’s evaluation framework offer realistic scenarios and diverse tasks to assess agent performance.
Limitations still exist: Handling complex web actions, file compatibility, integrating multimodal inputs, and ensuring ethical and secure deployment are ongoing challenges.
The future is promising: Combining textual and visual inputs, improving robustness, and developing standardized evaluation methods will pave the way for wider adoption.

The Fundamentals of Web Agents

The basics of Web Agents aren’t that different from regular AI Agents used in other fields. A pretty matching, though obviously very simplistic definition of an LLM-based AI Agent is that it is an LLM put into a loop. The diagram below shows how Web Agents work.

Components of Web Agent, source: own study

Here are the main parts of such a system:

Planning: Based on the user’s request, the agent makes a plan.
- It might get a detailed step-by-step guide and schedule all tasks at the beginning for future execution.
  For example, if the Web Agent is to operate on a concurrent web service with good documentation explaining the process to follow (e.g. in Sharepoint or Confluence), specific fields to fill in and buttons to click, the Web Agent can decompose this as a ready-made action plan.
- Alternatively, the agent might only receive the expected outcome from the user and be required to formulate a step-by-step plan in real-time. The agent, based on feedback from the web environment, has to plan the next step.
  For example, “My task is to edit the user’s profile. I see an icon with the user’s avatar, so I’m guessing I can click on it to go to the profile”.
Tools: Following the plan, the agent carries out actions.
- It could be some predefined actions like “click”, “type”, “scroll up/down” or “go back”.
- Another option – our agent might have the freedom to write code on the fly using tools like Selenium or Playwright.
Memory: The agent needs to remember what it did previously. This allows it to review past steps and figure out what to do next and how to do it.
For example, “My task is to edit the user’s profile. I see an icon with the user’s avatar, but I tried clicking on it before and nothing happened. Therefore, I will try…”

An additional component is the feedback from the Web Environment. At each step, it’s crucial to gather information from the browser regarding the current on-screen content. This data can include HTML code or, thanks to recent advances in vision language models, screenshots of the page, which align the web agent’s perspective more closely with that of the human user.

Such systems can be enhanced in various directions. Here are some examples of user queries to illustrate their versatility:

“How many commercial projects does deepsense.ai have?”: A straightforward query focusing primarily on extracting information from a single source.
“Summarize the news about LLMs throughout September 2024.”: This scenario may require intelligent aggregation of source data and subsequent processing.
“Book a flight for me to Warsaw in the next four days. Choose the cheapest flight with no connecting flights.”: Here, in addition to passively browsing through websites, we perform specific actions with permanent outcomes. (Quite a responsible task for a dummy AI agent, huh?)

Web Agents can be tailored for a variety of tasks, proving to be incredibly flexible and useful in different scenarios.

Overview of Web Agent Solutions: Differences Between WebGPT, WebVoyager and Agent-E

WebGPT: the first LLM-based web agent

Looking back, we can remember web archiving and web scraping tools – technologies that have been on the market for years and can be considered early forms of Web Agents. However, the introduction of LLMs allowed for far more intricate and less monotonous processes. Efforts to achieve this complexity predated even the first release of ChatGPT! Staying within OpenAI’s innovative endeavors, 2021 saw the launch of WebGPT.

Back then, it was already acknowledged that models tend to “hallucinate” or provide inaccurate information. To address this, WebGPT uses search queries, follows links, navigates pages, and looks for pertinent answers, emulating human browsing behavior within a text-based web browser created by its developers.

The model kicks off with an open-ended question and a summary of the browser state, then issues commands like “Search …”, “Find in page: …” or “Quote: …” to gather and compile information into an answer.

Showcase of WebGPT, source: https://openai.com/index/webgpt/

While WebGPT is more accurate than GPT-3 (the model on which it is based), it still commits basic errors and can perpetuate users’ biases. The citations it generates can lend an undue sense of credibility, potentially disguising the inherent inaccuracies and errors.

WebVoyager: Utilizing Vision Language Models for Browsing the Web

Although early attempts at developing LLM-based web agents began as far back as 2021 with WebGPT, it wasn’t until the summer of 2024 that two significant research papers emerged, each examining the topic from distinct perspectives. The first noteworthy work is WebVoyager, an agent that makes decisions based on its visual input, leveraging the multimodal capabilities of GPT-4v.

Workflow overview

When given an instruction, WebVoyager launches a web browser and takes actions guided by screenshots from the web. At every step, the agent evaluates the screenshot with possible options marked and decides on the next action, which is then executed in the browser environment. This sequence of actions continues until the agent determines that the task is complete and decides to stop.

Workflow of WebVoyager, source: https://arxiv.org/pdf/2401.13919

The authors of WebVoyager created an automated web-browsing environment utilizing Selenium. In contrast to WebGPT, which functions within an artificial setting, WebVoyager allows the agent to navigate the open web, encountering distinct challenges like floating ads, pop-up windows, and continuous updates. They chose real-time interaction with live websites to more accurately mirror real-world scenarios, where agents need to access up-to-date information from the web.

Set-of-mark. How the AI-powered wWeb Agents Perceive the World

Much like how humans explore the web, WebVoyager uses visual data, specifically screenshots, as its primary input. This approach avoids the complexity of parsing the HTML DOM tree or accessibility tree, which can often generate excessively verbose text and complicate the language model’s decision-making process. WebVoyager utilizes the innovative Set-of-Mark method to overlay bounding boxes over interactive elements on websites, guiding the agent’s action prediction more effectively. How is this achieved, and does it require another deep learning model? Not at all! Instead, it employs GPT-4V-ACT, a JavaScript tool that extracts interactive elements based on their web element types and overlays bounding boxes with numerical labels on the relevant regions. For instance, it identifies all <input> fields and elements with hyperlinks. The image below illustrates a screenshot prepared by GPT-4V-ACT:

Result of website marking, source: https://arxiv.org/pdf/2401.13919

Actions. What Can WebVoyager Do?

WebVoyager operates using a set of predefined actions, selecting one for each step and including parameters as needed. These actions are:

Click a Web Element.
Type content (deleting any existing content if necessary)
Scroll up or down.
Wait
Go back to the previous page
Return to google to start over.
Respond with the final answer

For certain actions, agents need to select a specific element from previously presented marks. When the model decides to click on the Search button from the example in the image above, the output would be: “Click [17]” If the user wants to specify the departure city, the action will require an additional parameter with some text. In this case, the example output might be: “Type [13]; Warsaw”

ReAct prompting – How the Web Agent Thinks and Acts?

The agent employs the ReAct prompting method to carefully consider its actions. ReAct is inspired by how humans naturally combine acting and reasoning to learn and make decisions. Acting allows the agent to interact with external sources, like knowledge bases or environments, to gather additional information. Reasoning helps the agent develop, track, and update action plans, as well as manage exceptions.

WebVoyager uses ReAct in the following way:

Fragment of WebVoyager prompt, source: https://arxiv.org/pdf/2401.13919

This approach can significantly enhance the model’s performance. For instance, consider the deepsense.ai web page shown below. WebVoyager has been assigned the task of navigating to the Services tab. However, through its improved reasoning capabilities, the agent recognizes that it must first close the cookie window to proceed. The output is:

Observation (from previous step): the image below.
Thought: To proceed, I need to accept the cookies to interact with the website properly.
Action: Click [24]

Observation with current web page status, source: https://arxiv.org/pdf/2401.13919

WebVoyager in Action. A Demo

We wanted to see how this agent works in some real-life scenarios. Let’s say that a client wants to get more information about our company. This is how the query looks like:

“How many commercial projects does deepsense.ai have? You can find that information on their site under the AI Guidance Services tab.”

As you can see, this instruction is quite detailed. Of course, we can also give it a simpler version, but increasing the detail means a greater chance of success.

Whole process can be seen below. The final answer from the model is:

“deepsense.ai has 200 commercial projects.”

Demo of WebVoyager operation, source: own study

Agent-E: Unleashing Web Automation with LLMs and Compact HTML

In July 2024, just one month after the release of WebVoyager, Emergence AI developed Agent-E, a novel web agent that brings significant advancements over previous state-of-the-art web agents. In this section, we’ll explore how Agent-E works, including its architecture and advanced techniques for processing web content.

Solution Architecture Overview

Agent-E is designed with a high-level architecture that includes two LLM-powered agents: the Planner Agent and the Browser Navigation Agent, supported by two executor components, the Planner Skills Executor and the Browser Navigation Skills Executor. Each LLM-powered agent has a set of skills, which are Python functions that the LLM can call to perform specific tasks. The executor components are responsible for carrying out the functions that the LLM selects and then reporting the results back to the LLM.

Agent-E task execution flow overview, source: https://arxiv.org/pdf/2407.13032

Agent-E is built using the Autogen framework, which is an open-source tool for creating multi-agent collaborative systems, and Playwright, which handles browser control. The architecture is designed to take advantage of the interaction between these skills and agents.

How Tasks Are Handled with Agent-E

When a user gives a command, Agent-E’s Planner Agent breaks down the task into a series of actionable steps. These steps are then assigned to the Browser Navigation Agent, which is freshly instantiated for each task to ensure that it starts with a clean slate and doesn’t carry over any previous context. This agent uses its foundational skills to assess the browser’s state and control it accordingly, executing the required steps to complete the task.

An illustration of Agent-E in action, showcasing the interaction between the planner and the browser navigation agent to complete the task of finding the price and minimum user requirements for a Canva Teams subscription, source: https://arxiv.org/pdf/2407.13032

Agent-E Skills Design: Sensing and Action

Agent-E’s skills are categorised into two main types: Sensing Skills and Action Skills.

Sensing Skills: These skills enable the Browser Navigation Agent to understand and interpret the webpage using various HTML synthesis techniques. Depending on the task, the agent can extract just the text content, identify input fields, or list all interactive elements on a page. This selective approach helps avoid the pitfalls of dealing with noisy or irrelevant HTML data.
Action Skills: These skills are designed to perform actions on the webpage, such as clicking a button, and then observe and report any resulting changes. For example, if a click action triggers a popup, the agent reports this change, helping the LLM determine the appropriate follow-up actions.

After executing each skill, the agent provides feedback in natural language, indicating whether the action was successful or not and explaining the reasons behind the outcome. This conversational feedback helps the LLM to better understand and correct any errors, leading to more accurate and efficient task execution.

Agent-E HTML Distillation: Optimising Web Interaction

One of the standout features of Agent-E is its advanced HTML distillation techniques. The size of a webpage’s HTML can significantly impact the speed, cost, and effectiveness of the LLM’s processing. To tackle this, Agent-E compacts and distills the HTML, preserving only the most relevant information for the task at hand. This process is critical because HTML can be so large that they often exceed the context window of an LLM, rendering them impractical for processing in their raw form.

Illustration of the difference in token counts between the raw HTML of popular websites and the more streamlined version after applying our distillation process, source: https://www.emergence.ai/blog/distilling-the-web-for-multi-agent-automation

Agent-E in action. A demo

This demo showcases Agent-E in action. A query, entered into the Agent-E sidebar, asks how many commercial projects deepsense.ai has. Upon submitting the query, Agent-E quickly processes the request, demonstrating its capability to understand complex queries and provide precise guidance. This example illustrates how Agent-E can assist users in navigating websites and retrieving specific information.

Evaluating WebAgents: Beyond Simplification and Toward Real-World Complexity

Assessing the performance of a WebAgent is a complex task. Earlier methods often treated agents like stateful functions, comparing outputs to a reference or ground truth. This approach, relying on keyword or phrase matching, struggled with nuanced or multi-step responses.

Additionally, earlier agent evaluation environments tended to oversimplify real-world tasks, limiting task variety and reducing complexity. These simplified environments often restricted agents to pre-defined states, limiting their ability to explore diverse, dynamic scenarios.

The following sections will introduce a range of new evaluation frameworks, each designed to assess WebAgents on multiple dimensions, including context-awareness, interactivity, and problem-solving strategies, to better reflect their performance in complex environments.

WebArena: Advancing Agent Development with Realistic Web Complexity

A new wave of WebAgent benchmarks has emerged to push the boundaries of autonomous agents in real-world web environments. One noteworthy example is WebArena that offers a realistic setting with four operational web applications across domains: online shopping, discussion forums, collaborative development, and business content management. WebArena includes utility tools such as a map, calculator, and scratchpad, allowing agents to mimic human-like interactions with web content. It is also supported by a broad range of knowledge resources, from general information like Wikipedia to specialized manuals for tools integrated within the system.

WebArena benchmark features 812 tasks evaluated using metrics such as Exact Match, Must Include, and Fuzzy Match, focusing on outcomes rather than steps. In tasks related to site navigation or content manipulation, WebArena assesses performance by programmatically examining the intermediate states during task execution. This process verifies whether the agent is interacting correctly with the website or database at each step. Additionally, agents are tested on their ability to recognize unachievable tasks.

WebVoyager Evaluation: Embracing Real-World Web Diversity

Another benchmark, introduced in the WebVoyager paper, focuses on evaluating agents in more dynamic, real-world scenarios. It spans 15 websites that represent various aspects of daily life, from online shopping and education to collaborative platforms and real-time information sources. Task generation in this system is a hybrid approach, combining automated processes with human validation to ensure diversity and accuracy. The final dataset consists of 643 tasks, including both well-structured queries and open-ended tasks like summarization.

A key feature of WebVoyager is its dual approach to answer annotation. For tasks requiring stable, factual responses, it uses “Golden” answers. For more open-ended or dynamic queries, it allows for multiple valid responses with “Possible” answers, reflecting the nuanced nature of real-world web interactions. This flexible evaluation system ensures that agents are not just judged by rigid correctness but by their ability to provide feasible, high-quality solutions.

Agent-E: Introducing Metrics for Evaluating Task Efficiency and Resource Utilization

Adding to this landscape, Agent-E introduces four key metrics for a comprehensive evaluation of web agents:

Task success rates measure the percentage of tasks completed successfully across various websites.
Self-aware vs. Oblivious failure rates differentiate between failures where the agent acknowledges its limitations and those where it provides incorrect answers or actions, with self-awareness being critical for minimizing errors.
Task completion times capture the average duration required to complete tasks, regardless of success.
Total number of LLM calls counts the average number of calls made by the agent (both planner and browser navigation) during task execution, reflecting its efficiency and resource utilization.

Navigating Limitations in Real-World Applications

Web agents show promise but face key challenges limiting their real-world use. Replicating complex web actions, like precise dragging motions on web pages, present difficulties due to their continuous nature. File compatibility is also an issue, as agents like WebVoyager can only handle basic file types such as text and PDFs, but struggle with multimedia formats like video, limiting their versatility.

Open-source models face trade-offs between image resolution and context size, hindering detailed web navigation. The need to integrate both text and visual inputs is also critical. Agents must integrate text and visual inputs to improve performance on visually complex sites, like booking systems or travel platforms, where calendars and intricate visual components are involved.

Finally, a “human-in-the-loop” approach is crucial for ensuring accuracy and managing security risks, while balancing ethical concerns and environmental costs. The authors of Agent-E stress that human-in-the-loop workflows are critical for the adoption of agentic systems, although no benchmarks currently exist for such agents.

The Future of Web Agents and Web Automation

Despite recent breakthroughs, we are still at the early stages of the evolution of Web Agents (and LLM Agents in general). The future promises many intriguing advancements, offering much to explore.

Currently, Agent-E focuses on textual input, while WebVoyager specializes in visual data. A unified approach that combines both could provide a more comprehensive understanding of web environments. This convergence of textual and visual inputs (and perhaps other modalities like audio or video) will allow agents to grasp context more holistically, resulting in more accurate and effective responses.

Present-day solutions, though fascinating, are still the products of research environments and are not yet ready for production use – a lot of unexpected things can happen like infinite loops, bad action parsing etc. In the future, agents will become more robust, addressing most edge cases and mitigating the non-deterministic nature of language models as much as possible. To achieve this, substantial input from software engineers is necessary to make these systems production-ready.

As an AI community, we have to work hard to introduce better methodologies for measuring agent performance. With an increasing number of testing methods and datasets, we will be able to conduct more extensive evaluations and identify specific areas needing improvement.

Summary

In conclusion, the journey of Web Agents from simple web scraping tools to sophisticated systems like WebVoyager, and Agent-E showcases the immense potential of leveraging large language models for web automation. Despite significant advancements, current Web Agents still face hurdles such as handling complex interactions, integrating multimodal inputs, and ensuring precise task execution in dynamic web environments. Addressing these challenges requires continuous innovation and development, along with robust evaluation frameworks to ensure real-world applicability. With these improvements, Web Agents could significantly enhance our ability to efficiently and effectively navigate the complexities of the web.

Why and How to Build AI Agents for LLM Applications

October 14, 2024/in Generative AI /by Alan Konarski and Sebastian Chwilczyński

Generative AI and Large Language Models (LLMs) have burst onto the scene, introducing us to “copilots,” “chatbots,” and the increasingly pivotal “AI agents.” These advancements unfold at breakneck speed, making it challenging to keep up.

We’ve been at the forefront of this revolution, witnessing how AI agents—or “agentic workflows,” as Andrew Ng refers to them—are becoming essential components of LLM-based applications. These agents are not just theoretical concepts; they’re practical tools enabling everything from natural language interactions with databases to generating entire software projects from simple requirements.

In this article, we’ll dive into AI agents’ fundamentals and explore their vast possibilities. Whether you’re a business leader aiming to stay ahead of the curve or a technical professional seeking deeper insights, we’ll provide a concrete and engaging look into how AI agents can be leveraged effectively.

TL;DR – What You Should Remember from This Article

We believe everyone will feel more confident about AI agents after reading this blogpost, regardless of the level of expertise and more business or technical roles. Still, if you consider yourself an AI Agents Jedi, at least check our main insights gathered while implementing Agentic systems in both internal R&D and commercial projects, it might be interesting for you.

We are already seeing many use cases where AI agents surpass pure LLM applications. One such use case is interacting with databases using natural language.
AI agents have unlocked new, previously impossible applications, such as performing a broad spectrum of tasks in the browser and generating entire software projects from requirements.
Artificial General Intelligence (AGI) is not there yet and very general, completely AI autonomous systems don’t perform reliably enough. Nevertheless, a well designed product with a certain level of autonomy (when it’s reliable) and human supervision proves to be very successful.
Anticipate breakthroughs in AI agents: their rapid improvement suggests a trajectory of increasing capability. Design your products with this potential in mind to stay ahead.
As we also stated in our previous blog post: Specialization beats Generalization. A group of specialized agents performs much better than one general agent. More specific agents require less powerful LLMs, leading to cost savings.
While some Agentic system design patterns are no-brainer, others present trade-offs between performance gains and costs in complexity, time, or resources. Add them to your system one by one after assessing improved performance and accepting the costs and complexity.
Without a proper evaluation pipeline and clearly defined goals, creating AI Agents is like alchemy, you never know if what you did improved the performance.

What Is an AI Agent

As more companies integrate LLMs into their workflows, they encounter various limitations such as low reliability, unpredictability, and high costs. Fortunately, the community has introduced numerous remedies to address these challenges. Over the two years since ChatGPT’s introduction, a plethora of LLM design patterns have emerged, with one of the most recent and hot topics being AI Agents.

When we refer to an “AI agent,” we’re describing a system that integrates multiple components to achieve superior results. The main element is the LLM Controller, which acts as the “brain” of the agents, making decisions and interacting with other components, such as:

Planner: Creates step-by-step plans by decomposing complex tasks into smaller ones. Many planning techniques exist offering trade offs between flexibility and reliability. For more adaptable solutions, we can use LLM planners, while completely predefined plans provide greater reliability. We can also combine both to achieve a balanced approach.
Memory: The history of previous steps and additional user or environment context that was learned along the way.
Tools: The agent has the capability to call external APIs or functions to get extra information or to perform given tasks, e.g., filling out and submitting a form on a registration page.

It’s also worth mentioning several AI Agents Design Patterns that have emerged in recent months, helping us take our AI agent-based applications to the next level. Here are the three most common:

Supervision: This pattern is built with two types of Agents: a Supervisor (also called Router) and Worker/Specialist agents. The Router agent picks which of the workers will perform the next task and decides when to finish processing.
Reflection: We prompt the LLM to reflect on and critique its past action, so it can improve its output.
Collaboration: Agents share common memories and work together on specific problems, each contributing as a specialist in a given area.

LLM Agentic Systems are built from a mix of LLM Agents and Agentic Design Patterns which both consist of various techniques that are rapidly developed.

Agentic systems represent the next step in the utilization of Large Language Models. By incorporating agentic workflows into your business:

Existing applications using LLMs can benefit significantly.
Completely new use cases can emerge.
The performance of non-LLM pipelines can be improved.

Differences Between AI Copilots, Chatbots, and Autonomous Agents, and Decision Recommendation Systems

Plethora of options and techniques related to AI Agents can be overwhelming. However, from a business perspective, it is more about what these Agents can do and not how they do this. Understanding the possibilities and limitations becomes easier after examining the different types of AI agents being developed.

We often encounter terms like co-pilot, chatbots, and agents, but what truly sets them apart? Let’s explore the distinctions.

Just as backend and frontend separation is beneficial for web applications, it is also advantageous for LLM-based applications. The backend handles the heavy lifting and is an ideal place to use agentic systems to achieve previously impossible tasks. The frontend’s role is to constantly:

Allow users to provide input.
Send this input to the backend.
Parse backend output and display the results.

At the third step, the system can:

Interact with the user environment (web page or application’s front-end , user database info)
Collaborate with the user helping them to solve their problem

Given these 2 criteria, we can imagine a 2D space where all LLM based applications reside

When designing LLM based applications we can think of how collaborative it can be and how it should influence the environment of the user, which facilitates the planning phase.

Imagine you want to build an LLM application for your video processing software. In the corners of our 2D space, you could have:

Copilot: Suggests different effects, helps with splitting the video, and answers questions about software functionality. It assists during the process but does not automate it.
Chatbot: Similar to a copilot, it helps but only resides in a chat window, providing text or image hints.
Autonomous Agentic System: Takes a query like “Take these 5 clips and create 20 second Instagram reel that is…” and outputs a ready video clip, using video processing software interface
Decision Recommendation System: Provides functionality like virality scores, taking a video as input and outputting a score.

Difference between different LLM frontends, on the stock exchange example. The same backend data generated by another Agentic System can be used differently to provide different user experience.

Use cases

Now that we understand what AI agents are, let’s explore some of their most interesting applications.

Web Agents

Nowadays, almost anything can be accomplished with an internet connection and a web browser: ordering food, starting a company, managing financial transactions, conducting market research, or even collaborating on complex projects. Whether it’s personal tasks or business operations, the possibilities are virtually endless!

But all of these tasks usually require you to go to the page, browse through available options, make payments, fill in all of the necessary data, and it takes time. And I think all of us value our time. But there is good news. With LLM-powered agents, most of these boring tasks could be automated.

We’ve already seen a couple of available solutions such as:

WebVoyager is an innovative Large Multimodal Model powered web agent that can complete user instructions end-to-end by interacting with real-world websites.
Agent-E is an agent-based system that aims to automate actions on the user’s computer. At the moment, it focuses on automation within the browser. Agent-E is a newer agent and currently being SOTA solution for WebAgents.

They both have been evaluated on the WebVoyager dataset. You can see example tasks and results below.

Example tasks from WebVoyager dataset. Source: https://arxiv.org/pdf/2407.13032

Performance of different WebAgents on the WebVoyager dataset. WebVoyager (He et al. 2024), WILBUR (Lutz et al. 2024) and Agent-E. Source: https://arxiv.org/pdf/2407.13032

Remember that these AI agents are still evolving; even though they are not perfect, they are improving. Importantly, their current design is quite broad in scope, which enhances their flexibility as they can handle a wide range of tasks and queries across various domains. This broad capability is a significant advantage, and with ongoing improvements, their reliability will continue to increase. Currently, a much more reliable system can be achieved by limiting agent’s scope to the most important business operations and developing more specialized agents that focus on specific tasks. For instance, we could create an agent dedicated solely to conducting market research.

Sequence diagram representing interaction between user, Agentic System and gooddoctor.com page during the task of making an appointment.

We have extensive experience building these types of agents. If you’re interested in learning more, we’ll be sharing details about WebAgents in our next article. Stay tuned!

Coding Agents

Another promising application of LLM-based agents is going one step further than collaborative tools like Copilot or ChatGPT and building an autonomous agent capable of creating an entire software engineering project only from the requirements written in natural language. We already have solutions such as:

Genie – developed by Cosine.
SWE-agent – built and maintained by researchers from Princeton University.

They have been evaluated on the SWE-bench. SWE-bench is a benchmark for evaluating large language models on real-world software issues collected from GitHub. Given a codebase and an issue, a language model is tasked with generating a patch that resolves the described problem.

SWE-bench collects 2,294 Issue-Pull Request pairs from 12 popular Python repositories. Source: https://www.swebench.com

Genie got over 30% on this benchmark, while SWE-agent powered by GPT-4 got nearly 12.5%. But bear in mind that the former is a closed, enterprise solution while the latter is open-sourced, MIT-licensed, and free to use. So, as these numbers show us, while completing projects end-to-end solely from requirements using an autonomous agent holds great promise, we’re not there yet, and there is still a lot of work to be done.

Note: OpenAI recently dropped a new benchmark, SWE-bench Verified , so we might see some new, interesting results!

We also have several other popular solutions available, such as AutoGPT, gpt-engineer, and the recently announced Replit Agent. While they seem very promising, they don’t provide benchmark results, so we don’t know their true value, and each use case has to be evaluated individually.

In our opinion, currently, tools like Copilot, which helps the developer and not replaces them, offer the most value. IDEs like Zed and Cursor, which are natively LLM-powered, exemplify this approach. Although the mentioned solutions are the best available now, we believe that very soon, solutions with more autonomous agents, like Genie or SWE-agent, will surpass Copilots and provide even more value.

If you want to learn more about Coding Agents, We encourage you to read our previous blog posts on this topic: AI Agents in LLMs: Introduction to Coding Agents and How to Build Your Own Coding Agent: Our Process.

Specialized Database Agent

General-purpose AI agents that can be prompted to solve any task often prove unreliable and can perform unwanted, potentially dangerous actions. When creating Text2SQL-based solutions (a system that translates a query in natural language into a query in database language) for clients, we found that solving specific cases reliably brings more business value than general, unreliable cases.

To achieve high reliability, we utilized very concrete implementations of agentic behaviors:

A set of tools for interacting with the database that the LLM can automatically select. Instead of generating pure SQL, which can be dangerous, the LLM selects predefined business actions.
A contextualization component that remembers facts about users and later utilizes them to disambiguate questions, such as, “Does anyone from my department take holidays this week?”This can be thought of as a memory component.
Encapsulation of an expert’s domain knowledge and domain vocabulary inside the tools.
A system utilizing vector similarity to exchange synonyms (e.g., recognizing that “New York” is represented as “NY” in the database).
LLM based reflection provides feedback and potential solutions to the main agent when the query it generated doesn’t execute correctly.
LLM agent summarizing the table returned by the database into a user-friendly message.

To sum up, the most reliable solutions always incorporate multi- agent collaboration between non LLM and LLM agents equipped with tools and memory, utilizing patterns like reflection.

Looking for a way to build a reliable AI agent for database queries? Given this experience, we created db-ally, an efficient, consistent, and secure library for creating agents for querying structured data with natural language. It’s open-source, so you can use it in your project freely.

db-ally plays a role of a proxy between LLM and database ensuring security and reliability

The Future of AI Agents

It’s really impossible to predict the future with 100% confidence. But we can make assumptions, and this is what we think about agents:

There is a new player in the game that can power our agents: LMM—large multimodal models. By enabling access to new modalities like vision and speech, they open up a whole new realm of possibilities. For example, instead of relying solely on text representation of the website, we can now feed a screenshot to the model, providing it with more information that can be helpful in achieving a certain goal.
They will become more and more present in our lives and our systems. It seems like the natural order of things as AI agents are currently a remedy to the shortcomings of pure LLM applications, such as:
- LLMs lack awareness of events or developments that occur after their training cut-off date
- LLMs often struggle with performing accurate mathematical calculations or solving complex mathematical problems, leading to incorrect results.
- LLMs cannot interact directly with external environments, such as web browsers or other software, limiting their ability to execute tasks that require such interactions.
Breakthroughs will be occurring, and limits will be pushed. It’s important to prepare for a new wave of agent-powered applications by:
- Identifying business processes that can be automated or facilitated by them
- Thinking about gathering data that can help develop and evaluate agents on these tasks

Conclusions

As AI continues to evolve, the role of AI agents will become even more essential in shaping how businesses operate and innovate. These agents, with their ability to handle complex workflows and execute tasks autonomously, offer a glimpse into the future of technology—a future where specialization and human supervision coexist to create powerful, efficient systems. Whether you’re building advanced applications or exploring new possibilities for automation, understanding and leveraging AI agents will be key to staying competitive in this rapidly advancing landscape.

Interested in learning more about how AI agents can enhance your workflows? Keep an eye out for our upcoming articles, where we’ll explore specific use cases and best practices to help you stay ahead in this evolving space.

Implementing Small Language Models (SLMs) with RAG on Embedded Devices Leading to Cost Reduction, Data Privacy, and Offline Use

April 25, 2024/in Generative AI /by Kamil Czerski

In today’s rapidly evolving generative AI world, keeping pace requires more than embracing cutting-edge technology. At deepsense.ai, we don’t merely follow trends; we aspire to establish new solutions. Our latest achievement combines Advanced Retrieval-Augmented Generation (RAG) with Small Language Models (SLMs), aiming to enhance the capabilities of embedded devices beyond traditional cloud solutions. Yet, it’s not solely about the technology – it’s about the business opportunities it presents: cost reduction, improved data privacy, and seamless offline functionality.

What are Small Language Models?

Inherently, Small Language Models (SLMs) are smaller counterparts of Large Language Models. They have fewer parameters and are more lightweight and faster in inference time. We can consider models with more than 7 billion parameters as LLMs (the largest could have even more than 1 trillion parameters), demanding resource-heavy training and inference. The definition of a Small Language Model may vary among authors, but we consider models lightweight enough to run on edge devices, typically with 3 billion parameters or less. Please note that this division is conventional and does not provide full depth.

SLMs are compact versions of Language Models, and they excel in two main areas:

SLMs are suitable for Edge Devices, offering businesses benefits such as cost reduction, offline usage, or enhanced data privacy.
For research groups, SLMs facilitate speeding up R&D progress, swiftly testing new ideas, benchmarking at scale, and iterating relatively fast. Even retraining SLMs (even from scratch) is feasible for small groups with access to home-grade GPUs.

This article focuses on the first area: applying SLMs to Edge Devices for practical purposes.

Benefits of SLMs on Edge Devices

In this section, we present three compelling reasons why companies may find Small Language Model (SLM) applications preferable to their cloud-heavy Large Language Model (LLM) counterparts:

Cost Reduction

The expense of cloud inference for Large Language Models can be prohibitive. Transitioning LLM-based solutions directly to edge devices eliminates the need for cloud inference, resulting in significant cost savings at scale. This cost reduction may be a primary incentive for companies already employing cloud-based LLM inference on mobile phones or edge devices. Additionally, for specific applications, the quality offered by smaller models may already meet requirements. Moreover, companies seeking to cut LLM costs can benefit from shifting inference to local PC hardware.

We encourage readers interested in cost reduction topics to read our article on Reducing the cost of LLMs with quantization and efficient fine-tuning.

Offline Functionality

Deploying SLMs directly on edge devices eliminates the requirement for internet access, making SLM-based solutions suitable for scenarios where internet connectivity is limited. For instance, consider a drone application leveraging a Small Vision Language Model; it must operate seamlessly even in environments lacking internet connectivity. Another example can be a smartphone application as an RAG pipeline, utilizing the company’s documents and providing a question-answer mechanism. This application utilizes SLM and can reduce the costs of hosting larger LLM in a cloud.

Data Privacy

Sometimes, apprehensions arise regarding cloud services due to data protection regulations. All processing occurs locally by running on the Edge, offering the opportunity to adopt Language Model-based solutions while adhering to stringent data protection protocols.

Developing a Complete RAG Pipeline with SLMs on a Mobile Phone

To gain hands-on experience with Small Language Models, we decided to investigate an internal project where we explore SMLs and their usage.
The main goal of this internal project was to develop a complete Retrieval-Augmented Generation (RAG) pipeline, encompassing the embedding model, retrieval of relevant document chunks, and the question-answering model, ready for deployment on resource-constrained Android devices. The primary objective was to explore the capabilities of Small Language Models (SLMs) in terms of overall response quality and generation speed on mobile hardware.

Also, we publish code related to this project; check it at: https://github.com/deepsense-ai/edge-slm

What did we do?

We constructed a prototype pipeline for RAG using the llama.cpp framework and successfully deployed it on Android devices.
We experimented with SLMs, including Phi-2, Gemma, and TinyLlama, with parameter counts ranging from 1B to 3B.
Using the Ragas library, we evaluated their question-answering quality by combining human assessment with automated LLM-based metrics.
We gauged the impact of different quantization levels and prompt engineering on response quality.
We assessed the pipeline’s latency and memory consumption, gaining insights into the current possibilities for deploying language models on the edge.
In addition, we conducted experiments on the pipeline’s retrieval component, which involved embedding model selection and hyperparameter optimization.
Parallel to developing the main pipeline, we explored other frameworks such as ExecuTorch and MLC and alternatives to Transformer, like selective state space models (Mamba).

Demo of the RAG pipeline Phi-2 Q8 model with thenlper/gte-large embeddings model running on Samsung S24 Ultra.

What does the RAG pipeline look like?

It is a technique for injecting specific knowledge (consider your company documents and text data) into a system where users ask questions, and the Language Model answers those questions, incorporating knowledge from the mentioned documents. In other words, it is a zero-shot prompt technique for the Language Model, requiring no fine-tuning or training of the model. The main flow of the designed RAG pipeline is depicted in the diagram below, and this is precisely what we have fully implemented on the smartphone as our demo project.

For the offline component, documents are chunked, and embeddings are calculated. For the company’s records, this process was once an offline operation. When a mobile application is initiated, embeddings (indexed pointers to document chunks) are stored on the device in RAM, and documents are stored on the smartphone’s hard drive. Subsequently, when a user poses a question (user query), context is retrieved from this vector index. With appropriate prompt engineering, the Small Language Model takes user questions, retrieves contexts, and generates responses.

Offline processing:

The production-ready system can also include an offline component. The “Knowledge base” needs to be distributed to the edge devices, implying that the distribution may involve precalculated embeddings.

The document chunking step is conducted offline using Python scripts. This approach allows for the utilization of existing libraries and tools, such as LangChain.
Embedding vectors are computed offline to reduce loading time. This is achieved by developing simple applications using the developed library, ensuring consistent runtime implementation for the embeddings.

Online on edge part:

Knowledge base loading must occur during application startup, and indexed chunks must be loaded at runtime. The current solution stores all document chunks directly in memory. If the desired knowledge base was too large to fit within a reasonable amount of RAM, enhancing the solution by implementing mechanisms to store the actual documents outside the application would be necessary.
The context retrieval component takes the user query and the knowledge base. It extracts the K nearest elements retrieved from the knowledge base based on the cosine similarity score between the user query and document chunks.
The response generation step is the final stage in the pipeline. For this project’s scope, we did not implement the chatting functionality. It may be added in subsequent steps. This component generates the SLM model response based on the specific model prompt template, retrieved contexts, and the user query. The output from the LLM interface is capable of token streaming.

Tech Stack

Below, we provide a quick overview of the project, divided into research and inference sites. For the tech stack used in inference, we chose llama.cpp as an inference engine for SLMs, bert.cpp as a framework for embedding a model (now fully integrated into llama.cpp), Faiss as a library for realizing k-nn search for embeddings from user queries and embeddings from document chunks, OpenBLAS as Faiss requires a BLAS implementation, and Conan to manage C++ dependencies in the Android environment. On the research side, we evaluated models using both human and automated metrics (Ragas) and benchmarked the application.

Methods and Tools

Let’s start with the inference engine for the Small Language Model. We have tested and evaluated four LLM frameworks:

llama.cpp – This framework emerged as the best choice for runtime in mobile environments. It boasts a large community and is actively developed. It is compatible with mobile devices. In addition to supporting numerous Transformer models, there is a pending PR for incorporating Mamba models, which we also tested. While the community is rapidly growing, the PRs and codebase can be disorganized. Mobile optimizations are not the highest priority.
MLC LLM—Built on Apache TVM, our tests revealed this project’s limited applicability. It exhibited slower performance with GPU computations compared to llama.cpp. Additionally, it offers a narrower range of supported models.
Mamba.c – A runtime for Mamba written in C. It only supports CPU instructions on Android. A basic CUDA implementation is available, but it functions solely on PCs.A runtime for Mamba written in C. It only supports CPU instructions on Android. A basic CUDA implementation is available, but it functions solely on PCs.
ExecuTorch – This is a new framework designed for mobile and edge devices. While still in the development phase, it shows promise. However, support for Llama models is currently limited and buggy (we have opened an issue that needs to be solved). It may become a rising star due to its better organization, but it has yet to be production-ready.

Additionally, we researched and briefly examined language models support in Gemma and NCNN.

Google recently released its gemma.cpp, another development worth noting. While we haven’t tested it as an inference engine, it could interest those looking to utilize Gemma models. This framework appears to exclusively support Gemma models.
NCNN, on the other hand, is a popular ML framework for Android devices. However, it has limited support for small language models, with only the 7B llama model reported to run, and needs quantizations lower than int8.

Embedding model runtimes:

Bert.cpp – This repository utilizes the GGML runtime to execute the embedding models.

RAG:

FAISS – We developed the library for Android devices. FAISS is responsible for indexing the embedding vectors and enabling efficient search based on cosine similarity.

C++ libraries built:

OpenBLAS—FAISS requires a library that implements the BLAS and LAPACK interfaces. We incorporated OpenBLAS into a Conan recipe (the official recipe lacks Android support). However, an issue arises as OpenBLAS requires a Fortran compiler that is no longer supported with the Android NDK.

C++ Package management:

Conan – Responsible for managing all project dependencies. It integrates third-party repositories into our solution and enhances the manageability of Android and x86 builds.

Challenges with Implementing SLM with RAG on a Mobile Device

The key challenges faced during this project were:

1. Memory Limitations

The models’ size is crucial in their applicability on mobile hardware, which typically has 4-12GB of RAM. As a Small Language Model, we consider models ranging from 1B-7B parameters. To match the mobile’s memory constraint, we need to utilize quantization techniques like int8 (Q8) or lower (e.g., 4-bit or 5-bit representations). It is also important to mention that we can’t use all the memory; depending on the OS, we need to reserve 2-3.5GB for Android and other application components. Operations such as sparse kernel multiplication for pruned models have yet to be widespread in mobile frameworks like llama.cpp. Still, this field is progressing rapidly and will soon allow for the execution of larger models and/or faster inference.

2. Platform Independence

The Android platform has its own set of requirements for building applications. All the necessary components of the developed solution were designed so that the codebase should only be rebuilt or require minor tweaks specific to the target platform. Consequently, we implemented a library with a terminal application that can be deployed on Android devices and a regular x86 computer. Keeping the core functionality as a native-built shared object will allow the library to be used in regular Android apps (written in Kotlin) and Flutter (with Dart). Both require only wrappers on the public interface for the core RAG library.

3. Not Mature Enough Inference Engines

SLM inference Engines are evolving rapidly and still need to mature. Currently, llama.cpp is the best choice and supports more models than any other framework, but as a rapidly growing repository, it is somewhat disorganized. Additionally, it does not target Android and mobile performance optimization as the primary goal. Android GPU support via CLBLast is not producing correct results and is slower than the CPU. There is currently no support for pruning and sparse kernel operations. On the other hand, ExecuTorch seems to be well-organized and offers Qualcomm’s kernels for massive inference speed-up, but it is not yet mature enough to run Language Models. We expect the situation to change dramatically in the upcoming months.

4. Missing Features in Runtime Technologies and LLM Libraries for C++

The products are designed to deploy language model-based applications and systems, mainly targeting cloud-native deployments and the Python environment. In some cases, features ready to use in Python (i.e., more advanced retrieval techniques like hierarchical search) must be implemented from scratch in C++.

5. Android Constraints – a Single Process

Android deployment imposes constraints, as the entire application must be contained within a single process. This results in a much narrower choice of technologies like vector databases, not to mention that there is no clean way to build and include Fortran dependencies.

Performed evaluations

Here, we would like to discuss key findings from performance benchmarking.

Retrieval

The retrieval part was evaluated on a sample dataset containing a few PDF documents on the public Internet. The metric measured was mAP (mean average precision), which assesses how much relevant information was retrieved correctly.

RAG – evaluation datasets:

Source documents found in public resources containing the standard operating procedures for areas:
- Construction workplace safety
- Warehouse procedures
- Grocery store worker instructions
- COVID-19 guidelines
Queries and expected vital points to be retrieved were created manually.

As the best performing models, we chose the gte-base family. Depending on the memory available on the device, our recommendation is as follows:

- gte-base/fp32
  - mAP 0.65 at 3 chunks and 600 tokens
  - ~0.5GB

- gte-large/fp32
  - mAP 0.65 at 2 chunks and 200 tokens
  - ~1.5GB

The bigger model is better because it is sufficient to return the top-2 chunks to achieve an mAP of 0.65 on our dataset, with an average of 200 tokens for the SLM to parse in the next step. However, we need to allocate 1.5GB of RAM for this bigger model. The smaller model is 3x more lightweight at only 0.5GB, but it requires the top 3 chunks to achieve an mAP of 0.65, resulting in 600 tokens for the SML to process in the next step. This means that the SML will have more work to do (as it needs to process 600 tokens compared to 200 tokens) and potentially more challenging work (summarizing/reasoning with more non-relevant contexts). In other words, using a better embedding model can reduce the SLM input size. The smaller input prompt for the SLM model will be reflected in a shorter time without any output from the LLM (the prompt decoding step in llama.cpp).

The plots show top-k (Number of retrieved chunks) on the X-axis and mAP of the retrieval on the Y-axis. The number in the plot, close to the line, is how many tokens the entire retrieved-context is built from. Each plot shows a different model under a few configurations, where c_xxx means context size and o_XX means overlap between contexts when calculating embedding. For each query, a certain number of ground truth contexts are expected to be retrieved. For each query, the precision is calculated by correctly_retrieved_chunks_num / all_gt_chunks.

Embedder speed

The retrieval is fast, as user queries are often short.

The base model is 3x faster and occupies 3x less memory. But if memory is not a constraint, we suggest going with the larger model as it achieves better mAP and needs fewer chunks and tokens to achieve comparable mAP, making the next step, SML inference, way faster and easier as a task

Retrieval Indexing performance.

For retrieval, we used the Faiss library. Even though indexing is needed only offline in our case, we also show here on a benchmark that it is fast enough to consider it on a device. Searching time is blazingly fast for CPU index search. Also, a benchmark of the RAM needed for embedding shows that you can push thousands of pages without worrying about device memory. Num vectors in several embeddings, each corresponding to one text chunk; in our case, one page was like 6-10 chunks.

SLMs Benchmark

Here, we were evaluating models in the 1-3B range. It is possible to push 7B models with lower quantization levels, but they are slower and require memory that is available only on high-end smartphones.

Speaking about the memory, here are the RAM results needed per model and quantization level.

The lower quantization means some weight is stored with less precision. It is a severe reduction with needed RAM.

Plots show the generation speed for SLM on two mobile devices:

Galaxy S24 Ultra: 12 GB Mem, Snapdragon 8 Gen 3
S20FE: 6 GB Mem, Snapdragon 865

and 3 quantization levels (the lower-end device has 6GB memory and could not run some Q8 models).

The generation speed of 5-10 tokens per second might be considered fast enough for a good user experience. More important here is eval prompt time (how fast models read contexts and user queries), as contexts might be really long. Even though eval time is usually similar or 2x faster than generation time for tok/sec, time to first token (input lag user needs to wait for the first token to be generated) influences experience negatively.

Here, we can see that in the worst-case scenario of 1000 tokens (query + contexts), users must wait even 50 seconds after prompting the system.

SLM model response quality evaluation

But how well did the SLM fabricate the answer assuming retrieved contexts (not always correct) and user query as input? Two approaches were used: Ragas (an automated tool for RAG evaluation with an LLM-as-a-judge approach based on OpenAI models) and human-based manual evaluation. For Ragas, three metrics were calculated:

Answer Correctness – the accuracy of the generated answer when compared to the ground truth,
Faithfulness – consistency of the answer given the context,
Answer Relevance—how pertinent the generated answer is to the given prompt. A lower score is assigned to incomplete or redundant answers.

For Human manual evaluation, two metrics were calculated:

Averaged Score – scores in float range 0-1 mixing how correct the answer is and how good context is utilized, eventually punishing models for excessive off-topic content,
Averaged Hard_EQ_1 Score—as above, but thresholded to accept answers as correct only for those who scored a full 1.0. If not, a 0.0 score is assigned.

Three models in the range 1B-3B were evaluated:

phi2 – 2.78B Microsoft SLM released under MIT,
gemma – instruction tuned 2.51B Google SLM with gemma-terms-of-use license,
tinyLLama_v1.0 – 1.1B LLama model under apache-2.0.

Dataset:

Firefighters dataset—a tiny, handcrafted dataset using online-available PDFs to measure SLM quality given the pre-defined contexts. It consists of 24 human-crafted questions and answers and context grabbed from firefighters’ documents.
Phi-2/TinyLlama/Gemma evaluated for Ragas and Humans Score.

In the first stage, we decided to look for a quickly handcrafted prompt with a proper template. Using an adequate template proved crucial—without it, models tend not to halt quickly, talk about unrelated things, and overall response quality is worse.

phi2

example prompt:

Instruct: Generate answer to the Question, using provided context.\nContext: {space_separated_content}\nQuestion: {user_question}\\nOutput:

At this stage, the impression was positive and, as expected, showed the supremacy of less aggressive quantization. The model did overall good for the positive notes, showing a sign of correct reasoning for not the most straightforward question. Sometimes, despite missing contexts, the model was able to answer correctly. It might be attributed to the generalized skill of improvising and the model relying on its embedded knowledge from the pre-training. On the negative side, the model produced overwhelmingly long answers at this stage. Sometimes they were correct and relevant, but short, concise responses would be preferred.

The big issue with phi2 at this stage (for this templated prompt) was, even after correctly responding, producing additional questions and answers (starting with another Question: token) or changing the topic and continuing the monolog. Here, we observed that some of these (change topic) behaviors start from repetitive patterns (like {correct_answer} Output: unrelated content). One idea would be to apply post-processing and cut off these patterns; however, some ways the model goes to the next topic are creative and impossible to catch by a simple pattern matching. Also, a brute-force post-processing like that could harm some correct responses. The other idea would be to work with prompt engineering to better instruct the model for shorter answers and improve quality, which we have tried in the following steps.

Eval with Ragas

As a next step, we have constructed a small dataset for our needs (firefighters dataset) to check Humans vs automated Ragas eval.

One question was whether we can rely on Ragas’ Evaluation and how it relates to our human scores. This tiny test cannot be treated as a conclusive answer, but using Ragas might help automate evaluation tests. It would be too much to say that Ragas correlates well with human answers (in general), but at least for this small dataset, that looks good. Nevertheless, human checking on top of that (for smaller datasets) should be standard practice. It is worth noting that by keeping the same seed (seed_id for generating the answer with LLM), we can see an increase in the results for less aggressive quantizations on Ragas metrics.

Here are the impressions:

human evaluation scores and Ragas correlate (with few exceptions) on this small dataset,
a good practice would be to run Ragas on large datasets and then validate with human evaluation,
scrolling through answers when human evaluation reveals model weaknesses; some of them can be addressed with prompt engineering,
it’s also worth noting that Ragas can now help with synthetic RAG dataset generation; we didn’t test this functionality, but readers might find it interesting.

Better Prompt Engineering

In the next step, we checked the influence of prompt engineering on the answer.
As a general note, we achieved better results by:

Instructing models what are contexts; instead of space-separated contexts, listing contexts separately and providing the id, like that
context_id: {context}\n.
directing the model to produce short but concise answers, relying on the contexts,
and prompt engineering to counter the model’s flaws independently per each model.

We learned that each model has its problems and needs separate prompt engineering. Here, we show some issues with the Gemma model and how we can address them by modifying the original prompt.

Prompt engineering (without prompt template) after applying the above:

Generate a concise answer to the User Question using the provided Contexts. Read carefully, as the answer is always in one or more contexts. Be aware that not all contexts must be relevant. Contexts:\n{join_contexts}\nUser Question: {prompt}

where join_contexts are newline contexts with hardcoded context_id:
context_0: some text here
context_1: some other text here
…
context_n:…

This makes it easier for models to cite directly by context number.

Injecting these into the prompt helped (and in some cases wholly removed) the issues when manually scrolling through the answers. With the above examples and some techniques like ‘use step-by-step reasoning / read carefully,’ we were able to improve from naive prompts.

Conclusions

The most important takeaways from implementing SLMs with RAG on mobile devices:

While it is indeed possible to run SLM on edge devices and have satisfactory results for applications such as RAG, both in terms of speed and quality, some important caveats need to be mentioned:
- Heavy memory constraints. Recent phones with 12-16GB+ RAM can run models in the range 1-3B and even those in the range 7B (here, quantizations Q5, Q4, or lower). For devices with 6GB RAM, we could run Q5-quantized Gemma and smaller models. 4GB is challenging, but models like 1.1B TinyLLAMA can still be run there.
- Speed Constraints. Newer smartphones are admirably faster. Runtime becomes slow for the longest contexts; for the models with 20 eval-prompt-token/sec, it means that if the context has 1000 tokens, it will take 50 seconds before models start responding (worst case scenario). Check more details in the report.
- Early insights are that for models like Gemma or Phi2, quality can be satisfactory for RAG purposes.
- The rapid growth of supported models and better inference/memory-efficient models are expected shortly.
The llama.cpp is growing in popularity. It might also be used for cloud model serving. It had a rapid adoption of the Gemma model (2 days).
Executorch is an uprising star. Qualcomm kernels have very impressive runtimes for exemplary ML models like image segmentation. However, as of today, LLM support is poor, and the model is not yet production-ready.
The vector database search could be better (compared to Python/cloud services). There is no hybrid search feature and no sparse search, and there is no easy, out-of-the-box way to bring those to the inference/cpp site.
The latency for the SLM to start generating a response may be quite long. This delay time is related to the input model prompt length. It’s essential to get as much output from the retrieval part as possible. However, it’s getting faster with recent, more powerful phones and sophisticated Android-targeted inference engine speed-ups.
Evaluating SLMs requires more work with bigger datasets than our tiny handcrafted firefighter dataset. Automating the generation like that would be helpful, and it seems the Ragas library has introduced such a feature, which might be an interesting approach for continuing the project.
Given the limited lifespan of these projects, not all findings, including better prompt engineering techniques from the eval site, were migrated to the inference site.

With engineering effort put into inference engines, a growing community for SLMs, and tons of research put into it, we expect that in the coming months (sometimes even weeks), the situation will change drastically. Not only will models become more powerful for this 1-3B Range, but the inference site will improve in terms of speed and potential memory consumption.

Ongoing Research

Let’s briefly mention ongoing research efforts that aim to break the current limits of SLMs (or LLMs in general).

Better hardware utilization by dedicated kernel ops (e.g., Executorch + Qualcomm) [7]
Here, it’s purely an engineering effort, but it is worth mentioning, as proper GPU kernels with hardware-aware optimizations can significantly boost Android runtimes.
`1-bit` LLMs [1]
The idea is to go beyond classic quantization and train a model from scratch that keeps weights on merely 1.5 bit, optimizing some multiplications onto addition. The authors showed they could pair with fp16 models, and if this holds for other SLMs we have tested, someone would need to retrain models and add support, e.g., in llama.cpp, but memory and inference speed benefits can be huge. This technique would allow us to bring bigger models to the mobile, which is inaccessible for now due to RAM constraints.
A mixture of Expert (MoE) like in Mixtral of Experts [2] (coarse grain sparsity)
The idea of coarse grain sparsity is where, during inference, only some path(s) are active, and not every weight/layer needs to be calculated. This family of techniques does not lower needed memory but speed inference (as only a sub-part of the model needs to be active during inference). There is also a positive inductive bias to make model parts sparse (like paths of layers) with separate modules (a bunch of layers) specialized, resulting in more powerful models that train faster. In the future, this technique could be combined with sharding – loading some parts of the models from HDD to RAM – but this time minimizing the delays.
Mamba[3], MoE Mamba[4]
Here, there is a trend to mitigate quadratic attention mechanisms (the more significant the context window, the more computationally heavy it becomes) in favor of ‘linear’ attention with RNN/LSTM. Authors claim that Mamba models became significantly stronger for a given capacity than their transformer counterparts. llama.cpp has a PR that brings Mamba support—it’s not yet finished.
MoE Mamba is a combination of both techniques, a mixture of experts and Mamba. Here, the authors also show that both techniques are complementary, increasing model quality.
Sparse kernels+pruning (fine grain sparsity) [5][6]
It is about sparsity at the micro-level, where the model is pruned, and some weight gets locally removed. llama.cpp currently lacks support for sparse kernel operation and sparse weight storage. With minimal quality loss, there is an opportunity here to save more RAM and maybe even speed up inference. This technique, when mature, can also bring bigger models to mobile environments.
Draft + Verify [8]
Last but not least, a technique for speeding the inference by having a lightweight (weak) drafter and (strong) verifier NN on top of the drafter. Drafter proposes many tokens at once, and verifiers can verify them in parallel – then, as many tokens are accepted, the first rejected token from the verifier appears in chronological order. Language model training is parallelized (tokens+masking trained in parallel), but the inference is calculated token-by-token. The big deal is that verifiers can work in parallel, verifying many tokens in contrast to sequential, step-by-step token fabrication like in the current models we have tested. This technique might be brought to mobile someday, again increasing inference speed.
Symbolic Knowledge Distillation [14]
Rather than training SLMs from scratch, distilling skills and abilities like reasoning from the influential teacher (big LLM) via proxy NN critic can result in more powerful models. This is another paradigm shift for removing unrelated knowledge from small models by distilling essential skills and abilities and relying on a RAG-like pipeline for knowledge retrieval, which can result in lighter (and more powerful) models.

We have mentioned just a few techniques, but the research area in this domain is much larger, including dynamic neural network structures, a distillation of reasoning capabilities, and memory-augmented neural networks.

I was a lead engineer on the project and the author of this article. However, the entire team contributed to its success, so I would like to give a big shoutout to Marcin Ochman (Senior Engineer), Paweł Kaczmarczyk (Senior Engineer), and Artur Zygadło (Project Manager) for making it possible.

Tech Stack:
[7]https://pytorch.org/executorch/stable/build-run-qualcomm-ai-engine-direct-backend.html
[9] https://github.com/ggerganov/llama.cpp
[10] https://github.com/skeskinen/bert.cpp
[11] https://github.com/facebookresearch/faiss
[12] https://conan.io/
[13] https://docs.ragas.io/en/stable/

Active Research:
[1] https://arxiv.org/pdf/2402.17764.pdf
[2] https://arxiv.org/abs/2401.04088
[3] https://arxiv.org/abs/2312.00752
[4] https://arxiv.org/pdf/2401.04081.pdf
[5] https://www.youtube.com/watch?v=0PAiQ1jTN5k
[6] https://huggingface.co/neuralmagic
[8] https://arxiv.org/html/2401.07851v2
[14] https://www.youtube.com/watch?v=H_IfCbpS6G0
[15] https://arxiv.org/abs/2404.01744

From LLMs to RAG. Elevating Chatbot Performance. What is the Retrieval-Augmented Generation System and How to Implement It Correctly?

March 28, 2024/in Generative AI /by Patryk Kowalski

Chances are you’ve already heard about RAG – Retrieval-Augmented Generation. This technology has taken the industry by storm, and for good reason. The emergence of RAG systems is a natural consequence of the popularity of Large Language Models. They make it easier than ever before to create a chatbot – one deeply entrenched in the domain of your company data. It can provide a natural language interface for all the company information that a user would normally have to dig through heaps of internal documents to get.

This saves so much time! Let’s just consider the possibilities:

A factory worker could ask what an error code means and how to proceed with it, instead of hopelessly skimming through bulky instruction manuals.
An office worker could check on any policy without pestering HR.
A retail worker could see whether specific promotions stack together.

And the list goes on.

Why can’t we just use GPT though? Is this ‘RAG’ necessary? Well, there are issues with using LLMs directly in such cases:

Hallucinations – while LLMs are great at creating plausible sentences, they may not always be factually correct.
Lack of confidence – LLM by itself won’t be able to confidently declare how it knows what it says, or how the user can confirm it.
Domain adaptation – Large Language Models are large. Training them in the specifics of what you want them to know is not a task that comes easily or cheaply!
Domain drift – Let’s say you managed to train a GPT-like model to know everything about your particular use case. What if the underlying data have changed? Do we have to do everything over again?

There are a lot of risks involved in creating a chatbot using LLMs – thankfully, RAG is here to support us.

This article focuses on the retrieval component of retrieval-augmented generation – making sure the correct context is fetched from the company documents and passed onto the answer generation stage. It is based on our hands-on experience building multiple commercial RAG systems. We have read a ton of papers, and learned what works well on actual client data and what doesn’t – and we’ve compiled it all for you here in this article!

What is RAG? Retrieval-Augmented Generation explained

I assume I have managed to get your attention by now. You know you can use RAG to anchor a generative model in your company data. Who wouldn’t want a seemingly flawless solution like that? You’re probably still a tad suspicious though. It sounds too good to be true, and you’re not sure how it works. Let’s take care of that!

RAG workflow

Figure 1. RAG workflow, source: https://towardsdatascience.com/retrieval-augmented-generation-rag-from-theory-to-langchain-implementation-4e9bd5f6a4f2

A typical RAG workflow will look like this:

The user asks a question.
The question is converted to a numerical representation for convenient processing.
Pieces of company knowledge similar to the question asked – either semantically, or in terms of keywords – are picked up.
The relevant text gets packed into the LLM context.
The LLM is fed the relevant context and user question, and uses it to come up with an accurate answer.
An exact source and citation are provided for the user, so the truthfulness of the answer can be verified.

After the workflow finishes, the user is equipped with an exact answer to the question and a relevant passage from the internal documents, validating this information.

What are the benefits of the Retrieval-Augmented Generation?

There are multiple benefits of using Retrieval-Augmented Generation compared to alternative methods of creating chatbots anchored in a specific domain. Amongst the most important ones, we can highlight the following:

No training necessary

Before RAG, trying to teach an LLM domain-specific information required fine-tuning. While the Performance Efficient Fine Tuning branch of Machine Learning is growing strongly, training still requires:

know-how,
computational resources,
a lot of data.

Except for some very specific use cases, it’s best avoided altogether.

The RAG system does not require any training of the base Generative Model.

Fewer hallucinations

Even assuming someone has managed to fine-tune an LLM correctly, unfortunately, it is still prone to hallucinations. The model can use knowledge built-in during the pretraining to formulate an answer, or it can come up with a plausible-sounding false explanation when lacking data.

The RAG system handles hallucinations by providing the model with the exact context it needs to provide a truthful context. The model can be further instructed not to rely on any built-in knowledge if it’s not present in the retrieved context, thus reducing the probability of hallucinations. It’s not possible with a fine-tuned model, as built-in knowledge is all it has.

Dynamic knowledge base

With the RAG system, you can change the knowledge base whenever you feel like it. No repeated training is necessary, nor are any additional steps for that matter. All you need to do to make new knowledge available to the model, or deprecate some previous documentation, is to swap the documents uploaded to the vector database.

Citations

The RAG system structure makes it possible to return sources and citations for the information returned by the model. It allows the user to validate any answer received, and check the wider context in the linked company documents.

Building retrieval for your RAG system

Now that we know how retrieval-augmented generation is supposed to work, and what it’s good at, let’s see how to build one.

Note that everything we talk about here has been field-tested – this is knowledge based on actual commercial projects delivered by deepsense.ai!

Embeddings & Vector databases

The first thing you need to do when building an RAG system is to convert your documents to their vector representations and store them somewhere.

Embeddings

Embedding encapsulates the meaning of a sentence inside a numeric vector. It allows for further operations, like a similarity search. To create embedding vectors, we use sentence-embedding models. There are multiple models available, and a good place to start selection is a leaderboard. When making your choice, be sure to check the following model parameters:

Does it support the language you’re interested in?
Is the context size suitable for your needs? How large of a chunk do you need to embed at once?
Will you need to threshold retrieval results based on similarity scores? Some models are not suitable for thresholding, because of the resulting tendency of embeddings to always score highly on similarity.

Text splitters

A whole document will rarely fit inside an embedding model. The text needs to be split into digestible chunks, no larger than the embedding model’s maximum supported context. A text splitter will help separate the text into smaller, hopefully semantically homogenous chunks.

Splitting by character

Figure 2. Character Text Splitter, source: https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/5_Levels_Of_Text_Splitting.ipynb

The simplest text splitter is the character-based one. You need to define a separator – typically an empty string will work just fine, and the chunk is populated letter by letter until it reaches the maximum chunk size. You can opt for chunk overlap, where there are a couple of common letters between adjacent chunks. It’s typically a good idea to do so, because otherwise you risk splitting a sentence in an awkward place, losing the semantic meaning in either chunk.

This solution leaves much to be desired – a perfect chunk encapsulates a singular idea to make retrieval easier and the content leaner. We should find a way to maximize the probability of getting correct splits. That’s where recursive text splitters enter the stage.

Splitting recursively

Figure 3. Recursive Text Splitter, source: https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/5_Levels_Of_Text_Splitting.ipynb

Recursive text splitters let you define a number of separators in a specific order. This way, you are able to express an intended hierarchy. For example in the langchain implementation (available here) the default separators are [“\n\n”, “\n”, ” “, “”] – splitting by paragraphs, newlines, spaces, and with no alternatives left, on an empty string. Prioritizing splitting into paragraphs makes it less likely to break a coherent thought into separate chunks.

If the documents you’re working with have some specific structure like markdown or html – even better! You can use the knowledge about the format to come up with a better hierarchy of separators – for example, using the html tags.

Splitting semantically

Figure 4. Semantic Text Splitter, source: https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/5_Levels_Of_Text_Splitting.ipynb

A rather new and exciting idea is to try to split the text into chunks based on its meaning rather than particular predefined separators. This can be achieved by embedding the text sentence by sentence, and measuring the distance between consecutive embeddings. Wherever there is a peak of distance, it’s like the topic has changed, and it’s a natural place for a break. Keep in mind, this makes text splitting dependent on the embedding model used.

Structure

You can keep the resulting chunks in a flat structure, one next to the other, but a much better way is to create a hierarchy. We’ll get to why and how in a further section of this article, but for now, let’s keep in mind it’s useful to perform nested text splitting – choosing parent chunks with big chunk sizes first, and then splitting each one of them further into child chunks.

Tunable parameters

After this stage, you will get a number of parameters you want to sweep through to find the configuration that works best for your use case. These will be:

chunk size,
chunk overlap,
parent chunk size.

Vector databases

Now that we have our embeddings, we need to store them somewhere. Fortunately there is no need to build this storage from the ground up, as there are many refined implementations of vector databases specializing in storing, indexing, serving and performing searches on vectors. Some are even open-source!

Figure 5. Vector Database Workflow, source: https://www.pinecone.io/learn/vector-database/

Select your vector database

There are many options when it comes to selecting a vector database. Some of the characteristics you may want to pay attention to when making your decision are as follows:

License – is it open source? Do you need it to be?
Supported search – you will typically want some kind of sparse, dense and hybrid search. We’ll get to what that means in a minute.
Managed – are you going to host the vector database yourself, or is a managed instance more up your alley?
Framework integrations – do you care about any specific framework integrations? If your whole app is in langchain or llamaindex, you will probably require your vector database to play nicely.
Indexing – depending on what scale of data you’re working with, you will want a different kind of index. An index is what allows you to perform an efficient vector search.

To make an informed choice you can use a comparison tool, like this one:

Figure 6. Vector DB Comparison, source: https://vdbs.superlinked.com/

There are rumors of openAI and anthropic using qdrant internally – maybe there’s a slight edge there?

Configuration based on the example of Weaviate

Let’s see how to set up a vector database based on the example of Weaviate, which is one of the most popular providers.

Figure 7. Weaviate Vector DB Configurator, source: https://weaviate.io/developers/weaviate/installation

You can use a configurator to put together a docker-compose or a kubernetes-helm file that gets you started. You need to click through some options there, and if you are a data scientist, you will mostly care about the following:

Vectorizer

A vectorizer is the model that will be used to embed your data and the user queries. You can have it self-hosted. Weaviate offers a large selection of pre-built images ready to roll out. If you are into a niche model that hasn’t made its way there yet, you can build your own image, and it will work as long as it’s compatible with Hugging Face’s AutoModel. In any other case, you can’t go wrong with an external API, e.g., from openAI and their sturdy text-embedding-ada-002.

Reranker

A reranking model helps you keep the retrieval results sorted correctly – with the most important one at the top. It can be important for the quality of the generated answer, and the accuracy of the returned citation, so you may want to pay for the extra processing time and GPU usage to have it handy. The good news is that the reranking cross-encoder model will only run on the prefiltered set of retrieved chunks.

Indexes

Indexing makes it easier to calculate the semantic similarity between high-dimensional vectors, like the ones we get from text embeddings. Check out this page for a simple explanation of how different indexes work. Weaviate offers only two of them:

Flat indexing

The right choice when perfect accuracy is required and speed is not a consideration. If the dataset we are searching is small, flat indexing may also be a good choice as the search speed can still be reasonable.

HNSW

Figure 8. Hierarchical Navigable Small Worlds, source: modified https://arxiv.org/pdf/1603.09320.pdf

The HNSW index is a more complex index that is slower to build, but it scales well to large datasets as queries have a logarithmic time complexity.

It uses a multi-layered graph approach to indexing data. The lowest layer contains all the vectors as they are – but on each higher layer, they are increasingly grouped together. The user query starts its journey at the top layer, traversing its way to the bottom, getting closer to the most similar chunk with each step.

Search methods

In a vector database, you will typically encounter the following search methods:

Full text

Used for metadata filtering. You can use some additional metadata with your chunks, such as tags, that can be used to perform initial filtering. In this case, a full-text match is necessary.

Keyword (sparse vector) search

Word frequency-based search, good for catching keywords. Let’s imagine your use case involves a number of technical user manuals. Those tend to be semantically similar, with the crucial difference of describing different equipment. In such a case, the name of the part of interest – a keyword – needs to be a significant part of the search. Weaviate uses BM25 to perform this kind of search.

Vector search

The bread and butter of similarity search methods. You can further decide on a distance metric, but the default cosine similarity tends to get the job done.

Hybrid search

Figure 9. Weaviate Hybrid Search, source: https://weaviate.io/blog/hybrid-search-fusion-algorithms

Why choose one, if you can have it all? A hybrid search combines dense and sparse vector search methods and combines the results. You can use the alpha parameter to regulate dense/sparse search importance. There are caveats though! When using a hybrid search, the confidence score is calculated as a combination of the dense/sparse confidence scores, making it harder to interpret or threshold.

Get only relevant results

At this point we should be able to retrieve chunks of information relevant to the user query – but how many of them? We need to set a limit. Not enough chunks can make us miss important information, while too many of them can bloat the LLM context and confuse it.

Weaviate proposes an autocut feature – “Autocut aims to approximate where a user would cut the results intuitively after observing N jumps in the distance from the query.” It can get a bit tricky if you’re using the hybrid search method with it – the rescaled confidence score can trip autocut up. In this case, make sure you’re using RelativeScoreFusion, and not RankedFusion. It is supposed to work better, because it often results in natural clusters that autocut can detect.

Reranking

The first result is the best result. Or it should be. For a limited result set you can afford a more computationally expensive approach, so it’s the reranker’s time to shine. The cross-encoder model ingests a pair of user queries and one of the retrieved chunks to compute a relevance score between them. This number will be used to reorder the chunks and make sure the most relevant ones will be served first. An added benefit is that this score tends to be suitable for thresholding – for example, if you want to be able to decide that a similarity smaller than X should cut the RAG workflow short, and skip the answer generation step.

Tunable parameters

After this stage, the group of tunable parameters should welcome new contenders:

hybrid search alpha parameter,
hybrid search score fusion method,
chunk limit,
autocut limit,
switching reranker on or off.

Measuring results

We have said a great deal about tuning parameters – but for tuning we need a metric to optimize. For retrieval, Mean Average Precision is a great candidate to optimize, because it only scores highly if the relevant documents are first on the list.

This is a go-to metric for us in commercial projects and it has proven very reliable.

Let’s go through how it works together:

Figure 10. Mean Average Precision calculation, source: own study, inspired by https://www.educative.io/answers/what-is-the-mean-average-precision-in-information-retrieval

The user query is fed into the search algorithm and results in 5 documents being retrieved.
Only three-fifths of the documents are relevant.
For each of the documents, we calculate the precision. Precision@k will be the ratio of relevant to missed documents for the kth result.
For each of the documents. we multiply its relevance and its precision.
We summate the resulting numbers, and divide it by the total number of relevant documents that it’s possible to retrieve – thus arriving at the Average Precision.
We repeat the process for more queries, and average the results to get the Mean Average Precision.

As you can see, the calculations aren’t very complex, but you need to prepare a dataset beforehand for it to make sense. In particular, you need to be aware how many relevant documents there are for any given query.

You also need to decide when a retrieved chunk is considered relevant. It can be deemed relevant if it comes from a relevant document – but if your documents are large, you may need a more precise method. You can also define a set of ground-truth sentences you think should be retrieved for a given query. It introduces a new sort of trickery, because depending on how you split your text into chunks, a given ground-truth sentence can be split into multiple chunks, or one chunk can contain more than one ground-truth sentence! In such a case, feel free to modify the Mean Average Precision definition, for example, allowing the relevance metric to exceed the [0,1] range. It won’t technically be the Mean Average Precision anymore, but it will do great for parameter optimization and comparison of different retrieval setups.

Tuning retrieval parameters

We have gathered quite a few parameters to optimize. Let’s recall what they were:

embedding model type,
chunk size,
chunk overlap,
parent chunk size,
hybrid search alpha parameter,
hybrid search score fusion method,
chunk limit,
autocut limit,
reranker type.

To sweep through all of those effectively, we advise you to use Hydra & Neptune.ai working in tandem.

You can use hydra to set up ranges for the parameter sweep conveniently. Use the –multirun option to run your tests efficiently. A simple bash script may look like this:

If you are running your embedding model locally, do not forget to set the hydra.launcher.batch_size when sweeping through embedding parameters, like the chunk size, to make sure you fit in your GPU’s VRAM.

When running the experiments in parallel, setting the Neptune.ai synchronization to offline and uploading the experiments manually after they’re done can help you save the connection pool from unnecessary abuse. You can do it as follows:

Get accurate citations

LLM wouldn’t make stuff up, would it? Hopefully not, but as Ronald Reagan used to say – trust, but verify.

We want to provide the user with the means to validate whether the generated answer is factually correct. A direct citation from the source document gets the job done – but how to get it?

Use retrieval results directly

The simplest way is to just use the passage retrieved with the highest confidence score. It won’t cover scenarios with more than one relevant passage, and sometimes it will be plain wrong, if LLM used another chunk for its response… but it still tends to work in the majority of cases, and its simplicity and lack of added complexity cannot be valued highly enough.

OpenAI Function-calling

If you are able to work with the openAI API, you can make use of their function calling feature. Just define a function that returns the answer in the form of a JSON dictionary, with keys for the answer and the citation respectively. You can index the citations to allow the model to just pick a number, making it as easy as can be. You can also opt to make the model repeat the citation verbatim, but you will quickly find out that LLMs enjoy modifying the text they repeat a bit here and there, making a direct match impossible.

You can achieve a very similar result using Direct Prompting – just include a few-shot example in the model prompt, where you show that the answer always needs to be coupled with a citation index.

How well is it going to work? The answer can vary wildly depending on the actual LLM you’re using, and the quality of data. LLMs on the smaller end of the spectrum, like LLAMA2 7b, cannot be trusted to select a correct citation. GPT-4 will be correct most of the time given a clean enough dataset. It’s not going to go great if your data are messy, though. Suppose the documents you use contain leftover watermarks, random numbers, or OCR artifacts. In that case, the model will have difficulty determining where one citation ends and the other starts, and which number is the citation index.

Don’t forget to check out langchain implementation of those methods.

Learn from our experience – here’s what boosted the performance of our RAG systems

There are A LOT of tricks meant to improve your retrieval performance. This field is growing explosively and produces an unmanageable amount of ideas. Not all of them are all that useful though, and some just aren’t worth the time. When working on commercial projects, we sifted through the internet tips and academic papers to test them all – and a few of the methods tested have proven quite extraordinary.

Multiquery – Reciprocal Rank Fusion

Semantic search tends not to be very stable. RAG users will ask the same question in a multitude of different ways, and we would like them to always get the same results. That’s what multiquerying is for!

Figure 11. Reciprocal Rank Fusion, source: https://towardsdatascience.com/forget-rag-the-future-is-rag-fusion-1147298d8ad1

Each user query generates a number of alternatively phrased queries with the same meaning. It will work best if you use an LLM to request rephrasing, but if you are reluctant to introduce another LLM call, or are very wary of the response time, there are always simpler solutions. For example, you can shuffle the letters a bit to produce alternative queries by introducing typos.

This way you will get a set of different results instead of one, so they need to be combined back into a simple list. A Reciprocal Rank Fusion algorithm can help with that. It makes sure to put the following information at the top:

The most frequently retrieved documents
Documents rated with the highest confidence.

This is a great method to increase retrieval stability and robustness to imperfect queries.

Hierarchical chunking

Do you want your chunks big or small?

Small chunks, obviously, because…	Big chunks, obviously, because…
Small chunks make more meaningful embeddings	Big chunks decrease the risk of dropping information
Small chunks keep LLM context from bloating	Big chunks retain context
Small chunks decrease response time

Instead of deciding, let’s try to take the best of both worlds – hierarchical chunking!

Figure 12. Parent Chunking, source: https://medium.com/ai-insights-cobet/rag-and-parent-document-retrievers-making-sense-of-complex-contexts-with-code-5bd5c3474a8a

First run your text splitter with a big chunk size – bigger than you can fit into an embedding model. Then, for a second round, split each parent chunk further down into several child chunks – those will be vectorized.

You will perform a similarity search on the child chunks – but you are always free to swap a child chunk for its parent before building the LLM context! A smart way to go about this is defining another tunable parameter in the range [0,1], and determining when to perform the child -> parent swap. For example, setting it to 0.5 would mean you need to retrieve at least half of all child chunks to trigger the swap for a parent.

What’s left? Well, everything else, of course!

If you followed this journey with me, I’m sure at this point you have a wonderful retrieval setup. Does it mean we’re done? Well… almost. To have a full-scale RAG system running, you need to include the answer generation stage too. You will need to come up with a few prompts for the LLM and connect to your favorite model. That’s beyond the scope of today’s adventure though – let’s take a breather and get back to this with a fresh mind!

Summary

RAG systems are great for building chatbots anchored in domain data. They are cheap to build, require no training, and solve a lot of problems inherent to generative models. RAGs validate their answers by providing citations, have a decreased probability of returning hallucinations, and are easy to adapt to a new domain, which makes them a go-to solution for multiple use cases.

Building a solid retrieval mechanism is a cornerstone of any RAG system. Feeding the generative model with accurate and concise context enables it to provide great and informative answers. There is a lot of literature regarding building RAG, and filtering through all the tips and manuals can be time-consuming. We have already checked what works and what doesn’t – as part of successful commercial projects – so make sure to take advantage of a head start and use our tips:

Be mindful when selecting the components: the vectorizer, reranker and the vector database.
Create a benchmarking dataset – not necessarily a huge one – and tune all the retrieval parameters specifically for your use case.
Do not forget to use multiquerying and hierarchical chunking – they give you a lot of ‘bang for your buck’.

With retrieval built this way, you are on a sure path toward a perfect RAG system.

Cost-Effective LLMs: Leveraging Generative AI with Limited Hardware

Reducing the cost of LLMs with quantization and efficient fine-tuning: how can businesses benefit from Generative AI with limited hardware?

February 29, 2024/in Generative AI /by Alicja Kotyla and Artur Zygadlo

More than a year has passed since the release of ChatGPT, which led hundreds of millions of people to not only talk about AI, but actively use it on a daily basis. The wide adoption of ChatGPT and other large language models (LLMs) among individuals made companies of all sizes and across all sectors of industry wonder how they could benefit from this upward-trending technology. One of the main challenges with turning LLMs into business value is the high cost of the expensive hardware required to run the models. Fortunately, recent developments in the field have allowed companies to significantly lower these expenses through cost-effective LLM solutions and reducing LLM operational costs.

This article will be a summary of the most recent trends around LLMs, focusing on LLM democratization – making generative AI easily accessible to everyone. The two main topics we will dive into are quantized inference for LLMs and efficient fine-tuning of LLMs. Apart from explaining these concepts and stressing their importance, we will share our experience from their practical use in commercial LLM projects which we have recently delivered to our clients. These methods, including LLM quantization methods and mixed precision training benefits, are key to LLM optimization for minimal hardware and memory-efficient LLM training.

If you want to know more about the history of large language models, their different “flavors” and example applications, feel free to check out our report The diverse landscape of Large Language Models or watch our webinar on AI revolution with large language models.

LLMs for everyone – the recent trend of AI democratization

ChatGPT is an example (not the first of its kind but undoubtedly the most famous one) of a large language model. LLMs are machine learning models based on deep neural networks, capable of generating text by autoregressively predicting the next word (or the next token, to be more precise). Their applications range from answering questions based on provided documents or knowledge bases (so-called retrieval-augmented generation, or RAG for short), to text summarization, content creation, coding assistants, and more.

The most powerful models like those from OpenAI (ChatGPT, GPT-4), Google (Gemini Ultra), and several open-source alternatives (Falcon, Llama 2, or Mixtral, to name a few) are astonishingly performant, even superior to humans in many tasks. Their power can be unleashed thanks to the fact that they are large, i.e. they consist of dozens or hundreds of billions of so-called parameters – numbers that describe how to convert the input data into an output text.

Between 2018 and 2021, based on the visibly improving capabilities of increasingly larger models, the direction set in the field by the world’s biggest research labs was to push the model size to the extreme. However, with the large size comes many challenges, with the high cost of hardware (GPUs with enough memory) required not only for training the model but even serving it for inference being one of the main limitations and factors making business decision makers hesitant to build products or try to improve existing processes with the help of LLMs. Solutions like activation checkpointing in LLMs, Zero Redundancy Optimizer (ZeRO), and memory optimization for LLM training are essential in addressing these challenges.

Figure 1. Recent trends in LLMs, source: own study

Fortunately, around early 2022, LLM researchers started to understand that for the LLMs to be practically applicable, scaling the models further up was not the way to go. Instead, they turned their attention to training so-called compute-optimal models, which paved the way for all the much smaller (with “only” a few billion parameters), yet still powerful, language models developed both by big tech companies and the open-source community in 2023.

In parallel to the models, motivated by the need to reduce costs and broaden access to the technology, new algorithms and ideas around quantized inference and parameter-efficient fine-tuning have been developed. We will delve into the details in the following sections.

LLM quantization – reducing the memory required for inference

The larger the model, the more GPU memory it needs to load its parameters into. Typically, billions of parameters require gigabytes of RAM. But how many exactly? To answer this question, we will first describe the various ways in which numbers (and hence the model parameters as well) can be represented.

Numeric data types

Two common floating-point formats used in deep learning applications include the single-precision floating-point format (known as FP32) and the half-precision floating-point format (FP16). FP32 utilizes 32 bits for representing a floating-point number, distributing 1 bit for the sign, 8 bits for the exponent, and 23 bits for the significand (see Figure 2), whereas FP16 allocates 16 bits, dividing them into 1 bit for the sign, 5 bits for the exponent, and 10 bits for the significand. It is worth mentioning one more data type, designed specifically with deep neural networks in mind – bfloat16 (brain floating point). Numbers in this format are also represented with 16 bits, but composed differently: 1 sign bit, 8 exponent bits (same as FP32), and 7 bits for the significand. In this way, despite bfloat16’s lower numerical precision, its dynamic range is equivalent to that of FP32.

Figure 2. Illustration of various numeric data types, source: https://www.microsoft.com/en-us/research/blog/a-microsoft-custom-data-type-for-efficient-inference

As a consequence of how these representations work, to load an LLM with 1 billion parameters, we would need 4 GB of GPU memory in the case of FP32 (32 bits, so 4 bytes per parameter), and 2 GB (2 bytes per parameter) for FP16 or bfloat16. It should now be easy to calculate how much is needed to serve the largest Llama 2 model with 70 billion parameters!

Although loading the model in half-precision saves a lot of memory (only half the amount is required compared to FP32), and it has become the de facto standard for neural network inference (as little or no quality degradation is usually observed compared to FP32), this still might not be enough in the case of LLMs. As these multi-billion-parameter models require powerful (and therefore expensive) GPUs like the NVIDIA A100 with 80GB RAM to run on, further memory savings are often required.

For this purpose, various quantization techniques that aim to shrink the model size even further while still preserving the model’s performance (sometimes even with a positive impact on response latency) have been developed. Recently, several algorithms compressing model parameters to data types such as INT8 (1 bit for the sign, 7 bits for the significand) and INT4 (1 bit for the sign, 3 bits for the significand) have gained the attention of the LLM community, among which they are widely used. We will discuss the most popular ones in the further sections.

Additionally, if you want to know more about the basics of quantization of neural networks, feel free to watch our deeptalk which introduces the topic.

Benefits and possible pitfalls of quantization

Quantization involves representing weights and activations with low-precision data types, such as INT8 or INT4, instead of the typical floating-point numbers, resulting in reduced computational and memory costs of LLM inference.

Regarding inference, it is worth mentioning that LLMs can be utilized in two ways. One possible approach is to set up a dedicated server with adequate GPUs (either on premises or within a virtual private cloud) and deploy the model there. In the cloud environment, one needs to pay a few dollars per hour to keep the utilized machine running. Another approach is to leverage one of the LLM-as-a-Service APIs like OpenAI API or Anyscale Endpoints and send requests (prompts) on demand. In this case, the hardware and infrastructure is fully handled by the API provider, and related expenses are covered by the money paid by the API users for each request, typically described in terms of a fraction of a dollar per every million tokens sent (e.g., in Anyscale Endpoints, the prices range from $0.15 for Llama 2 and Mistral models with 7 billion parameters to $1.00 for Llama 2 with 70 billion parameters for each million tokens).

Table 1 summarizes the costs of using selected open-source LLMs based on examples of Google Cloud Platform pricing. In the case of the deployment of an LLM in one’s own cloud, a suitable machine with sufficient memory needs to be chosen from the available configurations. Different cloud instances offered by Google (GCP), AWS, or Microsoft (Azure) have their own pricing, with the presence of high-end GPUs like NVIDIA A100 with up to 80GB VRAM having the biggest impact on the price. With INT8 quantization, the required memory is around 2 times lower, and INT4 quantization leads to 4 times lower memory requirements, allowing users to utilize the quantized model on a 3-4 times cheaper machine.

Model name	Number of parameters	Quantization	Estimated GPU memory required for inference	Hourly cost of running a cloud instance with sufficient memory (example pricing of GCP instances)	Hardware required for inference (example configurations of NVIDIA GPUs available in GCP)
StarCoder	15.5B	FP16	36 GB	$2.87	1x A100 (40 GB)
		INT8	18 GB	$0.70	2x T4 (16 GB)
		INT4	10 GB	$0.35	1x T4 (16 GB)
Llama 2	70B	FP16	150 GB	$7.70	2x A100 (80 GB)
		INT8	80 GB	$3.85	1x A100 (80 GB)
		INT4	40 GB	$2.87	1x A100 (40 GB)

Table 1. Example cost of LLMs self-hosted in the cloud, based on Google Cloud Platform pricing (as of February 2024)

With these cost savings being the biggest advantage of quantization, one might wonder what the downsides are, e.g., with respect to performance or speed. While the evaluation of LLMs and other generative models in terms of output quality remains a challenging and widely researched topic, subjective manual analysis of sample outputs reveals that in many cases the modern quantization algorithms do not seem to introduce any visible degradation. Quantization should not be too aggressive though, as at some point an unacceptable level of degradation can be observed – going below 3 bits per parameter is not recommended.

When it comes to response latency, it should not increase after quantization. In fact, it is sometimes even possible to observe inference speed-ups compared to FP16, especially when using specialized low-level kernels for integer matrix calculations. However, this largely depends on the specific usage patterns and computing environment (batch sizes, prompt lengths, utilized hardware etc.), and requires verification in practice.

Selected quantization algorithms

The topic of model quantization is currently widely researched, with advancements being developed at a fast pace. In this section, we will introduce and explain the technical details of selected quantization algorithms and share our practical experience with some of them.

Naive approach

One of the classic quantization algorithms is range-based linear quantization, an approach in which floating-point values are quantized by multiplying them with a scale factor derived from the actual range of the tensor’s values. Within this algorithm, we can distinguish two modes: symmetric and asymmetric.

In asymmetric mode, we align the smallest and largest values from the floating-point range with those of the integer range using a zero-point and a scale factor.

Figure 3. Asymmetric mode, source: https://intellabs.github.io/distiller/algo_quantization.html

Conversely, in symmetric mode, we pick the maximum absolute value between the smallest and largest values of the float range without a zero-point, resulting in both the float range and the quantized range being symmetric around zero.

Figure 4. Symmetric mode, source: https://intellabs.github.io/distiller/algo_quantization.html

One of the issues with the approaches presented above, due to the reliance on identifying maximum values, is their sensitivity to outliers which can hugely impact the quantization results.

LLM.int8()

In 2022, Dettmers et al. introduced LLM.int8(), a quantization method addressing the problem of outlier values, frequently present in LLMs’ internal calculations. It uses vector-wise quantization, prioritizing precision for outliers in FP16 format while processing the vast majority of values in INT8 format. With outliers typically making up only around 0.1% of all values, this approach cuts LLM’s memory usage by nearly half (compared to FP16).

LLM.int8() operates in three main stages during matrix multiplication:

It identifies columns in the input hidden states that contain outlier features using a specific threshold.
It conducts the matrix multiplication, processing outliers in FP16 and non-outliers in INT8 using vector-wise quantization.
After dequantizing the non-outlier results from INT8 to FP16, it combines them with the outlier results to produce the complete result in FP16.

Figure 5. Illustration of LLM.int8(), source: https://huggingface.co/blog/hf-bitsandbytes-integration

LLM.int8() was definitely a great development, allowing users to run and experiment with otherwise unavailable models; however, due to its nature (on-the-fly dequantization to FP16 for outlier values), its usage can lead to inference slowdown compared to serving purely 16-bit models.

Our experience with LLM.int8()

The inference slowdown of LLM.int8() mentioned above was confirmed by our experience in a commercial project around LLMs for code completion. According to our experiment results (with 7 billion and 15 billion models from the CodeGen and StarCoder series), the response latency may increase up to 1.5 times. While making it possible to run an inference of StarCoder 15.5B on a single 24GB GPU or two 16GB GPUs (and spend around 4 times less compared to FP16, as shown in Table 1), the speed of code generation was unacceptably low for the model to be usable in the form of a coding assistant. Back then, we had to stick to a smaller model and serve it in FP16. Fortunately, faster quantization algorithms (described below) were proposed not long after.

GPTQ

Introduced by Frantar et al., 2023, the GPTQ (Generative Pre-trained Transformer model Quantization) algorithm is a post-training quantization method, i.e., the weights of an already trained model are converted to lower precision without necessitating any retraining. GPTQ quantizes the model layer by layer, by iteratively going through matrix columns and finding compressed versions of their elements (one for each row) that will yield a minimum mean squared error on a pre-defined calibration dataset. The approach builds and improves on the Optimal Brain Quanization (OBQ) method (Frantar et al., 2022) for solving the layer-wise quantization problem defined above.

GPTQ is currently one of the two most popular quantization techniques (along with AWQ described below). GPTQ-quantized (typically 8-bit and 4-bit) versions of all major newly released LLMs are introduced soon after, and ready-to-use for both further research and commercial applications.

Figure 6. Illustration of GPTQ, source: https://mlabonne.github.io/blog/posts/4_bit_Quantization_with_GPTQ.html

A recently published extension of GPTQ is ExLlamaV2 (EXL2, for short). EXL2, like GPTQ, uses the same optimization method and supports 2, 3, 4, 5, 6 and 8-bit quantization. This format enables you to blend different quantization levels within a model to reach an average bitrate of 2 to 8 bits per weight. Additionally, it makes it possible to apply various quantization levels to each linear layer, resembling sparse quantization, where more crucial weights are quantized with more bits. In this way, it allows Llama2 70B to run on a single 24 GB GPU with a 2048-token context, and 13B models can be used at 2.65 bits within 8 GB of VRAM.

Our experience with GPTQ

Currently, one of the most popular applications of LLMs is retrieval-augmented generation (RAG). In RAG, rather than relying solely on the model’s parameters, the user’s input is first leveraged to extract data from an external source of knowledge, and then both the user query and relevant information are integrated into an LLM prompt, enhancing response generation and mitigating the hallucination issue (to a certain extent). In one of our recent projects concerning RAG in the context of a copilot application for frontline workers, which we developed for a global company from the retail sector, we tested the 4-bit GPTQ version of the powerful open-source model Mixtral 8x7B. With this quantization, we managed to run the model on a single NVIDIA A100 card with 40GB vRAM. For comparison, to run the half-precision model you would need about 90GB GPU vRAM, which exceeds the capacity of the largest available A100 GPU (80GB). As presented in Table 1 (with the example of Llama 2), such memory savings lead to cutting the costs of inference by almost two-thirds. Moreover, as we manually verified the quality of the responses, we observed no difference between the outputs of the quantized model and those of Mixtral served in FP16. In our experiments, 4-bit GPTQ and FP16 models were more or less on par in terms of the speed of text generation.

If you want to learn more about our experience with building RAG systems, consider watching our deeptalk.

AWQ

Another recently popular post-training quantization technique is Activation-aware Weight Quantization, or AWQ (Li et al., 2023), based on the observation that among the LLM’s weights (parameters), not all are equally important for the model’s performance. By identifying a small fraction (0.1%-1%) of so-called salient weights and scaling them up, AWQ effectively reduces their relative quantization error. To pinpoint these salient weight channels, the algorithm analyzes the activation distribution instead of the weight distribution.

Going into more detail, AWQ consists of three main stages:

profiling activations – run a sample of data through the LLM and record activations, then analyze to identify salient weights.
optimal scaling – scale up salient weights to minimize quantization error.
quantization – apply optimal scaling and quantize all weights to INT8/4/2.

The AWQ paper highlights a 3.2-3.3x average speedup compared to Huggingface’s FP16 implementation across various LLMs, but these findings should be treated with caution, as they only report the results for a single, short input prompt. As already stated above, the observed speedup will depend on many factors related to hardware, prompt length and usage patterns. Memory savings, and therefore also cost reduction, in the case of AWQ are similar to those of GPTQ, leading to as much as 3-4 times less money spent on keeping the inference server running.

GGUF

Typically, LLMs are implemented in Python (using its deep learning libraries like PyTorch or Hugging Face transformers), which is not the optimal choice for maximizing inference speed. An interesting project called llama.cpp was started by Georgi Gerganov soon after Meta released their first Llama models in March 2023, with the goal of reimplementing these newly proposed LLMs in C++. Due to its lightweight, portable nature and support for a wide variety of hardware, the project became very popular and matured quickly, currently allowing users to choose from multiple LLMs other than Llama (all implemented in C++).

Apart from implementing the models, the authors of llama.cpp came up with their own quantization algorithm, called k-quants and often referred to as GGUF (after the format in which llama.cpp models are served). This algorithm is less sophisticated than GPTQ or AWQ, but useful in practice, and much superior in terms of inference speed in CPU-only scenarios. There is a wide selection of quantized representations (2-bit, 3-bit, 4-bit, 5-bit, 6-bit and 8-bit), with the possibility to mix different levels of quantization within a single model.

Figure 7. Phi-2 as an example of a “small language model”, source: https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models

Our experience with GGUF

We utilized llama.cpp and GGUF-quantized models in our recent PoC project regarding LLMs, or rather SLMs (small language models) deployed on edge devices like mobile phones with Android and only 4 or 6 GB of RAM. We were able to set up the entire RAG pipeline, including the vector index, an embedding model and the generative language model. We found out that 8-bit versions of models like Phi-2 or Gemma were performing reasonably well for retrieval-augmented generation (while for more extreme 4-bit and 5-bit quantizations we saw a significant drop in performance). As of today, running language models on devices with such limited memory is a challenging topic which still needs to be explored further. Nevertheless, there is no doubt that thanks to advancements like GGUF, the recently announced ExecuTorch and various hardware-level optimizations, running a personalized language model on a mobile phone does not seem impossible anymore, with the field of Edge AI continuing to flourish. Deploying language models on edge devices allows one to mitigate the issue with the high cost of hardware and cloud infrastructure required for inference.

Efficient fine-tuning – enabling LLM training on limited hardware

While foundation LLMs of various sizes are already capable of solving many business use cases out-of-the-box (with some time spent on proper prompt engineering, but without the need for any further training), in certain situations there might still be a need to fine-tune them on domain-specific data to reach the satisfactory level of output quality. Compared to the already expensive LLM inference, training a model (either from scratch or fine-tuning an existing one) requires even more powerful hardware, as it is not only the model that needs to fit into memory, but additional space for storing gradients and optimizer states as well. In the case of multi-billion-parameter LLMs, the standard approach to fine-tuning the models by updating all (or some major part) of their parameters is nowadays a procedure only “GPU-rich” companies can afford. The exact calculations are quite complex, with a rule of thumb that the required memory (in gigabytes) is around 12 times the number of model parameters (in billions). An example setup suitable for fine-tuning Llama 2 with 70 billion parameters in the cloud is a cluster of two nodes with 8 A100 GPUs with 80GB VRAM each, resulting in an hourly cost of $61.60 (in the case of GCP). With dozens of hours required to fine-tune an LLM, the total cost of a single model training experiment reaches hundreds or thousands of dollars.

To mitigate this issue, so-called parameter-efficient fine-tuning (PEFT) techniques have been developed, making it possible to effectively tailor pre-trained language models for different downstream tasks, without the requirement to fine-tune every single parameter of the model. Instead, PEFT focuses on fine-tuning only a limited set of additional model parameters, which significantly cuts down on the computational and storage costs linked with fine-tuning LLMs. With parameter-efficient fine-tuning, it is possible to train Llama 2 with 70 billion parameters on a single 80GB GPU, which is 16 times cheaper than with the standard approach.

Quantized inference and parameter-efficient fine-tuning are only selected examples of techniques developed to optimize required memory, decrease response latency, or reduce the time and cost of training LLMs. If you want to know more about the topic, feel free to check out our previous blog post.

Figure 8. Overview of PEFT techniques, source: https://arxiv.org/pdf/2303.15647.pdf

Various PEFT techniques have been developed and can be divided into categories such as additive, selective, and reparametrization-based methods. Additive methods can be further split into two groups: adapter-like and soft prompt-based methods. In the following sections, we will briefly discuss representative examples of algorithms from these groups. Many of them are implemented as part of the PEFT library, part of the Hugging Face ecosystem, which makes them pretty straightforward to use in practice.

Low-Rank Adaptation (LoRA)

Reparameterization-based methods aim to discover low-dimensional representations of weight matrices. A prominent example of such a method available in PEFT is Low-Rank Adaptation (LoRA, Hu et al., 2021), which is currently the go-to approach for efficient fine-tuning, with new ideas still being built on top of it.

Figure 9. Illustration of LoRA, source: https://huggingface.co/docs/peft/main/en/conceptual_guides/lora

Rather than modifying the parts of the original pre-trained model during fine-tuning, LoRA introduces new weight matrices (marked in orange in the picture above), which are paired with existing parameters (blue); only these newly added weights are then updated during the training process. As LoRA is based on matrix rank decomposition, the new matrices have significantly fewer parameters in total when compared to the original model. The exact numbers vary depending on the rank hyperparameter setting, but the models resulting from LoRA fine-tuning can often be of a quality almost on par with those obtained via full fine-tuning (on much more powerful hardware, hence often even impossible to run), while modifying, e.g., less than 1% of the original parameter count. Moreover, LoRA is an algorithm designed with production environments in mind, as multiple sets of LoRA weights (so-called adapters) can be trained for different use cases, and then easily switched between, while sharing the same underlying pre-trained weights. Additionally, since the original pre-trained weights remain unchanged, the risk of the model forgetting what it had learned before is reduced.

QLoRA

One important extension of the LoRA method available in PEFT is called QLoRA (Dettmers et al., 2023), which is a combination of LoRA and model quantization. It extends LoRA to enhance efficiency by representing the pre-trained weights with low-precision data types, enabling even more extreme reduction of memory required for fine-tuning. In QLoRA, the frozen parameters are stored as FP4 (4-bit floating-point representation newly introduced in the QLoRA paper), but dequantized on demand as training-related calculations and gradient updates are still performed in FP16.

Figure 10. Comparison of full fine-tuning, LoRA and QLoRA, source: https://arxiv.org/pdf/2305.14314.pdf

Our experience with QLoRA

In the abovementioned recent project regarding coding LLMs (language models generating code rather than natural language), our goal was to fine-tune the foundation model to be able to write code in a previously unfamiliar programming language. We only had 48GB of vRAM available at our disposal, but as we leveraged QLoRA, we were able to experiment with fine-tuning both larger models (up to 15 billion parameters with 2048 input tokens) and increasing the context size for smaller models (up to 8192 tokens for a 1-billion-parameter model), with the latter turning out to be the game-changer in the context of coding LLMs. While it was hard to capture the improvement with standard token-level metrics like BLEU or ROUGE, which could only serve as our proxy measurement of quality, the final assessment of the generated code samples conducted by the client (expert in this particular, highly specialized programming language) revealed that the fine-tuned model was superior to the original one. Due to the hardware and budget limitations, we were not able to perform full fine-tuning of analogous models to have a fair comparison, but our experience proves that parameter-efficient fine-tuning techniques, especially when combined with quantization, enable the development of use-case and client-specific LLMs, even with constraints.

Other parameter-efficient fine-tuning methods

Another interesting method found in PEFT is Adaptive Low-Rank Adaptation (AdaLoRA, Zhang et al., 2023). This method builds upon the principles introduced in LoRA but utilizes a distinct form of low-rank matrix decomposition. AdaLoRA optimizes fine-tuning by allocating more trainable parameters to matrices and layers of the model found to be more important.

While LoRA and its variants are currently the most widely used among all the PEFT methods, other noteworthy alternatives exist, as mentioned above.

For example, the goal of additive methods is to enhance a model by introducing a new set of parameters or network layers. During fine-tuning, only these new parameters’ weights are updated. Two such methods are available in PEFT: the Adapters method (Houlsby et al., 2019), which involves introducing small fully-connected networks after the Transformer sub-layers, and the (IA)3 method (Liu et al., 2022), which relies on augmenting the Transformer block with additional parameters.

Another group of methods is based on the concept of prompting. Prompting directs how a language model behaves by changing the input text with a prompt, typically made up of a task description and relevant examples. There are two types of prompts: hard prompts and soft prompts. Hard prompts are manually crafted text prompts using discrete input tokens. Unlike hard prompts, soft prompts cannot be directly viewed and edited in text form. They consist of an embedding, essentially a sequence of numbers, that draws knowledge from the larger model. Soft prompts can be tuned for the input layer only (P-Tuning, Liu et al., 2021 and Prompt Tuning, Lester et al., 2021) or for all layers (Prefix Tuning, Li and Liang, 2021).

Summary

In this article, we discussed the recent trend in the field of generative AI – the democratization of LLMs, which is about making these text-processing models accessible to everyone by reducing the hardware (and the associated cost) requirements. Recent advances such as new LLM quantization methods and parameter-efficient fine-tuning techniques not only make it possible to run or fine-tune one’s own LLM on their personal computer or edge device for fun or small-scale projects. Much more importantly, this also means that LLMs are becoming cheaper for businesses who want to incorporate AI into their existing products or processes, or unlock completely new ideas and functionalities previously impossible to implement. Business applications of cost-efficient LLMs and LLM case studies in cost reduction are becoming more widespread as a result.

At deepsense.ai, we have hands-on experience with applying the methods described above in commercial projects such as the development of a coding assistant for a highly specialized programming language, or an LLM-powered mobile application for field workers in the retail sector. Our experiments confirm the effectiveness and importance of these methods in turning the recent advancements in generative AI into business value. LLM implementation examples and making LLMs accessible for businesses are increasingly common, showcasing how affordable LLM solutions can democratize AI.

If the article has got you interested in potential use cases of LLMs at your company, we invite you to explore our LLM Discovery Workshops, where you can learn more from our experts and find out how your business can benefit from cutting-edge technology, effectively democratizing AI with LLMs.

Data Generation Methods: ControlNet, GLIGEN & Stable Diffusion Inpainting - deepsense.ai

Data generation with diffusion models. Part 3: Generating custom data in the blink of an eye

December 5, 2023/in Generative AI /by Natalia Czerep, Marianna Parzych and Piotr Banasiński

It’s time to wrap up our work on data generation using diffusion models. Previously we laid the foundation for this by introducing the concept and providing a quick overview of promising methods. Then, in the second part, we focused on obtaining images along with semantic segmentation maps. In this blog post, we would like to touch on the topic of methods which allow supplementary inputs.

Generating images based on additional conditions

Using diffusion models can lead to creative, imaginary outputs. However, to use this data in real-world business projects, additional information is also required, whether it is the composition of the objects, the size, or even the overall appearance. For this reason, it could be helpful to incorporate another input besides the prompt and feed it to the network, so that it provides guidance on what we expect. At deepsense.ai we experimented with the most popular methods for guided image generation, and we are excited to share our thoughts.

Stable Diffusion Inpainting

Let’s start with a method that originates from the Stable Diffusion paper [1]. The Inpainting pipeline allows you to edit specific parts of an image by providing a mask and a text prompt. This allows the user to erase and replace parts of the picture. The inpainting functionality of Stable Diffusion relies on a modified UNet architecture. This specialized network incorporates five additional input channels: four dedicated to the encoded masked image and one specifically for the mask itself.

Fig. 1 Images generated using a Stable Diffusion Inpainting model with the prompt “a blue robot on a bench”

ControlNet

Sound familiar? Yes, we have mentioned the ControlNet architecture [2] before in our first blog post of this ‘Data generation with diffusion models’ series – you can check it out here [3]. But we are coming back to this topic with more experience and carefully considered conclusions.

ControlNet modifies diffusion models by adding a component ready to be trained with additional inputs. One copy of the encoder is frozen. It carries the wisdom of the original network, gained from studying billions of images. This ensures the preservation of the network’s incredible capabilities. There is no need to train this part of the network. The trainable copy of the encoder learns conditional control so that it is possible to direct outputs with segmentation masks, key points, edges, etc. The outputs from trainable and frozen neural network blocks are connected with a custom layer called “zero convolution” and passed to the remaining parts of the stable diffusion model.

Figure 2 - ControlNet architecture overview

Fig. 2 ControlNet architecture overview. Image from [2].

GLIGEN

A similar concept of freezing the original weights of pre-trained diffusion models can be seen in the implementation of GLIGEN [4] (Grounded-Language-to-Image Generation). Apart from creating a copy of the entire encoder as in ControlNet, it introduces new layers called Gated Self-Attention (GSA), which are responsible for processing grounding input, within the encoder. A key distinction between GLIGEN and ControlNet is the way they deal with conditions and visual features. GLIGEN operates by processing a concatenation of inputs using the Transformer layer. ControlNet adopts the approach of concatenating the condition and visual features. This design choice positions GLIGEN as a more versatile choice. Both architectures exhibit impressive performance based on conditions like edge maps and segmentation maps. However, GLIGEN goes further by demonstrating great results based on conditions such as bounding boxes and reference images.

Fig. 3 Gated Self-Attention layer from the GLIGEN architecture. Image from [4].

Training the models

Pre-trained models are undoubtedly impressive, but we often find that they don’t meet the exact needs of business projects. Their inputs are often hyperrealistic, imaginary, and do not fit custom, specialized datasets. One can combat this with prompt engineering, but such a solution is poorly scalable. In this case, a re-training process is necessary.

Figure 4a - Images generated with a pre-trained stable diffusion inpainting model — *Fig. 4 Images generated with a pre-trained stable diffusion inpainting model. Base images are part of the Cityscapes dataset [5]. Inpainted cars stand out from the background style.*

Figure 4b - Images generated with a pre-trained stable diffusion inpainting model — *Fig. 4 Images generated with a pre-trained stable diffusion inpainting model. Base images are part of the Cityscapes dataset [5]. Inpainted cars stand out from the background style.*

However, training large diffusion models is not an easy task. First of all, a large dataset is needed. Stable-Diffusion-Inpainting uses LAION which contains 5 billion images. ControlNet, on the other hand, was trained on the ADE20K dataset with over 25,000 images for segmentation and a custom dataset for edges as control images: 3 million edge-image-caption pairs from the internet. In business cases, often only a few thousand images are at their disposal. When it comes to hardware and computational power, GLIGEN was trained with 16 V100 GPUs for 100,000 iterations, and ControlNet for segmentation – with 200 GPU-hours on Nvidia A100 80G.

From a business perspective, it can be impractical to experimentally train such large models due to deadlines and other limitations. Therefore, we have adopted suitable training strategies to overcome these issues, which can lead to satisfactory results for the chosen project.

LoRA to rule them all

In typical text-to-image techniques, where images are generated solely based on text prompts, several methods have been developed to effectively train new concepts. These fine-tuning methods enable swift and efficient learning of new styles or objects, even when the available data is extremely limited. One of these methods is the Low-Rank Adaptation method, commonly referred to as LoRA. At deepsense.ai, we have extensively explored and evaluated this method, uncovering its remarkable flexibility and versatility. You will soon witness its capabilities firsthand. But first, let’s delve into what exactly LoRA is.

Custom models with LoRA

LoRA was initially introduced in the LoRA: Low-Rank Adaptation of Large Language Models paper [5] as a fine-tuning method for Large Language Models like GPT-3. In the paper, it is demonstrated that by fine-tuning only a small portion of the parameters, the fine-tuned model outperformed the pre-trained ones. This technique has since been applied to Stable Diffusion, where it has exhibited remarkable effectiveness in fine-tuning diffusion models as well.

Figure 5 - Comparison of a cross-attention schematic view without and with additional LoRA layers

Fig. 5 Comparison of a cross-attention schematic view without and with additional LoRA layers

LoRA achieves this by incorporating additional linear layers into the cross-attention layer. Fig. 5 provides a schematic representation of how it works. In the Stable Diffusion model, the cross-attention layers play a vital role in image generation as they facilitate the interaction between the processed textual prompt and the generated image. During fine-tuning, only the additional layers, which use significantly fewer weights compared to the entire model, are trained. As a result, the training process is quick, and powerful computing resources are not required.

However, the most remarkable aspect of LoRA is its adaptability to pre-trained weights. Despite the existence of various methods built upon Stable Diffusion – ControlNet, GLIGEN, etc. – and each functioning differently, the cross-attention layers connecting a textual prompt with an image maintain a consistent structure and remain a crucial element. As a result, the weights trained using LoRA on one model can be seamlessly transferred to another model, even if these models represent different methods. This interoperability showcases the versatility of LoRA and its ability to facilitate knowledge transfer across diverse model architectures.

Fig. 6 Our proposed process for customizing models using LoRA

Now let’s look at how it works in practice.

Results for Cityscapes

We chose the “Cityscapes” dataset [5] to present the results of the Stable Diffusion model with the LoRA layers.

“Cityscapes” consist of images and corresponding segmentation masks captured in 50 different cities in Germany. The images can be easily recognized by a specific color scheme, a slight degree of blurriness, and the presence of the Mercedes hood. We assumed that specific characteristics of the Cityscapes dataset would be a good example by means of which to present the capabilities of the LoRA method. With just under 3,000 training images, the dataset may appear small for training deep-learning models. Nevertheless, it remains widely utilized for evaluating semantic segmentation techniques.

Figure 7a - Original images from Cityscapes dataset — *Fig. 7 Original images from Cityscapes dataset [5].*

Figure 7b - Original images from Cityscapes dataset — *Fig. 7 Original images from Cityscapes dataset [5].*

Stable Diffusion + LoRA

The Stable Diffusion model, trained on the Cityscapes dataset using the LoRA method, demonstrates great fidelity in generating images that mimic the training dataset. Most notably, the training process takes only several dozen minutes on a single GPU card!

Figure 8a - Images generated by Stable Diffusion fine-tuned with LoRA on the Cityscapes dataset — *Fig. 8 Images generated by Stable Diffusion fine-tuned with LoRA on the Cityscapes dataset*

Figure 8b - Images generated by Stable Diffusion fine-tuned with LoRA on the Cityscapes dataset — *Fig. 8 Images generated by Stable Diffusion fine-tuned with LoRA on the Cityscapes dataset*

Stable Diffusion Inpainting + LoRA

Now we are ready to customize Stable Diffusion models with additional conditions. Let’s begin with inpainting with Stable Diffusion. The model is great for removing objects from scenes. It can also handle the insertion of completely new objects. However, one limitation is that the inpainted objects may not always align seamlessly with the overall style of the image. This disparity is particularly noticeable in the distinctive images from Cityscapes. Fortunately, incorporating LoRA weights trained on Cityscapes into the generation process significantly improves the integration of inpainted objects, resulting in a much smoother blend.

Figure 9a - Results for the Stable Diffusion Inpainting method — *Fig. 9 Results for the Stable Diffusion Inpainting method. From left to right: background image; binary mask which determines object position; results without LoRA weights; results with LoRA weights.*

Figure 9b - Results for the Stable Diffusion Inpainting method — *Fig. 9 Results for the Stable Diffusion Inpainting method. From left to right: background image; binary mask which determines object position; results without LoRA weights; results with LoRA weights.*

ControlNet + LoRA

The inpainting technique relies on basic binary masks, which can limit its functionality to some extent. However, more advanced techniques such as ControlNet and GLIGEN offer enhanced flexibility, enabling the usage of various additional conditionals. Let’s explore how ControlNet excels in reconstructing Cityscapes data based on segmentation masks. The model processes the segmentation mask and creates a result based on the information therein.

Figure 10a - Results for the Stable Diffusion Inpainting method — *Fig. 10 Results for the Stable Diffusion Inpainting method. From left to right: background image; binary mask which determines object position; results without LoRA weights; results with LoRA weights.*

Figure 10b - Results for the Stable Diffusion Inpainting method — *Fig. 10 Results for the Stable Diffusion Inpainting method. From left to right: background image; binary mask which determines object position; results without LoRA weights; results with LoRA weights.*

GLIGEN + LoRA

A similar effect can be obtained for the GLIGEN method. Here, let’s focus on conditioning with bounding boxes. Each bounding box has a text prompt associated with it that determines what should be inside. Although GLIGEN works in a completely different way than ControlNet or Stable-Diffusion-Inpainting methods, LoRA also matches it and works really well.

Figure 11a - Results for the GLIGEN method without and with LoRA weights — Fig. 11 Results for the GLIGEN method without (top row) and with (bottom row) LoRA weights. The green and red bounding boxes indicate people and cars.

Figure 11b - Results for the GLIGEN method without and with LoRA weights — Fig. 11 Results for the GLIGEN method without (top row) and with (bottom row) LoRA weights. The green and red bounding boxes indicate people and cars.

Summary

In this article, we have introduced various approaches to generating data using additional conditioning. Techniques like Stable Diffusion Inpainting, ControlNet, and GLIGEN offer impressive capabilities that can greatly assist in data generation for business projects. However, retraining these methods typically requires time and a relatively large amount of data.

In this blog post, we have presented a method that addresses this challenge by taking advantage of fine-tuning with the Low-Rank Optimization technique. This approach allows for the efficient and seamless adaptation of these methods to specific use cases. By adopting this method, it becomes feasible to enrich business project data with data generated by diffusion models, opening up new possibilities for enhancing data diversity and quality.

References

“High-Resolution Image Synthesis with Latent Diffusion Models”, Rombach, R., Blattmann, A., Lorenz, D., et al. 2021
”Adding Conditional Control to Text-to-Image Diffusion Models” Lvmin Zhang et al., 2023
“Data generation with diffusion models – part 1”
“GLIGEN: Open-Set Grounded Text-to-Image Generation” Yuheng Li et al., 2023
“LoRA: Low-Rank Adaptation of Large Language Models”, Hu, E. J., Shen, Y., Wallis, P., et al., 2021
Cityscapes dataset

Evaluation Derangement Syndrome (EDS) in the GPU-poor’s GenAI. Part 1: the case for Evaluation-Driven Development

November 14, 2023/in Generative AI /by Jarosław Kochanowicz

GenAI, understood as a class of models capable of generating human-like, high dimensional outputs like text, image or sound, is experiencing great success and explosive growth [1, 2, 3]. However, this has also quietly given rise to a critical problem that permeates applied GenAI in its entirety – what we call Evaluation Derangement Syndrome (EDS). In short, EDS is the problem of the widespread lack of a rational approach to and methodology for the objective, automated and quantitative evaluation of performance in terms of generative model finetuning and prompt engineering for specific downstream GenAI tasks related to practical business applications.

In this post, we analyze EDS from a practical, applied, business perspective, drawing from our rich experience in GenAI development for business, both in image generation (with diffusion models and GANs) and LLMs in code generation, retrieval systems, voice assistants, etc. We’ll explore the underlying causes (both technical and business), and examine the consequences it may have for the GenAI community and beyond.We analyze its intricate relationship with GPU inequality [4] and address how the ‘GPU-rich’ (a handful of firms with thousands of the strongest GPUs, as well as resources like data, engineers, and labelers) approach the problem of GenAI evaluation, in contrast to the harsh realities of the ‘GPU-poor’ (everyone else, really). We also discuss the fundamental insufficiency of the ‘pseudo-evaluation’ approaches used by the GPU-poor and sketch a potentially more rational path forward for them: Evaluation-Driven Development (EDD). The subsequent post will take a deep dive into the nitty-gritty practicalities of this approach, drawing from our extensive experiences with Diffusion Models and LLMs in the ‘GPU-poor’ landscape. Enjoy!

GenAI evaluation in the realm of the GPU-poor

For fundamental technical reasons, GenAI does not naturally lend itself to any obvious and reliable analogues of quality monitoring tools (like F1 score, accuracy, precision, etc.) that all data scientists live and breathe when practicing traditional ML. Then there are business pressures to deliver at an extreme tempo, typical of the heated ‘hype economy’ driven by the fear of taking over the target niche in the AI revolution, contributing to the ‘produce fast, test later (i.e., never)’ approach.

Additionally, almost all GenAI contain several evaluation dimensions lying on a spectrum from soft (subjective) to hard (objective). To give specific examples from our own GenAI projects and practice, the evaluation of:

a chat/assistant includes subdimensions like helpfulness, friendliness, or even political correctness (subjective), vs factual correctness that can be measured in a ‘hard’ quantitative manner for some of the questions/answers (objective);
a retrieval system may include retrieval coverage correctness (‘hard’ in many cases) and conversation/summary style (soft), similar to the assistant;
code generation can be broken down into (hard) code correctness/test passing and (soft) code clarity;
diffusion-based face generation may consist of (soft) image attractiveness, similarity to the desired target, but also (hard) domain adherence, i.e., % of generated images actually containing a face.

This mix of soft and hard evaluation criteria poses technical difficulties (analyzed in this section), and the resulting simple truth is this: almost all GPU-poor researchers working on GenAI applications today work without any rational, objective, quantitative, repetitive framework to evaluate their work or to inform choices – their own or that of the business decision-maker – that depend on them. When dealing with our day-to-day dilemmas we all depend on subjective, arbitrary gut feelings built in short, selective inspections of the systems we train. These are (at best) accompanied by weak numeric pseudo-evaluation (‘broken evaluation methods that put more emphasis on style rather than accuracy or usefulness’ [5]), that do almost nothing to evaluate our specific business capacities (I’m looking at you, leaderboard rankings [6], BertScores [7], BLEUs [8], etc.). The former are used more as a rationalization than an actual trustworthy indicator of performance and a primary driver of our research.

This is truly astonishing when considering the rigorous objective evaluation practices ingrained in traditional ML (Image 1). More shockingly, everybody in the field seems to accept this and move along: if we move fast (as in the production of more and more GenAI), we are fine with not checking the direction (as in objective QA, and in comparison with competition or alternative approaches).

Image 1. Daily cycles as practiced by the GPU-poor. The differences in the evaluation standards are stark and consequential.

This is not an academic or theoretical issue, but a weighty real-life problem with business, technical, and human consequences. Again, deepsense.ai’s wide experience allows us to share a specific story that many GenAI researchers can relate to. One of our valued customers asked us to develop a code-generating solution for a somewhat niche language (think GitHub Copilot’s [9] competition for this language). Even though the team we established consisted of elite LLM experts, the task proved very challenging. Due to the limited time allocated for creating the evaluation pipeline, it relied solely on the BLEU-based evaluations. The main effort went directly into generative model creation itself. This, combined with the release of the new, possibly superior base models over the duration of the project, led to serious internal evaluation issues.

To cut a long story short – our team was highly competent and worked very hard, but had no reliable way of determining if any improvement had taken place. Considering our BLEU-based evaluations, we thought this may not be the case! Luckily, the client’s own internal evaluation at the end of the project was very positive. Apparently, according to the client, our model was visibly superior to both the foundational models and GitHub Copilot. Good enough for us, and another job well done! But, to be fair, we have no idea how objective and extensive this ‘client-based evaluation’ was, or whether this will work next time. Being professionals, we prefer to make our own luck instead of being part of EDS. Today’s GenAI development is full of similar stories, rarely with a happy ending like ours.

In summary, EDS is a serious, practical issue affecting all areas of GenAI, most notably LLMs [10, 11], and image generation (GANs [12], Diffusion Models [13]). Its reach will only grow in the future, together with GenAI use cases. Generative AI can create jokes [14], stories [15], poems [16] and beautiful images [17] for a continuously growing number of applications, but our ability to evaluate GenAI is constantly falling even further behind. Given the scale of the pandemic and the technological, economic, and political impacts of GenAI, EDS truly is a critical issue. Let’s now investigate the underlying causes of this situation.

EDS – business causes

EDS emerges in the GPU-poor domain due to a complex interplay of factors—some technical and others soft, encompassing the economic, psychological, business-strategic, and organizational dimensions. Before delving into the technical reasons behind this situation, let’s take a look at the broader realities of the GenAI business-economic ecosystem contributing to EDS amongst the GPU-poor.

The first EDS-inducing condition is that, since at least 2022, the GenAI business ecosystem has operated permanently in ‘hyped-economy’ mode [18, 19]. In the GPU-poor realm, this means the red-hot fervor of startups [20] competing in a race to quickly find their niche in the GenAI revolution. Or at least respond to the marketing pressure to be a part of GenAI… A pervasive sense of ‘it’s now or never’ exerts enormous pressure on businesses. CEOs of even modest startups aspire to innovate within GenAI, sometimes despite limited technical understanding and unreliable intuitions concerning the likely future GenAI developments and their risks to early adopters [21]. The unrelenting drive to produce GenAI en masse is fueled by sky-high VC investments [22, 23] partly streaming from a ‘pay-to-participate’ strategy to keep a horse in the big GenAI race. A strategy that is at times questionable, considering that the chance to join the GPU-rich is likely far-fetched at this point [24].

The second EDS-relevant characteristic of the GenAI ecosystem is the astonishing tempo of the general-purpose model/innovation of releases by the GPU-rich (proprietary or open-source). This creates significant potential to undermine or eliminate early GenAI adopters – their business strategies are potentially threatened each time we are exposed to a release that redefines the GenAI landscape and boundaries of what is possible. This disruptive tendency manifests every few months and shows no sign of slowing down, with the recent releases of Llama 2 [25] and Mistral [26] (the great hopes of open source NLP [27, 28]) and two proprietary game-changers seemingly just around the corner: Gemini [29] and GPT-5 [30]. As a result, the GPU-poor research and do business on ‘moving sands’. As a side-effect of any release by the GPU-rich, their research projects may likely be outdated before completion. Indeed, rapid and unpredictable innovations of the GPU-rich will repetitively wipe out niches targeted by countless GPU-poor early adapters, before any real chance of a return on investment.

It is not merely that “it’s totally hopeless to compete with us on training foundation models you shouldn’t try, and it’s your job to, like, try anyway” [31], as Sam Altman accurately summed up the situation of the GPU-poor vs GPU-rich. It’s also safe to assume that once the battle for supremacy for the ‘foundation models’ is more advanced, the same fate will befall utilities and applications of these models, where many of the GPU-poor hope to find a place for themselves. Think retrieval systems, agents, specialized finetunes, and other services built on and around the foundational models. Whenever serious consequences (in USD) are involved, the GPU-rich will (in time) provide these ‘auxiliary’ services out of the box.

Hence, the GPU-poor’s fear of this ‘side effect’ of business eradication is contributing to EDS by creating additional pressure to ship fast and skip evaluation, which is not perceived as a critical business goal.

Astonishing pace of business changes leads to Evaluation Derangement in GPU-poor

Image 2. The astonishing tempo of changes in the business landscape contributes to Evaluation Derangement Syndrome in the GPU-poor. [source: The Missing Link in Generative AI | Fiddler AI Blog ].

The third ‘soft’ cause is that fighting EDS among the GPU-poor may not be a primary interest of any key players involved. The main contributors to the GenAI revolution (the GPU-rich) have a proper way of evaluating GenAI (see RLHF section), but no incentives to develop or share alternatives for those who cannot afford it. In fact, from the perspective of GenAI’s future, they may see the current practices of the GPU-poor as inconsequential and irrelevant, and can leave them to their dubious practices. Sure, GPU-poor startups built around GPU-rich models may turn out to be successful from an economic standpoint (and deepsense.ai is here to make sure of it!), but they are not likely to shape the GenAI landscape in any major way over the course of a decade – this role belongs to the GPU-rich. Other parties are also fine with EDS; ML training providers are happy to train any model, regardless of its performance, as some developers and consultants are to create them regardless of the actual capacity to validate quality. And so the EDS machine may continue. We at deepsense.ai are not happy with this approach and strive to provide solutions that actually ARE better.

Technical causes: Why does EDS plague GenAI?

Technically speaking, why exactly would EDS haunt GenAI applications? We figured out the Traditional ML evaluation quite well. Why would this know-how not translate easily to GenAI? In this section, we’ll delve into the ‘hard’/technical factors behind EDS in Generative AI.

Let’s start by refreshing the (very straightforward) conventional approach to ML development and evaluation. Let’s break it down into three steps:

gather human-generated ground truths (GTs),
gather model-generated outputs,
compare them directly to produce meaningful metrics like accuracy, F1 score, and mean squared error.

There are several reasons why this approach breaks down for GenAI, regardless if it is related to the generation of text, images, or other content.

1. Inadequacy of the ‘ground truth’ concept in GenAI

Firstly, almost by definition, generative tasks often include a practically infinite variety of ‘optimal’ solutions, making any metrics which rely on direct comparisons with GT questionable. Within Natural Language Processing (NLP), ‘pseudo-evaluation’ approaches that we call ‘Superficial Utility Comparison Kriterion’ (SUCK) methods, like BLEU [32], METEOR [33], ROUGE [34], or BLEURT [35], attempt to salvage the situation. SUCK methods usually compare model outputs to GTs in specific embedding spaces, de facto under the assumption that output quality correlates with similarity to some GT within that space.

While this assumption might hold to some extent in quasi-generative tasks like summarization [link], it is fundamentally flawed. SUCK’s weaknesses are well recognized in the literature [link]. In our opinion, a fundamentally different approach is needed – we simply need to embrace the fact that ‘ground-truth’ is simply not a viable concept for most truly generative tasks. When the task is love poem generation, what is THE right answer that all others should imitate?

2. The innate subjectivity in GenAI evaluation

Secondly, assessing human-level creativity requires subjective and elusive criteria. Such evaluation is best expressed by poorly-defined concepts like aesthetics, novelty, style, creativity, helpfulness. Formal quality formulas have limited promise in this context. Furthermore, intersubjectivity is an inherent feature of evaluation, not a flaw to eliminate [36]. In many cases, generative models should optimize multiple objective quality indicators (e.g., the correctness of the generated code or retrieval ratio correctness etc.) and the subjective ones (e.g., code cleanliness, or proper conversational tone). All of them may non-trivially determine the final output quality, and more often than not conflict in practice [37, 38], and how to use them all in the practical evaluation is not necessarily obvious. Data collection, training, and evaluation procedures need to explicitly account for, quantify, and control subjectivity in the subdimensions of GenAI evaluation and their relation to the final quality of the output.

3. Extreme use-case specificity of the GenAI evaluation criteria

Thirdly, the interpretation of values relevant to the problem is highly application-specific and changes dramatically between use cases, and even users. Despite using identical terms, domains or use-cases dramatically redefine their interpretation. Context and goals determine standards of output beauty/aesthetics, proper tone, relevance, social acceptability or most other subjective metrics one can imagine. This makes capturing them with large, cross-domain, ‘once and for all’ datasets a challenge.

4. Diversity / mode collapse monitoring problems

The fourth reason is the susceptibility to mode collapse, where models produce outputs with limited diversity. While this can impact the utility of the model, it is also difficult to quantify and measure as part of model evaluation. Attempts to approximate the human sense of diversity, like FiD or the CLIP score, have serious limitations [39, 40]. On one hand, the definition of sample proximity may be non-trivial and domain-specific. On the other hand, diversity may be hard for humans to evaluate, as it is a feature of a large dataset – evaluation of multiple objects is a natural human weakness.

5. Potential evaluation dataset leaks in behind-closed-doors training

The fifth and final challenge in GenAI evaluation is the potential data leakage of any existing evaluation datasets one intends to utilize. Almost any model trained by the GPU-poor is a refinement of a general model (like Llama or Stable Diffusion) generously thrown open by the GPU-rich. These models have been trained on just about any data available, including all open datasets you wish to use to evaluate them or their fine-tuned iterations. The closed-door nature of many of these training procedures and their often undisclosed training datasets complicate the issue further. When we consider this in conjunction with the third challenge, which relates to the high degree of application specificity in given metrics, it becomes evident that depending on vast, openly accessible, one-size-fits-all datasets may no longer be a viable approach. It seems that the GPU-poor may have to rely on methodology allowing for easy evaluation with internally created, very small datasets capturing the case-specificity of their application.

Okay, so there are technical reasons why GenAI evaluation is hard, and why traditional ML approaches fail. But why would those technical obstacles hit the GPU-poor more? Why and how are the GPU-rich capable of dealing with them? Why can’t the GPU-poor copy what they do on a small scale? These are all great questions that we will address next.

How do the rich kids do it? The shiny new RLHF you (probably) can’t afford

Time for full disclosure: the problem above is only partially new. A lot of thought went into the proper approach to somewhat similar challenges long before the GenAI revolution. The field of Reinforcement Learning (RL) has much to say about the ‘healthy’ approach to a situation where the evaluation metric (‘reward’ in the RL lingo) is hard to formalize, complex, etc.

Successful RL conceptual frameworks like actor-critic [41] or world-model [42] have been used in physical or virtual environments to attack problems with ill-defined notions of ‘correct action’. The moment one sees GenAI as an agent, its outputs as actions, and human evaluation as a reward function, it all falls very nicely into the RL framework – hence Reinforcement Learning from Human Feedback (RLHF). One RL-borrowed component that allowed the GPU-rich to ‘put the harness’ on the GenAI they produced was the notion of a critic, preference, or reward model. Understanding this concept is vital to our conversation. The preference model is trained on human evaluations, to approximate human, well… preferences (Image 3, left). Once successfully trained, it can provide automated, pseudo-human feedback to train the generator (Image 3, right) concerning the human-level metrics (friendliness, political correctness, etc.). It is relevant to note that the reward model is a traditional (i.e., non-generative) model – it is not GenAI, so we know how to evaluate it directly. OpenAI used this approach to make ChatGPT by ensuring that GPT3/4 will not set off on racist rants, etc., by finetuning it to score highly on a reward model trained for OpenAI-preferred ideological biases. However, the same principle works for any GenAI (for example, reward models trained on human aesthetics [43] enable the creation of image generators like Midjourney [44]).

OpenAI's diagram: Reward model trained on feedback (left); used for generative model training (right)

Image 3. OpenAI’s famous diagram. The reward model is trained on feedback to capture the human-level labeler’s preferences (left). The reward model is then used to automatically provide the training signal (providing human-like preferences) in millions of training steps of the actual generative model (right). [source: https://openai.com/blog/chatgpt]

It should now be clear that RLHF makes the GPU-rich immune to EDS. Preference models can provide an automated, repetitive, quantitative evaluation for human-level metrics like beauty or friendliness. There is a ‘but’, however. Namely, the way GPU-rich use RLHF is categorically beyond the reach of GPU-poor. We cannot ‘just copy what they do on a smaller scale’ – we are too poor to do even that. The amount of GPU power, labeling, and the (often overlooked) scale of technical engineering difficulty needed to make full RLHF work is unknown and (probably rightfully) normally assumed prohibitive for today’s GPU-poor.

At this point, many highly technical GPU-poor experts, who intuitively feel much of what we have described, just shrug and say, ‘that’s why we cannot afford proper GenAI evaluation’. One reason for this may be a misleading intuition about what is hard in (HF)RL. The thing is, the difficulty considered here is not just training any evaluation model, but rather a reward model strong enough to be used in full RLHF, i.e., the provider of a training signal directly from the generative model (right column). In this setup, the reward model becomes vulnerable, as the generative model may ‘beat’ it, exposing any weakness (hidden misalignments with human preferences) it may have. Dealing with this has much in common with adversarial training stability challenges in GANs [45] or general issues with ML adversarial safety [46]. It is hard. In essence, the reward model and the entire training methodology must be very strong to stop the generator from ‘hacking’ the preferences, or it will find a way to get a high reward in undesired ways (by generating weird or noise-like images/sentences that should not get high rewards, but do). This takes a vast amount of data, GPUs, and expert engineering.

Here comes the trivial observation behind the Evaluation Driven Development (EDD) that the GPU-poor can adapt. Evaluation models may be easier by orders of magnitude to obtain than reward models. Our experiments indicate that evaluation models may be very cheap and easy to create for the GPU-poor’s own use cases, when taken out of the RLHF context. Unlike the GPU-rich shaping their base models, you do not need your reward model to provide a training signal for the model you fine-tune (or prompt engineer). If you do it right, and the generator cannot fool the evaluator through the training signal, as few as 100-200 samples may be enough to create human-like evaluator models. And that is all you need to get out of the EDS!

Okay. So we have the general idea behind the EDD. But that’s not really a framework yet, is it? And what does ‘if you do it right’ mean exactly? That’s an entirely different story – one we will tell in the second blog post of this series.

References

A survey of Generative AI Applications, Gozalo-Brizuela R., Garrido-Merchán E.C., 2023
From ChatGPT to ThreatGPT: Impact of generative AI in cybersecurity and privacy, Gupta M., Akiri C., Aryal K. et al., 2023
Art and the science of generative AI: A deeper dive, Epstein Z., Hertzmann A., Herman L. et al., 2023
https://www.businessinsider.com/gpu-rich-vs-gpu-poor-tech-companies-in-each-group-2023-8?IR=T
https://www.semianalysis.com/p/google-gemini-eats-the-world-gemini
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
BERTScore: Evaluating text generation with BERT, Zhang T., Kishore V., Wu F. et al., 2019.
BLEU: A method for automatic evaluation of machine translation, Papineni K., Roukos S., Ward T. et al., 2002.
https://github.com/features/copilot
USR: An unsupervised and reference free evaluation metric for dialog generation, Mehri S., Eskanazi M., 2020
Better automatic evaluation of open-domain dialogue systems with contextualized embeddings, Ghazarian S., Wei J. T.-Z., Galstyan A. et al., 2019
An empirical study on evaluation metrics of generative adversarial networks, Xu Q., Huang G., Yuan Y. et al., 2018
Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models, Stein G., Cresswell J.C., Hosseinzadeh R. et al., 2023
Witscript 3: A hybrid AI system for improvising jokes in a conversation, Toplyn J., 2023
MaxProb: Controllable story generation from storyline, Vychegzhanin S., Kotelnikova A., Sergeev A., Kotelnikov E., 2023
Modern French poetry generation with RoBERTa and GPT-2, Hämäläinen M., Alnajjar K., Poibeau T., 2022
SDXL: Improving latent diffusion models for high-resolution image synthesis, Podell D., English Z., Lacey K., 2023.
https://www.gartner.com/en/articles/what-s-new-in-artificial-intelligence-from-the-2022-gartner-hype-cycle
https://pitchbook.com/news/articles/generative-ai-hype-stability-jasper
https://www.wsj.com/articles/no-business-plan-no-problem-chatgpt-spawns-an-investor-gold-rush-in-ai-6bdbed3c
https://www.cfodive.com/news/many-ai-early-adopters-struggle-success/652138/
https://mergers.whitecase.com/highlights/generative-ai-boom-sparks-investment-spree#!
https://techcrunch.com/2023/03/28/generative-ai-venture-capital/
https://sifted.eu/articles/keelvar-ai-go-to-market
https://ai.meta.com/llama/
https://mistral.ai/news/announcing-mistral-7b/
https://zapier.com/blog/llama-meta/
https://www.digitaltrends.com/computing/why-meta-llama-2-is-such-a-big-deal/
https://www.techopedia.com/definition/google-gemini
https://medium.com/@trendingAI/gpt-5-revealed-release-date-what-we-know-so-far-ff7ec532d40e
https://www.zenger.news/2023/06/10/sam-altman-says-its-hopeless-to-compete-with-openai/
BLEU: A method for automatic evaluation of machine translation, Papineni K., Roukos S., Ward T. et al., 2002
METEOR: An automatic metric for MT evaluation with improved correlation with human judgments, Banerjee S., Lavie A., 2005
ROUGE: A package for automatic evaluation of summaries, Lin C.-Y., 2004
BLEURT: Learning robust metrics for text generation, Sellam T., Das D., Parikh A.P., 2020
Understanding interobserver agreement: The kappa statistic, Viera A.J., Garrett J.M., 2005
Improved precision and recall metric for assessing generative models, Kynkäänniemi T., Karras T., Laine S., 2019
Evaluating creative language generation: The case of rap lyric ghostwriting, Potash P., Romanov A., Rumshisky A., 2018
Toward verifiable and reproducible human evaluation for text-to-image generation, Otani M., Togashi R., Sawai Y. et al., 2023
Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models, Stein G., Cresswell J.C., Hosseinzadeh R. et al., 2023
https://medium.com/intro-to-artificial-intelligence/the-actor-critic-reinforcement-learning-algorithm-c8095a655c14
Deep learning, reinforcement learning, and world models, Matsuo Y., LeCun Y., Sahani M., 2022
https://laion.ai/blog/laion-aesthetics/
https://www.midjourney.com/showcase/recent/
https://jonathan-hui.medium.com/gan-why-it-is-so-hard-to-train-generative-advisory-networks-819a86b3750b
Key concepts in AI safety: Robustness and adversarial examples, Rudner T.G.J., Toner H., 2021

Creating your own code writing agent. How to get results fast and avoid the most common pitfalls

October 30, 2023/in Generative AI /by Alan Konarski, Maks Operlejn and Patryk Kowalski

In this blog post we walk you through our journey creating an LLM-based code writing agent from scratch – fine tuned-for your needs and processes – and we share our experience of how to improve it iteratively.

Introduction

This article is the second part in our series on Coding Agents. The first part provides an in-depth look at existing solutions, exploring their unique attributes and inherent limitations. We recommend starting there for the full picture.

Riding the wave of AutoGPT’s initial popularity surge, we embarked on a mission to uncover its potential for more complex software development projects. Our focus was Data Science, a domain close to our hearts.

Upon realizing the pitfalls of AutoGPT and other Coding Agents, we decided to create an innovative tool of our own. However, it turned out that other powerful solutions had entered the arena in the meantime, nudging us to test all the agents on a common benchmark.

In this article, we invite you on a journey through the evolution of our own AI Agent solutions, from humble beginnings with a basic model to advanced context retrieval. We’ll evaluate both our agents and those currently available to the public, and we’ll also reveal the challenges and limitations we encountered during the development process. Finally, we’ll share the insights and lessons learned from our experience. So join us as we navigate the realm of Coding Agent development, detailing the peaks, the valleys, and all the intricate details in between.

Creating a Data Scientist Agent

All of the agents that we presented in the previous article were mainly evaluated on pure software engineering tasks such as building simple games or writing web servers with Rest API. But when tested on traditional Data Science problems such as image classification, object detection or sentiment analysis, their performance was far from perfect. That’s why we decided to build our own Agent to be better at solving Data Science problems.

From the beginning, we determined the two principal approaches to be explored:

Approach A: The agent will initially generate the complete plan and subsequently execute it.
Approach B: The agent will iteratively create a plan for problem-solving.

Each of these approaches has its own set of advantages and disadvantages. In the following sections, you will find a detailed exploration of these aspects.

We utilized the GPT-3.5 and GPT-4 models to engineer our coding agents. We did explore other models, like Claude 2, but their output paled in comparison to the quality of code produced by the OpenAI models.

We evaluated numerous benchmarks, all of which included only initial objective descriptions. Illustrated in Figure 1 is the description of the simplest benchmark among the seven we tested. This particular benchmark will serve as the example for demonstrating how the agents function in the next sections of this blog post.

Figure 1. Initial objective description for one of the benchmarks

Approach A: Generating the plan upfront

In this section, we’ll walk you through how our AI Agent evolved over time. We started with a basic implementation and gradually improved it until we reached a point where the agent could successfully write code and tests, refactor previously written code, and much more.

Baseline

We started with the implementation of a baseline agent to quickly assess whether this path made sense. Our baseline represents one of the simplest approaches you can create. It relies heavily on the underlying models and the initial problem description provided by the user. For example, we asked the models to produce a plan in the form of a valid JSON, only to discover that the models failed to do so, returning JSON that couldn’t be parsed. As a result, the entire program would fail (and you can imagine how challenging it is to carry out a Data Science project without a proper plan!).

We can identify three main stages in the execution of this workflow:

Problem Definition: The user defines the problem.
Planning: The planner agent decomposes the main goal into a comprehensive plan consisting of individual tasks. In this iteration, only two task types are permitted: “write code” or “finish”. (See Figure 2)
Code Generation: Finally, we iterate through the plan, where the CodeWriter agent generates and stores the code into files – we use FileManagementToolkit from LangChain for that purpose, as it allows you to read, create and modify files (learn more here).

Figure 2. Plan produced by the baseline Planner Agent, Source: own study

Figure 3. Architecture of the baseline, Source: own study

Problems encountered:

Tasks are treated as independent, with no sharing of information among them – the model does not get the context and result from previous iterations and as a result it can’t import previously written functions, for example.
We found out that LangChain’s FileManagementToolkit is unreliable. Agents tend to call the wrong tool quite often.

In the second iteration we solved both of these problems.

Extending the agent to include context knowledge

This approach fixes many of the downsides of our baseline. It generates more files, and they are more complete with fewer obvious issues.

One of the first improvements made over the baseline was the replacement of LangChain’s FileManagementToolkit. The original toolkit often failed to call the appropriate tool, but our implementation addressed this issue by removing the responsibility for determining when to utilize the correct tool from the Agents. In this particular scenario, it was clear when the code needed to be saved, allowing for the implementation of more conventional algorithms.

The second major improvement was the introduction of the context. In this context-aware approach, we started to inject whole files, previously generated by the agent, to the prompt. That allowed the model to use previously implemented functions, e.g., for loading data in the file with the model training code.

Apart from these two improvements, we have also added new roles:

- TestWriter – responsible for writing unit tests in pytest based on the context provided.
- CodeRefactorer – responsible for refactoring previously written code and tests (we thought it was a good idea based on how human Software Engineers and Data Scientists usually work – by implementing a rough version first and refactoring it later.)
- AdditionalFilesWriter – projects usually contain more files than just the code: we often have some kind of documentation, reade.me files, files with licenses, or files for package management. This agent was responsible for exactly that – generating the content of these additional files based on the context provided.

Figure 4. Architecture of context-aware implementation, Source: own study

Problems encountered:

As we load entire files into the prompt, it quickly exceeds 16k GPT-4/GPT-3.5 context limits.
The cost of generation is high due to the large size of the prompts.
Code generation is time-consuming.
Occasionally the planner returns an invalid JSON which leads to the application crashing.

Improving planning and context retrieval

In the next iteration, we paid more attention to reducing the size of the input context, thus optimizing latency and cost. This time, instead of returning the full files in the context, we decided to divide the files into chunks and return only the useful ones to the model.

Text Embeddings are a popular solution for storing text fragments and comparing them. Each document is represented by its mathematical equivalent (a vector). These vectors can be compared with each other, allowing us to retrieve the three most similar embeddings to our target vector X, for example. Text Embeddings are a popular method, so we encourage you to learn more here or here.

Embeddings are stored in the Vector Store, a database that enables the optimal storage and querying of vectors. We picked FAISS as our personal choice, but there are plenty of options – you can learn more about vector databases here. If you are wondering which database is worth choosing, this link takes you to an article comparing the most popular options.

The question now is how to split the code appropriately so that each document in the vector store contains specific and complete information:

Storing the entire file as an embedding: this lengthens the context entered into the prompt, which not only adds to the cost but also poses a significant risk of surpassing the prompt length limit. Additionally, we store parts of the code that will not be of any use to Code Writer (e.g., imports from other files).
Splitting each file using a specific separator or by the number of characters (such as CharacterTextSplitter from the LangChain library): while this can work in plain text, in code it can lead to poor division and incomplete embeddings.
Our method: export only useful code fragments (functions and classes). In addition, only signatures and docstrings are stored, because the body of such a function itself is of little use – the most important thing is the input parameters and what it returns.

For example, this is what the evaluate.py file looked like before processing:

Figure 5. Function before processing, Source: own study

The following figure illustrates the result of how the function was processed by our custom Code Splitter. As you can see, all the relevant information has been preserved – the model will know how to use this function. By excluding the body, Code Writer will become more cost-effective, eliminating concerns about surpassing the maximum context length. Of course, with more functions, the splitter would produce more chunks.

Figure 6. Function after processing, Source: own study

The case is slightly different with classes. In this case, in addition to the class signature and docstring, we also store the full __init__ method so that Code Writer knows the contents of that class. As for methods, we process them in the same way as functions. This is what it looks like before splitting:

Figure 7. Class before processing, Source: Own study

And this is the final result, after using Code Splitter:

Figure 8. Class after processing, Source: own study

We store the code processed this way in a vector database. The question, given the task, is how to add the relevant code chunks to the context. We have already shown above that we can compare all vectors and return the most similar ones. A prerequisite creating function in Planner retrieves the relevant text, which can be used to query the vector database. It is implemented using the prompt below:

Figure 9. Prompt for prerequisites specification

The prompt should also include information that these prerequisites do not have to be included in each task. For example, this is what the first two created tasks look like – as they are not dependent on others, they do not have any prerequisites:

Figure 10. Initial tasks, Source: own study

The third task already uses elements from the previous files, so we write them out specifically as a list. This way, we can turn each element in the list into an embedding, and then compare it with the code fragments and return the ones we need.

Figure 11. The next task, this time with prerequisites, Source: own study

The full architecture of the third iteration is shown in the image below:

Figure 12. Architecture of the enhanced context solution, Source: own study

Problems encountered:

Lack of state-of-the-art knowledge – current GPT models were trained on data (and codebases) produced up to and including 2021 – this means that they have no knowledge of any news released after 2021. Two years is a very long time, especially in the IT industry, where technologies are developing at a galloping pace. Code developed by both our agents and others (MetaGPT, GPT Engineer) is therefore outdated.
Due to the predefined plan, the agent is not flexible and cannot modify the tasks during execution.

The code generated by GPT was fairly good but far from perfect. The code was adequately divided across various files, and everything fit together seamlessly, although occasionally there were difficulties with proper importing. The AI agent maintained compliance with pre-established patterns such as proper docstring formatting, using the specified technologies, and loading data from the correct locations. Launching a project using GPT-4 incurs a cost of slightly over $1.

Approach B: Generating a plan on the fly

This approach is different from the others, because it is inspired by the fact that developers are never able to plan every little step that will be taken when starting a new project. Python scripts are not written once from top to bottom, but instead are constantly changed and improved.

The Agent consists of these main components:

Problem Definition: The user defines the problem.
PrepAgent: Takes the project requirements and returns the output in the format:
1. ```
2. Domain:
3. Problem type:
4. Model:
5. Metrics:
6. Dataset to use:
7. Raw Dataset location:
8. ```
ActionAgent: The step repeats in a loop. Based on the history of previous tasks (file structure and content), the agent selects one of the following actions:
1. CREATE <file_name> – creates a file
2. EDIT <file_name> – modifying an existing file
3. DELETE <file_name> – deletes a file
4. PEEK <file_name> – allows you to preview the content of the indicated file and stores it in the prompt. Right now, only CSV files have been handled – it previews the first 5 lines.
5. FINISH – ends the loop. This option is available to the agent only after several warm-up steps; the number is determined by the parameter.

Secondary agents are also implemented in the main flow:

1. BobTheBuilderAgent – responsible for managing files
2. ScriptKiddieAgent – modifies the content of the selected file

Its architecture is shown in the image below:

Figure 13. Architecture of Approach B, Source: own study

Encountered problems:

Once again, a lack of state-of-the-art knowledge
The possibility of an infinite loop

With such an architecture, it is extremely easy to get stuck in an infinite loop and suspend agent execution. The lack of a plan sometimes causes it to go in strange directions – and the non-determinism of language models encourages this. There are situations when instead of choosing one action from the previously listed ones, an agent decides to create a new one. Once it claimed that he would perform the CONTACT_CLIENT action – so perhaps our next step should be to implement meeting scheduling and face-to-face (or rather matrix-to-face) conversations 😉.

While the lack of advance planning for each task seems to have great potential, it is also much more unstable and requires the consideration of many edge cases. It seems to us that combining the robustness of Approach A with the flexibility of Approach B could yield interesting results.

Evaluation

In order to measure and compare our open-source AI agent and others, we have prepared a set of benchmark tasks. We’ve evaluated coding writing agents over a total of seven small-scale data science projects, which included tasks like:

Detection of license plates in the image (CV)
Text summarizer (NLP)
Spotify genre classification (tabular)

We provided task descriptions and a dataset path as input, which were taken up by the AI agents. These benchmarks measured the ability of the AI models to understand instructions and apply their acquired knowledge to complete a project successfully.

Once the agents completed their tasks, we evaluated the outcomes of the projects they generated. The evaluation was undertaken in two steps:

First, we assessed the quantitative aspects by looking at:

Whether the code can be executed without any issues/failures.
How much it cost to generate the code and how long it took.

Then, we conducted a human review to assess:

The amount and complexity of work required either to run the generated code or fix any issues.
Whether the solution is reasonable, comprehensive, and solves the issue entirely.
The code style, to make sure that it adheres to best practices.

A full diagram showing how the evaluation process was conducted is shown below:

Figure 14. Full evaluation process, Source: own study

The conclusions of our analysis are as follows:

We identified MetaGPT and the 3rd iteration of our Approach A as the top performers in producing high-quality code, with GPT Engineer falling slightly behind.
Our solution excels in planning due to effective prompt engineering with few-shot examples. The plan is solid, ensuring the model consistently produces correct JSON, an achievement other solutions may not always guarantee. In some instances, other agents initially failed during the planning stage itself. What is more, creating prerequisites enriches the plan with useful context.
In terms of cost efficiency, our model didn’t perform as well as both the GPT Engineer and MetaGPT – with a relatively minor difference (well, we are still dealing with very small figures).

Note: This rating demonstrates our team’s findings; however, it’s important to include the caveat that these are based on our criteria and the specific projects that were assessed, so it’s a subjective list. Other trial conditions or assessment criteria could potentially yield different rankings.

Limitations

While AI Agents show surprising autonomy in the coding task performance, they still face several limitations, with some critical ones being:

Difficulty in selecting the optimal tool for a task, leading to problem-solving failures.
Existing memory implementations in Agents are not flawless, particularly in providing extensive information to LLM due to limited context length.
Underlying models, like those from OpenAI, have notable writing code capabilities for generic tasks similar to internet code. However, they struggle when faced with variations or unconventional data formats.
Moreover, these models often include outdated approaches – at today’s pace of advancement in the IT field, a model trained on data up to and including 2021 (GPT-4) will not be able to offer the best solutions.
Larger projects may require many iterations leading to substantial LLM usage and consequently high costs. However, we are optimistic that future iterations of open-source models, like LLAMA2 and Gorilla, will be capable of replacing current paid, state-of-the-art LLMs, like GPT4, without any loss in performance.

Lessons learned

We wish to share some of our insights and useful tricks that we learned from our experiments that you may find useful when experimenting with your own AI agents:

GPT-4 is the best model for code generation, but it is also one of the most expensive ones.
Specialization beats generalization – try to use specialized roles in your agent: code reviewer, code writer, project manager, etc.
Models often fail when we ask for output in a specific format (e.g., JSON), as this research paper illustrated well. It helps significantly to use few-shot prompting, that is, to put in an example of the output we require from the LLM.
Mix different models: for example GPT-4 for code generation but the cheaper GPT-3.5 for other tasks like summarization.
Structured answers output are better than plain text: some problems are easier/much cheaper when just implemented in code.
Add monitoring, tracing and logging of execution. LLM-based applications can quickly grow and you may find it hard to debug them without proper tools.

Creating Coding Agents: final thoughts

In this article, we presented the process of creating our unique Coding Agents and compared them to the currently available solutions. Initially, our solution surpassed the available counterparts. However, the dynamic nature of technology soon brought new contenders like MetaGPT, which at the time of writing exceeds the efficiency of all others and also stands as an equal rival for our agent.

Nonetheless, it’s essential to remember that despite its advanced performance, MetaGPT, like all other solutions, has its share of limitations. They excel in tasks with well-defined objectives, such as creating a snake game, compared to tasks lacking clear directions, like developing a generalized classification model for data provided. In the section dedicated to limitations above, we describe some of the issues with these AI agents, but we believe that these challenges will be overcome in the near future, paving the way for a surge in AI Agent-powered applications, particularly those focused on code generation.

Interested in streamlining your projects with AI-based software development? Feel free to reach out to us for comprehensive solutions tailored to your specific needs.

Creating your whole codebase at once using LLMs – how long until AI replaces human developers?

October 8, 2023/in Generative AI /by Alan Konarski, Maks Operlejn and Patryk Kowalski

A comprehensive review of the state of the art in terms of code-writing agents. We compare the existing solutions and explain how they work behind the scenes.

Introduction

While the concept of AI agents has been around for decades, it is undeniable that recent advancements in Large Language Models (LLMs) have revolutionized this field, opening up a whole new realm of possibilities and applications.

This article will walk you through the current landscape of code-writing AI agents. We present a couple of the existing implementations and discuss how they are built.

In a follow-up article (stay tuned) we’ll share our experiences with developing our own version of an AI agent. We’ll also compare open-source agents and our own ones based on the quality of code written, cost, usage, and speed.

What is an AI Agent?

Imagine having an ever-present, intelligent companion who is always at your side, ready to lend a helping hand whenever you ask, and who never says ‘no’. The AI agent of the future will be exactly that.

We can define an AI Agent as a computer program or system that can perceive its environment, process information, and make decisions or take actions to achieve specific goals (such as solving software engineering problems). AI agents are designed to mimic some aspects of human intelligence and behavior, allowing them to operate autonomously or semi-autonomously in their environment.

Traditionally, AI agents were primarily utilized in the domain of game development. From simple chess-playing programs to complex virtual characters with human-like behaviors, these agents have showcased impressive capabilities within constrained environments. However, the limitations of traditional AI methods often restricted their adaptability, creativity, and real-world applicability.

Usually agents will have:

Some kind of memory (state)
Multiple specialized roles:
- Planner – to “think” and generate a plan (if steps are not predefined)
- Executor – to “act” by executing the plan using specific tools
- Feedback provider – to assess the quality of the execution by means of auto-reflection. It can be augmented or replaced by human feedback.
External sources of informationA way to interact with the user (an optional feature)

Attempts have been made to create agents in various fields such as:

Robotics: Agents can range from simple automated machines to complex humanoid robots designed to assist or replace humans in various tasks.
Autonomous vehicles: These agents have the potential to revolutionize industries, especially transportation, in the form of self-driving cars, trucks, and drones.
Gaming: The use of intelligent agents in the development of more immersive and challenging gaming experiences is on the rise.

A significant aspect of these intelligent agents is their connection with Reinforcement Learning. Reinforcement Learning is one of the seminal branches of Machine Learning, with roots reaching back to the 1980s and 1990s. Despite its age, it remains at the bleeding edge of technology, driving the development of increasingly intelligent, adaptable and complex systems.

Although LLM-based agents are relatively new (as, in fact, are LLMs themselves), they are already being applied to a wide range of tasks, such as:

Helping you search through long documents and asking them questions: How we developed a GPT-based solution for extracting knowledge from documents – deepsense.ai
Playing games: GitHub – MineDojo/Voyager: An Open-Ended Embodied Agent with Large Language Models
Creating entire software engineering projects for a given problem statement: GitHub – AntonOsika/gpt-engineer: Specify what you want it to build, the AI asks for clarification, and then builds it.

In this article we will focus solely on AI Agents capable of solving software engineering and data science problems.

The current state of coding agents

LLMs have rapidly revolutionized the code creation process as developers eagerly include them in their workflow. Whether programmers ask ChatGPT their questions directly and make use of the code snippets it provides, or use dedicated code-writing tools like Copilot which are capable of suggesting whole code blocks at a time from the convenience of your IDE, one thing is certain: the utilization of AI-powered tools in our daily work provides a significant productivity boost. The burning question of today is whether we can take this idea even further and step away from code suggestions, moving towards end-to-end solutions capable of creating the entire project by themselves. AI Code Writing Agents represent a promising approach to this task.

There are a number of open-source projects introducing fresh approaches to AI Agent implementation. Among them, a couple stand out with their relative maturity and serve as inspiration for any new projects in this field.

General purpose coding agents

Auto-GPT

Auto-GPT was one of the first AI agents using Large Language Models to make waves, mainly due to its ability to independently handle diverse tasks. The AutoGPT agent has a large tool set of commands it can call upon, including:

Web-based functions like Google searches and browsing
Interactions with fellow agents (initiating, messaging, deleting)
Operating system operations like file manipulation and script execution

Workflow

Operating via a chain-of-thought feedback loop, the agent generates prompts that it loops back into its own processing. Let’s look into the workflow following an example – creating a sentiment classifier.

The user provides a prompt setting the task.
An agent is created based on the prompt.
- The agent has a personality specified – a SentimentClassifierGPT – making it well prepared for the task at hand.
The agent thinks about how best to solve the problem.
- To make sure there’s no jumping to conclusions, it is bound by a deliberate thinking structure – the chain of thought.
The agent selects a command to run, and comes up with arguments to this command.
- This time it went with executing python code, providing the code it came up with as an argument.
Time for us, the user, to authorize the action.
- Silly agent – it doesn’t know that running this code is a bit premature! Let’s roll with it anyhow – accept.
The result of running this command gets memorized.
- If we have more rounds in our tank, the workflow goes back to the thinking step, and the agent gets to reflect on the results of its actions.
Second round! Workflow loops back to step 3.
- The agent thinks – “The error indicates that the ‘torch’ module is not installed in the current environment… However, as an AI, I don’t have the capability to install packages.” Spot on, buddy! We are not going to succeed this particular time.

Figure 1. Simplified Auto-GPT Workflow, Source: own study

Extra details

For memory, the agent employs a dual approach. It possesses a short-term memory in text format, complemented by long-term memory embeddings within a vector database. This enables the agent to effectively leverage contextual information and the outcomes of past actions within its ongoing interactions.

Pros of Auto-GPT

Versatility – an agent can call up a large variety of commands to help it with the task at hand.
Chain-of-thought prompting – a brilliant technique helping agents hit the mark more often.
Memory handling – well-designed short- and long-term memory.

Cons of Auto-GPT

Operating within a loop – the costs can rack up quickly! Especially when the agent exhibits repetitive behavior.
- This is slightly evened out by the human control mechanism – the user can stop the loop if it seems to be going nowhere.
Memory handling and context limitations – while it is well designed, it is still one of the weakest chains in this tool.
- For big projects, agents are prone to forget what they have already accomplished, which makes them likely to repeat themselves.
- The relatively small context of the current LLMs makes memory handling even more challenging.
Poor code-writing abilities.

What makes the project stand out

Agents created on the fly

BabyCoder

BabyAGI is another AI agent working in a loop. It has inspired multiple projects, possibly due to the simplicity of its implementation.

BabyCoder is one of the BabyAGI variants which specializes in writing code. Unlike AutoGPT, which creates agents on the fly, BabyCoder uses a handful of predefined, specific agents who work together to come up with a functional project.

Workflow

The Task Initializer Agent creates a list of tasks to be carried out,
The Human Input Agent works with the user to refine the tasks
The Task Assigner Agent passes each task to a capable executor, such as
- the Command Executor Agent
- the Code Writer Agent
- the Code Refactor Agent

Let’s try this out on the same task – creating a sentiment analysis model.

The user provides a prompt setting the goal.
The task Creation Agent comes up with tasks – in our case all 14 of them. The tasks span from creating project and data directories, to installing the required python modules, to writing the specific code files.
The Command Executor Agent starts working on the first task – and performs it flawlessly: mkdir sentiment_classifier/data
The context is extended by the result of the first task.
The execution agent picks up the next task on the list.
Workflow loops back to step 3 – task execution.
… The program crashes.

While most likely not the intended workflow, that is exactly how this turned out during our tests. At this point the BabyCoder either forgets what directories it created and attempts to use incorrect paths, looping itself aimlessly, or exceeds the available LLM context window with a loud error message straight from the OpenAI API.

Figure 2. BabyCoder workflow, Source: https://github.com/yoheinakajima/babyagi/tree/main/babycoder

Splitting the main task into multiple small, specialized tasks is a brilliant workaround for the current limitations of LLMs, including their limited attention span and difficulty handling complex workflow.

Pros of BabyCoder

Very simple implementation

Cons of BabyCoder

Unreliable – tends to crash with an error before finishing the tasks it set

What makes the project stand out

Predefined specializing agents
The number of projects it has inspired

smol developer

smol developer lives up to its name by being lightweight and simple.

Workflow

The user initiates the app-building process with a singular prompt.
The smol developer creates a detailed plan that outlines the structure of the app to be generated, listing all the variables and functions. This plan will be included in each prompt the smol developer uses later on.
The agent creates the code according to the first task.
Memory is extended by the code generated.
The agent creates the code according to the second task… And so on.

Pros of smol developer

Simplicity
Linear workflow – no surprises in terms of the number of calls made to the OpenAI API

Cons of smol developer

Lack of advanced memory handing – smol developer stuffs all the information together, so it needs to fit inside the LLM context
Requires multiple tries before you get the prompt right

What makes the project stand out

Simplicity

GPT Engineer

GPT Engineer is a tad bigger, but still a lightweight approach to code generation. It starts from a prompt – but it also asks extra clarifying questions.

Workflow

The user provides the main objective.
The GPT Engineer asks clarifying questions about the target, focusing on the technical aspects of the solution (e.g., data format, preference for the model used). The user can answer or can leave it to the model to decide by itself
The agent prepares the “Core Classes, Functions, and Methods” section, which lists technical details of the solution. For example 1. `SentimentDataModule`: A PyTorch Lightning DataModule for loading and preparing the sentiment analysis dataset.
Based on the code functionalities described above, the GPT Engineer starts to create code, one element after another
Once the code is generated, the agent gives us the option to install the necessary libraries (from the requirements.txt file) and launch his creation

It seems that the code produced by gpt-engineer is already much better than that produced by its predecessors, although it is still far from perfect.

Figure 3. GPT Engineer’s sample question and assumptions, Source: own study

Pros of GPT Engineer

Clarifying questions – the agent asks questions to make sure it produces code that is consistent with our expectations
The workflow is concise, as is the code itself, which saves us time and money
Gives you the ability to install dependencies and execute code

Cons of GPT Engineer

The questions do not always make sense – sometimes they ask about information already included with the initial prompt

What makes the project stand out

Objective clarification
Dependency handling and code execution
Alternative approaches to code generation, for example Test-Driven Development (TDD)
At the end of the execution, the GPT Engineer collects data on the quality of its work – signaling the authors’ intention to keep working on the improvements

Figure 4. GPT Engineer gathers data to learn from its mistakes, Source: own study

MetaGPT

MetaGPT originated as a research paper, with its focus centered on multi-agent collaboration.

This project is heavily inspired by how human developers manage IT projects. MetaGPT is written in a way that resembles a software company, hiring agents for positions common in the industry – like architects, product managers and even customer service reps. MetaGPT is a complex tool, which is why we decided to describe it in more depth than usual.

Workflow

The user provides the prompt.
The boss agent broadcasts the goal.
The agents check the shared message board for information relevant to them.
If an agent sees something of interest it reacts.
1. Think – keeping in mind the context of the role, see what needs to be done and reflect on how to do it.
2. Observe – check the message board and read about the relevant events.
3. Act – perform a task, call an LLM or another function.
4. Broadcast – share the execution results and other action records with the environment.
After performing their actions, agents broadcast their results to the message board.
Back to point 2.

Figure 5. MetaGPT workflow, Source: own study

The description above, while accurate, doesn’t do much in terms of explaining what actually happens. For a step-by-step walkthrough, see below:

Example of an actual workflow

Portrayed using the sentiment analysis model creation prompt.

The user provides the prompt.
The boss agent picks up the prompt and passes it on to the company message board.
All of the employee agents look through the message board, looking for any information of interest.
Alice, the Product Manager, picks up the requirements and gets to work:
- Polishing the requirements
  - “Develop a sentiment classifier based on product reviews and their ratings” (…)
- Creating the user stories
  - “As a data scientist, I want to train a sentiment classifier so that I can predict the sentiment of product reviews” (…)
- Drafting the UI
  - “The scripts should be command-line executable and provide clear output to the user.” (…)
- Performing the competitive analysis by searching for the information on the internet
  - “Google’s AutoML Natural Language: Offers sentiment analysis but lacks customization for specific tasks” (…)
- Documenting her findings using mermaid charts (see figure 3)
- Posting the results of her work to the company message board
Bob, the Architect, sees that Alice is done and picks up the baton.
- Drafting the implementation approach
  - “We will use the transformers library from Hugging Face, which provides pre-trained BERT models and utilities for fine-tuning on a specific task.” (…)
- Deciding on the python package name
  - “sentiment_classifier”
- Creating the file list
  - “train.py”,
  - “validate.py”,
  - “inference.py”,
  - “model.py”,
  - “utils.py”
- Charting the data structures and interface definitions (see figure 3)
- Posting the results of his work to the company message board
Eve, the Project Manager, details the tasks
- Analyzing the dependencies
  - Creating the requirements.txt
- Creating full API specification
- Analyzing the program flow
  - “train.py”, “Contains the main training loop. Depends on model.py and utils.py”
- Posting the results of her work to the company message board
Alex, the Engineer, writes the code for each file.

Environment

The agents share an Environment where all the memories are stored. Memories are essentially a list of messages broadcasted by agents. Agents with specific roles can browse the environment to retrieve and store memories relevant to them. Each agent possesses a distinct context, subscribing to different types of events and striving to remember only what is directly relevant to its role.

Working in rounds

The execution of the program is limited by the number of rounds set, so it’s important to leave enough rounds for every agent to have an opportunity to act.

For example, during the first round, the only message in the environment comes from the boss – that’s the initial user prompt. The Project Manager will pick up the requirements, perform her actions and post the results to the environment message buffer. However, no other agent will pick up this information during the same round – they have already finished their observations. Starting next round, Bob, the Architect, will pick up the user stories and start coming up with the design. A couple of rounds have to pass before any code gets written.

Each agent can react any time an event of interest is registered in the environment.

In some cases multiple agents will take action during the same round – a state management system has been implemented to prevent conflicts between asynchronous actions, allowing an agent to indicate whether it is busy or idle. This design might make it easier to develop this solution towards multithreading – but at the moment it does not support parallelism.

The mermaid charts

MetaGPT outputs very detailed schematics of its designs using the mermaid diagramming tool.

Figure 6. (a) Results of competitive analysis conducted by MetaGPT after being asked to create a sentiment analysis solution. (b) Data API design (c) Sequence flow, Source: Generated by MetaGPT

Pros of MetaGPT

Standardized Operating Procedures – well-designed workflow reduces chaos and ensures the tasks are handled consistently
- The charts!

High quality solutions
A budget cap – a spending limit keeping us safe

Cons of MetaGPT

Not ideal for simple tasks – the complex workflow can be an overkill

What makes the project stand out

Complexity – in terms of tools, memory handling and overall workflow
Academic background – an excellent paper describes MetaGPT in detail

Comparing the models

	AutoGPT	BabyCoder	smol developer	GPT Engineer	MetaGPT
Workflow	Loop	Linear	Linear	Linear	Loop
Order of actions	Sequential	Sequential	Sequential	Sequential	Asynchronous
Purpose	Multipurpose	Coding	Coding	Coding	Coding
Memory	Selectively retrieved	Selectively retrieved	Context only	Context only	Selectively retrieved
Agents	Created on the fly	Predefined	Predefined	Predefined	Predefined

Table 1. Comparison of the discussed models, Source: own study

None of the models we ran produced a working sentiment analysis model. AutoGPT and BabyCoder were not able to finish their processing without errors. GPT Engineer and MetaGPT came closest to a working solution. MetaGPT impressed us the most due to the extras it produces – charts and specifications. Of the two, it was also easier to use – GPT Engineer takes more effort to start due to the extra clarifying questions it asks at the beginning.

Specialized coding agents

DemoGPT

DemoGPT allows users to create interactive Streamlit apps with just a prompt. What makes DemoGPT different from other solutions is that it specializes in generating LLM-based applications powered by Streamlit and LangChain as opposed to more generic agents described above.

The core functionalities of DemoGPT revolve around four distinct stages (see Figure 4):

Planning: DemoGPT generates a comprehensive plan based on the user’s instructions.
Task Creation: Next, specific tasks are created by DemoGPT, utilizing the generated plan and the provided instruction as a guide.
Code Snippet Generation: The generated tasks are then transformed into code snippets.
Final Code Assembly: The code snippets are merged to form a final code, resulting in an interactive Streamlit app.

Figure 7. DemoGPT flow, Source: https://github.com/melih-unsal/DemoGPT

Moreover, each of these steps undergoes refinement before progressing to the next stage, ensuring optimal quality and accuracy throughout the process. Once the user approves the final outcome, all the generated artifacts, including the plan, tasks, and code snippets, are saved in the database for future retrieval when generating similar applications.

By utilizing these functionalities, DemoGPT is able to convert user instructions into interactive applications, making it an effective tool for LLM-based application development.

Wasp

At its core, Wasp tries to redefine full-stack web application development. It covers three significant domains of a web application, creating a well-functioning client (front-end), server (back-end), and the database, all connected in one ecosystem.

Underneath, Wasp is fueled by React, Node.js, and Prisma. With these technologies at its backbone, Wasp defines web components, server queries, and actions.

The essence of its effectiveness is Wasp’s built-in compiler. Thanks to this, you only need to describe your features in a special config file, and Wasp will compile them for you. The output will be the client app, server app and the whole codebase needed for deployment.

Everything described above is summarized in the figure below:

Figure 8. Wasp – main components, Source: https://wasp-lang.dev/docs

And this is what a very basic sample Wasp config file looks like:

Figure 9. Wasp sample configuration file, Source: https://wasp-lang.dev/docs

You may wonder how the concept of generative AI tools comes into play here when we are the ones writing the code and configuration file. Well, Wasp has been on the market for quite some time. However, its team just recently started integrating GPT to develop a full web application with only a natural language objective specification.

The model does a great job of creating Wasp configurations and individual components. It facilitates the quick construction of a simple application prototype, complete with a database, an authentication system, and the backbone of essential CRUD operations.

You can test Wasp’s capabilities in generating web applications from scratch here, where a free playground has been provided.

Figure 10. GPT Web App Generator by Wasp, Source: https://wasp-lang.dev/docs

The future of AI Agents

We’re still in the early stages of LLM-powered AI Agent development, but they already have impressive capabilities. We believe that in the next few years we will see rapid improvements in that field and rapid increase in the products based on AI Agents. These solutions will include not just language-based models (LLMs), but also multimodal models, among others – allowing agents to work with different types of data like images and videos. Frankly, we already have models like GPT4 and Bard that are multimodal and can take images as an input, but it seems that these capabilities are not yet fully utilized in AI Agents.

We may even ask ourselves: can these AI agents lead to AGI?

Summary

In this blog post, we’ve discussed the rapid growth of AI Agents, fueled by recent breakthroughs in LLM. These AI agents attempt to write complete code for projects simply from the problem statements – albeit not always successfully. There are a variety of solutions offering different approaches to code writing – from the broad capabilities of MetaGPT, to the specialized nature of Wasp. However, every one of them shares a similar core – an LLM answering questions and creating code – and they can only be as good as the underlying model.

Currently, these solutions are far from being a replacement for humans – but if we give them time, they may come to surprise us all.

If you are interested in evaluating these agents and the ones we have created, as well as the lessons we learned from creating AI support for writing code, stay tuned for the upcoming second part of our blog post!

Need a reliable partner for a forward-thinking AI project for your business? Contact us today, and our experts will help you find a bespoke solution tailored to answer your business challenges. Choose our AI development services!

Generative AI developer toolkit

September 11, 2023/in Generative AI /by Paweł Kmiecik

A thrilling adventure in the world of next-gen programming awaits, powered not by replacing humans with AI, but by using AI to enhance human potential. In this blog post we will discuss the most interesting and powerful GenAI tools that you should know more about.

In recent times, the rapid advancement of AI technologies like ChatGPT and other Large Language Models (LLMs) have sparked growing panic among the software engineering community. Headlines warning of the looming robot takeover have fueled this unease, making developers question the future of their occupation. As technology evolves, it’s natural that we, as developers, feel apprehensive about the impact of AI on our careers. But worrying about robots taking over your job can only be harmful and hinder your development. Instead of dreaming up dystopian visions, it is better to know the possibilities that AI tools like GitHub Copilot or ChatGPT open up to complement and streamline your workflow. By embracing emerging generative AI, you can supercharge your efficiency and enrich your value in the ever-competitive job market, proving that you are adaptable and ready for whatever the future brings.

Don’t give up on being a developer

According to a 2019 report by the UK Office for National Statistics, software engineers face a 27.4% probability of their roles being replaced by robots. Yet this statistic is not exclusive to software engineers: many occupations across various sectors have seen a similar or higher figure projected. Despite being harbingers of fear, these statistics need to be understood in context. They certainly don’t consider the current limitations that AI possesses, nor its inability to replace key human skills.

Chart 1 – Occupations at risk of being automated. The yellow dot represents programmers and software engineers. Source.

At present, for all its impressive capabilities, AI technology cannot replicate human creativity, intuition and critical thinking. ChatGPT or GitHub Copilot, while perfectly competent at suggesting individual lines or blocks of code, lack the intuitive semantic understanding that developers possess. The tools also understand neither the broader context nor intent of code nor the larger architectural design decisions. These are elements that are profoundly human and cannot be easily learned by current LLMs, even the best ones.

Finally, an important observation to bear in mind is that developers who resist embracing these new-age tools aren’t in danger of being replaced by robots, but rather by other developers who choose to adapt and evolve. There is an increased demand in the job market for developers who are capable of adopting AI tools and willing to do so, with employers valuing the higher efficiency and enriched skill set that these individuals bring to the table.

Understand what’s going on – the LLM tsunami

In the midst of the burgeoning LLM wave, burying your head in the sand is not a viable solution. Instead of retreating to a corner, take the proactive route to understanding how these developing tools work. Equip yourself with knowledge, dive into the mechanics, and prepare to ride the wave instead of being consumed by it.

First and foremost, acquaint yourself with the general architecture of LLM models. Just as if you were learning a new framework or library, it’s crucial to understand its underlying assumptions and considerations first. Remember, curiosity is your greatest ally. Understanding the datasets these models are trained on, the process of training, and most importantly their limitations, can give you a well-rounded perspective of the tools that are based on them.

You may worry that this sounds like it might require a data scientist’s level of expertise. Don’t worry – you don’t need to dive deep into the intricate matrix of machine learning algorithms and neural networks. A high-level overview of the concepts is all you need. You don’t need to become an expert overnight, but getting your hands dirty and exploring their different functionalities is a concrete step forward.

Now that you have laid the groundwork, it’s time to learn the art of crafting better prompts. A good prompt is instrumental in how helpful these AI tools will prove to be. There are many resources available to sharpen this skill, including free courses by deeplearning.ai that offer excellent training or customized workshops like those offered by deepsense.ai.

This newfound knowledge of generative AI tools will stand you in good stead in the dynamic landscape of software development. Think of it as a lifelong learning process, not a sprint to the finish line. By understanding the currents of the LLM tsunami, you can learn to navigate them and utilize these AI tools to supercharge your workflow and provide the edge you need in a continually evolving job market.

Play with in-editor code generators

The journey of code generation spans a significant timeline. Some may recall the tools that produced Java classes from UML diagrams. Contemporary Integrated Development Environments (IDEs) can create standard code segments utilizing template systems, streamlining the development process by facilitating actions such as method extraction, renaming, and other refactoring techniques.

However, today’s code generation tools have evolved to be much more advanced and resourceful, courtesy of Large Language Models (LLMs). These models, trained on millions of lines of code, possess the capacity to comprehend the essential context, language syntax, and rules, and most impressively, they can independently craft the desired code.

The fundamental workflow of these tools is to propose code snippets as you type. Let’s begin with GitHub Copilot.

An example of in-line completions. The programmer starts typing a Python function that is about to find names in a collection of strings, whose prefix is one of the elements of the starting_with list and whose suffix is one of the elements of ending_with. GitHub Copilot found a proper solution and filled the docstring as well.

GitHub Copilot is an IDE plugin, compatible with MS VisualStudio, VisualStudio Code, JetBrains IDEs and even NeoVim. This plugin leverages the Codex model, an OpenAI creation that is diligently developed, trained, and maintained. The architecture of this tool is not overly complex – the pivotal element is the “client” segment of the plugin, which is responsible for prompt formulation. This prompt is then forwarded to the model backend service, and the response is utilized to generate a code suggestion.

Crucial to creating relevant and sensible code suggestions is the prompt itself – as I mentioned earlier, I wasn’t fibbing. The plugin garners specific metadata about the surrounding code – the currently open file, the code snippets from neighboring tabs, the programming language, the filename, and more. While this may set off cybersecurity concerns, we will delve into this discussion in an upcoming section.

Now that we have a basic understanding of GitHub Copilot’s functionality, comprehending its second mode – code generation from within comment descriptions – becomes simpler.

Example of completion made from in-comment description. GitHub Copilot is asked to create a fully working application. While some elements are stubbed (e.g. database connection), the rest of the code is correct.

This mode is closer to the way we use ChatGPT – we’re asking the plugin to generate code that is described in natural language. A crucial skill here is to be as specific as possible, while using simple language at the same time. Keep in mind that you cannot expect the plugin to create a fully working product for you – it’s much better to construct highly specified parts of the code which can be described using the Single Responsibility Principle. This makes us – developers – safe in the place we’re working in ;).

The market quickly understood the potential of tools like GitHub Copilot; that’s why Amazon released their own version of a code generation plugin – Amazon Code Whisperer. The way it works is almost the same as with the GitHub product. There are basically two modes we can work with. Amazon took more care over the origin of the completion. We have better control over the source of the suggestion that the tool proposes and whether it matches some open source code with limited licensing.

The number of LLM-based code generators is growing and we can expect the market to deliver at least one a week. At the time of writing, tools like Tabnine, JetBrains AI Assistant or Codeium are worth mentioning.

As evidenced, the realm of coding assistant tools is expanding rapidly. An eager developer might say, “Alright, give me the best one, and I’ll start using it!” However, the hitch is these tools are challenging to compare. Comparisons of code generators are available, but they do not explicitly rank one tool as superior over another. They do, however, offer some valuable insight:

Adhering to best practices in terms of code naming, simplicity of structure, etc., makes your code more intuitive for the tools to interpret. The old adage that “the quality you give is the quality you get” holds true – provide superior quality code and receive improved suggestions in return.
The tools can sometimes anticipate your desired output even with just a few words of a function name. However, this isn’t a golden rule and generally applies to basic scenarios.
The inclusion of descriptive names, type hints and documentation (like Javadoc or Python docstrings) significantly enhances the quality of code completion suggestions.

The overarching advice is to focus less on which tool to choose and more on how to utilize it efficiently. The quality of the suggestions is largely influenced by your coding approach and how you express your expectations, rather than the inherent quality of a specific tool. Moreover, anticipate swift evolution and frequent shuffling in the rankings of these tools in the upcoming months. While staying up to date with market trends is beneficial, frequently jumping from one tool to another is detrimental.

Prioritizing enduring skills such as prompt formation over mastering a particular tool bears a resemblance to learning universal coding principles instead of trying to memorize every aspect of a specific framework. When your foundational skills in prompting are well-honed, transitioning to a different tool becomes a breeze.

Discover the power of the GPT model

With the launch of ChatGPT by OpenAI, the tech industry encountered a ground-breaking game changer. Professionals are still discovering new practical use cases for this remarkable tool. This definitely holds true for programming – developers can use this chat-style tool in pair programming sessions as a digital partner.

ChatGPT can be employed to generate code, much like the integrated tools we’ve previously discussed. The key lies in crafting the appropriate prompt and providing all of the necessary context. Playing with the system in this manner not only presents an educational opportunity but also provides insights into how to utilize these integrated tools more efficiently.

What truly captivates our interest is the vast array of applications not strictly related to code writing; instead, they enhance various other tasks that developers routinely tackle. These include:

Creating concise but descriptive names: It’s a well-known joke amongst developers that naming conventions and cache invalidation are the two hardest aspects of programming. Thankfully, heeding the first problem is now an ace up your sleeve.
Comparing libraries, frameworks, and platforms: The GPT model has compiled a wealth of knowledge, allowing you to probe its database for comparisons between two solutions, while focusing on the aspects that particularly interest you.
Generating sample data: By supplying a data structure (in any format), you can instruct it to generate sample data for use in tests.
Code improvements: By simply pasting a snippet of code, ChatGPT can suggest improvements, and sometimes even spot bugs.

The ongoing evolution of Integrated Development Environments (IDEs) alongside developers’ tools indicates a clear shift towards integrating intelligent assistants into the daily life of a developer. Embrace these emerging technologies today and explore the plethora of tasks that the GPT model can facilitate. This will enable a smooth transition to using specialized, dedicated tools in the future.

Give various LLM-based assistants a chance

In today’s ever-evolving technological landscape, leveraging large language models (LLMs) to develop innovative digital products is easier than ever before. One can lean on proprietary offerings such as the OpenAI API or opt for self-hosted LLMs, depending on individual project requirements. While technical execution might not be straightforward but is at least well covered by the right LLMOps practices, the true challenge and artistry lies in utilizing LLMs creatively to address and resolve specific problems, and in designing the perfect prompts to do so. The potential of LLMs has already been recognized and harnessed by individuals and organizations alike, resulting in an array of solutions uniquely equipped to tackle specific challenges.

This burgeoning trend extends well into the realm of coding-related activities. Take phind.com, for instance. This cutting-edge, LLM-powered assistant for developers enables users to resolve complex programming conundrums step by step.

Meanwhile, a Chrome extension named codereview.gpt is revolutionizing the way developers engage with GitHub and GitLab platforms. When you’re drained but still staring at a queue of three pull requests, CodeReview.GPT is your ally. The tool analyzes proposed changes, identifies code smells and typos, and even suggests improvements, making code reviewing efficient and adaptive.

CodeReview.GPT

But it’s not just developers who are benefiting from these advancements. Data scientists and analysts who favor Jupyter notebooks can take advantage of genai. This tool seamlessly integrates with your notebook setup and provides intuitive hints and solutions for coding obstacles you may encounter within your cells.

The power shared by these versatile tools stems from their foundation on LLMs. While the implementation of LLMs is technically quite demanding, the real challenge and the secret to success lies in crafting the most effective prompts for any given user task. This underlines the importance of mastering the art of setting effective prompts – a sophisticated blend of skills that draws from both technical understanding and creative thinking.

Security considerations

As we journey deeper into the world of generative AI tools, it is imperative that we address the elephant in the room – security. Like any technology, tools such as GitHub Copilot or ChatGPT aren’t devoid of potential risks. While they offer numerous efficiencies and streamlining possibilities, caution must be exercised to ensure secure usage.

The most significant concern arises from the fact that these tools learn from public code repositories – they don’t possess the ability to differentiate between confidential proprietary code and public code. This could inadvertently risk the exposure of sensitive data.

Furthermore, there’s the concern of a tool like GitHub Copilot generating code snippets that it learned from open-source projects or libraries that come with strict licensing conditions. The usage of such pieces of code, without complying with these conditions, could lead to potential legal or compliance issues.

However, these limitations shouldn’t deter you from leveraging generative AI tools. Rather, they should reinforce the importance of a transparent, well-structured approach to the adoption of these tools.

To safely navigate these security concerns and prepare your company for the change, adopt a strategy that includes:

Implementing comprehensive access controls and guidelines to ensure the secure and responsible use of AI tools.
Regularly updating your risk assessment protocols to include artificial intelligence tool applications.
Training your team not only in the functional use of these tools but also educating them about the associated risks and how to circumvent them.

Instances where sensitive information accidentally trickles into the AI training data can be addressed by employing secure coding practices and routine data audits. Policies should be put in place that restrict AI access to sensitive information to the bare minimum.

In conclusion, adopting AI tools like GitHub Copilot or ChatGPT requires a cautious balance. On one hand, you should cautiously respect the potential risks; on the other hand, you shouldn’t be dissuaded from harnessing the immense benefits these tools offer. View these security aspects not as impediments, but as crucial components of your journey towards integrating AI into your everyday coding practices.

Summary

The choice of AI-based tools for coders has exploded in the last few months. As professionals, we could have ignored them and even stuck to the good old vi editor. But times are changing and a conscious developer should adapt to be more effective and valuable on the market. It’s the best time to start learning the new way of coding – even if not in your current project (due to security restrictions), try it in your pet project or programming exercises (coding katas).

The way software is being created is definitely evolving – starting with no-code and low-code solutions that help companies deliver products faster and with lower costs, to coding enhancing tools like Copilot that make developers more efficient, to Large Language Models that even help in architecture and business decision-making processes. Place yourself in this new landscape of the IT industry by starting your journey with a generative AI toolkit.