Creating your whole codebase at once using LLMs – how long until AI replaces human developers?
A comprehensive review of the state of the art in terms of code-writing agents. We compare the existing solutions and explain how they work behind the scenes.
Introduction
While the concept of AI agents has been around for decades, it is undeniable that recent advancements in Large Language Models (LLMs) have revolutionized this field, opening up a whole new realm of possibilities and applications.
This article will walk you through the current landscape of code-writing AI agents. We present a couple of the existing implementations and discuss how they are built.
In a follow-up article (stay tuned) we’ll share our experiences with developing our own version of an AI agent. We’ll also compare open-source agents and our own ones based on the quality of code written, cost, usage, and speed.
What is an AI Agent?
Imagine having an ever-present, intelligent companion who is always at your side, ready to lend a helping hand whenever you ask, and who never says ‘no’. The AI agent of the future will be exactly that.
We can define an AI Agent as a computer program or system that can perceive its environment, process information, and make decisions or take actions to achieve specific goals (such as solving software engineering problems). AI agents are designed to mimic some aspects of human intelligence and behavior, allowing them to operate autonomously or semi-autonomously in their environment.
Traditionally, AI agents were primarily utilized in the domain of game development. From simple chess-playing programs to complex virtual characters with human-like behaviors, these agents have showcased impressive capabilities within constrained environments. However, the limitations of traditional AI methods often restricted their adaptability, creativity, and real-world applicability.
Usually agents will have:
- Some kind of memory (state)
- Multiple specialized roles:
- Planner – to “think” and generate a plan (if steps are not predefined)
- Executor – to “act” by executing the plan using specific tools
- Feedback provider – to assess the quality of the execution by means of auto-reflection. It can be augmented or replaced by human feedback.
- External sources of informationA way to interact with the user (an optional feature)
Attempts have been made to create agents in various fields such as:
- Robotics: Agents can range from simple automated machines to complex humanoid robots designed to assist or replace humans in various tasks.
- Autonomous vehicles: These agents have the potential to revolutionize industries, especially transportation, in the form of self-driving cars, trucks, and drones.
- Gaming: The use of intelligent agents in the development of more immersive and challenging gaming experiences is on the rise.
A significant aspect of these intelligent agents is their connection with Reinforcement Learning. Reinforcement Learning is one of the seminal branches of Machine Learning, with roots reaching back to the 1980s and 1990s. Despite its age, it remains at the bleeding edge of technology, driving the development of increasingly intelligent, adaptable and complex systems.
Although LLM-based agents are relatively new (as, in fact, are LLMs themselves), they are already being applied to a wide range of tasks, such as:
- Helping you search through long documents and asking them questions: How we developed a GPT-based solution for extracting knowledge from documents – deepsense.ai
- Playing games: GitHub – MineDojo/Voyager: An Open-Ended Embodied Agent with Large Language Models
- Creating entire software engineering projects for a given problem statement: GitHub – AntonOsika/gpt-engineer: Specify what you want it to build, the AI asks for clarification, and then builds it.
In this article we will focus solely on AI Agents capable of solving software engineering and data science problems.
The current state of coding agents
LLMs have rapidly revolutionized the code creation process as developers eagerly include them in their workflow. Whether programmers ask ChatGPT their questions directly and make use of the code snippets it provides, or use dedicated code-writing tools like Copilot which are capable of suggesting whole code blocks at a time from the convenience of your IDE, one thing is certain: the utilization of AI-powered tools in our daily work provides a significant productivity boost. The burning question of today is whether we can take this idea even further and step away from code suggestions, moving towards end-to-end solutions capable of creating the entire project by themselves. AI Code Writing Agents represent a promising approach to this task.
There are a number of open-source projects introducing fresh approaches to AI Agent implementation. Among them, a couple stand out with their relative maturity and serve as inspiration for any new projects in this field.
General purpose coding agents
Auto-GPT
Auto-GPT was one of the first AI agents using Large Language Models to make waves, mainly due to its ability to independently handle diverse tasks. The AutoGPT agent has a large tool set of commands it can call upon, including:
- Web-based functions like Google searches and browsing
- Interactions with fellow agents (initiating, messaging, deleting)
- Operating system operations like file manipulation and script execution
Workflow
Operating via a chain-of-thought feedback loop, the agent generates prompts that it loops back into its own processing. Let’s look into the workflow following an example – creating a sentiment classifier.
- The user provides a prompt setting the task.
- An agent is created based on the prompt.
- The agent has a personality specified – a SentimentClassifierGPT – making it well prepared for the task at hand.
- The agent thinks about how best to solve the problem.
- To make sure there’s no jumping to conclusions, it is bound by a deliberate thinking structure – the chain of thought.
- The agent selects a command to run, and comes up with arguments to this command.
- This time it went with executing python code, providing the code it came up with as an argument.
- Time for us, the user, to authorize the action.
- Silly agent – it doesn’t know that running this code is a bit premature! Let’s roll with it anyhow – accept.
- The result of running this command gets memorized.
- If we have more rounds in our tank, the workflow goes back to the thinking step, and the agent gets to reflect on the results of its actions.
- Second round! Workflow loops back to step 3.
- The agent thinks – “The error indicates that the ‘torch’ module is not installed in the current environment… However, as an AI, I don’t have the capability to install packages.” Spot on, buddy! We are not going to succeed this particular time.
Extra details
For memory, the agent employs a dual approach. It possesses a short-term memory in text format, complemented by long-term memory embeddings within a vector database. This enables the agent to effectively leverage contextual information and the outcomes of past actions within its ongoing interactions.
Pros of Auto-GPT
- Versatility – an agent can call up a large variety of commands to help it with the task at hand.
- Chain-of-thought prompting – a brilliant technique helping agents hit the mark more often.
- Memory handling – well-designed short- and long-term memory.
Cons of Auto-GPT
- Operating within a loop – the costs can rack up quickly! Especially when the agent exhibits repetitive behavior.
- This is slightly evened out by the human control mechanism – the user can stop the loop if it seems to be going nowhere.
- Memory handling and context limitations – while it is well designed, it is still one of the weakest chains in this tool.
- For big projects, agents are prone to forget what they have already accomplished, which makes them likely to repeat themselves.
- The relatively small context of the current LLMs makes memory handling even more challenging.
- Poor code-writing abilities.
What makes the project stand out
- Agents created on the fly
BabyCoder
BabyAGI is another AI agent working in a loop. It has inspired multiple projects, possibly due to the simplicity of its implementation.
BabyCoder is one of the BabyAGI variants which specializes in writing code. Unlike AutoGPT, which creates agents on the fly, BabyCoder uses a handful of predefined, specific agents who work together to come up with a functional project.
Workflow
- The Task Initializer Agent creates a list of tasks to be carried out,
- The Human Input Agent works with the user to refine the tasks
- The Task Assigner Agent passes each task to a capable executor, such as
- the Command Executor Agent
- the Code Writer Agent
- the Code Refactor Agent
Let’s try this out on the same task – creating a sentiment analysis model.
- The user provides a prompt setting the goal.
- The task Creation Agent comes up with tasks – in our case all 14 of them. The tasks span from creating project and data directories, to installing the required python modules, to writing the specific code files.
- The Command Executor Agent starts working on the first task – and performs it flawlessly: mkdir sentiment_classifier/data
- The context is extended by the result of the first task.
- The execution agent picks up the next task on the list.
- Workflow loops back to step 3 – task execution.
- … The program crashes.
While most likely not the intended workflow, that is exactly how this turned out during our tests. At this point the BabyCoder either forgets what directories it created and attempts to use incorrect paths, looping itself aimlessly, or exceeds the available LLM context window with a loud error message straight from the OpenAI API.
Splitting the main task into multiple small, specialized tasks is a brilliant workaround for the current limitations of LLMs, including their limited attention span and difficulty handling complex workflow.
Pros of BabyCoder
- Very simple implementation
Cons of BabyCoder
- Unreliable – tends to crash with an error before finishing the tasks it set
What makes the project stand out
- Predefined specializing agents
- The number of projects it has inspired
smol developer
smol developer lives up to its name by being lightweight and simple.
Workflow
- The user initiates the app-building process with a singular prompt.
- The smol developer creates a detailed plan that outlines the structure of the app to be generated, listing all the variables and functions. This plan will be included in each prompt the smol developer uses later on.
- The agent creates the code according to the first task.
- Memory is extended by the code generated.
- The agent creates the code according to the second task… And so on.
Pros of smol developer
- Simplicity
- Linear workflow – no surprises in terms of the number of calls made to the OpenAI API
Cons of smol developer
- Lack of advanced memory handing – smol developer stuffs all the information together, so it needs to fit inside the LLM context
- Requires multiple tries before you get the prompt right
What makes the project stand out
- Simplicity
GPT Engineer
GPT Engineer is a tad bigger, but still a lightweight approach to code generation. It starts from a prompt – but it also asks extra clarifying questions.
Workflow
- The user provides the main objective.
- The GPT Engineer asks clarifying questions about the target, focusing on the technical aspects of the solution (e.g., data format, preference for the model used). The user can answer or can leave it to the model to decide by itself
- The agent prepares the “Core Classes, Functions, and Methods” section, which lists technical details of the solution. For example 1. `SentimentDataModule`: A PyTorch Lightning DataModule for loading and preparing the sentiment analysis dataset.
- Based on the code functionalities described above, the GPT Engineer starts to create code, one element after another
- Once the code is generated, the agent gives us the option to install the necessary libraries (from the requirements.txt file) and launch his creation
It seems that the code produced by gpt-engineer is already much better than that produced by its predecessors, although it is still far from perfect.
Pros of GPT Engineer
- Clarifying questions – the agent asks questions to make sure it produces code that is consistent with our expectations
- The workflow is concise, as is the code itself, which saves us time and money
- Gives you the ability to install dependencies and execute code
Cons of GPT Engineer
- The questions do not always make sense – sometimes they ask about information already included with the initial prompt
What makes the project stand out
- Objective clarification
- Dependency handling and code execution
- Alternative approaches to code generation, for example Test-Driven Development (TDD)
- At the end of the execution, the GPT Engineer collects data on the quality of its work – signaling the authors’ intention to keep working on the improvements
MetaGPT
MetaGPT originated as a research paper, with its focus centered on multi-agent collaboration.
This project is heavily inspired by how human developers manage IT projects. MetaGPT is written in a way that resembles a software company, hiring agents for positions common in the industry – like architects, product managers and even customer service reps. MetaGPT is a complex tool, which is why we decided to describe it in more depth than usual.
Workflow
- The user provides the prompt.
- The boss agent broadcasts the goal.
- The agents check the shared message board for information relevant to them.
- If an agent sees something of interest it reacts.
- Think – keeping in mind the context of the role, see what needs to be done and reflect on how to do it.
- Observe – check the message board and read about the relevant events.
- Act – perform a task, call an LLM or another function.
- Broadcast – share the execution results and other action records with the environment.
- After performing their actions, agents broadcast their results to the message board.
- Back to point 2.
The description above, while accurate, doesn’t do much in terms of explaining what actually happens. For a step-by-step walkthrough, see below:
Example of an actual workflow
Portrayed using the sentiment analysis model creation prompt.
- The user provides the prompt.
- The boss agent picks up the prompt and passes it on to the company message board.
- All of the employee agents look through the message board, looking for any information of interest.
- Alice, the Product Manager, picks up the requirements and gets to work:
- Polishing the requirements
- “Develop a sentiment classifier based on product reviews and their ratings” (…)
- Creating the user stories
- “As a data scientist, I want to train a sentiment classifier so that I can predict the sentiment of product reviews” (…)
- Drafting the UI
- “The scripts should be command-line executable and provide clear output to the user.” (…)
- Performing the competitive analysis by searching for the information on the internet
- “Google’s AutoML Natural Language: Offers sentiment analysis but lacks customization for specific tasks” (…)
- Documenting her findings using mermaid charts (see figure 3)
- Posting the results of her work to the company message board
- Polishing the requirements
- Bob, the Architect, sees that Alice is done and picks up the baton.
- Drafting the implementation approach
- “We will use the transformers library from Hugging Face, which provides pre-trained BERT models and utilities for fine-tuning on a specific task.” (…)
- Deciding on the python package name
- “sentiment_classifier”
- Creating the file list
- “train.py”,
- “validate.py”,
- “inference.py”,
- “model.py”,
- “utils.py”
- Charting the data structures and interface definitions (see figure 3)
- Posting the results of his work to the company message board
- Drafting the implementation approach
- Eve, the Project Manager, details the tasks
- Analyzing the dependencies
- Creating the requirements.txt
- Creating full API specification
- Analyzing the program flow
- “train.py”, “Contains the main training loop. Depends on model.py and utils.py”
- Posting the results of her work to the company message board
- Analyzing the dependencies
- Alex, the Engineer, writes the code for each file.
Environment
The agents share an Environment where all the memories are stored. Memories are essentially a list of messages broadcasted by agents. Agents with specific roles can browse the environment to retrieve and store memories relevant to them. Each agent possesses a distinct context, subscribing to different types of events and striving to remember only what is directly relevant to its role.
Working in rounds
The execution of the program is limited by the number of rounds set, so it’s important to leave enough rounds for every agent to have an opportunity to act.
For example, during the first round, the only message in the environment comes from the boss – that’s the initial user prompt. The Project Manager will pick up the requirements, perform her actions and post the results to the environment message buffer. However, no other agent will pick up this information during the same round – they have already finished their observations. Starting next round, Bob, the Architect, will pick up the user stories and start coming up with the design. A couple of rounds have to pass before any code gets written.
Each agent can react any time an event of interest is registered in the environment.
In some cases multiple agents will take action during the same round – a state management system has been implemented to prevent conflicts between asynchronous actions, allowing an agent to indicate whether it is busy or idle. This design might make it easier to develop this solution towards multithreading – but at the moment it does not support parallelism.
The mermaid charts
MetaGPT outputs very detailed schematics of its designs using the mermaid diagramming tool.
Pros of MetaGPT
- Standardized Operating Procedures – well-designed workflow reduces chaos and ensures the tasks are handled consistently
- The charts!
- High quality solutions
- A budget cap – a spending limit keeping us safe
Cons of MetaGPT
- Not ideal for simple tasks – the complex workflow can be an overkill
What makes the project stand out
- Complexity – in terms of tools, memory handling and overall workflow
- Academic background – an excellent paper describes MetaGPT in detail
Comparing the models
AutoGPT | BabyCoder | smol developer | GPT Engineer | MetaGPT | |
Workflow | Loop | Linear | Linear | Linear | Loop |
Order of actions | Sequential | Sequential | Sequential | Sequential | Asynchronous |
Purpose | Multipurpose | Coding | Coding | Coding | Coding |
Memory | Selectively retrieved | Selectively retrieved | Context only | Context only | Selectively retrieved |
Agents | Created on the fly | Predefined | Predefined | Predefined | Predefined |
Table 1. Comparison of the discussed models, Source: own study
None of the models we ran produced a working sentiment analysis model. AutoGPT and BabyCoder were not able to finish their processing without errors. GPT Engineer and MetaGPT came closest to a working solution. MetaGPT impressed us the most due to the extras it produces – charts and specifications. Of the two, it was also easier to use – GPT Engineer takes more effort to start due to the extra clarifying questions it asks at the beginning.
Specialized coding agents
DemoGPT
DemoGPT allows users to create interactive Streamlit apps with just a prompt. What makes DemoGPT different from other solutions is that it specializes in generating LLM-based applications powered by Streamlit and LangChain as opposed to more generic agents described above.
The core functionalities of DemoGPT revolve around four distinct stages (see Figure 4):
- Planning: DemoGPT generates a comprehensive plan based on the user’s instructions.
- Task Creation: Next, specific tasks are created by DemoGPT, utilizing the generated plan and the provided instruction as a guide.
- Code Snippet Generation: The generated tasks are then transformed into code snippets.
- Final Code Assembly: The code snippets are merged to form a final code, resulting in an interactive Streamlit app.
Moreover, each of these steps undergoes refinement before progressing to the next stage, ensuring optimal quality and accuracy throughout the process. Once the user approves the final outcome, all the generated artifacts, including the plan, tasks, and code snippets, are saved in the database for future retrieval when generating similar applications.
By utilizing these functionalities, DemoGPT is able to convert user instructions into interactive applications, making it an effective tool for LLM-based application development.
Wasp
At its core, Wasp tries to redefine full-stack web application development. It covers three significant domains of a web application, creating a well-functioning client (front-end), server (back-end), and the database, all connected in one ecosystem.
Underneath, Wasp is fueled by React, Node.js, and Prisma. With these technologies at its backbone, Wasp defines web components, server queries, and actions.
The essence of its effectiveness is Wasp’s built-in compiler. Thanks to this, you only need to describe your features in a special config file, and Wasp will compile them for you. The output will be the client app, server app and the whole codebase needed for deployment.
Everything described above is summarized in the figure below:
And this is what a very basic sample Wasp config file looks like:
You may wonder how the concept of generative AI tools comes into play here when we are the ones writing the code and configuration file. Well, Wasp has been on the market for quite some time. However, its team just recently started integrating GPT to develop a full web application with only a natural language objective specification.
The model does a great job of creating Wasp configurations and individual components. It facilitates the quick construction of a simple application prototype, complete with a database, an authentication system, and the backbone of essential CRUD operations.
You can test Wasp’s capabilities in generating web applications from scratch here, where a free playground has been provided.
The future of AI Agents
We’re still in the early stages of LLM-powered AI Agent development, but they already have impressive capabilities. We believe that in the next few years we will see rapid improvements in that field and rapid increase in the products based on AI Agents. These solutions will include not just language-based models (LLMs), but also multimodal models, among others – allowing agents to work with different types of data like images and videos. Frankly, we already have models like GPT4 and Bard that are multimodal and can take images as an input, but it seems that these capabilities are not yet fully utilized in AI Agents.
We may even ask ourselves: can these AI agents lead to AGI?
Summary
In this blog post, we’ve discussed the rapid growth of AI Agents, fueled by recent breakthroughs in LLM. These AI agents attempt to write complete code for projects simply from the problem statements – albeit not always successfully. There are a variety of solutions offering different approaches to code writing – from the broad capabilities of MetaGPT, to the specialized nature of Wasp. However, every one of them shares a similar core – an LLM answering questions and creating code – and they can only be as good as the underlying model.
Currently, these solutions are far from being a replacement for humans – but if we give them time, they may come to surprise us all.
If you are interested in evaluating these agents and the ones we have created, as well as the lessons we learned from creating AI support for writing code, stay tuned for the upcoming second part of our blog post!
Need a reliable partner for a forward-thinking AI project for your business? Contact us today, and our experts will help you find a bespoke solution tailored to answer your business challenges. Choose our AI development services!