Self-correcting Code Generation Using Multi-Step Agent

This cookbook demonstrates how to build a self-correcting code generation pipeline using the smolagents framework and Multi-Step Agent. By integrating iterative code quality reviews and automated unit tests, the agent continuously refines its output to produce robust and correct code. As a result, this approach boosts the success rate from a 53.8% baseline using single-request LLMs to an impressive 81.8%, underscoring a significant improvement in quality and reliability.

Introduction

Agentic Applications

Agentic applications are autonomous, goal-oriented systems that perform tasks with minimal human intervention. They excel at:

Processing complex workflows
Adapting to real-time feedback
Making decisions based on learned patterns
Executing multi-step tasks

These applications leverage Large Language Models (LLMs) to interpret information, generate plans, and execute actions with sophisticated decision-making capabilities.

Existing Solutions

Aider is a command-line tool that acts as a GPT-powered pair programmer. It enables interactive code modification through natural language commands, integrates with git for version control, and leverages test results for code refinement. Aider excels in conversational, iterative code development.

OpenHands focuses on autonomous code generation and execution. It decomposes tasks, executes generated code, and uses feedback for self-correction. OpenHands aims to create agents capable of handling complex coding tasks with minimal human intervention, emphasizing planning and tool integration.

The `smolagents` Framework

smolagents is a Hugging Face framework designed for building agentic applications. Key features include:

Pre-built multi-step agents
Custom tool development support
Detailed action and result history tracking
Clear tool specification system

Multi-Step Agent Architecture

The Multi-Step Agent serves as the core interface for agentic applications in smolagents. It implements:

System Prompts: Defines agent behavior and capabilities
Step Execution: Manages workflow and memory operations
Tool Integration: Coordinates multiple tool interactions

Goal of This Cookbook

The goal of this cookbook is to create a code generation agent using the smolagents framework.
In contrast to other solutions, this approach does not require any additional user input beyond the task instructions—a set of rules and expectations regarding the generated code.
Another expected input is a set of predefined unit tests.
To achieve this goal, we will need to create proper tools:

Code Reviewer to feed agent with ‘human-like’ overview on the generated code
Unit Test Runner to achieve correctness of the code
These tools will be used by a customized Multi-Step Agent provided by smolagents.

Expected agentic flow of the solution:

Implementation

Environment Setup

!pip install openai==1.63.2
!pip install smolagents==1.9.2
!pip install jinja2==3.1.5

from pathlib import Path
import os
import tempfile

import yaml
from jinja2 import Template
from openai import OpenAI
from smolagents import OpenAIServerModel, Tool, ToolCallingAgent

AI Agents & Tools

Iterative code generation agent

This snippet enables code generation agent using a selected model. As mentioned at the beginning, the agent will be responsible for achieving the final goal of code generation using provided tools. It requires connection to the model.

The only customization in comparison with base ToolCallingAgent class is addition of custom prompt for code generation, with rules that will ensure a proper agentic workflow and code requirements.

class IterativeCodeAgent(ToolCallingAgent):
    def __init__(self, *args, **kwargs):

        with open(CUSTOM_PROMPTS_FILE) as fp:
            self.run_prompt = yaml.safe_load(fp)["code_generation_agent"]

        super().__init__(
            *args,
            **kwargs
        )

    def run_agent(self, task_instructions: str, *args) -> None:

        prompt = Template(self.run_prompt)
        task = prompt.render(instructions=task_instructions)
        super().run(task, *args)

Code Quality Reviewer

CodeQualityReviewer is a prompt tool to receive “human-like” reviews of provided code.
Goal of tool’s run is to ensure, that code quality is correct in 5 selected areas:

Readability
Maintainability
Efficiency
Robustness
PEP-8

For each area, the Reviewer is requested to provide a boolean flag (if expectation is met by the code) and optional feedback on how to improve its quality.

Let’s take a look at the prompt.
First part of prompt are code principles that should be reviewed by the tool.
Second part is list of tasks for the Reviewer, that will shape the output.

  Check if the code is following principles below:
      1. Readability - if code is easy to understand, even by someone who didn't write it. It should include meaningful variable names,
      consistent formatting and variable and argument typing
      2. Maintainability - if code is easy to modify, update and debug. It should follow coding standards, avoid overly complex logic and
      create modular solution if requested task require it.
      3. Efficiency - if code is using resources effectively. It should minimize execution time and memory usage
      4. Robustness - if code is handling errors. It should create try-except block for risky code blocks.
      5. PEP-8 - if code is following PEP 8 style guide. It should cover aspects like indentation, line length and naming convention.

  Tasks:
      1. Make sure the output follows expected output.
      2. For each principle listed above, analyze whether it meets its respective requirements.
      3. Request any changes in provided code, as part of the comment.
      4. Don't assume any external documentation.
      5. Make summary in the end of feedback to gather all suggestions above.
      6. As last line of response only boolean value: True if all principles gave return True value, False if any of them is equal to False.

  Expected output example:

  1. Readability:
    - The code uses clear and descriptive names for functions and variables.
    - A type hint is missing for the input parameter of the function "run(input_string):"

  2. Maintainability:
    - The solution is modularized into several functions
    - Error checking and consistent structure make it easy to modify or extend functionalities.

  3. Efficiency:
    - Code has been written with optimal structures
    - The solution is using efficient Python built-in functions

  4. Robustness:
    - The code includes appropriate error handling through type checks and try-except blocks

  5. PEP-8:
    - The code follows PEP-8 guidelines: proper indentations, spacing, meaningful names and line lengths

  Summary:
    - Readability: Add type hinting in the declaration of function run(input_string). 
      Proposed solution: "run(input_string: str) -> None:"
    - Maintainability: No changes required
    - Efficiency: No changes required
    - Robustness: No changes required
    - PEP-8: No changes required

  False

Example output of the run of the tool:

Observations: 1. Readability:
  - The function includes a detailed docstring explaining the rules and expected input/output.
  - Variable names such as "dice", "counts", and "score" are descriptive.
  - The inline comments clarify the purpose of each code block.
  - Suggestion: Adding type hints for the function’s parameter and return value would further improve clarity.

2. Maintainability:
  - Code is modularized into a single function which encapsulates the score calculation.
  - The logic is straightforward, making it relatively easy to modify or extend in the future.
  - Suggestion: Consider moving the import statement ("from collections import Counter") to the top of the module 
if this function is part of a larger module, to conform to common best practices.

3. Efficiency:
  - The use of the collections.Counter is efficient for counting dice occurrences.
  - The algorithm iterates through a small fixed range (numbers 2 to 6) and performs simple arithmetic operations.
  - No unnecessary loops or computations were identified.

4. Robustness:
  - The function assumes that the input is a list of integers representing dice outcomes.
  - There is no error handling (e.g., try-except blocks) to catch unexpected input types or values.
  - Suggestion: Add input validation or try-except blocks to handle cases where "dice" may not be a list or may 
contain non-integer values.

5. PEP-8:
  - The function follows general PEP-8 style guidelines with appropriate indentation and spacing.
  - The line lengths are acceptable.
  - Naming conventions for functions and variables are clear.
  - Suggestion: As mentioned under Readability, adding type hints would further align with modern Python best 
practices.

Summary:
  - Readability: Add type hints in the declaration of the function, e.g., "def calculate_score(dice: list[int]) -> 
int:".
  - Maintainability: Move the import statement ("from collections import Counter") to the top of the file if 
possible.
  - Efficiency: No changes required.
  - Robustness: Introduce input validation or try-except blocks to handle unexpected data types and ensure robust 
error handling.
  - PEP-8: Add type hints for parameters and return types to further follow modern best practices.

False

class CodeQualityReviewerTool(Tool):
    name = "code_reviewer"
    description = "Task of this tool is to determine, if given code has been created in alignment with Python code quality standard. It should check then the structure, docstrings, comments, readability of variables and other important aspects for human user. Output should be used as input for code modification tasks."
    inputs = {
        "code": {
            "type": "string",
            "description": "Code provided by previous step. It should be reviewed if it's following code quality requirements."
        }

    }
    output_type = "string"

    def __init__(self, model_id, **kwargs):
        super().__init__(**kwargs)

        self.client : OpenAI = OpenAI()
        self.model_id = model_id

        with open(CUSTOM_PROMPTS_FILE) as file:
            self.prompt = yaml.safe_load(file)["reviewer_prompt"]

    def forward(self, code: str) -> str:
        response = self.client.chat.completions.create(
            model=self.model_id,
            messages=[
                {
                    "role": "system",
                    "content": self.prompt
                },
                {
                    "role": "user",
                    "content": code
                }
            ]
        )
        result = response.choices[0].message.content
        return result

Unit Tests Runner

UnitTestRunner tool is implementing run of provided unit tests.
The output of the tool is typical unit test output, that is formatted and provided as input for the next tool.
The idea is to use the output to tweak changes in generated code.
Its verbosity level can be switched to any level that will influence both costs (number of tokens) and performance of the tool (quality of feedback).
For the sake of later experiments it has been set to default.

import io
import re
from contextlib import redirect_stdout

class UnitTestsRunnerTool(Tool):
    name = "unit_tests_runner"
    description = "Task of this tool is to run the unit tests associated with the programming task and return status of these."
    inputs = {
        "generated_code": {
            "type": "string",
            "description": "Code provided by previous step. It will be checked if it's passing unit tests."
        }
    }
    output_type = "string"

    def __init__(self, tests_path, **kwargs):
        super().__init__(**kwargs)
        self.tests_path = tests_path

    def forward(self, generated_code: str) -> str:
        with tempfile.NamedTemporaryFile(mode='w', delete=False, suffix=".py") as temp_file:
            temp_file_path = temp_file.name
            temp_file.write(generated_code + "\n\n" + read_file(self.tests_path))

        with io.StringIO() as buffer, redirect_stdout(buffer):
            exec(f"import pytest; pytest.main(['{temp_file_path}'])")
            pytest_output = buffer.getvalue()

        os.remove(temp_file_path)

        ansi_escape = re.compile(r'(?:\x1B[@-_]|[\x80-\x9F])[0-?]*[ -/]*[@-~]')
        result = ansi_escape.sub('', pytest_output)

        return result

Example Run

The example below demonstrates how to use the newly created tools. The code generation task is straightforward — the instructions request a score engine for a dice game.

The agent’s goal is to generate code that passes the unit tests. Given the simplicity of the task, we can expect success within a few steps: 1-2 steps of code generation and 1-2 steps of Unit Tests running.

The code snippet below demonstrates how to use the agent.

code_quality_review_tool = CodeQualityReviewerTool(model_id="o3-mini")
unit_test_runner = UnitTestsRunnerTool(test_file)

model = OpenAIServerModel(
    model_id = "o3-mini"
)

agent = IterativeCodeAgent(
    tools=[code_quality_review_tool, unit_test_runner],
    model=model,
    managed_agents=[],
    add_base_tools=True,
    max_steps=10,
    name="master_agent",
    description="Coding Agent"
)

agent.run_agent(task_instructions=instructions)

Let’s take a look at the processing steps that were using UnitTestRunner tool.

First iteration of unit test is consisting of default smolagents structure:

Step input details with unformatted code blob
Output from the Tool (1 test failed, 10 tests passed)

Observations: ============================= test session starts ==============================
platform linux -- Python 3.12.7, pytest-8.3.5, pluggy-1.5.0
rootdir: /tmp
plugins: anyio-4.8.0, typeguard-4.3.0
collected 11 items

../../../../../../tmp/tmphncy6usk.py ..F........                         |100%]

=================================== FAILURES ===================================
___________________________ TestDiceGame.test_score_C ___________________________

self = <tmp4q39of1z.TestDiceGame testMethod=test_score_C>

    def test_score_C(self):
        input = [1, 2, 3, 4, 5, 6]
>       self.assertEqual(dice_score(input), 2000)
E       AssertionError: 150 != 2000

/tmp/tmphncy6usk.py:66: AssertionError
=============================== warnings summary ===============================
../../venv/lib/python3.12/site-packages/_pytest/config/__init__.py:1277
  /home/piotr/PycharmProjects/cookbooks/venv/lib/python3.12/site-packages/_pytest/config/__init__.py:1277: 
PytestAssertRewriteWarning: Module already imported so cannot be rewritten: anyio
    self._mark_plugins_for_rewrite(hook)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================================ short test summary info =================================
FAILED ../../../../../../tmp/tmphncy6usk.py::TestDiceGame::test_score_C - AssertionError: 150 != 2000
======================== 1 failed, 10 passed, 1 warning in 0.02s =========================

In the second iteration of unit test tool, the structure is similar, but the output confirms correctness of the code.

Observations: ============================= test session starts ==============================
platform linux -- Python 3.12.7, pytest-8.3.5, pluggy-1.5.0
rootdir: /tmp
plugins: anyio-4.8.0, typeguard-4.3.0
collected 11 items

../../../../../../tmp/tmphncy6usk.py ...........                         |100%]

=============================== warnings summary ===============================
../../venv/lib/python3.12/site-packages/_pytest/config/__init__.py:1277
  /home/piotr/PycharmProjects/cookbooks/venv/lib/python3.12/site-packages/_pytest/config/__init__.py:1277: 
PytestAssertRewriteWarning: Module already imported so cannot be rewritten: anyio
    self._mark_plugins_for_rewrite(hook)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================== 11 passed, 1 warning in 0.02s =========================

Evaluation

Benchmark

For the evaluation we will be using Polyglot Benchmark available here: Polyglot Benchmark.

It’s aider’s benchmark collection containing programming exercises in multiple programming languages. Each task consists of three elements: instructions file, code base (with name of expected class/function) and unit test cases to evaluate created code.
With that knowledge in mind, the evaluation will be run on 33 Python-related tasks, each described by a pair of instructions and unit tests.
In comparison with default Aider Leaderboard that is using 225 exercises in 5 technologies (C++, Go, Java, JS, Python).

More information and a detailed leaderboard is available here: Leaderboards

Results

Our agentic flow demonstrates a marked improvement over baseline models. Specifically, our implementation successfully completed 81.8% of exercises with 100% unit test coverage, and overall, it passed 91.9% of the total unit tests. These figures stand in stark contrast to the performance of the baseline o3-mini (medium) model, which achieved only 53.8% correctness on 225 exercises.

This significant improvement in quality underlines the effectiveness of our self-correcting, multi-step approach that continuously refines the code based on both quality reviews and rigorous unit testing.

However, this enhanced quality comes with increased costs. While the o3-mini (medium) model incurs a mean cost of $0.04 per task (with a high-end variant at $0.08 per task), our agentic solution has a mean cost per task of $0.61. In relative terms, our approach costs over 15 times more than the baseline setup. Despite this, the dramatic boost in code quality and robustness clearly justifies the additional expenditure for applications where higher accuracy and reliability are critical.

Below is a comparative summary:

Model	Percent Correct	Mean cost per Task
o3-mini(high)	60.4%	$0.08
o3-mini(medium)	53.8%	$0.04
agent (o3-mini)	81.8%	$0.61

In summary, while our agentic process incurs higher costs, the leap in performance—from a modest 53.8% to a robust 81.8% success rate—is both significant and compelling, validating the benefits of an iterative, self-correcting approach for high-quality code generation.

Summary

This cookbook has demonstrated how to build a self-correcting code generation pipeline using smolagents. By integrating code quality reviews and unit tests, the agent iteratively refines its code, improving its quality and correctness. The benchmark results highlight the effectiveness of this approach, showing significant improvements compared to single LLM requests. This methodology provides a robust and reliable way to generate high-quality code through an automated, iterative feedback loop. The implementation of custom tools like the CodeQualityReviewer and UnitTestsRunner allows for precise control over the code generation process, ensuring that the generated code meets specific quality standards and functional requirements. This approach not only enhances the reliability of LLM-generated code but also streamlines the development workflow, reducing the need for manual code reviews and debugging.

Author

Piotr Falkiewicz

Senior Machine Learning Engineer

Self-correcting Code Generation Using Multi-Step Agent

Introduction

Agentic Applications

Existing Solutions

The `smolagents` Framework

Multi-Step Agent Architecture

Goal of This Cookbook

Implementation

Environment Setup

AI Agents & Tools

Iterative code generation agent

Code Quality Reviewer

Unit Tests Runner

Example Run

Evaluation

Benchmark

Results

Summary

Author

Explore more insights and resources

Beyond Bias: Measuring & Controlling LLM Ideology

Scaling RAG Ingestion with Ragbits, Ray and Qdrant

Opensearch Indexing Pipeline with Flink and Kafka on AWS

Self-correcting Code Generation Using Multi-Step Agent

Introduction

Agentic Applications

Existing Solutions

The smolagents Framework

Multi-Step Agent Architecture

Goal of This Cookbook

Implementation

Environment Setup

AI Agents & Tools

Iterative code generation agent

Code Quality Reviewer

Unit Tests Runner

Example Run

Evaluation

Benchmark

Results

Summary

Author

Explore more insights and resources

Beyond Bias: Measuring & Controlling LLM Ideology

Scaling RAG Ingestion with Ragbits, Ray and Qdrant

Opensearch Indexing Pipeline with Flink and Kafka on AWS

The `smolagents` Framework