
Table of contents
We have published the latest results from Business Utility Evaluation, our benchmark for assessing how well LLMs perform in realistic analytical business workflows.
The current leaderboard is based on the 2026-06-03 evaluation snapshot. BU Eval measures not only whether a model can produce a correct answer, but whether its output is useful from a business perspective. The benchmark reports mean score, instability, and a combined Business Utility metric, which rewards models that are both accurate and repeatable.
Latest ranking

In this round, Claude Opus 4.8 achieved the highest Business Utility score. It was followed by GPT-5.5 and GPT-5.4, which produced very similar results. The remaining evaluated models scored lower in this snapshot.
How to read the result
The Business Utility score is designed to reflect a deployment-oriented intuition: a model is more useful when it combines high average quality with repeatable behavior across runs. In BU Eval, each model is evaluated across multiple trajectories, and failed trajectories are scored as zero. The metric is bounded between 0 and 1, but the benchmark does not define a universal deployment threshold, since acceptable utility depends on use case, risk tolerance, and verification costs.
| Rank | Model | Business Utility |
|---|---|---|
| 1 | Claude Opus 4.8 | 0.42 |
| 2 | GPT-5.5 | 0.23 |
| 3 | GPT-5.4 | 0.22 |
| 4 | Claude Sonnet 4.6 | 0.13 |
| 5 | Gemini 3.1 Pro Preview | 0.12 |
| 6 | Gemini 3.5 Flash | 0.04 |
Evaluation setup
The current run includes models from Anthropic, OpenAI, and Google. Each model was evaluated using its provider-specific harness, such as Claude Code, Codex, or Gemini CLI. Models were run with default temperature settings and with the highest available reasoning-effort configuration where applicable.
The detailed results and artifacts are available in the public repository. The benchmark currently includes several business simulation tasks, such as bottlenecked employees, machinery malfunctions, marketplace activity, sales representatives, and supply chain scenarios.
About BU Eval
Business Utility Evaluation (BU Eval) is an LLM and AI agent benchmark designed to measure how well AI systems perform in real-world business workflows. Unlike traditional AI benchmarks that focus primarily on answer correctness, BU Eval evaluates whether a model is reliable enough to support decision-making in production environments.
Read more here about why we discovered the Business Utility metric.
The benchmark assesses both analytical quality and result stability, helping organizations understand not only whether a model can solve a business problem, but whether it can do so consistently across repeated runs. This makes BU Eval particularly relevant for teams evaluating AI systems for enterprise deployment, operational automation, AI agents, and data-driven decision support.
BU Eval uses synthetic business simulations that reflect realistic analytical tasks, including supply chain analysis, marketplace monitoring, operational bottleneck detection, and business performance investigation. Models receive simulator-generated datasets and must identify patterns, anomalies, root causes, and actionable business insights from incomplete or indirect evidence. At the core of the benchmark is the Business Utility Score, a metric that combines average performance with repeatability. By accounting for both accuracy and stability, BU Eval provides.
Table of contents






