Business Utility of Large Language Models as Exploratory Data Analysis Agents

Rafał Łabędzki; Patryk Miziuła; Hubert Rutkowski; Szymon Betlewski; Cezary Depta; Szymon Janowski; Jarosław Kochanowicz; Jan Kanty Milczek

Business Utility of Large Language Models as Exploratory Data Analysis Agents

4–6 minutes

read

•

2 June, 2026

Read full paper

Open PDF

Abstract

Large Language Models (LLMs) are increasingly used in analytical workflows, but their suitability as exploratory data analysis (EDA) agents in business settings remains uncertain. In practice, a deployable EDA agent must provide not only useful average performance but also sufficient repeatability to support trust in its outputs. We evaluate this requirement in a controlled, business-relevant benchmark built on an agent-based supply chain simulation. The task is to identify supplier-product combinations responsible for low quality and downstream sales loss by reasoning from indirect operational traces rather than from explicit labels. Fifteen model-variant configurations from eight model families were evaluated under four experimental conditions that varied data representation, prompt clarity, and signal strength, with five trajectories per condition. Outputs were scored against deterministic ground truth using the Jaccard index and assessed through a framework that combines mean score (ms), coefficient of variation (CV), exploratory cross-condition significance tests, and Business utility, a risk-adjusted metric that we propose to summarise quality and repeatability in a single operational measure. The results show that most configurations are not reliable enough for autonomous EDA use, even when their average scores appear acceptable. GPT-5.4 with extra-high reasoning effort achieved the strongest overall profile, with an experiment-averaged ms of 0.8748 and an experiment-averaged Business utility of 0.6952, while the next-best configurations lost substantially more utility after variability discounting. Our findings suggest that evaluation of EDA agents should treat average quality, repeatability, and condition sensitivity as complementary dimensions of operational trustworthiness.

Authors: Rafał Łabędzki, Patryk Miziuła, Hubert Rutkowski, Szymon Betlewski, Cezary Depta, Szymon Janowski, Jarosław Kochanowicz, Jan Kanty Milczek

References

Allen, Theodore T., Zhenhuan Sui, and Kaveh Akbari. 2018. “Exploratory Text Data Analysis for Quality Hypothesis Generation.” Quality Engineering 30 (4): 701–12. https://doi.org/10.1080/08982112.2018.1481216.
Cai, Yifu, Xinyu Li, Mononito Goswami, Michał Wiliński, Gus Welter, and Artur Dubrawski. 2025. TimeSeriesGym: A Scalable Benchmark for (Time Series) Machine Learning Engineering Agents. https://doi.org/10.48550/ARXIV.2505.13291.
Cao, Ruisheng, Fangyu Lei, Haoyuan Wu, et al. 2024. Spider2-V: How Far Are Multimodal Agents from Automating Data Science and Engineering Workflows? https://doi.org/10.48550/arXiv.2407.10956.
Chan, Jun Shern, Neil Chowdhury, Oliver Jaffe, et al. 2025. MLE-Bench: Evaluating Machine Learning Agents on Machine Learning Engineering. https://doi.org/10.48550/arXiv.2410.07095.
Chittepu, Yaswanth, Raghavendra Addanki, Tung Mai, Anup Rao, and Branislav Kveton. 2026. MLTool-Bench: Tool-Augmented Planning for ML Tasks. https://doi.org/10.48550/arXiv.2512.00672.
De Mast, Jeroen, and Albert Trip. 2007. “Exploratory Data Analysis in Quality-Improvement Projects.” Journal of Quality Technology 39 (4): 301–11. https://doi.org/10.1080/00224065.2007.11917697.
Egg, Alex, Martin Iglesias Goyanes, Friso Kingma, Andreu Mora, Leandro von Werra, and Thomas Wolf. DABstep: Data Agent Benchmark for Multi-Step Reasoning. https://doi.org/10.48550/arXiv
.2506.23719.14
Gao, Yunfan, Yun Xiong, Xinyu Gao, et al. 2024. Retrieval-Augmented Generation for Large Language Models: A Survey. https://doi.org/10.48550/arXiv.2312.10997.
Good, I. J. 1983. “The Philosophy of Exploratory Data Analysis.” Philosophy of Science 50 (2): 283–95. https://doi.org/10.1086/289110.
Hu, Xueyu, Ziyu Zhao, Shuang Wei, et al. 2024. InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks. https://doi.org/10.48550/arXiv.2401.05507.
Huang, Qian, Jian Vora, Percy Liang, and Jure Leskovec. 2024. “MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation.” OpenReview, June. https://openreview.net/forum?id=1Fs1LvjYQW.
Huang, Yiming, Jianwen Luo, Yan Yu, et al. 2024. DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models. https://doi.org/10.48550/arXiv.2410.07331.
Jansen, Jacqueline A., Artür Manukyan, Nour Al Khoury, and Altuna Akalin. 2025. “Leveraging Large Language Models for Data Analysis Automation.” PLOS ONE 20 (2): e0317084. https://doi.org/10.1371/journal.pone.0317084.
Jimenez, Carlos E., John Yang, Alexander Wettig, et al. 2024. SWE-Bench: Can Language Models Resolve Real-World GitHub Issues? https://doi.org/10.48550/arXiv.2310.06770.
Jing, Liqiang, Zhehui Huang, Xiaoyang Wang, et al. 2025. DSBench: How Far Are Data Science Agents to Becoming Data Science Experts? https://doi.org/10.48550/arXiv.2409.07703.
Kahneman, Daniel, and Amos Tversky. 1979. “Prospect Theory: An Analysis of Decision Under Risk.” Econometrica 47 (2): 263–91. https://doi.org/10.2307/1914185.
Łabędzki, Rafał, Katarzyna Mikołajczyk, Anna Biłyk, and Monika Trojanowska. 2025. “Understanding Human-AI Collaboration: A Systematic Review of Challenges and Research Methods in Management.” In HCI International 2025 Posters, edited by Constantine Stephanidis, Margherita Antona, Stavroula Ntoa, and Gavriel Salvendy, vol. 2529. Communications in Computer and Information Science. Springer. https://doi.org/10.1007/978-3-031-94171-9_32.
Lai, Yuhang, Chengxi Li, Yiming Wang, et al. 2022. DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation. https://doi.org/10.48550/arXiv.2211.11501.
Lei, Fangyu, Jinxiang Meng, Yiming Huang, et al. 2025. DAComp: Benchmarking Data Agents Across the Full Data Intelligence Lifecycle. https://doi.org/10.48550/arXiv.2512.04324.
Li, Hanyu, Haoyu Liu, Tingyu Zhu, et al. 2025. IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis. https://doi.org/10.48550/arXiv.2505.18223.
Li Vigni, Mario, Caterina Durante, and Marina Cocchi. 2013. “Exploratory Data Analysis.” In Data Handling in Science and Technology, vol. 28. Elsevier. https://doi.org/10.1016/B978-0-444-59528-7.00003-X.
Li, Xinyi, Sai Wang, Siqi Zeng, Yu Wu, and Yi Yang. 2024. “A Survey on LLM-Based Multi-Agent Systems: Workflow, Infrastructure, and Challenges.” Vicinagearth 1 (1): 9. https://doi.org/10.1007/s44336-024-00009-2.
Liu, Shu, Shangqing Zhao, Chenghao Jia, et al. 2024. BIBench: Benchmarking Data Analysis Knowledge of Large Language Models. https://doi.org/10.48550/arXiv.2401.02982.
Liu, Xinyu, Shuyu Shen, Boyan Li, Nan Tang, and Yuyu Luo. 2025. “NL2SQL-BUGs: A Benchmark for Detecting Semantic Errors in NL2SQL Translation.” Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, August, 5662–73. https://doi.org/10.1145/371511896.3737427.
Luo, Tianqi, Chuhan Huang, Leixian Shen, et al. 2025. nvBench 2.0: Resolving Ambiguity in Text-toVisualization Through Stepwise Reasoning. https://doi.org/10.48550/ARXIV.2503.12880.
Morgenthaler, Stephan. 2009. “Exploratory Data Analysis.” WIREs Computational Statistics 1 (1): 33–44. https://doi.org/10.1002/wics.2.
Rahman, Mizanur, Md Tahmid Rahman Laskar, Shafiq Joty, and Enamul Hoque. 2025. Text2Vis: A Challenging and Diverse Benchmark for Generating Multimodal Visualizations from Text. https://doi.org/10.48550/arXiv.2507.19969.
Tang, Xiangru, Yuliang Liu, Zefan Cai, et al. 2024. ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code. https://doi.org/10.48550/arXiv.2311.09835.
Tukey, John Wilder. 1977. Exploratory Data Analysis. Addison-Wesley.
Tversky, Amos, and Daniel Kahneman. 1992. “Advances in Prospect Theory: Cumulative Representation of Uncertainty.” Journal of Risk and Uncertainty 5 (4): 297–323. https://doi.org/10.1007/BF00122574.
Wang, Ziting, Shize Zhang, Haitao Yuan, et al. 2025. FDABench: A Benchmark for Data Agents on Analytical Queries over Heterogeneous Data. https://doi.org/10.48550/arXiv.2509.02473.
Zhang, Dan, Sining Zhoubian, Min Cai, et al. 2025. DataSciBench: An LLM Agent Benchmark for Data Science. https://doi.org/10.48550/arXiv.2502.13897.
Zhang, Yuge, Qiyang Jiang, Xingyu Han, Nan Chen, Yuqing Yang, and Kan Ren. 2024. Benchmarking Data Science Agents. https://doi.org/10.48550/arXiv.2402.17168.
Zhu, Zhenghao, Yuanfeng Song, Xin Chen, et al. 2025. InsightEval: An Expert-Curated Benchmark for Assessing Insight Discovery in LLM-Driven Data Agents. https://doi.org/10.48550/arXiv.2511.22884.