
Read full paper
Abstract
Fine-tuning object detection (OD) models on combined datasets assumes
annotation compatibility, yet datasets often encode conflicting spatial definitions for semantically equivalent categories. We propose an agentic label harmonization workflow that uses a vision-language model to reconcile both category semantics and bounding box granularity across heterogeneous sources before training. We evaluate on document layout detection as a challenging case study, where annotation standards vary widely across corpora. Without harmonization, naïve mixed-dataset fine-tuning degrades a pretrained RT-DETRv2 detector: on SCORE-Bench, which measures how accurately the full document conversion pipeline reproduces ground-truth structure, table TEDS drops from 0.800 to 0.750. Applied to two corpora whose 16 and 10 category taxonomies share only
8 direct correspondences, harmonization yields consistent gains across content fidelity, table structure, and spatial consistency: detection F-score improves from 0.860 to 0.883, table TEDS improves to 0.814, and mean bounding box overlap drops from 0.043 to 0.016. Representation analysis further shows that harmonized training produces more compact and separable post-decoder embeddings, confirming that annotation inconsistency distorts the learned feature space and that resolving it before training restores representation structure.
Author: Vladimir Kirilenko
References
- J. Lambert, Z. Liu, O. Sener, J. Hays, and V. Koltun, “MSeg: A Composite Dataset for Multi-domain Semantic Segmentation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. [Online]. Available: https://arxiv.org/abs/2112.13762
- Z. Chen et al., “Dynamic Supervisor for Cross-dataset Object Detection,” Neurocomputing, 2022, [Online]. Available: https://arxiv.org/abs/2204.00183
- Y.-H. Liao, D. Acuna, R. Mahmood, J. Lucas, V. Prabhu, and S. Fidler, “Transferring Labels to Solve Annotation Mismatches Across Object Detection Datasets,” in International Conference on Learning Representations (ICLR), 2024. [Online]. Available: https://openreview.net/forum?id=ChHx5ORqF0
- J. Kirkpatrick et al., “Overcoming Catastrophic Forgetting in Neural Networks,” Proceedings of the National Academy of Sciences (PNAS), vol. 114, no. 13, pp. 3521–3526, 2017, [Online]. Available: https://arxiv.org/abs/1612.00796
- T. Feng, M. Wang, and H. Yuan, “Overcoming Catastrophic Forgetting in Incremental Object Detection via Elastic Response Distillation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. [Online]. Available: https://arxiv.org/abs/2204.02136
- X. Zhong, J. Tang, and A. J. Yepes, “PubLayNet: Largest Dataset Ever for Document Layout Analysis,” in International Conference on Document Analysis and Recognition (ICDAR), 2019. [Online]. Available: https://arxiv.org/abs/1908.07836
- M. Li et al., “DocBank: A Benchmark Dataset for Document Layout Analysis,” in International Conference on Computational Linguistics (COLING), 2020. [Online]. Available: https://arxiv.org/abs/2006.01038
- B. Pfitzmann, C. Auer, M. Dolfi, A. S. Nassar, and P. W. J. Staar, “DocLayNet: A Large Human-Annotated Dataset for Document-Layout Segmentation,” in Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022, pp. 3743–3751. doi: 10.1145/3534678.3539043.
- S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” in Advances in Neural Information Processing Systems (NeurIPS), 2015. [Online]. Available: https://arxiv.org/abs/1506.01497
- J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [Online]. Available: https://arxiv.org/abs/1506.02640
- Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “YOLOX: Exceeding YOLO Series in 2021,” arXiv preprint arXiv:2107.08430, 2021, [Online]. Available: https://arxiv.org/abs/2107.08430
- G. Jocher, A. Chaurasia, and J. Qiu, “YOLOv8.” [Online]. Available: https://github.com/ultralytics/ultralytics
- N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-toEnd Object Detection with Transformers,” in European Conference on Computer Vision (ECCV), 2020. [Online]. Available: https://arxiv.org/abs/2005.12872
- X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable DETR: Deformable Transformers for End-to-End Object Detection,” in International Conference on Learning Representations (ICLR), 2021. [Online]. Available: https://arxiv.org/abs/2010.04159
- H. Zhang et al., “DINO: DETR with Improved DeNoising Anchor Boxes for End-toEnd Object Detection,” in International Conference on Learning Representations (ICLR),[Online]. Available: https://arxiv.org/abs/2203.03605
- Y. Zhao et al., “DETRs Beat YOLOs on Real-time Object Detection.” 2023.
- W. Lv, Y. Zhao, Q. Chang, K. Huang, G. Wang, and Y. Liu, “RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time Detection Transformer,” arXiv preprint arXiv:2407.17140, 2024, [Online]. Available: https://arxiv.org/abs/2407.17140
- C. Da, C. Luo, Q. Zheng, and C. Yao, “VGT: Vision Grid Transformer for Document Layout Analysis,” in IEEE/CVF International Conference on Computer Vision (ICCV), [Online]. Available: https://arxiv.org/abs/2308.14978
- Z. Zhao, H. Kang, B. Wang, and C. He, “DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception,” arXiv preprint arXiv:2410.12628, 2024, [Online]. Available: https://arxiv.org/abs/2410.12628
- N. Livathinos, C. Auer, A. Nassar, and others, “Advanced Layout Analysis Models for Docling,” arXiv preprint arXiv:2509.11720, 2025, [Online]. Available: https://arxiv.org/abs/2509.11720
- S. J. Pan and Q. Yang, “A Survey on Transfer Learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345–1359, 2010, [Online]. Available: https://ieeexplore.ieee.org/document/5288526
- W. Li, F. Li, Y. Luo, P. Wang, and J. Sun, “Deep Domain Adaptive Object Detection: A Survey,” in IEEE Symposium Series on Computational Intelligence (SSCI), 2020. [Online]. Available: https://arxiv.org/abs/2002.06797
- Y. Chen, W. Li, C. Sakaridis, D. Dai, and L. Van Gool, “Domain Adaptive Faster R-CNN for Object Detection in the Wild,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018. [Online]. Available: https://arxiv.org/abs/1803.03243
- K. Saito, Y. Ushiku, T. Harada, and K. Saenko, “Strong-Weak Distribution Alignment for Adaptive Object Detection,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. [Online]. Available: https://arxiv.org/abs/1812.04798
- C.-C. Hsu, Y.-H. Tsai, Y.-Y. Lin, and M.-H. Yang, “Every Pixel Matters: Center-Aware Feature Alignment for Domain Adaptive Object Detector,” in European Conference on Computer Vision (ECCV), 2020. [Online]. Available: https://arxiv.org/abs/2008.08574
- G. Zhao, G. Li, R. Xu, and L. Lin, “Collaborative Training between Region Proposal Localization and Classification for Domain Adaptive Object Detection,” in European Conference on Computer Vision (ECCV), 2020. [Online]. Available: https://arxiv.org/abs/2009.08119
- X. Wang, Z. Cai, D. Gao, and N. Vasconcelos, “Towards Universal Object Detection by Domain Attention,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. [Online]. Available: https://arxiv.org/abs/1904.04402
- X. Zhou, R. Girdhar, A. Joulin, P. Krähenbühl, and I. Misra, “Detecting Twenty-Thousand Classes Using Image-Level Supervision,” in European Conference on Computer Vision (ECCV), 2022. [Online]. Available: https://arxiv.org/abs/2201.02605
- M. Kennerley, A. I. Aviles-Rivero, C.-B. Schönlieb, and R. T. Tan, “Bridging Annotation Gaps: Transferring Labels to Align Object Detection Datasets,” arXiv preprint arXiv:2506.04737, 2025, [Online]. Available: https://arxiv.org/abs/2506.04737
- L. H. Li et al., “Grounded Language-Image Pre-Training,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. [Online]. Available: https://arxiv.org/abs/2112.03857
- S. Liu et al., “Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection,” arXiv preprint arXiv:2303.05499, 2023, [Online]. Available: https://arxiv.org/abs/2303.05499
- J. Hoffman et al., “LSDA: Large Scale Detection Through Adaptation,” in Advances in Neural Information Processing Systems (NeurIPS), 2014. [Online]. Available: https://arxiv.org/abs/1407.5035
- H. Rasheed, M. Maaz, M. U. Khattak, S. H. Khan, and F. S. Khan, “Bridging the Gap between Object and Image-Level Representations for Open-Vocabulary Detection,” in Advances in Neural Information Processing Systems (NeurIPS), 2022. [Online]. Available: https://arxiv.org/abs/2207.03482
- S. Tewes, Y. Chen, O. Moured, J. Zhang, and R. Stiefelhagen, “SFDLA: Source-Free Document Layout Analysis,” in International Conference on Document Analysis and Recognition (ICDAR), 2025. [Online]. Available: https://arxiv.org/abs/2503.18742
- J. Wang, K. Hu, Z. Zhong, L. Sun, and Q. Huo, “Detect-Order-Construct: A Tree Construction based Approach for Hierarchical Document Structure Analysis,” Pattern Recognition, 2024, [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0031320324005879
- Y. Xu, M. Li, L. Cui, S. Huang, F. Wei, and M. Zhou, “LayoutLM: Pre-training of Text and Layout for Document Image Understanding,” in ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2020. [Online]. Available: https://arxiv.org/abs/1912.13318
- S. Appalaraju, B. Jasani, B. U. Kota, Y. Xie, and R. Manmatha, “DocFormer: End-to-End Transformer for Document Understanding,” in IEEE/CVF International Conference on Computer Vision (ICCV), 2021. [Online]. Available: https://arxiv.org/abs/2106.11539
- J. Li, Y. Xu, T. Lv, L. Cui, C. Zhang, and F. Wei, “DiT: Self-Supervised Pre-Training for Document Image Transformer,” in ACM International Conference on Multimedia (ACMMM), 2022. [Online]. Available: https://arxiv.org/abs/2203.02378
- G. Kim et al., “OCR-Free Document Understanding Transformer,” in European Conference on Computer Vision (ECCV), 2022. [Online]. Available: https://arxiv.org/abs/2111.15664
- Y. Huang, T. Lv, L. Cui, Y. Lu, and F. Wei, “LayoutLMv3: Pre-Training for Document AI with Unified Text and Image Masking,” in ACM International Conference on Multimedia (ACM MM), 2022. [Online]. Available: https://arxiv.org/abs/2204.08387
- A. Radford et al., “Learning Transferable Visual Models from Natural Language Supervision,” in International Conference on Machine Learning (ICML), 2021. [Online]. Available: https://arxiv.org/abs/2103.00020
- J. Ye et al., “mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding,” arXiv preprint arXiv:2307.02499, 2023, [Online]. Available: https://arxiv.org/abs/2307.02499
- H. Wei et al., “Vary: Scaling Up the Vision Vocabulary for Large Vision-Language Models,”arXiv preprint arXiv:2312.06109, 2023, [Online]. Available: https://arxiv.org/abs/2312.06109
- R. Li, A. Jimeno Yepes, Y. You, K. Pluciński, M. Operlejn, and C. Wolfe, “SCORE: A Semantic Evaluation Framework for Generative Document Parsing,” arXiv preprint arXiv:2509.19345, 2025, [Online]. Available: https://arxiv.org/abs/2509.19345





