Home Resources Improving Layout Representation Learning Across Inconsistently Annotated Datasets via Agentic Harmonization

Improving Layout Representation Learning Across Inconsistently Annotated Datasets via Agentic Harmonization

Read full paper

Abstract

Fine-tuning object detection (OD) models on combined datasets assumes
annotation compatibility, yet datasets often encode conflicting spatial definitions for semantically equivalent categories. We propose an agentic label harmonization workflow that uses a vision-language model to reconcile both category semantics and bounding box granularity across heterogeneous sources before training. We evaluate on document layout detection as a challenging case study, where annotation standards vary widely across corpora. Without harmonization, naïve mixed-dataset fine-tuning degrades a pretrained RT-DETRv2 detector: on SCORE-Bench, which measures how accurately the full document conversion pipeline reproduces ground-truth structure, table TEDS drops from 0.800 to 0.750. Applied to two corpora whose 16 and 10 category taxonomies share only
8 direct correspondences, harmonization yields consistent gains across content fidelity, table structure, and spatial consistency: detection F-score improves from 0.860 to 0.883, table TEDS improves to 0.814, and mean bounding box overlap drops from 0.043 to 0.016. Representation analysis further shows that harmonized training produces more compact and separable post-decoder embeddings, confirming that annotation inconsistency distorts the learned feature space and that resolving it before training restores representation structure.

Author: Vladimir Kirilenko

References

  1. J. Lambert, Z. Liu, O. Sener, J. Hays, and V. Koltun, “MSeg: A Composite Dataset for Multi-domain Semantic Segmentation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. [Online]. Available: https://arxiv.org/abs/2112.13762
  2. Z. Chen et al., “Dynamic Supervisor for Cross-dataset Object Detection,” Neurocomputing, 2022, [Online]. Available: https://arxiv.org/abs/2204.00183
  3. Y.-H. Liao, D. Acuna, R. Mahmood, J. Lucas, V. Prabhu, and S. Fidler, “Transferring Labels to Solve Annotation Mismatches Across Object Detection Datasets,” in International Conference on Learning Representations (ICLR), 2024. [Online]. Available: https://openreview.net/forum?id=ChHx5ORqF0
  4. J. Kirkpatrick et al., “Overcoming Catastrophic Forgetting in Neural Networks,” Proceedings of the National Academy of Sciences (PNAS), vol. 114, no. 13, pp. 3521–3526, 2017, [Online]. Available: https://arxiv.org/abs/1612.00796
  5. T. Feng, M. Wang, and H. Yuan, “Overcoming Catastrophic Forgetting in Incremental Object Detection via Elastic Response Distillation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. [Online]. Available: https://arxiv.org/abs/2204.02136
  6. X. Zhong, J. Tang, and A. J. Yepes, “PubLayNet: Largest Dataset Ever for Document Layout Analysis,” in International Conference on Document Analysis and Recognition (ICDAR), 2019. [Online]. Available: https://arxiv.org/abs/1908.07836
  7. M. Li et al., “DocBank: A Benchmark Dataset for Document Layout Analysis,” in International Conference on Computational Linguistics (COLING), 2020. [Online]. Available: https://arxiv.org/abs/2006.01038
  8. B. Pfitzmann, C. Auer, M. Dolfi, A. S. Nassar, and P. W. J. Staar, “DocLayNet: A Large Human-Annotated Dataset for Document-Layout Segmentation,” in Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022, pp. 3743–3751. doi: 10.1145/3534678.3539043.
  9. S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” in Advances in Neural Information Processing Systems (NeurIPS), 2015. [Online]. Available: https://arxiv.org/abs/1506.01497
  10. J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [Online]. Available: https://arxiv.org/abs/1506.02640
  11. Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “YOLOX: Exceeding YOLO Series in 2021,” arXiv preprint arXiv:2107.08430, 2021, [Online]. Available: https://arxiv.org/abs/2107.08430
  12. G. Jocher, A. Chaurasia, and J. Qiu, “YOLOv8.” [Online]. Available: https://github.com/ultralytics/ultralytics
  13. N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-toEnd Object Detection with Transformers,” in European Conference on Computer Vision (ECCV), 2020. [Online]. Available: https://arxiv.org/abs/2005.12872
  14. X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable DETR: Deformable Transformers for End-to-End Object Detection,” in International Conference on Learning Representations (ICLR), 2021. [Online]. Available: https://arxiv.org/abs/2010.04159
  15. H. Zhang et al., “DINO: DETR with Improved DeNoising Anchor Boxes for End-toEnd Object Detection,” in International Conference on Learning Representations (ICLR),[Online]. Available: https://arxiv.org/abs/2203.03605
  16. Y. Zhao et al., “DETRs Beat YOLOs on Real-time Object Detection.” 2023.
  17. W. Lv, Y. Zhao, Q. Chang, K. Huang, G. Wang, and Y. Liu, “RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time Detection Transformer,” arXiv preprint arXiv:2407.17140, 2024, [Online]. Available: https://arxiv.org/abs/2407.17140
  18. C. Da, C. Luo, Q. Zheng, and C. Yao, “VGT: Vision Grid Transformer for Document Layout Analysis,” in IEEE/CVF International Conference on Computer Vision (ICCV), [Online]. Available: https://arxiv.org/abs/2308.14978
  19. Z. Zhao, H. Kang, B. Wang, and C. He, “DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception,” arXiv preprint arXiv:2410.12628, 2024, [Online]. Available: https://arxiv.org/abs/2410.12628
  20. N. Livathinos, C. Auer, A. Nassar, and others, “Advanced Layout Analysis Models for Docling,” arXiv preprint arXiv:2509.11720, 2025, [Online]. Available: https://arxiv.org/abs/2509.11720
  21. S. J. Pan and Q. Yang, “A Survey on Transfer Learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345–1359, 2010, [Online]. Available: https://ieeexplore.ieee.org/document/5288526
  22. W. Li, F. Li, Y. Luo, P. Wang, and J. Sun, “Deep Domain Adaptive Object Detection: A Survey,” in IEEE Symposium Series on Computational Intelligence (SSCI), 2020. [Online]. Available: https://arxiv.org/abs/2002.06797
  23. Y. Chen, W. Li, C. Sakaridis, D. Dai, and L. Van Gool, “Domain Adaptive Faster R-CNN for Object Detection in the Wild,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018. [Online]. Available: https://arxiv.org/abs/1803.03243
  24. K. Saito, Y. Ushiku, T. Harada, and K. Saenko, “Strong-Weak Distribution Alignment for Adaptive Object Detection,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. [Online]. Available: https://arxiv.org/abs/1812.04798
  25. C.-C. Hsu, Y.-H. Tsai, Y.-Y. Lin, and M.-H. Yang, “Every Pixel Matters: Center-Aware Feature Alignment for Domain Adaptive Object Detector,” in European Conference on Computer Vision (ECCV), 2020. [Online]. Available: https://arxiv.org/abs/2008.08574
  26. G. Zhao, G. Li, R. Xu, and L. Lin, “Collaborative Training between Region Proposal Localization and Classification for Domain Adaptive Object Detection,” in European Conference on Computer Vision (ECCV), 2020. [Online]. Available: https://arxiv.org/abs/2009.08119
  27. X. Wang, Z. Cai, D. Gao, and N. Vasconcelos, “Towards Universal Object Detection by Domain Attention,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. [Online]. Available: https://arxiv.org/abs/1904.04402
  28. X. Zhou, R. Girdhar, A. Joulin, P. Krähenbühl, and I. Misra, “Detecting Twenty-Thousand Classes Using Image-Level Supervision,” in European Conference on Computer Vision (ECCV), 2022. [Online]. Available: https://arxiv.org/abs/2201.02605
  29. M. Kennerley, A. I. Aviles-Rivero, C.-B. Schönlieb, and R. T. Tan, “Bridging Annotation Gaps: Transferring Labels to Align Object Detection Datasets,” arXiv preprint arXiv:2506.04737, 2025, [Online]. Available: https://arxiv.org/abs/2506.04737
  30. L. H. Li et al., “Grounded Language-Image Pre-Training,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. [Online]. Available: https://arxiv.org/abs/2112.03857
  31. S. Liu et al., “Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection,” arXiv preprint arXiv:2303.05499, 2023, [Online]. Available: https://arxiv.org/abs/2303.05499
  32. J. Hoffman et al., “LSDA: Large Scale Detection Through Adaptation,” in Advances in Neural Information Processing Systems (NeurIPS), 2014. [Online]. Available: https://arxiv.org/abs/1407.5035
  33. H. Rasheed, M. Maaz, M. U. Khattak, S. H. Khan, and F. S. Khan, “Bridging the Gap between Object and Image-Level Representations for Open-Vocabulary Detection,” in Advances in Neural Information Processing Systems (NeurIPS), 2022. [Online]. Available: https://arxiv.org/abs/2207.03482
  34. S. Tewes, Y. Chen, O. Moured, J. Zhang, and R. Stiefelhagen, “SFDLA: Source-Free Document Layout Analysis,” in International Conference on Document Analysis and Recognition (ICDAR), 2025. [Online]. Available: https://arxiv.org/abs/2503.18742
  35. J. Wang, K. Hu, Z. Zhong, L. Sun, and Q. Huo, “Detect-Order-Construct: A Tree Construction based Approach for Hierarchical Document Structure Analysis,” Pattern Recognition, 2024, [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0031320324005879
  36. Y. Xu, M. Li, L. Cui, S. Huang, F. Wei, and M. Zhou, “LayoutLM: Pre-training of Text and Layout for Document Image Understanding,” in ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2020. [Online]. Available: https://arxiv.org/abs/1912.13318
  37. S. Appalaraju, B. Jasani, B. U. Kota, Y. Xie, and R. Manmatha, “DocFormer: End-to-End Transformer for Document Understanding,” in IEEE/CVF International Conference on Computer Vision (ICCV), 2021. [Online]. Available: https://arxiv.org/abs/2106.11539
  38. J. Li, Y. Xu, T. Lv, L. Cui, C. Zhang, and F. Wei, “DiT: Self-Supervised Pre-Training for Document Image Transformer,” in ACM International Conference on Multimedia (ACMMM), 2022. [Online]. Available: https://arxiv.org/abs/2203.02378
  39. G. Kim et al., “OCR-Free Document Understanding Transformer,” in European Conference on Computer Vision (ECCV), 2022. [Online]. Available: https://arxiv.org/abs/2111.15664
  40. Y. Huang, T. Lv, L. Cui, Y. Lu, and F. Wei, “LayoutLMv3: Pre-Training for Document AI with Unified Text and Image Masking,” in ACM International Conference on Multimedia (ACM MM), 2022. [Online]. Available: https://arxiv.org/abs/2204.08387
  41. A. Radford et al., “Learning Transferable Visual Models from Natural Language Supervision,” in International Conference on Machine Learning (ICML), 2021. [Online]. Available: https://arxiv.org/abs/2103.00020
  42. J. Ye et al., “mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding,” arXiv preprint arXiv:2307.02499, 2023, [Online]. Available: https://arxiv.org/abs/2307.02499
  43. H. Wei et al., “Vary: Scaling Up the Vision Vocabulary for Large Vision-Language Models,”arXiv preprint arXiv:2312.06109, 2023, [Online]. Available: https://arxiv.org/abs/2312.06109
  44. R. Li, A. Jimeno Yepes, Y. You, K. Pluciński, M. Operlejn, and C. Wolfe, “SCORE: A Semantic Evaluation Framework for Generative Document Parsing,” arXiv preprint arXiv:2509.19345, 2025, [Online]. Available: https://arxiv.org/abs/2509.19345