
An end-to-end AI pipeline that transforms complex source documents into structured, high-fidelity, MLR-compliant promotional content, combining OCR, multimodal LLMs, and agent-based validation to ensure accuracy, scalability, and regulatory confidence.
Meet our client
Client:
Industry:
Market:
Technology:
In a Nutshell
Client’s Challenge
A global pharmaceutical company set out to build an end-to-end platform for generating promotional materials using LLMs. The system needed to automatically create presentations and brochures by retrieving data, generating content, and validating it through AI agents aligned with strict Medical, Legal, and Regulatory (MLR) standards.
The key challenge was ensuring that extracted and generated content remained accurate, structured, and compliant. OCR pipelines struggled with complex documents containing tables, images, and graphs, often introducing artifacts or losing structure. Additionally, maintaining high fidelity required multiple LLM calls, making the process costly and slow.
Our Solution
We developed a scalable, multimodal AI pipeline combining OCR, LLM-based postprocessing, and agent-driven validation.
The system uses Azure OCR for initial extraction and applies parallel LLM refinement at the document-element level (paragraphs, tables, images) to improve accuracy and reduce hallucinations. We separated OCR processing from the MLR review pipeline to optimize cost and performance, and introduced background processing to improve user experience.
The solution also integrates a custom validation framework, enabling continuous evaluation of accuracy and compliance.
Client’s Benefits
The platform enables reliable, structured content extraction and prepares documents for automated MLR review. It significantly improves the quality and consistency of generated materials while reducing manual workload and enabling scalable content production aligned with regulatory requirements.
A Deep Dive
1. Overview
The project focused on building an AI platform for generating compliant promotional content. Its objective was to automate the creation and validation of pharmaceutical materials using LLMs.
As a result, we developed a high-fidelity OCR pipeline that produces structured outputs, reduced hallucinations through element-level LLM processing, and designed a scalable architecture that separates OCR and review workflows. Additionally, we implemented a built-in evaluation framework to enable continuous performance monitoring.
2. Client
A leading global pharmaceutical company focused on innovative therapies and responsible healthcare communication.
Key context:
- Operates across 60+ countries
- Focused on digital transformation in medical and commercial functions
- Aims to automate and scale compliant content generation
- Prioritizes strict adherence to MLR (Medical, Legal, Regulatory) standards
3. Challenge
Business Challenge
The client needed to automate the creation of promotional materials while ensuring full compliance with MLR standards. Any inaccuracies in extracted or generated content could lead to regulatory risks, making content fidelity and traceability critical.
The system had to:
- Generate structured, high-quality promotional content
- Ensure compliance with strict regulatory frameworks
- Enable scalable and repeatable content production
Technology Challenge
Several technical constraints made this problem non-trivial:
- Complex document understanding:
Extracting structured data (text, tables, images, graphs) from PDFs while preserving layout and hierarchy - OCR limitations:
Standard OCR introduced artifacts and inconsistencies that compromised downstream processing - High cost and latency:
Large documents required multiple LLM calls, increasing processing time and cost - Hallucination risk:
Passing large document contexts to LLMs led to inaccuracies and hallucinated outputs
4. Solution
We designed a modular, multimodal AI pipeline that ensures accurate document processing and efficient MLR validation.
Core Approach
To address these challenges, we designed a modular, end-to-end AI pipeline that combines structured OCR extraction, LLM-based refinement, and scalable validation mechanisms to ensure accuracy, efficiency, and compliance.
1. Structured OCR Extraction
- Azure OCR used to extract document content
- Documents decomposed into structured elements (paragraphs, tables, images)
- Data stored using Pydantic models for consistency and downstream processing
2. LLM-Based Postprocessing
- Multimodal GPT-4.1 applied to refine OCR outputs
- Each document element processed independently to:
- Reduce hallucinations
- Improve accuracy
- Preserve structure
3. Parallel Processing Architecture
- Element-level processing executed in parallel
- Reduced latency and improved scalability
- Avoided large-context bottlenecks in LLM calls
4. Decoupled Pipeline Design
- OCR pipeline separated from MLR review system
- Enabled:
- Reuse of processed documents
- Significant reduction in token usage
- Lower operational costs
5. Background Processing & UX Optimization
- Long-running tasks executed asynchronously
- Users receive real-time progress updates
- Improved usability for large document workflows
6. Integrated Validation Framework
- Custom datasets curated for evaluation
- Internal validation system measures:
- Completeness
- Hallucination rate
- Provides actionable performance reports
Technology Stack
- OCR: Azure OCR
- LLM: OpenAI GPT-4.1 (multimodal)
- Backend: FastAPI
- Storage: AWS S3
- Database: PostgreSQL (AWS RDS)
What Makes This Solution Unique
This solution stands out due to its use of parallel LLM processing at the level of individual document elements, which significantly improves both speed and accuracy. By separating the OCR pipeline from the review workflow, we were able to optimize costs and enable more flexible reuse of processed data.
The system also incorporates a background execution model, allowing long-running processes to run asynchronously while keeping users informed about progress, which enhances overall user experience. Additionally, an integrated evaluation framework ensures continuous quality monitoring through measurable metrics.
Finally, the entire architecture was specifically designed to support MLR-compliant content workflows, making it well-suited for highly regulated environments such as the pharmaceutical industry.
5. Process
The solution was developed through an iterative, experimentation-driven approach:
- Initial OCR extraction using Azure OCR
- Structuring document content into typed elements (Pydantic models)
- Testing full-page LLM postprocessing → identified hallucination issues
- Transition to element-level processing → improved accuracy
- Parallelization to address latency and cost concerns
- Separation of OCR and review pipelines → cost optimization
- Integration of validation framework with curated datasets
Expertise involved:
- ML Engineers (LLM pipelines, optimization)
- Data Scientists (evaluation, hallucination mitigation)
- Backend Engineers (FastAPI, infrastructure)
6. Outcome
Quantitative Results
Using an internal evaluation framework based on curated datasets and an LLM-as-a-judge approach, the solution achieved:
- Completeness score: 0.87 (on scale 0–1)
- Hallucination detection score: 0.85 (higher = fewer hallucinations)
These results indicate that the pipeline captures the vast majority of relevant information from source documents while maintaining a low level of generated inaccuracies.
Practically, this means:
- Extracted content closely reflects original documents, reducing manual correction effort
- Low hallucination rates minimize compliance risks in regulated environments
- The system can be confidently integrated into production workflows where accuracy is non-negotiable
Overall, these benchmarks validate that the solution meets the high standards required for pharmaceutical-grade content automation, providing a solid foundation for scalable and compliant deployment.
Qualitative Results
The solution delivered several important qualitative improvements that enhanced both the reliability of the system and its readiness for production use. These outcomes go beyond raw performance metrics, reflecting meaningful progress in accuracy, scalability, and operational transparency.
Specifically, we:
- Improved accuracy and consistency of document processing
- Reduced hallucination risk in downstream LLM workflows
- Scalable foundation for automated MLR review
- Enhanced transparency through structured outputs and evaluation metrics
Lessons Learned
One of the key insights from this project was that processing large document contexts with LLMs significantly increases the risk of hallucinations and reduces output reliability. When entire pages or documents were passed to the model at once, the quality of results degraded, especially in complex, structured content.
Breaking documents down into smaller, well-defined elements—such as paragraphs, tables, and images—proved to be much more effective. This element-level decomposition improved both accuracy and consistency, while also making the system easier to scale and parallelize.
Another important lesson was the need to separate different stages of the pipeline. Decoupling the OCR process from the MLR review workflow allowed for better cost control and flexibility. It enabled reusing processed data without repeatedly incurring the cost of expensive OCR and LLM operations.
Finally, the project highlighted the importance of having a robust evaluation framework. Production-grade AI systems require continuous monitoring and measurable performance metrics. Without a structured way to assess completeness and hallucination rates, it would be difficult to ensure reliability and maintain confidence in the system over time.
7. Summary
We delivered a production-ready foundation for automated, compliant content generation in a highly regulated industry.
By combining structured OCR, multimodal LLM refinement, and agent-based validation, the solution transforms complex document workflows into scalable, accurate, and auditable pipelines.
This positions the client to:
- Build a long-term AI-driven content ecosystem
- Accelerate content production
- Ensure regulatory compliance at scale





