Automating Pharma-Compliant Content Creation with LLMs — From OCR to MLR-Ready Outputs

An end-to-end AI pipeline that transforms complex source documents into structured, high-fidelity, MLR-compliant promotional content, combining OCR, multimodal LLMs, and agent-based validation to ensure accuracy, scalability, and regulatory confidence.

Meet our client

Client:

One of the biggest pharmaceutical company

Industry:

Healthcare, Pharma

Market:

Europe

Technology:

LLM

In a Nutshell

Client’s Challenge

A global pharmaceutical company set out to build an end-to-end platform for generating promotional materials using LLMs. The system needed to automatically create presentations and brochures by retrieving data, generating content, and validating it through AI agents aligned with strict Medical, Legal, and Regulatory (MLR) standards.

The key challenge was ensuring that extracted and generated content remained accurate, structured, and compliant. OCR pipelines struggled with complex documents containing tables, images, and graphs, often introducing artifacts or losing structure. Additionally, maintaining high fidelity required multiple LLM calls, making the process costly and slow.

Our Solution

We developed a scalable, multimodal AI pipeline combining OCR, LLM-based postprocessing, and agent-driven validation.

The system uses Azure OCR for initial extraction and applies parallel LLM refinement at the document-element level (paragraphs, tables, images) to improve accuracy and reduce hallucinations. We separated OCR processing from the MLR review pipeline to optimize cost and performance, and introduced background processing to improve user experience.

The solution also integrates a custom validation framework, enabling continuous evaluation of accuracy and compliance.

Client’s Benefits

The platform enables reliable, structured content extraction and prepares documents for automated MLR review. It significantly improves the quality and consistency of generated materials while reducing manual workload and enabling scalable content production aligned with regulatory requirements.

A Deep Dive

1. Overview

The project focused on building an AI platform for generating compliant promotional content. Its objective was to automate the creation and validation of pharmaceutical materials using LLMs.

As a result, we developed a high-fidelity OCR pipeline that produces structured outputs, reduced hallucinations through element-level LLM processing, and designed a scalable architecture that separates OCR and review workflows. Additionally, we implemented a built-in evaluation framework to enable continuous performance monitoring.

2. Client

A leading global pharmaceutical company focused on innovative therapies and responsible healthcare communication.

Key context:

Operates across 60+ countries
Focused on digital transformation in medical and commercial functions
Aims to automate and scale compliant content generation
Prioritizes strict adherence to MLR (Medical, Legal, Regulatory) standards

3. Challenge

Business Challenge

The client needed to automate the creation of promotional materials while ensuring full compliance with MLR standards. Any inaccuracies in extracted or generated content could lead to regulatory risks, making content fidelity and traceability critical.

The system had to:

Generate structured, high-quality promotional content
Ensure compliance with strict regulatory frameworks
Enable scalable and repeatable content production

Technology Challenge

Several technical constraints made this problem non-trivial:

Complex document understanding:
Extracting structured data (text, tables, images, graphs) from PDFs while preserving layout and hierarchy
OCR limitations:
Standard OCR introduced artifacts and inconsistencies that compromised downstream processing
High cost and latency:
Large documents required multiple LLM calls, increasing processing time and cost
Hallucination risk:
Passing large document contexts to LLMs led to inaccuracies and hallucinated outputs

4. Solution

We designed a modular, multimodal AI pipeline that ensures accurate document processing and efficient MLR validation.

Core Approach

To address these challenges, we designed a modular, end-to-end AI pipeline that combines structured OCR extraction, LLM-based refinement, and scalable validation mechanisms to ensure accuracy, efficiency, and compliance.

1. Structured OCR Extraction

Azure OCR used to extract document content
Documents decomposed into structured elements (paragraphs, tables, images)
Data stored using Pydantic models for consistency and downstream processing

2. LLM-Based Postprocessing

Multimodal GPT-4.1 applied to refine OCR outputs
Each document element processed independently to:
- Reduce hallucinations
- Improve accuracy
- Preserve structure

3. Parallel Processing Architecture

Element-level processing executed in parallel
Reduced latency and improved scalability
Avoided large-context bottlenecks in LLM calls

4. Decoupled Pipeline Design

OCR pipeline separated from MLR review system
Enabled:
- Reuse of processed documents
- Significant reduction in token usage
- Lower operational costs

5. Background Processing & UX Optimization

Long-running tasks executed asynchronously
Users receive real-time progress updates
Improved usability for large document workflows

6. Integrated Validation Framework

Custom datasets curated for evaluation
Internal validation system measures:
- Completeness
- Hallucination rate
Provides actionable performance reports

Technology Stack

OCR: Azure OCR
LLM: OpenAI GPT-4.1 (multimodal)
Backend: FastAPI
Storage: AWS S3
Database: PostgreSQL (AWS RDS)

What Makes This Solution Unique

This solution stands out due to its use of parallel LLM processing at the level of individual document elements, which significantly improves both speed and accuracy. By separating the OCR pipeline from the review workflow, we were able to optimize costs and enable more flexible reuse of processed data.

The system also incorporates a background execution model, allowing long-running processes to run asynchronously while keeping users informed about progress, which enhances overall user experience. Additionally, an integrated evaluation framework ensures continuous quality monitoring through measurable metrics.

Finally, the entire architecture was specifically designed to support MLR-compliant content workflows, making it well-suited for highly regulated environments such as the pharmaceutical industry.

5. Process

The solution was developed through an iterative, experimentation-driven approach:

Initial OCR extraction using Azure OCR
Structuring document content into typed elements (Pydantic models)
Testing full-page LLM postprocessing → identified hallucination issues
Transition to element-level processing → improved accuracy
Parallelization to address latency and cost concerns
Separation of OCR and review pipelines → cost optimization
Integration of validation framework with curated datasets

Expertise involved:

ML Engineers (LLM pipelines, optimization)
Data Scientists (evaluation, hallucination mitigation)
Backend Engineers (FastAPI, infrastructure)

6. Outcome

Quantitative Results

Using an internal evaluation framework based on curated datasets and an LLM-as-a-judge approach, the solution achieved:

Completeness score: 0.87 (on scale 0–1)
Hallucination detection score: 0.85 (higher = fewer hallucinations)

These results indicate that the pipeline captures the vast majority of relevant information from source documents while maintaining a low level of generated inaccuracies.

Practically, this means:

Extracted content closely reflects original documents, reducing manual correction effort
Low hallucination rates minimize compliance risks in regulated environments
The system can be confidently integrated into production workflows where accuracy is non-negotiable

Overall, these benchmarks validate that the solution meets the high standards required for pharmaceutical-grade content automation, providing a solid foundation for scalable and compliant deployment.

Qualitative Results

The solution delivered several important qualitative improvements that enhanced both the reliability of the system and its readiness for production use. These outcomes go beyond raw performance metrics, reflecting meaningful progress in accuracy, scalability, and operational transparency.

Specifically, we:

Improved accuracy and consistency of document processing
Reduced hallucination risk in downstream LLM workflows
Scalable foundation for automated MLR review
Enhanced transparency through structured outputs and evaluation metrics

Lessons Learned

One of the key insights from this project was that processing large document contexts with LLMs significantly increases the risk of hallucinations and reduces output reliability. When entire pages or documents were passed to the model at once, the quality of results degraded, especially in complex, structured content.

Breaking documents down into smaller, well-defined elements—such as paragraphs, tables, and images—proved to be much more effective. This element-level decomposition improved both accuracy and consistency, while also making the system easier to scale and parallelize.

Another important lesson was the need to separate different stages of the pipeline. Decoupling the OCR process from the MLR review workflow allowed for better cost control and flexibility. It enabled reusing processed data without repeatedly incurring the cost of expensive OCR and LLM operations.

Finally, the project highlighted the importance of having a robust evaluation framework. Production-grade AI systems require continuous monitoring and measurable performance metrics. Without a structured way to assess completeness and hallucination rates, it would be difficult to ensure reliability and maintain confidence in the system over time.

7. Summary

We delivered a production-ready foundation for automated, compliant content generation in a highly regulated industry.

By combining structured OCR, multimodal LLM refinement, and agent-based validation, the solution transforms complex document workflows into scalable, accurate, and auditable pipelines.

This positions the client to:

Build a long-term AI-driven content ecosystem
Accelerate content production
Ensure regulatory compliance at scale