Automated Reporting Agent for High-Trust Document Workflows Using AWS

Introduction

In industries like finance, pharmaceuticals, and market research, the ability to generate accurate, structured reports from large corpora of documents is a time-consuming and often error-prone task. Extracting actionable insights from vast amounts of unstructured data can overwhelm manual processes, especially when the corpus includes complex document formats like tables, charts, or varied layouts.

To address these challenges, we developed a reporting agent designed to automate and streamline the process of transforming raw text and data into structured, comprehensive reports. This agent not only processes large corpora of documents but also follows a predefined report structure, ensuring consistency and precision in the final output. The reporting agent’s ability to handle complex formats—including but not limited to nested tables, tables with multi-level column headers, graphs, and any of the above spread across two or more pages—and bypass the limitations typically imposed by LLM context window sizes makes it an invaluable tool for businesses seeking to optimize their reporting workflows.

In this article, we will explore the workings of the reporting agent, dive into the underlying technologies that power it, and examine a case study where we specialized this solution for pharmacovigilance in drug discovery.

How the Agent Works

The Agent Architecture

At the core of the reporting agent lies a flexible template-driven architecture that allows users to define the structure of the reports they want to generate. These report templates are embedded with placeholders: tagged components that indicate where specific data or insights should be populated. Each placeholder is associated with a parameter, which acts as a logical abstraction that can be mapped to specific types of content such as text summaries, table extractions, chart interpretations, or metadata like page numbers. Crucially, these parameters can be reused across multiple templates, enabling a scalable and modular approach to report generation.

To initiate a reporting job, the user provides two inputs:

  1. a corpus of PDF documents; and
  2. a pre-defined template, built in advance with embedded placeholders tied to the reporting objectives.

Once initiated, the agent performs multi-modal document extraction, capturing not only raw text but also structured elements like tables, embedded images, and layout information such as page references. This extraction is designed to be both comprehensive and context-aware, ensuring no meaningful content is overlooked.

The extracted information is then mapped to parameters, filling the placeholders in the report template with domain-relevant insights. This mapping is done intelligently, maintaining the semantic alignment between what was requested and what was found in the source material.

To manage performance and accuracy at scale, the agent uses a dual-processing flow depending on document size.

  • Small documents (fewer than 100 pages) are processed as a whole, allowing the agent to analyze them in a single pass.
  • Large documents (100 pages or more) are automatically split into 30-page chunks, each of which is processed independently. This chunking approach allows the system to navigate the limitations of LLM context windows while maintaining continuity across sections, and the context-awareness of the extraction process ensures no semantic meaning is lost.

This dual strategy ensures that both light and heavy documents are processed efficiently without compromising the fidelity of the report output.

The User Journey

To make this process accessible and intuitive, the agent is delivered through a web-based interface with three top-level dashboards.

  • The Reports Dashboard provides an overview of all report-generation jobs submitted to the agent. Users can view job statuses, monitor progress, and access completed reports directly from this interface. It also serves as the starting point for creating new jobs, allowing users to select a report template and upload a corpus of documents with just a few clicks.
  • The Templates Dashboard contains a library of all currently available report templates. In many regulated industries, public-facing documents follow strict formatting guidelines or boilerplate structures. These templates can be created and registered here, embedded with reusable placeholders to reflect industry-standard reporting needs.
  • The Parameters Dashboard acts as the reference layer for the system, listing all defined parameters along with human-readable descriptions. This aids both users and the agent in understanding the semantics behind each placeholder—ensuring accurate data extraction and contextual alignment in the final report.

Together, these components create a seamless end-to-end experience for users—from defining reporting logic to generating high-quality outputs—all while abstracting away the complexity of underlying document processing.

The Technology

Behind the intuitive interface and seamless reporting flow is a robust and scalable cloud-native architecture built primarily on AWS. Each component of the reporting agent is designed to ensure efficiency, reliability, and flexibility, leveraging AWS infrastructure to handle scale and cost optimization.

The Database

At the core of the system is a MongoDB instance hosted on AWS, which serves as the primary database. This database supports all functional layers of the application:

  • job records for the Reports Dashboard;
  • template definitions for the Templates Dashboard;
  • parameter metadata for the Parameters Dashboard; and
  • document data, which includes structured content extracted from the source files.

When a document is ingested, the system computes a checksum, which is stored in the database alongside the extracted information. This enables the agent to perform a lightweight deduplication check: if a document with the same checksum is received again in a future job, the agent can reuse previously extracted content, saving both time and compute cost.

Data-Extraction Components

The data extraction layer is built using Python, leveraging libraries for parsing PDFs. For scanned or image-based documents, the agent uses Tesseract OCR, deployed via AWS services, to convert images into machine-readable text, tables, and visual metadata.

Report-Generation Components

When it comes to generating the final report content, the system connects to Amazon Bedrock to orchestrate interactions with large language models. By default, the agent utilizes Anthropic Claude 3.5 Sonnet v2 for its balanced performance across comprehension, summarization, and structured reasoning tasks. However, this can be configured to use other models available through Bedrock based on client or domain requirements.

Blob Storage for Generated Reports

Once a report is fully generated, it is saved to Amazon S3, making it accessible both within the UI and through pre-signed URLs for secure downloads. This ensures that report access remains efficient, secure, and easily integrated into downstream workflows.

This modular, AWS-backed architecture allows the agent to scale seamlessly while maintaining performance across high-volume, computation-heavy reporting tasks—particularly in data-intensive industries like finance and pharmaceuticals.

Challenges Faced

Building a reporting agent capable of handling real-world document corpora required solving a number of non-trivial technical challenges. These ranged from dealing with the complexities of document formats to ensuring performance and semantic accuracy at scale. Below are some of the key obstacles we encountered and how we addressed them.

Complex Tables

One of the most persistent challenges was the extraction of information from complex tables. Many documents, particularly in finance and pharmaceutical reporting, include multi-layered tables with merged cells, nested structures, or variable header rows. These formats are often beyond traditional parsers. To address this, we implemented custom Python-based parsing logic capable of interpreting and reconstructing these tables into structured data representations suitable for template mapping.

Embedded Images

Another major hurdle was handling images embedded in the documents—such as charts, annotated figures, or image-based infographics—that often contained critical information not present in the surrounding text. To extract data from these formats, we integrated AWS Tesseract OCR, which enabled the system to interpret and extract text and structure from image-based content with high fidelity.

Documents as Scanned Images

A related challenge was dealing with scanned documents, which are common in domains where archival or regulatory documents are often stored as image-only PDFs. These documents lacked embedded text and required robust OCR to extract meaningful information. The Tesseract-powered OCR pipeline helped bridge this gap, extracting not only text but also layout-aware elements like headings and tables for consistent downstream use.

Document Sizes

We also encountered significant variability in document size, with some reports stretching beyond 100 pages. Processing such large documents directly with an LLM was not feasible due to token and context window constraints. To overcome this, we developed a two-track approach: smaller documents were processed in a single pass, while larger ones were automatically divided into 30-page chunks. This kept each processing unit within context limits while enabling distributed handling of large documents.

Dealing with Semantic Loss from Splitting Large Documents

Chunking large documents introduced the risk of losing semantic continuity between sections. Important context or cross-references could be missed if each chunk was treated in isolation. To mitigate this, we implemented a dual strategy: overlapping context windows between chunks helped preserve local continuity, and a final summarization pass was performed across all extracted data to unify insights and maintain narrative cohesion.

Performance and Cost

Finally, we addressed the critical concern of performance and cost efficiency. With large-scale document processing, particularly involving OCR and LLM inference, costs can escalate quickly. We implemented a document checksum mechanism using our AWS-hosted MongoDB backend. When a document is uploaded, the agent computes a checksum and checks it against stored entries. If a match is found, the previously extracted data is reused, significantly reducing processing time and compute resource usage.

Together, these solutions allowed the reporting agent to handle a broad variety of document formats, maintain performance across different workloads, and generate high-quality outputs even in complex, high-volume environments.

A Domain-Specialization Case Study: Pharmacovigilance in Drug Discovery

One of the earliest and most impactful applications of the reporting agent was in the domain of pharmacovigilance—the process of detecting, assessing, understanding, and preventing adverse effects or any other drug-related problems during clinical development. Our goal was to enable faster, more accessible reporting for stakeholders ranging from regulatory bodies to non-technical executive teams.

The Use-Case

Pharmaceutical companies often conduct detailed clinical study reports (CSRs), which summarize trial design, methodology, results, and safety observations. These documents are typically 20–30 pages long but densely packed with complex tables, graphical summaries, and domain-specific language. They often include:

  • multi-column tables with varied formatting;
  • charts embedded as images with referenced text; and
  • dense narrative sections linking adverse events to patient cohorts or dosage groups.

Manually distilling these reports into accessible formats for different stakeholder groups was labor-intensive and error-prone. Our task was to specialize the reporting agent to automatically generate audience-specific summaries using customizable templates.

Template Design for Multi-audience Reporting

To accommodate different reporting requirements, we developed three distinct template categories:

  • Public-Facing Reports: Simplified summaries for non-technical audiences, focusing on safety, efficacy, and trial scope.
  • Regulatory Reports: Structured templates tailored to the data disclosure requirements of agencies like the FDA or EMA.
  • Shareholder Briefings: Executive-level summaries focused on trial outcomes, risks, and potential business implications.

Each template was embedded with placeholders and reusable parameters such as the following.

  • study_intervention
  • study_end_date
  • medicine_studied
  • country_distribution
  • lay_study_title

These parameters not only guided the agent in generating content but also preserved traceability between extracted insights and source material.

reporting agent
Figure 1. The Templates Dashboard.

Customization and Extraction

The core document used was a PDF-based CSR. Despite its relatively small size, its formatting complexity—such as tables with misaligned columns or text-delineated columns—demanded robust parsing. We adapted our extraction pipeline with:

  • python-based custom table parsers; and
  • minimal prompt-tuning to help the LLM interpret domain-specific language (e.g., “serious adverse event” vs. “non-serious”).
Figure 2. The Parameters Dashboard.

LLM-Driven Generation and Result Traceability

Once extracted, data was mapped to the appropriate template using our parameter system, and the report was generated using Claude 3.5 Sonnet v2 via Amazon Bedrock.

An important enhancement in this domain was the ability to trace every generated insight back to its source. Parameters carried metadata such as:

  • page numbers;
  • original table or paragraph references; and
  • document checksums.

This allowed stakeholders to verify claims by drilling down from the generated report back to the exact page or table in the original CSR.

reporting agent
Figure 3. The Reports Dashboard.

Measurable Impact

  • transparency and integrity were established thanks to a clear data lineage;
  • report generation time was reduced from hours to minutes;
  • manual validation effort dropped significantly due to traceability features; and
  • non-technical teams reported improved comprehension and trust in the outputs.
  • reporting agent
    Figure 4. A view of a generated report.

    Conclusion

    As industries grapple with ever-growing volumes of complex documentation, the need for intelligent, scalable reporting solutions has never been more critical. The reporting agent we’ve developed addresses this gap head-on, transforming unstructured documents into structured, stakeholder-ready reports with remarkable efficiency.

    By combining a robust template-driven design, multi-modal document extraction, and scalable language model integration through AWS infrastructure, the agent enables users to generate high-quality reports with traceable, verifiable insights. Its adaptability across domains—from finance and market research to highly specialized areas like pharmacovigilance—proves its utility in real-world, high-stakes environments.

    The pharmacovigilance case study demonstrated not only a tangible reduction in effort and turnaround time but also the importance of transparency and traceability in AI-assisted reporting. By providing end users with both insight and lineage, the system builds trust while delivering measurable value.

    Looking ahead, the potential for further domain specialization, integration with active learning systems, or even real-time document stream processing opens the door for the agent to become a core component in enterprise knowledge workflows.

    As AI capabilities mature, tools like this reporting agent will be at the forefront of making complex information accessible, actionable, and aligned with business objectives.

    Leave a Reply

    Discover more from

    Subscribe now to keep reading and get access to the full archive.

    Continue reading