Introduction
In today’s digital economy, enterprises are inundated with unstructured and semi-structured data, particularly in the form of documents such as invoices, contracts, reports, and forms. Processing these documents manually is not only time-consuming but also error-prone and expensive. As organizations strive to streamline operations and extract actionable insights from data, Intelligent Document Processing (IDP) has emerged as a transformative capability.
At the intersection of artificial intelligence, machine learning, and natural language processing, IDP enables businesses to automate the extraction, classification, and interpretation of document content with high accuracy. However, traditional IDP solutions often fall short when it comes to flexibility, scalability, and seamless integration into broader enterprise workflows.
To address these limitations, we developed a modular, AI-powered agent purpose-built for intelligent document processing. This agent is designed to operate autonomously across various document types and use-cases, while providing configuration options for business-specific rules and integrations.
This article provides a comprehensive overview of the agent, including how it works, the user journey it supports, the underlying technologies, and the development challenges we faced. A key part of the article is dedicated to a real-world case study in the banking industry, where the agent was used to automate document workflows for onboarding retail customers, a process traditionally burdened by high document volume and regulatory compliance requirements.
How the Agent Works
The Intelligent Document Processing (IDP) agent is designed to automate the transformation of unstructured document content—often in the form of scanned PDFs or handwritten forms—into structured, machine-readable data that can be consumed by downstream systems and workflows. By leveraging advanced AI technologies, the agent can process a wide variety of document types with minimal human intervention.
At a high level, the agent follows a multi-step pipeline to convert incoming documents into structured data. While the core pipeline remains consistent across use cases, there are two primary approaches to data extraction that the agent supports, depending on the nature of the input document and the required accuracy, cost, and runtime characteristics.

As illustrated in figure 1, the extraction pipeline begins with a common set of preprocessing steps, followed by one of two extraction paths.
Pre-processing Pipeline
Regardless of the extraction method used, the initial document handling process includes the following stages.
- Document Ingestion: The agent receives an uploaded document (in PDF format).
- Page-Splitting: The document is split into individual pages for independent processing.
- Image Conversion: Each page is converted into a high-resolution image.
- Image Post-processing: The image is cleaned and standardized—e.g., converted to grayscale, resized, and deskewed—to optimize downstream recognition performance.
Extraction Method 1: Vision-Language Model (VLM)
This method uses a Vision-Language Model (VLM), a type of large language model capable of interpreting both visual and textual inputs. VLMs are particularly effective for structured or printed documents where layout and formatting provide strong semantic cues.
- Input: Each pre-processed image (page) is passed to the VLM.
- Instruction Prompting: The model receives structured instructions describing the expected schema and layout of the document.
- Output: For each page, the VLM returns a structured data block (e.g., JSON), which is later aggregated across all pages.
While this method is well-suited for printed and well-scanned forms, its performance may degrade when processing handwritten text or documents with low visual quality. Additionally, VLMs are computationally expensive, which can affect scalability and cost-efficiency.
Extraction Method 2: OCR + Text-to-Text LLM
In scenarios where documents contain significant amounts of handwritten or degraded text—or when resource efficiency is a concern—the agent switches to a hybrid extraction approach that separates vision and language understanding tasks.
- Optical Character Recognition (OCR): Each page image is processed using a high-accuracy OCR engine. The output is structured as markdown or rich text that preserves the layout and semantic formatting of the original document.
- Text-to-Text Language Model: The extracted markdown text is passed to a text-only large language model (LLM), along with prompts describing the expected data schema and section definitions.
This two-step method provides greater flexibility for handling noisy inputs, while often reducing overall compute costs. The final structured output is aggregated from each page and validated against business rules.
By supporting both processing methods, the agent offers a robust and adaptive architecture capable of handling diverse document types—from clean printed forms to handwritten records—while balancing accuracy, cost, and performance based on the specific operational context.
Ensuring Accuracy: a Benchmarking Framework for Data Extraction
This framework is a tool designed to compare two sets of data, one representing the “expected” values and the other representing the “actual” results, which are typically stored in JSON format. The primary goal is to assess how closely the actual data matches the expected data. It does this by thoroughly analyzing the structure and content of both datasets, even when they are complex, hierarchical, or contain lists. Discrepancies between the two datasets—such as missing or extra fields, as well as slight formatting differences—are identified and documented. The comparison process also accounts for minor variations like spaces in strings or different capitalizations, ensuring a more consistent and fair evaluation of the data.
Once the comparison is made, the framework logs the results in a dedicated file for future reference. Each comparison is recorded with details such as the accuracy percentage and a timestamp, allowing users to track the performance of different datasets over time. In addition to logging the overall accuracy, the framework breaks down the results into sections, providing a detailed view of how well each part of the dataset matches. This section-based breakdown is useful for pinpointing specific areas where discrepancies are most prominent, giving users insights into where their data may need adjustment or improvement.
To further enhance the analysis, the tool also generates a visual representation of the accuracy for each section. This is presented as a bar chart, where each bar shows the percentage accuracy of a specific section of the dataset. The chart provides an easy-to-understand overview of which sections performed well and which ones need attention, making it an effective way to visualize the results. Overall, this framework serves as a comprehensive solution for evaluating and comparing datasets, particularly when working with structured data like JSON, offering both numerical accuracy metrics and visual feedback to guide decision-making.
The User Journey
To ensure a seamless and intuitive experience for business users, the Intelligent Document Processing (IDP) agent is accompanied by a web-based interface that facilitates document submission, monitoring, and review. This web application is designed to integrate smoothly into existing enterprise platforms or be deployed as a standalone interface, depending on the use-case.
The user journey is designed to be simple and transparent, guiding users through each stage of the document processing lifecycle.
- Upload: Users begin by uploading a scanned or digital document—in PDF format—through a secure web interface. The document may originate from various sources, such as physical scans, email attachments, or uploads from third-party systems.
- Processing: Once submitted, the document enters the agent’s backend processing pipeline. The user is provided with real-time feedback on the processing status through visual indicators such as progress bars or status messages. This step is fully automated, requiring no user input while the AI performs image transformation, content recognition, and data extraction.
- Review and Confirmation: Upon completion, the user is presented with a structured web form populated with the extracted data. The original document is displayed side-by-side with the form, enabling users to quickly compare and verify the accuracy of the extracted information. Users can make manual corrections if needed before submitting the final structured data to downstream systems.
This intuitive, guided workflow ensures that non-technical users can interact effectively with the AI agent, reducing friction in adoption and increasing confidence in automated document processing outcomes.
The Technology
The intelligent document processing agent is built on top of a focused, cloud-native architecture leveraging key AWS services for scalability, observability, and seamless integration with enterprise systems.
When a document is uploaded through the web application, it is stored in Amazon S3, which serves as the central repository for all incoming files. S3 offers a reliable and secure way to handle high volumes of scanned documents, ensuring durability while enabling fast access for downstream processing.
To extract structured data from these documents, the agent supports two primary processing flows. In one approach, the agent uses Amazon Textract, AWS’s managed OCR service, to convert scanned or handwritten documents into text while preserving layout information. Textract is especially useful when dealing with lower-quality scans or documents with handwriting, where vision-language models may struggle with accuracy or efficiency.
To enhance handwritten text recognition and overcome the limitations of traditional OCR and Vision Language Models (VLMs), the agent integrates LLMWhisperer, a specialized AI tool trained to improve accuracy on handwritten documents by combining advanced language modeling with custom fine-tuning. This complementary approach significantly boosts extraction accuracy for handwritten content.
For more complex extraction tasks, the agent uses foundation models available through Amazon Bedrock. Depending on the nature of the document and the configuration of the pipeline, either a Vision Language Model (VLM) or a text-to-text Language Model (LLM) is invoked. Vision models are used to interpret entire document pages as images, while text models operate on OCR output for a more cost-efficient and targeted analysis. Bedrock provides a unified interface to access these models, simplifying integration and ensuring that the agent remains adaptable to evolving model capabilities.
As documents are processed, the system logs each job and its associated metadata to Amazon OpenSearch Service, which acts as the central indexing and monitoring solution. OpenSearch enables real-time search and filtering across processed jobs, making it easy for support teams or business users to query the status of a submission, review outputs, or audit the processing history.
Together, these technologies form a robust foundation for the agent, balancing performance, cost, and flexibility. By leaning on managed AWS services for storage, AI, and observability, the solution is both maintainable and scalable—ready to meet the document processing demands of a modern enterprise environment.
Challenges Faced
Building a robust intelligent document processing agent capable of handling diverse and complex inputs presented several technical challenges, particularly around the accurate extraction of handwritten content and the effective utilization of large language models.
One of the most significant hurdles was processing handwritten documents. Handwriting variability, noise from scanned images, and inconsistent formatting posed difficulties for both VLMs and OCR technologies. While Amazon Textract provided reliable OCR capabilities for printed and some handwritten texts, it struggled with more challenging handwritten inputs. To overcome this, the integration of LLMWhisperer—a specialized AI tool optimized for handwritten text recognition—proved critical in enhancing accuracy and overall extraction quality.
In parallel, the team encountered extensive trial and error in selecting and tuning the appropriate foundation models within Amazon Bedrock. This iterative process involved testing multiple models and configurations to balance performance and cost-effectiveness. Ultimately, the team settled on Llama 3.2 models tailored for different pipeline segments: the 90-billion-parameter version for vision-centric tasks involving direct image inputs, and the 3-billion-parameter version for the OCR-then-LLM flow to process textual data extracted from documents.
Given the complexity and variability of the extraction tasks, systematic evaluation was essential. This need led to the creation and deployment of a comprehensive benchmarking framework capable of quantitatively comparing extracted data against ground truth with a high degree of granularity. The framework enabled continuous assessment of model performance during the experimentation phase, guiding informed decisions about model selection, prompt engineering, and pipeline adjustments. It also provided transparent reporting and visualization of accuracy metrics, which was invaluable for tracking improvements and identifying persistent extraction issues.
Together, these challenges shaped the development approach, emphasizing rigorous testing, specialized tooling for handwritten text, and adaptive use of foundation models to deliver a dependable intelligent document processing solution.
A Case Study: Retail Banking Onboarding
One of the earliest and most impactful applications of our intelligent document processing (IDP) agent was in the retail banking sector, specifically in automating the onboarding process for new customers. This traditionally paper-heavy workflow required applicants to fill out physical forms by hand, which were then scanned and processed manually by operations staff. This process introduced several pain points.
- Manual Data Entry: Each form had to be read and transcribed by staff, a time-consuming and error-prone task, particularly with high volumes.
- Slow Processing: Delays between form submission and account activation led to poor customer experience and operational backlogs.
- Inconsistent Customer Experience: Accuracy depended heavily on individual operator attention, and errors in transcription could lead to failed compliance checks or delays in onboarding.
To address these challenges, we developed a specialized version of the IDP agent tailored to the bank’s onboarding forms. This version of the agent was embedded within a secure internal web application designed for use by frontline staff and operations teams.
Initiating a New Request
The user interface allows staff to easily upload scanned onboarding forms into the system. Upon submission, the agent begins processing the document automatically.

Real-Time Document Processing
Once uploaded, the user can monitor the progress of the request through a live status interface. The backend pipeline applies preprocessing, OCR, and AI-driven data extraction in a matter of seconds.

Accurate, Validated Results
Upon completion, the extracted data is displayed in a structured web form side-by-side with the original document. This allows staff to quickly verify the results. A key enhancement in this solution was the use of prompt engineering to enforce field-level validation rules—such as confirming that national ID numbers follow a certain pattern, that date of birth is in a valid range, or that mandatory fields are not left blank.
These rules were built into the LLM prompt context, ensuring that the agent didn’t simply extract what it saw, but also checked it against banking requirements. This dramatically reduced the number of form rejections and manual corrections.

Results That Rival Human Accuracy
Through multiple iterations of prompting, benchmarking, and fine-tuning—including the selection of the Llama 3.2 90B model for vision tasks and the 3B model for OCR-to-text workflows—we were able to achieve greater than 95% extraction accuracy across real-world onboarding form submissions. This matched or exceeded the performance of manual data entry staff, with the added benefits of speed, consistency, and auditability.
Conclusion
Intelligent document processing has become a critical capability for organizations seeking to automate manual workflows, improve operational efficiency, and ensure data quality at scale. The agent described in this article illustrates how a thoughtful integration of AI technologies and cloud infrastructure can deliver reliable, accurate, and adaptable document automation.
Through the use of multiple extraction strategies—including vision-language models, OCR pipelines, and tailored prompt engineering—the agent achieves high performance across a range of document types and formats. In the case of retail banking onboarding, this approach delivered greater-than-human accuracy, improved processing times, and reduced the operational burden of manual data entry.
The supporting benchmarking framework, cloud-native architecture, and web interface all contribute to a production-ready solution that is both scalable and auditable. As document complexity and volume continue to grow, this agent architecture can be adapted across industries such as banking, insurance, healthcare, and government services—where high-volume, high-accuracy document processing is essential to core operations.