Implementing AI Data Extraction Software: Code Examples and Best Practices for Programmers

by | Feb 20, 2026 | My Blog

Implementing AI Data Extraction Software: Code Examples and Best Practices for Programmers

AI data extraction is the process of pulling structured, typed fields from unstructured or semi-structured sources, such as PDFs, scanned images, and raw HTML, using machine learning models, large language models (LLMs), or managed cloud APIs rather than hand-written rules. Your regex-based extraction breaks the moment a vendor changes their invoice layout. This guide gives you working Python code, honest trade-off analysis, and production patterns to ship a pipeline that handles real-world document variability.

Quick Answer: How do I implement AI data extraction?
  • Choose your approach based on document type: managed API for structured forms, LLM for unstructured text, self-hosted for data sovereignty.
  • Define a Pydantic schema to enforce typed output before data touches your database.
  • Use OpenAI function calling or tool use for schema-constrained extraction from unstructured text.
  • Add retry logic with exponential backoff and route failures to a dead-letter queue, not /dev/null.
  • Gate low-confidence extractions for human review using confidence scores from the API or a secondary LLM call.

Three Approaches to AI Data Extraction and When to Use Each

The implementation path you choose determines your accuracy ceiling, your operational cost, and how much your team will be debugging at 2am. When evaluating commercial AI data extraction software, there are three real options, and each has a distinct failure profile.

  • Managed document AI APIs (AWS Textract, Azure Form Recognizer) use purpose-trained ML models for specific document types: invoices, receipts, tax forms, ID documents. They win on accuracy for known layouts and require almost no prompt engineering. The cost is vendor lock-in and per-page pricing that scales linearly, not logarithmically. For high-volume, structured document types, they’re the right call.
  • LLM-based extraction via the OpenAI API, Anthropic, or equivalent gives you flexibility for unstructured, variable-format documents. You define a schema in a system prompt or via function calling, send the document text, and parse the response. The catch: LLMs hallucinate. They’ll confidently return a field value that doesn’t exist in the source document. Output validation is not optional here.
  • Self-hosted open-source models (Llama 3, Mistral, or fine-tuned variants via Hugging Face) give you data sovereignty and eliminate per-call API costs. The overhead is real: you’re managing GPU infrastructure, model versioning, and inference latency. Teams under 10 engineers rarely have the capacity to maintain this well.

Use this decision heuristic: structured, high-volume documents with known layouts go to managed APIs. Unstructured, variable documents go to LLM APIs with schema enforcement. Regulated environments with data residency requirements go to self-hosted models, but only if your team has the infrastructure capacity to support them.

Extracting Structured Data from PDFs and Images with Python

Text-Layer PDFs with pdfplumber

For PDFs with a text layer (not scanned), pdfplumber gives you clean text extraction before you send anything to an LLM or structured API. The following example extracts text from each page and prepares it for downstream processing.

def extract_text_from_pdf(file_path: str) -> str:
    pages = []
    with pdfplumber.open(file_path) as pdf:
        for page in pdf.pages:
            text = page.extract_text()
            if text:
                pages.append(text.strip())
    return "\n\n".join(pages)

raw_text = extract_text_from_pdf("invoice.pdf")

Running this returns a single string of concatenated page text, ready to pass to your extraction layer. If extract_text() returns None for most pages, you’re dealing with a scanned document and need OCR instead.

Scanned Documents with AWS Textract

AWS Textract’s AnalyzeDocument API handles scanned images and PDFs using Amazon’s document ML models. For files larger than 5MB, you need the asynchronous job pattern via start_document_analysis and get_document_analysis. The synchronous call below works for single-page documents and quick prototyping.


textract = boto3.client("textract", region_name="us-east-1")

def extract_with_textract(image_bytes: bytes) -> dict:
    response = textract.analyze_document(
        Document={"Bytes": image_bytes},
        FeatureTypes=["FORMS", "TABLES"]
    )
    
    fields = {}
    for block in response["Blocks"]:
        if block["BlockType"] == "KEY_VALUE_SET" and "KEY" in block.get("EntityTypes", []):
            key_text = get_text_for_block(block, response["Blocks"])
            value_block = get_value_block(block, response["Blocks"])
            if value_block:
                value_text = get_text_for_block(value_block, response["Blocks"])
                fields[key_text] = value_text
    return fields

def get_text_for_block(block, all_blocks):
    text = ""
    for rel in block.get("Relationships", []):
        if rel["Type"] == "CHILD":
            for child_id in rel["Ids"]:
                child = next((b for b in all_blocks if b["Id"] == child_id), None)
                if child and child["BlockType"] == "WORD":
                    text += child["Text"] + " "
    return text.strip()

def get_value_block(key_block, all_blocks):
    for rel in key_block.get("Relationships", []):
        if rel["Type"] == "VALUE":
            for val_id in rel["Ids"]:
                return next((b for b in all_blocks if b["Id"] == val_id), None)
    return None

This returns a dictionary of key-value pairs extracted from form fields. For multi-page PDFs, upload the file to S3 first and use start_document_analysis with the S3 object reference, then poll get_document_analysis until JobStatus is SUCCEEDED.

Azure Form Recognizer (now part of Azure AI Document Intelligence) offers a parallel capability via the azure-ai-formrecognizer SDK. Its prebuilt models for invoices, receipts, and business cards often require zero configuration for common document types, which is worth evaluating if your team is already on Azure.

Using the OpenAI API to Extract Typed Fields from Unstructured Text

Schema Enforcement with Function Calling

Raw JSON prompting works in demos. In production, the model occasionally returns malformed JSON, adds unexpected fields, or omits required ones. OpenAI’s function calling (now called tool use) constrains the output to a defined JSON schema at the API level, making it far more reliable than asking the model to “return JSON.”

First, define your extraction schema as a Pydantic model. This gives you both the JSON schema for the API call and a validation layer for the response.


class InvoiceExtraction(BaseModel):
    vendor_name: str = Field(description="Name of the vendor or supplier")
    invoice_number: str = Field(description="Invoice or document number")
    invoice_date: str = Field(description="Invoice date in YYYY-MM-DD format")
    total_amount: float = Field(description="Total invoice amount as a number")
    currency: str = Field(description="Three-letter currency code, e.g. USD")
    due_date: Optional[str] = Field(default=None, description="Payment due date in YYYY-MM-DD format")

client = OpenAI()

def extract_invoice_fields(document_text: str) -> InvoiceExtraction:
    tools = [{
        "type": "function",
        "function": {
            "name": "extract_invoice",
            "description": "Extract structured invoice fields from document text",
            "parameters": InvoiceExtraction.model_json_schema()
        }
    }]
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Extract invoice fields from the provided text. Only extract values explicitly present in the document. If a field is not present, omit it or return null."},
            {"role": "user", "content": document_text}
        ],
        tools=tools,
        tool_choice={"type": "function", "function": {"name": "extract_invoice"}}
    )
    
    tool_call = response.choices[0].message.tool_calls[0]
    raw_data = json.loads(tool_call.function.arguments)
    return InvoiceExtraction(**raw_data)

Running this returns a validated InvoiceExtraction Pydantic object with typed fields. If the model omits a required field or returns a string where a float is expected, Pydantic raises a ValidationError before the data reaches your application logic. That’s exactly the behavior you want.

The system prompt instruction “only extract values explicitly present in the document” is your primary hallucination mitigation strategy. Grounding the model in the source text, rather than asking it to infer or complete missing fields, reduces confabulated values. Research published by Motzfeldt Jensen et al., PLOS ONE / Aalborg University Hospital found that GPT-4o achieved 92.4% accuracy across 484 data points in structured extraction tasks, with a reproducibility agreement of 94.1% across two separate sessions. That accuracy is real, but it’s not 100%, which is why your validation layer matters.

Building an Extraction Pipeline with LangChain

Document Loading and Chunked Extraction

LangChain’s document loaders and output parsers reduce boilerplate for multi-step workflows, particularly when you’re processing long documents that exceed a single context window. The pattern: load the document, chunk it into token-sized segments, extract fields from each chunk, then merge results.

from langchain_community.document_loaders import PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import ChatOpenAI
from langchain.output_parsers import PydanticOutputParser
from langchain.prompts import ChatPromptTemplate

loader = PyMuPDFLoader("contract.pdf")
documents = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
chunks = splitter.split_documents(documents)

parser = PydanticOutputParser(pydantic_object=InvoiceExtraction)
llm = ChatOpenAI(model="gpt-4o", temperature=0)

prompt = ChatPromptTemplate.from_messages([
    ("system", "Extract invoice fields from the text. {format_instructions}"),
    ("human", "{text}")
]).partial(format_instructions=parser.get_format_instructions())

chain = prompt | llm | parser

results = []
for chunk in chunks:
    try:
        extracted = chain.invoke({"text": chunk.page_content})
        results.append(extracted)
    except Exception as e:
        print(f"Chunk extraction failed: {e}")
        continue

LangChain speeds up prototyping significantly. The trade-off is real: the abstraction layers make stack traces harder to read, and debugging a failing chain often requires unwrapping multiple intermediate objects. For production pipelines where observability matters, consider whether the boilerplate savings justify the debugging overhead. Many teams prototype with LangChain and then rewrite the core extraction logic with direct API calls before deploying to production.

Validating Extraction Output and Handling Failures in Production

Pydantic Validation and the Dead-Letter Queue Pattern

Skipping output validation causes silent data corruption. The LLM returns a string like "$1,234.56" for a field your schema expects as a float, Pydantic isn’t in the path, and your database stores a null or throws a type error three steps downstream. By then, you’ve lost the original document context and have no way to recover cleanly.

To handle errors in an AI extraction pipeline, you need three components: a retry wrapper, a schema validator, and a fallback logger. The following example uses tenacity for retry logic with exponential backoff, which handles transient API rate limit errors without hammering the endpoint.

from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
from openai import RateLimitError, APIError
from pydantic import ValidationError
import logging

logger = logging.getLogger(__name__)

@retry(
    retry=retry_if_exception_type((RateLimitError, APIError)),
    wait=wait_exponential(multiplier=1, min=2, max=60),
    stop=stop_after_attempt(4)
)
def extract_with_retry(document_text: str) -> InvoiceExtraction:
    return extract_invoice_fields(document_text)

def safe_extract(document_id: str, document_text: str) -> InvoiceExtraction | None:
    try:
        result = extract_with_retry(document_text)
        return result
    except ValidationError as e:
        logger.error("Schema validation failed", extra={
            "document_id": document_id,
            "errors": e.errors(),
            "event": "extraction_validation_failure"
        })
        route_to_dead_letter_queue(document_id, reason="validation_failure", details=str(e))
        return None
    except Exception as e:
        logger.error("Extraction failed", extra={
            "document_id": document_id,
            "error": str(e),
            "event": "extraction_failure"
        })
        route_to_dead_letter_queue(document_id, reason="api_failure", details=str(e))
        return None

def route_to_dead_letter_queue(document_id: str, reason: str, details: str):
    # Publish to SQS dead-letter queue, Kafka topic, or database review table
    logger.warning("Document routed to review queue", extra={
        "document_id": document_id,
        "reason": reason
    })

The dead-letter queue pattern is non-negotiable for batch extraction jobs. Silently dropping failed documents means you have no visibility into your actual failure rate. Route them to a review queue, log the failure reason with field-level detail, and build a simple admin view so your team can triage and reprocess. This is the failure mode teams consistently underestimate: they build the happy path, ship it, and discover weeks later that 8% of documents were silently dropped.

Confidence Scoring and Human-in-the-Loop Gating

Managed APIs like Textract and Form Recognizer return confidence scores per field. Use them. Set a threshold, say 0.85, and route anything below it to human review rather than passing it downstream. For LLM-based extraction, you can add a secondary prompt that asks the model to rate its own confidence per field, though this adds latency and cost.

The risk of over-relying on AI explanations is real. Research from Irons et al., Commonwealth Scientific and Industrial Research Organisation (CSIRO), Australia found that participants given LLM-generated explanations as supporting information were less likely to detect errors than those reviewing the original source text. AI-generated confidence explanations can make reviewers trust incorrect extractions more than they should. When you build your human review interface, show reviewers the original document alongside the extracted fields, not just the AI’s rationale.

Enforcing Data Quality Across High-Volume Extraction Runs

Manual extraction at scale fails badly. Research cited by Gartlehner, RTI International / AHRQ found that up to 63% of studies in systematic reviews contain at least one data extraction error, highlighting how error-prone human extraction becomes under volume pressure. AI extraction shifts the error profile rather than eliminating it, which is why your observability layer matters as much as the extraction logic itself.

Log extraction results with field-level metadata: confidence score, model version, document type, processing latency, and whether the result passed validation. Store this in a structured format (a database table or a structured log stream) so you can query it. When extraction quality degrades on a specific document type, you’ll see it in the data before users report it.

Schema versioning is the problem teams ignore until it bites them. When your InvoiceExtraction schema adds a required field, documents processed under the old schema will fail validation on reprocessing. Tag every extracted record with the schema version used. When you deploy a new schema version, run a migration job that reprocesses affected documents rather than silently leaving stale records in place.

Variable type matters enormously for AI extraction reliability. Research presented by Shree et al. (Evidera Ltd.), ISPOR Europe 2024 found that AI reliability dropped to an Intraclass Correlation Coefficient (ICC, a measure of agreement between repeated measurements) of 0 for study characteristics, while patient characteristics showed excellent reliability at ICC = 0.95. The lesson for your pipeline: audit extraction accuracy by field type, not just by document type. Some fields will be highly reliable; others will require systematic human review regardless of your prompt engineering.

Performance and Cost Considerations at Scale

Cost math matters before you commit to an architecture. AWS Textract charges per page for document analysis, with pricing varying by feature set and region. Azure Form Recognizer prices similarly on a per-page model with a free tier for low-volume usage. OpenAI’s GPT-4o charges per input and output token, which means a 10-page PDF converted to text can cost significantly more per document than a managed API call at high volume.

For async and batch processing, avoid blocking your main application thread on extraction calls. Use Python’s asyncio with aiohttp for concurrent OpenAI API calls, or submit Textract async jobs and poll with an SQS queue. Set concurrency limits that respect your API rate limits: OpenAI’s tier-based rate limits are per-minute, so a naive parallel implementation will hit 429 errors immediately at scale.

Cache extraction results keyed on a hash of the document content, not the filename or URL. If the source document changes, the content hash changes and the cache miss triggers a fresh extraction. If the document is identical, you avoid redundant API calls and cost. For documents that update frequently, set a short TTL or skip caching entirely.

AI Data Extraction Libraries Compared

Use this table to match the extraction library to your project constraints before writing any code.

Library / APITypeBest ForPricing ModelLicense
AWS TextractManaged cloud APIScanned forms, tables, structured documentsPer pageProprietary
Azure Form RecognizerManaged cloud APIInvoices, receipts, ID documents with prebuilt modelsPer page (free tier)Proprietary
OpenAI API (gpt-4o)LLM APIUnstructured, variable-format documentsPer tokenProprietary
LangChainOpen-source orchestrationMulti-step pipelines, rapid prototypingFree (API costs separate)MIT
Unstructured.ioOpen-source / hostedDocument preprocessing, format normalizationFree OSS / hosted pricingApache 2.0

Choosing Your Implementation Path: A Decision Framework

Your document type and team constraints should determine your architecture, not the tool with the best marketing copy. Here’s the conditional logic that maps to real decisions.

  • Structured documents, known layouts, high volume: Use AWS Textract or Azure Form Recognizer. The per-page cost is predictable, accuracy on known document types is high, and you don’t need to manage prompts or model versions.
  • Unstructured or variable-format documents: Use OpenAI function calling with a Pydantic schema. Accept the per-token cost as the price of flexibility, and invest the time you save on prompt engineering into your validation layer instead.
  • Data sovereignty requirements or hard budget ceilings: Evaluate self-hosted models via Hugging Face, but only if your team has the infrastructure capacity. A poorly maintained self-hosted model will underperform a managed API on both accuracy and reliability.
  • AI extraction is the wrong tool when: your latency requirement is under 100ms (LLM calls typically run 1-5 seconds), your documents have no consistent structure and no training data exists for the domain, or your use case requires audit-grade provenance where every extracted value must be traceable to a specific character position in the source.

Start with the OpenAI function-calling pattern shown in this guide for your prototype. Validate output with Pydantic from the first commit. Then benchmark Textract or Form Recognizer against your actual document set before committing to a long-term architecture. The benchmark will tell you more than any comparison table, including this one.

Output validation is not a follow-up task. Ship it on day one. Teams that defer schema enforcement until “after we validate the concept” consistently discover that the concept has been running in production for three months by the time they circle back, and the downstream corruption is already in their database.

Frequently Asked Questions About AI Data Extraction

How accurate is AI data extraction in production?

Accuracy varies significantly by document type and field type. Well-structured fields like dates and amounts in standard invoice formats tend to extract reliably. Free-text fields, ambiguous labels, and domain-specific terminology degrade accuracy. Plan for a human review layer for any field where errors have meaningful downstream consequences.

What is the difference between OCR and AI extraction?

OCR (Optical Character Recognition) converts image pixels to machine-readable text. AI extraction takes that text (or a text-layer PDF) and pulls typed, structured fields from it. You often need both: OCR to digitize scanned documents, then AI extraction to interpret the resulting text into a schema your application can consume.

Can I run AI extraction locally without an API?

Yes. Open-source models like Llama 3 and Mistral run locally via Ollama or direct Hugging Face inference. Local inference eliminates per-call API costs and keeps data on your infrastructure, but requires GPU hardware for reasonable throughput. Expect higher setup complexity and ongoing model maintenance compared to managed APIs.

How do I prevent LLMs from hallucinating extracted fields?

Use three constraints together: instruct the model explicitly to only extract values present in the source text, use function calling or tool use to constrain output to your defined schema, and validate every response with Pydantic before it touches your application. No single constraint is sufficient on its own.

When should I fine-tune a model for extraction instead of prompting?

Fine-tuning makes sense when your document type is highly domain-specific, your prompt engineering has plateaued below an acceptable accuracy threshold, and you have at least several hundred labeled extraction examples to train on. For most teams, prompt engineering with function calling reaches acceptable accuracy faster and with lower maintenance burden than fine-tuning.

Kayleigh Baxter