Building Reliable Invoice Extraction Prompts That Handle Edge Cases

Invoice extraction sounds straightforward until you actually do it. Feed an invoice to an LLM, get structured JSON back. Simple, right?

Then reality hits. A supplier sends a scan that's slightly rotated. Someone spills coffee on a printout before scanning it. An accountant handwrites corrections in the margins. And suddenly your "working" prompt starts returning garbage—or worse, confidently wrong numbers that slip into your accounting system undetected.

I've been building document processing systems for a while now, and invoice extraction prompt engineering is genuinely harder than most tutorials suggest. The difference between a demo that impresses and a system you can actually trust in production comes down to how you handle the messy cases. Not whether your prompt works on clean PDFs.

In my previous article on building an invoice processing pipeline, I focused on the reactive architecture and processing flow. This time, let's dig into the prompt itself—the part that actually determines whether you get accurate data or plausible-looking nonsense.

Why Invoice Extraction Is Harder Than It Looks

Here's what the benchmarks don't tell you: that 97-99% accuracy figure you see quoted everywhere assumes clean, digital PDFs with consistent formatting. Real invoices are chaos.

Consider what lands in a typical accounts payable inbox: computer-generated PDFs (the easy ones), scanned paper invoices at varying DPIs, mobile photos taken at odd angles, faxes (yes, still), documents with stamps and handwritten approvals, and invoices where someone has helpfully crossed out the original amount and written a correction.

Each of these failure modes requires different handling. A generic prompt that works beautifully on digital invoices will hallucinate confidently when fed a 150 DPI scan with coffee stains. And hallucination in financial documents isn't just annoying—it's the kind of thing that gets discovered during an audit.

The fundamental problem is that most prompts are optimised for the happy path. They assume the document contains the requested information in a readable format. Real-world invoice extraction needs to handle the inverse: documents where information is ambiguous, partially visible, or simply not present.

The Anatomy of an Effective Extraction Prompt

Let's build a prompt structure that actually works. Claude has been fine-tuned to pay special attention to XML tags, making them ideal for separating prompt components. Anthropic's documentation on XML tags explains why this structure improves accuracy—the model can clearly distinguish between instructions, examples, and the actual document content.

Here's the skeleton:

<s>
You are an expert document extraction system. Extract ONLY information
explicitly present in the document. Never infer or guess values.
For any field where the value cannot be clearly read, return null.
</s>

<document>
{{DOCUMENT_CONTENT_OR_IMAGE}}
</document>

<output_schema>
{{JSON_SCHEMA_DEFINITION}}
</output_schema>

<examples>
{{FEW_SHOT_EXAMPLES}}
</examples>

<instructions>
Extract all invoice fields according to the schema.
If a field cannot be determined with high confidence, set it to null.
Output valid JSON only.
</instructions>

A few things to note. First, the document comes before the instructions. This matters for vision tasks—Claude processes images more effectively when they appear early in the prompt. Second, the system instruction explicitly tells Claude to return null rather than guess. This is crucial. Without it, the model will try to be "helpful" by inferring values that aren't actually there.

Why Few-Shot Examples Matter More Than You Think

You might be tempted to skip examples. Don't. Research from LangChain showed that Claude 3 Haiku achieved 75% accuracy with just three examples versus 11% zero-shot—a 7x improvement from minimal investment.

But not all examples are created equal. Your few-shot examples should:

Cover edge cases, not just perfect documents
Demonstrate null handling for missing fields
Include at least one example with nested line items
Show how to handle common formatting variations (date formats, currency symbols)

Here's what a good example looks like:

<example>
<input>[Scanned invoice with partially obscured vendor address]</input>
<o>{
  "invoice_number": "INV-2024-0892",
  "vendor_name": "Acme Supplies Pty Ltd",
  "vendor_address": null,
  "date": "2024-03-15",
  "total_amount": 1847.50,
  "line_items": [
    {"description": "Office Supplies", "quantity": 1, "unit_price": 1847.50}
  ],
  "extraction_notes": "Vendor address partially obscured, unable to extract reliably"
}</o>
</example>

Notice how the example explicitly shows null for the obscured field rather than guessing. This teaches Claude the behaviour you want.

Designing for Uncertainty: The Escape Hatches Your Prompt Needs

Here's where most invoice extraction tutorials fail you completely. They assume every document can be processed successfully. In production, that assumption will cost you.

Letting the Model Say "I Don't Know"

LLMs have a problematic tendency to be confidently wrong. If you ask for a value, they'll give you one—even if it means hallucinating. The solution isn't to ask Claude for a numerical confidence score (LLMs consistently overestimate their certainty and produce non-uniform distributions). Instead, build explicit uncertainty signals into your schema.

I use five flags:

no_answer — Field not found in document
partial_answer — Value partially visible or incomplete
conflicting_values — Multiple different values present
ambiguous — Cannot determine which interpretation is correct
confident — Clear, unambiguous extraction

This approach works because you're asking Claude to classify the situation rather than quantify uncertainty. Classification is something LLMs do well.

Building in an "Invalid Document" Path

Sometimes the right answer is "this document can't be processed." Maybe it's too damaged, maybe it's not actually an invoice, maybe it's in a language the model doesn't handle well.

Your schema should have an escape hatch:

interface InvoiceExtraction {
  extraction_status: "success" | "partial" | "failed";
  failure_reason?: string;

  // Only populated if extraction_status is 'success' or 'partial'
  invoice_number?: string;
  vendor_name?: string;
  total_amount?: number;
  // ... other fields

  fields_with_issues?: Array<{
    field_name: string;
    issue_type: "not_found" | "partial" | "conflicting" | "ambiguous";
    notes?: string;
  }>;
}

This structure means your downstream systems can handle uncertain extractions appropriately—routing them to human review rather than trusting potentially incorrect data.

Testing Prompts Like You'd Test Code

A prompt that's used repeatedly deserves the same rigour as production code. Yet I constantly see people iterate on prompts with a handful of test documents, declare victory, and wonder why production accuracy is terrible.

Building a Ground Truth Dataset

You need 50-100+ documents minimum, stratified across:

Format tiers: digital PDFs, high-quality scans, low-quality scans, mobile photos
Layout variations: single-column, multi-column, with tables, without tables
Edge cases: handwritten annotations, stamps, partial pages, foreign language elements

Label these with expert verification. Ideally, two people verify each other's work. Store ground truth as JSON with the same schema your extraction uses.

This takes time. It's also the only way to know whether your prompt actually works.

Treating Prompts as Versioned Artefacts

Here's something that bit me: model updates can break working prompts. Research documents accuracy swings of 8.7% or more between model versions. Your prompt that worked perfectly on Claude Sonnet 4.5 might behave differently after an update.

The fix:

Pin production deployments to specific model versions (claude-sonnet-4-5-20250929 rather than just claude-sonnet-4-5)
Run your full test suite before any model migration
Compare field-level metrics, not just overall accuracy
Keep a changelog of prompt modifications

Regression testing isn't glamorous, but it's the difference between a system you can trust and one that silently degrades.

The LLM Challenger Pattern for High-Stakes Extractions

For invoices where accuracy is non-negotiable—large amounts, audit-sensitive documents, anything that feeds directly into payments—consider a two-pass approach.

// Primary extraction
const primaryResult = await extractWithClaude(document);

// Challenge extraction
const challengePrompt = `
Review this extraction result against the source document.
Flag any fields where the extracted value doesn't match what you see.

Extraction result: ${JSON.stringify(primaryResult)}
Source document: ${document}
`;
const validation = await validateWithClaude(challengePrompt);

// Only trust fields with consensus
const finalResult = filterFlaggedFields(primaryResult, validation);

The second pass isn't doing extraction—it's doing verification. Different prompts, different cognitive load. Production implementations using this pattern report error rates dropping to 0.0001% on verified fields.

Yes, it doubles your API costs for those documents. For a $50,000 invoice, that's a reasonable trade-off.

Putting It All Together

Let's combine everything into a working implementation. If you're using Claude's structured outputs (which you should be—it eliminates an entire class of parsing errors), your schema enforces the structure automatically.

import Anthropic from "@anthropic-ai/sdk";
import { z } from "zod";
import { zodToJsonSchema } from "zod-to-json-schema";

// Define the schema with Zod
const LineItemSchema = z.object({
  description: z.string(),
  quantity: z.number().nullable(),
  unit_price: z.number().nullable(),
  confidence: z.enum(["confident", "partial", "ambiguous"]),
});

const FieldIssueSchema = z.object({
  field_name: z.string(),
  issue_type: z.enum(["not_found", "partial", "conflicting", "ambiguous"]),
  notes: z.string().nullable(),
});

const InvoiceExtractionSchema = z.object({
  extraction_status: z.enum(["success", "partial", "failed"]),
  failure_reason: z.string().nullable(),

  invoice_number: z.string().nullable(),
  vendor_name: z.string().nullable(),
  date: z.string().nullable(), // YYYY-MM-DD format
  total_amount: z.number().nullable(),
  currency: z.string().nullable(),
  line_items: z.array(LineItemSchema).nullable(),

  fields_with_issues: z.array(FieldIssueSchema).nullable(),
});

type InvoiceExtraction = z.infer<typeof InvoiceExtractionSchema>;

// The prompt
const EXTRACTION_PROMPT = `
<s>
You are an expert invoice extraction system. Extract ONLY information 
explicitly visible in the document. Never infer or guess values.

For any field that cannot be clearly read:
- Set the field value to null
- Add an entry to fields_with_issues explaining why

If the document is too damaged to process reliably, or is not an invoice,
set extraction_status to 'failed' with an explanation.
</s>

<examples>
<example>
<description>Invoice with coffee stain obscuring vendor address</description>
<o>
{
  "extraction_status": "partial",
  "invoice_number": "INV-001",
  "vendor_name": "Example Corp",
  "total_amount": 1500.00,
  "currency": "AUD",
  "fields_with_issues": [
    {"field_name": "vendor_address", "issue_type": "partial", 
     "notes": "Address obscured by stain, only partial postcode visible"}
  ]
}
</o>
</example>
</examples>

<instructions>
Extract all invoice fields from the attached document.
Set extraction_status to 'success' only if all critical fields 
(invoice_number, vendor_name, total_amount) are clearly readable.
Use 'partial' if some fields have issues but critical fields are present.
Use 'failed' if the document cannot be reliably processed.
</instructions>
`;

async function extractInvoice(
  pdfBase64Data: string
): Promise<InvoiceExtraction> {
  const client = new Anthropic();

  const response = await client.beta.messages.create({
    model: "claude-sonnet-4-5-20250929",
    betas: ["structured-outputs-2025-11-13"],
    max_tokens: 2048,
    // @ts-expect-error - temperature not yet in beta types
    temperature: 0.0, // Consistency over creativity
    messages: [
      {
        role: "user",
        content: [
          {
            type: "document",
            source: {
              type: "base64",
              media_type: "application/pdf",
              data: pdfBase64Data,
            },
          },
          { type: "text", text: EXTRACTION_PROMPT },
        ],
      },
    ],
    output_format: {
      type: "json_schema",
      schema: zodToJsonSchema(InvoiceExtractionSchema),
    },
  });

  const textBlock = response.content[0];
  if (textBlock.type !== "text") {
    throw new Error("Unexpected response type");
  }

  return InvoiceExtractionSchema.parse(JSON.parse(textBlock.text));
}

A few implementation notes. Temperature 0.0 is essential for extraction tasks—you want consistency, not creativity. The document comes first in the content array (images and PDFs should precede instructions). And structured outputs mean you never have to write JSON parsing logic or handle malformed responses—Zod validates the response matches your schema at runtime.

For high-volume processing, consider the batch API for 50% cost savings, or a hybrid OCR-LLM architecture where traditional OCR handles text extraction and Claude handles the reasoning. RaftLabs reported 70% cost reduction with this approach while maintaining 97% accuracy.

The Uncomfortable Truth About Invoice Extraction

There's no perfect prompt. Documents are too varied, quality is too inconsistent, and edge cases are infinite. What you can build is a system that knows its limitations—that flags uncertain extractions rather than guessing, that routes difficult documents to human review, and that continuously improves as you encounter new failure modes.

The prompts that work in production aren't the clever ones. They're the ones with escape hatches. The ones that make uncertainty explicit rather than hiding it behind confident-sounding numbers. The ones backed by actual test coverage rather than vibes.

Treat prompt development as engineering, not magic. Build in uncertainty handling from day one. Test, version, and regression-test like any other production code. And when something fails—because something always fails—make sure your system tells you about it instead of silently corrupting your data.

That's the difference between a demo and something you can actually trust.