An autonomous AP agent handling 2,000+ supplier invoices a month for Aralab

An autonomous accounts payable agent that handles 2k+ supplier invoices a month, freeing Aralab's three-person finance team for the strategic work they were hired to do.

Aralab
3 FTEs
redirected from data entry to strategic work
2000+
supplier invoices processed every month
6 weeks
from kickoff to live, with one engineer

The challenge

Aralab's operations generate more than 2000 supplier invoices every month. Each one arrives as a PDF, and that's where the consistency ends. Every supplier has its own layout, line-item conventions, tax formatting, reference number scheme. Processing them manually meant opening each invoice, reading it as a human, and working out where each piece of data belonged in the ERP.

The three-person finance team had been doing this for years. Not because they lacked skill — these were experienced AP professionals. Because the volume left no other option. Each invoice took time. Multiply by 2000. There was nothing left.

The deeper problem was what that consumed capacity was costing Aralab. A three-person finance team buried in data entry is a finance team that isn't doing cash flow forecasting, supplier negotiations, or financial planning. The business was paying for judgment and getting data transcription instead.

The reason Aralab hadn't solved this before us was that the problem was harder than it looked. Template-based OCR breaks on diverse supplier formats. Matching logic adds another layer — a single invoice can reference multiple POs, suppliers ship partial quantities, units differ. And above all of it sat a hard requirement: zero tolerance for pricing errors. Manufacturing margins are tight. A one-cent discrepancy on a unit price across thousands of units cascades into a material error.

What we learned
Templates break on diverse formatsEvery supplier ships its own layout — rules-based OCR works until it doesn't, which is daily.
Capacity hides judgment debtSkilled AP professionals doing data entry are paying for transcription instead of forecasting and negotiation.
One cent compoundsManufacturing margins don't tolerate pricing drift — a small unit-price discrepancy across thousands becomes a material error.

The solution

Twistag treated it as an agent problem, not an automation problem. Traditional invoice automation vendors approach AP as a data extraction and matching challenge — better templates, better OCR, more rules. We built a system that reasons about invoices the way a skilled AP clerk does, at machine speed, with deterministic guarantees on the parts that can't tolerate ambiguity.

The architecture is hybrid. Claude Sonnet 4.5 handles the contextual interpretation work: reading any supplier PDF format, parsing line items, resolving partial-match scenarios, identifying suppliers from inconsistent name variations. Deterministic code handles the arithmetic and validation: every unit price compared against the matched purchase order, every line total recalculated independently, every tax amount verified, every grand total checked against the sum of lines. The LLM proposes; the validation engine verifies. Neither alone is sufficient.

The pipeline starts with a PDF arriving in Cloud Storage on GCP, which triggers a Firestore-based workflow. Claude reads the document and produces a structured JSON object: supplier identifiers including the Portuguese NIF tax number, invoice number and date, every line item with quantity and price, tax calculations, document totals. A two-layer supplier identification system links the invoice to the right ERP entity — NIF normalization for the primary path, fuzzy matching as a fallback when the tax number is missing or ambiguous.

Line-item matching is where the hybrid does its heaviest work. The deterministic layer handles clear-cut cases: exact PO references, aligned quantities, matching prices. These resolve instantly with no LLM cost. The LLM layer handles everything the rules can't — partial deliveries, multi-PO allocations, unit conversions, description mismatches. For each, Claude receives the invoice line alongside candidate ERP records and returns a structured decision with a confidence score and an explanation of its reasoning. High-confidence matches proceed automatically. Low-confidence matches surface in the dashboard for the finance team to review.

LangFuse runs across the whole AI layer for observability. Every Claude call is traced — what the model was given, what it produced, how long it took, what it cost. Prompt versions are tracked, cost per invoice is visible in real time. In a finance context, the ability to see how the AI reached a decision is part of what makes the system trustworthy enough to delegate to.

What this shaped
LLM proposes, code verifiesLanguage models read context; deterministic code does the arithmetic. Neither is sufficient alone for finance.
Confidence scores route human timeHigh-confidence matches resolve themselves; low-confidence ones surface for review — humans see only ambiguity.
Observability is part of trustEvery Claude call traced, every prompt versioned, every cost-per-invoice visible. Finance teams need to see the reasoning.

The impact

The whole system shipped in six weeks with one engineer. Aralab's three-person finance team now works through a dashboard showing every invoice's processing status — fully automated, partially matched and awaiting line-item review, or flagged for full review. They process 2000+ invoices a month with no human required for the patternable 80% and a fast review interface for the ambiguous 20%.

What that bought back is more important than the volume number. Three FTEs that were buried in data entry are now doing the work the business actually pays them to do — cash flow forecasting, supplier negotiations, financial planning. Same headcount, completely different output.

What this proved
Same headcount, different outputThree FTEs went from data entry to cash flow forecasting and supplier negotiation without anyone being added.
Six weeks, one engineerThe right pattern lets one engineer ship what looks like a team's work.
Patternable 80% is the unlockMost invoice work is patternable; the 20% that isn't gets a fast review surface, not a slower automation.

Technologies used

  • Anthropic Claude
  • GCP
  • Firebase
  • LangFuse

related case studies

Explore more case studies

next step

Have a similar challenge?

Tell us where you're stuck. We'll come back with a one-page outline of how we'd approach it.