An autonomous AP agent handling 2,000+ supplier invoices a month for Aralab
An autonomous accounts payable agent that handles 2k+ supplier invoices a month, freeing Aralab's three-person finance team for the strategic work they were hired to do.

The challenge
Aralab's operations generate more than 2000 supplier invoices every month. Each one arrives as a PDF, and that's where the consistency ends. Every supplier has its own layout, line-item conventions, tax formatting, reference number scheme. Processing them manually meant opening each invoice, reading it as a human, and working out where each piece of data belonged in the ERP.
The three-person finance team had been doing this for years. Not because they lacked skill — these were experienced AP professionals. Because the volume left no other option. Each invoice took time. Multiply by 2000. There was nothing left.
The deeper problem was what that consumed capacity was costing Aralab. A three-person finance team buried in data entry is a finance team that isn't doing cash flow forecasting, supplier negotiations, or financial planning. The business was paying for judgment and getting data transcription instead.
The reason Aralab hadn't solved this before us was that the problem was harder than it looked. Template-based OCR breaks on diverse supplier formats. Matching logic adds another layer — a single invoice can reference multiple POs, suppliers ship partial quantities, units differ. And above all of it sat a hard requirement: zero tolerance for pricing errors. Manufacturing margins are tight. A one-cent discrepancy on a unit price across thousands of units cascades into a material error.
What we learned
| Templates break on diverse formats | Every supplier ships its own layout — rules-based OCR works until it doesn't, which is daily. |
| Capacity hides judgment debt | Skilled AP professionals doing data entry are paying for transcription instead of forecasting and negotiation. |
| One cent compounds | Manufacturing margins don't tolerate pricing drift — a small unit-price discrepancy across thousands becomes a material error. |
The solution
Twistag treated it as an agent problem, not an automation problem. Traditional invoice automation vendors approach AP as a data extraction and matching challenge — better templates, better OCR, more rules. We built a system that reasons about invoices the way a skilled AP clerk does, at machine speed, with deterministic guarantees on the parts that can't tolerate ambiguity.
The architecture is hybrid. Claude Sonnet 4.5 handles the contextual interpretation work: reading any supplier PDF format, parsing line items, resolving partial-match scenarios, identifying suppliers from inconsistent name variations. Deterministic code handles the arithmetic and validation: every unit price compared against the matched purchase order, every line total recalculated independently, every tax amount verified, every grand total checked against the sum of lines. The LLM proposes; the validation engine verifies. Neither alone is sufficient.
The pipeline starts with a PDF arriving in Cloud Storage on GCP, which triggers a Firestore-based workflow. Claude reads the document and produces a structured JSON object: supplier identifiers including the Portuguese NIF tax number, invoice number and date, every line item with quantity and price, tax calculations, document totals. A two-layer supplier identification system links the invoice to the right ERP entity — NIF normalization for the primary path, fuzzy matching as a fallback when the tax number is missing or ambiguous.
Line-item matching is where the hybrid does its heaviest work. The deterministic layer handles clear-cut cases: exact PO references, aligned quantities, matching prices. These resolve instantly with no LLM cost. The LLM layer handles everything the rules can't — partial deliveries, multi-PO allocations, unit conversions, description mismatches. For each, Claude receives the invoice line alongside candidate ERP records and returns a structured decision with a confidence score and an explanation of its reasoning. High-confidence matches proceed automatically. Low-confidence matches surface in the dashboard for the finance team to review.
LangFuse runs across the whole AI layer for observability. Every Claude call is traced — what the model was given, what it produced, how long it took, what it cost. Prompt versions are tracked, cost per invoice is visible in real time. In a finance context, the ability to see how the AI reached a decision is part of what makes the system trustworthy enough to delegate to.
What this shaped
| LLM proposes, code verifies | Language models read context; deterministic code does the arithmetic. Neither is sufficient alone for finance. |
| Confidence scores route human time | High-confidence matches resolve themselves; low-confidence ones surface for review — humans see only ambiguity. |
| Observability is part of trust | Every Claude call traced, every prompt versioned, every cost-per-invoice visible. Finance teams need to see the reasoning. |
The impact
The whole system shipped in six weeks with one engineer. Aralab's three-person finance team now works through a dashboard showing every invoice's processing status — fully automated, partially matched and awaiting line-item review, or flagged for full review. They process 2000+ invoices a month with no human required for the patternable 80% and a fast review interface for the ambiguous 20%.
What that bought back is more important than the volume number. Three FTEs that were buried in data entry are now doing the work the business actually pays them to do — cash flow forecasting, supplier negotiations, financial planning. Same headcount, completely different output.
What this proved
| Same headcount, different output | Three FTEs went from data entry to cash flow forecasting and supplier negotiation without anyone being added. |
| Six weeks, one engineer | The right pattern lets one engineer ship what looks like a team's work. |
| Patternable 80% is the unlock | Most invoice work is patternable; the 20% that isn't gets a fast review surface, not a slower automation. |
Technologies used
- Anthropic Claude
- GCP
- Firebase
- LangFuse

