To evaluate AI document automation accuracy before you buy, measure four things on your own documents: precision (are extracted values correct?), recall (are all required fields captured?), hallucination rate (does the platform invent values that aren't in the document?), and citation depth (can you verify each value against its source?). A single headline accuracy number means little without these — and without a proof-of-concept run on your real documents, including the messy ones.
This is a practical framework for technology and operations buyers running POCs or vendor evaluations for document AI platforms.
This is part of a series of articles about AI for Document Workflows.
What "Accuracy" Actually Means
Vendors quote accuracy as one figure, but it hides four distinct properties that matter differently depending on your workflow.
| Metric | Question it answers | Why it matters |
|---|---|---|
| Precision | Are the extracted values correct? | Wrong values flow downstream into decisions |
| Recall | Are all required fields captured? | A missed field is a silent gap, not an error |
| Hallucination rate | Does it invent values not in the document? | Invented values are the most dangerous failure |
| Citation depth | Can each value be verified against its source? | Determines whether you can trust and audit output |
The hallucination rate deserves special attention: a platform that confidently returns a plausible value that isn't in the document is more dangerous than one that flags uncertainty, because the error is invisible. Citation depth is what makes hallucinations catchable — if every value links to a source, an invented value has nowhere to point.
How to Structure a POC
The single most important rule: use your own documents, not vendor-provided samples. Vendor demos run on documents chosen because the platform handles them well. Your POC should run on a representative slice of what you actually process, deliberately including edge cases: multi-amendment stacks, non-standard formats, handwritten notes, and scanned or low-quality images. Measure the results against a human-reviewed ground truth — a set of documents your own experts have abstracted — so you can compute precision, recall, and hallucination rate rather than eyeballing a few outputs. Run enough volume to see consistency, not just a lucky sample.
Red Flags in Vendor Demos
A few patterns should lower your confidence immediately. Pre-selected documents: if you can't bring your own, you're seeing a best case, not a real one. No citations on outputs: if the platform can't show where a value came from, you can't verify it at scale or defend it later. Accuracy claims without methodology: a "99% accurate" figure means nothing without knowing on which documents, measuring precision or recall, and against what ground truth. "We train on your data" buried in the contract: a data-handling term that turns your confidential documents into training material is both a security and a competitive concern. Treat each of these as a reason to dig deeper before trusting the headline number.
What Good Looks Like
A strong platform shows three things on your documents. First, consistent performance across format variations — accuracy doesn't collapse when you move from a clean PDF to a scanned amendment or an unfamiliar layout. Second, citations on every field, so every value is verifiable against its source. Third, clear flagging of low-confidence extractions for human review, rather than confident guesses on ambiguous items. A platform that does all three lets you trust it at scale and verify it where it counts — which is the whole point of measuring accuracy before you buy. Pair these with the right security posture: onshore processing, SOC 2 Type II certification, and no training on your data.
Learn more about ai document extraction source citations.
How Kolena Works
Kolena is an AI document automation platform built for insurance, real estate, banking, and financial services teams evaluating document AI. Your own documents — including the edge cases — go in; structured output comes out with every field cited to its source and low-confidence items flagged for review, so a POC measures real performance, not a curated demo.
It reads any format, keeps accuracy consistent across format variation, and pushes cited output into your existing systems, so you can verify each value against a human-reviewed ground truth. Every run produces a full audit trail: not just what was extracted, but the specific line, field, or clause that justified each data point. SOC 2 Type II certified, onshore processing, no training on customer data.