Every document extraction vendor claims 95%+ accuracy. None of them publish how they measure it.
Every document extraction vendor claims 95%+ accuracy. None of them publish how they measure it.
We built an open, reproducible benchmark for Koji — 653 documents, 9 categories, 3,961 fields with ground-truth expected outputs. Anyone can run it. Here’s the methodology and what it revealed.
If you evaluate document AI products, you’ve seen the claims. “99% accuracy on invoices.” “Enterprise-grade extraction.” “Production-ready.” The numbers are always high and the methodology is never published.
This makes the claims unfalsifiable. You can’t reproduce them, you can’t compare them, and you can’t tell whether “95% accuracy” means 95% of documents had zero errors or 95% of individual fields matched expected values. These are very different numbers — a document with 10 fields where 1 field is wrong is 90% per-field but 0% per-document (it has an error). Neither framing is wrong, but publishing one without specifying which is misleading.
The NLP community solved this decades ago. Models are evaluated against standard benchmarks: SQuAD for question answering, GLUE for language understanding, ImageNet for computer vision. Every paper reports results on the same datasets using the same metrics. You can compare directly.
Document extraction has no equivalent. There’s no SQuAD for invoices, no GLUE for insurance policies. Every vendor builds their own test set, measures against their own expected outputs, and reports their own number. The customer has no way to verify.
We built Koji’s benchmark to be the thing we wished existed when we started.
The validation corpus is a public repository: 653 documents across 9 categories.
| Category | Documents | Real | Synthetic | Notes |
|---|---|---|---|---|
| sec_filings | 102 | 102 | 0 | EDGAR 10-K, 10-Q, 8-K, DEF 14A, S-1, 20-F + amendments |
| insurance_policies | 97 | 17 | 80 | Dec pages, endorsements, binders across 9 policy types |
| invoices | 155 | 0 | 155 | Synthetic with full schema coverage |
| receipts | 52 | 52 | 0 | SROIE scanned receipts (real OCR) |
| insurance_claims | 152 | 17 | 135 | FEMA proof-of-loss, WC FROI, loss runs |
| insurance_certificates | 61 | 21 | 40 | COIs from .gov/.edu + synthetic |
| irs_forms | 20 | 20 | 0 | Structured tax forms |
| adversarial | 11 | 0 | 11 | Blank docs, OCR noise, wrong-schema, stapled packets |
| multi_format | 3 | 3 | 0 | xlsx, docx, pptx |
Each document has three components:
The real documents come from public sources: EDGAR filings, state insurance department websites, SROIE dataset, government COI repositories. The synthetic documents are generated to cover specific failure modes — carrier letter-codes on insurance certificates, line-broken text on SEC cover pages, multi-policy COIs with per-policy additional insureds.
The mix matters. Real documents test whether the pipeline handles actual OCR artifacts, layout variations, and formatting inconsistencies. Synthetic documents test specific edge cases that real documents don’t cover densely enough.
One command:
koji bench --corpus . --model openai/gpt-4o-mini
This runs every document through the extraction pipeline, compares the output to expected values, and reports per-category and per-field accuracy.
We measure at the field level, not the document level. If a document has 7 fields and the pipeline gets 6 right, that’s 6/7 = 85.7% for that document. The category accuracy is the sum of correct fields divided by total fields across all documents in the category.
Field-level is more honest than document-level because it doesn’t let one hard field drag down an otherwise perfect extraction. If filing_date is consistently tricky but the other 3 fields on SEC filings are always correct, the field-level metric shows 75% per document (3/4) rather than 0% (document has an error).
Extracted values are compared against expected values with normalization:
Dates normalize to YYYY-MM-DD before comparison:
def _normalize_date(value):
s = value.strip()
# YYYY-MM-DD already
m = re.match(r"^(\d{4})-(\d{1,2})-(\d{1,2})$", s)
if m:
return f"{m.group(1)}-{int(m.group(2)):02d}-{int(m.group(3)):02d}"
# MM/DD/YYYY
m = re.match(r"^(\d{1,2})/(\d{1,2})/(\d{4})$", s)
if m:
return f"{m.group(3)}-{int(m.group(1)):02d}-{int(m.group(2)):02d}"
return None
“April 10, 2026”, “04/10/2026”, “2026-04-10” all normalize to 2026-04-10.
Numbers strip currency symbols and formatting. $1,000.00 → 1000.0:
def _to_number(value):
if isinstance(value, (int, float)):
return float(value)
if isinstance(value, str):
cleaned = value.replace("$", "").replace(",", "").strip()
return float(cleaned)
return None
Strings use configurable fuzzy matching via Levenshtein ratio. A fuzzy threshold of 0.85 (85% character similarity) allows minor OCR errors and formatting differences without counting them as failures. Each schema sets its own threshold in its YAML:
compare:
fuzzy_threshold: 0.85
Nulls have four-way semantics, which is critical for adversarial testing:
# Both empty → PASS ("correctly absent")
# Expected empty, actual non-empty → FAIL ("hallucinated")
# Expected non-empty, actual empty → FAIL ("missing")
# Both non-empty → type-aware comparison
This means our adversarial corpus (blank docs, wrong-schema docs) can assert that the model should return null — and a hallucinated value counts as a failure, not a success. The extraction returning nothing when there’s nothing to extract is a feature, not a gap.
Arrays compare order-independently with nested field matching. Each expected item finds its best match in the actual array via recursive comparison. A list of insurance policies matches if every expected policy has a corresponding actual policy with matching fields, regardless of array order. This handles the common case where an LLM returns array items in a different sequence than the ground truth.
We don’t measure parsing accuracy (is the OCR correct?). The benchmark inputs are pre-parsed markdown. If the OCR misread a digit, the extraction might be “correct” (it faithfully extracted the wrong text) but the end-to-end result is wrong. Parsing accuracy is a separate problem with separate benchmarks.
We don’t measure latency in the accuracy number. A field that takes 30 seconds to extract but returns the right value counts the same as one that returns in 1 second. Latency is tracked separately.
We don’t measure cost per field. The benchmark reports elapsed time and can be run against different models to compare cost-accuracy tradeoffs, but the accuracy number itself is model-agnostic.
Current results across the full corpus:
| Category | Accuracy | Fields |
|---|---|---|
| irs_forms | 100.0% | 180/180 |
| multi_format | 100.0% | 18/18 |
| sec_filings | 99.2% | 380/383 |
| insurance_policies | 99.1% | 756/763 |
| adversarial | 96.7% | 58/60 |
| insurance_claims | 95.5% | 974/1020 |
| invoices | 95.3% | 1478/1551 |
| insurance_certificates | 94.4% | 288/305 |
| receipts | 81.6% | 120/147 |
| Overall | 96.1% | 4252/4427 |
OCR quality (receipts, 81.6%). The SROIE receipt dataset contains scanned images of thermal-printed receipts — blurry, skewed, low-resolution. The extraction is often correct given the OCR output; the OCR output is often wrong given the original image. This is a parsing problem, not an extraction problem, but it shows up in the end-to-end number. We track it separately to avoid conflating the two.
Complex nested arrays (insurance certificates, 94.4%). Certificates of insurance contain a table of policies, each with its own carrier, limits, dates, and additional insureds. Extracting this as a structured array is one of the hardest tasks in document extraction — the LLM needs to associate the right carrier with the right policy row, match limits to the correct coverage type, and list per-policy additional insureds separately. Even small misassociations count as field failures.
LLM non-determinism (sec_filings, 99.2%). Even at temperature=0, large language models occasionally return different results for the same input. A field that extracts correctly 95% of the time will eventually fail in a 383-field benchmark. Our same-chunk retry mechanism (re-extract with identical input when a required field returns null) catches most of these, but some slip through. The remaining 0.8% failure rate on SEC filings is almost entirely non-determinism.
Run the same benchmark twice and you’ll get different numbers. We’ve seen the overall accuracy vary by 2-3 percentage points between runs — same code, same model, same documents. The sources:
We report the accuracy from a clean run with retries and stable connectivity. The number represents the pipeline’s capability, not the API’s reliability on any given day.
Document extraction is too important to ship untested. If your pipeline processes insurance claims that determine payouts, or SEC filings that inform investment decisions, or medical records that affect patient care — you need to know the accuracy before you deploy, and you need to know when it regresses.
A benchmark that runs on every engine change, against a public corpus with published expected outputs, makes accuracy a measurable property of the system rather than a marketing claim. It’s the difference between “we think it works” and “we measured it at 96.1% on May 16, 2026, and here’s the JSON to prove it.”
The corpus is public. The benchmark tool is open source. The methodology is what you just read. If you’re building document extraction and you want to measure honestly, you can start here.
git clone https://github.com/getkoji/corpus
koji bench --corpus corpus/ --model openai/gpt-4o-mini
The number you get is the number. No cherry-picking, no caveats, no fine print.
Frank Thomas is the founder of Koji, an open-source document extraction platform. The validation corpus is available at github.com/getkoji/corpus.