Benchmarking Document Extraction: How We Measure Accuracy Across 653 Documents

Every document extraction vendor claims 95%+ accuracy. None of them publish how they measure it.

We built an open, reproducible benchmark for Koji — 653 documents, 9 categories, 3,961 fields with ground-truth expected outputs. Anyone can run it. Here’s the methodology and what it revealed.

The credibility problem

If you evaluate document AI products, you’ve seen the claims. “99% accuracy on invoices.” “Enterprise-grade extraction.” “Production-ready.” The numbers are always high and the methodology is never published.

This makes the claims unfalsifiable. You can’t reproduce them, you can’t compare them, and you can’t tell whether “95% accuracy” means 95% of documents had zero errors or 95% of individual fields matched expected values. These are very different numbers — a document with 10 fields where 1 field is wrong is 90% per-field but 0% per-document (it has an error). Neither framing is wrong, but publishing one without specifying which is misleading.

The NLP community solved this decades ago. Models are evaluated against standard benchmarks: SQuAD for question answering, GLUE for language understanding, ImageNet for computer vision. Every paper reports results on the same datasets using the same metrics. You can compare directly.

Document extraction has no equivalent. There’s no SQuAD for invoices, no GLUE for insurance policies. Every vendor builds their own test set, measures against their own expected outputs, and reports their own number. The customer has no way to verify.

We built Koji’s benchmark to be the thing we wished existed when we started.

The corpus

The validation corpus is a public repository: 653 documents across 9 categories.

Category	Documents	Real	Synthetic	Notes
sec_filings	102	102	0	EDGAR 10-K, 10-Q, 8-K, DEF 14A, S-1, 20-F + amendments
insurance_policies	97	17	80	Dec pages, endorsements, binders across 9 policy types
invoices	155	0	155	Synthetic with full schema coverage
receipts	52	52	0	SROIE scanned receipts (real OCR)
insurance_claims	152	17	135	FEMA proof-of-loss, WC FROI, loss runs
insurance_certificates	61	21	40	COIs from .gov/.edu + synthetic
irs_forms	20	20	0	Structured tax forms
adversarial	11	0	11	Blank docs, OCR noise, wrong-schema, stapled packets
multi_format	3	3	0	xlsx, docx, pptx

Each document has three components:

The document itself — parsed markdown (the input to extraction)
A schema — YAML defining what fields to extract, with types and validation rules
Expected output — JSON with the ground-truth values for every field

The real documents come from public sources: EDGAR filings, state insurance department websites, SROIE dataset, government COI repositories. The synthetic documents are generated to cover specific failure modes — carrier letter-codes on insurance certificates, line-broken text on SEC cover pages, multi-policy COIs with per-policy additional insureds.

The mix matters. Real documents test whether the pipeline handles actual OCR artifacts, layout variations, and formatting inconsistencies. Synthetic documents test specific edge cases that real documents don’t cover densely enough.

How we measure

One command:

koji bench --corpus . --model openai/gpt-4o-mini

This runs every document through the extraction pipeline, compares the output to expected values, and reports per-category and per-field accuracy.

Field-level accuracy

We measure at the field level, not the document level. If a document has 7 fields and the pipeline gets 6 right, that’s 6/7 = 85.7% for that document. The category accuracy is the sum of correct fields divided by total fields across all documents in the category.

Field-level is more honest than document-level because it doesn’t let one hard field drag down an otherwise perfect extraction. If filing_date is consistently tricky but the other 3 fields on SEC filings are always correct, the field-level metric shows 75% per document (3/4) rather than 0% (document has an error).

Comparison rules

Extracted values are compared against expected values with normalization:

Dates normalize to YYYY-MM-DD before comparison:

def _normalize_date(value):
    s = value.strip()
    # YYYY-MM-DD already
    m = re.match(r"^(\d{4})-(\d{1,2})-(\d{1,2})$", s)
    if m:
        return f"{m.group(1)}-{int(m.group(2)):02d}-{int(m.group(3)):02d}"
    # MM/DD/YYYY
    m = re.match(r"^(\d{1,2})/(\d{1,2})/(\d{4})$", s)
    if m:
        return f"{m.group(3)}-{int(m.group(1)):02d}-{int(m.group(2)):02d}"
    return None

“April 10, 2026”, “04/10/2026”, “2026-04-10” all normalize to 2026-04-10.

Numbers strip currency symbols and formatting. $1,000.00 → 1000.0:

def _to_number(value):
    if isinstance(value, (int, float)):
        return float(value)
    if isinstance(value, str):
        cleaned = value.replace("$", "").replace(",", "").strip()
        return float(cleaned)
    return None

Strings use configurable fuzzy matching via Levenshtein ratio. A fuzzy threshold of 0.85 (85% character similarity) allows minor OCR errors and formatting differences without counting them as failures. Each schema sets its own threshold in its YAML:

compare:
  fuzzy_threshold: 0.85

Nulls have four-way semantics, which is critical for adversarial testing:

# Both empty → PASS ("correctly absent")
# Expected empty, actual non-empty → FAIL ("hallucinated")
# Expected non-empty, actual empty → FAIL ("missing")
# Both non-empty → type-aware comparison

This means our adversarial corpus (blank docs, wrong-schema docs) can assert that the model should return null — and a hallucinated value counts as a failure, not a success. The extraction returning nothing when there’s nothing to extract is a feature, not a gap.

Arrays compare order-independently with nested field matching. Each expected item finds its best match in the actual array via recursive comparison. A list of insurance policies matches if every expected policy has a corresponding actual policy with matching fields, regardless of array order. This handles the common case where an LLM returns array items in a different sequence than the ground truth.

What we don’t measure

We don’t measure parsing accuracy (is the OCR correct?). The benchmark inputs are pre-parsed markdown. If the OCR misread a digit, the extraction might be “correct” (it faithfully extracted the wrong text) but the end-to-end result is wrong. Parsing accuracy is a separate problem with separate benchmarks.

We don’t measure latency in the accuracy number. A field that takes 30 seconds to extract but returns the right value counts the same as one that returns in 1 second. Latency is tracked separately.

We don’t measure cost per field. The benchmark reports elapsed time and can be run against different models to compare cost-accuracy tradeoffs, but the accuracy number itself is model-agnostic.

What the benchmark revealed

Current results across the full corpus:

Category	Accuracy	Fields
irs_forms	100.0%	180/180
multi_format	100.0%	18/18
sec_filings	99.2%	380/383
insurance_policies	99.1%	756/763
adversarial	96.7%	58/60
insurance_claims	95.5%	974/1020
invoices	95.3%	1478/1551
insurance_certificates	94.4%	288/305
receipts	81.6%	120/147
Overall	96.1%	4252/4427

Where extraction fails

OCR quality (receipts, 81.6%). The SROIE receipt dataset contains scanned images of thermal-printed receipts — blurry, skewed, low-resolution. The extraction is often correct given the OCR output; the OCR output is often wrong given the original image. This is a parsing problem, not an extraction problem, but it shows up in the end-to-end number. We track it separately to avoid conflating the two.

Complex nested arrays (insurance certificates, 94.4%). Certificates of insurance contain a table of policies, each with its own carrier, limits, dates, and additional insureds. Extracting this as a structured array is one of the hardest tasks in document extraction — the LLM needs to associate the right carrier with the right policy row, match limits to the correct coverage type, and list per-policy additional insureds separately. Even small misassociations count as field failures.

LLM non-determinism (sec_filings, 99.2%). Even at temperature=0, large language models occasionally return different results for the same input. A field that extracts correctly 95% of the time will eventually fail in a 383-field benchmark. Our same-chunk retry mechanism (re-extract with identical input when a required field returns null) catches most of these, but some slip through. The remaining 0.8% failure rate on SEC filings is almost entirely non-determinism.

The variance problem

Run the same benchmark twice and you’ll get different numbers. We’ve seen the overall accuracy vary by 2-3 percentage points between runs — same code, same model, same documents. The sources:

LLM non-determinism. Temperature=0 reduces but doesn’t eliminate variation. OpenAI has confirmed this.
Rate limiting. At 653 documents, the benchmark fires hundreds of LLM calls. If the API rate-limits some of them, those extractions fail silently. We added retry with exponential backoff specifically to address this (it moved SEC filings from 93.7% to 99.2%).
Timing-dependent behavior. Some API calls timeout under load that succeed when the API is less busy. Our 300-second timeout is generous, but not infinite.

We report the accuracy from a clean run with retries and stable connectivity. The number represents the pipeline’s capability, not the API’s reliability on any given day.

Why this matters

Document extraction is too important to ship untested. If your pipeline processes insurance claims that determine payouts, or SEC filings that inform investment decisions, or medical records that affect patient care — you need to know the accuracy before you deploy, and you need to know when it regresses.

A benchmark that runs on every engine change, against a public corpus with published expected outputs, makes accuracy a measurable property of the system rather than a marketing claim. It’s the difference between “we think it works” and “we measured it at 96.1% on May 16, 2026, and here’s the JSON to prove it.”

The corpus is public. The benchmark tool is open source. The methodology is what you just read. If you’re building document extraction and you want to measure honestly, you can start here.

git clone https://github.com/getkoji/corpus
koji bench --corpus corpus/ --model openai/gpt-4o-mini

The number you get is the number. No cherry-picking, no caveats, no fine print.

Frank Thomas is the founder of Koji, an open-source document extraction platform. The validation corpus is available at github.com/getkoji/corpus.