frank thomas
All posts

Documents are a data source. Treat them like one.

Most extraction projects fail because teams treat the document as the artifact instead of the schema.

Most document-extraction projects don’t fail at the model. They fail at the schema — or rather, at the absence of one. The team treats the PDF as the artifact. The PDF is not the artifact. The structured record on the other side is.

I’ve been building Koji on the back of this observation. After enough engagements where a client handed us a folder of invoices and asked, in good faith, “can your model read these?” — the answer was always yes, and the project was always a mess. The model could read them. The team couldn’t agree on what “read” meant.

So we changed the order of operations. We started writing the schema first, before opening a single document. The schema is the contract: it says what fields exist, what types they are, what’s nullable, what’s enumerated, and — most importantly — what the downstream system does with each one. Once the contract exists, the document is just one of several possible producers of records that satisfy it.

What “schema-first” actually means

Concretely, this is what we ask a new Koji user to write before they touch a parser:

schema "invoice" {
  field invoice_no   : string  required
  field issued_on    : date    required
  field due_on       : date    nullable
  field currency     : enum["USD","EUR","CAD","GBP"]
  field line_items   : list<line_item>  min=1
  field subtotal     : money   required
  field tax          : money   default=0
  field total        : money   required  invariant total == subtotal + tax
}

Notice what isn’t here: no mention of the PDF, the layout, the vendor, or the model. The schema describes the world the data needs to live in, not the world it came from. The invariant at the bottom is a property the extractor must satisfy or the record is rejected — not flagged, not warned, rejected. This is the only way to keep downstream code honest.

Why this matters in practice

Three things change once the schema leads.

  1. Disagreements move upstream. Stakeholders argue about whether “tax” includes withholding before any model is trained. That argument is going to happen anyway. Better to have it in week one than week twelve.
  2. The model becomes interchangeable. When the contract is the schema, the producer behind it — heuristics, LLM, hybrid pipeline, human-in-the-loop — is an implementation detail. We’ve swapped extractors three times on production pipelines and downstream consumers didn’t notice.
  3. Coverage becomes measurable. “Did this document produce a valid record?” is a binary question. Aggregate it over a corpus and you have a real coverage metric instead of vibes about model accuracy.

A worked example

Here’s a small extraction run from a recent project — invoices from a logistics provider, three vendor formats, six months of mail. Numbers are real.

VendorDocumentsValid recordsCoverageMedian latency
Vendor A (clean PDFs)4,2124,19899.7%410 ms
Vendor B (scanned + OCR)1,8471,79397.1%1.2 s
Vendor C (faxed, terrible)60251485.4%2.1 s

The 88 documents from Vendor C that failed validation weren’t silently mis-extracted. They were rejected at the contract boundary and routed to a human queue. This is the entire game. An extractor that produces 100 records, 90 of which are subtly wrong, is worse than an extractor that produces 85 correct records and 15 explicit failures. The first poisons your warehouse. The second tells you exactly where the problem is.

What the document does tell you

None of this is to say the document is unimportant. It’s the source of truth for the bytes you’re trying to extract. But it shouldn’t be the source of truth for the structure you’re extracting into. That structure exists in your business logic, your warehouse schema, your downstream API. Start there. Then ask whether the document can satisfy it.

If the answer is no, you’ve learned something more useful than any extraction metric: you’ve learned that your suppliers are sending you data your business can’t use. That’s a procurement problem, not a machine-learning problem.

Where Koji fits

Koji is the schema-first part of this, made open source. You write the schema in our DSL (or import it from JSON Schema), point it at a corpus, and it gives you an extractor — heuristic-first where the structure permits, model-assisted where it doesn’t — with invariants enforced at the boundary. You self-host it or use the hosted runner.

If any of this sounds useful, the project lives at getkoji.dev. It’s early. It’s also already running in production for three customers, which is more important than how polished the website is.

Documents in, structured data out. Everything else is implementation.

← Docs-first development with AI agents
Schema-Driven Extraction: Configuration Over Code for Document AI →