Most extraction projects fail because teams treat the document as the artifact instead of the schema.
Most document-extraction projects don’t fail at the model. They fail at the schema — or rather, at the absence of one. The team treats the PDF as the artifact. The PDF is not the artifact. The structured record on the other side is.
I’ve been building Koji on the back of this observation. After enough engagements where a client handed us a folder of invoices and asked, in good faith, “can your model read these?” — the answer was always yes, and the project was always a mess. The model could read them. The team couldn’t agree on what “read” meant.
So we changed the order of operations. We started writing the schema first, before opening a single document. The schema is the contract: it says what fields exist, what types they are, what’s nullable, what’s enumerated, and — most importantly — what the downstream system does with each one. Once the contract exists, the document is just one of several possible producers of records that satisfy it.
Concretely, this is what we ask a new Koji user to write before they touch a parser:
schema "invoice" {
field invoice_no : string required
field issued_on : date required
field due_on : date nullable
field currency : enum["USD","EUR","CAD","GBP"]
field line_items : list<line_item> min=1
field subtotal : money required
field tax : money default=0
field total : money required invariant total == subtotal + tax
}
Notice what isn’t here: no mention of the PDF, the layout, the vendor, or the model. The schema describes the world the data needs to live in, not the world it came from. The invariant at the bottom is a property the extractor must satisfy or the record is rejected — not flagged, not warned, rejected. This is the only way to keep downstream code honest.
Three things change once the schema leads.
Here’s a small extraction run from a recent project — invoices from a logistics provider, three vendor formats, six months of mail. Numbers are real.
| Vendor | Documents | Valid records | Coverage | Median latency |
|---|---|---|---|---|
| Vendor A (clean PDFs) | 4,212 | 4,198 | 99.7% | 410 ms |
| Vendor B (scanned + OCR) | 1,847 | 1,793 | 97.1% | 1.2 s |
| Vendor C (faxed, terrible) | 602 | 514 | 85.4% | 2.1 s |
The 88 documents from Vendor C that failed validation weren’t silently mis-extracted. They were rejected at the contract boundary and routed to a human queue. This is the entire game. An extractor that produces 100 records, 90 of which are subtly wrong, is worse than an extractor that produces 85 correct records and 15 explicit failures. The first poisons your warehouse. The second tells you exactly where the problem is.
None of this is to say the document is unimportant. It’s the source of truth for the bytes you’re trying to extract. But it shouldn’t be the source of truth for the structure you’re extracting into. That structure exists in your business logic, your warehouse schema, your downstream API. Start there. Then ask whether the document can satisfy it.
If the answer is no, you’ve learned something more useful than any extraction metric: you’ve learned that your suppliers are sending you data your business can’t use. That’s a procurement problem, not a machine-learning problem.
Koji is the schema-first part of this, made open source. You write the schema in our DSL (or import it from JSON Schema), point it at a corpus, and it gives you an extractor — heuristic-first where the structure permits, model-assisted where it doesn’t — with invariants enforced at the boundary. You self-host it or use the hosted runner.
If any of this sounds useful, the project lives at getkoji.dev. It’s early. It’s also already running in production for three customers, which is more important than how polished the website is.
Documents in, structured data out. Everything else is implementation.