The document extraction pipeline: we tried everything

At Superkey we process insurance documents at volume — submissions, policies, certificates, endorsements. Every one of them is a PDF that needs to become structured data in our system. Over the past year we’ve gone through four different extraction approaches, each one fixing the previous one’s problems and introducing new ones.

This is the honest timeline.

Attempt 1: AWS Textract with custom orchestration

The first version was straightforward. Documents come in, we send them to AWS Textract for OCR, parse the response, and run a series of extraction rules against the text.

The pipeline looked like:

S3 upload → Lambda → Textract → Parse response → Rule engine → Database

What worked: Textract’s OCR is good. For clean, digital PDFs it reliably produces text. The AWS integration was familiar territory for the team.

What didn’t:

Textract gives you text and bounding boxes. It doesn’t give you structure. A table that’s obvious to a human — carrier name in column A, policy number in column B — comes back as a flat stream of text blocks with coordinates. Reconstructing the table structure from coordinates is its own engineering project, and we were getting it wrong on roughly 15% of documents.

The cost was higher than expected. At volume, Textract’s per-page pricing adds up. We were spending more on OCR than on the compute running the rest of the pipeline.

But the real problem was the orchestration. Every failure mode needed its own handling: Textract timeouts, malformed PDFs, multi-page documents that needed to be processed in order, results that needed to be retried. We built a Lambda-based state machine to manage this, and maintaining it became a part-time job. We were spending more time on plumbing than on extraction quality.

Attempt 2: Google Document AI with RAG

We switched to Google Document AI hoping for better structural understanding — it has native table detection, form parsing, and entity extraction. We also added a RAG layer: embed the parsed document chunks, retrieve the most relevant ones per field, and use an LLM to extract from the retrieved context.

GCS upload → Document AI → Chunk + embed → Vector store → RAG retrieve → LLM extract → Database

What worked: Document AI’s form parser was a genuine improvement over raw Textract for structured documents. Tables came back as tables. Key-value pairs came back as key-value pairs. The RAG layer meant we weren’t sending entire documents to the LLM.

What didn’t:

Performance was inconsistent. Document AI worked well on documents that looked like the ones it was trained on (standard forms, typed text) and poorly on anything unusual (handwritten notes, faded scans, documents with mixed layouts). We had no way to predict which documents would fail.

The RAG retrieval was the wrong abstraction. Cosine similarity finds chunks that are semantically close to your query — but “semantically close” and “contains the answer” aren’t the same thing. The chunk containing the actual dollar amount for a coverage limit might be titled “Schedule A” and contain nothing but numbers in a table. The semantic embedding doesn’t know that’s where the answer lives.

Cost improved slightly but was still significant. Document AI charges per page, the vector store has hosting costs, and the LLM calls for extraction added up. Total cost per document was running $0.15-0.40 depending on page count and complexity.

The biggest gap: no way to measure accuracy systematically. We could spot-check individual documents, but we had no corpus, no expected outputs, no way to run a regression test. Every pipeline change was a prayer — did we make it better or worse? We’d find out when a customer complained.

Attempt 3: LlamaIndex

LlamaIndex promised to wrap up the orchestration problems. Instead of DIY-ing the pipeline from OCR to embedding to retrieval to extraction, LlamaIndex provides abstractions for all of it. The workflow is clean: parse the document, split it into sections, extract fields from each section.

Document → Parse → Split → Extract → Structured output

What worked: The developer experience was immediately better. The parse-split-extract pipeline is a sensible decomposition of the problem, and LlamaIndex provides the building blocks for each stage. We stopped building custom Lambda state machines and started writing Python pipelines that were readable and modifiable.

The abstraction layer meant we could swap components without rewriting the pipeline. Try a different parser? Swap the module. Try a different LLM for extraction? Change the provider. The pipeline is still code you write and maintain, but it’s code that composes well.

What didn’t:

The abstraction that makes experimentation easy also makes production hard. LlamaIndex is designed for exploration — notebooks, prototypes, demos. When we tried to run it in production, we hit problems that the library doesn’t solve:

No built-in accuracy measurement. We still had no way to benchmark extraction quality across a corpus. LlamaIndex will happily extract data from every document you give it. Whether the extraction is correct is your problem.

No failure notification. When an extraction fails — the LLM returns garbage, the chunking produces empty results, the API times out — LlamaIndex doesn’t alert you. It returns whatever it got. We built monitoring around it, but that’s exactly the orchestration work we were trying to avoid.

No backtesting. When we changed a prompt or a chunking strategy, we had no way to run the new version against historical documents and compare results. Every change was deployed to production and evaluated by watching error rates. This is not how you operate a system that processes financial documents.

Deployment complexity. LlamaIndex pushes toward containerized deployment — agents as services. For a team our size, managing container registries, orchestration, health checks, and networking between services was overhead we didn’t need. The documents are PDFs. The extraction is an API call. It shouldn’t require Kubernetes.

Where we’re landing

We’re going with LlamaIndex. It’s not perfect — the gaps I listed above are real and they’ll bite us eventually. But it’s the best option available right now. The developer experience is good, the abstraction layer saves us from the orchestration hell of attempts 1 and 2, and we can swap models without rewriting the pipeline.

The things I’m still worried about:

No accuracy measurement. We still don’t have a way to benchmark extraction quality across our full document set. We’re flying on spot checks and customer complaints. This is going to hurt us.
No backtesting. Every prompt change, every chunking tweak, every model swap is a production experiment. I’d kill for a corpus of expected outputs I could run against before deploying.
Operational blindness. When extraction fails silently — returns plausible but wrong data — we have no way to catch it systematically. The failure mode isn’t an error message, it’s a wrong number in a database that nobody notices for weeks.

Maybe these are solvable within LlamaIndex’s ecosystem. Maybe we’ll outgrow it and need something purpose-built. For now, it’s the best tool for the job, and “best available” is usually the right call when you’re trying to ship product.

I’ll write an update in a few months on how it’s going.

Frank Thomas is CTO at Superkey Insurance.