frank thomas
All posts

Rate Limits, Retries, and the Hidden Accuracy Killer in LLM Pipelines

We spent weeks investigating a 6% accuracy variance in our document extraction benchmarks. The root cause wasn't the model, the prompts, or the routing.

We spent weeks investigating a 6% accuracy variance in our document extraction benchmarks. The root cause wasn’t the model, the prompts, or the routing. It was silent HTTP 429 errors from OpenAI that our pipeline treated as “field not found.”

This is the story of how we found it, why it matters, and what we changed.


The mystery

Koji extracts structured data from documents — insurance policies, SEC filings, invoices, medical claims. We have a corpus of 653 documents with ground-truth expected outputs, and we benchmark every engine change against it before merging.

The numbers wouldn’t stabilize:

RunAccuracy
194.5%
289.8%
388.8%
491.6%

Same 653 documents. Same model (gpt-4o-mini). Same schemas. Same code. Six percentage points of variance between runs.

Our first theory was LLM non-determinism. Even at temperature: 0, OpenAI’s API isn’t fully deterministic — they’ve acknowledged this publicly. We figured the model was occasionally missing fields on borderline cases, and the variance was just the random distribution of those misses across runs.

We were wrong.

The investigation

We started by running the failing documents individually. A 10-Q filing from KB Home that returned all nulls in the benchmark? Extracted perfectly when run alone. Every field found, correct values, medium-to-high confidence.

We ran it again. Perfect. Again. Perfect.

Then we embedded it back in the full 653-document benchmark. It failed.

We added same-chunk retry logic — when a required field comes back null, re-run extraction on the identical chunks before trying anything else. The idea was that if LLM non-determinism was the cause, a second attempt with the same input would often succeed.

It helped. Marginally. The accuracy variance shrank from 6 points to about 4. But the pattern was wrong for non-determinism: failures clustered in time, not by document. The middle of a benchmark run had more failures than the start or end. Random non-determinism wouldn’t do that.

The cause

We checked the Docker logs of the extract service.

[koji-extract] Group ['period_fiscal_year_end', 'period_date_of_report'] error:
  Client error '429 Too Many Requests' for url
  'https://api.openai.com/v1/chat/completions'

[koji-extract] Group ['filing_date'] error:
  Client error '429 Too Many Requests'

[koji-extract] Gap fill for filing_date error:
  Client error '429 Too Many Requests'

Rate limiting. Running 653 documents sequentially fires hundreds of LLM calls. The API has per-minute rate limits. Once you exceed them, requests get rejected with HTTP 429. Our extraction pipeline had no retry logic — a 429 was treated as a fatal error, and the extraction returned an empty result.

An empty result looks exactly like “the field isn’t in the document.” The pipeline records it as a null extraction with not_found confidence. The benchmark counts it as a field failure. Nothing in the output distinguishes “the LLM couldn’t find this field” from “the LLM was never asked because the API rejected the request.”

The clustering pattern made sense now. The benchmark starts with small documents (adversarial category, 11 docs) that process quickly. By the time it reaches the larger categories — insurance policies, SEC filings — it’s firing requests at a sustained rate that triggers the rate limiter. The middle of the run fails; the end recovers as the rate limit window resets.

The numbers

One benchmark run hit 355 rate-limit retries before we added the fix. That’s 355 LLM calls that silently failed and were recorded as extraction failures. On a corpus where the total field count is 3,961 — that’s nearly 9% of fields potentially affected by a single infrastructure issue.

The smoking gun was a run that reported 6.0% accuracy — 23 out of 383 fields on SEC filings. Nearly every document returned all nulls. The logs were wall-to-wall 429s. It wasn’t an extraction quality problem. It was an HTTP client problem.

The fix

The retry wrapper is small enough to show in full:

_RETRY_STATUS_CODES = {429, 502, 503, 529}
_MAX_RETRIES = 3
_BASE_DELAY = 2.0  # seconds

async def _retry_on_rate_limit(coro_factory, max_retries=_MAX_RETRIES):
    """Retry an async HTTP call with exponential backoff.

    coro_factory is a zero-arg callable that returns a fresh coroutine
    each time — coroutines can't be re-awaited, so we need a factory.
    """
    last_exc = None
    for attempt in range(max_retries + 1):
        try:
            resp = await coro_factory()
            if resp.status_code not in _RETRY_STATUS_CODES:
                return resp
            last_exc = httpx.HTTPStatusError(
                f"HTTP {resp.status_code}",
                request=resp.request, response=resp
            )
        except httpx.HTTPStatusError as e:
            if e.response.status_code not in _RETRY_STATUS_CODES:
                raise
            last_exc = e
        if attempt < max_retries:
            delay = _BASE_DELAY * (2 ** attempt)  # 2s, 4s, 8s
            await asyncio.sleep(delay)
    raise last_exc

The key detail is coro_factory — a callable that returns a new coroutine on each invocation. Python coroutines can only be awaited once, so you can’t retry by re-awaiting the same object. The factory pattern fixes this:

# Before: 429 is fatal
resp = await client.post(url, json=payload, headers=headers)
resp.raise_for_status()

# After: 429 retries with backoff
resp = await _retry_on_rate_limit(
    lambda: client.post(url, json=payload, headers=headers)
)
resp.raise_for_status()

We applied this to every outbound LLM call in the provider — generate() for extraction, chat() for tool-use patterns, and the section-map pass. One wrapper, applied at the HTTP layer, protecting every downstream consumer.

The status code set includes 529 (a non-standard code some providers use for overload) and 502/503 (transient infrastructure errors that resolve on retry). We don’t retry 500 — that’s usually a bug on the provider’s side that won’t fix itself.

The result on SEC filings:

BeforeAfter
93.7% (359/383 fields)99.2% (380/383 fields)

That 5.5% improvement isn’t better extraction. It’s the same extraction actually completing instead of being silently killed by the API.

A clean 20-document batch with no rate limiting: 100% accuracy. The extraction quality was always there. The infrastructure was hiding it.

What this means for LLM pipelines generally

If you’re building anything that makes LLM API calls at volume — extraction, classification, summarization, agents — you probably have this bug. Here’s what to check:

1. Never treat HTTP errors as application-level results.

An API error is not “the model couldn’t find the answer.” It’s “the model was never asked.” Your pipeline needs to distinguish between:

  • The LLM ran and returned null — genuine not-found, record as such
  • The LLM call failed — retry, and only record a result after the call succeeds or all retries exhaust

If your code catches exceptions from the HTTP client and returns an empty dict, you have this bug.

2. Your accuracy metrics are only as reliable as your infrastructure.

We were measuring our API client’s resilience, not our extraction quality. A benchmark that doesn’t retry on transient errors measures something — but not what you think it measures. If your accuracy numbers fluctuate between runs and you blame “LLM non-determinism,” check your logs first.

3. Log the HTTP layer, not just the application layer.

Our extraction logs showed “field not found” for the failing fields. Correct — the field wasn’t found, because the call that would have found it never completed. We only discovered the root cause by checking the HTTP response codes in the Docker logs. If we’d been running in a managed environment without access to those logs, we might still be chasing “non-determinism.”

4. Test at production throughput, not demo scale.

Our 20-document smoke tests never hit rate limits. The problem only manifested at 100+ documents. If your test suite runs 10 examples and your production runs 10,000, you’re not testing what matters. Rate limits, connection pools, memory pressure, timeouts — these are all scale-dependent. Your benchmark needs to run at a scale that triggers the same failure modes your production traffic will.

5. Retry is not optional.

Every LLM provider has rate limits. Every provider has transient errors (502, 503). Every provider has maintenance windows. If your pipeline treats these as permanent failures, your accuracy has a ceiling set by your provider’s reliability, not by your extraction quality.

Three retries with exponential backoff. It’s not clever. It took us embarrassingly long to add it. But it moved our accuracy more than any prompt optimization, routing improvement, or model upgrade we’d tried in weeks.


Frank Thomas is the founder of Koji, an open-source document extraction platform. The benchmarking corpus referenced in this post is public.

← Why Heuristic Routing Fails on Long Documents