pickuma.
Finance

Building a Market-Data Pipeline: Caching, Rate Limits, and Gaps

Reliable backtests need reliable data, and pulling it live from an API on every run is slow, fragile, and costly. Here's how to build a local market-data pipeline that caches, respects rate limits, and handles gaps.

O
Owen
Engineer · Investor
Verify profile ↗
8 min read

Most beginner trading projects hit the same wall: the strategy code is fine, but the data layer is a mess of ad-hoc API calls that are slow, get rate-limited, silently miss days, and produce different results every run. A market-data pipeline — the unglamorous infrastructure that fetches, stores, and serves clean data — is what separates a reproducible research setup from a flaky notebook. For a developer, it’s a familiar engineering problem wearing a finance hat. None of this is investment advice.

Fetch once, read many: the caching layer

The first principle is simple: never pull the same historical data from an API twice. Historical bars don’t change, so fetching them live on every backtest run is slow, wasteful of your rate limit, and — on usage-priced providers — a literal cost. Fetch once, store locally, and read from local storage thereafter.

A practical setup downloads each symbol’s history into a local store — Parquet files or a local database like SQLite or DuckDB work well for this — keyed by symbol and date range. Your strategy code reads from the cache, not the network. When you need fresh data, you fetch only the incremental window since your last update and append it. This single change usually takes a backtest from minutes of waiting on API calls to seconds reading local files, and it makes runs reproducible because everyone’s reading the same stored data.

Respecting rate limits without hating your life

Every data provider rate-limits you, and naive code that fires requests in a tight loop will get throttled or blocked. Handle this deliberately rather than by trial and error.

Batch your requests where the API supports it — many providers offer bulk endpoints that return many symbols or a wide date range in one call, which is dramatically more efficient than one request per symbol. Add exponential backoff with retry on the responses that signal throttling, so a temporary limit pauses you instead of failing the run. And pace your requests to stay comfortably under the published limit rather than racing up against it. Because you’re caching, all of this happens during ingestion, not during research — so the rate-limit dance never slows down your actual backtests.

Gaps, duplicates, and survivorship bias

Clean-looking data is rarely as clean as it looks, and three problems quietly wreck backtests.

Gaps and duplicates. Real feeds have missing bars (a holiday, an outage, a thin day) and sometimes duplicate or out-of-order records. Don’t assume your time series is complete and contiguous — validate it. Check that trading days are present where expected, drop or flag duplicates, and decide explicitly how to handle missing bars (forward-fill, skip, or error) rather than letting your strategy silently trade on a hole in the data.

Survivorship bias. This is the one that flatters every naive backtest. If your universe is “stocks that exist today,” you’ve excluded every company that went bankrupt, got delisted, or was acquired — the losers. Backtesting only on survivors makes almost any strategy look profitable, because you removed the failures in advance. A serious pipeline includes delisted securities and point-in-time universe membership, so your backtest sees the same companies you’d actually have been able to trade back then.

The throughline is that your data layer deserves real engineering. A strategy is only as trustworthy as the data underneath it, and most “amazing” backtests are really just measurements of a flawed data pipeline. Build ingestion as a proper job, validate what you store, include the companies that failed, and your research rests on something solid instead of something that merely looks solid.

FAQ

What should I use to store market data locally?+
For a single developer, columnar files like Parquet or an embedded analytical database like DuckDB or SQLite handle historical bars well — fast to read, simple to manage, no server to run. The key is the pattern: fetch once, store locally, and have your strategy read from the store rather than the API.
How do I avoid getting rate-limited?+
Use bulk endpoints where available, add exponential backoff and retries on throttling responses, and pace requests below the published limit. Because you cache historical data, this only happens during scheduled ingestion — not during backtests — so it never slows your research once the data is stored.
What is survivorship bias and how do I fix it?+
It's the distortion from testing only on companies that still exist, silently excluding those that failed or were delisted — which makes strategies look better than reality. Fix it by sourcing data that includes delisted securities and point-in-time universe membership, so your backtest trades the same names you actually could have at the time.

A market-data pipeline isn’t the exciting part of a trading project, but it’s the part that determines whether anything built on top of it can be trusted. Cache aggressively, respect rate limits during ingestion, validate for gaps, and include the companies that died — and your backtests will finally be measuring your strategy instead of your data’s flaws.

Related reading

See all Finance articles →

Get the best tools, weekly

One email every Friday. No spam, unsubscribe anytime.

O
Owen
Engineer · Investor
Verify profile ↗