Building a Stock Screener: From Data Source to a Ranked List
A pipeline-focused engineering guide to building a stock screener: universe selection, data sources, ranking metrics, storage, and the pitfalls that quietly break each stage.
You want to build a stock screener. Not because someone told you it would print money, but because it is the kind of tool that makes you think harder about the data and the mechanics behind it. The screener itself is a pipeline — a series of stages where bugs compound quietly. Get the plumbing right, and the output is a ranked list of candidates you can actually reason about. Get it wrong, and you are sorting noise through a broken funnel while believing the results mean something.
This is a guide to building that pipeline well. The question of whether any particular screen generates alpha is separate, and one this article deliberately does not answer.
Stage 1: Define Your Universe — and Respect the Survivorship Trap
The first decision is which stocks you are screening. The obvious starting point is a major index — S&P 500, Russell 2000, MSCI World. Index constituents are publicly listed, widely covered, and give you a coherent starting population.
The trap is what you do next. If you pull the current list of constituents, use that as your universe, and then apply it to historical data, you have already introduced survivorship bias. The companies in the S&P 500 today are not the companies that were in the S&P 500 five years ago. The ones that dropped off often dropped off for bad reasons — bankruptcy, delisting, permanent impairment. If your historical screen never touches those names, you are testing on a population that has already survived, and your backtest will look better than it should.
The correct approach is to use a point-in-time universe: the actual constituents on each historical date, not the current list projected backwards. This data is harder to get. Historical index membership files exist but are not always free. If you are building a serious backtest, this matters. If you are building a forward-looking screen for current candidates, the survivorship problem is less acute — but you should still be aware that a screen built only on large-cap survivors will have structural biases baked in from the start.
A reasonable starting universe for a side project: all US-listed common stocks above some liquidity threshold (average daily volume or market cap), sourced from a provider that flags security type and exchange. You want to exclude ETFs, preferred shares, warrants, and foreign-domiciled ADRs if your metrics assume US-GAAP financials.
Stage 2: Data Sources — Fundamentals and Prices Are Separate Problems
Fundamentals (earnings, book value, free cash flow, debt) and prices (daily open/high/low/close, volume) come from different places and have different update cadences. Treat them as separate pipelines.
Fundamentals: SEC EDGAR
For US companies, the SEC’s EDGAR system is the authoritative source for financial filings. The structured company-facts API (data.sec.gov/api/xbrl/companyfacts/{CIK}.json) returns XBRL-tagged financial data directly from 10-K and 10-Q filings in JSON format, with no API key required. The data is filed by companies themselves — no third-party transformation layer — and covers thousands of tickers going back many years.
The practical constraints: the data is as fresh as the most recent filing, which for annual figures means you may be working with numbers that are 3-12 months old. Not every company tags every concept consistently; XBRL tagging quality varies, so you will encounter gaps and mismatched concept names across filers. Rate limits on EDGAR are moderate — the SEC asks for a max of 10 requests per second and requires a descriptive User-Agent header. For a screener that refreshes daily, this is not a problem in practice.
Pickuma has a separate post covering the SEC EDGAR API in detail, including how to traverse the submissions endpoint and handle the XBRL concept taxonomy.
Prices: Market Data APIs
Price data is a different market. Several providers offer free tiers — Polygon.io, Alpha Vantage, Yahoo Finance (unofficial), and others — with varying coverage, rate limits, and data quality. For a screen that needs end-of-day closes for several thousand US tickers, you need a free tier that actually covers that. A tier limited to 25 requests per day does not.
Key questions when evaluating a price API: Does the data adjust for splits and dividends, or do you have to handle that yourself? Does the free tier include the full US universe, or only a subset? What is the historical depth? Pickuma has a separate comparison of price-data APIs for exactly this evaluation.
Store prices locally after the first fetch. A daily job that re-fetches two years of history for 5,000 tickers is wasteful and will hit rate limits. Fetch incrementally: append new closes, keep the history.
Point-in-Time Correctness
When you compute a ratio using last quarter’s earnings and today’s price, the earnings number was available to the market on the filing date — not the period end date. A company with a December fiscal year end typically files in mid-February. If you are simulating a historical screen, using December 31st earnings as if they were available on January 1st is look-ahead bias. Use the actual filing date as the availability date.
# Pseudocode: get the filing date, not the period endfor filing in company_facts["facts"]["us-gaap"]["NetIncomeLoss"]["units"]["USD"]: period_end = filing["end"] # e.g. 2023-12-31 filed_on = filing["filed"] # e.g. 2024-02-14 <- use this value = filing["val"]Stage 3: Define and Compute Your Ranking Metric
A screener needs a ranking rule. The simplest useful approach is a ratio — price-to-earnings, EV/EBIT, return on capital, debt-to-equity — computed for each ticker and then sorted.
A more sophisticated approach, made famous by Joel Greenblatt’s “magic formula” framework, combines multiple ranks: rank all stocks by one metric, rank all stocks by a second metric, add the ranks, sort by the combined score. This has the property of not over-weighting any single extreme value. Whether any particular combination of metrics generates persistent returns is an empirical question that this article is not going to answer — the “magic formula” is a useful pedagogical example of the rank-and-combine pattern, not an endorsement.
import pandas as pd
# df has columns: ticker, ev_ebit, rocdf["rank_ev_ebit"] = df["ev_ebit"].rank(ascending=True) # lower is betterdf["rank_roc"] = df["roc"].rank(ascending=False) # higher is betterdf["combined"] = df["rank_ev_ebit"] + df["rank_roc"]df_sorted = df.sort_values("combined").reset_index(drop=True)Watch for missing data in your ranking. A ticker with no EBIT filed yet (a recent IPO, or a company that had a delayed filing) should be excluded from the screen rather than ranked as if its ratio were zero. Propagating nulls silently is one of the more common bugs in screener pipelines.
Stage 4: Store Results and Schedule a Refresh
A screener that runs once is a script. A screener that runs daily and keeps history is a tool.
Store the ranked output in a database table with a run_date column. This lets you compare ranks across days, track changes in position, and debug why a ticker moved. A simple schema: (run_date, ticker, metric_1, metric_2, combined_rank). SQLite is sufficient for side-project scale; Postgres if you want more.
Schedule the daily job with whatever fits your stack — a cron job, a GitHub Actions workflow on a schedule trigger, or a cloud function. The job sequence: fetch new prices, fetch any new filings from EDGAR, recompute ratios, write ranked output, log a run timestamp. Add a check: if fewer than a minimum expected number of tickers made it through the pipeline, emit a warning rather than silently writing a half-populated result.
One practical issue with EDGAR: large companies file on time; smaller companies are sometimes late. If your screener runs on a Saturday morning and Tuesday’s filing has not appeared yet, the previous quarter’s value will persist. This is usually fine for a weekly or daily refresh cadence, but document it.
Stage 5: Present the Ranked List — and Read It Honestly
The output is a sorted table of tickers with their metrics. A screener surfaces candidates for further research. It is not a buy list.
A ranked list tells you which companies score well on the metrics you defined. It does not tell you whether those metrics are predictive, whether the market has already priced in what the screen sees, or whether there is a structural reason the top-ranked names look cheap (distress, regulatory risk, terminal business models). The screen is a filter, not a conclusion.
A useful discipline: before trusting any backtest result, write down every data dependency in the pipeline and check whether each piece of data was actually available at the simulated trade date. If any of it was not, the result is compromised.
FAQ
Do I need a paid data provider to build a useful screener? +
How do I handle missing fundamental data for a ticker? +
My screener shows strong historical performance. Should I trust it? +
Related reading
2026-05-21
Bonds for Developers: What Fixed Income Actually Does in a Portfolio
A practical literacy guide to bonds, duration, and why fixed income plays a specific role in a portfolio — written for developers who already understand stock-index investing.
2026-05-21
Brokerage APIs for Algorithmic Trading: Alpaca, Interactive Brokers, and Tradier Compared
A developer-focused comparison of Alpaca, IBKR, and Tradier brokerage APIs — auth, paper trading, order types, rate limits, and honest caveats.
2026-05-21
Dollar-Cost Averaging vs Lump-Sum Investing: What the Data Actually Says
Historical data favors deploying a lump sum immediately, but DCA reduces regret and worst-case outcomes. Here is how to think through the tradeoff rigorously.
2026-05-21
Expense Ratios: The Quiet Fee That Compounds Against You
Expense ratios are deducted silently from your fund every day. A worked 30-year example shows how even a 1% annual fee erodes a meaningful fraction of your ending balance.
2026-05-21
Portfolio Rebalancing: When and How Often It Actually Matters
Evidence on portfolio rebalancing frequency: why it is primarily a risk-control tool, how drift works, and what the research actually says about costs and timing.
Get the best tools, weekly
One email every Friday. No spam, unsubscribe anytime.