Building a Stock Screener: From Data Source to a Ranked List

You want to build a stock screener. Not because someone told you it would print money, but because it is the kind of tool that makes you think harder about the data and the mechanics behind it. The screener itself is a pipeline — a series of stages where bugs compound quietly. Get the plumbing right, and the output is a ranked list of candidates you can actually reason about. Get it wrong, and you are sorting noise through a broken funnel while believing the results mean something.

This is a guide to building that pipeline well. The question of whether any particular screen generates alpha is separate, and one this article deliberately does not answer.

Stage 1: Define Your Universe — and Respect the Survivorship Trap

The first decision is which stocks you are screening. The obvious starting point is a major index — S&P 500, Russell 2000, MSCI World. Index constituents are publicly listed, widely covered, and give you a coherent starting population.

The trap is what you do next. If you pull the current list of constituents, use that as your universe, and then apply it to historical data, you have already introduced survivorship bias. The companies in the S&P 500 today are not the companies that were in the S&P 500 five years ago. The ones that dropped off often dropped off for bad reasons — bankruptcy, delisting, permanent impairment. If your historical screen never touches those names, you are testing on a population that has already survived, and your backtest will look better than it should.

The correct approach is to use a point-in-time universe: the actual constituents on each historical date, not the current list projected backwards. This data is harder to get. Historical index membership files exist but are not always free. If you are building a serious backtest, this matters. If you are building a forward-looking screen for current candidates, the survivorship problem is less acute — but you should still be aware that a screen built only on large-cap survivors will have structural biases baked in from the start.

A reasonable starting universe for a side project: all US-listed common stocks above some liquidity threshold (average daily volume or market cap), sourced from a provider that flags security type and exchange. You want to exclude ETFs, preferred shares, warrants, and foreign-domiciled ADRs if your metrics assume US-GAAP financials.

Stage 2: Data Sources — Fundamentals and Prices Are Separate Problems

Fundamentals (earnings, book value, free cash flow, debt) and prices (daily open/high/low/close, volume) come from different places and have different update cadences. Treat them as separate pipelines.

Fundamentals: SEC EDGAR

For US companies, the SEC’s EDGAR system is the authoritative source for financial filings. The structured company-facts API (data.sec.gov/api/xbrl/companyfacts/{CIK}.json) returns XBRL-tagged financial data directly from 10-K and 10-Q filings in JSON format, with no API key required. The data is filed by companies themselves — no third-party transformation layer — and covers thousands of tickers going back many years.

The practical constraints: the data is as fresh as the most recent filing, which for annual figures means you may be working with numbers that are 3-12 months old. Not every company tags every concept consistently; XBRL tagging quality varies, so you will encounter gaps and mismatched concept names across filers. Rate limits on EDGAR are moderate — the SEC asks for a max of 10 requests per second and requires a descriptive User-Agent header. For a screener that refreshes daily, this is not a problem in practice.

Pickuma has a separate post covering the SEC EDGAR API in detail, including how to traverse the submissions endpoint and handle the XBRL concept taxonomy.

Prices: Market Data APIs

Price data is a different market. Several providers offer free tiers — Polygon.io, Alpha Vantage, Yahoo Finance (unofficial), and others — with varying coverage, rate limits, and data quality. For a screen that needs end-of-day closes for several thousand US tickers, you need a free tier that actually covers that. A tier limited to 25 requests per day does not.

Key questions when evaluating a price API: Does the data adjust for splits and dividends, or do you have to handle that yourself? Does the free tier include the full US universe, or only a subset? What is the historical depth? Pickuma has a separate comparison of price-data APIs for exactly this evaluation.

Store prices locally after the first fetch. A daily job that re-fetches two years of history for 5,000 tickers is wasteful and will hit rate limits. Fetch incrementally: append new closes, keep the history.

Point-in-Time Correctness

When you compute a ratio using last quarter’s earnings and today’s price, the earnings number was available to the market on the filing date — not the period end date. A company with a December fiscal year end typically files in mid-February. If you are simulating a historical screen, using December 31st earnings as if they were available on January 1st is look-ahead bias. Use the actual filing date as the availability date.

# Pseudocode: get the filing date, not the period end
for filing in company_facts["facts"]["us-gaap"]["NetIncomeLoss"]["units"]["USD"]:
    period_end = filing["end"]      # e.g. 2023-12-31
    filed_on   = filing["filed"]    # e.g. 2024-02-14  <- use this
    value      = filing["val"]

Stage 3: Define and Compute Your Ranking Metric

A screener needs a ranking rule. The simplest useful approach is a ratio — price-to-earnings, EV/EBIT, return on capital, debt-to-equity — computed for each ticker and then sorted.

A more sophisticated approach, made famous by Joel Greenblatt’s “magic formula” framework, combines multiple ranks: rank all stocks by one metric, rank all stocks by a second metric, add the ranks, sort by the combined score. This has the property of not over-weighting any single extreme value. Whether any particular combination of metrics generates persistent returns is an empirical question that this article is not going to answer — the “magic formula” is a useful pedagogical example of the rank-and-combine pattern, not an endorsement.

import pandas as pd

# df has columns: ticker, ev_ebit, roc
df["rank_ev_ebit"] = df["ev_ebit"].rank(ascending=True)   # lower is better
df["rank_roc"]     = df["roc"].rank(ascending=False)      # higher is better
df["combined"]     = df["rank_ev_ebit"] + df["rank_roc"]
df_sorted = df.sort_values("combined").reset_index(drop=True)

Watch for missing data in your ranking. A ticker with no EBIT filed yet (a recent IPO, or a company that had a delayed filing) should be excluded from the screen rather than ranked as if its ratio were zero. Propagating nulls silently is one of the more common bugs in screener pipelines.

Stage 4: Store Results and Schedule a Refresh

A screener that runs once is a script. A screener that runs daily and keeps history is a tool.

Store the ranked output in a database table with a run_date column. This lets you compare ranks across days, track changes in position, and debug why a ticker moved. A simple schema: (run_date, ticker, metric_1, metric_2, combined_rank). SQLite is sufficient for side-project scale; Postgres if you want more.

Schedule the daily job with whatever fits your stack — a cron job, a GitHub Actions workflow on a schedule trigger, or a cloud function. The job sequence: fetch new prices, fetch any new filings from EDGAR, recompute ratios, write ranked output, log a run timestamp. Add a check: if fewer than a minimum expected number of tickers made it through the pipeline, emit a warning rather than silently writing a half-populated result.

One practical issue with EDGAR: large companies file on time; smaller companies are sometimes late. If your screener runs on a Saturday morning and Tuesday’s filing has not appeared yet, the previous quarter’s value will persist. This is usually fine for a weekly or daily refresh cadence, but document it.

Stage 5: Present the Ranked List — and Read It Honestly

The output is a sorted table of tickers with their metrics. A screener surfaces candidates for further research. It is not a buy list.

A ranked list tells you which companies score well on the metrics you defined. It does not tell you whether those metrics are predictive, whether the market has already priced in what the screen sees, or whether there is a structural reason the top-ranked names look cheap (distress, regulatory risk, terminal business models). The screen is a filter, not a conclusion.

A useful discipline: before trusting any backtest result, write down every data dependency in the pipeline and check whether each piece of data was actually available at the simulated trade date. If any of it was not, the result is compromised.

FAQ

Do I need a paid data provider to build a useful screener?

Not necessarily. For US-listed companies, the SEC EDGAR company-facts API provides free, structured fundamental data sourced directly from filings. For prices, several providers offer free end-of-day tiers with decent US coverage. The main constraint on free tiers is rate limits and historical depth, not the existence of the data itself.

How do I handle missing fundamental data for a ticker?

Exclude it from the ranking for that run rather than treating the missing value as zero. A ticker with no EBIT on file will rank incorrectly if you propagate nulls as zeros. Log the exclusion so you can audit which tickers are being skipped and why.

My screener shows strong historical performance. Should I trust it?

Be skeptical first. Check whether the backtest used point-in-time data, whether the universe had survivorship bias removed, and whether the returns survive realistic transaction costs. A screener that looked good historically because of data leakage — not signal — is a common and costly mistake.

Building a Stock Screener: From Data Source to a Ranked List

Stage 1: Define Your Universe — and Respect the Survivorship Trap

Stage 2: Data Sources — Fundamentals and Prices Are Separate Problems

Fundamentals: SEC EDGAR

Prices: Market Data APIs

Point-in-Time Correctness

Stage 3: Define and Compute Your Ranking Metric

Stage 4: Store Results and Schedule a Refresh

Stage 5: Present the Ranked List — and Read It Honestly

FAQ

Position Sizing and Risk per Trade: The Math Retail Investors Skip in 2026

Dollar-Cost Averaging vs Lump Sum: What the Math Really Says

What the Sharpe Ratio Actually Tells You (and Where It Misleads)

Tiingo vs Polygon.io: Market Data APIs for Indie Quant Projects in 2026

Building a Portfolio Rebalancing Script in Python: From Drift to Trades

Get the best tools, weekly