How Engineers Should Read a 10-K: A Backtest-Driven Approach

The wrong way engineers read 10-Ks

The instinctive engineer move when first encountering a 10-K is to read it cover-to-cover. This is wrong for two reasons. First, 10-Ks are 80-300 pages and most of the content is boilerplate, repeated from year to year. Second, the parts you’d naturally focus on (CEO letter, business description) are the least informative — they’re written by IR teams to sell the company’s narrative.

The right approach treats a 10-K as structured data to be parsed, not prose to be read. For a quant strategy, you care about specific numeric and textual signals from specific sections. Everything else is noise.

This is the section-by-section guide to what’s signal and what’s noise, organized for engineers who think in terms of “what do I actually pull from this document.”

Section 1: Business (skip mostly)

The first section is the company describing itself. Useful only on first encounter — once you’ve read one annual report from a company, the Business section doesn’t change much year-over-year.

Signal: Material changes from prior year. If a company added or removed a segment, that’s a real shift in their business mix that affects how their numbers translate.

Noise: Everything else. The “we are a leading provider of X” descriptions are PR.

How engineers handle it: diff the current year’s Business section against last year’s. Read the diff, not the section.

import difflib
def section_diff(text_a, text_b):
    return '\n'.join(difflib.unified_diff(
        text_a.split('\n'), text_b.split('\n'),
        lineterm='', n=0
    ))

Section 1A: Risk Factors (where the gold is)

This is the most underrated section. Companies are legally required to disclose material risks, and lawyers err on the side of including everything. The result: Risk Factors are unusually honest about what could go wrong.

Signal:

New risks added vs. prior year (something specific changed)
Risk language that becomes more specific (e.g., “competition” → “competition from Open Source LLM providers”)
Risks involving regulatory action, customer concentration, key personnel

Noise: Boilerplate risks (“changes in interest rates may affect our cost of capital”). Every company has these.

The high-value signal: section-over-section diff. A new risk added in year N that wasn’t in year N-1 is often the company quietly pre-disclosing a problem before it becomes public.

Quantamental hedge funds run NLP on Risk Factors diffs systematically. Retail-accessible version: just diff them yourself. Companies usually don’t add new risks lightly.

Section 7: MD&A (Management Discussion and Analysis)

Here’s where management explains the numbers. This section is part-narrative, part-honest analysis.

Signal:

Operating margin trajectory and management’s stated reasons
Segment-level commentary (why segment A grew faster than segment B)
Forward-looking statements about specific factors (e.g., “we expect Q1 to be impacted by X”)
Cash flow analysis — particularly the gap between net income and operating cash flow

Noise: The “Year in review” intro and the high-level “our strategy” sections. These are PR.

The single most important thing to extract from MD&A: the reconciliation between accounting earnings and cash earnings. Companies with large gaps between net income and operating cash flow are sending signal. Sometimes the gap is benign (working capital changes from growth), sometimes it’s not (channel stuffing, revenue recognition aggressiveness).

Section 8: Financial Statements

The numbers themselves. This is what you’d parse programmatically.

Signal: Everything. Specifically:

Income statement: revenue growth, gross margin, operating margin, EPS
Balance sheet: cash position, debt levels, working capital trends, asset composition
Cash flow: operating cash flow vs. net income, capex, free cash flow

Noise: Almost nothing. Financial statements are dense signal.

For parsing: SEC EDGAR provides structured XBRL data alongside the human-readable 10-K. Python tools like python-xbrl or arelle can extract specific line items programmatically. For most retail quant use cases, sec-edgar-downloader to get the filing and BeautifulSoup to parse the financial statement tables is enough.

Section 9A: Controls and Procedures (skip unless flagged)

Material weakness disclosures live here. 99% of companies report “no material weaknesses” — boilerplate, skip.

Signal: Any disclosure of material weakness in internal controls. This is a credit-quality red flag. The market often takes a week or more to react to material weakness disclosures, especially for less-followed mid-caps. There’s potential signal here for slow-pricing-in trades.

Quantitatively: an audit-language NLP model can flag the rare “material weakness disclosed” 10-Ks and feed them into a screening pipeline.

Sections 10-14: Governance, Comp, etc. (mostly skip)

Executive compensation, director information, related-party transactions. Boilerplate for most companies.

Signal: Insider stock sales documented here (more reliably than in Form 4 amendments). Large stock sales by C-suite that don’t show up in 10b5-1 plans are signal — typically negative for the stock 6-12 months out.

Noise: Most everything else in these sections.

What I actually pull from a 10-K (the 8-field summary)

For a screening universe of US public companies, here are the 8 fields I extract per company per year. Everything else is downstream noise.

Operating revenue (trailing 4 quarters summed for TTM)
Operating income (EBIT)
Net income
Operating cash flow (TTM)
Free cash flow = OCF − capex
Total debt (current + long-term)
Cash + short-term investments
Material risks added vs. prior year (text, from Risk Factors diff)

Field 8 is the unusual one. Most quant screens use only numeric fields. Adding the Risk Factors diff as a signal — even crudely, as “did this company add a new risk this year, yes/no” — captures real information.

The 5-line harness

from sec_edgar_downloader import Downloader
import requests, re

# Pull the latest 10-K for any ticker
dl = Downloader(company_name="YourName", email="you@example.com")
dl.get("10-K", "AAPL", limit=1)
# Files now in ./sec-edgar-filings/AAPL/10-K/

# Get financial facts via SEC's structured API
def get_company_facts(cik):
    url = f'https://data.sec.gov/api/xbrl/companyfacts/CIK{cik:010d}.json'
    return requests.get(url, headers={'User-Agent': 'YourName you@example.com'}).json()

The structured XBRL data via data.sec.gov is the cleanest input. You get a JSON dump of every fact the company has reported across all 10-Ks, with explicit “as reported” values and units. Parse that, don’t parse the human-readable 10-K, unless you specifically need the prose sections.

When the 10-K is the wrong document

For some signals, 10-Qs are better:

Quarterly fluctuations in working capital
Segment-level reporting (some companies provide more detail quarterly)
The latest quarter’s growth trajectory

For some signals, 8-Ks are critical:

Material events between filings (acquisitions, executive changes, accounting restatements)
Real-time disclosure of things that don’t wait for the next 10-Q

For some signals, the 10-K’s DEF 14A (proxy statement) is required:

Executive compensation in detail
Shareholder proposals and vote outcomes
Director independence

Quant pipelines should be filing-type-aware. The 10-K isn’t always the right document.

Verdict

Read 10-Ks the way an engineer reads code: skim the structure, focus on the parts that contain signal, ignore the boilerplate. The high-value sections are Risk Factors (and especially their year-over-year diff), MD&A (especially the cash flow narrative), and the financial statements themselves.

For systematic strategies, parse the XBRL via SEC’s data API and run NLP on Risk Factors text. For ad-hoc research on a single company, focus on the year-over-year diffs and the cash-flow-vs-earnings reconciliation.

The 80% of the 10-K that’s boilerplate is the 80% that wastes amateur readers’ time. Skip it deliberately and your reading rate on a 10-K drops from 4 hours to 25 minutes.