How We Score Tools: The Rubric Behind Every pickuma Review
A look inside the five-dimension scoring rubric pickuma uses to rate developer and AI tools, how the weights shift by category, and where a single number stops being useful.
Every review on this site ends with a number, and a number with no method behind it is just a vibe wearing a lab coat. So here is the method. This is the rubric we run each tool through before it gets a score, the weights we attach to each part, and the cases where we throw the number out entirely because it would mislead you.
We write this down for two reasons. First, so you can argue with it — if you think we weight pricing too lightly for solo developers, you now have something concrete to push against. Second, so we hold ourselves to it. A rubric you publish is a rubric you can be caught violating.
The five things every score measures
We score every tool across five dimensions. Each one gets a 1-to-10 sub-score, and the headline number you see is a weighted blend of the five. The dimensions are fixed; the weights are not, which we’ll get to in the next section.
| Dimension | What we test | Common failure |
|---|---|---|
| Capability | Does it do the core job well, not just adequately? | Demo-perfect, breaks on real inputs |
| Time-to-value | How long from sign-up to first useful result? | A weekend of setup before anything works |
| Reliability | Does it behave the same on Tuesday as it did Monday? | Silent regressions, flaky outputs |
| Pricing honesty | Is the real cost the advertised cost? | Seat minimums, usage cliffs, gated exports |
| Lock-in cost | How expensive is it to leave? | Proprietary formats, no export |
Capability is the obvious one, but it’s also where most marketing pages lie by omission. We don’t score the feature list. We score whether the feature survives contact with a messy, real workload — the kind you’d actually throw at it on a Wednesday afternoon.
Time-to-value is the dimension readers underrate most. A tool that scores a 9 on capability but takes two days to configure is, for most people, worse than a 7 that works in ten minutes. We measure this from a cold start: new account, no prior setup, clock running.
Pricing honesty is separate from price. A tool can be expensive and honest, or cheap and dishonest. We penalize the gap between the number on the pricing page and the number on your invoice — seat minimums you discover at checkout, an export locked behind the next tier up, a free plan that throttles the one feature you came for.
Lock-in cost asks a single question: if you wanted to leave in a year, how much would it hurt? Tools that export clean, open formats score well here. Tools that trap your data in a shape only they can read score badly, no matter how good the rest of the experience is.
How we weight them (and why the weights move)
A fixed weighting would be easier to defend and worse for you. The right weight depends on what the tool is for and who’s using it.
For an infrastructure tool a team will run in production, reliability and lock-in cost carry the most weight — a flaky database or a proprietary log format is a problem you live with for years. For a quick AI utility a solo developer might use for a single project, time-to-value and pricing honesty matter more, and lock-in barely registers because you’re not betting your stack on it.
So the weights shift by category. We publish the weighting we used at the top of each review’s scorecard, so a 7.5 in one category and a 7.5 in another aren’t pretending to be the same measurement. They’re not.
We keep the rubric, the per-category weights, and every tool’s sub-scores in a single shared workspace so the scoring stays consistent from one review to the next. If you’re building your own evaluation process — for a team tool bake-off, a vendor shortlist, or your own writing — a structured doc that forces every option through the same columns beats a folder of scattered notes.
Notion
What we use to keep the rubric, weights, and per-tool scorecards in one place so reviews stay consistent. A database with fixed columns turns 'this feels better' into something you can sort and compare.
Free for personal use; paid plans from $10/user/mo
Affiliate link · We earn a commission at no cost to you.
Where scores fall short
A rubric is a tool, and like every tool it has a range outside of which it produces nonsense. We’d rather tell you where ours breaks than pretend it doesn’t.
The first limit is taste. Some tools are technically strong and genuinely unpleasant to use, and “unpleasant” resists a 1-to-10 score. We fold it into capability when it affects real work, but a review’s prose will always carry nuance the number can’t.
The second limit is timing. Scores are snapshots. A tool we rated a 6 last quarter may ship the exact feature that was dragging it down, and until we re-test, the published number is stale. We date every score and re-review when something material changes — but between those points, trust the date as much as the digit.
The third limit is you. Our weights encode an average reader who doesn’t exist. If you’re cost-sensitive, mentally raise the pricing weight. If you’re building something you’ll maintain for five years, raise reliability and lock-in. The sub-scores are there precisely so you can re-blend them for your own situation instead of inheriting ours.
The goal was never to hand you a single digit and call it objectivity. It’s to make our judgment legible — to show the inputs, the weights, and the seams — so you can take what’s useful and override the rest.
FAQ
Do tools pay to get a higher score?
Why does the same score mean different things in different reviews?
How often do you re-score a tool?
Related reading
2026-06-22
What 18 Months of Affiliate Data Taught Us About Which Reviews Convert
We pulled 18 months of click and conversion data across our tool reviews. The patterns that drove signups were not the ones we expected when we started.
2026-06-22
How We Use AI Without Letting It Hallucinate Into Reviews
The exact guardrails we put between an LLM and a published review: where AI drafts, where it gets shut off, and how every factual claim gets checked against a primary source.
2026-06-22
Why pickuma Runs No Sponsored Posts (and How That Shapes Recommendations)
pickuma takes affiliate commissions but never sells sponsored coverage. Here's the difference between the two models and how it changes what we recommend.
2026-06-10
The E-E-A-T Signals We Actually Invest In (and the Ones We Skip)
E-E-A-T is not a meta tag you can set. Here is where an AI-assisted publication spends real effort on trust signals, and where we decided the effort is wasted.
2026-06-10
How We Handle Internal Linking Across Hundreds of Articles Without a Spreadsheet
The internal linking system behind pickuma.com: a typed URL helper, an automated related-posts scorer, and a build step that fails when a link would 404.
Get the best tools, weekly
One email every Friday. No spam, unsubscribe anytime.