pickuma.
Meta

How We Score Tools: The Rubric Behind Every pickuma Review

A look inside the five-dimension scoring rubric pickuma uses to rate developer and AI tools, how the weights shift by category, and where a single number stops being useful.

6 min read

Every review on this site ends with a number, and a number with no method behind it is just a vibe wearing a lab coat. So here is the method. This is the rubric we run each tool through before it gets a score, the weights we attach to each part, and the cases where we throw the number out entirely because it would mislead you.

We write this down for two reasons. First, so you can argue with it — if you think we weight pricing too lightly for solo developers, you now have something concrete to push against. Second, so we hold ourselves to it. A rubric you publish is a rubric you can be caught violating.

The five things every score measures

We score every tool across five dimensions. Each one gets a 1-to-10 sub-score, and the headline number you see is a weighted blend of the five. The dimensions are fixed; the weights are not, which we’ll get to in the next section.

DimensionWhat we testCommon failure
CapabilityDoes it do the core job well, not just adequately?Demo-perfect, breaks on real inputs
Time-to-valueHow long from sign-up to first useful result?A weekend of setup before anything works
ReliabilityDoes it behave the same on Tuesday as it did Monday?Silent regressions, flaky outputs
Pricing honestyIs the real cost the advertised cost?Seat minimums, usage cliffs, gated exports
Lock-in costHow expensive is it to leave?Proprietary formats, no export

Capability is the obvious one, but it’s also where most marketing pages lie by omission. We don’t score the feature list. We score whether the feature survives contact with a messy, real workload — the kind you’d actually throw at it on a Wednesday afternoon.

Time-to-value is the dimension readers underrate most. A tool that scores a 9 on capability but takes two days to configure is, for most people, worse than a 7 that works in ten minutes. We measure this from a cold start: new account, no prior setup, clock running.

Pricing honesty is separate from price. A tool can be expensive and honest, or cheap and dishonest. We penalize the gap between the number on the pricing page and the number on your invoice — seat minimums you discover at checkout, an export locked behind the next tier up, a free plan that throttles the one feature you came for.

Lock-in cost asks a single question: if you wanted to leave in a year, how much would it hurt? Tools that export clean, open formats score well here. Tools that trap your data in a shape only they can read score badly, no matter how good the rest of the experience is.

How we weight them (and why the weights move)

A fixed weighting would be easier to defend and worse for you. The right weight depends on what the tool is for and who’s using it.

For an infrastructure tool a team will run in production, reliability and lock-in cost carry the most weight — a flaky database or a proprietary log format is a problem you live with for years. For a quick AI utility a solo developer might use for a single project, time-to-value and pricing honesty matter more, and lock-in barely registers because you’re not betting your stack on it.

So the weights shift by category. We publish the weighting we used at the top of each review’s scorecard, so a 7.5 in one category and a 7.5 in another aren’t pretending to be the same measurement. They’re not.

We keep the rubric, the per-category weights, and every tool’s sub-scores in a single shared workspace so the scoring stays consistent from one review to the next. If you’re building your own evaluation process — for a team tool bake-off, a vendor shortlist, or your own writing — a structured doc that forces every option through the same columns beats a folder of scattered notes.

Notion

What we use to keep the rubric, weights, and per-tool scorecards in one place so reviews stay consistent. A database with fixed columns turns 'this feels better' into something you can sort and compare.

Free for personal use; paid plans from $10/user/mo

Try Notion

Affiliate link · We earn a commission at no cost to you.

Where scores fall short

A rubric is a tool, and like every tool it has a range outside of which it produces nonsense. We’d rather tell you where ours breaks than pretend it doesn’t.

The first limit is taste. Some tools are technically strong and genuinely unpleasant to use, and “unpleasant” resists a 1-to-10 score. We fold it into capability when it affects real work, but a review’s prose will always carry nuance the number can’t.

The second limit is timing. Scores are snapshots. A tool we rated a 6 last quarter may ship the exact feature that was dragging it down, and until we re-test, the published number is stale. We date every score and re-review when something material changes — but between those points, trust the date as much as the digit.

The third limit is you. Our weights encode an average reader who doesn’t exist. If you’re cost-sensitive, mentally raise the pricing weight. If you’re building something you’ll maintain for five years, raise reliability and lock-in. The sub-scores are there precisely so you can re-blend them for your own situation instead of inheriting ours.

The goal was never to hand you a single digit and call it objectivity. It’s to make our judgment legible — to show the inputs, the weights, and the seams — so you can take what’s useful and override the rest.

FAQ

Do tools pay to get a higher score?
No. Affiliate relationships affect whether we earn a commission when you click through, not what number a tool receives. The rubric is applied the same way to tools we earn from and tools we don't, and the sub-scores are published so a suspiciously inflated rating would be easy to spot.
Why does the same score mean different things in different reviews?
Because the per-category weights move. Reliability counts for more in an infrastructure review than in a quick-utility review, so an 8.0 is a blend of different ingredients each time. We publish the weighting used at the top of each scorecard so you can see what went into it.
How often do you re-score a tool?
When something material changes — a major release, a pricing change, or a reliability issue we observe over time — and on a periodic pass otherwise. Every score carries the date it was last assessed, so you can judge how current it is before you rely on it.

Related reading

See all Meta articles →

Get the best tools, weekly

One email every Friday. No spam, unsubscribe anytime.