pickuma.
Meta

How Pickuma Reviews Developer Tools: Our Testing Methodology

The structured process behind every review — minimum usage requirements, evaluation criteria, benchmark reproducibility, and the decision framework for when we reject a tool rather than reviewing it.

10 min read

Every review on Pickuma follows a testing methodology that I formalized in January 2026 after the first few articles and have refined since. This article describes the process, the criteria, and the rejection framework — the tools I decided not to review and why.

I am publishing this for two reasons. First, transparency: if you are going to make a purchasing decision based on something I wrote, you should know exactly how I arrived at that assessment. Second, reproducibility: a review without a stated methodology is indistinguishable from an opinion, and opinions about developer tools are cheap.

The Minimum Usage Requirement

Every tool reviewed on Pickuma goes through a minimum usage period before a single word of the review is written. The duration varies by tool category, and the category classification determines the depth:

Editor or IDE integration (Cursor, Copilot, Codex CLI): minimum two full working weeks. I use the tool as my primary development environment for a real project — not a tutorial, not a toy, but production code with real deadlines, real colleagues, and real merge conflicts. The first three days of any editor review are noise — every tool feels magical when you open a blank file and it auto-completes. The second week is where the seams show.

Hosted service or API (Supabase, Vercel, Fly.io): minimum one full feature built end-to-end. I deploy something real, configure the production environment, set up monitoring, and then use it for at least 48 hours of production traffic. A service review that only covers the “getting started” tutorial is a documentation summary, not a review.

CLI tool or library (Bun, Biome, Drizzle ORM): minimum one project that exercises the tool’s primary use case from initialization through deployment. For a package manager, that means managing dependencies across a real project for two weeks. For a linter or formatter, that means running it on an existing codebase with 50+ files and evaluating the migration path.

Self-hosted software (Immich, Plausible, Uptime Kuma): minimum one full deployment on a real VPS or home server, configured with SSL, backups, and monitoring, running for at least one week of production use. Docker Compose in a local VM with no traffic does not count.

The Evaluation Criteria

Every review is scored against six criteria, each rated on a 1-to-10 scale. These criteria are the same across all tool categories, but the weight assigned to each changes depending on what the tool is supposed to do.

Setup and onboarding (weight: 15%). Starting from a blank directory or fresh account, how many commands, clicks, and configuration decisions are required before the tool does something useful. Time-to-first-value measured in minutes, not hours.

Documentation quality (weight: 15%). Does the documentation answer the questions a developer will actually have after the first day? I evaluate reference completeness, tutorial quality, API doc accuracy, and whether the error messages tell you what to do next or just tell you what went wrong.

Core functionality (weight: 30%). Does the tool do what it claims to do? This is the heaviest-weighted criterion for a reason — a beautiful dashboard that fails silently on Unicode input is a worse tool than a CLI with arcane flags that works correctly every time.

Performance (weight: 15%). Benchmarked quantitatively. Build times, response latency, memory usage, disk footprint — whatever metrics are appropriate to the category. Every benchmark is run three times on the same hardware and the median value is reported.

Ecosystem and community (weight: 10%). How many open GitHub issues? What is the average issue resolution time? How active is the contributor base? Does a Stack Overflow question from 2024 still have a relevant answer in 2026? For open-source tools, I evaluate bus factor — how many maintainers would need to walk away for the project to stall.

Pricing and licensing (weight: 15%). How does the pricing scale at 1 user, 5 users, and 50 users? Are there bait-and-switch features locked behind enterprise tiers? For open-source tools, is the license permissive (MIT, Apache 2.0) or restrictive (AGPL, SSPL), and does that restriction actually matter for the intended use case?

The Rejection Framework

I have rejected more tools than I have reviewed. About 35% of tools that I begin testing do not result in a published review. Here is why, broken into categories:

Unstable during testing (18% of rejections). The tool crashes, corrupts data, or produces incorrect output under normal usage within the minimum testing period. I do not publish “this tool is broken” reviews because by the time the review goes live, the tool may have fixed the issue. Instability during a testing window tells me the project needs more time before it is reviewable, not that it deserves a permanent negative verdict.

Too early-stage (12%). The tool works but is missing core features that the README or landing page implies it has. Early-stage tools get added to a watchlist and re-evaluated every three months. If the feature gap closes, the review proceeds.

Overlap with an existing review (5%). If I have already published a thorough review of a tool in the same category and the new tool does not differentiate itself in a meaningful way, I note the existence of the new tool but do not publish a standalone review. Readers benefit more from a high-signal comparison between established tools than from a long tail of reviews that all say the same thing.

The remaining 65% that do result in reviews pass through the full methodology. Some come out looking great, some come out looking mediocre, and the review reflects that honestly. The worst outcome for a tool is not a mediocre review — it is a non-review because it failed during testing.

Benchmark Reproducibility

Whenever a review includes quantitative benchmarks — build speed, query latency, cold start time, bundle size — the methodology for reproducing those benchmarks is included in the review. Every benchmark meets three conditions:

First, it runs on documented hardware. I use an M2 MacBook Air with 16 GB RAM for all client-side benchmarks and a Hetzner CX22 VPS (2 vCPU, 4 GB RAM) for all server-side benchmarks. The hardware is not exotic, so anyone can reproduce the results with a comparably spec’d machine.

Second, it uses a fixed dataset or codebase. Performance benchmarks that use different inputs on different runs are not benchmarks — they are anecdotes. For database benchmarks, I use a standardized dataset of 100,000 rows across 15 tables with realistic foreign-key relationships. For build tool benchmarks, I use a standardized 200-component React application that exercises the common build paths. These datasets are published alongside the reviews in a public GitHub repository.

Third, it reports the median of three runs, not the best run. Cherry-picking the fastest result is the most common performance-misreporting pattern I have observed, and I take explicit steps to avoid it. The first run is a warmup, second and third are measured, and the median becomes the reported number.

How the Scoring Works in Practice

A concrete example helps. When I reviewed the Cursor IDE, the scores worked out to: Setup and onboarding 8/10 — the installer works and the first AI completion appears within two minutes; Documentation 6/10 — the getting-started guide is solid but the advanced configuration docs assume familiarity with VS Code extension internals that many Cursor users do not have; Core functionality 9/10 — the AI completions and chat features work reliably and the model-switching UX is fast; Performance 7/10 — the editor is Electron-based and cold start is 4.2 seconds, which is slow compared to native editors but acceptable given the AI feature set; Ecosystem 8/10 — built on the VS Code extension ecosystem so every VS Code plugin works; Pricing 7/10 — $20/month is competitive but the free tier limits completions aggressively and the jump from free to paid has no intermediate step.

The weighted score comes out to 7.7, which rounds to 8 and matches the overall assessment in the review: Cursor is the best AI coding editor available in 2026, but it is not perfect, and its weaknesses are in the areas that matter most over long usage periods — documentation depth and performance under sustained load.

The scores are included in my testing notes but not published in the review because, as I said, raw numbers without the narrative explanation are misleading. A reader seeing “Documentation 6/10” without the context that this score reflects the gap between beginner and advanced documentation would draw the wrong conclusion.

The Category-Specific Criteria

While the six evaluation criteria are universal, some categories get additional scrutiny in specific areas. For self-hosted software, I add a deployment difficulty criterion: how many steps, configuration files, and debugging cycles are required to go from git clone to a running, production-ready instance with SSL, backups, and monitoring. For CLI tools, I add a composability criterion: how well does the tool integrate with other Unix tools through pipes, exit codes, and standard output formatting. For APIs and SDKs, I add an error-handling criterion: when the API returns an error, does the SDK give you enough information to fix the problem without opening a browser tab.

These additional criteria do not change the scoring. They exist to ensure I test the right things for the right category. A CLI tool that is fast but produces output in a format that breaks standard Unix pipeline composition is a bad CLI tool, regardless of how well it scores on the six universal criteria.

FAQ

Do you use AI to evaluate tools or write benchmarks? +
AI is used for drafting and structuring the review text, never for the evaluation itself. Every score, benchmark, and qualitative assessment comes from direct hands-on testing. The AI assists with organizing notes, drafting comparison tables, and structuring arguments — the kind of work where faster output does not compromise accuracy. It never generates opinions about tools it has not used, because it cannot use tools.
How do you handle tools that release updates mid-review? +
If the update changes the core functionality I am testing, I restart the evaluation from the new version. This has happened twice — once with a major Supabase launch that landed three days into testing, and once with a Cursor release that changed the AI model behavior significantly. In both cases, I extended the review timeline rather than publishing against an obsolete version.
Do you accept free licenses or access from tool vendors? +
I accept temporary evaluation licenses if the tool has no free tier that covers the testing scope. This has happened with three enterprise tools reviewed on the site. Every instance is disclosed in the review. I reject sponsored reviews — no vendor pays for coverage, and no vendor sees the review before publication.
What is the most common reason a tool scores poorly? +
Documentation that omits error states. Most tools document the happy path beautifully — install, configure, deploy, done. The best tools document what happens when the deploy fails, when the configuration is wrong, and when the user is running an unsupported version. The gap between documented happy-path and undocumented error-path is the single largest predictor of a tool's real-world usability, and it is where most tools lose points.

Related reading

See all Meta articles →

Get the best tools, weekly

One email every Friday. No spam, unsubscribe anytime.