Why Half of Claude Skills Don't Work — Data From Testing 45+ of Them

We’ve now put 45+ Claude skills through the full test protocol — clean install, trigger checks, real-work tasks scored against a no-skill baseline. The headline finding is the one this site was built on: a large share of published skills fail before they ever help anyone. Here’s where they fail, with the patterns to check before you waste an evening.

Failure mode 1: the install that isn’t (≈30% of failures)

The most common breakage is the dumbest: following the author’s own README on a fresh setup doesn’t produce a working skill. Causes, in order of frequency: wrong directory depth in the instructions (the skill lands at skills/name/name/SKILL.md), undeclared dependencies (a “PDF skill” that silently requires system libraries), and instructions written for an old Claude Code version.

Spot it early: a README whose install section is one vague line. Compare with the install blocks on skills that scored 5/5, like DOCX — specific paths, specific commands.

Failure mode 2: the trigger that never fires (≈25%)

The skill installs fine and then… nothing. Claude never activates it, because the frontmatter description — the only text Claude uses to decide — is written like a marketing tagline instead of a trigger specification. “Supercharge your workflow” matches no user request ever typed.

Spot it early: open SKILL.md before installing (it’s just markdown on GitHub). If the description doesn’t name concrete phrases you’d actually type, the skill will sit dead in your directory. We score this as “triggers reliably” — anything below 4/5 comes with a workaround in our test notes.

Failure mode 3: output ≤ baseline (≈35%)

The subtlest failure: the skill triggers, does something, and the result is no better than Claude with no skill at all. Common in categories where the base model is already strong (generic “writing improvement” skills) and in skills that are just a system prompt saying “be excellent.”

This is why our scoring weights output vs. baseline at 10 of 25 points — double any other criterion. A skill must beat naked Claude on a real task or it has no reason to exist. In our before/after examples you can see what an earned delta looks like.

Failure mode 4: honesty problems (≈10%)

Hidden network calls, undisclosed telemetry, prompt-injection-shaped instructions buried mid-file. Rare, but the reason “docs & honesty” is a scored criterion and not a nice-to-have. We read every SKILL.md we list. You should too, for anything outside a tested directory.

The meta-lesson

None of this means the ecosystem is bad — it means it’s young and unfiltered, like browser extensions circa 2010. The genuinely great skills (the 9+ tier) are transformative. The problem is that nothing in a GitHub star count tells you which tier you’re looking at: several skills we failed have hundreds of stars, and some of our highest scores have almost none.

Stars measure marketing. Tests measure whether it works. That’s the entire thesis of SkillProof.