Specshift

Your docs have a retrieval score. You just don’t know it yet.

Specshift evaluates how well a documentation site serves an LLM trying to use the platform the docs describe. Not “are the docs pretty.” Whether an agent reading these docs ends up at a correct answer, a working call, a passing build.

What we measure

retrieval

Can a vector index over these docs answer realistic developer questions?

weight 1.0

agent

Can a tool-use agent complete a representative task using only these docs?

weight 1.0

structure

Are the docs organised so a model can navigate without context overflow?

weight 1.0

drift

Do the docs describe the platform’s actual current behaviour?

weight 1.0

Future suite: oscal (compliance crosswalk for FedRAMP-targeting platforms). Deferred to a later methodology version — ships after the first 3PAO partnership lands.

How scoring works

Every test produces a score in [0, 1]. NaN means the test couldn’t run (timeout, infrastructure failure) and is excluded from the suite roll-up. Suite scores are weighted averages of their tests. The overall score is the weighted average of suite scores.

Reports are reproducible. Every scorecard pins both the methodology version and the engine version. Re-run the same suite on the same target with the same engine and methodology and you get a deterministic score — or the methodology documents the source of randomness.

How to run it

npx @wentzel/specshift run https://docs.example.com --methodology=v1

Or paste a URL into /eval and we’ll run it for you. You get a public report URL and an SVG badge for your README.

Methodology v1.0

Full scoring rules, with limitations.md. Bumping the methodology version is a load-bearing event — every change is gated.

Leaderboard

Public ranking of how well documentation explains itself. First edition seeds 50 rows. Disputes are tamper-evident.