Specshift
Your docs have a retrieval score. You just don’t know it yet.
Specshift evaluates how well a documentation site serves an LLM trying to use the platform the docs describe. Not “are the docs pretty.” Whether an agent reading these docs ends up at a correct answer, a working call, a passing build.
What we measure
Future suite: oscal (compliance crosswalk for FedRAMP-targeting platforms). Deferred to a later methodology version — ships after the first 3PAO partnership lands.
How scoring works
Every test produces a score in [0, 1]. NaN means the test couldn’t run (timeout, infrastructure failure) and is excluded from the suite roll-up. Suite scores are weighted averages of their tests. The overall score is the weighted average of suite scores.
Reports are reproducible. Every scorecard pins both the methodology version and the engine version. Re-run the same suite on the same target with the same engine and methodology and you get a deterministic score — or the methodology documents the source of randomness.
How to run it
npx @wentzel/specshift run https://docs.example.com --methodology=v1Or paste a URL into /eval and we’ll run it for you. You get a public report URL and an SVG badge for your README.