Specshift methodology v1.0
Public mirror of the scoring rules. Published at pragma.wentzel.ai/specshift/methodology and updated on every version bump.
What Specshift measures
How well a documentation site serves an LLM trying to use the platform the docs describe. Not "are the docs pretty" — "would an agent reading these docs end up at a correct answer / a working call / a passing build?"
Suites (v1)
| Suite | Question it answers | Default weight |
|---|---|---|
| retrieval | Can a vector index over these docs answer realistic developer questions? | 1.0 |
| agent | Can a tool-use agent complete a representative task using only these docs? | 1.0 |
| structure | Are the docs organised so a model can navigate without context overflow? | 1.0 |
| drift | Do the docs describe the platform's actual current behaviour? | 1.0 |
Future suite: oscal (compliance crosswalk for FedRAMP-targeting platforms). Deferred to v0.3 — ships after first 3PAO partnership lands (PRAG/09).
Scoring
Every Test produces a score in [0, 1]. NaN means the test could not run (e.g. timeout, infrastructure failure) and is excluded from the suite roll-up. Suite scores are weighted averages of their tests; the overall score is the weighted average of suite scores.
Reports are reproducible: methodologyVersion + engineVersion are pinned in every Scorecard. Re-running the same suite on the same target with the same engine + methodology MUST produce a deterministic score (or document the source of randomness).
Disputes
Customers may dispute a score per the public correction-and-dispute policy at PRAG/06. Each dispute writes a tamper-evident audit-chain record (PRAG-019) — the chain hash is published alongside the public ruling.
Versioning
- Bumping the methodology version is a load-bearing event. The
methodology-pin CI gate (PRAG-016) blocks merges that change rules without:
- A new entry in
VERSIONS(registry) - CHANGELOG.md entry describing what changed and why
- limitations.md update describing what's still NOT measured
See limitations.md for what v1.0 does NOT measure.
v1 patches
v1.1 (2026-05-04) — non-breaking scoring patch
This patch makes the suite implementations representative for modern SPA-style developer documentation without changing the public scoring contract (Test still emits [0,1], Suite still rolls up via weighted average, methodology version pin remains v1.0). Changes:
- Retrieval + Agent corpora: each
expects/successCriteria
entry is now a synonym group — ANY synonym in the group counts as a hit. The previous single-phrase form is still accepted. This stops Specshift penalising sites for using "auth token" instead of "api key" when both refer to the same concept.
- Retrieval + Agent fetch: the suites now visit a small bounded
set of in-domain links from the landing (≤6 sub-pages, doc-shaped paths only — quickstart, auth, api, reference, sdk, etc.) and, when the landing is a marketing splash with a discoverable /docs entrypoint, pivot to that root before crawling. SPA shells whose prose only renders client-side are no longer scored as if their HTML were the entire docs surface.
- Retrieval + Agent partial credit: once at least one expected
concept is matched, the test scores no lower than 0.5 — a 1-of-3 match shouldn't drop the test to 33%, that conflates retrieval recall with a structural defect. A zero hit still scores 0.
- Structure: heuristics return partial credit instead of binary,
the sitemap check is now NaN (informational, not a hard zero) when no sitemap is found, anchored-headings is skipped on SPA shells with zero anchored ids in initial HTML, and a new link-density test rewards landings that surface enough internal links to be navigable. Like retrieval, structure also pivots to /docs when the landing is a marketing splash.
- Drift: added
https-canonicalandlanding-reachablechecks so
the suite produces a real score on sites that don't surface the 401 claim language on the landing. Tightened the robots.txt check so Disallow: /api/ no longer trips a site-wide-block false positive.
- Engine: when every test in a suite is NaN, the suite score is
still NaN (so dashboards can show "incomplete") but contributes a neutral 0.5 to the overall roll-up rather than vanishing entirely — failed suites must pull the overall down, not silently inflate it.
The corpus on disk is still labelled v1.0 and remains replay-stable for any Scorecard pinned to methodology v1.0. v1.2 will re-evaluate whether the synonym-group form deserves a methodology version bump.