Scoring System vs. Grading System

“Scoring” and “Grading” name two distinct sub-systems that this spec keeps strictly apart: the Scoring System evidences how a single test or dimension performs, while the Grading System derives a letter mark from a collection of those scores. Each carries its own independent version namespace, so a scoring update does not automatically imply a grading update — and vice versa. A grader, scorer, or aggregator that conforms to gradingSpec/3.0.0 MUST keep both names — and both version strings — apart in every emitted artefact (grading entries, logs, error codes).

Comparison Table

The following table is the canonical, binding distinction between the two systems. Implementers MUST NOT collapse rows.

Aspect	Scoring System	Grading System
What it is	How tests are evidenced (one score per test / per dimension)	How a grade (a letter mark) is derived from a collection of scores
Input	Test outcome (HTTP status, LLM answer, imports scan, documentation match)	Collection of scores from the Scoring System
Output	Numerical score `1.0`–`5.0` or enum `pass` / `fail` / `stale` per dimension	Aggregate grade `A` / `B` / `C` / `D` / `F`, or `REJECTED` (on Categorical Veto)
What its updates change	How tests are evidenced (e.g. a new HTTP heuristic). Tracked under spec version `scoringSystem/X.Y.Z`	Thresholds, weights, tier trim, Veto list. Tracked under spec version `gradingSystem/X.Y.Z`
Reproducibility	Score is reproducible given identical inputs AND identical `scoringSystem` version	Grade is reproducible given identical scores AND identical `gradingSystem` version

Versioning Schema

The two systems carry two independent SemVer namespaces:

Namespace	Pattern	Bumped when
`scoringSystem/X.Y.Z`	`^scoringSystem/\d+\.\d+\.\d+$`	A scoring rule, heuristic, or evidence-extraction changes such that the score output for the same input changes
`gradingSystem/X.Y.Z`	`^gradingSystem/\d+\.\d+\.\d+$`	Thresholds, weights, tier trim rules, Categorical-Veto trigger list, or aggregation logic change such that the grade output for the same score input changes

SemVer semantics (MAJOR.MINOR.PATCH) apply per namespace independently. The two namespaces are NOT lock-stepped.

The third namespace gradingSpec/X.Y.Z (the document set you are reading) is versioned separately again — see 00-overview.md, “Three Independently Versioned Namespaces”.

Binding Rules

The following rules are binding for every grader, scorer, and aggregator that conforms to gradingSpec/3.0.0.

Both version strings MUST be present in every grading entry. Every grading entry (defined in 08-grading-model.md) MUST carry both scoringSystem and gradingSystem as top-level version fields. A grading entry that lacks either field is INVALID. The same two version strings are additionally surfaced in the index.json rollup (per namespace and per selection), so the version under which an aggregated node was last scored and graded is visible without opening each grading entry.
Scoring-System bump and Grading-System bump are independent triggers. A change in the Scoring System triggers a re-scoring at the next re-grading run, but does NOT automatically bump the Grading System. A change in the Grading System triggers a re-grading but does NOT automatically bump the Scoring System. Implementers MUST NOT couple the two version namespaces.
Example (HTTP-429 heuristic). Suppose the current versions are scoringSystem/1.0.0 and gradingSystem/1.0.0. A maintainer decides that HTTP 429 (“rate-limited, retry”) MUST be scored as a transient defect (not fail). The Scoring System advances to scoringSystem/1.1.0. The Grading System remains at gradingSystem/1.0.0 because the score-to-grade mapping is unchanged. New grading entries record scoringSystem/1.1.0 + gradingSystem/1.0.0; old entries remain at 1.0.0/1.0.0 until they are re-scored.

Score-to-Grade Thresholds (`gradingSystem/1.0.0`)

The aggregate grade is derived from the weighted mean of the per-answer scores. Each answer contributes a numeric value on the 1.0–5.0 scale: pass maps to 5.0, fail to 1.0, numeric scores as-is. n/a and stale answers are excluded from the mean (an all-excluded set yields no grade, i.e. a pending node). The mean is then banded:

Weighted mean	Grade
≥ 4.5	A
≥ 3.5	B
≥ 2.5	C
≥ 1.5	D
< 1.5	F

Tier trim. The banded grade is capped at the tier maximum: autonomous → B, group-bound → A. A score that would band to A on an autonomous-tier node is recorded as B (the pre-trim band is preserved as rawGrade). A Categorical Veto overrides the entire computation with REJECTED.

Changing any threshold, the tier-trim rule, or the numeric mapping bumps the gradingSystem version.

Relationship to the Schemas-Spec v4.3.0

The Schemas-Spec v4.3.0 provides the upstream contract for scoring: the prompts.json / scores.json artefact pair is defined in 22-scoring-protocol.md of the Schemas-Spec (sister repository flowmcp-spec). The Scoring System named here sub-consumes that protocol: scores produced by the Schemas-Spec scoring protocol enter this spec’s Scoring System as inputs, and the dimensions enumerated in 08-grading-model.md, “Dimension Enum”, extend that protocol with the additional grading dimensions defined here.

This Grading-Spec does NOT re-define the prompts.json / scores.json contract. Implementers MUST treat the Schemas-Spec v4.3.0 22-scoring-protocol.md as the highest instance for the artefact pair; conflicting prose in this spec is to be read as a refinement, not as a replacement.

./00-overview.md — see chapter 00.
./04-phases-single.md — see chapter 04.
./05-phases-selection.md — see chapter 05.
./06-determinism-and-tier.md — see chapter 06.
./08-grading-model.md — see chapter 08.