Zum Inhalt springen

Grading Model

Dieser Inhalt ist noch nicht in deiner Sprache verfügbar.

Conformance language (MUST/SHOULD/MAY) follows BCP 14 [RFC2119]/[RFC8174] as defined in 00-overview.md. The binding source is the FlowMCP Schemas Specification v4.2.0.


A grading is an array of evaluations that carries veto power, a tier trim (autonomous max B / group max A), and can be re-triggered by the user. It is described as one data model with two skill families writing into it. The Categorical Veto, when present, overrides every aggregation logic and yields aggregateGrade = REJECTED.

The grading entry is the only durable artefact emitted by a grader; it MUST be valid against 08-grading-model.schema.json.


One data model, two skill families.

There is one shared data model (see fields below) and two skill families (Single-Schema-Validator + Selection-Validator) that write different Areas into this one model. Advantage: anti-drift at the spec level, clear separation of applications at the implementation level.

This architecture decision is binding. Implementers MUST NOT split the data model into two distinct types per skill family — the gradingTier field is the consumer-visible switch, the family separation lives in the implementation.


The grading entry is a JSON object with the following top-level fields. The column Required indicates MUST / SHOULD / OPTIONAL. The column Conditional captures the version-conditional rules.

FieldTypeRequiredConditionalDescription
schemaIdstringMUSTIdentifier of the schema under grading.
selectionIdstringMUST when gradingTier=group-boundIf group-bound, REQUIRED; if autonomous, OPTIONALIdentifier of the Selection under grading (when group-bound).
gradingTierenum autonomous | group-boundMUSTTier classification (see 06-determinism-and-tier.md).
scoringSystemstring matching ^scoringSystem/\d+\.\d+\.\d+$MUSTScoring System version (see 07-scoring-vs-grading.md).
gradingSystemstring matching ^gradingSystem/\d+\.\d+\.\d+$MUSTGrading System version (see 07-scoring-vs-grading.md).
areaenum (see Areas and Score Values)MUSTThe Area this entry grades (const per entry).
gradingsarray of answer entriesMUSTMinimum length 1The per-question answers for this Area (see gradings[] Element).
harnessenum ["claude-code"]MUSTThe harness that drove the non-deterministic evaluation (see Envelope Fields).
personaobject { basePersonaId, lensId }MUST for persona-bound areasRequired for persona-bound Areas (see Areas and Score Values)The persona lens (see Envelope Fields, 12-personas-contract.md).
skillIdstringMUST for per-skill areasRequired for namespace-skills and selection-skills-L1/L2/L3The graded skill instance (see Envelope Fields).
categoricalVetoobject | nullMUST (default null)When non-null, forces aggregateGrade=REJECTEDThe Categorical Veto record (see Categorical Veto).
regradingTriggerobjectOPTIONALPresent iff this entry is a re-gradingThe re-grading trigger that produced this entry (see Re-Grading Trigger).
aggregateGradeenum A | B | C | D | F | REJECTEDMUSTREJECTED iff categoricalVeto != nullThe aggregate grade after weighted aggregation and tier trim.
maxAttainableGradeenum A | BMUSTDerived from gradingTierThe highest grade attainable at this tier (see Tier Computation).

A grading entry MUST contain all MUST fields, conditional fields when their condition holds, and MAY contain OPTIONAL fields. additionalProperties is false (see 08-grading-model.schema.json).

Mandatory Fields + Hash Placement (restructured in 2.0.0)

Section titled “Mandatory Fields + Hash Placement (restructured in 2.0.0)”

The grading entry binds a grading to the concrete tested schema variant and makes the partial vs. full mode explicit. The fields below live on the grading entry.

FieldFormatExampleDefinition
schemaId<namespace>.<tool>etherscan.getContractEthereumidentifier of the schema under grading
versionflowmcp/4.\d+.\d+flowmcp/4.0.0Spec version (FlowMCP), frozen on major 4
schemaHashsha256, 8 chars (hex)a1b2c3d4deterministic from canonical JSON (recorded here, derived)
gradingId<schemaHash>--<timestamp>a1b2c3d4--2026-05-29T15-34Zunique grading instance
gradingMode"partial" | "full""full"determines the aggregateGrade effect
aboutHashsha256, 8 charsef56gh78hash of the about page (recorded here, derived)

Hash placement (binding in 2.0.0). schemaHash and aboutHash are not part of the source schema contract. The source schema is neutral — it carries only logical names and the FlowMCP version field. The hashes are derived from the canonical content and recorded in two derived places: the grading entry (above) and the namespace/selection index.json. They never live inside the source .mjs. Rationale: an in-source hash drifts on every edit, so the recorded value stops matching the content (see 15-versioning-axes.md).

Example source schema header (neutral — no hashes, no snapshot version):

export const schema = {
version: 'flowmcp/4.0.0',
namespace: 'etherscan',
name: 'getContractEthereum'
}

Example grading entry (excerpt):

{
"gradingId": "a1b2c3d4--2026-05-29T15-34Z",
"schemaId": "etherscan.getContractEthereum",
"area": "single-test",
"version": "flowmcp/4.0.0",
"schemaHash": "a1b2c3d4",
"gradingMode": "full",
"aboutHash": "ef56gh78",
"harness": "claude-code",
"aggregateGrade": "B",
"gradings": [ /* per-question answers */ ]
}

Cross-Refs:

The grading envelope carries three additional fields that describe how and under which lens a grading was produced.

FieldTypeRequiredDescription
harnessenum ["claude-code"]MUSTThe harness that drove the non-deterministic evaluation. Currently the only allowed value is claude-code (sub-agent with a fresh, empty context, read-only tools, single pass, strict JSON).
personaobject { basePersonaId, lensId }MUST for persona-bound areasThe persona under which a non-deterministic area was scored. basePersonaIdai-engineer | decision-maker | hackathon-builder | schema-maintainer; lensId is the domain lens. See 12-personas-contract.md.
skillIdstringMUST for per-skill areasIdentifier of the graded skill instance (per-skill areas grade one skill at a time, not a level cohort — see 13-skills.md).

harness makes the grading reproducible across drivers; persona records the lens; skillId distinguishes per-skill area instances. The deterministic answers come from code and are merged with the harness sub-agent answers into one grading entry.


Data Model — gradings[] Element (per-question answer)

Section titled “Data Model — gradings[] Element (per-question answer)”

Each element of the gradings[] array is a JSON object describing one answer to one question of the Area, scored by one grader at one timestamp. The element fields are:

FieldTypeRequiredConditionalDescription
questionIdstring matching ^Q-.+MUSTIdentifier of the Area question being answered.
scorenumber 1.05.0 OR enum pass | fail | stale | n/aMUSTSee Score ValuesThe score value.
weightnumberMUSTWeight contributed to the weighted aggregation.
determinismenum deterministic | non-deterministicMUSTWhether the answer is reproducible at the same scoringSystem version.
graderIdentityobject (kind, name, version)MUSTIdentity of the grader; kindllm | human | script.
llmModelstringMUST when graderIdentity.kind=llmIdentifier of the LLM model used (e.g. claude-opus-4-7).
selectionContextobject (groupId, personaIds[], domainDocId)MUST when determinism=non-deterministicWhen the answer is non-deterministic, at least one persona is REQUIRED (see Personas Obligation)The group / persona / domain context under which the answer was produced.
timestampstring (ISO-8601)MUSTTime of scoring.
evidenceobject or urlSHOULDPointer to the underlying test evidence (HTTP response, LLM transcript, lint output, etc.).
reasoningstringSHOULDHuman-readable rationale (especially for non-deterministic answers).
naReasonenum (see n/a Convention with Standard Reasons)MUST when score = n/aClosed-set reason for a non-applicable answer.

previousGradingId is NOT a field on the gradings[] element; it lives on the top-level regradingTrigger object (see Re-Grading Trigger).


As of gradingSpec/2.0.0, a grading targets exactly one Area. The area field is a const per grading entry. There are 11 Areas, split between provider (namespace) grading and selection grading. Each Area carries its own question set; the per-question answers live in the gradings[] array (see gradings[] Element). The detailed question definitions and output schemas are specified in the per-Area chapters and the Area output schemas.

#AreaGradesPersona-boundDet / Non-det
1single-testone toolnodeterministic gate + non-det
2tools-aggregate-schematools collection (schema-wide)noboth
3tools-aggregate-namespacetools across the namespacenoboth
4namespace-descriptionnamespace metadatanonon-det
5namespace-skillsone namespace skillyesnon-det
6about-namespaceAbout resource (in one schema)yesdeterministic (route-exists) + non-det
7about-selectionAbout of the selection (= domain knowledge)yesdeterministic + non-det
8selection-skills-L1one L1 skill (per skill)yesnon-det
9selection-skills-L2one L2 skill (per skill)yesnon-det
10selection-skills-L3one L3 skill (per skill)yesnon-det
11selection-aggregatethe selection as a wholeyesdeterministic + non-det

A grading entry that uses an area value not listed here is INVALID. Adding a new Area is a gradingSystem bump (see 07-scoring-vs-grading.md).

Areas 1–6 are provider areas (tier autonomous, max Grade B, rollup in providers/<ns>/index.json). Areas 7–11 are selection areas (tier group-bound, Grade A attainable, rollup in selections/<sel>/index.json). The two blocks are disjoint — a provider schema is not evaluated over the selection areas, and a selection is not evaluated over the provider areas. See 19-folder-layout.md for the _gradings/ location per Area.

selection-aggregate carries the selection-wide checks: thresholds (soft ≥ 5 / hard ≥ 7 members), topic coherence, domainConformance (members checked against the About / domain knowledge), personaUseCaseFit, the group-bound tier path to Grade A, and the cascade stop. Per-skill areas (8/9/10) grade one skill at a time and carry skillId in the envelope; there is no level-cohort grade.

Each Area defines how many answers its grading entry must carry, split into a deterministic block (computed by code) and a non-deterministic block (produced by the harness sub-agent). A deterministic block alone is not a valid Area grading — the two blocks are merged into one entry. The per-Area answer counts and question sets are normative in the Area output schemas.

The score field is one of:

  • a number in [1.0, 5.0] (numeric score), OR
  • the enum string pass / fail / stale / n/a.

Mixing the two domains (e.g. score = "3.0") is INVALID. The pass / fail enum is reserved for deterministic answers with a binary outcome (HTTP 200 is pass, anything else is fail — see 06-determinism-and-tier.md rule 1). The stale enum is reserved for aged-out time-dependent answers (see Timeline Rule + Aging). The n/a enum is reserved for non-applicable answers (see n/a Pragma).

n/a Convention with Standard Reasons (NEW in 1.1.0)

Section titled “n/a Convention with Standard Reasons (NEW in 1.1.0)”

An answer entry with gradings[i].score === "n/a" is only permitted when gradings[i].naReason carries a value from the following closed set:

naReasonMeaning
not-applicable-to-tool-typeDimension structurally does not apply to this tool type
requires-private-dataThe check would require a private / non-public data source
blocked-by-preconditionPre-condition not met (e.g. member schema not stable)
out-of-scope-resourceRelates to Resources (out-of-scope, on-hold per n/a Pragma)
out-of-scope-promptRelates to Prompts (out-of-scope, on-hold per n/a Pragma)
out-of-scope-procedureRelates to Procedures (out-of-scope, on-hold per n/a Pragma)

Free-text reasons are rejected by the schema validator (NA-001 ERROR). Additional reason values can only be added through a spec bump.

Reference implementation: src/NaReason.mjs (closed-set static validator, NA-001 error code in ErrorCodes.mjs). Pre-existing gradings without naReason are migrated by setting naReason = "not-applicable-to-tool-type".

JSON-Schema fragment for gradings[i]:

{
"score": { "oneOf": [ { "type": "number", "minimum": 1.0, "maximum": 5.0 }, { "enum": [ "pass", "fail", "stale", "n/a" ] } ] },
"naReason": {
"type": "string",
"enum": [
"not-applicable-to-tool-type",
"requires-private-data",
"blocked-by-precondition",
"out-of-scope-resource",
"out-of-scope-prompt",
"out-of-scope-procedure"
]
}
}

The categoricalVeto field is either null (no veto) or an object describing a veto that was raised by a grader. The Veto is a closed list at this spec version; the four allowed triggers are enumerated below.

FieldTypeRequiredDescription
triggeredByenum (see below)MUSTThe veto trigger name.
graderIdentityobject (kind, name, version)MUSTIdentity of the grader who raised the veto.
evidencestring or urlMUSTPointer to the evidence behind the veto.
timestampstring (ISO-8601)MUSTTime of veto.

The triggeredBy enum is closed. The four allowed values are:

  1. malicious-module — an imported module exhibits behaviour outside the tool’s stated purpose (tracker, telemetry without user knowledge, malware). Deterministic part: imports scan. Non-deterministic part: behaviour judgement. See 09-security-and-development.md.
  2. api-key-domain-mismatch — a requiredServerParams entry declares a key name that belongs to a different domain or company than the API itself (e.g. FACEBOOK_API_KEY for example.xyz). Deterministic. See 09-security-and-development.md.
  3. illegal-content — the schema, its output, or its purpose involves illegal content. Non-deterministic. See 09-security-and-development.md.
  4. ai-security-veto — the grader sees a security finding that is not on the closed deterministic list but is well-evidenced and well-reasoned. Non-deterministic; REQUIRES evidence AND reasoning. See 09-security-and-development.md.

Implementers MUST NOT extend the triggeredBy enum at runtime. Adding a new trigger is a gradingSystem bump.

When categoricalVeto != null, aggregateGrade = REJECTED (no aggregation is performed over gradings[]).


The implementation separates the writers of gradings[] entries into two skill families — both write into the same data model but cover different tiers:

FamilyRepositoryWritesYields
Single-Schema-Validatorflowmcp-gradingProvider Areas 1–6 (see 04-phases-single.md)gradingTier = autonomous
Selection-Validatorflowmcp-gradingConsumes provider grading entries plus selection Areas 7–11 (see 05-phases-selection.md); writes the group-bound AreasgradingTier = group-bound

A grading entry MUST be written by exactly one of the two families. A Selection-Validator entry MAY reference the Single-Schema-Validator entries it consumed via selectionContext.domainDocId and the surrounding aggregator’s bookkeeping; the spec does NOT require an explicit cross-link.


maxAttainableGrade is derived from gradingTier by a fixed mapping:

gradingTiermaxAttainableGrade
autonomousB
group-boundA

The mapping is binding. Implementers MUST emit maxAttainableGrade even though it is mechanically derived from gradingTier; consumers of grading entries (UIs, dashboards, registry pages) rely on the field being present so that they can communicate to the consumer that a higher grade is attainable by attaching the schema’s namespace to a Selection and running the Selection phases. See 06-determinism-and-tier.md.


Dimensions fall into two classes by their relationship to time:

ClassExamplesAging
Time-independentdescriptionNeutrality, formattingCompliance, outputSchemaConformance, schema-structure validationNo aging. timestamp is required for audit purposes only.
Time-dependentapiAvailability, tosMatch, legalAssessmentAn aging threshold MUST be tracked; once the threshold is exceeded, the dimension’s score MUST become stale.

Aging defaults — referenced throughout the codebase as the constant #AGING_DEFAULTS — are:

Aging keyDefaultApplies to
API_DAYS14 daysapiAvailability
TOS_DAYS30 daystosMatch, legalAssessment
RETENTION_DAYS180 daysTotal grading-entry retention before archival

Binding rule. Aging produces score = stale, not score = fail. The two outcomes are semantically distinct: fail is an active negative judgement; stale is an absence of a recent positive judgement. Aggregation logic MUST treat stale differently from fail (see Multi-Grader Rule).

The defaults are explicit per the no-hidden-defaults rule — implementers MUST NOT silently substitute alternative aging windows. Overrides MAY be configured per group but MUST be recorded in the Domain-Knowledge document (see 10-domain-knowledge.md).


Multiple graders MAY independently answer the same question. The data model does NOT automatically consolidate these multi-grader entries. Each entry stands on its own under its own graderIdentity and timestamp. Aggregation logic at the level of aggregateGrade SHOULD pick the most recent valid entry per question; tie-breaking and disagreement-handling rules are out of scope for gradingSystem/1.0.0 and are tracked as a follow-up.


A user, an aging job, or a version bump CAN trigger a re-grading. The regradingTrigger field records the trigger; the old grading entry is NOT deleted — the new entry references the old via previousGradingId.

The regradingTrigger object has these fields:

FieldTypeRequiredConditionalDescription
triggeredByenum (see below)MUSTThe re-grading trigger name.
reportedIssuestringMUST when triggeredBy=user-reportThe free-text issue description supplied by the user.
requestedBystringMUST when triggeredBy=user-reportIdentifier of the user who requested the re-grading.
previousGradingIdstringMUSTIdentifier of the grading entry being superseded.
timestampstring (ISO-8601)MUSTTime of re-grading.

The triggeredBy enum has four values:

  1. user-report — a user reported a tool as “no longer working” via the CLI or issue template.
  2. scheduled — a scheduled re-grading run (e.g. monthly).
  3. scoring-system-bump — the scoringSystem version was bumped (see 07-scoring-vs-grading.md); affected dimensions are re-scored.
  4. grading-system-bump — the gradingSystem version was bumped; affected dimensions are re-aggregated.

The grader reads reportedIssue (when present) and prioritises the re-evaluation of the Area questions implicated by the report. Implementers MUST NOT delete or overwrite the superseded grading entry. The lineage is preserved through previousGradingId.


“Any grading > no grading.”

The spec does NOT require an answer to every Area question with a numeric score. It requires an honest gradings[] array: questions that were not actually tested MUST be recorded with score = n/a. The Anti-Pattern — and it is explicitly forbidden — is to invent entries instead of writing n/a. A grader that does not have evidence for a question MUST emit n/a rather than fabricate a score.

Aggregation logic at the level of aggregateGrade MUST treat n/a as excluded from the weighted sum: the entry contributes neither to the numerator nor to the denominator. Implementers MUST NOT silently substitute n/a with 0, 1.0, or any other numeric value. No-silent-defaults is the binding interpretation of this rule.


Non-deterministic entries (determinism = non-deterministic) MUST carry at least one personaId in selectionContext.personaIds[]. A non-deterministic entry without persona context is INVALID; the JSON-Schema annex enforces this via a conditional if/then (see 08-grading-model.schema.json).

The Personas contract — including the Lens concept and the source of the four generalised personas — is defined in 12-personas-contract.md.

Error-code names for the personas obligation are:

  • GRD-005 — non-deterministic entry missing personaIds[].
  • VET-003 — Categorical Veto entry missing required evidence or reasoning when triggeredBy = ai-security-veto.

The full error-code catalogue is delivered in a later stage.


The aggregateGrade is computed by the following rules.

  1. Veto short-circuit. If categoricalVeto != null, then aggregateGrade = REJECTED. No aggregation runs.
  2. Weighted sum. Otherwise, the grader computes a weighted average over all gradings[] entries whose score is a number, ignoring entries with score ∈ { n/a }. Entries with score ∈ { pass, fail, stale } are mapped to numbers by the Grading System version (pass → 5.0, fail → 1.0, stale → omitted from numerator and denominator unless the Grading System version specifies otherwise).
  3. Tier trim. The weighted average is mapped to a grade letter A/B/C/D/F by thresholds defined at the gradingSystem version. The result is then trimmed by maxAttainableGrade: an autonomous entry capped at B cannot emit A.
  4. Minimum LLM rule. For aggregateGrade >= B, at least one non-deterministic (LLM) entry SHOULD be present (see 06-determinism-and-tier.md rule 3).
  5. Group-bound rule for A. For aggregateGrade >= A, at least one group-bound entry MUST be present (see 06-determinism-and-tier.md rule 4). A purely autonomous grading cannot yield A.

The concrete threshold values, weights per question, and stale-handling policy are NOT part of this spec chapter — they live in the gradingSystem/1.0.0 implementation. The above five rules are the binding contract.


Autonomous Grading (single-test, three answers)

Section titled “Autonomous Grading (single-test, three answers)”
{
"gradingId": "a1b2c3d4--2026-05-29T15-34Z",
"schemaId": "etherscan.getBalance",
"area": "single-test",
"version": "flowmcp/4.0.0",
"schemaHash": "a1b2c3d4",
"gradingMode": "full",
"gradingTier": "autonomous",
"harness": "claude-code",
"persona": { "basePersonaId": "decision-maker", "lensId": "crypto" },
"scoringSystem": "scoringSystem/1.0.0",
"gradingSystem": "gradingSystem/1.0.0",
"gradings": [
{
"questionId": "Q-api-availability",
"score": "pass",
"weight": 1.0,
"determinism": "deterministic",
"graderIdentity": { "kind": "script", "name": "single-schema-validator", "version": "0.1.0" },
"timestamp": "2026-05-29T10:00:00Z",
"evidence": "https://example.org/proofs/etherscan-getBalance/2026-05-29.txt"
},
{
"questionId": "Q-description-neutrality",
"score": 4.5,
"weight": 1.0,
"determinism": "deterministic",
"graderIdentity": { "kind": "script", "name": "single-schema-validator", "version": "0.1.0" },
"timestamp": "2026-05-29T10:00:00Z"
},
{
"questionId": "Q-when-to-use",
"score": 4.0,
"weight": 1.0,
"determinism": "non-deterministic",
"graderIdentity": { "kind": "llm", "name": "claude-opus-4-7", "version": "1m" },
"llmModel": "claude-opus-4-7",
"selectionContext": {
"groupId": "crypto",
"personaIds": ["decision-maker"],
"domainDocId": "crypto-1.0.0"
},
"timestamp": "2026-05-29T10:00:00Z",
"reasoning": "Clear, unambiguous trigger sentence; covers the canonical balance-lookup use case."
}
],
"categoricalVeto": null,
"aggregateGrade": "B",
"maxAttainableGrade": "B"
}
{
"gradingId": "deadbeef--2026-05-29T10-00-00Z",
"schemaId": "example.maliciousAdapter",
"area": "single-test",
"version": "flowmcp/4.0.0",
"schemaHash": "deadbeef",
"gradingMode": "full",
"gradingTier": "autonomous",
"harness": "claude-code",
"scoringSystem": "scoringSystem/1.0.0",
"gradingSystem": "gradingSystem/1.0.0",
"gradings": [
{
"questionId": "Q-security",
"score": "fail",
"weight": 1.0,
"determinism": "deterministic",
"graderIdentity": { "kind": "script", "name": "imports-scanner", "version": "0.1.0" },
"timestamp": "2026-05-29T10:00:00Z",
"evidence": "https://example.org/proofs/imports-scan.txt"
}
],
"categoricalVeto": {
"triggeredBy": "api-key-domain-mismatch",
"graderIdentity": { "kind": "script", "name": "api-key-domain-checker", "version": "0.1.0" },
"evidence": "schema declares FACEBOOK_API_KEY for example.xyz",
"timestamp": "2026-05-29T10:00:00Z"
},
"aggregateGrade": "REJECTED",
"maxAttainableGrade": "B"
}

Both example documents validate against 08-grading-model.schema.json.


The normative JSON-Schema for the grading entry is 08-grading-model.schema.json (JSON-Schema 2020-12). Every grading entry emitted by a grader MUST validate against this schema. The schema mirrors the conditional rules (e.g. selectionId required when gradingTier=group-bound, llmModel required when graderIdentity.kind=llm, personaIds[] required when determinism=non-deterministic, harness constrained to claude-code) via JSON-Schema if/then blocks. Validation uses Ajv2020 plus ajv-formats (the draft-2020-12 build), not the default Ajv build.

import { readFileSync } from 'node:fs'
import Ajv2020 from 'ajv/dist/2020.js'
import addFormats from 'ajv-formats'
const schema = JSON.parse( readFileSync( 'grading/2.0.0/08-grading-model.schema.json', 'utf8' ) )
const valid = JSON.parse( readFileSync( 'grading/2.0.0/examples/grading-autonomous.json', 'utf8' ) )
const rejected = JSON.parse( readFileSync( 'grading/2.0.0/examples/grading-rejected.json', 'utf8' ) )
const ajv = new Ajv2020( { strict: true, allErrors: true } )
addFormats( ajv )
const validate = ajv.compile( schema )
const okValid = validate( valid )
const okRejected = validate( rejected )
if( !okValid ) { throw new Error( 'autonomous example invalid: ' + JSON.stringify( validate.errors ) ) }
if( !okRejected ) { throw new Error( 'rejected example invalid: ' + JSON.stringify( validate.errors ) ) }
if( rejected.aggregateGrade !== 'REJECTED' ) { throw new Error( 'rejected example must aggregate to REJECTED' ) }