Output Evaluator — Rubric

The Seven Dimensions

What the rubric evaluates

Each dimension examines a different aspect of output quality. Not just whether the output reads well, but whether it is doing what it should at every level: contextual, analytical, structural, and relational.

Fit to Context

Does the output demonstrate genuine understanding of the specific situation, audience, and purpose, or does it read as generically applicable?

Band	Score	Descriptor
Insufficient	0–2	Generic; could apply to any situation; no meaningful adaptation to context; the specific brief has not been understood or used.
Partial	3–4	Acknowledges context superficially; adaptation is surface-level or formulaic.
Adequate	5–6	Clearly shaped by the stated context; appropriate to audience and purpose as given; no significant mismatch.
Capable	7–8	Demonstrates genuine understanding including unstated contextual factors; calibrated to the specific situation, not just the brief.
Exemplary	9–10	Demonstrates insight into deeper contextual dynamics; anticipates what the context requires beyond the explicit brief; could only have been written for this situation.

Evidence and Grounding

Are claims, assertions, and recommendations supported by concrete evidence, examples, or reasoning, or are they asserted without foundation?

Band	Score	Descriptor
Insufficient	0–2	Claims made without any support; assertions presented as facts; no evidence, examples, or reasoning offered.
Partial	3–4	Some grounding present but inconsistent; key claims left unsupported; evidence where present is vague or weak.
Adequate	5–6	Claims broadly supported; evidence present and relevant; no significant unsupported assertions.
Capable	7–8	Claims consistently supported with specific, well-chosen evidence; uncertainty acknowledged where it genuinely exists.
Exemplary	9–10	Evidence is precise and proportionate; the distinction between established fact, reasoned inference, and acknowledged uncertainty is consistently maintained throughout.

Analytical Depth

Does the output generate insight and implication, or does it describe and summarise without advancing understanding?

Band	Score	Descriptor
Insufficient	0–2	Purely descriptive; restates information without analysis; nothing new is generated.
Partial	3–4	Some analytical movement but conclusions are surface-level, obvious, or not meaningfully derived from the content.
Adequate	5–6	Clear line of analysis present; insight generated; implications drawn; the output moves beyond description.
Capable	7–8	Analysis is penetrating; non-obvious connections surfaced; the reader's understanding is meaningfully advanced.
Exemplary	9–10	Reaches the underlying dynamics; surfaces what others would miss; the analysis itself is the value of the output, not merely a vehicle for presenting information.

Purposeful Structure

Does the structure serve the purpose of the output, guiding the reader efficiently toward understanding or decision, or does it impose form without function?

Band	Score	Descriptor
Insufficient	0–2	Structure absent, incoherent, or actively misleading; the reader cannot follow the logic.
Partial	3–4	A logical sequence exists but structure does not reinforce the core argument or purpose; some sections add noise rather than signal.
Adequate	5–6	Structure clearly serves the output's purpose; hierarchy and sequencing are coherent; the reader is guided reliably.
Capable	7–8	Structure and argument are well-integrated; organisation itself contributes to the output's persuasiveness or clarity.
Exemplary	9–10	Structure and content are inseparable; the architecture of the output communicates meaning; nothing could be moved or removed without loss.

Appropriate Register

Is the tone, language level, and relational posture calibrated correctly to the audience, domain, and moment, and does it remain consistent?

Band	Score	Descriptor
Insufficient	0–2	Significant mismatch between register and context; undermines credibility or utility; may actively alienate the intended audience.
Partial	3–4	Broadly appropriate register but with inconsistencies or misjudgements that weaken the output.
Adequate	5–6	Register consistently appropriate; language level and tone well-matched to audience and purpose throughout.
Capable	7–8	Register precisely calibrated; shifts intentionally when context requires; enhances rather than merely supports the content.
Exemplary	9–10	Register is a positive contributor to impact; the voice is distinctive, earned, and exactly right for the moment; the reader feels addressed rather than processed.

Critical Integrity

Does the output reflect honest, proportionate, and developmentally useful assessment, free from both approval-seeking affirmation and adversarial challenge?

Band	Score	Descriptor
Insufficient	0–2	Sycophantic or adversarial; either validates uncritically to please, or challenges without developmental intent; the output serves the wrong purpose.
Partial	3–4	Broadly honest but with softened critique, significant omissions of concern, or disproportionate framing of issues.
Adequate	5–6	Honest and proportionate; concerns raised clearly and without burying; challenge delivered with developmental purpose.
Capable	7–8	Engages as a trusted peer; identifies what needs to be said; maintains analytical independence whilst remaining constructive.
Exemplary	9–10	Identifies what others might miss or avoid saying; maintains complete honesty without compromising the relationship; useful precisely because it does not seek approval.

Evaluative Judgement

Does the output, and the process that produced it, demonstrate that the right level of critical engagement was applied at the right points?

The quality, accuracy, and generative power of evaluative judgement is what matters, not its timing. Genuine reflection that arrives after deliberation and improves the outcome materially is of higher value than rapid challenge that produces superficial change. This dimension does not reward speed of challenge or penalise considered acceptance.

Band	Score	Descriptor
Insufficient	0–2	No evaluative engagement evident; outputs accepted or rejected without apparent judgement; challenge and acceptance both appear arbitrary.
Partial	3–4	Some evaluative engagement but poorly calibrated; significant points accepted without scrutiny, or challenge applied where it adds no value.
Adequate	5–6	Evaluative judgement present and broadly sound; key assumptions tested; acceptance where appropriate is active rather than passive.
Capable	7–8	Strong discrimination between what warrants challenge and what does not; each evaluative decision can be justified; reflection is visible in the quality of interventions.
Exemplary	9–10	Evaluative judgement is precise and generative; active acceptance of strong outputs demonstrates discriminating quality of thought equal to any challenge; the thinking process itself is the standard.

Holistic Measure

Level of AI Voice

After the seven dimensions are scored, the evaluation closes with a holistic measure of AI Voice. This is not a quality dimension in the same sense as the others. It is an assessment of authorship.

The question it asks is: to what degree does the output sound generated rather than authored? Where is AI voice present, where is it absent, and what is driving it?

For this measure, lower is better. A score of one or two means the output reads as genuinely human-authored. A score of nine or ten means it is unmistakably AI-generated throughout.

AI voice is not always a problem. In some contexts it is neutral or acceptable. But it is always worth naming, because the presence of AI voice often signals that the human engagement that produced the output was passive rather than active.

AI Voice: Bands

Band	Score	Descriptor
Sounds human	0–2	The output reads as genuinely authored by a person. AI involvement is not detectable in voice or patterning.
Mostly human	3–4	Largely human in voice. Occasional AI patterning present but does not dominate.
Mixed	5–6	A blend of human and AI voice. Neither fully dominates; the output shifts between registers.
Mostly AI	7–8	AI voice is the dominant register. Formulaic structures, hedging language, or generic patterning are clearly present.
AI throughout	9–10	Unmistakably AI-generated throughout. Mechanical structure, excessive hedging, and AI patterns are pervasive.

Principles

How the rubric is applied

The rubric is honest by design. These principles govern how evaluations are conducted, whether by AI or by a human evaluator working through the dimensions independently.

Honest before kind

Developmental feedback serves the person better than comfortable feedback. The delivery can be softened where appropriate. The content cannot.

Specific, not vague

Weaknesses are located precisely in the output. "Could be stronger" is not useful. Where it fell short and what it cost the output: that is useful.

Concerns before praise

If a dimension scores Partial, that is named clearly before noting what worked. Concerns are not buried at the end of paragraphs that lead with strengths.

No manufactured critique

If a dimension genuinely scores Exemplary, it is recorded as Exemplary. Inventing critique to appear rigorous is its own failure of the framework's principles.

Dimensional independence

A strong overall impression does not inflate weak dimensions. A weak overall impression does not deflate strong ones. Each dimension stands on its own evidence.

Genre calibration

Formal professional outputs and conversational opinion pieces are assessed differently on four dimensions. The rubric adjusts to what the output is, not what the evaluator prefers.

Dimension 3 of 7

Analytical Depth

Does the output generate insight and implication, or describe and summarise?

7/10

Capable

Strengths

This is the piece's strongest dimension. The central insight — that AI exposed rather than created the proxy problem — is genuinely non-obvious and is well-sustained across the early sections. The inversion of the gaming incentive is analytically sharp. The decision to position the doctoral model as destination rather than starting point shows structural analytical intelligence. The formative to summative bridge is the most developed analytical section in the piece and makes a claim that most practitioners will not have encountered in this form.

Weaknesses

The piece loses analytical momentum in its second half. The "What This Makes Possible" section drops into relatively predictable benefits language. "From passive subject to active designer" is a phrase that circulates in education discourse already. The risks section names the risks without interrogating them. The equity concern in particular deserved more analytical weight, both because it is a genuine vulnerability in the model and because the audience will raise it. The conclusion restates the argument rather than landing a final insight the reader has not already encountered.

Areas for development

The second half needs the same analytical standard applied to the first. The conclusion in particular needs a closing analytical move rather than a summary restatement. The equity risk deserves a more honest interrogation of whether the model, as currently articulated, has an adequate answer to it.

What a final standard would require

The analytical standard maintained in sections one through five is sustained through to the conclusion. The final paragraph leaves the reader with something they did not have when they started, not a summary of what they have just read.

Holistic Measure

Level of AI Voice

To what degree does the output sound generated rather than authored?

4/10

High AI presence

Assessment

The current draft carries a detectable AI voice in multiple sections, estimated at 35 to 40 percent of the text. This is significantly above the implicit sub-15 percent standard for thought leadership work of this kind, and it is the highest-priority development concern for the editing pass.

The AI signature appears most clearly in:

List-introduction structures. "The concerns cluster around three things" is a classic AI framing that signals a list is coming. The piece uses this pattern more than once.

Rhythmic removal sequences. "Strip the method away. Remove the examination, the observation, the portfolio..." is analytically effective but the rhythm is an AI cadence that an experienced editor will identify.

Formulaic benefit framing. "The benefits of the model are not abstract. They are observable at the level of the individual learner, the practitioner, and the system" is a structure that appears frequently in AI-generated professional writing.

Connective tissue passages. Several transitions between sections read as structural glue rather than genuine argument. "There is a further implication of this model that the sector has been circling for some time" is the clearest example.

Areas for development

The humanising pass needs to be systematic, not selective. Every section should be reviewed for AI cadence with the target of reducing detectable AI voice to below 15 percent. The author's voice in the collaborative exchanges is more distinctive than the voice in the drafted sections. The editing pass should move the drafted text toward the conversational register the author used when pushing back and challenging during ideation.

Overall Summary

Three findings + priority

The central argument is the piece's most important achievement at v0.1. The insight that AI exposed rather than created the proxy problem is genuinely differentiating, and the analytical development of it through the gaming counter, marking integrity and formative-to-summative sections is the strongest work in the piece. This is the spine the editing pass must protect.

The AI voice level is the most urgent development priority. At an estimated 35 to 40 percent, the piece does not yet read as the author's thought leadership. It reads as AI-assisted drafting. That is not the standard the piece needs to reach its audience or establish the credibility the argument deserves.

The evidence architecture is present but shallow. Names are doing the work that findings should be doing. The editing pass needs to go back into each reference and extract a specific claim that does genuine argumentative work, or reduce the reference to a passing acknowledgement rather than a substantive anchor.

Priority

The AI voice reduction pass. Everything else is secondary to establishing the author's voice as the authorial presence in the piece. Until that is done, no other editing decision can be fully evaluated.

This excerpt covers two of eight measures. The full report includes all seven dimensions, the holistic AI Voice assessment, a complete score summary, and the full evaluation record.

Download full evaluation report

Output Evaluator
Rubric

Evaluate an output

Five bands, honestly labelled

What the rubric evaluates

Level of AI Voice

How the rubric is applied

Honest before kind

Specific, not vague

Concerns before praise

No manufactured critique

Dimensional independence

Genre calibration

Output EvaluatorRubric

Evaluate an output

Five bands, honestly labelled

What the rubric evaluates

Level of AI Voice

How the rubric is applied

Honest before kind

Specific, not vague

Concerns before praise

No manufactured critique

Dimensional independence

Genre calibration

Output Evaluator
Rubric