A seven-dimension framework for evaluating the quality of AI-generated outputs in professional contexts. Honest by design. Developmental in intent.
Read about the framework below, then use it. Two evaluation routes are available: score an output yourself using the self-assessment tool, or submit it for AI-generated evaluation against the full rubric.
Work through each of the seven dimensions yourself. Select a score, add your notes, and generate a structured PDF evaluation record. No account required. Takes around fifteen to twenty minutes.
Paste or upload your output and Claude evaluates it against all seven dimensions. Asks five clarifying questions conversationally, then produces a full scored evaluation with PDF report.
Every dimension is scored out of ten. Scores map to five bands. The descriptions are intentionally honest. Adequate is not a compliment, and Exemplary is earned precisely because the bands beneath it are not.
Each dimension examines a different aspect of output quality. Not just whether the output reads well, but whether it is doing what it should at every level: contextual, analytical, structural, and relational.
| Band | Score | Descriptor |
|---|---|---|
| Insufficient | 0–2 | Generic; could apply to any situation; no meaningful adaptation to context; the specific brief has not been understood or used. |
| Partial | 3–4 | Acknowledges context superficially; adaptation is surface-level or formulaic. |
| Adequate | 5–6 | Clearly shaped by the stated context; appropriate to audience and purpose as given; no significant mismatch. |
| Capable | 7–8 | Demonstrates genuine understanding including unstated contextual factors; calibrated to the specific situation, not just the brief. |
| Exemplary | 9–10 | Demonstrates insight into deeper contextual dynamics; anticipates what the context requires beyond the explicit brief; could only have been written for this situation. |
| Band | Score | Descriptor |
|---|---|---|
| Insufficient | 0–2 | Claims made without any support; assertions presented as facts; no evidence, examples, or reasoning offered. |
| Partial | 3–4 | Some grounding present but inconsistent; key claims left unsupported; evidence where present is vague or weak. |
| Adequate | 5–6 | Claims broadly supported; evidence present and relevant; no significant unsupported assertions. |
| Capable | 7–8 | Claims consistently supported with specific, well-chosen evidence; uncertainty acknowledged where it genuinely exists. |
| Exemplary | 9–10 | Evidence is precise and proportionate; the distinction between established fact, reasoned inference, and acknowledged uncertainty is consistently maintained throughout. |
| Band | Score | Descriptor |
|---|---|---|
| Insufficient | 0–2 | Purely descriptive; restates information without analysis; nothing new is generated. |
| Partial | 3–4 | Some analytical movement but conclusions are surface-level, obvious, or not meaningfully derived from the content. |
| Adequate | 5–6 | Clear line of analysis present; insight generated; implications drawn; the output moves beyond description. |
| Capable | 7–8 | Analysis is penetrating; non-obvious connections surfaced; the reader's understanding is meaningfully advanced. |
| Exemplary | 9–10 | Reaches the underlying dynamics; surfaces what others would miss; the analysis itself is the value of the output, not merely a vehicle for presenting information. |
| Band | Score | Descriptor |
|---|---|---|
| Insufficient | 0–2 | Structure absent, incoherent, or actively misleading; the reader cannot follow the logic. |
| Partial | 3–4 | A logical sequence exists but structure does not reinforce the core argument or purpose; some sections add noise rather than signal. |
| Adequate | 5–6 | Structure clearly serves the output's purpose; hierarchy and sequencing are coherent; the reader is guided reliably. |
| Capable | 7–8 | Structure and argument are well-integrated; organisation itself contributes to the output's persuasiveness or clarity. |
| Exemplary | 9–10 | Structure and content are inseparable; the architecture of the output communicates meaning; nothing could be moved or removed without loss. |
| Band | Score | Descriptor |
|---|---|---|
| Insufficient | 0–2 | Significant mismatch between register and context; undermines credibility or utility; may actively alienate the intended audience. |
| Partial | 3–4 | Broadly appropriate register but with inconsistencies or misjudgements that weaken the output. |
| Adequate | 5–6 | Register consistently appropriate; language level and tone well-matched to audience and purpose throughout. |
| Capable | 7–8 | Register precisely calibrated; shifts intentionally when context requires; enhances rather than merely supports the content. |
| Exemplary | 9–10 | Register is a positive contributor to impact; the voice is distinctive, earned, and exactly right for the moment; the reader feels addressed rather than processed. |
| Band | Score | Descriptor |
|---|---|---|
| Insufficient | 0–2 | Sycophantic or adversarial; either validates uncritically to please, or challenges without developmental intent; the output serves the wrong purpose. |
| Partial | 3–4 | Broadly honest but with softened critique, significant omissions of concern, or disproportionate framing of issues. |
| Adequate | 5–6 | Honest and proportionate; concerns raised clearly and without burying; challenge delivered with developmental purpose. |
| Capable | 7–8 | Engages as a trusted peer; identifies what needs to be said; maintains analytical independence whilst remaining constructive. |
| Exemplary | 9–10 | Identifies what others might miss or avoid saying; maintains complete honesty without compromising the relationship; useful precisely because it does not seek approval. |
The quality, accuracy, and generative power of evaluative judgement is what matters, not its timing. Genuine reflection that arrives after deliberation and improves the outcome materially is of higher value than rapid challenge that produces superficial change. This dimension does not reward speed of challenge or penalise considered acceptance.
| Band | Score | Descriptor |
|---|---|---|
| Insufficient | 0–2 | No evaluative engagement evident; outputs accepted or rejected without apparent judgement; challenge and acceptance both appear arbitrary. |
| Partial | 3–4 | Some evaluative engagement but poorly calibrated; significant points accepted without scrutiny, or challenge applied where it adds no value. |
| Adequate | 5–6 | Evaluative judgement present and broadly sound; key assumptions tested; acceptance where appropriate is active rather than passive. |
| Capable | 7–8 | Strong discrimination between what warrants challenge and what does not; each evaluative decision can be justified; reflection is visible in the quality of interventions. |
| Exemplary | 9–10 | Evaluative judgement is precise and generative; active acceptance of strong outputs demonstrates discriminating quality of thought equal to any challenge; the thinking process itself is the standard. |
After the seven dimensions are scored, the evaluation closes with a holistic measure of AI Voice. This is not a quality dimension in the same sense as the others. It is an assessment of authorship.
The question it asks is: to what degree does the output sound generated rather than authored? Where is AI voice present, where is it absent, and what is driving it?
For this measure, lower is better. A score of one or two means the output reads as genuinely human-authored. A score of nine or ten means it is unmistakably AI-generated throughout.
AI voice is not always a problem. In some contexts it is neutral or acceptable. But it is always worth naming, because the presence of AI voice often signals that the human engagement that produced the output was passive rather than active.
| Band | Score | Descriptor |
|---|---|---|
| Sounds human | 0–2 | The output reads as genuinely authored by a person. AI involvement is not detectable in voice or patterning. |
| Mostly human | 3–4 | Largely human in voice. Occasional AI patterning present but does not dominate. |
| Mixed | 5–6 | A blend of human and AI voice. Neither fully dominates; the output shifts between registers. |
| Mostly AI | 7–8 | AI voice is the dominant register. Formulaic structures, hedging language, or generic patterning are clearly present. |
| AI throughout | 9–10 | Unmistakably AI-generated throughout. Mechanical structure, excessive hedging, and AI patterns are pervasive. |
The rubric is honest by design. These principles govern how evaluations are conducted, whether by AI or by a human evaluator working through the dimensions independently.
Developmental feedback serves the person better than comfortable feedback. The delivery can be softened where appropriate. The content cannot.
Weaknesses are located precisely in the output. "Could be stronger" is not useful. Where it fell short and what it cost the output: that is useful.
If a dimension scores Partial, that is named clearly before noting what worked. Concerns are not buried at the end of paragraphs that lead with strengths.
If a dimension genuinely scores Exemplary, it is recorded as Exemplary. Inventing critique to appear rigorous is its own failure of the framework's principles.
A strong overall impression does not inflate weak dimensions. A weak overall impression does not deflate strong ones. Each dimension stands on its own evidence.
Formal professional outputs and conversational opinion pieces are assessed differently on four dimensions. The rubric adjusts to what the output is, not what the evaluator prefers.