7.7 KiB
Evaluation Utility Command
Evaluate a single iteration on a specific quality dimension using ReAct reasoning pattern.
Syntax
/evaluate <dimension> <iteration_path> [spec_path]
Parameters:
dimension: One of "technical", "creativity", "compliance", or "all"iteration_path: Path to the iteration file/directory to evaluatespec_path: Required for "compliance" dimension, optional for others
Examples:
/evaluate technical output/iteration_001.html
/evaluate creativity output/iteration_005.html
/evaluate compliance output/iteration_003.html specs/example_spec.md
/evaluate all output/iteration_002.html specs/example_spec.md
Execution Process
THOUGHT Phase: Reasoning About Evaluation
Before scoring, reason about:
-
What defines quality in this dimension?
- For technical: Architecture, code quality, performance, robustness
- For creativity: Originality, innovation, aesthetic choices, uniqueness
- For compliance: Requirement fulfillment, naming, structure, standards
-
What evidence should I look for?
- Concrete artifacts that demonstrate quality
- Code patterns, design decisions, implementation details
- Documentation and self-assessment comments
-
What are potential pitfalls in this evaluation?
- Subjective bias
- Missing context
- Unfair comparisons
- Evaluation drift
-
How will I ensure objective scoring?
- Use specific criteria from evaluator definitions
- Look for measurable indicators
- Document reasoning for each score component
ACTION Phase: Execute Evaluation
-
Load Iteration Content
- Read the file(s) completely
- Parse structure and components
- Extract metadata and documentation
-
Load Evaluation Criteria
- For technical: Use
evaluators/technical_quality.md - For creativity: Use
evaluators/creativity_score.md - For compliance: Use
evaluators/spec_compliance.md
- For technical: Use
-
Apply Evaluation Logic
For Technical Quality:
Scoring (0-100): - Code Quality (25 points): Clean, readable, maintainable code - Architecture (25 points): Well-structured, modular design - Performance (25 points): Efficient algorithms, optimized rendering - Robustness (25 points): Error handling, edge cases, validationFor Creativity Score:
Scoring (0-100): - Originality (25 points): Novel ideas, unique approach - Innovation (25 points): Creative problem-solving, fresh perspective - Uniqueness (25 points): Differentiation from existing iterations - Aesthetic (25 points): Visual appeal, design sophisticationFor Spec Compliance:
Scoring (0-100): - Requirements Met (40 points): All spec requirements fulfilled - Naming Conventions (20 points): Follows spec naming patterns - Structure Adherence (20 points): Matches spec structure - Quality Standards (20 points): Meets spec quality criteria -
Calculate Scores
- Score each sub-component
- Sum to dimension total
- Document scoring reasoning
OBSERVATION Phase: Document Results
Output format:
{
"iteration": "iteration_001.html",
"dimension": "technical",
"score": 78,
"breakdown": {
"code_quality": 20,
"architecture": 19,
"performance": 18,
"robustness": 21
},
"reasoning": {
"strengths": [
"Clean, well-commented code",
"Excellent error handling",
"Modular component structure"
],
"weaknesses": [
"Some repeated code blocks",
"Performance could be optimized for large datasets"
],
"evidence": [
"Lines 45-67: Robust input validation",
"Lines 120-145: Efficient caching mechanism"
]
},
"timestamp": "2025-10-10T14:23:45Z"
}
Human-Readable Summary:
=== EVALUATION RESULTS ===
Iteration: iteration_001.html
Dimension: Technical Quality
Score: 78/100
BREAKDOWN:
- Code Quality: 20/25 - Clean, well-commented code
- Architecture: 19/25 - Modular structure, minor coupling issues
- Performance: 18/25 - Good baseline, room for optimization
- Robustness: 21/25 - Excellent error handling
STRENGTHS:
+ Clean, well-commented code
+ Excellent error handling
+ Modular component structure
WEAKNESSES:
- Some repeated code blocks (DRY principle violation)
- Performance could be optimized for large datasets
EVIDENCE:
• Lines 45-67: Robust input validation with clear error messages
• Lines 120-145: Efficient caching mechanism reduces redundant calculations
REASONING:
This iteration demonstrates strong fundamentals with clean code and
excellent robustness. The architecture is well-thought-out with good
separation of concerns. Performance is adequate but could benefit from
optimization for edge cases. Overall, a solid technical implementation
that slightly exceeds expectations.
Multi-Dimension Evaluation (dimension="all")
When evaluating all dimensions:
-
Execute each dimension evaluation sequentially
- Technical → Creativity → Compliance
- Each with full THOUGHT-ACTION-OBSERVATION cycle
-
Calculate composite score
composite = (technical * 0.35) + (creativity * 0.35) + (compliance * 0.30) -
Identify quality trade-offs
- High technical + low creativity?
- High creativity + low compliance?
- Document trade-off patterns
-
Generate comprehensive summary
=== COMPREHENSIVE EVALUATION ===
Iteration: iteration_001.html
COMPOSITE SCORE: 76/100
Dimension Scores:
- Technical Quality: 78/100 (Weight: 35%) = 27.3
- Creativity Score: 82/100 (Weight: 35%) = 28.7
- Spec Compliance: 68/100 (Weight: 30%) = 20.4
OVERALL ASSESSMENT:
This iteration excels in creativity and technical implementation but
shows room for improvement in spec compliance, particularly around
naming conventions and structure adherence.
QUALITY PROFILE: "Creative Innovator"
- Strengths: Novel approach, clean code, innovative solutions
- Growth Areas: Specification adherence, naming consistency
RECOMMENDATIONS:
1. Review spec naming conventions and apply consistently
2. Maintain creative innovation while improving compliance
3. Current balance favors creativity over compliance - consider alignment
Reasoning Documentation
For each evaluation, document the reasoning process:
-
Pre-Evaluation Thoughts
- What am I looking for?
- What criteria matter most?
- How will I avoid bias?
-
During Evaluation Observations
- What patterns do I see?
- What stands out positively?
- What concerns emerge?
-
Post-Evaluation Reflection
- Does the score feel right?
- Did I apply criteria consistently?
- What would improve this iteration?
- What can others learn from this evaluation?
Output Storage
Evaluation results are stored in:
{output_dir}/quality_reports/evaluations/iteration_{N}_evaluation.json
This enables:
- Historical tracking of quality trends
- Comparison across iterations
- Machine-readable quality data
- Re-evaluation with updated criteria
Error Handling
- Iteration not found: Report error, skip evaluation
- Spec required but missing: Report error for compliance dimension
- Invalid dimension: Report valid options
- Evaluation criteria missing: Use defaults, log warning
- Scoring inconsistency: Re-evaluate with explicit reasoning
Success Criteria
A successful evaluation demonstrates:
- Clear reasoning before scoring
- Objective, evidence-based scoring
- Specific examples supporting scores
- Actionable feedback for improvement
- Consistent application of criteria
- Transparent documentation of thought process
Remember: Evaluation is not about being harsh or lenient - it's about being fair, consistent, and helpful. Reason about quality, observe evidence, and let observations guide your scores.