# Evaluation Utility Command Evaluate a single iteration on a specific quality dimension using ReAct reasoning pattern. ## Syntax ``` /evaluate [spec_path] ``` **Parameters:** - `dimension`: One of "technical", "creativity", "compliance", or "all" - `iteration_path`: Path to the iteration file/directory to evaluate - `spec_path`: Required for "compliance" dimension, optional for others **Examples:** ``` /evaluate technical output/iteration_001.html /evaluate creativity output/iteration_005.html /evaluate compliance output/iteration_003.html specs/example_spec.md /evaluate all output/iteration_002.html specs/example_spec.md ``` ## Execution Process ### THOUGHT Phase: Reasoning About Evaluation Before scoring, reason about: 1. **What defines quality in this dimension?** - For technical: Architecture, code quality, performance, robustness - For creativity: Originality, innovation, aesthetic choices, uniqueness - For compliance: Requirement fulfillment, naming, structure, standards 2. **What evidence should I look for?** - Concrete artifacts that demonstrate quality - Code patterns, design decisions, implementation details - Documentation and self-assessment comments 3. **What are potential pitfalls in this evaluation?** - Subjective bias - Missing context - Unfair comparisons - Evaluation drift 4. **How will I ensure objective scoring?** - Use specific criteria from evaluator definitions - Look for measurable indicators - Document reasoning for each score component ### ACTION Phase: Execute Evaluation 1. **Load Iteration Content** - Read the file(s) completely - Parse structure and components - Extract metadata and documentation 2. **Load Evaluation Criteria** - For technical: Use `evaluators/technical_quality.md` - For creativity: Use `evaluators/creativity_score.md` - For compliance: Use `evaluators/spec_compliance.md` 3. **Apply Evaluation Logic** **For Technical Quality:** ``` Scoring (0-100): - Code Quality (25 points): Clean, readable, maintainable code - Architecture (25 points): Well-structured, modular design - Performance (25 points): Efficient algorithms, optimized rendering - Robustness (25 points): Error handling, edge cases, validation ``` **For Creativity Score:** ``` Scoring (0-100): - Originality (25 points): Novel ideas, unique approach - Innovation (25 points): Creative problem-solving, fresh perspective - Uniqueness (25 points): Differentiation from existing iterations - Aesthetic (25 points): Visual appeal, design sophistication ``` **For Spec Compliance:** ``` Scoring (0-100): - Requirements Met (40 points): All spec requirements fulfilled - Naming Conventions (20 points): Follows spec naming patterns - Structure Adherence (20 points): Matches spec structure - Quality Standards (20 points): Meets spec quality criteria ``` 4. **Calculate Scores** - Score each sub-component - Sum to dimension total - Document scoring reasoning ### OBSERVATION Phase: Document Results Output format: ```json { "iteration": "iteration_001.html", "dimension": "technical", "score": 78, "breakdown": { "code_quality": 20, "architecture": 19, "performance": 18, "robustness": 21 }, "reasoning": { "strengths": [ "Clean, well-commented code", "Excellent error handling", "Modular component structure" ], "weaknesses": [ "Some repeated code blocks", "Performance could be optimized for large datasets" ], "evidence": [ "Lines 45-67: Robust input validation", "Lines 120-145: Efficient caching mechanism" ] }, "timestamp": "2025-10-10T14:23:45Z" } ``` **Human-Readable Summary:** ``` === EVALUATION RESULTS === Iteration: iteration_001.html Dimension: Technical Quality Score: 78/100 BREAKDOWN: - Code Quality: 20/25 - Clean, well-commented code - Architecture: 19/25 - Modular structure, minor coupling issues - Performance: 18/25 - Good baseline, room for optimization - Robustness: 21/25 - Excellent error handling STRENGTHS: + Clean, well-commented code + Excellent error handling + Modular component structure WEAKNESSES: - Some repeated code blocks (DRY principle violation) - Performance could be optimized for large datasets EVIDENCE: • Lines 45-67: Robust input validation with clear error messages • Lines 120-145: Efficient caching mechanism reduces redundant calculations REASONING: This iteration demonstrates strong fundamentals with clean code and excellent robustness. The architecture is well-thought-out with good separation of concerns. Performance is adequate but could benefit from optimization for edge cases. Overall, a solid technical implementation that slightly exceeds expectations. ``` ## Multi-Dimension Evaluation (dimension="all") When evaluating all dimensions: 1. **Execute each dimension evaluation sequentially** - Technical → Creativity → Compliance - Each with full THOUGHT-ACTION-OBSERVATION cycle 2. **Calculate composite score** ``` composite = (technical * 0.35) + (creativity * 0.35) + (compliance * 0.30) ``` 3. **Identify quality trade-offs** - High technical + low creativity? - High creativity + low compliance? - Document trade-off patterns 4. **Generate comprehensive summary** ``` === COMPREHENSIVE EVALUATION === Iteration: iteration_001.html COMPOSITE SCORE: 76/100 Dimension Scores: - Technical Quality: 78/100 (Weight: 35%) = 27.3 - Creativity Score: 82/100 (Weight: 35%) = 28.7 - Spec Compliance: 68/100 (Weight: 30%) = 20.4 OVERALL ASSESSMENT: This iteration excels in creativity and technical implementation but shows room for improvement in spec compliance, particularly around naming conventions and structure adherence. QUALITY PROFILE: "Creative Innovator" - Strengths: Novel approach, clean code, innovative solutions - Growth Areas: Specification adherence, naming consistency RECOMMENDATIONS: 1. Review spec naming conventions and apply consistently 2. Maintain creative innovation while improving compliance 3. Current balance favors creativity over compliance - consider alignment ``` ## Reasoning Documentation For each evaluation, document the reasoning process: 1. **Pre-Evaluation Thoughts** - What am I looking for? - What criteria matter most? - How will I avoid bias? 2. **During Evaluation Observations** - What patterns do I see? - What stands out positively? - What concerns emerge? 3. **Post-Evaluation Reflection** - Does the score feel right? - Did I apply criteria consistently? - What would improve this iteration? - What can others learn from this evaluation? ## Output Storage Evaluation results are stored in: ``` {output_dir}/quality_reports/evaluations/iteration_{N}_evaluation.json ``` This enables: - Historical tracking of quality trends - Comparison across iterations - Machine-readable quality data - Re-evaluation with updated criteria ## Error Handling - **Iteration not found**: Report error, skip evaluation - **Spec required but missing**: Report error for compliance dimension - **Invalid dimension**: Report valid options - **Evaluation criteria missing**: Use defaults, log warning - **Scoring inconsistency**: Re-evaluate with explicit reasoning ## Success Criteria A successful evaluation demonstrates: - Clear reasoning before scoring - Objective, evidence-based scoring - Specific examples supporting scores - Actionable feedback for improvement - Consistent application of criteria - Transparent documentation of thought process --- **Remember**: Evaluation is not about being harsh or lenient - it's about being fair, consistent, and helpful. Reason about quality, observe evidence, and let observations guide your scores.