7.7 KiB

Raw Blame History

Evaluation Utility Command

Evaluate a single iteration on a specific quality dimension using ReAct reasoning pattern.

Syntax

/evaluate <dimension> <iteration_path> [spec_path]

Parameters:

dimension: One of "technical", "creativity", "compliance", or "all"
iteration_path: Path to the iteration file/directory to evaluate
spec_path: Required for "compliance" dimension, optional for others

Examples:

/evaluate technical output/iteration_001.html
/evaluate creativity output/iteration_005.html
/evaluate compliance output/iteration_003.html specs/example_spec.md
/evaluate all output/iteration_002.html specs/example_spec.md

Execution Process

THOUGHT Phase: Reasoning About Evaluation

Before scoring, reason about:

What defines quality in this dimension?
- For technical: Architecture, code quality, performance, robustness
- For creativity: Originality, innovation, aesthetic choices, uniqueness
- For compliance: Requirement fulfillment, naming, structure, standards
What evidence should I look for?
- Concrete artifacts that demonstrate quality
- Code patterns, design decisions, implementation details
- Documentation and self-assessment comments
What are potential pitfalls in this evaluation?
- Subjective bias
- Missing context
- Unfair comparisons
- Evaluation drift
How will I ensure objective scoring?
- Use specific criteria from evaluator definitions
- Look for measurable indicators
- Document reasoning for each score component

ACTION Phase: Execute Evaluation

Load Iteration Content
- Read the file(s) completely
- Parse structure and components
- Extract metadata and documentation
Load Evaluation Criteria
- For technical: Use evaluators/technical_quality.md
- For creativity: Use evaluators/creativity_score.md
- For compliance: Use evaluators/spec_compliance.md

Apply Evaluation Logic

For Technical Quality:

Scoring (0-100):
- Code Quality (25 points): Clean, readable, maintainable code
- Architecture (25 points): Well-structured, modular design
- Performance (25 points): Efficient algorithms, optimized rendering
- Robustness (25 points): Error handling, edge cases, validation

For Creativity Score:

Scoring (0-100):
- Originality (25 points): Novel ideas, unique approach
- Innovation (25 points): Creative problem-solving, fresh perspective
- Uniqueness (25 points): Differentiation from existing iterations
- Aesthetic (25 points): Visual appeal, design sophistication

For Spec Compliance:

Scoring (0-100):
- Requirements Met (40 points): All spec requirements fulfilled
- Naming Conventions (20 points): Follows spec naming patterns
- Structure Adherence (20 points): Matches spec structure
- Quality Standards (20 points): Meets spec quality criteria

Calculate Scores
- Score each sub-component
- Sum to dimension total
- Document scoring reasoning

OBSERVATION Phase: Document Results

Output format:

{
  "iteration": "iteration_001.html",
  "dimension": "technical",
  "score": 78,
  "breakdown": {
    "code_quality": 20,
    "architecture": 19,
    "performance": 18,
    "robustness": 21
  },
  "reasoning": {
    "strengths": [
      "Clean, well-commented code",
      "Excellent error handling",
      "Modular component structure"
    ],
    "weaknesses": [
      "Some repeated code blocks",
      "Performance could be optimized for large datasets"
    ],
    "evidence": [
      "Lines 45-67: Robust input validation",
      "Lines 120-145: Efficient caching mechanism"
    ]
  },
  "timestamp": "2025-10-10T14:23:45Z"
}

Human-Readable Summary:

=== EVALUATION RESULTS ===

Iteration: iteration_001.html
Dimension: Technical Quality
Score: 78/100

BREAKDOWN:
- Code Quality: 20/25 - Clean, well-commented code
- Architecture: 19/25 - Modular structure, minor coupling issues
- Performance: 18/25 - Good baseline, room for optimization
- Robustness: 21/25 - Excellent error handling

STRENGTHS:
+ Clean, well-commented code
+ Excellent error handling
+ Modular component structure

WEAKNESSES:
- Some repeated code blocks (DRY principle violation)
- Performance could be optimized for large datasets

EVIDENCE:
• Lines 45-67: Robust input validation with clear error messages
• Lines 120-145: Efficient caching mechanism reduces redundant calculations

REASONING:
This iteration demonstrates strong fundamentals with clean code and
excellent robustness. The architecture is well-thought-out with good
separation of concerns. Performance is adequate but could benefit from
optimization for edge cases. Overall, a solid technical implementation
that slightly exceeds expectations.

Multi-Dimension Evaluation (dimension="all")

When evaluating all dimensions:

Execute each dimension evaluation sequentially
- Technical → Creativity → Compliance
- Each with full THOUGHT-ACTION-OBSERVATION cycle

Calculate composite score

composite = (technical * 0.35) + (creativity * 0.35) + (compliance * 0.30)

Identify quality trade-offs
- High technical + low creativity?
- High creativity + low compliance?
- Document trade-off patterns
Generate comprehensive summary

=== COMPREHENSIVE EVALUATION ===

Iteration: iteration_001.html

COMPOSITE SCORE: 76/100

Dimension Scores:
- Technical Quality: 78/100 (Weight: 35%) = 27.3
- Creativity Score: 82/100 (Weight: 35%) = 28.7
- Spec Compliance: 68/100 (Weight: 30%) = 20.4

OVERALL ASSESSMENT:
This iteration excels in creativity and technical implementation but
shows room for improvement in spec compliance, particularly around
naming conventions and structure adherence.

QUALITY PROFILE: "Creative Innovator"
- Strengths: Novel approach, clean code, innovative solutions
- Growth Areas: Specification adherence, naming consistency

RECOMMENDATIONS:
1. Review spec naming conventions and apply consistently
2. Maintain creative innovation while improving compliance
3. Current balance favors creativity over compliance - consider alignment

Reasoning Documentation

For each evaluation, document the reasoning process:

Pre-Evaluation Thoughts
- What am I looking for?
- What criteria matter most?
- How will I avoid bias?
During Evaluation Observations
- What patterns do I see?
- What stands out positively?
- What concerns emerge?
Post-Evaluation Reflection
- Does the score feel right?
- Did I apply criteria consistently?
- What would improve this iteration?
- What can others learn from this evaluation?

Output Storage

Evaluation results are stored in:

{output_dir}/quality_reports/evaluations/iteration_{N}_evaluation.json

This enables:

Historical tracking of quality trends
Comparison across iterations
Machine-readable quality data
Re-evaluation with updated criteria

Error Handling

Iteration not found: Report error, skip evaluation
Spec required but missing: Report error for compliance dimension
Invalid dimension: Report valid options
Evaluation criteria missing: Use defaults, log warning
Scoring inconsistency: Re-evaluate with explicit reasoning

Success Criteria

A successful evaluation demonstrates:

Clear reasoning before scoring
Objective, evidence-based scoring
Specific examples supporting scores
Actionable feedback for improvement
Consistent application of criteria
Transparent documentation of thought process

Remember: Evaluation is not about being harsh or lenient - it's about being fair, consistent, and helpful. Reason about quality, observe evidence, and let observations guide your scores.

7.7 KiB Raw Blame History