infinite-agents-public/infinite_variants/infinite_variant_4/.claude/commands/evaluate.md

7.7 KiB

Evaluation Utility Command

Evaluate a single iteration on a specific quality dimension using ReAct reasoning pattern.

Syntax

/evaluate <dimension> <iteration_path> [spec_path]

Parameters:

  • dimension: One of "technical", "creativity", "compliance", or "all"
  • iteration_path: Path to the iteration file/directory to evaluate
  • spec_path: Required for "compliance" dimension, optional for others

Examples:

/evaluate technical output/iteration_001.html
/evaluate creativity output/iteration_005.html
/evaluate compliance output/iteration_003.html specs/example_spec.md
/evaluate all output/iteration_002.html specs/example_spec.md

Execution Process

THOUGHT Phase: Reasoning About Evaluation

Before scoring, reason about:

  1. What defines quality in this dimension?

    • For technical: Architecture, code quality, performance, robustness
    • For creativity: Originality, innovation, aesthetic choices, uniqueness
    • For compliance: Requirement fulfillment, naming, structure, standards
  2. What evidence should I look for?

    • Concrete artifacts that demonstrate quality
    • Code patterns, design decisions, implementation details
    • Documentation and self-assessment comments
  3. What are potential pitfalls in this evaluation?

    • Subjective bias
    • Missing context
    • Unfair comparisons
    • Evaluation drift
  4. How will I ensure objective scoring?

    • Use specific criteria from evaluator definitions
    • Look for measurable indicators
    • Document reasoning for each score component

ACTION Phase: Execute Evaluation

  1. Load Iteration Content

    • Read the file(s) completely
    • Parse structure and components
    • Extract metadata and documentation
  2. Load Evaluation Criteria

    • For technical: Use evaluators/technical_quality.md
    • For creativity: Use evaluators/creativity_score.md
    • For compliance: Use evaluators/spec_compliance.md
  3. Apply Evaluation Logic

    For Technical Quality:

    Scoring (0-100):
    - Code Quality (25 points): Clean, readable, maintainable code
    - Architecture (25 points): Well-structured, modular design
    - Performance (25 points): Efficient algorithms, optimized rendering
    - Robustness (25 points): Error handling, edge cases, validation
    

    For Creativity Score:

    Scoring (0-100):
    - Originality (25 points): Novel ideas, unique approach
    - Innovation (25 points): Creative problem-solving, fresh perspective
    - Uniqueness (25 points): Differentiation from existing iterations
    - Aesthetic (25 points): Visual appeal, design sophistication
    

    For Spec Compliance:

    Scoring (0-100):
    - Requirements Met (40 points): All spec requirements fulfilled
    - Naming Conventions (20 points): Follows spec naming patterns
    - Structure Adherence (20 points): Matches spec structure
    - Quality Standards (20 points): Meets spec quality criteria
    
  4. Calculate Scores

    • Score each sub-component
    • Sum to dimension total
    • Document scoring reasoning

OBSERVATION Phase: Document Results

Output format:

{
  "iteration": "iteration_001.html",
  "dimension": "technical",
  "score": 78,
  "breakdown": {
    "code_quality": 20,
    "architecture": 19,
    "performance": 18,
    "robustness": 21
  },
  "reasoning": {
    "strengths": [
      "Clean, well-commented code",
      "Excellent error handling",
      "Modular component structure"
    ],
    "weaknesses": [
      "Some repeated code blocks",
      "Performance could be optimized for large datasets"
    ],
    "evidence": [
      "Lines 45-67: Robust input validation",
      "Lines 120-145: Efficient caching mechanism"
    ]
  },
  "timestamp": "2025-10-10T14:23:45Z"
}

Human-Readable Summary:

=== EVALUATION RESULTS ===

Iteration: iteration_001.html
Dimension: Technical Quality
Score: 78/100

BREAKDOWN:
- Code Quality: 20/25 - Clean, well-commented code
- Architecture: 19/25 - Modular structure, minor coupling issues
- Performance: 18/25 - Good baseline, room for optimization
- Robustness: 21/25 - Excellent error handling

STRENGTHS:
+ Clean, well-commented code
+ Excellent error handling
+ Modular component structure

WEAKNESSES:
- Some repeated code blocks (DRY principle violation)
- Performance could be optimized for large datasets

EVIDENCE:
• Lines 45-67: Robust input validation with clear error messages
• Lines 120-145: Efficient caching mechanism reduces redundant calculations

REASONING:
This iteration demonstrates strong fundamentals with clean code and
excellent robustness. The architecture is well-thought-out with good
separation of concerns. Performance is adequate but could benefit from
optimization for edge cases. Overall, a solid technical implementation
that slightly exceeds expectations.

Multi-Dimension Evaluation (dimension="all")

When evaluating all dimensions:

  1. Execute each dimension evaluation sequentially

    • Technical → Creativity → Compliance
    • Each with full THOUGHT-ACTION-OBSERVATION cycle
  2. Calculate composite score

    composite = (technical * 0.35) + (creativity * 0.35) + (compliance * 0.30)
    
  3. Identify quality trade-offs

    • High technical + low creativity?
    • High creativity + low compliance?
    • Document trade-off patterns
  4. Generate comprehensive summary

=== COMPREHENSIVE EVALUATION ===

Iteration: iteration_001.html

COMPOSITE SCORE: 76/100

Dimension Scores:
- Technical Quality: 78/100 (Weight: 35%) = 27.3
- Creativity Score: 82/100 (Weight: 35%) = 28.7
- Spec Compliance: 68/100 (Weight: 30%) = 20.4

OVERALL ASSESSMENT:
This iteration excels in creativity and technical implementation but
shows room for improvement in spec compliance, particularly around
naming conventions and structure adherence.

QUALITY PROFILE: "Creative Innovator"
- Strengths: Novel approach, clean code, innovative solutions
- Growth Areas: Specification adherence, naming consistency

RECOMMENDATIONS:
1. Review spec naming conventions and apply consistently
2. Maintain creative innovation while improving compliance
3. Current balance favors creativity over compliance - consider alignment

Reasoning Documentation

For each evaluation, document the reasoning process:

  1. Pre-Evaluation Thoughts

    • What am I looking for?
    • What criteria matter most?
    • How will I avoid bias?
  2. During Evaluation Observations

    • What patterns do I see?
    • What stands out positively?
    • What concerns emerge?
  3. Post-Evaluation Reflection

    • Does the score feel right?
    • Did I apply criteria consistently?
    • What would improve this iteration?
    • What can others learn from this evaluation?

Output Storage

Evaluation results are stored in:

{output_dir}/quality_reports/evaluations/iteration_{N}_evaluation.json

This enables:

  • Historical tracking of quality trends
  • Comparison across iterations
  • Machine-readable quality data
  • Re-evaluation with updated criteria

Error Handling

  • Iteration not found: Report error, skip evaluation
  • Spec required but missing: Report error for compliance dimension
  • Invalid dimension: Report valid options
  • Evaluation criteria missing: Use defaults, log warning
  • Scoring inconsistency: Re-evaluate with explicit reasoning

Success Criteria

A successful evaluation demonstrates:

  • Clear reasoning before scoring
  • Objective, evidence-based scoring
  • Specific examples supporting scores
  • Actionable feedback for improvement
  • Consistent application of criteria
  • Transparent documentation of thought process

Remember: Evaluation is not about being harsh or lenient - it's about being fair, consistent, and helpful. Reason about quality, observe evidence, and let observations guide your scores.