infinite-agents-public/infinite_variants/infinite_variant_4/CLAUDE.md

16 KiB

CLAUDE.md - Infinite Loop Variant 4: Quality Evaluation & Ranking System

This file provides guidance to Claude Code when working with the Quality Evaluation & Ranking System variant of the infinite agentic loop pattern.

Project Overview

This is Infinite Loop Variant 4, implementing automated quality evaluation and ranking for AI-generated iterations. The system uses the ReAct pattern (Reasoning + Acting + Observation) to evaluate, score, rank, and continuously improve iteration quality across multiple dimensions.

Key Concepts

ReAct Pattern Integration

Every operation in this system follows the ReAct cycle:

  1. THOUGHT (Reasoning): Explicitly reason about quality, evaluation strategy, and improvement opportunities before acting
  2. ACTION (Acting): Execute evaluations, generate content, score iterations with clear intent
  3. OBSERVATION (Observing): Analyze results, identify patterns, extract insights to inform next cycle

Critical: Always document reasoning. Every evaluation, ranking, and report should show the thought process that led to conclusions.

Multi-Dimensional Quality

Quality is assessed across three equally important dimensions:

  • Technical Quality (35%): Code, architecture, performance, robustness
  • Creativity Score (35%): Originality, innovation, uniqueness, aesthetic
  • Spec Compliance (30%): Requirements, naming, structure, standards

Critical: Never evaluate just one dimension. Quality is holistic. An iteration can be technically perfect but creatively bland, or wildly creative but technically flawed. Balance matters.

Quality-Driven Improvement

In infinite mode, quality assessment drives generation strategy:

  • Early waves: Establish baseline, explore diversity
  • Mid waves: Learn from top performers, address gaps
  • Late waves: Push frontiers, optimize composite scores
  • All waves: Monitor trends, adapt continuously

Critical: Don't just generate and evaluate. Learn from evaluations and adapt strategy accordingly. Let observations inform next actions.

Commands to Use

Primary Command: /project:infinite-quality

When to use: Generating iterations with quality evaluation and ranking

Syntax:

/project:infinite-quality <spec_path> <output_dir> <count|infinite> [config_path]

Key responsibilities when executing:

  1. Initial Reasoning (THOUGHT):

    • Deeply understand spec quality criteria
    • Plan evaluation strategy
    • Design quality-driven creative directions
    • Consider what makes quality in this context
  2. Generation (ACTION):

    • Launch sub-agents with quality targets
    • Provide spec + quality standards
    • Assign diverse creative directions
    • Generate with self-assessment
  3. Evaluation (ACTION):

    • Score all iterations on all dimensions
    • Use evaluators from evaluators/ directory
    • Document evidence for all scores
    • Be fair, consistent, and thorough
  4. Ranking & Reporting (OBSERVATION):

    • Rank by composite score
    • Identify patterns and trade-offs
    • Extract actionable insights
    • Generate comprehensive report
  5. Strategy Adaptation (THOUGHT for next wave):

    • Learn from top performers
    • Address quality gaps
    • Adjust creative directions
    • Refine quality targets

Example execution flow:

User: /project:infinite-quality specs/example_spec.md output/ 10

You should:
1. Read and analyze specs/example_spec.md deeply
2. Read specs/quality_standards.md for evaluation criteria
3. Reason about quality goals for this spec (THOUGHT)
4. Launch 10 sub-agents with diverse creative directions (ACTION)
5. Evaluate all 10 iterations using evaluators/ logic (ACTION)
6. Rank iterations and generate quality report (OBSERVATION)
7. Present key findings and recommendations

Utility Command: /evaluate

When to use: Evaluating a single iteration on specific dimensions

Syntax:

/evaluate <dimension> <iteration_path> [spec_path]

Dimensions: technical, creativity, compliance, all

Key responsibilities:

  1. Pre-Evaluation Reasoning (THOUGHT):

    • What does quality mean for this dimension?
    • What evidence should I look for?
    • How do I remain objective?
  2. Evaluation (ACTION):

    • Read iteration completely
    • Load appropriate evaluator logic
    • Score each sub-dimension with evidence
    • Calculate total dimension score
  3. Analysis (OBSERVATION):

    • Identify specific strengths
    • Identify specific weaknesses
    • Provide evidence for scores
    • Suggest improvements

Critical: Always provide specific evidence. Never say "code quality is good" without examples like "lines 45-67 demonstrate excellent input validation with clear error messages."

Utility Command: /rank

When to use: Ranking all iterations in a directory

Syntax:

/rank <output_dir> [dimension]

Key responsibilities:

  1. Pre-Ranking Reasoning (THOUGHT):

    • What makes fair ranking?
    • What patterns to look for?
    • How to interpret rankings?
  2. Ranking (ACTION):

    • Load all evaluations
    • Calculate composite scores
    • Sort and segment
    • Identify quality profiles
  3. Pattern Analysis (OBSERVATION):

    • Quality clusters and outliers
    • Dimension trade-offs
    • Success/failure factors
    • Strategic recommendations

Critical: Rankings should reveal insights, not just order. Explain what separates top from bottom performers.

Utility Command: /quality-report

When to use: Generating comprehensive quality reports

Syntax:

/quality-report <output_dir> [wave_number]

Key responsibilities:

  1. Pre-Report Reasoning (THOUGHT):

    • Purpose and audience
    • Most important insights
    • How to visualize quality
  2. Report Generation (ACTION):

    • Aggregate all evaluation data
    • Calculate comprehensive statistics
    • Generate text visualizations
    • Identify patterns and insights
  3. Strategic Recommendations (OBSERVATION):

    • Actionable next steps
    • Creative direction suggestions
    • Quality targets for next wave
    • System improvements

Critical: Reports must be actionable. Every insight should lead to a concrete recommendation.

Evaluation Guidelines

Technical Quality Evaluation

Focus on:

  • Code Quality: Readability, comments, naming, DRY
  • Architecture: Modularity, separation, reusability, scalability
  • Performance: Render speed, animation fps, algorithms, DOM ops
  • Robustness: Validation, error handling, edge cases, compatibility

Scoring approach:

  • Look for concrete evidence in code
  • Compare against standards in evaluators/technical_quality.md
  • Score each sub-dimension 0-25 points
  • Document specific examples
  • Total: 0-100 points

Example evidence:

  • Good: "Lines 120-145: Efficient caching mechanism reduces redundant calculations"
  • Bad: "Performance is good"

Creativity Score Evaluation

Focus on:

  • Originality: Novel concepts, fresh perspectives, unexpected approaches
  • Innovation: Creative solutions, clever techniques, boundary-pushing
  • Uniqueness: Differentiation from others, distinctive identity, memorability
  • Aesthetic: Visual appeal, color harmony, typography, polish

Scoring approach:

  • Recognize creativity is partially subjective
  • Look for objective indicators of novelty
  • Compare against standards in evaluators/creativity_score.md
  • Reward creative risk-taking
  • Total: 0-100 points

Example evidence:

  • Good: "Novel data-as-music-notation concept, first iteration to use audio sonification"
  • Bad: "This is creative"

Spec Compliance Evaluation

Focus on:

  • Requirements Met: Functional, technical, design requirements (40 points)
  • Naming Conventions: Pattern adherence, quality (20 points)
  • Structure Adherence: File structure, code organization (20 points)
  • Quality Standards: Baselines met (20 points)

Scoring approach:

  • Treat spec as checklist
  • Binary or proportional scoring per requirement
  • Compare against standards in evaluators/spec_compliance.md
  • Be objective and evidence-based
  • Total: 0-100 points

Example evidence:

  • Good: "Spec requires 20+ data points, iteration has 50 points ✓ [4/4 points]"
  • Bad: "Meets requirements"

Scoring Calibration

Use these reference points to ensure consistent scoring:

90-100 (Exceptional): Excellence in all sub-dimensions, exemplary work 80-89 (Excellent): Strong across all sub-dimensions, minor improvements possible 70-79 (Good): Solid in most sub-dimensions, some areas need work 60-69 (Adequate): Meets basic requirements, notable weaknesses 50-59 (Needs Improvement): Below expectations, significant issues Below 50 (Insufficient): Major deficiencies, fails basic criteria

Critical: Most iterations should fall in 60-85 range. Scores of 90+ should be rare and truly exceptional. Scores below 60 indicate serious problems.

Quality Report Best Practices

When generating quality reports:

  1. Start with Executive Summary: 3 key insights, 1 priority recommendation
  2. Provide Statistics: Mean, median, std dev, min, max for all dimensions
  3. Visualize Distribution: Text-based histograms and charts
  4. Identify Patterns: What makes top iterations succeed? What causes low scores?
  5. Analyze Trade-offs: Which dimensions compete? Which synergize?
  6. Give Strategic Recommendations: Specific, actionable, prioritized
  7. Self-Assess Report Quality: Is this useful? Honest? Comprehensive?

Critical: Every report should drive improvement. If it doesn't lead to actionable insights, it's not a good report.

Infinite Mode Strategy

When running in infinite mode (count: infinite):

Wave 1 (Foundation)

  • Generate 6-8 iterations with diverse creative directions
  • Establish baseline quality metrics
  • Identify initial strengths and weaknesses
  • Generate wave 1 report

Wave 2+ (Progressive Improvement)

THOUGHT Phase:

  • What made wave 1 top performers successful?
  • What quality dimensions need improvement?
  • What creative directions are underexplored?
  • How can we push quality higher?

ACTION Phase:

  • Generate 6-8 new iterations
  • Incorporate lessons from previous waves
  • Target quality gaps
  • Increase challenge in strong areas

OBSERVATION Phase:

  • Evaluate new iterations
  • Update overall rankings
  • Generate wave-specific report
  • Compare to previous waves

Adaptation:

  • Quality improving? Continue strategy
  • Quality stagnating? Adjust approach
  • Quality declining? Investigate and correct

Critical: Don't just repeat same strategy. Each wave should learn from previous waves. Show progressive improvement.

Configuration Customization

Users can customize scoring through config/scoring_weights.json:

Default weights:

  • Technical: 35%, Creativity: 35%, Compliance: 30%

Alternative profiles:

  • technical_focus: 50/25/25 - For production code
  • creative_focus: 25/50/25 - For exploratory projects
  • compliance_focus: 30/25/45 - For standardization
  • innovation_priority: 20/60/20 - For research

When using custom config:

  1. Load and validate config
  2. Apply weights to composite score calculation
  3. Document which config is being used
  4. Note how it affects scoring

Common Pitfalls to Avoid

  1. Evaluation without Reasoning: Don't just score - explain why
  2. Inconsistent Scoring: Apply same criteria to all iterations
  3. Vague Feedback: Provide specific evidence and examples
  4. Ignoring Trade-offs: Recognize when dimensions compete
  5. Not Learning from Results: Use observations to inform next actions
  6. Artificial Precision: Don't pretend scores are more accurate than they are
  7. Forgetting Balance: All three dimensions matter

Success Criteria

A successful execution of this system demonstrates:

  1. Meaningful Differentiation: Scores clearly separate quality levels
  2. Evidence-Based Scoring: Every score backed by specific examples
  3. Actionable Insights: Reports lead to concrete improvements
  4. Visible Learning: Quality improves over waves in infinite mode
  5. Transparent Reasoning: ReAct pattern evident throughout
  6. Fair Consistency: Same criteria applied to all iterations

File Organization

When generating iterations:

{output_dir}/
├── iteration_001.html
├── iteration_002.html
├── ...
└── quality_reports/
    ├── evaluations/
    │   ├── iteration_001_evaluation.json
    │   ├── iteration_002_evaluation.json
    │   └── ...
    ├── rankings/
    │   ├── ranking_report.md
    │   └── ranking_data.json
    └── reports/
        ├── wave_1_report.md
        ├── wave_2_report.md
        └── ...

Critical: Keep quality data organized. Store evaluations as JSON for machine readability, reports as Markdown for human readability.

Integration with Main Infinite Loop Project

This variant builds on the original infinite loop pattern with quality-focused enhancements:

Shared concepts:

  • Multi-agent parallel orchestration
  • Specification-driven generation
  • Wave-based iteration (infinite mode)
  • Context management

New additions:

  • ReAct reasoning pattern
  • Multi-dimensional quality evaluation
  • Automated ranking and reporting
  • Quality-driven strategy adaptation

Critical: This is not a replacement for the original pattern - it's an enhancement focused on quality assessment and continuous improvement.

Examples

Example 1: Small Batch with Quality Focus

User: /project:infinite-quality specs/example_spec.md output/ 5

You should:
1. Analyze spec quality criteria
2. Reason about quality goals (THOUGHT)
3. Generate 5 diverse iterations (ACTION)
4. Evaluate all on all dimensions (ACTION)
5. Rank and report (OBSERVATION)
6. Provide top 3 insights

Expected output:
- 5 HTML files in output/
- 5 evaluation JSON files in output/quality_reports/evaluations/
- 1 ranking report in output/quality_reports/rankings/
- 1 quality report in output/quality_reports/reports/
- Summary with key findings and recommendations

Example 2: Infinite Mode with Learning

User: /project:infinite-quality specs/example_spec.md output/ infinite

You should:
Wave 1:
- Generate 6-8 iterations
- Evaluate and rank
- Report baseline quality

Wave 2:
- THOUGHT: Learn from wave 1 top performers
- ACTION: Generate 6-8 new iterations with lessons applied
- OBSERVATION: Evaluate, rank, report improvements

Wave 3+:
- Continue THOUGHT → ACTION → OBSERVATION cycle
- Show progressive quality improvement
- Adapt strategy based on observations
- Continue until context limits

Expected pattern:
- Quality scores increase over waves
- Strategy evolves based on observations
- Reports show learning and adaptation

When to Ask for Clarification

Ask user if:

  • Spec lacks quality criteria (offer to use defaults)
  • Custom config has invalid weights (offer to fix)
  • Unclear whether to prioritize technical vs creative
  • Infinite mode strategy needs direction
  • Evaluation criteria should be adjusted

Don't ask about:

  • How to score (use evaluators/ logic)
  • Report format (use templates/ structure)
  • Ranking methodology (use rank.md process)
  • Standard evaluation process (documented in commands)

Version & Maintenance

Current Version: 1.0 Created: 2025-10-10 Pattern: Infinite Agentic Loop + ReAct Reasoning Dependencies: Claude Code custom commands, WebFetch for ReAct pattern research

Future considerations:

  • Automated testing integration
  • Visual quality report generation
  • Meta-learning on evaluation criteria
  • User feedback integration

Remember: Quality evaluation is not about being harsh or lenient - it's about being fair, consistent, and helpful. Use the ReAct pattern to reason thoughtfully, act systematically, and observe honestly. Let quality assessment drive continuous improvement.