16 KiB
CLAUDE.md - Infinite Loop Variant 4: Quality Evaluation & Ranking System
This file provides guidance to Claude Code when working with the Quality Evaluation & Ranking System variant of the infinite agentic loop pattern.
Project Overview
This is Infinite Loop Variant 4, implementing automated quality evaluation and ranking for AI-generated iterations. The system uses the ReAct pattern (Reasoning + Acting + Observation) to evaluate, score, rank, and continuously improve iteration quality across multiple dimensions.
Key Concepts
ReAct Pattern Integration
Every operation in this system follows the ReAct cycle:
- THOUGHT (Reasoning): Explicitly reason about quality, evaluation strategy, and improvement opportunities before acting
- ACTION (Acting): Execute evaluations, generate content, score iterations with clear intent
- OBSERVATION (Observing): Analyze results, identify patterns, extract insights to inform next cycle
Critical: Always document reasoning. Every evaluation, ranking, and report should show the thought process that led to conclusions.
Multi-Dimensional Quality
Quality is assessed across three equally important dimensions:
- Technical Quality (35%): Code, architecture, performance, robustness
- Creativity Score (35%): Originality, innovation, uniqueness, aesthetic
- Spec Compliance (30%): Requirements, naming, structure, standards
Critical: Never evaluate just one dimension. Quality is holistic. An iteration can be technically perfect but creatively bland, or wildly creative but technically flawed. Balance matters.
Quality-Driven Improvement
In infinite mode, quality assessment drives generation strategy:
- Early waves: Establish baseline, explore diversity
- Mid waves: Learn from top performers, address gaps
- Late waves: Push frontiers, optimize composite scores
- All waves: Monitor trends, adapt continuously
Critical: Don't just generate and evaluate. Learn from evaluations and adapt strategy accordingly. Let observations inform next actions.
Commands to Use
Primary Command: /project:infinite-quality
When to use: Generating iterations with quality evaluation and ranking
Syntax:
/project:infinite-quality <spec_path> <output_dir> <count|infinite> [config_path]
Key responsibilities when executing:
-
Initial Reasoning (THOUGHT):
- Deeply understand spec quality criteria
- Plan evaluation strategy
- Design quality-driven creative directions
- Consider what makes quality in this context
-
Generation (ACTION):
- Launch sub-agents with quality targets
- Provide spec + quality standards
- Assign diverse creative directions
- Generate with self-assessment
-
Evaluation (ACTION):
- Score all iterations on all dimensions
- Use evaluators from
evaluators/directory - Document evidence for all scores
- Be fair, consistent, and thorough
-
Ranking & Reporting (OBSERVATION):
- Rank by composite score
- Identify patterns and trade-offs
- Extract actionable insights
- Generate comprehensive report
-
Strategy Adaptation (THOUGHT for next wave):
- Learn from top performers
- Address quality gaps
- Adjust creative directions
- Refine quality targets
Example execution flow:
User: /project:infinite-quality specs/example_spec.md output/ 10
You should:
1. Read and analyze specs/example_spec.md deeply
2. Read specs/quality_standards.md for evaluation criteria
3. Reason about quality goals for this spec (THOUGHT)
4. Launch 10 sub-agents with diverse creative directions (ACTION)
5. Evaluate all 10 iterations using evaluators/ logic (ACTION)
6. Rank iterations and generate quality report (OBSERVATION)
7. Present key findings and recommendations
Utility Command: /evaluate
When to use: Evaluating a single iteration on specific dimensions
Syntax:
/evaluate <dimension> <iteration_path> [spec_path]
Dimensions: technical, creativity, compliance, all
Key responsibilities:
-
Pre-Evaluation Reasoning (THOUGHT):
- What does quality mean for this dimension?
- What evidence should I look for?
- How do I remain objective?
-
Evaluation (ACTION):
- Read iteration completely
- Load appropriate evaluator logic
- Score each sub-dimension with evidence
- Calculate total dimension score
-
Analysis (OBSERVATION):
- Identify specific strengths
- Identify specific weaknesses
- Provide evidence for scores
- Suggest improvements
Critical: Always provide specific evidence. Never say "code quality is good" without examples like "lines 45-67 demonstrate excellent input validation with clear error messages."
Utility Command: /rank
When to use: Ranking all iterations in a directory
Syntax:
/rank <output_dir> [dimension]
Key responsibilities:
-
Pre-Ranking Reasoning (THOUGHT):
- What makes fair ranking?
- What patterns to look for?
- How to interpret rankings?
-
Ranking (ACTION):
- Load all evaluations
- Calculate composite scores
- Sort and segment
- Identify quality profiles
-
Pattern Analysis (OBSERVATION):
- Quality clusters and outliers
- Dimension trade-offs
- Success/failure factors
- Strategic recommendations
Critical: Rankings should reveal insights, not just order. Explain what separates top from bottom performers.
Utility Command: /quality-report
When to use: Generating comprehensive quality reports
Syntax:
/quality-report <output_dir> [wave_number]
Key responsibilities:
-
Pre-Report Reasoning (THOUGHT):
- Purpose and audience
- Most important insights
- How to visualize quality
-
Report Generation (ACTION):
- Aggregate all evaluation data
- Calculate comprehensive statistics
- Generate text visualizations
- Identify patterns and insights
-
Strategic Recommendations (OBSERVATION):
- Actionable next steps
- Creative direction suggestions
- Quality targets for next wave
- System improvements
Critical: Reports must be actionable. Every insight should lead to a concrete recommendation.
Evaluation Guidelines
Technical Quality Evaluation
Focus on:
- Code Quality: Readability, comments, naming, DRY
- Architecture: Modularity, separation, reusability, scalability
- Performance: Render speed, animation fps, algorithms, DOM ops
- Robustness: Validation, error handling, edge cases, compatibility
Scoring approach:
- Look for concrete evidence in code
- Compare against standards in
evaluators/technical_quality.md - Score each sub-dimension 0-25 points
- Document specific examples
- Total: 0-100 points
Example evidence:
- Good: "Lines 120-145: Efficient caching mechanism reduces redundant calculations"
- Bad: "Performance is good"
Creativity Score Evaluation
Focus on:
- Originality: Novel concepts, fresh perspectives, unexpected approaches
- Innovation: Creative solutions, clever techniques, boundary-pushing
- Uniqueness: Differentiation from others, distinctive identity, memorability
- Aesthetic: Visual appeal, color harmony, typography, polish
Scoring approach:
- Recognize creativity is partially subjective
- Look for objective indicators of novelty
- Compare against standards in
evaluators/creativity_score.md - Reward creative risk-taking
- Total: 0-100 points
Example evidence:
- Good: "Novel data-as-music-notation concept, first iteration to use audio sonification"
- Bad: "This is creative"
Spec Compliance Evaluation
Focus on:
- Requirements Met: Functional, technical, design requirements (40 points)
- Naming Conventions: Pattern adherence, quality (20 points)
- Structure Adherence: File structure, code organization (20 points)
- Quality Standards: Baselines met (20 points)
Scoring approach:
- Treat spec as checklist
- Binary or proportional scoring per requirement
- Compare against standards in
evaluators/spec_compliance.md - Be objective and evidence-based
- Total: 0-100 points
Example evidence:
- Good: "Spec requires 20+ data points, iteration has 50 points ✓ [4/4 points]"
- Bad: "Meets requirements"
Scoring Calibration
Use these reference points to ensure consistent scoring:
90-100 (Exceptional): Excellence in all sub-dimensions, exemplary work 80-89 (Excellent): Strong across all sub-dimensions, minor improvements possible 70-79 (Good): Solid in most sub-dimensions, some areas need work 60-69 (Adequate): Meets basic requirements, notable weaknesses 50-59 (Needs Improvement): Below expectations, significant issues Below 50 (Insufficient): Major deficiencies, fails basic criteria
Critical: Most iterations should fall in 60-85 range. Scores of 90+ should be rare and truly exceptional. Scores below 60 indicate serious problems.
Quality Report Best Practices
When generating quality reports:
- Start with Executive Summary: 3 key insights, 1 priority recommendation
- Provide Statistics: Mean, median, std dev, min, max for all dimensions
- Visualize Distribution: Text-based histograms and charts
- Identify Patterns: What makes top iterations succeed? What causes low scores?
- Analyze Trade-offs: Which dimensions compete? Which synergize?
- Give Strategic Recommendations: Specific, actionable, prioritized
- Self-Assess Report Quality: Is this useful? Honest? Comprehensive?
Critical: Every report should drive improvement. If it doesn't lead to actionable insights, it's not a good report.
Infinite Mode Strategy
When running in infinite mode (count: infinite):
Wave 1 (Foundation)
- Generate 6-8 iterations with diverse creative directions
- Establish baseline quality metrics
- Identify initial strengths and weaknesses
- Generate wave 1 report
Wave 2+ (Progressive Improvement)
THOUGHT Phase:
- What made wave 1 top performers successful?
- What quality dimensions need improvement?
- What creative directions are underexplored?
- How can we push quality higher?
ACTION Phase:
- Generate 6-8 new iterations
- Incorporate lessons from previous waves
- Target quality gaps
- Increase challenge in strong areas
OBSERVATION Phase:
- Evaluate new iterations
- Update overall rankings
- Generate wave-specific report
- Compare to previous waves
Adaptation:
- Quality improving? Continue strategy
- Quality stagnating? Adjust approach
- Quality declining? Investigate and correct
Critical: Don't just repeat same strategy. Each wave should learn from previous waves. Show progressive improvement.
Configuration Customization
Users can customize scoring through config/scoring_weights.json:
Default weights:
- Technical: 35%, Creativity: 35%, Compliance: 30%
Alternative profiles:
technical_focus: 50/25/25 - For production codecreative_focus: 25/50/25 - For exploratory projectscompliance_focus: 30/25/45 - For standardizationinnovation_priority: 20/60/20 - For research
When using custom config:
- Load and validate config
- Apply weights to composite score calculation
- Document which config is being used
- Note how it affects scoring
Common Pitfalls to Avoid
- Evaluation without Reasoning: Don't just score - explain why
- Inconsistent Scoring: Apply same criteria to all iterations
- Vague Feedback: Provide specific evidence and examples
- Ignoring Trade-offs: Recognize when dimensions compete
- Not Learning from Results: Use observations to inform next actions
- Artificial Precision: Don't pretend scores are more accurate than they are
- Forgetting Balance: All three dimensions matter
Success Criteria
A successful execution of this system demonstrates:
- Meaningful Differentiation: Scores clearly separate quality levels
- Evidence-Based Scoring: Every score backed by specific examples
- Actionable Insights: Reports lead to concrete improvements
- Visible Learning: Quality improves over waves in infinite mode
- Transparent Reasoning: ReAct pattern evident throughout
- Fair Consistency: Same criteria applied to all iterations
File Organization
When generating iterations:
{output_dir}/
├── iteration_001.html
├── iteration_002.html
├── ...
└── quality_reports/
├── evaluations/
│ ├── iteration_001_evaluation.json
│ ├── iteration_002_evaluation.json
│ └── ...
├── rankings/
│ ├── ranking_report.md
│ └── ranking_data.json
└── reports/
├── wave_1_report.md
├── wave_2_report.md
└── ...
Critical: Keep quality data organized. Store evaluations as JSON for machine readability, reports as Markdown for human readability.
Integration with Main Infinite Loop Project
This variant builds on the original infinite loop pattern with quality-focused enhancements:
Shared concepts:
- Multi-agent parallel orchestration
- Specification-driven generation
- Wave-based iteration (infinite mode)
- Context management
New additions:
- ReAct reasoning pattern
- Multi-dimensional quality evaluation
- Automated ranking and reporting
- Quality-driven strategy adaptation
Critical: This is not a replacement for the original pattern - it's an enhancement focused on quality assessment and continuous improvement.
Examples
Example 1: Small Batch with Quality Focus
User: /project:infinite-quality specs/example_spec.md output/ 5
You should:
1. Analyze spec quality criteria
2. Reason about quality goals (THOUGHT)
3. Generate 5 diverse iterations (ACTION)
4. Evaluate all on all dimensions (ACTION)
5. Rank and report (OBSERVATION)
6. Provide top 3 insights
Expected output:
- 5 HTML files in output/
- 5 evaluation JSON files in output/quality_reports/evaluations/
- 1 ranking report in output/quality_reports/rankings/
- 1 quality report in output/quality_reports/reports/
- Summary with key findings and recommendations
Example 2: Infinite Mode with Learning
User: /project:infinite-quality specs/example_spec.md output/ infinite
You should:
Wave 1:
- Generate 6-8 iterations
- Evaluate and rank
- Report baseline quality
Wave 2:
- THOUGHT: Learn from wave 1 top performers
- ACTION: Generate 6-8 new iterations with lessons applied
- OBSERVATION: Evaluate, rank, report improvements
Wave 3+:
- Continue THOUGHT → ACTION → OBSERVATION cycle
- Show progressive quality improvement
- Adapt strategy based on observations
- Continue until context limits
Expected pattern:
- Quality scores increase over waves
- Strategy evolves based on observations
- Reports show learning and adaptation
When to Ask for Clarification
Ask user if:
- Spec lacks quality criteria (offer to use defaults)
- Custom config has invalid weights (offer to fix)
- Unclear whether to prioritize technical vs creative
- Infinite mode strategy needs direction
- Evaluation criteria should be adjusted
Don't ask about:
- How to score (use evaluators/ logic)
- Report format (use templates/ structure)
- Ranking methodology (use rank.md process)
- Standard evaluation process (documented in commands)
Version & Maintenance
Current Version: 1.0 Created: 2025-10-10 Pattern: Infinite Agentic Loop + ReAct Reasoning Dependencies: Claude Code custom commands, WebFetch for ReAct pattern research
Future considerations:
- Automated testing integration
- Visual quality report generation
- Meta-learning on evaluation criteria
- User feedback integration
Remember: Quality evaluation is not about being harsh or lenient - it's about being fair, consistent, and helpful. Use the ReAct pattern to reason thoughtfully, act systematically, and observe honestly. Let quality assessment drive continuous improvement.