# CLAUDE.md - Infinite Loop Variant 4: Quality Evaluation & Ranking System This file provides guidance to Claude Code when working with the Quality Evaluation & Ranking System variant of the infinite agentic loop pattern. ## Project Overview This is Infinite Loop Variant 4, implementing **automated quality evaluation and ranking** for AI-generated iterations. The system uses the **ReAct pattern** (Reasoning + Acting + Observation) to evaluate, score, rank, and continuously improve iteration quality across multiple dimensions. ## Key Concepts ### ReAct Pattern Integration Every operation in this system follows the ReAct cycle: 1. **THOUGHT (Reasoning)**: Explicitly reason about quality, evaluation strategy, and improvement opportunities before acting 2. **ACTION (Acting)**: Execute evaluations, generate content, score iterations with clear intent 3. **OBSERVATION (Observing)**: Analyze results, identify patterns, extract insights to inform next cycle **Critical**: Always document reasoning. Every evaluation, ranking, and report should show the thought process that led to conclusions. ### Multi-Dimensional Quality Quality is assessed across **three equally important dimensions**: - **Technical Quality (35%)**: Code, architecture, performance, robustness - **Creativity Score (35%)**: Originality, innovation, uniqueness, aesthetic - **Spec Compliance (30%)**: Requirements, naming, structure, standards **Critical**: Never evaluate just one dimension. Quality is holistic. An iteration can be technically perfect but creatively bland, or wildly creative but technically flawed. Balance matters. ### Quality-Driven Improvement In infinite mode, quality assessment drives generation strategy: - **Early waves**: Establish baseline, explore diversity - **Mid waves**: Learn from top performers, address gaps - **Late waves**: Push frontiers, optimize composite scores - **All waves**: Monitor trends, adapt continuously **Critical**: Don't just generate and evaluate. Learn from evaluations and adapt strategy accordingly. Let observations inform next actions. ## Commands to Use ### Primary Command: `/project:infinite-quality` **When to use**: Generating iterations with quality evaluation and ranking **Syntax**: ``` /project:infinite-quality [config_path] ``` **Key responsibilities when executing**: 1. **Initial Reasoning (THOUGHT)**: - Deeply understand spec quality criteria - Plan evaluation strategy - Design quality-driven creative directions - Consider what makes quality in this context 2. **Generation (ACTION)**: - Launch sub-agents with quality targets - Provide spec + quality standards - Assign diverse creative directions - Generate with self-assessment 3. **Evaluation (ACTION)**: - Score all iterations on all dimensions - Use evaluators from `evaluators/` directory - Document evidence for all scores - Be fair, consistent, and thorough 4. **Ranking & Reporting (OBSERVATION)**: - Rank by composite score - Identify patterns and trade-offs - Extract actionable insights - Generate comprehensive report 5. **Strategy Adaptation (THOUGHT for next wave)**: - Learn from top performers - Address quality gaps - Adjust creative directions - Refine quality targets **Example execution flow**: ``` User: /project:infinite-quality specs/example_spec.md output/ 10 You should: 1. Read and analyze specs/example_spec.md deeply 2. Read specs/quality_standards.md for evaluation criteria 3. Reason about quality goals for this spec (THOUGHT) 4. Launch 10 sub-agents with diverse creative directions (ACTION) 5. Evaluate all 10 iterations using evaluators/ logic (ACTION) 6. Rank iterations and generate quality report (OBSERVATION) 7. Present key findings and recommendations ``` ### Utility Command: `/evaluate` **When to use**: Evaluating a single iteration on specific dimensions **Syntax**: ``` /evaluate [spec_path] ``` **Dimensions**: `technical`, `creativity`, `compliance`, `all` **Key responsibilities**: 1. **Pre-Evaluation Reasoning (THOUGHT)**: - What does quality mean for this dimension? - What evidence should I look for? - How do I remain objective? 2. **Evaluation (ACTION)**: - Read iteration completely - Load appropriate evaluator logic - Score each sub-dimension with evidence - Calculate total dimension score 3. **Analysis (OBSERVATION)**: - Identify specific strengths - Identify specific weaknesses - Provide evidence for scores - Suggest improvements **Critical**: Always provide specific evidence. Never say "code quality is good" without examples like "lines 45-67 demonstrate excellent input validation with clear error messages." ### Utility Command: `/rank` **When to use**: Ranking all iterations in a directory **Syntax**: ``` /rank [dimension] ``` **Key responsibilities**: 1. **Pre-Ranking Reasoning (THOUGHT)**: - What makes fair ranking? - What patterns to look for? - How to interpret rankings? 2. **Ranking (ACTION)**: - Load all evaluations - Calculate composite scores - Sort and segment - Identify quality profiles 3. **Pattern Analysis (OBSERVATION)**: - Quality clusters and outliers - Dimension trade-offs - Success/failure factors - Strategic recommendations **Critical**: Rankings should reveal insights, not just order. Explain what separates top from bottom performers. ### Utility Command: `/quality-report` **When to use**: Generating comprehensive quality reports **Syntax**: ``` /quality-report [wave_number] ``` **Key responsibilities**: 1. **Pre-Report Reasoning (THOUGHT)**: - Purpose and audience - Most important insights - How to visualize quality 2. **Report Generation (ACTION)**: - Aggregate all evaluation data - Calculate comprehensive statistics - Generate text visualizations - Identify patterns and insights 3. **Strategic Recommendations (OBSERVATION)**: - Actionable next steps - Creative direction suggestions - Quality targets for next wave - System improvements **Critical**: Reports must be actionable. Every insight should lead to a concrete recommendation. ## Evaluation Guidelines ### Technical Quality Evaluation **Focus on**: - **Code Quality**: Readability, comments, naming, DRY - **Architecture**: Modularity, separation, reusability, scalability - **Performance**: Render speed, animation fps, algorithms, DOM ops - **Robustness**: Validation, error handling, edge cases, compatibility **Scoring approach**: - Look for concrete evidence in code - Compare against standards in `evaluators/technical_quality.md` - Score each sub-dimension 0-25 points - Document specific examples - Total: 0-100 points **Example evidence**: - Good: "Lines 120-145: Efficient caching mechanism reduces redundant calculations" - Bad: "Performance is good" ### Creativity Score Evaluation **Focus on**: - **Originality**: Novel concepts, fresh perspectives, unexpected approaches - **Innovation**: Creative solutions, clever techniques, boundary-pushing - **Uniqueness**: Differentiation from others, distinctive identity, memorability - **Aesthetic**: Visual appeal, color harmony, typography, polish **Scoring approach**: - Recognize creativity is partially subjective - Look for objective indicators of novelty - Compare against standards in `evaluators/creativity_score.md` - Reward creative risk-taking - Total: 0-100 points **Example evidence**: - Good: "Novel data-as-music-notation concept, first iteration to use audio sonification" - Bad: "This is creative" ### Spec Compliance Evaluation **Focus on**: - **Requirements Met**: Functional, technical, design requirements (40 points) - **Naming Conventions**: Pattern adherence, quality (20 points) - **Structure Adherence**: File structure, code organization (20 points) - **Quality Standards**: Baselines met (20 points) **Scoring approach**: - Treat spec as checklist - Binary or proportional scoring per requirement - Compare against standards in `evaluators/spec_compliance.md` - Be objective and evidence-based - Total: 0-100 points **Example evidence**: - Good: "Spec requires 20+ data points, iteration has 50 points ✓ [4/4 points]" - Bad: "Meets requirements" ## Scoring Calibration Use these reference points to ensure consistent scoring: **90-100 (Exceptional)**: Excellence in all sub-dimensions, exemplary work **80-89 (Excellent)**: Strong across all sub-dimensions, minor improvements possible **70-79 (Good)**: Solid in most sub-dimensions, some areas need work **60-69 (Adequate)**: Meets basic requirements, notable weaknesses **50-59 (Needs Improvement)**: Below expectations, significant issues **Below 50 (Insufficient)**: Major deficiencies, fails basic criteria **Critical**: Most iterations should fall in 60-85 range. Scores of 90+ should be rare and truly exceptional. Scores below 60 indicate serious problems. ## Quality Report Best Practices When generating quality reports: 1. **Start with Executive Summary**: 3 key insights, 1 priority recommendation 2. **Provide Statistics**: Mean, median, std dev, min, max for all dimensions 3. **Visualize Distribution**: Text-based histograms and charts 4. **Identify Patterns**: What makes top iterations succeed? What causes low scores? 5. **Analyze Trade-offs**: Which dimensions compete? Which synergize? 6. **Give Strategic Recommendations**: Specific, actionable, prioritized 7. **Self-Assess Report Quality**: Is this useful? Honest? Comprehensive? **Critical**: Every report should drive improvement. If it doesn't lead to actionable insights, it's not a good report. ## Infinite Mode Strategy When running in infinite mode (`count: infinite`): ### Wave 1 (Foundation) - Generate 6-8 iterations with diverse creative directions - Establish baseline quality metrics - Identify initial strengths and weaknesses - Generate wave 1 report ### Wave 2+ (Progressive Improvement) **THOUGHT Phase**: - What made wave 1 top performers successful? - What quality dimensions need improvement? - What creative directions are underexplored? - How can we push quality higher? **ACTION Phase**: - Generate 6-8 new iterations - Incorporate lessons from previous waves - Target quality gaps - Increase challenge in strong areas **OBSERVATION Phase**: - Evaluate new iterations - Update overall rankings - Generate wave-specific report - Compare to previous waves **Adaptation**: - Quality improving? Continue strategy - Quality stagnating? Adjust approach - Quality declining? Investigate and correct **Critical**: Don't just repeat same strategy. Each wave should learn from previous waves. Show progressive improvement. ## Configuration Customization Users can customize scoring through `config/scoring_weights.json`: **Default weights**: - Technical: 35%, Creativity: 35%, Compliance: 30% **Alternative profiles**: - `technical_focus`: 50/25/25 - For production code - `creative_focus`: 25/50/25 - For exploratory projects - `compliance_focus`: 30/25/45 - For standardization - `innovation_priority`: 20/60/20 - For research **When using custom config**: 1. Load and validate config 2. Apply weights to composite score calculation 3. Document which config is being used 4. Note how it affects scoring ## Common Pitfalls to Avoid 1. **Evaluation without Reasoning**: Don't just score - explain why 2. **Inconsistent Scoring**: Apply same criteria to all iterations 3. **Vague Feedback**: Provide specific evidence and examples 4. **Ignoring Trade-offs**: Recognize when dimensions compete 5. **Not Learning from Results**: Use observations to inform next actions 6. **Artificial Precision**: Don't pretend scores are more accurate than they are 7. **Forgetting Balance**: All three dimensions matter ## Success Criteria A successful execution of this system demonstrates: 1. **Meaningful Differentiation**: Scores clearly separate quality levels 2. **Evidence-Based Scoring**: Every score backed by specific examples 3. **Actionable Insights**: Reports lead to concrete improvements 4. **Visible Learning**: Quality improves over waves in infinite mode 5. **Transparent Reasoning**: ReAct pattern evident throughout 6. **Fair Consistency**: Same criteria applied to all iterations ## File Organization **When generating iterations**: ``` {output_dir}/ ├── iteration_001.html ├── iteration_002.html ├── ... └── quality_reports/ ├── evaluations/ │ ├── iteration_001_evaluation.json │ ├── iteration_002_evaluation.json │ └── ... ├── rankings/ │ ├── ranking_report.md │ └── ranking_data.json └── reports/ ├── wave_1_report.md ├── wave_2_report.md └── ... ``` **Critical**: Keep quality data organized. Store evaluations as JSON for machine readability, reports as Markdown for human readability. ## Integration with Main Infinite Loop Project This variant builds on the original infinite loop pattern with quality-focused enhancements: **Shared concepts**: - Multi-agent parallel orchestration - Specification-driven generation - Wave-based iteration (infinite mode) - Context management **New additions**: - ReAct reasoning pattern - Multi-dimensional quality evaluation - Automated ranking and reporting - Quality-driven strategy adaptation **Critical**: This is not a replacement for the original pattern - it's an enhancement focused on quality assessment and continuous improvement. ## Examples ### Example 1: Small Batch with Quality Focus ``` User: /project:infinite-quality specs/example_spec.md output/ 5 You should: 1. Analyze spec quality criteria 2. Reason about quality goals (THOUGHT) 3. Generate 5 diverse iterations (ACTION) 4. Evaluate all on all dimensions (ACTION) 5. Rank and report (OBSERVATION) 6. Provide top 3 insights Expected output: - 5 HTML files in output/ - 5 evaluation JSON files in output/quality_reports/evaluations/ - 1 ranking report in output/quality_reports/rankings/ - 1 quality report in output/quality_reports/reports/ - Summary with key findings and recommendations ``` ### Example 2: Infinite Mode with Learning ``` User: /project:infinite-quality specs/example_spec.md output/ infinite You should: Wave 1: - Generate 6-8 iterations - Evaluate and rank - Report baseline quality Wave 2: - THOUGHT: Learn from wave 1 top performers - ACTION: Generate 6-8 new iterations with lessons applied - OBSERVATION: Evaluate, rank, report improvements Wave 3+: - Continue THOUGHT → ACTION → OBSERVATION cycle - Show progressive quality improvement - Adapt strategy based on observations - Continue until context limits Expected pattern: - Quality scores increase over waves - Strategy evolves based on observations - Reports show learning and adaptation ``` ## When to Ask for Clarification Ask user if: - Spec lacks quality criteria (offer to use defaults) - Custom config has invalid weights (offer to fix) - Unclear whether to prioritize technical vs creative - Infinite mode strategy needs direction - Evaluation criteria should be adjusted Don't ask about: - How to score (use evaluators/ logic) - Report format (use templates/ structure) - Ranking methodology (use rank.md process) - Standard evaluation process (documented in commands) ## Version & Maintenance **Current Version**: 1.0 **Created**: 2025-10-10 **Pattern**: Infinite Agentic Loop + ReAct Reasoning **Dependencies**: Claude Code custom commands, WebFetch for ReAct pattern research **Future considerations**: - Automated testing integration - Visual quality report generation - Meta-learning on evaluation criteria - User feedback integration --- **Remember**: Quality evaluation is not about being harsh or lenient - it's about being fair, consistent, and helpful. Use the ReAct pattern to reason thoughtfully, act systematically, and observe honestly. Let quality assessment drive continuous improvement.