infinite-agents-public/infinite_variants/infinite_variant_4/CLAUDE.md

478 lines
16 KiB
Markdown

# CLAUDE.md - Infinite Loop Variant 4: Quality Evaluation & Ranking System
This file provides guidance to Claude Code when working with the Quality Evaluation & Ranking System variant of the infinite agentic loop pattern.
## Project Overview
This is Infinite Loop Variant 4, implementing **automated quality evaluation and ranking** for AI-generated iterations. The system uses the **ReAct pattern** (Reasoning + Acting + Observation) to evaluate, score, rank, and continuously improve iteration quality across multiple dimensions.
## Key Concepts
### ReAct Pattern Integration
Every operation in this system follows the ReAct cycle:
1. **THOUGHT (Reasoning)**: Explicitly reason about quality, evaluation strategy, and improvement opportunities before acting
2. **ACTION (Acting)**: Execute evaluations, generate content, score iterations with clear intent
3. **OBSERVATION (Observing)**: Analyze results, identify patterns, extract insights to inform next cycle
**Critical**: Always document reasoning. Every evaluation, ranking, and report should show the thought process that led to conclusions.
### Multi-Dimensional Quality
Quality is assessed across **three equally important dimensions**:
- **Technical Quality (35%)**: Code, architecture, performance, robustness
- **Creativity Score (35%)**: Originality, innovation, uniqueness, aesthetic
- **Spec Compliance (30%)**: Requirements, naming, structure, standards
**Critical**: Never evaluate just one dimension. Quality is holistic. An iteration can be technically perfect but creatively bland, or wildly creative but technically flawed. Balance matters.
### Quality-Driven Improvement
In infinite mode, quality assessment drives generation strategy:
- **Early waves**: Establish baseline, explore diversity
- **Mid waves**: Learn from top performers, address gaps
- **Late waves**: Push frontiers, optimize composite scores
- **All waves**: Monitor trends, adapt continuously
**Critical**: Don't just generate and evaluate. Learn from evaluations and adapt strategy accordingly. Let observations inform next actions.
## Commands to Use
### Primary Command: `/project:infinite-quality`
**When to use**: Generating iterations with quality evaluation and ranking
**Syntax**:
```
/project:infinite-quality <spec_path> <output_dir> <count|infinite> [config_path]
```
**Key responsibilities when executing**:
1. **Initial Reasoning (THOUGHT)**:
- Deeply understand spec quality criteria
- Plan evaluation strategy
- Design quality-driven creative directions
- Consider what makes quality in this context
2. **Generation (ACTION)**:
- Launch sub-agents with quality targets
- Provide spec + quality standards
- Assign diverse creative directions
- Generate with self-assessment
3. **Evaluation (ACTION)**:
- Score all iterations on all dimensions
- Use evaluators from `evaluators/` directory
- Document evidence for all scores
- Be fair, consistent, and thorough
4. **Ranking & Reporting (OBSERVATION)**:
- Rank by composite score
- Identify patterns and trade-offs
- Extract actionable insights
- Generate comprehensive report
5. **Strategy Adaptation (THOUGHT for next wave)**:
- Learn from top performers
- Address quality gaps
- Adjust creative directions
- Refine quality targets
**Example execution flow**:
```
User: /project:infinite-quality specs/example_spec.md output/ 10
You should:
1. Read and analyze specs/example_spec.md deeply
2. Read specs/quality_standards.md for evaluation criteria
3. Reason about quality goals for this spec (THOUGHT)
4. Launch 10 sub-agents with diverse creative directions (ACTION)
5. Evaluate all 10 iterations using evaluators/ logic (ACTION)
6. Rank iterations and generate quality report (OBSERVATION)
7. Present key findings and recommendations
```
### Utility Command: `/evaluate`
**When to use**: Evaluating a single iteration on specific dimensions
**Syntax**:
```
/evaluate <dimension> <iteration_path> [spec_path]
```
**Dimensions**: `technical`, `creativity`, `compliance`, `all`
**Key responsibilities**:
1. **Pre-Evaluation Reasoning (THOUGHT)**:
- What does quality mean for this dimension?
- What evidence should I look for?
- How do I remain objective?
2. **Evaluation (ACTION)**:
- Read iteration completely
- Load appropriate evaluator logic
- Score each sub-dimension with evidence
- Calculate total dimension score
3. **Analysis (OBSERVATION)**:
- Identify specific strengths
- Identify specific weaknesses
- Provide evidence for scores
- Suggest improvements
**Critical**: Always provide specific evidence. Never say "code quality is good" without examples like "lines 45-67 demonstrate excellent input validation with clear error messages."
### Utility Command: `/rank`
**When to use**: Ranking all iterations in a directory
**Syntax**:
```
/rank <output_dir> [dimension]
```
**Key responsibilities**:
1. **Pre-Ranking Reasoning (THOUGHT)**:
- What makes fair ranking?
- What patterns to look for?
- How to interpret rankings?
2. **Ranking (ACTION)**:
- Load all evaluations
- Calculate composite scores
- Sort and segment
- Identify quality profiles
3. **Pattern Analysis (OBSERVATION)**:
- Quality clusters and outliers
- Dimension trade-offs
- Success/failure factors
- Strategic recommendations
**Critical**: Rankings should reveal insights, not just order. Explain what separates top from bottom performers.
### Utility Command: `/quality-report`
**When to use**: Generating comprehensive quality reports
**Syntax**:
```
/quality-report <output_dir> [wave_number]
```
**Key responsibilities**:
1. **Pre-Report Reasoning (THOUGHT)**:
- Purpose and audience
- Most important insights
- How to visualize quality
2. **Report Generation (ACTION)**:
- Aggregate all evaluation data
- Calculate comprehensive statistics
- Generate text visualizations
- Identify patterns and insights
3. **Strategic Recommendations (OBSERVATION)**:
- Actionable next steps
- Creative direction suggestions
- Quality targets for next wave
- System improvements
**Critical**: Reports must be actionable. Every insight should lead to a concrete recommendation.
## Evaluation Guidelines
### Technical Quality Evaluation
**Focus on**:
- **Code Quality**: Readability, comments, naming, DRY
- **Architecture**: Modularity, separation, reusability, scalability
- **Performance**: Render speed, animation fps, algorithms, DOM ops
- **Robustness**: Validation, error handling, edge cases, compatibility
**Scoring approach**:
- Look for concrete evidence in code
- Compare against standards in `evaluators/technical_quality.md`
- Score each sub-dimension 0-25 points
- Document specific examples
- Total: 0-100 points
**Example evidence**:
- Good: "Lines 120-145: Efficient caching mechanism reduces redundant calculations"
- Bad: "Performance is good"
### Creativity Score Evaluation
**Focus on**:
- **Originality**: Novel concepts, fresh perspectives, unexpected approaches
- **Innovation**: Creative solutions, clever techniques, boundary-pushing
- **Uniqueness**: Differentiation from others, distinctive identity, memorability
- **Aesthetic**: Visual appeal, color harmony, typography, polish
**Scoring approach**:
- Recognize creativity is partially subjective
- Look for objective indicators of novelty
- Compare against standards in `evaluators/creativity_score.md`
- Reward creative risk-taking
- Total: 0-100 points
**Example evidence**:
- Good: "Novel data-as-music-notation concept, first iteration to use audio sonification"
- Bad: "This is creative"
### Spec Compliance Evaluation
**Focus on**:
- **Requirements Met**: Functional, technical, design requirements (40 points)
- **Naming Conventions**: Pattern adherence, quality (20 points)
- **Structure Adherence**: File structure, code organization (20 points)
- **Quality Standards**: Baselines met (20 points)
**Scoring approach**:
- Treat spec as checklist
- Binary or proportional scoring per requirement
- Compare against standards in `evaluators/spec_compliance.md`
- Be objective and evidence-based
- Total: 0-100 points
**Example evidence**:
- Good: "Spec requires 20+ data points, iteration has 50 points ✓ [4/4 points]"
- Bad: "Meets requirements"
## Scoring Calibration
Use these reference points to ensure consistent scoring:
**90-100 (Exceptional)**: Excellence in all sub-dimensions, exemplary work
**80-89 (Excellent)**: Strong across all sub-dimensions, minor improvements possible
**70-79 (Good)**: Solid in most sub-dimensions, some areas need work
**60-69 (Adequate)**: Meets basic requirements, notable weaknesses
**50-59 (Needs Improvement)**: Below expectations, significant issues
**Below 50 (Insufficient)**: Major deficiencies, fails basic criteria
**Critical**: Most iterations should fall in 60-85 range. Scores of 90+ should be rare and truly exceptional. Scores below 60 indicate serious problems.
## Quality Report Best Practices
When generating quality reports:
1. **Start with Executive Summary**: 3 key insights, 1 priority recommendation
2. **Provide Statistics**: Mean, median, std dev, min, max for all dimensions
3. **Visualize Distribution**: Text-based histograms and charts
4. **Identify Patterns**: What makes top iterations succeed? What causes low scores?
5. **Analyze Trade-offs**: Which dimensions compete? Which synergize?
6. **Give Strategic Recommendations**: Specific, actionable, prioritized
7. **Self-Assess Report Quality**: Is this useful? Honest? Comprehensive?
**Critical**: Every report should drive improvement. If it doesn't lead to actionable insights, it's not a good report.
## Infinite Mode Strategy
When running in infinite mode (`count: infinite`):
### Wave 1 (Foundation)
- Generate 6-8 iterations with diverse creative directions
- Establish baseline quality metrics
- Identify initial strengths and weaknesses
- Generate wave 1 report
### Wave 2+ (Progressive Improvement)
**THOUGHT Phase**:
- What made wave 1 top performers successful?
- What quality dimensions need improvement?
- What creative directions are underexplored?
- How can we push quality higher?
**ACTION Phase**:
- Generate 6-8 new iterations
- Incorporate lessons from previous waves
- Target quality gaps
- Increase challenge in strong areas
**OBSERVATION Phase**:
- Evaluate new iterations
- Update overall rankings
- Generate wave-specific report
- Compare to previous waves
**Adaptation**:
- Quality improving? Continue strategy
- Quality stagnating? Adjust approach
- Quality declining? Investigate and correct
**Critical**: Don't just repeat same strategy. Each wave should learn from previous waves. Show progressive improvement.
## Configuration Customization
Users can customize scoring through `config/scoring_weights.json`:
**Default weights**:
- Technical: 35%, Creativity: 35%, Compliance: 30%
**Alternative profiles**:
- `technical_focus`: 50/25/25 - For production code
- `creative_focus`: 25/50/25 - For exploratory projects
- `compliance_focus`: 30/25/45 - For standardization
- `innovation_priority`: 20/60/20 - For research
**When using custom config**:
1. Load and validate config
2. Apply weights to composite score calculation
3. Document which config is being used
4. Note how it affects scoring
## Common Pitfalls to Avoid
1. **Evaluation without Reasoning**: Don't just score - explain why
2. **Inconsistent Scoring**: Apply same criteria to all iterations
3. **Vague Feedback**: Provide specific evidence and examples
4. **Ignoring Trade-offs**: Recognize when dimensions compete
5. **Not Learning from Results**: Use observations to inform next actions
6. **Artificial Precision**: Don't pretend scores are more accurate than they are
7. **Forgetting Balance**: All three dimensions matter
## Success Criteria
A successful execution of this system demonstrates:
1. **Meaningful Differentiation**: Scores clearly separate quality levels
2. **Evidence-Based Scoring**: Every score backed by specific examples
3. **Actionable Insights**: Reports lead to concrete improvements
4. **Visible Learning**: Quality improves over waves in infinite mode
5. **Transparent Reasoning**: ReAct pattern evident throughout
6. **Fair Consistency**: Same criteria applied to all iterations
## File Organization
**When generating iterations**:
```
{output_dir}/
├── iteration_001.html
├── iteration_002.html
├── ...
└── quality_reports/
├── evaluations/
│ ├── iteration_001_evaluation.json
│ ├── iteration_002_evaluation.json
│ └── ...
├── rankings/
│ ├── ranking_report.md
│ └── ranking_data.json
└── reports/
├── wave_1_report.md
├── wave_2_report.md
└── ...
```
**Critical**: Keep quality data organized. Store evaluations as JSON for machine readability, reports as Markdown for human readability.
## Integration with Main Infinite Loop Project
This variant builds on the original infinite loop pattern with quality-focused enhancements:
**Shared concepts**:
- Multi-agent parallel orchestration
- Specification-driven generation
- Wave-based iteration (infinite mode)
- Context management
**New additions**:
- ReAct reasoning pattern
- Multi-dimensional quality evaluation
- Automated ranking and reporting
- Quality-driven strategy adaptation
**Critical**: This is not a replacement for the original pattern - it's an enhancement focused on quality assessment and continuous improvement.
## Examples
### Example 1: Small Batch with Quality Focus
```
User: /project:infinite-quality specs/example_spec.md output/ 5
You should:
1. Analyze spec quality criteria
2. Reason about quality goals (THOUGHT)
3. Generate 5 diverse iterations (ACTION)
4. Evaluate all on all dimensions (ACTION)
5. Rank and report (OBSERVATION)
6. Provide top 3 insights
Expected output:
- 5 HTML files in output/
- 5 evaluation JSON files in output/quality_reports/evaluations/
- 1 ranking report in output/quality_reports/rankings/
- 1 quality report in output/quality_reports/reports/
- Summary with key findings and recommendations
```
### Example 2: Infinite Mode with Learning
```
User: /project:infinite-quality specs/example_spec.md output/ infinite
You should:
Wave 1:
- Generate 6-8 iterations
- Evaluate and rank
- Report baseline quality
Wave 2:
- THOUGHT: Learn from wave 1 top performers
- ACTION: Generate 6-8 new iterations with lessons applied
- OBSERVATION: Evaluate, rank, report improvements
Wave 3+:
- Continue THOUGHT → ACTION → OBSERVATION cycle
- Show progressive quality improvement
- Adapt strategy based on observations
- Continue until context limits
Expected pattern:
- Quality scores increase over waves
- Strategy evolves based on observations
- Reports show learning and adaptation
```
## When to Ask for Clarification
Ask user if:
- Spec lacks quality criteria (offer to use defaults)
- Custom config has invalid weights (offer to fix)
- Unclear whether to prioritize technical vs creative
- Infinite mode strategy needs direction
- Evaluation criteria should be adjusted
Don't ask about:
- How to score (use evaluators/ logic)
- Report format (use templates/ structure)
- Ranking methodology (use rank.md process)
- Standard evaluation process (documented in commands)
## Version & Maintenance
**Current Version**: 1.0
**Created**: 2025-10-10
**Pattern**: Infinite Agentic Loop + ReAct Reasoning
**Dependencies**: Claude Code custom commands, WebFetch for ReAct pattern research
**Future considerations**:
- Automated testing integration
- Visual quality report generation
- Meta-learning on evaluation criteria
- User feedback integration
---
**Remember**: Quality evaluation is not about being harsh or lenient - it's about being fair, consistent, and helpful. Use the ReAct pattern to reason thoughtfully, act systematically, and observe honestly. Let quality assessment drive continuous improvement.