478 lines
16 KiB
Markdown
478 lines
16 KiB
Markdown
# CLAUDE.md - Infinite Loop Variant 4: Quality Evaluation & Ranking System
|
|
|
|
This file provides guidance to Claude Code when working with the Quality Evaluation & Ranking System variant of the infinite agentic loop pattern.
|
|
|
|
## Project Overview
|
|
|
|
This is Infinite Loop Variant 4, implementing **automated quality evaluation and ranking** for AI-generated iterations. The system uses the **ReAct pattern** (Reasoning + Acting + Observation) to evaluate, score, rank, and continuously improve iteration quality across multiple dimensions.
|
|
|
|
## Key Concepts
|
|
|
|
### ReAct Pattern Integration
|
|
|
|
Every operation in this system follows the ReAct cycle:
|
|
|
|
1. **THOUGHT (Reasoning)**: Explicitly reason about quality, evaluation strategy, and improvement opportunities before acting
|
|
2. **ACTION (Acting)**: Execute evaluations, generate content, score iterations with clear intent
|
|
3. **OBSERVATION (Observing)**: Analyze results, identify patterns, extract insights to inform next cycle
|
|
|
|
**Critical**: Always document reasoning. Every evaluation, ranking, and report should show the thought process that led to conclusions.
|
|
|
|
### Multi-Dimensional Quality
|
|
|
|
Quality is assessed across **three equally important dimensions**:
|
|
|
|
- **Technical Quality (35%)**: Code, architecture, performance, robustness
|
|
- **Creativity Score (35%)**: Originality, innovation, uniqueness, aesthetic
|
|
- **Spec Compliance (30%)**: Requirements, naming, structure, standards
|
|
|
|
**Critical**: Never evaluate just one dimension. Quality is holistic. An iteration can be technically perfect but creatively bland, or wildly creative but technically flawed. Balance matters.
|
|
|
|
### Quality-Driven Improvement
|
|
|
|
In infinite mode, quality assessment drives generation strategy:
|
|
|
|
- **Early waves**: Establish baseline, explore diversity
|
|
- **Mid waves**: Learn from top performers, address gaps
|
|
- **Late waves**: Push frontiers, optimize composite scores
|
|
- **All waves**: Monitor trends, adapt continuously
|
|
|
|
**Critical**: Don't just generate and evaluate. Learn from evaluations and adapt strategy accordingly. Let observations inform next actions.
|
|
|
|
## Commands to Use
|
|
|
|
### Primary Command: `/project:infinite-quality`
|
|
|
|
**When to use**: Generating iterations with quality evaluation and ranking
|
|
|
|
**Syntax**:
|
|
```
|
|
/project:infinite-quality <spec_path> <output_dir> <count|infinite> [config_path]
|
|
```
|
|
|
|
**Key responsibilities when executing**:
|
|
|
|
1. **Initial Reasoning (THOUGHT)**:
|
|
- Deeply understand spec quality criteria
|
|
- Plan evaluation strategy
|
|
- Design quality-driven creative directions
|
|
- Consider what makes quality in this context
|
|
|
|
2. **Generation (ACTION)**:
|
|
- Launch sub-agents with quality targets
|
|
- Provide spec + quality standards
|
|
- Assign diverse creative directions
|
|
- Generate with self-assessment
|
|
|
|
3. **Evaluation (ACTION)**:
|
|
- Score all iterations on all dimensions
|
|
- Use evaluators from `evaluators/` directory
|
|
- Document evidence for all scores
|
|
- Be fair, consistent, and thorough
|
|
|
|
4. **Ranking & Reporting (OBSERVATION)**:
|
|
- Rank by composite score
|
|
- Identify patterns and trade-offs
|
|
- Extract actionable insights
|
|
- Generate comprehensive report
|
|
|
|
5. **Strategy Adaptation (THOUGHT for next wave)**:
|
|
- Learn from top performers
|
|
- Address quality gaps
|
|
- Adjust creative directions
|
|
- Refine quality targets
|
|
|
|
**Example execution flow**:
|
|
```
|
|
User: /project:infinite-quality specs/example_spec.md output/ 10
|
|
|
|
You should:
|
|
1. Read and analyze specs/example_spec.md deeply
|
|
2. Read specs/quality_standards.md for evaluation criteria
|
|
3. Reason about quality goals for this spec (THOUGHT)
|
|
4. Launch 10 sub-agents with diverse creative directions (ACTION)
|
|
5. Evaluate all 10 iterations using evaluators/ logic (ACTION)
|
|
6. Rank iterations and generate quality report (OBSERVATION)
|
|
7. Present key findings and recommendations
|
|
```
|
|
|
|
### Utility Command: `/evaluate`
|
|
|
|
**When to use**: Evaluating a single iteration on specific dimensions
|
|
|
|
**Syntax**:
|
|
```
|
|
/evaluate <dimension> <iteration_path> [spec_path]
|
|
```
|
|
|
|
**Dimensions**: `technical`, `creativity`, `compliance`, `all`
|
|
|
|
**Key responsibilities**:
|
|
|
|
1. **Pre-Evaluation Reasoning (THOUGHT)**:
|
|
- What does quality mean for this dimension?
|
|
- What evidence should I look for?
|
|
- How do I remain objective?
|
|
|
|
2. **Evaluation (ACTION)**:
|
|
- Read iteration completely
|
|
- Load appropriate evaluator logic
|
|
- Score each sub-dimension with evidence
|
|
- Calculate total dimension score
|
|
|
|
3. **Analysis (OBSERVATION)**:
|
|
- Identify specific strengths
|
|
- Identify specific weaknesses
|
|
- Provide evidence for scores
|
|
- Suggest improvements
|
|
|
|
**Critical**: Always provide specific evidence. Never say "code quality is good" without examples like "lines 45-67 demonstrate excellent input validation with clear error messages."
|
|
|
|
### Utility Command: `/rank`
|
|
|
|
**When to use**: Ranking all iterations in a directory
|
|
|
|
**Syntax**:
|
|
```
|
|
/rank <output_dir> [dimension]
|
|
```
|
|
|
|
**Key responsibilities**:
|
|
|
|
1. **Pre-Ranking Reasoning (THOUGHT)**:
|
|
- What makes fair ranking?
|
|
- What patterns to look for?
|
|
- How to interpret rankings?
|
|
|
|
2. **Ranking (ACTION)**:
|
|
- Load all evaluations
|
|
- Calculate composite scores
|
|
- Sort and segment
|
|
- Identify quality profiles
|
|
|
|
3. **Pattern Analysis (OBSERVATION)**:
|
|
- Quality clusters and outliers
|
|
- Dimension trade-offs
|
|
- Success/failure factors
|
|
- Strategic recommendations
|
|
|
|
**Critical**: Rankings should reveal insights, not just order. Explain what separates top from bottom performers.
|
|
|
|
### Utility Command: `/quality-report`
|
|
|
|
**When to use**: Generating comprehensive quality reports
|
|
|
|
**Syntax**:
|
|
```
|
|
/quality-report <output_dir> [wave_number]
|
|
```
|
|
|
|
**Key responsibilities**:
|
|
|
|
1. **Pre-Report Reasoning (THOUGHT)**:
|
|
- Purpose and audience
|
|
- Most important insights
|
|
- How to visualize quality
|
|
|
|
2. **Report Generation (ACTION)**:
|
|
- Aggregate all evaluation data
|
|
- Calculate comprehensive statistics
|
|
- Generate text visualizations
|
|
- Identify patterns and insights
|
|
|
|
3. **Strategic Recommendations (OBSERVATION)**:
|
|
- Actionable next steps
|
|
- Creative direction suggestions
|
|
- Quality targets for next wave
|
|
- System improvements
|
|
|
|
**Critical**: Reports must be actionable. Every insight should lead to a concrete recommendation.
|
|
|
|
## Evaluation Guidelines
|
|
|
|
### Technical Quality Evaluation
|
|
|
|
**Focus on**:
|
|
- **Code Quality**: Readability, comments, naming, DRY
|
|
- **Architecture**: Modularity, separation, reusability, scalability
|
|
- **Performance**: Render speed, animation fps, algorithms, DOM ops
|
|
- **Robustness**: Validation, error handling, edge cases, compatibility
|
|
|
|
**Scoring approach**:
|
|
- Look for concrete evidence in code
|
|
- Compare against standards in `evaluators/technical_quality.md`
|
|
- Score each sub-dimension 0-25 points
|
|
- Document specific examples
|
|
- Total: 0-100 points
|
|
|
|
**Example evidence**:
|
|
- Good: "Lines 120-145: Efficient caching mechanism reduces redundant calculations"
|
|
- Bad: "Performance is good"
|
|
|
|
### Creativity Score Evaluation
|
|
|
|
**Focus on**:
|
|
- **Originality**: Novel concepts, fresh perspectives, unexpected approaches
|
|
- **Innovation**: Creative solutions, clever techniques, boundary-pushing
|
|
- **Uniqueness**: Differentiation from others, distinctive identity, memorability
|
|
- **Aesthetic**: Visual appeal, color harmony, typography, polish
|
|
|
|
**Scoring approach**:
|
|
- Recognize creativity is partially subjective
|
|
- Look for objective indicators of novelty
|
|
- Compare against standards in `evaluators/creativity_score.md`
|
|
- Reward creative risk-taking
|
|
- Total: 0-100 points
|
|
|
|
**Example evidence**:
|
|
- Good: "Novel data-as-music-notation concept, first iteration to use audio sonification"
|
|
- Bad: "This is creative"
|
|
|
|
### Spec Compliance Evaluation
|
|
|
|
**Focus on**:
|
|
- **Requirements Met**: Functional, technical, design requirements (40 points)
|
|
- **Naming Conventions**: Pattern adherence, quality (20 points)
|
|
- **Structure Adherence**: File structure, code organization (20 points)
|
|
- **Quality Standards**: Baselines met (20 points)
|
|
|
|
**Scoring approach**:
|
|
- Treat spec as checklist
|
|
- Binary or proportional scoring per requirement
|
|
- Compare against standards in `evaluators/spec_compliance.md`
|
|
- Be objective and evidence-based
|
|
- Total: 0-100 points
|
|
|
|
**Example evidence**:
|
|
- Good: "Spec requires 20+ data points, iteration has 50 points ✓ [4/4 points]"
|
|
- Bad: "Meets requirements"
|
|
|
|
## Scoring Calibration
|
|
|
|
Use these reference points to ensure consistent scoring:
|
|
|
|
**90-100 (Exceptional)**: Excellence in all sub-dimensions, exemplary work
|
|
**80-89 (Excellent)**: Strong across all sub-dimensions, minor improvements possible
|
|
**70-79 (Good)**: Solid in most sub-dimensions, some areas need work
|
|
**60-69 (Adequate)**: Meets basic requirements, notable weaknesses
|
|
**50-59 (Needs Improvement)**: Below expectations, significant issues
|
|
**Below 50 (Insufficient)**: Major deficiencies, fails basic criteria
|
|
|
|
**Critical**: Most iterations should fall in 60-85 range. Scores of 90+ should be rare and truly exceptional. Scores below 60 indicate serious problems.
|
|
|
|
## Quality Report Best Practices
|
|
|
|
When generating quality reports:
|
|
|
|
1. **Start with Executive Summary**: 3 key insights, 1 priority recommendation
|
|
2. **Provide Statistics**: Mean, median, std dev, min, max for all dimensions
|
|
3. **Visualize Distribution**: Text-based histograms and charts
|
|
4. **Identify Patterns**: What makes top iterations succeed? What causes low scores?
|
|
5. **Analyze Trade-offs**: Which dimensions compete? Which synergize?
|
|
6. **Give Strategic Recommendations**: Specific, actionable, prioritized
|
|
7. **Self-Assess Report Quality**: Is this useful? Honest? Comprehensive?
|
|
|
|
**Critical**: Every report should drive improvement. If it doesn't lead to actionable insights, it's not a good report.
|
|
|
|
## Infinite Mode Strategy
|
|
|
|
When running in infinite mode (`count: infinite`):
|
|
|
|
### Wave 1 (Foundation)
|
|
- Generate 6-8 iterations with diverse creative directions
|
|
- Establish baseline quality metrics
|
|
- Identify initial strengths and weaknesses
|
|
- Generate wave 1 report
|
|
|
|
### Wave 2+ (Progressive Improvement)
|
|
|
|
**THOUGHT Phase**:
|
|
- What made wave 1 top performers successful?
|
|
- What quality dimensions need improvement?
|
|
- What creative directions are underexplored?
|
|
- How can we push quality higher?
|
|
|
|
**ACTION Phase**:
|
|
- Generate 6-8 new iterations
|
|
- Incorporate lessons from previous waves
|
|
- Target quality gaps
|
|
- Increase challenge in strong areas
|
|
|
|
**OBSERVATION Phase**:
|
|
- Evaluate new iterations
|
|
- Update overall rankings
|
|
- Generate wave-specific report
|
|
- Compare to previous waves
|
|
|
|
**Adaptation**:
|
|
- Quality improving? Continue strategy
|
|
- Quality stagnating? Adjust approach
|
|
- Quality declining? Investigate and correct
|
|
|
|
**Critical**: Don't just repeat same strategy. Each wave should learn from previous waves. Show progressive improvement.
|
|
|
|
## Configuration Customization
|
|
|
|
Users can customize scoring through `config/scoring_weights.json`:
|
|
|
|
**Default weights**:
|
|
- Technical: 35%, Creativity: 35%, Compliance: 30%
|
|
|
|
**Alternative profiles**:
|
|
- `technical_focus`: 50/25/25 - For production code
|
|
- `creative_focus`: 25/50/25 - For exploratory projects
|
|
- `compliance_focus`: 30/25/45 - For standardization
|
|
- `innovation_priority`: 20/60/20 - For research
|
|
|
|
**When using custom config**:
|
|
1. Load and validate config
|
|
2. Apply weights to composite score calculation
|
|
3. Document which config is being used
|
|
4. Note how it affects scoring
|
|
|
|
## Common Pitfalls to Avoid
|
|
|
|
1. **Evaluation without Reasoning**: Don't just score - explain why
|
|
2. **Inconsistent Scoring**: Apply same criteria to all iterations
|
|
3. **Vague Feedback**: Provide specific evidence and examples
|
|
4. **Ignoring Trade-offs**: Recognize when dimensions compete
|
|
5. **Not Learning from Results**: Use observations to inform next actions
|
|
6. **Artificial Precision**: Don't pretend scores are more accurate than they are
|
|
7. **Forgetting Balance**: All three dimensions matter
|
|
|
|
## Success Criteria
|
|
|
|
A successful execution of this system demonstrates:
|
|
|
|
1. **Meaningful Differentiation**: Scores clearly separate quality levels
|
|
2. **Evidence-Based Scoring**: Every score backed by specific examples
|
|
3. **Actionable Insights**: Reports lead to concrete improvements
|
|
4. **Visible Learning**: Quality improves over waves in infinite mode
|
|
5. **Transparent Reasoning**: ReAct pattern evident throughout
|
|
6. **Fair Consistency**: Same criteria applied to all iterations
|
|
|
|
## File Organization
|
|
|
|
**When generating iterations**:
|
|
```
|
|
{output_dir}/
|
|
├── iteration_001.html
|
|
├── iteration_002.html
|
|
├── ...
|
|
└── quality_reports/
|
|
├── evaluations/
|
|
│ ├── iteration_001_evaluation.json
|
|
│ ├── iteration_002_evaluation.json
|
|
│ └── ...
|
|
├── rankings/
|
|
│ ├── ranking_report.md
|
|
│ └── ranking_data.json
|
|
└── reports/
|
|
├── wave_1_report.md
|
|
├── wave_2_report.md
|
|
└── ...
|
|
```
|
|
|
|
**Critical**: Keep quality data organized. Store evaluations as JSON for machine readability, reports as Markdown for human readability.
|
|
|
|
## Integration with Main Infinite Loop Project
|
|
|
|
This variant builds on the original infinite loop pattern with quality-focused enhancements:
|
|
|
|
**Shared concepts**:
|
|
- Multi-agent parallel orchestration
|
|
- Specification-driven generation
|
|
- Wave-based iteration (infinite mode)
|
|
- Context management
|
|
|
|
**New additions**:
|
|
- ReAct reasoning pattern
|
|
- Multi-dimensional quality evaluation
|
|
- Automated ranking and reporting
|
|
- Quality-driven strategy adaptation
|
|
|
|
**Critical**: This is not a replacement for the original pattern - it's an enhancement focused on quality assessment and continuous improvement.
|
|
|
|
## Examples
|
|
|
|
### Example 1: Small Batch with Quality Focus
|
|
|
|
```
|
|
User: /project:infinite-quality specs/example_spec.md output/ 5
|
|
|
|
You should:
|
|
1. Analyze spec quality criteria
|
|
2. Reason about quality goals (THOUGHT)
|
|
3. Generate 5 diverse iterations (ACTION)
|
|
4. Evaluate all on all dimensions (ACTION)
|
|
5. Rank and report (OBSERVATION)
|
|
6. Provide top 3 insights
|
|
|
|
Expected output:
|
|
- 5 HTML files in output/
|
|
- 5 evaluation JSON files in output/quality_reports/evaluations/
|
|
- 1 ranking report in output/quality_reports/rankings/
|
|
- 1 quality report in output/quality_reports/reports/
|
|
- Summary with key findings and recommendations
|
|
```
|
|
|
|
### Example 2: Infinite Mode with Learning
|
|
|
|
```
|
|
User: /project:infinite-quality specs/example_spec.md output/ infinite
|
|
|
|
You should:
|
|
Wave 1:
|
|
- Generate 6-8 iterations
|
|
- Evaluate and rank
|
|
- Report baseline quality
|
|
|
|
Wave 2:
|
|
- THOUGHT: Learn from wave 1 top performers
|
|
- ACTION: Generate 6-8 new iterations with lessons applied
|
|
- OBSERVATION: Evaluate, rank, report improvements
|
|
|
|
Wave 3+:
|
|
- Continue THOUGHT → ACTION → OBSERVATION cycle
|
|
- Show progressive quality improvement
|
|
- Adapt strategy based on observations
|
|
- Continue until context limits
|
|
|
|
Expected pattern:
|
|
- Quality scores increase over waves
|
|
- Strategy evolves based on observations
|
|
- Reports show learning and adaptation
|
|
```
|
|
|
|
## When to Ask for Clarification
|
|
|
|
Ask user if:
|
|
- Spec lacks quality criteria (offer to use defaults)
|
|
- Custom config has invalid weights (offer to fix)
|
|
- Unclear whether to prioritize technical vs creative
|
|
- Infinite mode strategy needs direction
|
|
- Evaluation criteria should be adjusted
|
|
|
|
Don't ask about:
|
|
- How to score (use evaluators/ logic)
|
|
- Report format (use templates/ structure)
|
|
- Ranking methodology (use rank.md process)
|
|
- Standard evaluation process (documented in commands)
|
|
|
|
## Version & Maintenance
|
|
|
|
**Current Version**: 1.0
|
|
**Created**: 2025-10-10
|
|
**Pattern**: Infinite Agentic Loop + ReAct Reasoning
|
|
**Dependencies**: Claude Code custom commands, WebFetch for ReAct pattern research
|
|
|
|
**Future considerations**:
|
|
- Automated testing integration
|
|
- Visual quality report generation
|
|
- Meta-learning on evaluation criteria
|
|
- User feedback integration
|
|
|
|
---
|
|
|
|
**Remember**: Quality evaluation is not about being harsh or lenient - it's about being fair, consistent, and helpful. Use the ReAct pattern to reason thoughtfully, act systematically, and observe honestly. Let quality assessment drive continuous improvement.
|