infinite-agents-public/infinite_variants/infinite_variant_4/.claude/commands/rank.md

11 KiB

Ranking Utility Command

Rank all iterations in a directory based on composite quality scores using ReAct reasoning.

Syntax

/rank <output_dir> [dimension]

Parameters:

  • output_dir: Directory containing iterations and evaluation results
  • dimension: Optional - Rank by specific dimension (technical/creativity/compliance) instead of composite

Examples:

/rank output/
/rank output/ creativity
/rank output/ technical

Execution Process

THOUGHT Phase: Reasoning About Ranking

Before ranking, reason about:

  1. What makes a fair ranking system?

    • Consistent evaluation criteria across all iterations
    • Appropriate weighting of dimensions
    • Recognition of different quality profiles
    • Avoidance of artificial precision
  2. What patterns should I look for?

    • Quality clusters (groups of similar scores)
    • Outliers (exceptionally high or low)
    • Quality trade-offs (high in one dimension, low in another)
    • Quality progression (improvement over iteration sequence)
  3. How should I interpret rankings?

    • Top 20%: Exemplary iterations
    • Middle 60%: Solid, meeting expectations
    • Bottom 20%: Learning opportunities
    • Not about "bad" vs "good" but about relative quality
  4. What insights can rankings reveal?

    • Which creative directions succeed?
    • Which quality dimensions need more focus?
    • Are there unexpected quality leaders?
    • Is quality improving over time?

ACTION Phase: Execute Ranking

  1. Load All Evaluations

    • Scan {output_dir}/quality_reports/evaluations/ for all evaluation JSON files
    • Parse each evaluation result
    • Extract scores for all dimensions
    • Verify evaluation completeness
  2. Calculate Composite Scores (if not already calculated)

    For each iteration:

    composite_score = (technical * 0.35) + (creativity * 0.35) + (compliance * 0.30)
    

    Store in ranking structure:

    {
      "iteration": "iteration_001.html",
      "scores": {
        "technical": 78,
        "creativity": 82,
        "compliance": 68,
        "composite": 76.0
      }
    }
    
  3. Sort by Selected Dimension

    • Sort iterations by composite score (or specified dimension)
    • Maintain stable sort (preserve order for ties)
    • Assign ranks (1 = highest)
  4. Calculate Statistics

    Statistics:
    - Count: Total number of iterations
    - Mean: Average score
    - Median: Middle value
    - Std Dev: Score distribution spread
    - Min: Lowest score
    - Max: Highest score
    - Range: Max - Min
    - Quartiles: Q1 (25th %), Q2 (50th %), Q3 (75th %)
    
  5. Identify Quality Segments

    • Exemplary (Top 20%): Rank 1 to ceil(count * 0.2)
    • Proficient (Next 30%): Rank ceil(count * 0.2)+1 to ceil(count * 0.5)
    • Adequate (Next 30%): Rank ceil(count * 0.5)+1 to ceil(count * 0.8)
    • Developing (Bottom 20%): Rank ceil(count * 0.8)+1 to count
  6. Analyze Quality Profiles

    For each iteration, determine quality profile:

    def quality_profile(tech, creative, compliance):
        if tech > 80 and creative > 80 and compliance > 80:
            return "Triple Threat - Excellent in all dimensions"
        elif tech > 80 and creative > 80:
            return "Technical Innovator - Strong tech + creativity"
        elif creative > 80 and compliance > 80:
            return "Compliant Creator - Creative within bounds"
        elif tech > 80 and compliance > 80:
            return "Reliable Engineer - Solid technical compliance"
        elif creative > 80:
            return "Creative Maverick - Innovation focus"
        elif tech > 80:
            return "Technical Specialist - Engineering excellence"
        elif compliance > 80:
            return "Spec Guardian - Perfect adherence"
        else:
            return "Balanced Generalist - Even across dimensions"
    

OBSERVATION Phase: Document Rankings

Output comprehensive ranking report:

=== QUALITY RANKINGS REPORT ===

Directory: output/
Ranked by: Composite Score
Total Iterations: 20
Generated: 2025-10-10T14:45:23Z

--- SUMMARY STATISTICS ---

Composite Scores:
  Mean:     72.4
  Median:   73.5
  Std Dev:  8.2
  Min:      58.0
  Max:      89.5
  Range:    31.5

Quartiles:
  Q1 (25%): 67.2
  Q2 (50%): 73.5
  Q3 (75%): 78.8

--- TOP PERFORMERS (Top 20%) ---

Rank 1: iteration_012.html - Score: 89.5
  Technical: 92 | Creativity: 95 | Compliance: 78
  Profile: Technical Innovator - Strong tech + creativity
  Strengths: Exceptional innovation, excellent code quality, novel approach
  Notable: Highest creativity score in entire batch

Rank 2: iteration_007.html - Score: 86.2
  Technical: 88 | Creativity: 89 | Compliance: 81
  Profile: Triple Threat - Excellent in all dimensions
  Strengths: Well-rounded excellence, balanced quality, consistent execution
  Notable: Most balanced high performer

Rank 3: iteration_018.html - Score: 84.7
  Technical: 85 | Creativity: 82 | Compliance: 87
  Profile: Reliable Engineer - Solid technical compliance
  Strengths: Perfect spec adherence, clean architecture, robust implementation
  Notable: Highest compliance score in batch

Rank 4: iteration_003.html - Score: 82.1
  Technical: 80 | Creativity: 88 | Compliance: 76
  Profile: Creative Maverick - Innovation focus
  Strengths: Unique visual design, innovative interactions, aesthetic excellence

--- PROFICIENT PERFORMERS (30-50%) ---

Rank 5: iteration_015.html - Score: 78.9
  Technical: 77 | Creativity: 79 | Compliance: 80
  Profile: Balanced Generalist - Even across dimensions

Rank 6: iteration_009.html - Score: 77.6
  Technical: 82 | Creativity: 75 | Compliance: 76
  Profile: Technical Specialist - Engineering excellence

[... continues ...]

--- DEVELOPING ITERATIONS (Bottom 20%) ---

Rank 17: iteration_005.html - Score: 62.3
  Technical: 65 | Creativity: 68 | Compliance: 55
  Profile: Balanced Generalist - Even across dimensions
  Growth Areas: Improve spec compliance, strengthen naming conventions

Rank 18: iteration_011.html - Score: 60.8
  Technical: 58 | Creativity: 72 | Compliance: 52
  Profile: Creative Maverick - Innovation focus
  Growth Areas: Boost technical robustness, enhance spec adherence

Rank 19: iteration_016.html - Score: 59.4
  Technical: 62 | Creativity: 55 | Compliance: 61
  Profile: Balanced Generalist - Even across dimensions
  Growth Areas: Increase creativity, explore unique approaches

Rank 20: iteration_001.html - Score: 58.0
  Technical: 60 | Creativity: 58 | Compliance: 56
  Profile: Balanced Generalist - Even across dimensions
  Growth Areas: Early iteration - establish stronger foundation

--- DIMENSIONAL ANALYSIS ---

Technical Quality Distribution:
  Mean: 74.2, Range: 58-92
  Top: iteration_012 (92)
  Pattern: Strong technical quality overall, few outliers

Creativity Score Distribution:
  Mean: 75.8, Range: 55-95
  Top: iteration_012 (95)
  Pattern: Wide distribution, high variance in creative approaches

Spec Compliance Distribution:
  Mean: 67.3, Range: 52-87
  Top: iteration_018 (87)
  Pattern: Compliance varies significantly, improvement opportunity

--- QUALITY TRADE-OFFS ---

Trade-off Pattern 1: "Creativity vs Compliance"
  Iterations: 003, 011, 004
  Pattern: High creativity (avg 85) paired with lower compliance (avg 62)
  Insight: Creative explorations sometimes sacrifice spec adherence

Trade-off Pattern 2: "Technical vs Creative"
  Iterations: 006, 013
  Pattern: High technical (avg 88) paired with moderate creativity (avg 70)
  Insight: Technical focus may constrain creative experimentation

--- QUALITY INSIGHTS ---

1. Quality Leaders Excel in Balance
   - Top 3 iterations all score 80+ in at least 2 dimensions
   - Success requires multi-dimensional excellence, not single strength

2. Compliance is Weakest Dimension
   - Mean compliance (67.3) lags technical (74.2) and creativity (75.8)
   - 60% of iterations score below 70 in compliance
   - Recommendation: Emphasize spec adherence in next wave

3. Creativity Shows Highest Variance
   - Std dev of 12.1 (vs 8.4 technical, 9.2 compliance)
   - Indicates diverse creative approaches - positive diversity
   - Some iterations play it safe, others push boundaries

4. Quality Improves Mid-Batch
   - Iterations 7-15 show 8% higher average scores than 1-6 or 16-20
   - Pattern suggests learning curve, then fatigue/repetition
   - Recommendation: Maintain mid-batch momentum in future waves

5. No "Perfect 100" Iterations
   - Max score: 89.5 (iteration_012)
   - Indicates room for improvement across all dimensions
   - Opportunity: Study iteration_012 and push further

--- RECOMMENDATIONS FOR NEXT WAVE ---

Based on ranking analysis:

1. **Amplify Success Patterns**
   - Study iteration_012 creative techniques
   - Replicate iteration_018 compliance approach
   - Maintain iteration_007 balanced excellence

2. **Address Compliance Gap**
   - Provide clearer spec guidance in sub-agent prompts
   - Add compliance checkpoints during generation
   - Review spec for clarity issues

3. **Encourage Balanced Excellence**
   - Reward multi-dimensional quality over single-dimension spikes
   - Design creative directions that maintain compliance
   - Set minimum thresholds for all dimensions (e.g., 70+)

4. **Explore Quality Frontiers**
   - Current max is 89.5 - can we reach 95+?
   - Identify specific innovations from top iterations
   - Push technical, creative, AND compliance simultaneously

5. **Maintain Creative Diversity**
   - High creativity variance is valuable
   - Continue diverse creative directions
   - But add "creative compliance" as explicit goal

--- RANKING DATA (JSON) ---

[Export full ranking data as JSON for programmatic access]

THOUGHT Phase: Reflect on Rankings

After generating rankings, reason about:

  1. Do the rankings make sense?

    • Do high-ranked iterations genuinely feel higher quality?
    • Are low-ranked iterations actually weaker?
    • Any surprising rankings that warrant investigation?
  2. What story do the rankings tell?

    • Is quality improving, declining, or stable?
    • Are there clear quality clusters?
    • What separates good from great?
  3. How should this inform strategy?

    • What should next wave prioritize?
    • Which creative directions should be amplified?
    • Which quality dimensions need focus?
  4. Are evaluation criteria working?

    • Do scores differentiate quality meaningfully?
    • Are weights (35/35/30) appropriate?
    • Should criteria be adjusted?

Output Storage

Rankings are stored in:

{output_dir}/quality_reports/rankings/ranking_report.md
{output_dir}/quality_reports/rankings/ranking_data.json

JSON format enables:

  • Historical tracking
  • Trend analysis
  • Visualization
  • Programmatic access

Success Criteria

A successful ranking demonstrates:

  • Clear differentiation of quality levels
  • Meaningful insights about quality patterns
  • Actionable recommendations for improvement
  • Fair and consistent application of criteria
  • Transparent reasoning about rankings
  • Evidence-based quality assessment

Remember: Rankings are not judgments of worth - they're tools for learning. Every iteration teaches us something about quality, and rankings help us identify patterns and opportunities for growth.