11 KiB
Ranking Utility Command
Rank all iterations in a directory based on composite quality scores using ReAct reasoning.
Syntax
/rank <output_dir> [dimension]
Parameters:
output_dir: Directory containing iterations and evaluation resultsdimension: Optional - Rank by specific dimension (technical/creativity/compliance) instead of composite
Examples:
/rank output/
/rank output/ creativity
/rank output/ technical
Execution Process
THOUGHT Phase: Reasoning About Ranking
Before ranking, reason about:
-
What makes a fair ranking system?
- Consistent evaluation criteria across all iterations
- Appropriate weighting of dimensions
- Recognition of different quality profiles
- Avoidance of artificial precision
-
What patterns should I look for?
- Quality clusters (groups of similar scores)
- Outliers (exceptionally high or low)
- Quality trade-offs (high in one dimension, low in another)
- Quality progression (improvement over iteration sequence)
-
How should I interpret rankings?
- Top 20%: Exemplary iterations
- Middle 60%: Solid, meeting expectations
- Bottom 20%: Learning opportunities
- Not about "bad" vs "good" but about relative quality
-
What insights can rankings reveal?
- Which creative directions succeed?
- Which quality dimensions need more focus?
- Are there unexpected quality leaders?
- Is quality improving over time?
ACTION Phase: Execute Ranking
-
Load All Evaluations
- Scan
{output_dir}/quality_reports/evaluations/for all evaluation JSON files - Parse each evaluation result
- Extract scores for all dimensions
- Verify evaluation completeness
- Scan
-
Calculate Composite Scores (if not already calculated)
For each iteration:
composite_score = (technical * 0.35) + (creativity * 0.35) + (compliance * 0.30)Store in ranking structure:
{ "iteration": "iteration_001.html", "scores": { "technical": 78, "creativity": 82, "compliance": 68, "composite": 76.0 } } -
Sort by Selected Dimension
- Sort iterations by composite score (or specified dimension)
- Maintain stable sort (preserve order for ties)
- Assign ranks (1 = highest)
-
Calculate Statistics
Statistics: - Count: Total number of iterations - Mean: Average score - Median: Middle value - Std Dev: Score distribution spread - Min: Lowest score - Max: Highest score - Range: Max - Min - Quartiles: Q1 (25th %), Q2 (50th %), Q3 (75th %) -
Identify Quality Segments
- Exemplary (Top 20%): Rank 1 to ceil(count * 0.2)
- Proficient (Next 30%): Rank ceil(count * 0.2)+1 to ceil(count * 0.5)
- Adequate (Next 30%): Rank ceil(count * 0.5)+1 to ceil(count * 0.8)
- Developing (Bottom 20%): Rank ceil(count * 0.8)+1 to count
-
Analyze Quality Profiles
For each iteration, determine quality profile:
def quality_profile(tech, creative, compliance): if tech > 80 and creative > 80 and compliance > 80: return "Triple Threat - Excellent in all dimensions" elif tech > 80 and creative > 80: return "Technical Innovator - Strong tech + creativity" elif creative > 80 and compliance > 80: return "Compliant Creator - Creative within bounds" elif tech > 80 and compliance > 80: return "Reliable Engineer - Solid technical compliance" elif creative > 80: return "Creative Maverick - Innovation focus" elif tech > 80: return "Technical Specialist - Engineering excellence" elif compliance > 80: return "Spec Guardian - Perfect adherence" else: return "Balanced Generalist - Even across dimensions"
OBSERVATION Phase: Document Rankings
Output comprehensive ranking report:
=== QUALITY RANKINGS REPORT ===
Directory: output/
Ranked by: Composite Score
Total Iterations: 20
Generated: 2025-10-10T14:45:23Z
--- SUMMARY STATISTICS ---
Composite Scores:
Mean: 72.4
Median: 73.5
Std Dev: 8.2
Min: 58.0
Max: 89.5
Range: 31.5
Quartiles:
Q1 (25%): 67.2
Q2 (50%): 73.5
Q3 (75%): 78.8
--- TOP PERFORMERS (Top 20%) ---
Rank 1: iteration_012.html - Score: 89.5
Technical: 92 | Creativity: 95 | Compliance: 78
Profile: Technical Innovator - Strong tech + creativity
Strengths: Exceptional innovation, excellent code quality, novel approach
Notable: Highest creativity score in entire batch
Rank 2: iteration_007.html - Score: 86.2
Technical: 88 | Creativity: 89 | Compliance: 81
Profile: Triple Threat - Excellent in all dimensions
Strengths: Well-rounded excellence, balanced quality, consistent execution
Notable: Most balanced high performer
Rank 3: iteration_018.html - Score: 84.7
Technical: 85 | Creativity: 82 | Compliance: 87
Profile: Reliable Engineer - Solid technical compliance
Strengths: Perfect spec adherence, clean architecture, robust implementation
Notable: Highest compliance score in batch
Rank 4: iteration_003.html - Score: 82.1
Technical: 80 | Creativity: 88 | Compliance: 76
Profile: Creative Maverick - Innovation focus
Strengths: Unique visual design, innovative interactions, aesthetic excellence
--- PROFICIENT PERFORMERS (30-50%) ---
Rank 5: iteration_015.html - Score: 78.9
Technical: 77 | Creativity: 79 | Compliance: 80
Profile: Balanced Generalist - Even across dimensions
Rank 6: iteration_009.html - Score: 77.6
Technical: 82 | Creativity: 75 | Compliance: 76
Profile: Technical Specialist - Engineering excellence
[... continues ...]
--- DEVELOPING ITERATIONS (Bottom 20%) ---
Rank 17: iteration_005.html - Score: 62.3
Technical: 65 | Creativity: 68 | Compliance: 55
Profile: Balanced Generalist - Even across dimensions
Growth Areas: Improve spec compliance, strengthen naming conventions
Rank 18: iteration_011.html - Score: 60.8
Technical: 58 | Creativity: 72 | Compliance: 52
Profile: Creative Maverick - Innovation focus
Growth Areas: Boost technical robustness, enhance spec adherence
Rank 19: iteration_016.html - Score: 59.4
Technical: 62 | Creativity: 55 | Compliance: 61
Profile: Balanced Generalist - Even across dimensions
Growth Areas: Increase creativity, explore unique approaches
Rank 20: iteration_001.html - Score: 58.0
Technical: 60 | Creativity: 58 | Compliance: 56
Profile: Balanced Generalist - Even across dimensions
Growth Areas: Early iteration - establish stronger foundation
--- DIMENSIONAL ANALYSIS ---
Technical Quality Distribution:
Mean: 74.2, Range: 58-92
Top: iteration_012 (92)
Pattern: Strong technical quality overall, few outliers
Creativity Score Distribution:
Mean: 75.8, Range: 55-95
Top: iteration_012 (95)
Pattern: Wide distribution, high variance in creative approaches
Spec Compliance Distribution:
Mean: 67.3, Range: 52-87
Top: iteration_018 (87)
Pattern: Compliance varies significantly, improvement opportunity
--- QUALITY TRADE-OFFS ---
Trade-off Pattern 1: "Creativity vs Compliance"
Iterations: 003, 011, 004
Pattern: High creativity (avg 85) paired with lower compliance (avg 62)
Insight: Creative explorations sometimes sacrifice spec adherence
Trade-off Pattern 2: "Technical vs Creative"
Iterations: 006, 013
Pattern: High technical (avg 88) paired with moderate creativity (avg 70)
Insight: Technical focus may constrain creative experimentation
--- QUALITY INSIGHTS ---
1. Quality Leaders Excel in Balance
- Top 3 iterations all score 80+ in at least 2 dimensions
- Success requires multi-dimensional excellence, not single strength
2. Compliance is Weakest Dimension
- Mean compliance (67.3) lags technical (74.2) and creativity (75.8)
- 60% of iterations score below 70 in compliance
- Recommendation: Emphasize spec adherence in next wave
3. Creativity Shows Highest Variance
- Std dev of 12.1 (vs 8.4 technical, 9.2 compliance)
- Indicates diverse creative approaches - positive diversity
- Some iterations play it safe, others push boundaries
4. Quality Improves Mid-Batch
- Iterations 7-15 show 8% higher average scores than 1-6 or 16-20
- Pattern suggests learning curve, then fatigue/repetition
- Recommendation: Maintain mid-batch momentum in future waves
5. No "Perfect 100" Iterations
- Max score: 89.5 (iteration_012)
- Indicates room for improvement across all dimensions
- Opportunity: Study iteration_012 and push further
--- RECOMMENDATIONS FOR NEXT WAVE ---
Based on ranking analysis:
1. **Amplify Success Patterns**
- Study iteration_012 creative techniques
- Replicate iteration_018 compliance approach
- Maintain iteration_007 balanced excellence
2. **Address Compliance Gap**
- Provide clearer spec guidance in sub-agent prompts
- Add compliance checkpoints during generation
- Review spec for clarity issues
3. **Encourage Balanced Excellence**
- Reward multi-dimensional quality over single-dimension spikes
- Design creative directions that maintain compliance
- Set minimum thresholds for all dimensions (e.g., 70+)
4. **Explore Quality Frontiers**
- Current max is 89.5 - can we reach 95+?
- Identify specific innovations from top iterations
- Push technical, creative, AND compliance simultaneously
5. **Maintain Creative Diversity**
- High creativity variance is valuable
- Continue diverse creative directions
- But add "creative compliance" as explicit goal
--- RANKING DATA (JSON) ---
[Export full ranking data as JSON for programmatic access]
THOUGHT Phase: Reflect on Rankings
After generating rankings, reason about:
-
Do the rankings make sense?
- Do high-ranked iterations genuinely feel higher quality?
- Are low-ranked iterations actually weaker?
- Any surprising rankings that warrant investigation?
-
What story do the rankings tell?
- Is quality improving, declining, or stable?
- Are there clear quality clusters?
- What separates good from great?
-
How should this inform strategy?
- What should next wave prioritize?
- Which creative directions should be amplified?
- Which quality dimensions need focus?
-
Are evaluation criteria working?
- Do scores differentiate quality meaningfully?
- Are weights (35/35/30) appropriate?
- Should criteria be adjusted?
Output Storage
Rankings are stored in:
{output_dir}/quality_reports/rankings/ranking_report.md
{output_dir}/quality_reports/rankings/ranking_data.json
JSON format enables:
- Historical tracking
- Trend analysis
- Visualization
- Programmatic access
Success Criteria
A successful ranking demonstrates:
- Clear differentiation of quality levels
- Meaningful insights about quality patterns
- Actionable recommendations for improvement
- Fair and consistent application of criteria
- Transparent reasoning about rankings
- Evidence-based quality assessment
Remember: Rankings are not judgments of worth - they're tools for learning. Every iteration teaches us something about quality, and rankings help us identify patterns and opportunities for growth.