Initial commit: P2P Wiki AI system

- RAG-based chat with 39k wiki articles (232k chunks) - Article ingress pipeline for processing external URLs - Review queue for AI-generated content - FastAPI backend with web UI - Traefik-ready Docker setup for p2pwiki.jeffemmett.com Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 13:53:29 +01:00 · 2026-01-23 13:53:29 +01:00 · 4ebd90cc64
commit 4ebd90cc64
16 changed files with 26481 additions and 0 deletions
--- a/.env.example
+++ b/.env.example
@ -0,0 +1,17 @@
 # P2P Wiki AI Configuration
 # Ollama (Local LLM)
 OLLAMA_BASE_URL=http://localhost:11434
 OLLAMA_MODEL=llama3.2
 # Claude API (Optional - for higher quality article drafts)
 ANTHROPIC_API_KEY=
 CLAUDE_MODEL=claude-sonnet-4-20250514
 # Hybrid Routing
 USE_CLAUDE_FOR_DRAFTS=true
 USE_OLLAMA_FOR_CHAT=true
 # Server
 HOST=0.0.0.0
 PORT=8420
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,28 @@
 # Virtual environment
 .venv/
 venv/
 env/
 # Python
 __pycache__/
 *.py[cod]
 *.egg-info/
 dist/
 build/
 # Data files (too large for git)
 data/articles.json
 data/chroma/
 data/review_queue/
 xmldump/
 xmldump-2014.tar.gz
 articles/
 articles.tar.gz
 # Environment
 .env
 # IDE
 .idea/
 .vscode/
 *.swp
--- a/49
+++ b/49
@ -0,0 +1,49 @@
 # P2P Wiki AI - Multi-stage build
 FROM python:3.11-slim as builder
 WORKDIR /app
 # Install build dependencies
 RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential \
    && rm -rf /var/lib/apt/lists/*
 # Install Python dependencies
 COPY pyproject.toml .
 RUN pip install --no-cache-dir build && \
    pip wheel --no-cache-dir --wheel-dir /wheels .
 # Production image
 FROM python:3.11-slim
 WORKDIR /app
 # Install runtime dependencies
 RUN apt-get update && apt-get install -y --no-install-recommends \
    libxml2 \
    && rm -rf /var/lib/apt/lists/*
 # Copy wheels and install
 COPY --from=builder /wheels /wheels
 RUN pip install --no-cache-dir /wheels/*.whl && rm -rf /wheels
 # Copy application code
 COPY src/ src/
 COPY web/ web/
 # Create data directories
 RUN mkdir -p data/chroma data/review_queue
 # Environment variables
 ENV PYTHONUNBUFFERED=1
 ENV PYTHONDONTWRITEBYTECODE=1
 # Expose port
 EXPOSE 8420
 # Health check
 HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
    CMD python -c "import httpx; httpx.get('http://localhost:8420/health')" || exit 1
 # Run the application
 CMD ["python", "-m", "uvicorn", "src.api:app", "--host", "0.0.0.0", "--port", "8420"]
--- a/README.md
+++ b/README.md
@ -0,0 +1,199 @@
 # P2P Wiki AI
 AI-augmented system for the P2P Foundation Wiki with two main features:
 1. **Conversational Agent** - Ask questions about the 23,000+ wiki articles using RAG (Retrieval Augmented Generation)
 2. **Article Ingress Pipeline** - Drop article URLs to automatically analyze content, find matching wiki articles for citations, and generate draft articles
 ## Architecture
 ```
 ┌─────────────────────────────────────────────────────────────────┐
 │                    P2P Wiki AI System                           │
 ├─────────────────────────────────────────────────────────────────┤
 │                                                                 │
 │  ┌─────────────────┐     ┌─────────────────┐                   │
 │  │   Chat (Q&A)    │     │  Ingress Tool   │                   │
 │  │   via RAG       │     │  (URL Drop)     │                   │
 │  └────────┬────────┘     └────────┬────────┘                   │
 │           │                       │                             │
 │           └───────────┬───────────┘                             │
 │                       ▼                                         │
 │           ┌───────────────────────┐                             │
 │           │    FastAPI Backend    │                             │
 │           └───────────┬───────────┘                             │
 │                       │                                         │
 │        ┌──────────────┼──────────────┐                         │
 │        ▼              ▼              ▼                          │
 │  ┌──────────┐  ┌─────────────┐  ┌──────────────┐               │
 │  │ ChromaDB │  │ Ollama/     │  │   Article    │               │
 │  │ (Vector) │  │ Claude      │  │   Scraper    │               │
 │  └──────────┘  └─────────────┘  └──────────────┘               │
 │                                                                 │
 └─────────────────────────────────────────────────────────────────┘
 ```
 ## Quick Start
 ### 1. Prerequisites
 - Python 3.10+
 - [Ollama](https://ollama.ai) installed locally (or access to a remote Ollama server)
 - Optional: Anthropic API key for Claude (higher quality article drafts)
 ### 2. Install Dependencies
 ```bash
 cd /home/jeffe/Github/p2pwiki-content
 pip install -e .
 ```
 ### 3. Parse Wiki Content
 Convert the MediaWiki XML dumps to searchable JSON:
 ```bash
 python -m src.parser
 ```
 This creates `data/articles.json` with all parsed articles (~23,000 pages).
 ### 4. Generate Embeddings
 Create the vector store for semantic search:
 ```bash
 python -m src.embeddings
 ```
 This creates the ChromaDB vector store in `data/chroma/`. Takes a few minutes.
 ### 5. Configure Environment
 ```bash
 cp .env.example .env
 # Edit .env with your settings
 ```
 ### 6. Run the Server
 ```bash
 python -m src.api
 ```
 Visit http://localhost:8420/ui for the web interface.
 ## Docker Deployment
 For production deployment on the RS 8000:
 ```bash
 # Build and run
 docker compose up -d --build
 # Check logs
 docker compose logs -f
 # Access at http://localhost:8420/ui
 # Or via Traefik at https://wiki-ai.jeffemmett.com
 ```
 ## API Endpoints
 ### Chat
 ```bash
 # Ask a question
 curl -X POST http://localhost:8420/chat \
  -H "Content-Type: application/json" \
  -d '{"query": "What is commons-based peer production?"}'
 ```
 ### Ingress
 ```bash
 # Process an external article
 curl -X POST http://localhost:8420/ingress \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/article-about-cooperatives"}'
 ```
 ### Review Queue
 ```bash
 # Get all items in review queue
 curl http://localhost:8420/review
 # Approve a draft article
 curl -X POST http://localhost:8420/review/action \
  -H "Content-Type: application/json" \
  -d '{"filepath": "/path/to/item.json", "item_type": "draft", "item_index": 0, "action": "approve"}'
 ```
 ### Search
 ```bash
 # Direct vector search
 curl "http://localhost:8420/search?q=cooperative%20economics&n=10"
 # List article titles
 curl "http://localhost:8420/articles?limit=100"
 ```
 ## Hybrid AI Routing
 The system uses intelligent routing between local (Ollama) and cloud (Claude) LLMs:
 | Task | Default LLM | Reasoning |
 |------|-------------|-----------|
 | Chat Q&A | Ollama | Fast, free, good enough for retrieval-based answers |
 | Content Analysis | Claude | Better at extracting topics and identifying wiki relevance |
 | Draft Generation | Claude | Higher quality article writing |
 | Embeddings | Local (sentence-transformers) | Fast, free, optimized for semantic search |
 Configure in `.env`:
 ```
 USE_CLAUDE_FOR_DRAFTS=true
 USE_OLLAMA_FOR_CHAT=true
 ```
 ## Project Structure
 ```
 p2pwiki-content/
 ├── src/
 │   ├── api.py          # FastAPI backend
 │   ├── config.py       # Configuration settings
 │   ├── embeddings.py   # Vector store (ChromaDB)
 │   ├── ingress.py      # Article scraper & analyzer
 │   ├── llm.py          # LLM client (Ollama/Claude)
 │   ├── parser.py       # MediaWiki XML parser
 │   └── rag.py          # RAG chat system
 ├── web/
 │   └── index.html      # Web UI
 ├── data/
 │   ├── articles.json   # Parsed wiki content
 │   ├── chroma/         # Vector store
 │   └── review_queue/   # Pending ingress items
 ├── xmldump/            # MediaWiki XML dumps
 ├── docker-compose.yml
 ├── Dockerfile
 └── pyproject.toml
 ```
 ## Content Coverage
 The P2P Foundation Wiki contains ~23,000 articles covering:
 - Peer-to-peer networks and culture
 - Commons-based peer production (CBPP)
 - Alternative economics and post-capitalism
 - Cooperative business models
 - Open source and free culture
 - Collaborative governance
 - Sustainability and ecology
 ## License
 The wiki content is from the P2P Foundation under their respective licenses.
 The AI system code is provided as-is for educational purposes.
--- a/docker-compose.yml
+++ b/docker-compose.yml
@ -0,0 +1,38 @@
 version: '3.8'
 services:
  p2pwiki-ai:
    build: .
    container_name: p2pwiki-ai
    restart: unless-stopped
    ports:
      - "8420:8420"
    volumes:
      # Persist vector store and review queue
      - ./data:/app/data
      # Mount XML dumps for parsing (read-only)
      - ./xmldump:/app/xmldump:ro
    environment:
      # Ollama connection (adjust host for your setup)
      - OLLAMA_BASE_URL=${OLLAMA_BASE_URL:-http://host.docker.internal:11434}
      - OLLAMA_MODEL=${OLLAMA_MODEL:-llama3.2}
      # Claude API (optional, for higher quality drafts)
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY:-}
      - CLAUDE_MODEL=${CLAUDE_MODEL:-claude-sonnet-4-20250514}
      # Hybrid routing settings
      - USE_CLAUDE_FOR_DRAFTS=${USE_CLAUDE_FOR_DRAFTS:-true}
      - USE_OLLAMA_FOR_CHAT=${USE_OLLAMA_FOR_CHAT:-true}
    labels:
      # Traefik labels for reverse proxy
      - "traefik.enable=true"
      - "traefik.http.routers.p2pwiki-ai.rule=Host(`p2pwiki.jeffemmett.com`)"
      - "traefik.http.services.p2pwiki-ai.loadbalancer.server.port=8420"
    networks:
      - traefik-public
    # Add extra_hosts for Docker Desktop to access host services
    extra_hosts:
      - "host.docker.internal:host-gateway"
 networks:
  traefik-public:
    external: true
--- a/pagenames.txt
+++ b/pagenames.txt
--- a/pyproject.toml
+++ b/pyproject.toml
@ -0,0 +1,64 @@
 [project]
 name = "p2pwiki-ai"
 version = "0.1.0"
 description = "AI-augmented system for P2P Foundation Wiki - chat agent and ingress pipeline"
 requires-python = ">=3.10"
 dependencies = [
    # Core
    "fastapi>=0.109.0",
    "uvicorn[standard]>=0.27.0",
    "pydantic>=2.5.0",
    "pydantic-settings>=2.1.0",
    # XML parsing
    "lxml>=5.1.0",
    # Vector store & embeddings
    "chromadb>=0.4.22",
    "sentence-transformers>=2.3.0",
    # LLM integration
    "openai>=1.10.0",  # For Ollama-compatible API
    "anthropic>=0.18.0",  # For Claude API
    "httpx>=0.26.0",
    # Article scraping
    "trafilatura>=1.6.0",
    "newspaper3k>=0.2.8",
    "beautifulsoup4>=4.12.0",
    "requests>=2.31.0",
    # Utilities
    "python-dotenv>=1.0.0",
    "rich>=13.7.0",
    "tqdm>=4.66.0",
    "tenacity>=8.2.0",
 ]
 [project.optional-dependencies]
 dev = [
    "pytest>=7.4.0",
    "pytest-asyncio>=0.23.0",
    "black>=24.1.0",
    "ruff>=0.1.0",
 ]
 [project.scripts]
 p2pwiki-parse = "src.parser:main"
 p2pwiki-embed = "src.embeddings:main"
 p2pwiki-serve = "src.api:main"
 [build-system]
 requires = ["setuptools>=68.0", "wheel"]
 build-backend = "setuptools.build_meta"
 [tool.setuptools.packages.find]
 where = ["."]
 [tool.black]
 line-length = 100
 target-version = ["py310"]
 [tool.ruff]
 line-length = 100
 select = ["E", "F", "I", "N", "W"]
--- a/src/init.py
+++ b/src/init.py
@ -0,0 +1 @@
 """P2P Wiki AI System - Chat agent and ingress pipeline."""
--- a/src/api.py
+++ b/src/api.py
@ -0,0 +1,320 @@
 """FastAPI backend for P2P Wiki AI system."""
 import asyncio
 from contextlib import asynccontextmanager
 from pathlib import Path
 from typing import Optional
 from fastapi import FastAPI, HTTPException, BackgroundTasks
 from fastapi.middleware.cors import CORSMiddleware
 from fastapi.staticfiles import StaticFiles
 from fastapi.responses import FileResponse
 from pydantic import BaseModel, HttpUrl
 from .config import settings
 from .embeddings import WikiVectorStore
 from .rag import WikiRAG, RAGResponse
 from .ingress import IngressPipeline, get_review_queue, approve_item, reject_item
 # Global instances
 vector_store: Optional[WikiVectorStore] = None
 rag_system: Optional[WikiRAG] = None
 ingress_pipeline: Optional[IngressPipeline] = None
@asynccontextmanager
 async def lifespan(app: FastAPI):
    """Initialize services on startup."""
    global vector_store, rag_system, ingress_pipeline
    print("Initializing P2P Wiki AI system...")
    # Check if vector store has been populated
    chroma_path = settings.chroma_persist_dir
    if not chroma_path.exists() or not any(chroma_path.iterdir()):
        print("WARNING: Vector store not initialized. Run 'python -m src.parser' and 'python -m src.embeddings' first.")
    else:
        vector_store = WikiVectorStore()
        rag_system = WikiRAG(vector_store)
        ingress_pipeline = IngressPipeline(vector_store)
        print(f"Loaded vector store with {vector_store.get_stats()['total_chunks']} chunks")
    yield
    print("Shutting down...")
 app = FastAPI(
    title="P2P Wiki AI",
    description="AI-augmented system for P2P Foundation Wiki - chat agent and ingress pipeline",
    version="0.1.0",
    lifespan=lifespan,
 )
 # CORS middleware
 app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],  # Configure appropriately for production
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
 )
 # --- Request/Response Models ---
 class ChatRequest(BaseModel):
    """Chat request model."""
    query: str
    n_results: int = 5
    filter_categories: Optional[list[str]] = None
 class ChatResponse(BaseModel):
    """Chat response model."""
    answer: str
    sources: list[dict]
    query: str
 class IngressRequest(BaseModel):
    """Ingress request model."""
    url: HttpUrl
 class IngressResponse(BaseModel):
    """Ingress response model."""
    status: str
    message: str
    scraped_title: Optional[str] = None
    topics_found: int = 0
    wiki_matches: int = 0
    drafts_generated: int = 0
    queue_file: Optional[str] = None
 class ReviewActionRequest(BaseModel):
    """Review action request model."""
    filepath: str
    item_type: str  # "match" or "draft"
    item_index: int
    action: str  # "approve" or "reject"
 # --- API Endpoints ---
@app.get("/")
 async def root():
    """Root endpoint."""
    return {
        "name": "P2P Wiki AI",
        "version": "0.1.0",
        "status": "running",
        "vector_store_ready": vector_store is not None,
    }
@app.get("/health")
 async def health():
    """Health check endpoint."""
    return {
        "status": "healthy",
        "vector_store_ready": vector_store is not None,
    }
@app.get("/stats")
 async def stats():
    """Get system statistics."""
    if not vector_store:
        return {"error": "Vector store not initialized"}
    return {
        "vector_store": vector_store.get_stats(),
        "review_queue_count": len(get_review_queue()),
    }
 # --- Chat Endpoints ---
@app.post("/chat", response_model=ChatResponse)
 async def chat(request: ChatRequest):
    """Chat with the wiki knowledge base."""
    if not rag_system:
        raise HTTPException(
            status_code=503,
            detail="RAG system not initialized. Run indexing first.",
        )
    response = await rag_system.ask(
        query=request.query,
        n_results=request.n_results,
        filter_categories=request.filter_categories,
    )
    return ChatResponse(
        answer=response.answer,
        sources=response.sources,
        query=response.query,
    )
@app.post("/chat/clear")
 async def clear_chat():
    """Clear chat history."""
    if rag_system:
        rag_system.clear_history()
    return {"status": "cleared"}
@app.get("/chat/suggestions")
 async def chat_suggestions(q: str = ""):
    """Get article title suggestions for autocomplete."""
    if not rag_system or not q:
        return {"suggestions": []}
    suggestions = rag_system.get_suggestions(q)
    return {"suggestions": suggestions}
 # --- Ingress Endpoints ---
@app.post("/ingress", response_model=IngressResponse)
 async def ingress(request: IngressRequest, background_tasks: BackgroundTasks):
    """
    Process an external article URL through the ingress pipeline.
    This scrapes the article, analyzes it for wiki relevance,
    finds matching existing articles, and generates draft articles.
    """
    if not ingress_pipeline:
        raise HTTPException(
            status_code=503,
            detail="Ingress pipeline not initialized. Run indexing first.",
        )
    try:
        result = await ingress_pipeline.process(str(request.url))
        return IngressResponse(
            status="success",
            message="Article processed successfully",
            scraped_title=result.scraped.title,
            topics_found=len(result.analysis.get("main_topics", [])),
            wiki_matches=len(result.wiki_matches),
            drafts_generated=len(result.draft_articles),
            queue_file=result.timestamp,
        )
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))
 # --- Review Queue Endpoints ---
@app.get("/review")
 async def get_review_items():
    """Get all items in the review queue."""
    items = get_review_queue()
    return {"count": len(items), "items": items}
@app.get("/review/{filename}")
 async def get_review_item(filename: str):
    """Get a specific review item."""
    filepath = settings.review_queue_dir / filename
    if not filepath.exists():
        raise HTTPException(status_code=404, detail="Review item not found")
    import json
    with open(filepath, "r", encoding="utf-8") as f:
        data = json.load(f)
    return data
@app.post("/review/action")
 async def review_action(request: ReviewActionRequest):
    """Approve or reject a review item."""
    if request.action == "approve":
        success = approve_item(request.filepath, request.item_type, request.item_index)
    elif request.action == "reject":
        success = reject_item(request.filepath, request.item_type, request.item_index)
    else:
        raise HTTPException(status_code=400, detail="Invalid action")
    if success:
        return {"status": "success", "action": request.action}
    else:
        raise HTTPException(status_code=500, detail="Action failed")
 # --- Search Endpoints ---
@app.get("/search")
 async def search(q: str, n: int = 10, categories: Optional[str] = None):
    """Direct search of the vector store."""
    if not vector_store:
        raise HTTPException(status_code=503, detail="Vector store not initialized")
    filter_cats = categories.split(",") if categories else None
    results = vector_store.search(q, n_results=n, filter_categories=filter_cats)
    return {"query": q, "count": len(results), "results": results}
@app.get("/articles")
 async def list_articles(limit: int = 100, offset: int = 0):
    """List article titles."""
    if not vector_store:
        raise HTTPException(status_code=503, detail="Vector store not initialized")
    titles = vector_store.get_article_titles()
    return {
        "total": len(titles),
        "limit": limit,
        "offset": offset,
        "titles": titles[offset : offset + limit],
    }
 # --- Static Files (Web UI) ---
 web_dir = Path(__file__).parent.parent / "web"
 if web_dir.exists():
    app.mount("/static", StaticFiles(directory=str(web_dir)), name="static")
    @app.get("/ui")
    async def ui():
        """Serve the web UI."""
        index_path = web_dir / "index.html"
        if index_path.exists():
            return FileResponse(index_path)
        raise HTTPException(status_code=404, detail="Web UI not found")
 def main():
    """Run the API server."""
    import uvicorn
    uvicorn.run(
        "src.api:app",
        host=settings.host,
        port=settings.port,
        reload=True,
    )
 if __name__ == "__main__":
    main()
--- a/src/config.py
+++ b/src/config.py
@ -0,0 +1,51 @@
 """Configuration settings for P2P Wiki AI system."""
 from pathlib import Path
 from pydantic_settings import BaseSettings
 class Settings(BaseSettings):
    """Application settings loaded from environment variables."""
    # Paths
    project_root: Path = Path(__file__).parent.parent
    data_dir: Path = project_root / "data"
    xmldump_dir: Path = project_root / "xmldump"
    # Vector store
    chroma_persist_dir: Path = data_dir / "chroma"
    embedding_model: str = "all-MiniLM-L6-v2"  # Fast, good quality
    # Ollama (local LLM)
    ollama_base_url: str = "http://localhost:11434"
    ollama_model: str = "llama3.2"  # Default model for local inference
    # Claude API (for complex tasks)
    anthropic_api_key: str = ""
    claude_model: str = "claude-sonnet-4-20250514"
    # Hybrid routing thresholds
    use_claude_for_drafts: bool = True  # Use Claude for article drafting
    use_ollama_for_chat: bool = True  # Use Ollama for simple Q&A
    # MediaWiki
    mediawiki_api_url: str = ""  # Set if you have a live wiki API
    # Server
    host: str = "0.0.0.0"
    port: int = 8420
    # Review queue
    review_queue_dir: Path = data_dir / "review_queue"
    class Config:
        env_file = ".env"
        env_file_encoding = "utf-8"
 settings = Settings()
 # Ensure directories exist
 settings.data_dir.mkdir(parents=True, exist_ok=True)
 settings.chroma_persist_dir.mkdir(parents=True, exist_ok=True)
 settings.review_queue_dir.mkdir(parents=True, exist_ok=True)
--- a/src/embeddings.py
+++ b/src/embeddings.py
@ -0,0 +1,256 @@
 """Vector store setup and embedding generation using ChromaDB."""
 import json
 from pathlib import Path
 from typing import Optional
 import chromadb
 from chromadb.config import Settings as ChromaSettings
 from rich.console import Console
 from rich.progress import Progress
 from sentence_transformers import SentenceTransformer
 from .config import settings
 from .parser import WikiArticle
 console = Console()
 # Chunk size for embedding (in characters)
 CHUNK_SIZE = 1000
 CHUNK_OVERLAP = 200
 class WikiVectorStore:
    """Vector store for wiki articles using ChromaDB."""
    def __init__(self, persist_dir: Optional[Path] = None):
        self.persist_dir = persist_dir or settings.chroma_persist_dir
        # Initialize ChromaDB
        self.client = chromadb.PersistentClient(
            path=str(self.persist_dir),
            settings=ChromaSettings(anonymized_telemetry=False),
        )
        # Create or get collection
        self.collection = self.client.get_or_create_collection(
            name="wiki_articles",
            metadata={"hnsw:space": "cosine"},
        )
        # Load embedding model
        console.print(f"[cyan]Loading embedding model: {settings.embedding_model}[/cyan]")
        self.model = SentenceTransformer(settings.embedding_model)
        console.print("[green]Model loaded[/green]")
    def _chunk_text(self, text: str, title: str) -> list[tuple[str, dict]]:
        """Split text into overlapping chunks with metadata."""
        if len(text) <= CHUNK_SIZE:
            return [(text, {"chunk_index": 0, "total_chunks": 1})]
        chunks = []
        start = 0
        chunk_index = 0
        while start < len(text):
            end = start + CHUNK_SIZE
            # Try to break at sentence boundary
            if end < len(text):
                # Look for sentence end within last 100 chars
                for i in range(min(100, end - start)):
                    if text[end - i] in ".!?\n":
                        end = end - i + 1
                        break
            chunk_text = text[start:end].strip()
            if chunk_text:
                # Prepend title for context
                chunk_with_title = f"{title}\n\n{chunk_text}"
                chunks.append(
                    (chunk_with_title, {"chunk_index": chunk_index, "total_chunks": -1})
                )
                chunk_index += 1
            start = end - CHUNK_OVERLAP
        # Update total_chunks
        for i, (text, meta) in enumerate(chunks):
            meta["total_chunks"] = len(chunks)
        return chunks
    def get_embedded_article_ids(self) -> set:
        """Get set of article IDs that are already embedded."""
        results = self.collection.get(include=["metadatas"])
        article_ids = set()
        for meta in results["metadatas"]:
            if meta and "article_id" in meta:
                article_ids.add(meta["article_id"])
        return article_ids
    def add_articles(self, articles: list[WikiArticle], batch_size: int = 100, resume: bool = True):
        """Add articles to the vector store."""
        console.print(f"[cyan]Processing {len(articles)} articles...[/cyan]")
        # Check for already embedded articles if resuming
        if resume:
            embedded_ids = self.get_embedded_article_ids()
            original_count = len(articles)
            articles = [a for a in articles if a.id not in embedded_ids]
            skipped = original_count - len(articles)
            if skipped > 0:
                console.print(f"[yellow]Skipping {skipped} already-embedded articles[/yellow]")
            if not articles:
                console.print("[green]All articles already embedded![/green]")
                return
        all_chunks = []
        all_ids = []
        all_metadatas = []
        with Progress() as progress:
            task = progress.add_task("[cyan]Chunking articles...", total=len(articles))
            for article in articles:
                if not article.plain_text:
                    progress.advance(task)
                    continue
                chunks = self._chunk_text(article.plain_text, article.title)
                for chunk_text, chunk_meta in chunks:
                    chunk_id = f"{article.id}_{chunk_meta['chunk_index']}"
                    metadata = {
                        "article_id": article.id,
                        "title": article.title,
                        "categories": ",".join(article.categories[:10]),  # Limit categories
                        "timestamp": article.timestamp,
                        "chunk_index": chunk_meta["chunk_index"],
                        "total_chunks": chunk_meta["total_chunks"],
                    }
                    all_chunks.append(chunk_text)
                    all_ids.append(chunk_id)
                    all_metadatas.append(metadata)
                progress.advance(task)
        console.print(f"[cyan]Created {len(all_chunks)} chunks from {len(articles)} articles[/cyan]")
        # Generate embeddings and add in batches
        console.print("[cyan]Generating embeddings and adding to vector store...[/cyan]")
        with Progress() as progress:
            task = progress.add_task(
                "[cyan]Embedding and storing...", total=len(all_chunks) // batch_size + 1
            )
            for i in range(0, len(all_chunks), batch_size):
                batch_chunks = all_chunks[i : i + batch_size]
                batch_ids = all_ids[i : i + batch_size]
                batch_metadatas = all_metadatas[i : i + batch_size]
                # Generate embeddings
                embeddings = self.model.encode(batch_chunks, show_progress_bar=False)
                # Add to collection
                self.collection.add(
                    ids=batch_ids,
                    embeddings=embeddings.tolist(),
                    documents=batch_chunks,
                    metadatas=batch_metadatas,
                )
                progress.advance(task)
        console.print(f"[green]Added {len(all_chunks)} chunks to vector store[/green]")
    def search(
        self,
        query: str,
        n_results: int = 5,
        filter_categories: Optional[list[str]] = None,
    ) -> list[dict]:
        """Search for relevant chunks."""
        query_embedding = self.model.encode([query])[0]
        where_filter = None
        if filter_categories:
            # ChromaDB where filter for categories
            where_filter = {
                "$or": [{"categories": {"$contains": cat}} for cat in filter_categories]
            }
        results = self.collection.query(
            query_embeddings=[query_embedding.tolist()],
            n_results=n_results,
            where=where_filter,
            include=["documents", "metadatas", "distances"],
        )
        # Format results
        formatted = []
        if results["documents"] and results["documents"][0]:
            for i, doc in enumerate(results["documents"][0]):
                formatted.append(
                    {
                        "content": doc,
                        "metadata": results["metadatas"][0][i],
                        "distance": results["distances"][0][i],
                    }
                )
        return formatted
    def get_article_titles(self) -> list[str]:
        """Get all unique article titles in the store."""
        # Get all metadata
        results = self.collection.get(include=["metadatas"])
        titles = set()
        for meta in results["metadatas"]:
            if meta and "title" in meta:
                titles.add(meta["title"])
        return sorted(titles)
    def get_stats(self) -> dict:
        """Get statistics about the vector store."""
        count = self.collection.count()
        # Get sample of metadatas to count unique articles
        sample = self.collection.get(limit=10000, include=["metadatas"])
        unique_articles = len(set(m["article_id"] for m in sample["metadatas"] if m))
        return {
            "total_chunks": count,
            "unique_articles_sampled": unique_articles,
            "persist_dir": str(self.persist_dir),
        }
 def main():
    """CLI entry point for generating embeddings."""
    articles_path = settings.data_dir / "articles.json"
    if not articles_path.exists():
        console.print(f"[red]Articles file not found: {articles_path}[/red]")
        console.print("[yellow]Run 'python -m src.parser' first to parse XML dumps[/yellow]")
        return
    console.print(f"[cyan]Loading articles from {articles_path}...[/cyan]")
    with open(articles_path, "r", encoding="utf-8") as f:
        articles_data = json.load(f)
    articles = [WikiArticle(**a) for a in articles_data]
    console.print(f"[green]Loaded {len(articles)} articles[/green]")
    store = WikiVectorStore()
    store.add_articles(articles)
    stats = store.get_stats()
    console.print(f"[green]Vector store stats: {stats}[/green]")
 if __name__ == "__main__":
    main()
--- a/src/ingress.py
+++ b/src/ingress.py
@ -0,0 +1,467 @@
 """Article ingress pipeline - scrape, analyze, and draft wiki content."""
 import json
 import re
 from dataclasses import dataclass, field, asdict
 from datetime import datetime
 from pathlib import Path
 from typing import Optional
 from urllib.parse import urlparse
 import httpx
 import trafilatura
 from bs4 import BeautifulSoup
 from rich.console import Console
 from .config import settings
 from .embeddings import WikiVectorStore
 from .llm import llm_client
 console = Console()
@dataclass
 class ScrapedArticle:
    """Represents a scraped external article."""
    url: str
    title: str
    content: str
    author: Optional[str] = None
    date: Optional[str] = None
    domain: str = ""
    word_count: int = 0
    def __post_init__(self):
        if not self.domain:
            self.domain = urlparse(self.url).netloc
        if not self.word_count:
            self.word_count = len(self.content.split())
@dataclass
 class WikiMatch:
    """A matching wiki article for citation."""
    title: str
    article_id: int
    relevance_score: float
    categories: list[str]
    suggested_citation: str  # How to cite the scraped article in this wiki page
@dataclass
 class DraftArticle:
    """A draft wiki article generated from scraped content."""
    title: str
    content: str  # MediaWiki formatted content
    categories: list[str]
    source_url: str
    source_title: str
    summary: str
    related_articles: list[str]  # Existing wiki articles to link to
@dataclass
 class IngressResult:
    """Result of the ingress pipeline."""
    scraped: ScrapedArticle
    analysis: dict  # Topic analysis results
    wiki_matches: list[WikiMatch]  # Existing articles to update with citations
    draft_articles: list[DraftArticle]  # New articles to create
    timestamp: str = field(default_factory=lambda: datetime.now().isoformat())
    def to_dict(self) -> dict:
        return {
            "scraped": asdict(self.scraped),
            "analysis": self.analysis,
            "wiki_matches": [asdict(m) for m in self.wiki_matches],
            "draft_articles": [asdict(d) for d in self.draft_articles],
            "timestamp": self.timestamp,
        }
 class ArticleScraper:
    """Scrapes and extracts content from URLs."""
    async def scrape(self, url: str) -> ScrapedArticle:
        """Scrape article content from URL."""
        console.print(f"[cyan]Scraping: {url}[/cyan]")
        async with httpx.AsyncClient(
            timeout=30.0,
            follow_redirects=True,
            headers={
                "User-Agent": "Mozilla/5.0 (compatible; P2PWikiBot/1.0; +http://p2pfoundation.net)"
            },
        ) as client:
            response = await client.get(url)
            response.raise_for_status()
            html = response.text
        # Use trafilatura for main content extraction
        content = trafilatura.extract(
            html,
            include_comments=False,
            include_tables=True,
            no_fallback=False,
        )
        if not content:
            # Fallback to BeautifulSoup
            soup = BeautifulSoup(html, "html.parser")
            # Remove script and style elements
            for element in soup(["script", "style", "nav", "footer", "header"]):
                element.decompose()
            content = soup.get_text(separator="\n", strip=True)
        # Extract metadata
        soup = BeautifulSoup(html, "html.parser")
        title = ""
        title_tag = soup.find("title")
        if title_tag:
            title = title_tag.get_text(strip=True)
        # Try og:title
        og_title = soup.find("meta", property="og:title")
        if og_title and og_title.get("content"):
            title = og_title["content"]
        author = None
        author_meta = soup.find("meta", attrs={"name": "author"})
        if author_meta and author_meta.get("content"):
            author = author_meta["content"]
        date = None
        date_meta = soup.find("meta", attrs={"name": "date"}) or soup.find(
            "meta", property="article:published_time"
        )
        if date_meta and date_meta.get("content"):
            date = date_meta["content"]
        return ScrapedArticle(
            url=url,
            title=title,
            content=content or "",
            author=author,
            date=date,
        )
 class ContentAnalyzer:
    """Analyzes scraped content for wiki relevance."""
    def __init__(self, vector_store: Optional[WikiVectorStore] = None):
        self.vector_store = vector_store or WikiVectorStore()
    async def analyze(self, article: ScrapedArticle) -> dict:
        """Analyze article for topics, concepts, and wiki relevance."""
        # Truncate very long articles for analysis
        content_for_analysis = article.content[:8000]
        analysis_prompt = f"""Analyze this article for potential wiki content about peer-to-peer culture, commons, alternative economics, and collaborative governance.
 Article Title: {article.title}
 Source: {article.domain}
 Article Content:
 {content_for_analysis}
 Please provide your analysis in the following JSON format:
 {{
    "main_topics": ["topic1", "topic2"],
    "key_concepts": ["concept1", "concept2"],
    "relevant_categories": ["category1", "category2"],
    "summary": "2-3 sentence summary",
    "wiki_relevance_score": 0.0-1.0,
    "suggested_article_titles": ["Title 1", "Title 2"],
    "key_quotes": ["notable quote 1", "notable quote 2"],
    "mentioned_organizations": ["org1", "org2"],
    "mentioned_people": ["person1", "person2"]
 }}
 Focus on topics relevant to:
 - Peer-to-peer networks and culture
 - Commons-based peer production
 - Alternative economics and post-capitalism
 - Cooperative business models
 - Open source / free culture
 - Collaborative governance
 - Sustainability and ecology"""
        response = await llm_client.analyze(
            content=article.content[:8000],
            task=analysis_prompt,
            temperature=0.3,
        )
        # Parse JSON from response
        try:
            # Find JSON in response
            json_match = re.search(r"\{[\s\S]*\}", response)
            if json_match:
                analysis = json.loads(json_match.group())
            else:
                analysis = {"error": "Could not parse analysis", "raw": response}
        except json.JSONDecodeError:
            analysis = {"error": "Invalid JSON in analysis", "raw": response}
        return analysis
    async def find_wiki_matches(
        self, article: ScrapedArticle, analysis: dict, n_results: int = 10
    ) -> list[WikiMatch]:
        """Find existing wiki articles that could cite this content."""
        matches = []
        # Search using main topics and concepts
        search_terms = analysis.get("main_topics", []) + analysis.get("key_concepts", [])
        for term in search_terms[:5]:  # Limit searches
            results = self.vector_store.search(term, n_results=3)
            for result in results:
                title = result["metadata"].get("title", "Unknown")
                article_id = result["metadata"].get("article_id", 0)
                distance = result.get("distance", 1.0)
                # Skip if already added
                if any(m.title == title for m in matches):
                    continue
                # Calculate relevance (lower distance = higher relevance)
                relevance = max(0, 1 - distance)
                if relevance > 0.3:  # Threshold for relevance
                    matches.append(
                        WikiMatch(
                            title=title,
                            article_id=article_id,
                            relevance_score=relevance,
                            categories=result["metadata"]
                            .get("categories", "")
                            .split(","),
                            suggested_citation=f"See also: [{article.title}]({article.url})",
                        )
                    )
        # Sort by relevance and limit
        matches.sort(key=lambda m: m.relevance_score, reverse=True)
        return matches[:n_results]
 class DraftGenerator:
    """Generates draft wiki articles from scraped content."""
    def __init__(self, vector_store: Optional[WikiVectorStore] = None):
        self.vector_store = vector_store or WikiVectorStore()
    async def generate_drafts(
        self,
        article: ScrapedArticle,
        analysis: dict,
        max_drafts: int = 3,
    ) -> list[DraftArticle]:
        """Generate draft wiki articles based on scraped content."""
        drafts = []
        suggested_titles = analysis.get("suggested_article_titles", [])
        if not suggested_titles:
            return drafts
        for title in suggested_titles[:max_drafts]:
            # Check if article already exists
            existing = self.vector_store.search(title, n_results=1)
            if existing and existing[0].get("distance", 1.0) < 0.1:
                console.print(f"[yellow]Skipping '{title}' - similar article exists[/yellow]")
                continue
            draft = await self._generate_single_draft(article, analysis, title)
            if draft:
                drafts.append(draft)
        return drafts
    async def _generate_single_draft(
        self,
        article: ScrapedArticle,
        analysis: dict,
        title: str,
    ) -> Optional[DraftArticle]:
        """Generate a single draft article."""
        # Find related existing articles
        related_search = self.vector_store.search(title, n_results=5)
        related_titles = [
            r["metadata"].get("title", "")
            for r in related_search
            if r.get("distance", 1.0) < 0.5
        ]
        categories = analysis.get("relevant_categories", [])
        summary = analysis.get("summary", "")
        draft_prompt = f"""Create a MediaWiki-formatted article for the P2P Foundation Wiki.
 Article Title: {title}
 Source Material:
 Title: {article.title}
 URL: {article.url}
 Summary: {summary}
 Key concepts to cover: {', '.join(analysis.get('key_concepts', []))}
 Related existing wiki articles: {', '.join(related_titles)}
 Categories to include: {', '.join(categories)}
 Please write the wiki article in MediaWiki markup format with:
 1. An introduction/definition section
 2. A "Description" section with key information
 3. Links to related wiki articles using [[Article Name]] format
 4. A "Sources" section citing the original article
 5. Category tags at the end using [[Category:Name]] format
 The article should:
 - Be encyclopedic and neutral in tone
 - Focus on the P2P/commons aspects of the topic
 - Be approximately 300-500 words
 - Include internal wiki links to related concepts"""
        content = await llm_client.generate_draft(
            draft_prompt,
            system="You are a wiki editor for the P2P Foundation Wiki. Write clear, encyclopedic articles in MediaWiki markup format.",
            temperature=0.5,
        )
        return DraftArticle(
            title=title,
            content=content,
            categories=categories,
            source_url=article.url,
            source_title=article.title,
            summary=summary,
            related_articles=related_titles,
        )
 class IngressPipeline:
    """Complete ingress pipeline for processing external articles."""
    def __init__(self, vector_store: Optional[WikiVectorStore] = None):
        self.vector_store = vector_store or WikiVectorStore()
        self.scraper = ArticleScraper()
        self.analyzer = ContentAnalyzer(self.vector_store)
        self.generator = DraftGenerator(self.vector_store)
    async def process(self, url: str) -> IngressResult:
        """Process a URL through the complete ingress pipeline."""
        console.print(f"[bold cyan]Processing: {url}[/bold cyan]")
        # Step 1: Scrape
        console.print("[cyan]Step 1/4: Scraping article...[/cyan]")
        scraped = await self.scraper.scrape(url)
        console.print(f"[green]Scraped: {scraped.title} ({scraped.word_count} words)[/green]")
        # Step 2: Analyze
        console.print("[cyan]Step 2/4: Analyzing content...[/cyan]")
        analysis = await self.analyzer.analyze(scraped)
        console.print(f"[green]Found {len(analysis.get('main_topics', []))} main topics[/green]")
        # Step 3: Find wiki matches
        console.print("[cyan]Step 3/4: Finding wiki matches...[/cyan]")
        matches = await self.analyzer.find_wiki_matches(scraped, analysis)
        console.print(f"[green]Found {len(matches)} potential wiki matches[/green]")
        # Step 4: Generate drafts
        console.print("[cyan]Step 4/4: Generating draft articles...[/cyan]")
        drafts = await self.generator.generate_drafts(scraped, analysis)
        console.print(f"[green]Generated {len(drafts)} draft articles[/green]")
        result = IngressResult(
            scraped=scraped,
            analysis=analysis,
            wiki_matches=matches,
            draft_articles=drafts,
        )
        # Save to review queue
        self._save_to_review_queue(result)
        return result
    def _save_to_review_queue(self, result: IngressResult):
        """Save ingress result to the review queue."""
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        domain = result.scraped.domain.replace(".", "_")
        filename = f"{timestamp}_{domain}.json"
        filepath = settings.review_queue_dir / filename
        with open(filepath, "w", encoding="utf-8") as f:
            json.dump(result.to_dict(), f, indent=2, ensure_ascii=False)
        console.print(f"[green]Saved to review queue: {filepath}[/green]")
 def get_review_queue() -> list[dict]:
    """Get all items in the review queue."""
    queue_files = sorted(settings.review_queue_dir.glob("*.json"), reverse=True)
    items = []
    for filepath in queue_files:
        with open(filepath, "r", encoding="utf-8") as f:
            data = json.load(f)
            data["_filepath"] = str(filepath)
            items.append(data)
    return items
 def approve_item(filepath: str, item_type: str, item_index: int) -> bool:
    """
    Approve an item from the review queue.
    Args:
        filepath: Path to the review queue JSON file
        item_type: "match" or "draft"
        item_index: Index of the item to approve
    Returns:
        True if successful
    """
    # For now, just mark as approved in the file
    # In production, this would push to MediaWiki API
    with open(filepath, "r", encoding="utf-8") as f:
        data = json.load(f)
    if item_type == "match":
        if item_index < len(data.get("wiki_matches", [])):
            data["wiki_matches"][item_index]["approved"] = True
    elif item_type == "draft":
        if item_index < len(data.get("draft_articles", [])):
            data["draft_articles"][item_index]["approved"] = True
    with open(filepath, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=2, ensure_ascii=False)
    return True
 def reject_item(filepath: str, item_type: str, item_index: int) -> bool:
    """Reject an item from the review queue."""
    with open(filepath, "r", encoding="utf-8") as f:
        data = json.load(f)
    if item_type == "match":
        if item_index < len(data.get("wiki_matches", [])):
            data["wiki_matches"][item_index]["rejected"] = True
    elif item_type == "draft":
        if item_index < len(data.get("draft_articles", [])):
            data["draft_articles"][item_index]["rejected"] = True
    with open(filepath, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=2, ensure_ascii=False)
    return True
--- a/src/llm.py
+++ b/src/llm.py
@ -0,0 +1,153 @@
 """LLM client with hybrid routing between Ollama and Claude."""
 from typing import AsyncIterator, Optional
 import httpx
 from anthropic import Anthropic
 from tenacity import retry, stop_after_attempt, wait_exponential
 from .config import settings
 class LLMClient:
    """Unified LLM client with hybrid routing."""
    def __init__(self):
        self.ollama_url = settings.ollama_base_url
        self.ollama_model = settings.ollama_model
        # Initialize Claude client if API key is set
        self.claude_client = None
        if settings.anthropic_api_key:
            self.claude_client = Anthropic(api_key=settings.anthropic_api_key)
    @retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=1, max=10))
    async def _call_ollama(
        self,
        prompt: str,
        system: Optional[str] = None,
        temperature: float = 0.7,
        max_tokens: int = 2048,
    ) -> str:
        """Call Ollama API."""
        messages = []
        if system:
            messages.append({"role": "system", "content": system})
        messages.append({"role": "user", "content": prompt})
        async with httpx.AsyncClient(timeout=120.0) as client:
            response = await client.post(
                f"{self.ollama_url}/api/chat",
                json={
                    "model": self.ollama_model,
                    "messages": messages,
                    "stream": False,
                    "options": {
                        "temperature": temperature,
                        "num_predict": max_tokens,
                    },
                },
            )
            response.raise_for_status()
            data = response.json()
            return data["message"]["content"]
    @retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=1, max=10))
    async def _call_claude(
        self,
        prompt: str,
        system: Optional[str] = None,
        temperature: float = 0.7,
        max_tokens: int = 4096,
    ) -> str:
        """Call Claude API."""
        if not self.claude_client:
            raise ValueError("Claude API key not configured")
        message = self.claude_client.messages.create(
            model=settings.claude_model,
            max_tokens=max_tokens,
            system=system or "",
            messages=[{"role": "user", "content": prompt}],
            temperature=temperature,
        )
        return message.content[0].text
    async def chat(
        self,
        prompt: str,
        system: Optional[str] = None,
        use_claude: bool = False,
        temperature: float = 0.7,
        max_tokens: int = 2048,
    ) -> str:
        """
        Chat with LLM using hybrid routing.
        Args:
            prompt: User prompt
            system: System prompt
            use_claude: Force Claude API (otherwise uses Ollama by default)
            temperature: Sampling temperature
            max_tokens: Max response tokens
        Returns:
            LLM response text
        """
        if use_claude and self.claude_client:
            return await self._call_claude(prompt, system, temperature, max_tokens)
        else:
            return await self._call_ollama(prompt, system, temperature, max_tokens)
    async def generate_draft(
        self,
        prompt: str,
        system: Optional[str] = None,
        temperature: float = 0.5,
    ) -> str:
        """
        Generate article draft - uses Claude for higher quality.
        Args:
            prompt: Prompt describing what to generate
            system: System prompt for context
            temperature: Lower for more factual output
        Returns:
            Generated draft text
        """
        # Use Claude for drafts if configured, otherwise fall back to Ollama
        use_claude = settings.use_claude_for_drafts and self.claude_client is not None
        return await self.chat(
            prompt, system, use_claude=use_claude, temperature=temperature, max_tokens=4096
        )
    async def analyze(
        self,
        content: str,
        task: str,
        temperature: float = 0.3,
    ) -> str:
        """
        Analyze content for a specific task - uses Claude for complex analysis.
        Args:
            content: Content to analyze
            task: Description of analysis task
            temperature: Lower for more deterministic output
        Returns:
            Analysis result
        """
        prompt = f"""Task: {task}
 Content to analyze:
 {content}
 Provide your analysis:"""
        use_claude = self.claude_client is not None
        return await self.chat(prompt, use_claude=use_claude, temperature=temperature)
 # Singleton instance
 llm_client = LLMClient()
--- a/src/parser.py
+++ b/src/parser.py
@ -0,0 +1,267 @@
 """MediaWiki XML dump parser - converts to structured JSON."""
 import json
 import re
 from dataclasses import dataclass, field, asdict
 from pathlib import Path
 from typing import Iterator
 from lxml import etree
 from rich.progress import Progress, TaskID
 from rich.console import Console
 from .config import settings
 console = Console()
 # MediaWiki namespace
 MW_NS = {"mw": "http://www.mediawiki.org/xml/export-0.6/"}
@dataclass
 class WikiArticle:
    """Represents a parsed wiki article."""
    id: int
    title: str
    content: str  # Raw wikitext
    plain_text: str  # Cleaned plain text for embedding
    categories: list[str] = field(default_factory=list)
    links: list[str] = field(default_factory=list)  # Internal wiki links
    external_links: list[str] = field(default_factory=list)
    timestamp: str = ""
    contributor: str = ""
    def to_dict(self) -> dict:
        return asdict(self)
 def clean_wikitext(text: str) -> str:
    """Convert MediaWiki markup to plain text for embedding."""
    if not text:
        return ""
    # Remove templates {{...}}
    text = re.sub(r"\{\{[^}]+\}\}", "", text)
    # Remove categories [[Category:...]]
    text = re.sub(r"\[\[Category:[^\]]+\]\]", "", text, flags=re.IGNORECASE)
    # Convert wiki links [[Page|Display]] or [[Page]] to just the display text
    text = re.sub(r"\[\[([^|\]]+)\|([^\]]+)\]\]", r"\2", text)
    text = re.sub(r"\[\[([^\]]+)\]\]", r"\1", text)
    # Remove external links [url text] -> text
    text = re.sub(r"\[https?://[^\s\]]+ ([^\]]+)\]", r"\1", text)
    text = re.sub(r"\[https?://[^\]]+\]", "", text)
    # Remove wiki formatting
    text = re.sub(r"'''?([^']+)'''?", r"\1", text)  # Bold/italic
    text = re.sub(r"={2,}([^=]+)={2,}", r"\1", text)  # Headers
    text = re.sub(r"^[*#:;]+", "", text, flags=re.MULTILINE)  # List markers
    # Remove HTML tags
    text = re.sub(r"<[^>]+>", "", text)
    # Clean up whitespace
    text = re.sub(r"\n{3,}", "\n\n", text)
    text = re.sub(r" {2,}", " ", text)
    return text.strip()
 def extract_categories(text: str) -> list[str]:
    """Extract category names from wikitext."""
    pattern = r"\[\[Category:([^\]|]+)"
    return list(set(re.findall(pattern, text, re.IGNORECASE)))
 def extract_wiki_links(text: str) -> list[str]:
    """Extract internal wiki links from wikitext."""
    # Match [[Page]] or [[Page|Display]]
    pattern = r"\[\[([^|\]]+)"
    links = re.findall(pattern, text)
    # Filter out categories and files
    return list(
        set(
            link.strip()
            for link in links
            if not link.lower().startswith(("category:", "file:", "image:"))
        )
    )
 def extract_external_links(text: str) -> list[str]:
    """Extract external URLs from wikitext."""
    pattern = r"https?://[^\s\]\)\"']+"
    return list(set(re.findall(pattern, text)))
 def parse_xml_file(xml_path: Path) -> Iterator[WikiArticle]:
    """Parse a MediaWiki XML dump file and yield articles."""
    context = etree.iterparse(
        str(xml_path), events=("end",), tag="{http://www.mediawiki.org/xml/export-0.6/}page"
    )
    for event, page in context:
        # Get basic info
        title_elem = page.find("mw:title", MW_NS)
        id_elem = page.find("mw:id", MW_NS)
        ns_elem = page.find("mw:ns", MW_NS)
        # Skip non-main namespace pages (talk, user, etc.)
        if ns_elem is not None and ns_elem.text != "0":
            page.clear()
            continue
        title = title_elem.text if title_elem is not None else ""
        page_id = int(id_elem.text) if id_elem is not None else 0
        # Get latest revision
        revision = page.find("mw:revision", MW_NS)
        if revision is None:
            page.clear()
            continue
        text_elem = revision.find("mw:text", MW_NS)
        timestamp_elem = revision.find("mw:timestamp", MW_NS)
        contributor = revision.find("mw:contributor", MW_NS)
        content = text_elem.text if text_elem is not None else ""
        timestamp = timestamp_elem.text if timestamp_elem is not None else ""
        contributor_name = ""
        if contributor is not None:
            username = contributor.find("mw:username", MW_NS)
            if username is not None:
                contributor_name = username.text or ""
        # Skip redirects and empty pages
        if not content or content.lower().startswith("#redirect"):
            page.clear()
            continue
        article = WikiArticle(
            id=page_id,
            title=title,
            content=content,
            plain_text=clean_wikitext(content),
            categories=extract_categories(content),
            links=extract_wiki_links(content),
            external_links=extract_external_links(content),
            timestamp=timestamp,
            contributor=contributor_name,
        )
        # Clear element to free memory
        page.clear()
        yield article
 def parse_all_dumps(output_path: Path | None = None) -> list[WikiArticle]:
    """Parse all XML dump files and optionally save to JSON."""
    xml_files = sorted(settings.xmldump_dir.glob("*.xml"))
    if not xml_files:
        console.print(f"[red]No XML files found in {settings.xmldump_dir}[/red]")
        return []
    console.print(f"[green]Found {len(xml_files)} XML files to parse[/green]")
    all_articles = []
    with Progress() as progress:
        task = progress.add_task("[cyan]Parsing XML files...", total=len(xml_files))
        for xml_file in xml_files:
            progress.update(task, description=f"[cyan]Parsing {xml_file.name}...")
            for article in parse_xml_file(xml_file):
                all_articles.append(article)
            progress.advance(task)
    console.print(f"[green]Parsed {len(all_articles)} articles[/green]")
    if output_path:
        console.print(f"[cyan]Saving to {output_path}...[/cyan]")
        with open(output_path, "w", encoding="utf-8") as f:
            json.dump([a.to_dict() for a in all_articles], f, ensure_ascii=False, indent=2)
        console.print(f"[green]Saved {len(all_articles)} articles to {output_path}[/green]")
    return all_articles
 def parse_mediawiki_files(articles_dir: Path, output_path: Path | None = None) -> list[WikiArticle]:
    """Parse individual .mediawiki files from a directory (Codeberg format)."""
    mediawiki_files = list(articles_dir.glob("*.mediawiki"))
    if not mediawiki_files:
        console.print(f"[red]No .mediawiki files found in {articles_dir}[/red]")
        return []
    console.print(f"[green]Found {len(mediawiki_files)} .mediawiki files to parse[/green]")
    all_articles = []
    with Progress() as progress:
        task = progress.add_task("[cyan]Parsing files...", total=len(mediawiki_files))
        for i, filepath in enumerate(mediawiki_files):
            # Title is the filename without extension
            title = filepath.stem
            try:
                content = filepath.read_text(encoding="utf-8", errors="replace")
            except Exception as e:
                console.print(f"[yellow]Warning: Could not read {filepath}: {e}[/yellow]")
                progress.advance(task)
                continue
            # Skip redirects and empty files
            if not content or content.strip().lower().startswith("#redirect"):
                progress.advance(task)
                continue
            article = WikiArticle(
                id=i,
                title=title,
                content=content,
                plain_text=clean_wikitext(content),
                categories=extract_categories(content),
                links=extract_wiki_links(content),
                external_links=extract_external_links(content),
                timestamp="",
                contributor="",
            )
            all_articles.append(article)
            progress.advance(task)
    console.print(f"[green]Parsed {len(all_articles)} articles[/green]")
    if output_path:
        console.print(f"[cyan]Saving to {output_path}...[/cyan]")
        with open(output_path, "w", encoding="utf-8") as f:
            json.dump([a.to_dict() for a in all_articles], f, ensure_ascii=False, indent=2)
        console.print(f"[green]Saved {len(all_articles)} articles to {output_path}[/green]")
    return all_articles
 def main():
    """CLI entry point for parsing wiki content."""
    output_path = settings.data_dir / "articles.json"
    # Check for Codeberg-style articles directory first (newer, more complete)
    articles_dir = settings.project_root / "articles" / "articles"
    if articles_dir.exists():
        console.print("[cyan]Found Codeberg-style articles directory, using that...[/cyan]")
        parse_mediawiki_files(articles_dir, output_path)
    else:
        # Fall back to XML dumps
        parse_all_dumps(output_path)
 if __name__ == "__main__":
    main()
--- a/src/rag.py
+++ b/src/rag.py
@ -0,0 +1,159 @@
 """RAG (Retrieval Augmented Generation) system for wiki Q&A."""
 from dataclasses import dataclass
 from typing import Optional
 from .embeddings import WikiVectorStore
 from .llm import llm_client
 SYSTEM_PROMPT = """You are a knowledgeable assistant for the P2P Foundation Wiki, a comprehensive knowledge base about peer-to-peer culture, commons-based peer production, alternative economics, and collaborative governance.
 Your role is to answer questions about the wiki content accurately and helpfully. When answering:
 1. Base your answers on the provided wiki content excerpts
 2. Cite specific articles when relevant (use the article titles)
 3. If the provided content doesn't fully answer the question, say so
 4. Explain concepts in accessible language while maintaining accuracy
 5. Connect related concepts when helpful
 If asked about something not covered in the provided content, acknowledge this and suggest related topics that might be helpful."""
@dataclass
 class ChatMessage:
    """A chat message."""
    role: str  # "user" or "assistant"
    content: str
@dataclass
 class RAGResponse:
    """Response from the RAG system."""
    answer: str
    sources: list[dict]  # List of source articles used
    query: str
 class WikiRAG:
    """RAG system for answering questions about wiki content."""
    def __init__(self, vector_store: Optional[WikiVectorStore] = None):
        self.vector_store = vector_store or WikiVectorStore()
        self.conversation_history: list[ChatMessage] = []
    def _format_context(self, search_results: list[dict]) -> str:
        """Format search results as context for the LLM."""
        if not search_results:
            return "No relevant wiki content found for this query."
        context_parts = []
        for i, result in enumerate(search_results, 1):
            title = result["metadata"].get("title", "Unknown")
            content = result["content"]
            categories = result["metadata"].get("categories", "")
            context_parts.append(
                f"[Source {i}: {title}]\n"
                f"Categories: {categories}\n"
                f"Content:\n{content}\n"
            )
        return "\n---\n".join(context_parts)
    def _build_prompt(self, query: str, context: str) -> str:
        """Build the prompt for the LLM."""
        # Include recent conversation history for context
        history_text = ""
        if self.conversation_history:
            recent = self.conversation_history[-4:]  # Last 2 exchanges
            history_text = "\n\nRecent conversation:\n"
            for msg in recent:
                role = "User" if msg.role == "user" else "Assistant"
                # Truncate long messages
                content = msg.content[:500] + "..." if len(msg.content) > 500 else msg.content
                history_text += f"{role}: {content}\n"
        return f"""Based on the following wiki content, please answer the user's question.
 Wiki Content:
 {context}
 {history_text}
 User Question: {query}
 Please provide a helpful answer based on the wiki content above. Cite specific articles when relevant."""
    async def ask(
        self,
        query: str,
        n_results: int = 5,
        filter_categories: Optional[list[str]] = None,
    ) -> RAGResponse:
        """
        Ask a question and get an answer based on wiki content.
        Args:
            query: User's question
            n_results: Number of relevant chunks to retrieve
            filter_categories: Optional category filter
        Returns:
            RAGResponse with answer and sources
        """
        # Search for relevant content
        search_results = self.vector_store.search(
            query, n_results=n_results, filter_categories=filter_categories
        )
        # Format context
        context = self._format_context(search_results)
        # Build prompt
        prompt = self._build_prompt(query, context)
        # Get LLM response (use Ollama for chat by default)
        answer = await llm_client.chat(
            prompt,
            system=SYSTEM_PROMPT,
            use_claude=False,  # Use Ollama for chat
            temperature=0.7,
        )
        # Update conversation history
        self.conversation_history.append(ChatMessage(role="user", content=query))
        self.conversation_history.append(ChatMessage(role="assistant", content=answer))
        # Extract unique sources
        sources = []
        seen_titles = set()
        for result in search_results:
            title = result["metadata"].get("title", "Unknown")
            if title not in seen_titles:
                seen_titles.add(title)
                sources.append(
                    {
                        "title": title,
                        "article_id": result["metadata"].get("article_id"),
                        "categories": result["metadata"].get("categories", "").split(","),
                    }
                )
        return RAGResponse(answer=answer, sources=sources, query=query)
    def clear_history(self):
        """Clear conversation history."""
        self.conversation_history = []
    def get_suggestions(self, partial_query: str, n_results: int = 5) -> list[str]:
        """Get article title suggestions for autocomplete."""
        # Simple prefix matching on titles
        all_titles = self.vector_store.get_article_titles()
        partial_lower = partial_query.lower()
        suggestions = [
            title for title in all_titles if partial_lower in title.lower()
        ][:n_results]
        return suggestions
--- a/web/index.html
+++ b/web/index.html
@ -0,0 +1,707 @@
 <!DOCTYPE html>
 <html lang="en">
 <head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>P2P Wiki AI</title>
    <style>
        :root {
            --bg-primary: #1a1a2e;
            --bg-secondary: #16213e;
            --bg-tertiary: #0f3460;
            --text-primary: #e8e8e8;
            --text-secondary: #a0a0a0;
            --accent: #e94560;
            --accent-hover: #ff6b6b;
            --success: #4ecdc4;
            --border: #2a2a4a;
        }
        * {
            box-sizing: border-box;
            margin: 0;
            padding: 0;
        }
        body {
            font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
            background: var(--bg-primary);
            color: var(--text-primary);
            min-height: 100vh;
        }
        .container {
            max-width: 1200px;
            margin: 0 auto;
            padding: 20px;
        }
        header {
            display: flex;
            justify-content: space-between;
            align-items: center;
            padding: 20px 0;
            border-bottom: 1px solid var(--border);
            margin-bottom: 30px;
        }
        h1 {
            font-size: 1.8em;
            font-weight: 600;
        }
        h1 span {
            color: var(--accent);
        }
        .tabs {
            display: flex;
            gap: 10px;
        }
        .tab {
            padding: 10px 20px;
            background: var(--bg-secondary);
            border: 1px solid var(--border);
            border-radius: 8px;
            cursor: pointer;
            transition: all 0.2s;
        }
        .tab:hover, .tab.active {
            background: var(--bg-tertiary);
            border-color: var(--accent);
        }
        .panel {
            display: none;
        }
        .panel.active {
            display: block;
        }
        /* Chat Panel */
        .chat-container {
            display: flex;
            flex-direction: column;
            height: calc(100vh - 200px);
            background: var(--bg-secondary);
            border-radius: 12px;
            overflow: hidden;
        }
        .chat-messages {
            flex: 1;
            overflow-y: auto;
            padding: 20px;
        }
        .message {
            margin-bottom: 20px;
            max-width: 80%;
        }
        .message.user {
            margin-left: auto;
        }
        .message-content {
            padding: 15px;
            border-radius: 12px;
            line-height: 1.6;
        }
        .message.user .message-content {
            background: var(--bg-tertiary);
        }
        .message.assistant .message-content {
            background: var(--bg-primary);
            border: 1px solid var(--border);
        }
        .message-sources {
            margin-top: 10px;
            padding: 10px;
            background: rgba(233, 69, 96, 0.1);
            border-radius: 8px;
            font-size: 0.9em;
        }
        .message-sources h4 {
            color: var(--accent);
            margin-bottom: 5px;
        }
        .source-tag {
            display: inline-block;
            padding: 3px 8px;
            margin: 2px;
            background: var(--bg-tertiary);
            border-radius: 4px;
            font-size: 0.85em;
        }
        .chat-input {
            display: flex;
            gap: 10px;
            padding: 20px;
            background: var(--bg-primary);
            border-top: 1px solid var(--border);
        }
        .chat-input input {
            flex: 1;
            padding: 15px;
            background: var(--bg-secondary);
            border: 1px solid var(--border);
            border-radius: 8px;
            color: var(--text-primary);
            font-size: 1em;
        }
        .chat-input input:focus {
            outline: none;
            border-color: var(--accent);
        }
        .chat-input button {
            padding: 15px 30px;
            background: var(--accent);
            border: none;
            border-radius: 8px;
            color: white;
            font-weight: 600;
            cursor: pointer;
            transition: background 0.2s;
        }
        .chat-input button:hover {
            background: var(--accent-hover);
        }
        .chat-input button:disabled {
            opacity: 0.5;
            cursor: not-allowed;
        }
        /* Ingress Panel */
        .ingress-container {
            background: var(--bg-secondary);
            border-radius: 12px;
            padding: 30px;
        }
        .ingress-form {
            display: flex;
            gap: 10px;
            margin-bottom: 30px;
        }
        .ingress-form input {
            flex: 1;
            padding: 15px;
            background: var(--bg-primary);
            border: 1px solid var(--border);
            border-radius: 8px;
            color: var(--text-primary);
            font-size: 1em;
        }
        .ingress-form input:focus {
            outline: none;
            border-color: var(--accent);
        }
        .ingress-form button {
            padding: 15px 30px;
            background: var(--success);
            border: none;
            border-radius: 8px;
            color: var(--bg-primary);
            font-weight: 600;
            cursor: pointer;
            transition: opacity 0.2s;
        }
        .ingress-form button:hover {
            opacity: 0.9;
        }
        .ingress-form button:disabled {
            opacity: 0.5;
            cursor: not-allowed;
        }
        .ingress-result {
            background: var(--bg-primary);
            border-radius: 8px;
            padding: 20px;
            margin-bottom: 20px;
        }
        .ingress-result h3 {
            margin-bottom: 15px;
            color: var(--accent);
        }
        .result-stats {
            display: grid;
            grid-template-columns: repeat(auto-fit, minmax(150px, 1fr));
            gap: 15px;
            margin-bottom: 20px;
        }
        .stat {
            background: var(--bg-secondary);
            padding: 15px;
            border-radius: 8px;
            text-align: center;
        }
        .stat-value {
            font-size: 2em;
            font-weight: bold;
            color: var(--success);
        }
        .stat-label {
            color: var(--text-secondary);
            font-size: 0.9em;
        }
        /* Review Panel */
        .review-container {
            background: var(--bg-secondary);
            border-radius: 12px;
            padding: 30px;
        }
        .review-item {
            background: var(--bg-primary);
            border-radius: 8px;
            padding: 20px;
            margin-bottom: 20px;
        }
        .review-item h3 {
            margin-bottom: 10px;
        }
        .review-meta {
            color: var(--text-secondary);
            font-size: 0.9em;
            margin-bottom: 15px;
        }
        .review-section {
            margin-top: 20px;
            padding-top: 20px;
            border-top: 1px solid var(--border);
        }
        .review-section h4 {
            margin-bottom: 10px;
            color: var(--accent);
        }
        .match-item, .draft-item {
            background: var(--bg-secondary);
            padding: 15px;
            border-radius: 8px;
            margin-bottom: 10px;
        }
        .match-item .title, .draft-item .title {
            font-weight: 600;
            margin-bottom: 5px;
        }
        .match-item .score {
            color: var(--success);
        }
        .action-buttons {
            display: flex;
            gap: 10px;
            margin-top: 10px;
        }
        .btn-approve {
            padding: 8px 16px;
            background: var(--success);
            border: none;
            border-radius: 4px;
            color: var(--bg-primary);
            cursor: pointer;
        }
        .btn-reject {
            padding: 8px 16px;
            background: var(--accent);
            border: none;
            border-radius: 4px;
            color: white;
            cursor: pointer;
        }
        .loading {
            display: inline-block;
            width: 20px;
            height: 20px;
            border: 2px solid var(--text-secondary);
            border-top-color: var(--accent);
            border-radius: 50%;
            animation: spin 1s linear infinite;
        }
        @keyframes spin {
            to { transform: rotate(360deg); }
        }
        .empty-state {
            text-align: center;
            padding: 50px;
            color: var(--text-secondary);
        }
        /* Markdown-like formatting */
        .message-content p { margin-bottom: 10px; }
        .message-content ul, .message-content ol { margin-left: 20px; margin-bottom: 10px; }
        .message-content code { background: var(--bg-tertiary); padding: 2px 6px; border-radius: 4px; }
        .message-content pre { background: var(--bg-tertiary); padding: 15px; border-radius: 8px; overflow-x: auto; }
    </style>
 </head>
 <body>
    <div class="container">
        <header>
            <h1>P2P Wiki <span>AI</span></h1>
            <div class="tabs">
                <div class="tab active" data-panel="chat">Chat</div>
                <div class="tab" data-panel="ingress">Ingress</div>
                <div class="tab" data-panel="review">Review Queue</div>
            </div>
        </header>
        <!-- Chat Panel -->
        <div id="chat" class="panel active">
            <div class="chat-container">
                <div class="chat-messages" id="chatMessages">
                    <div class="message assistant">
                        <div class="message-content">
                            <p>Welcome to the P2P Wiki AI assistant! I can help you explore the P2P Foundation Wiki's knowledge about peer-to-peer culture, commons-based peer production, alternative economics, and collaborative governance.</p>
                            <p>Ask me anything about these topics!</p>
                        </div>
                    </div>
                </div>
                <div class="chat-input">
                    <input type="text" id="chatInput" placeholder="Ask about P2P, commons, cooperative economics..." />
                    <button id="chatSend">Send</button>
                </div>
            </div>
        </div>
        <!-- Ingress Panel -->
        <div id="ingress" class="panel">
            <div class="ingress-container">
                <h2>Article Ingress</h2>
                <p style="color: var(--text-secondary); margin-bottom: 20px;">
                    Drop an article URL to analyze it for wiki content. The AI will identify relevant topics,
                    find matching wiki articles for citations, and draft new articles.
                </p>
                <div class="ingress-form">
                    <input type="url" id="ingressUrl" placeholder="https://example.com/article-about-commons" />
                    <button id="ingressSubmit">Process Article</button>
                </div>
                <div id="ingressResult"></div>
            </div>
        </div>
        <!-- Review Panel -->
        <div id="review" class="panel">
            <div class="review-container">
                <h2>Review Queue</h2>
                <p style="color: var(--text-secondary); margin-bottom: 20px;">
                    Review and approve AI-generated wiki content before it's added to the wiki.
                </p>
                <div id="reviewItems">
                    <div class="empty-state">Loading review items...</div>
                </div>
            </div>
        </div>
    </div>
    <script>
        const API_BASE = '';  // Same origin
        // Tab switching
        document.querySelectorAll('.tab').forEach(tab => {
            tab.addEventListener('click', () => {
                document.querySelectorAll('.tab').forEach(t => t.classList.remove('active'));
                document.querySelectorAll('.panel').forEach(p => p.classList.remove('active'));
                tab.classList.add('active');
                document.getElementById(tab.dataset.panel).classList.add('active');
                // Load review items when switching to review tab
                if (tab.dataset.panel === 'review') {
                    loadReviewItems();
                }
            });
        });
        // Chat functionality
        const chatMessages = document.getElementById('chatMessages');
        const chatInput = document.getElementById('chatInput');
        const chatSend = document.getElementById('chatSend');
        function addMessage(content, role, sources = []) {
            const div = document.createElement('div');
            div.className = `message ${role}`;
            let html = `<div class="message-content">${formatMessage(content)}</div>`;
            if (sources.length > 0) {
                html += `<div class="message-sources">
                    <h4>Sources</h4>
                    ${sources.map(s => `<span class="source-tag">${s.title}</span>`).join('')}
                </div>`;
            }
            div.innerHTML = html;
            chatMessages.appendChild(div);
            chatMessages.scrollTop = chatMessages.scrollHeight;
        }
        function formatMessage(text) {
            // Basic markdown-like formatting
            return text
                .replace(/\n\n/g, '</p><p>')
                .replace(/\n/g, '<br>')
                .replace(/\*\*(.+?)\*\*/g, '<strong>$1</strong>')
                .replace(/\*(.+?)\*/g, '<em>$1</em>')
                .replace(/`(.+?)`/g, '<code>$1</code>');
        }
        async function sendChat() {
            const query = chatInput.value.trim();
            if (!query) return;
            chatInput.value = '';
            chatSend.disabled = true;
            addMessage(query, 'user');
            // Show loading
            const loadingDiv = document.createElement('div');
            loadingDiv.className = 'message assistant';
            loadingDiv.innerHTML = '<div class="message-content"><span class="loading"></span> Thinking...</div>';
            chatMessages.appendChild(loadingDiv);
            chatMessages.scrollTop = chatMessages.scrollHeight;
            try {
                const response = await fetch(`${API_BASE}/chat`, {
                    method: 'POST',
                    headers: { 'Content-Type': 'application/json' },
                    body: JSON.stringify({ query, n_results: 5 })
                });
                const data = await response.json();
                chatMessages.removeChild(loadingDiv);
                if (response.ok) {
                    addMessage(data.answer, 'assistant', data.sources);
                } else {
                    addMessage(`Error: ${data.detail || 'Something went wrong'}`, 'assistant');
                }
            } catch (error) {
                chatMessages.removeChild(loadingDiv);
                addMessage(`Error: ${error.message}`, 'assistant');
            }
            chatSend.disabled = false;
            chatInput.focus();
        }
        chatSend.addEventListener('click', sendChat);
        chatInput.addEventListener('keypress', (e) => {
            if (e.key === 'Enter') sendChat();
        });
        // Ingress functionality
        const ingressUrl = document.getElementById('ingressUrl');
        const ingressSubmit = document.getElementById('ingressSubmit');
        const ingressResult = document.getElementById('ingressResult');
        async function processIngress() {
            const url = ingressUrl.value.trim();
            if (!url) return;
            ingressSubmit.disabled = true;
            ingressSubmit.textContent = 'Processing...';
            ingressResult.innerHTML = `
                <div class="ingress-result">
                    <h3>Processing Article</h3>
                    <p><span class="loading"></span> Scraping and analyzing content...</p>
                </div>
            `;
            try {
                const response = await fetch(`${API_BASE}/ingress`, {
                    method: 'POST',
                    headers: { 'Content-Type': 'application/json' },
                    body: JSON.stringify({ url })
                });
                const data = await response.json();
                if (response.ok) {
                    ingressResult.innerHTML = `
                        <div class="ingress-result">
                            <h3>Analysis Complete: ${data.scraped_title || 'Article'}</h3>
                            <div class="result-stats">
                                <div class="stat">
                                    <div class="stat-value">${data.topics_found}</div>
                                    <div class="stat-label">Topics Found</div>
                                </div>
                                <div class="stat">
                                    <div class="stat-value">${data.wiki_matches}</div>
                                    <div class="stat-label">Wiki Matches</div>
                                </div>
                                <div class="stat">
                                    <div class="stat-value">${data.drafts_generated}</div>
                                    <div class="stat-label">Drafts Generated</div>
                                </div>
                            </div>
                            <p style="color: var(--success);">
                                Results added to review queue. Check the Review tab to approve or reject suggestions.
                            </p>
                        </div>
                    `;
                } else {
                    ingressResult.innerHTML = `
                        <div class="ingress-result">
                            <h3 style="color: var(--accent);">Error</h3>
                            <p>${data.detail || 'Failed to process article'}</p>
                        </div>
                    `;
                }
            } catch (error) {
                ingressResult.innerHTML = `
                    <div class="ingress-result">
                        <h3 style="color: var(--accent);">Error</h3>
                        <p>${error.message}</p>
                    </div>
                `;
            }
            ingressSubmit.disabled = false;
            ingressSubmit.textContent = 'Process Article';
        }
        ingressSubmit.addEventListener('click', processIngress);
        ingressUrl.addEventListener('keypress', (e) => {
            if (e.key === 'Enter') processIngress();
        });
        // Review functionality
        const reviewItems = document.getElementById('reviewItems');
        async function loadReviewItems() {
            try {
                const response = await fetch(`${API_BASE}/review`);
                const data = await response.json();
                if (data.count === 0) {
                    reviewItems.innerHTML = '<div class="empty-state">No items in the review queue.</div>';
                    return;
                }
                reviewItems.innerHTML = data.items.map(item => `
                    <div class="review-item">
                        <h3>${item.scraped?.title || 'Unknown Article'}</h3>
                        <div class="review-meta">
                            Source: <a href="${item.scraped?.url}" target="_blank">${item.scraped?.domain}</a>
                            | Processed: ${new Date(item.timestamp).toLocaleString()}
                        </div>
                        ${item.wiki_matches?.length > 0 ? `
                            <div class="review-section">
                                <h4>Suggested Citations (${item.wiki_matches.length})</h4>
                                ${item.wiki_matches.map((match, i) => `
                                    <div class="match-item" ${match.approved ? 'style="opacity: 0.5"' : ''}>
                                        <div class="title">${match.title}</div>
                                        <div class="score">Relevance: ${(match.relevance_score * 100).toFixed(0)}%</div>
                                        <div>${match.suggested_citation}</div>
                                        ${!match.approved && !match.rejected ? `
                                            <div class="action-buttons">
                                                <button class="btn-approve" onclick="reviewAction('${item._filepath}', 'match', ${i}, 'approve')">Approve</button>
                                                <button class="btn-reject" onclick="reviewAction('${item._filepath}', 'match', ${i}, 'reject')">Reject</button>
                                            </div>
                                        ` : `<em>${match.approved ? 'Approved' : 'Rejected'}</em>`}
                                    </div>
                                `).join('')}
                            </div>
                        ` : ''}
                        ${item.draft_articles?.length > 0 ? `
                            <div class="review-section">
                                <h4>Draft Articles (${item.draft_articles.length})</h4>
                                ${item.draft_articles.map((draft, i) => `
                                    <div class="draft-item" ${draft.approved ? 'style="opacity: 0.5"' : ''}>
                                        <div class="title">${draft.title}</div>
                                        <div style="color: var(--text-secondary); font-size: 0.9em; margin-bottom: 10px;">
                                            ${draft.summary || ''}
                                        </div>
                                        <details>
                                            <summary style="cursor: pointer; color: var(--accent);">View Draft Content</summary>
                                            <pre style="margin-top: 10px; white-space: pre-wrap; font-size: 0.85em;">${draft.content}</pre>
                                        </details>
                                        ${!draft.approved && !draft.rejected ? `
                                            <div class="action-buttons">
                                                <button class="btn-approve" onclick="reviewAction('${item._filepath}', 'draft', ${i}, 'approve')">Approve</button>
                                                <button class="btn-reject" onclick="reviewAction('${item._filepath}', 'draft', ${i}, 'reject')">Reject</button>
                                            </div>
                                        ` : `<em>${draft.approved ? 'Approved' : 'Rejected'}</em>`}
                                    </div>
                                `).join('')}
                            </div>
                        ` : ''}
                    </div>
                `).join('');
            } catch (error) {
                reviewItems.innerHTML = `<div class="empty-state">Error loading review items: ${error.message}</div>`;
            }
        }
        async function reviewAction(filepath, itemType, itemIndex, action) {
            try {
                const response = await fetch(`${API_BASE}/review/action`, {
                    method: 'POST',
                    headers: { 'Content-Type': 'application/json' },
                    body: JSON.stringify({
                        filepath,
                        item_type: itemType,
                        item_index: itemIndex,
                        action
                    })
                });
                if (response.ok) {
                    loadReviewItems();  // Refresh the list
                } else {
                    const data = await response.json();
                    alert(`Error: ${data.detail || 'Action failed'}`);
                }
            } catch (error) {
                alert(`Error: ${error.message}`);
            }
        }
        // Make reviewAction available globally
        window.reviewAction = reviewAction;
    </script>
 </body>
 </html>
		`@ -0,0 +1 @@`
							`"""P2P Wiki AI System - Chat agent and ingress pipeline."""`