Initial commit: P2P Wiki AI system
- RAG-based chat with 39k wiki articles (232k chunks) - Article ingress pipeline for processing external URLs - Review queue for AI-generated content - FastAPI backend with web UI - Traefik-ready Docker setup for p2pwiki.jeffemmett.com Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
commit
4ebd90cc64
|
|
@ -0,0 +1,17 @@
|
|||
# P2P Wiki AI Configuration
|
||||
|
||||
# Ollama (Local LLM)
|
||||
OLLAMA_BASE_URL=http://localhost:11434
|
||||
OLLAMA_MODEL=llama3.2
|
||||
|
||||
# Claude API (Optional - for higher quality article drafts)
|
||||
ANTHROPIC_API_KEY=
|
||||
CLAUDE_MODEL=claude-sonnet-4-20250514
|
||||
|
||||
# Hybrid Routing
|
||||
USE_CLAUDE_FOR_DRAFTS=true
|
||||
USE_OLLAMA_FOR_CHAT=true
|
||||
|
||||
# Server
|
||||
HOST=0.0.0.0
|
||||
PORT=8420
|
||||
|
|
@ -0,0 +1,28 @@
|
|||
# Virtual environment
|
||||
.venv/
|
||||
venv/
|
||||
env/
|
||||
|
||||
# Python
|
||||
__pycache__/
|
||||
*.py[cod]
|
||||
*.egg-info/
|
||||
dist/
|
||||
build/
|
||||
|
||||
# Data files (too large for git)
|
||||
data/articles.json
|
||||
data/chroma/
|
||||
data/review_queue/
|
||||
xmldump/
|
||||
xmldump-2014.tar.gz
|
||||
articles/
|
||||
articles.tar.gz
|
||||
|
||||
# Environment
|
||||
.env
|
||||
|
||||
# IDE
|
||||
.idea/
|
||||
.vscode/
|
||||
*.swp
|
||||
|
|
@ -0,0 +1,49 @@
|
|||
# P2P Wiki AI - Multi-stage build
|
||||
FROM python:3.11-slim as builder
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
# Install build dependencies
|
||||
RUN apt-get update && apt-get install -y --no-install-recommends \
|
||||
build-essential \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
# Install Python dependencies
|
||||
COPY pyproject.toml .
|
||||
RUN pip install --no-cache-dir build && \
|
||||
pip wheel --no-cache-dir --wheel-dir /wheels .
|
||||
|
||||
# Production image
|
||||
FROM python:3.11-slim
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
# Install runtime dependencies
|
||||
RUN apt-get update && apt-get install -y --no-install-recommends \
|
||||
libxml2 \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
# Copy wheels and install
|
||||
COPY --from=builder /wheels /wheels
|
||||
RUN pip install --no-cache-dir /wheels/*.whl && rm -rf /wheels
|
||||
|
||||
# Copy application code
|
||||
COPY src/ src/
|
||||
COPY web/ web/
|
||||
|
||||
# Create data directories
|
||||
RUN mkdir -p data/chroma data/review_queue
|
||||
|
||||
# Environment variables
|
||||
ENV PYTHONUNBUFFERED=1
|
||||
ENV PYTHONDONTWRITEBYTECODE=1
|
||||
|
||||
# Expose port
|
||||
EXPOSE 8420
|
||||
|
||||
# Health check
|
||||
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
|
||||
CMD python -c "import httpx; httpx.get('http://localhost:8420/health')" || exit 1
|
||||
|
||||
# Run the application
|
||||
CMD ["python", "-m", "uvicorn", "src.api:app", "--host", "0.0.0.0", "--port", "8420"]
|
||||
|
|
@ -0,0 +1,199 @@
|
|||
# P2P Wiki AI
|
||||
|
||||
AI-augmented system for the P2P Foundation Wiki with two main features:
|
||||
|
||||
1. **Conversational Agent** - Ask questions about the 23,000+ wiki articles using RAG (Retrieval Augmented Generation)
|
||||
2. **Article Ingress Pipeline** - Drop article URLs to automatically analyze content, find matching wiki articles for citations, and generate draft articles
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ P2P Wiki AI System │
|
||||
├─────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ ┌─────────────────┐ ┌─────────────────┐ │
|
||||
│ │ Chat (Q&A) │ │ Ingress Tool │ │
|
||||
│ │ via RAG │ │ (URL Drop) │ │
|
||||
│ └────────┬────────┘ └────────┬────────┘ │
|
||||
│ │ │ │
|
||||
│ └───────────┬───────────┘ │
|
||||
│ ▼ │
|
||||
│ ┌───────────────────────┐ │
|
||||
│ │ FastAPI Backend │ │
|
||||
│ └───────────┬───────────┘ │
|
||||
│ │ │
|
||||
│ ┌──────────────┼──────────────┐ │
|
||||
│ ▼ ▼ ▼ │
|
||||
│ ┌──────────┐ ┌─────────────┐ ┌──────────────┐ │
|
||||
│ │ ChromaDB │ │ Ollama/ │ │ Article │ │
|
||||
│ │ (Vector) │ │ Claude │ │ Scraper │ │
|
||||
│ └──────────┘ └─────────────┘ └──────────────┘ │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
### 1. Prerequisites
|
||||
|
||||
- Python 3.10+
|
||||
- [Ollama](https://ollama.ai) installed locally (or access to a remote Ollama server)
|
||||
- Optional: Anthropic API key for Claude (higher quality article drafts)
|
||||
|
||||
### 2. Install Dependencies
|
||||
|
||||
```bash
|
||||
cd /home/jeffe/Github/p2pwiki-content
|
||||
pip install -e .
|
||||
```
|
||||
|
||||
### 3. Parse Wiki Content
|
||||
|
||||
Convert the MediaWiki XML dumps to searchable JSON:
|
||||
|
||||
```bash
|
||||
python -m src.parser
|
||||
```
|
||||
|
||||
This creates `data/articles.json` with all parsed articles (~23,000 pages).
|
||||
|
||||
### 4. Generate Embeddings
|
||||
|
||||
Create the vector store for semantic search:
|
||||
|
||||
```bash
|
||||
python -m src.embeddings
|
||||
```
|
||||
|
||||
This creates the ChromaDB vector store in `data/chroma/`. Takes a few minutes.
|
||||
|
||||
### 5. Configure Environment
|
||||
|
||||
```bash
|
||||
cp .env.example .env
|
||||
# Edit .env with your settings
|
||||
```
|
||||
|
||||
### 6. Run the Server
|
||||
|
||||
```bash
|
||||
python -m src.api
|
||||
```
|
||||
|
||||
Visit http://localhost:8420/ui for the web interface.
|
||||
|
||||
## Docker Deployment
|
||||
|
||||
For production deployment on the RS 8000:
|
||||
|
||||
```bash
|
||||
# Build and run
|
||||
docker compose up -d --build
|
||||
|
||||
# Check logs
|
||||
docker compose logs -f
|
||||
|
||||
# Access at http://localhost:8420/ui
|
||||
# Or via Traefik at https://wiki-ai.jeffemmett.com
|
||||
```
|
||||
|
||||
## API Endpoints
|
||||
|
||||
### Chat
|
||||
|
||||
```bash
|
||||
# Ask a question
|
||||
curl -X POST http://localhost:8420/chat \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"query": "What is commons-based peer production?"}'
|
||||
```
|
||||
|
||||
### Ingress
|
||||
|
||||
```bash
|
||||
# Process an external article
|
||||
curl -X POST http://localhost:8420/ingress \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"url": "https://example.com/article-about-cooperatives"}'
|
||||
```
|
||||
|
||||
### Review Queue
|
||||
|
||||
```bash
|
||||
# Get all items in review queue
|
||||
curl http://localhost:8420/review
|
||||
|
||||
# Approve a draft article
|
||||
curl -X POST http://localhost:8420/review/action \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"filepath": "/path/to/item.json", "item_type": "draft", "item_index": 0, "action": "approve"}'
|
||||
```
|
||||
|
||||
### Search
|
||||
|
||||
```bash
|
||||
# Direct vector search
|
||||
curl "http://localhost:8420/search?q=cooperative%20economics&n=10"
|
||||
|
||||
# List article titles
|
||||
curl "http://localhost:8420/articles?limit=100"
|
||||
```
|
||||
|
||||
## Hybrid AI Routing
|
||||
|
||||
The system uses intelligent routing between local (Ollama) and cloud (Claude) LLMs:
|
||||
|
||||
| Task | Default LLM | Reasoning |
|
||||
|------|-------------|-----------|
|
||||
| Chat Q&A | Ollama | Fast, free, good enough for retrieval-based answers |
|
||||
| Content Analysis | Claude | Better at extracting topics and identifying wiki relevance |
|
||||
| Draft Generation | Claude | Higher quality article writing |
|
||||
| Embeddings | Local (sentence-transformers) | Fast, free, optimized for semantic search |
|
||||
|
||||
Configure in `.env`:
|
||||
```
|
||||
USE_CLAUDE_FOR_DRAFTS=true
|
||||
USE_OLLAMA_FOR_CHAT=true
|
||||
```
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
p2pwiki-content/
|
||||
├── src/
|
||||
│ ├── api.py # FastAPI backend
|
||||
│ ├── config.py # Configuration settings
|
||||
│ ├── embeddings.py # Vector store (ChromaDB)
|
||||
│ ├── ingress.py # Article scraper & analyzer
|
||||
│ ├── llm.py # LLM client (Ollama/Claude)
|
||||
│ ├── parser.py # MediaWiki XML parser
|
||||
│ └── rag.py # RAG chat system
|
||||
├── web/
|
||||
│ └── index.html # Web UI
|
||||
├── data/
|
||||
│ ├── articles.json # Parsed wiki content
|
||||
│ ├── chroma/ # Vector store
|
||||
│ └── review_queue/ # Pending ingress items
|
||||
├── xmldump/ # MediaWiki XML dumps
|
||||
├── docker-compose.yml
|
||||
├── Dockerfile
|
||||
└── pyproject.toml
|
||||
```
|
||||
|
||||
## Content Coverage
|
||||
|
||||
The P2P Foundation Wiki contains ~23,000 articles covering:
|
||||
|
||||
- Peer-to-peer networks and culture
|
||||
- Commons-based peer production (CBPP)
|
||||
- Alternative economics and post-capitalism
|
||||
- Cooperative business models
|
||||
- Open source and free culture
|
||||
- Collaborative governance
|
||||
- Sustainability and ecology
|
||||
|
||||
## License
|
||||
|
||||
The wiki content is from the P2P Foundation under their respective licenses.
|
||||
The AI system code is provided as-is for educational purposes.
|
||||
|
|
@ -0,0 +1,38 @@
|
|||
version: '3.8'
|
||||
|
||||
services:
|
||||
p2pwiki-ai:
|
||||
build: .
|
||||
container_name: p2pwiki-ai
|
||||
restart: unless-stopped
|
||||
ports:
|
||||
- "8420:8420"
|
||||
volumes:
|
||||
# Persist vector store and review queue
|
||||
- ./data:/app/data
|
||||
# Mount XML dumps for parsing (read-only)
|
||||
- ./xmldump:/app/xmldump:ro
|
||||
environment:
|
||||
# Ollama connection (adjust host for your setup)
|
||||
- OLLAMA_BASE_URL=${OLLAMA_BASE_URL:-http://host.docker.internal:11434}
|
||||
- OLLAMA_MODEL=${OLLAMA_MODEL:-llama3.2}
|
||||
# Claude API (optional, for higher quality drafts)
|
||||
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY:-}
|
||||
- CLAUDE_MODEL=${CLAUDE_MODEL:-claude-sonnet-4-20250514}
|
||||
# Hybrid routing settings
|
||||
- USE_CLAUDE_FOR_DRAFTS=${USE_CLAUDE_FOR_DRAFTS:-true}
|
||||
- USE_OLLAMA_FOR_CHAT=${USE_OLLAMA_FOR_CHAT:-true}
|
||||
labels:
|
||||
# Traefik labels for reverse proxy
|
||||
- "traefik.enable=true"
|
||||
- "traefik.http.routers.p2pwiki-ai.rule=Host(`p2pwiki.jeffemmett.com`)"
|
||||
- "traefik.http.services.p2pwiki-ai.loadbalancer.server.port=8420"
|
||||
networks:
|
||||
- traefik-public
|
||||
# Add extra_hosts for Docker Desktop to access host services
|
||||
extra_hosts:
|
||||
- "host.docker.internal:host-gateway"
|
||||
|
||||
networks:
|
||||
traefik-public:
|
||||
external: true
|
||||
File diff suppressed because it is too large
Load Diff
|
|
@ -0,0 +1,64 @@
|
|||
[project]
|
||||
name = "p2pwiki-ai"
|
||||
version = "0.1.0"
|
||||
description = "AI-augmented system for P2P Foundation Wiki - chat agent and ingress pipeline"
|
||||
requires-python = ">=3.10"
|
||||
dependencies = [
|
||||
# Core
|
||||
"fastapi>=0.109.0",
|
||||
"uvicorn[standard]>=0.27.0",
|
||||
"pydantic>=2.5.0",
|
||||
"pydantic-settings>=2.1.0",
|
||||
|
||||
# XML parsing
|
||||
"lxml>=5.1.0",
|
||||
|
||||
# Vector store & embeddings
|
||||
"chromadb>=0.4.22",
|
||||
"sentence-transformers>=2.3.0",
|
||||
|
||||
# LLM integration
|
||||
"openai>=1.10.0", # For Ollama-compatible API
|
||||
"anthropic>=0.18.0", # For Claude API
|
||||
"httpx>=0.26.0",
|
||||
|
||||
# Article scraping
|
||||
"trafilatura>=1.6.0",
|
||||
"newspaper3k>=0.2.8",
|
||||
"beautifulsoup4>=4.12.0",
|
||||
"requests>=2.31.0",
|
||||
|
||||
# Utilities
|
||||
"python-dotenv>=1.0.0",
|
||||
"rich>=13.7.0",
|
||||
"tqdm>=4.66.0",
|
||||
"tenacity>=8.2.0",
|
||||
]
|
||||
|
||||
[project.optional-dependencies]
|
||||
dev = [
|
||||
"pytest>=7.4.0",
|
||||
"pytest-asyncio>=0.23.0",
|
||||
"black>=24.1.0",
|
||||
"ruff>=0.1.0",
|
||||
]
|
||||
|
||||
[project.scripts]
|
||||
p2pwiki-parse = "src.parser:main"
|
||||
p2pwiki-embed = "src.embeddings:main"
|
||||
p2pwiki-serve = "src.api:main"
|
||||
|
||||
[build-system]
|
||||
requires = ["setuptools>=68.0", "wheel"]
|
||||
build-backend = "setuptools.build_meta"
|
||||
|
||||
[tool.setuptools.packages.find]
|
||||
where = ["."]
|
||||
|
||||
[tool.black]
|
||||
line-length = 100
|
||||
target-version = ["py310"]
|
||||
|
||||
[tool.ruff]
|
||||
line-length = 100
|
||||
select = ["E", "F", "I", "N", "W"]
|
||||
|
|
@ -0,0 +1 @@
|
|||
"""P2P Wiki AI System - Chat agent and ingress pipeline."""
|
||||
|
|
@ -0,0 +1,320 @@
|
|||
"""FastAPI backend for P2P Wiki AI system."""
|
||||
|
||||
import asyncio
|
||||
from contextlib import asynccontextmanager
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
from fastapi import FastAPI, HTTPException, BackgroundTasks
|
||||
from fastapi.middleware.cors import CORSMiddleware
|
||||
from fastapi.staticfiles import StaticFiles
|
||||
from fastapi.responses import FileResponse
|
||||
from pydantic import BaseModel, HttpUrl
|
||||
|
||||
from .config import settings
|
||||
from .embeddings import WikiVectorStore
|
||||
from .rag import WikiRAG, RAGResponse
|
||||
from .ingress import IngressPipeline, get_review_queue, approve_item, reject_item
|
||||
|
||||
# Global instances
|
||||
vector_store: Optional[WikiVectorStore] = None
|
||||
rag_system: Optional[WikiRAG] = None
|
||||
ingress_pipeline: Optional[IngressPipeline] = None
|
||||
|
||||
|
||||
@asynccontextmanager
|
||||
async def lifespan(app: FastAPI):
|
||||
"""Initialize services on startup."""
|
||||
global vector_store, rag_system, ingress_pipeline
|
||||
|
||||
print("Initializing P2P Wiki AI system...")
|
||||
|
||||
# Check if vector store has been populated
|
||||
chroma_path = settings.chroma_persist_dir
|
||||
if not chroma_path.exists() or not any(chroma_path.iterdir()):
|
||||
print("WARNING: Vector store not initialized. Run 'python -m src.parser' and 'python -m src.embeddings' first.")
|
||||
else:
|
||||
vector_store = WikiVectorStore()
|
||||
rag_system = WikiRAG(vector_store)
|
||||
ingress_pipeline = IngressPipeline(vector_store)
|
||||
print(f"Loaded vector store with {vector_store.get_stats()['total_chunks']} chunks")
|
||||
|
||||
yield
|
||||
|
||||
print("Shutting down...")
|
||||
|
||||
|
||||
app = FastAPI(
|
||||
title="P2P Wiki AI",
|
||||
description="AI-augmented system for P2P Foundation Wiki - chat agent and ingress pipeline",
|
||||
version="0.1.0",
|
||||
lifespan=lifespan,
|
||||
)
|
||||
|
||||
# CORS middleware
|
||||
app.add_middleware(
|
||||
CORSMiddleware,
|
||||
allow_origins=["*"], # Configure appropriately for production
|
||||
allow_credentials=True,
|
||||
allow_methods=["*"],
|
||||
allow_headers=["*"],
|
||||
)
|
||||
|
||||
|
||||
# --- Request/Response Models ---
|
||||
|
||||
|
||||
class ChatRequest(BaseModel):
|
||||
"""Chat request model."""
|
||||
|
||||
query: str
|
||||
n_results: int = 5
|
||||
filter_categories: Optional[list[str]] = None
|
||||
|
||||
|
||||
class ChatResponse(BaseModel):
|
||||
"""Chat response model."""
|
||||
|
||||
answer: str
|
||||
sources: list[dict]
|
||||
query: str
|
||||
|
||||
|
||||
class IngressRequest(BaseModel):
|
||||
"""Ingress request model."""
|
||||
|
||||
url: HttpUrl
|
||||
|
||||
|
||||
class IngressResponse(BaseModel):
|
||||
"""Ingress response model."""
|
||||
|
||||
status: str
|
||||
message: str
|
||||
scraped_title: Optional[str] = None
|
||||
topics_found: int = 0
|
||||
wiki_matches: int = 0
|
||||
drafts_generated: int = 0
|
||||
queue_file: Optional[str] = None
|
||||
|
||||
|
||||
class ReviewActionRequest(BaseModel):
|
||||
"""Review action request model."""
|
||||
|
||||
filepath: str
|
||||
item_type: str # "match" or "draft"
|
||||
item_index: int
|
||||
action: str # "approve" or "reject"
|
||||
|
||||
|
||||
# --- API Endpoints ---
|
||||
|
||||
|
||||
@app.get("/")
|
||||
async def root():
|
||||
"""Root endpoint."""
|
||||
return {
|
||||
"name": "P2P Wiki AI",
|
||||
"version": "0.1.0",
|
||||
"status": "running",
|
||||
"vector_store_ready": vector_store is not None,
|
||||
}
|
||||
|
||||
|
||||
@app.get("/health")
|
||||
async def health():
|
||||
"""Health check endpoint."""
|
||||
return {
|
||||
"status": "healthy",
|
||||
"vector_store_ready": vector_store is not None,
|
||||
}
|
||||
|
||||
|
||||
@app.get("/stats")
|
||||
async def stats():
|
||||
"""Get system statistics."""
|
||||
if not vector_store:
|
||||
return {"error": "Vector store not initialized"}
|
||||
|
||||
return {
|
||||
"vector_store": vector_store.get_stats(),
|
||||
"review_queue_count": len(get_review_queue()),
|
||||
}
|
||||
|
||||
|
||||
# --- Chat Endpoints ---
|
||||
|
||||
|
||||
@app.post("/chat", response_model=ChatResponse)
|
||||
async def chat(request: ChatRequest):
|
||||
"""Chat with the wiki knowledge base."""
|
||||
if not rag_system:
|
||||
raise HTTPException(
|
||||
status_code=503,
|
||||
detail="RAG system not initialized. Run indexing first.",
|
||||
)
|
||||
|
||||
response = await rag_system.ask(
|
||||
query=request.query,
|
||||
n_results=request.n_results,
|
||||
filter_categories=request.filter_categories,
|
||||
)
|
||||
|
||||
return ChatResponse(
|
||||
answer=response.answer,
|
||||
sources=response.sources,
|
||||
query=response.query,
|
||||
)
|
||||
|
||||
|
||||
@app.post("/chat/clear")
|
||||
async def clear_chat():
|
||||
"""Clear chat history."""
|
||||
if rag_system:
|
||||
rag_system.clear_history()
|
||||
return {"status": "cleared"}
|
||||
|
||||
|
||||
@app.get("/chat/suggestions")
|
||||
async def chat_suggestions(q: str = ""):
|
||||
"""Get article title suggestions for autocomplete."""
|
||||
if not rag_system or not q:
|
||||
return {"suggestions": []}
|
||||
|
||||
suggestions = rag_system.get_suggestions(q)
|
||||
return {"suggestions": suggestions}
|
||||
|
||||
|
||||
# --- Ingress Endpoints ---
|
||||
|
||||
|
||||
@app.post("/ingress", response_model=IngressResponse)
|
||||
async def ingress(request: IngressRequest, background_tasks: BackgroundTasks):
|
||||
"""
|
||||
Process an external article URL through the ingress pipeline.
|
||||
|
||||
This scrapes the article, analyzes it for wiki relevance,
|
||||
finds matching existing articles, and generates draft articles.
|
||||
"""
|
||||
if not ingress_pipeline:
|
||||
raise HTTPException(
|
||||
status_code=503,
|
||||
detail="Ingress pipeline not initialized. Run indexing first.",
|
||||
)
|
||||
|
||||
try:
|
||||
result = await ingress_pipeline.process(str(request.url))
|
||||
|
||||
return IngressResponse(
|
||||
status="success",
|
||||
message="Article processed successfully",
|
||||
scraped_title=result.scraped.title,
|
||||
topics_found=len(result.analysis.get("main_topics", [])),
|
||||
wiki_matches=len(result.wiki_matches),
|
||||
drafts_generated=len(result.draft_articles),
|
||||
queue_file=result.timestamp,
|
||||
)
|
||||
except Exception as e:
|
||||
raise HTTPException(status_code=500, detail=str(e))
|
||||
|
||||
|
||||
# --- Review Queue Endpoints ---
|
||||
|
||||
|
||||
@app.get("/review")
|
||||
async def get_review_items():
|
||||
"""Get all items in the review queue."""
|
||||
items = get_review_queue()
|
||||
return {"count": len(items), "items": items}
|
||||
|
||||
|
||||
@app.get("/review/{filename}")
|
||||
async def get_review_item(filename: str):
|
||||
"""Get a specific review item."""
|
||||
filepath = settings.review_queue_dir / filename
|
||||
if not filepath.exists():
|
||||
raise HTTPException(status_code=404, detail="Review item not found")
|
||||
|
||||
import json
|
||||
|
||||
with open(filepath, "r", encoding="utf-8") as f:
|
||||
data = json.load(f)
|
||||
|
||||
return data
|
||||
|
||||
|
||||
@app.post("/review/action")
|
||||
async def review_action(request: ReviewActionRequest):
|
||||
"""Approve or reject a review item."""
|
||||
if request.action == "approve":
|
||||
success = approve_item(request.filepath, request.item_type, request.item_index)
|
||||
elif request.action == "reject":
|
||||
success = reject_item(request.filepath, request.item_type, request.item_index)
|
||||
else:
|
||||
raise HTTPException(status_code=400, detail="Invalid action")
|
||||
|
||||
if success:
|
||||
return {"status": "success", "action": request.action}
|
||||
else:
|
||||
raise HTTPException(status_code=500, detail="Action failed")
|
||||
|
||||
|
||||
# --- Search Endpoints ---
|
||||
|
||||
|
||||
@app.get("/search")
|
||||
async def search(q: str, n: int = 10, categories: Optional[str] = None):
|
||||
"""Direct search of the vector store."""
|
||||
if not vector_store:
|
||||
raise HTTPException(status_code=503, detail="Vector store not initialized")
|
||||
|
||||
filter_cats = categories.split(",") if categories else None
|
||||
results = vector_store.search(q, n_results=n, filter_categories=filter_cats)
|
||||
|
||||
return {"query": q, "count": len(results), "results": results}
|
||||
|
||||
|
||||
@app.get("/articles")
|
||||
async def list_articles(limit: int = 100, offset: int = 0):
|
||||
"""List article titles."""
|
||||
if not vector_store:
|
||||
raise HTTPException(status_code=503, detail="Vector store not initialized")
|
||||
|
||||
titles = vector_store.get_article_titles()
|
||||
return {
|
||||
"total": len(titles),
|
||||
"limit": limit,
|
||||
"offset": offset,
|
||||
"titles": titles[offset : offset + limit],
|
||||
}
|
||||
|
||||
|
||||
# --- Static Files (Web UI) ---
|
||||
|
||||
web_dir = Path(__file__).parent.parent / "web"
|
||||
if web_dir.exists():
|
||||
app.mount("/static", StaticFiles(directory=str(web_dir)), name="static")
|
||||
|
||||
@app.get("/ui")
|
||||
async def ui():
|
||||
"""Serve the web UI."""
|
||||
index_path = web_dir / "index.html"
|
||||
if index_path.exists():
|
||||
return FileResponse(index_path)
|
||||
raise HTTPException(status_code=404, detail="Web UI not found")
|
||||
|
||||
|
||||
def main():
|
||||
"""Run the API server."""
|
||||
import uvicorn
|
||||
|
||||
uvicorn.run(
|
||||
"src.api:app",
|
||||
host=settings.host,
|
||||
port=settings.port,
|
||||
reload=True,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
|
@ -0,0 +1,51 @@
|
|||
"""Configuration settings for P2P Wiki AI system."""
|
||||
|
||||
from pathlib import Path
|
||||
from pydantic_settings import BaseSettings
|
||||
|
||||
|
||||
class Settings(BaseSettings):
|
||||
"""Application settings loaded from environment variables."""
|
||||
|
||||
# Paths
|
||||
project_root: Path = Path(__file__).parent.parent
|
||||
data_dir: Path = project_root / "data"
|
||||
xmldump_dir: Path = project_root / "xmldump"
|
||||
|
||||
# Vector store
|
||||
chroma_persist_dir: Path = data_dir / "chroma"
|
||||
embedding_model: str = "all-MiniLM-L6-v2" # Fast, good quality
|
||||
|
||||
# Ollama (local LLM)
|
||||
ollama_base_url: str = "http://localhost:11434"
|
||||
ollama_model: str = "llama3.2" # Default model for local inference
|
||||
|
||||
# Claude API (for complex tasks)
|
||||
anthropic_api_key: str = ""
|
||||
claude_model: str = "claude-sonnet-4-20250514"
|
||||
|
||||
# Hybrid routing thresholds
|
||||
use_claude_for_drafts: bool = True # Use Claude for article drafting
|
||||
use_ollama_for_chat: bool = True # Use Ollama for simple Q&A
|
||||
|
||||
# MediaWiki
|
||||
mediawiki_api_url: str = "" # Set if you have a live wiki API
|
||||
|
||||
# Server
|
||||
host: str = "0.0.0.0"
|
||||
port: int = 8420
|
||||
|
||||
# Review queue
|
||||
review_queue_dir: Path = data_dir / "review_queue"
|
||||
|
||||
class Config:
|
||||
env_file = ".env"
|
||||
env_file_encoding = "utf-8"
|
||||
|
||||
|
||||
settings = Settings()
|
||||
|
||||
# Ensure directories exist
|
||||
settings.data_dir.mkdir(parents=True, exist_ok=True)
|
||||
settings.chroma_persist_dir.mkdir(parents=True, exist_ok=True)
|
||||
settings.review_queue_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
|
@ -0,0 +1,256 @@
|
|||
"""Vector store setup and embedding generation using ChromaDB."""
|
||||
|
||||
import json
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
import chromadb
|
||||
from chromadb.config import Settings as ChromaSettings
|
||||
from rich.console import Console
|
||||
from rich.progress import Progress
|
||||
from sentence_transformers import SentenceTransformer
|
||||
|
||||
from .config import settings
|
||||
from .parser import WikiArticle
|
||||
|
||||
console = Console()
|
||||
|
||||
# Chunk size for embedding (in characters)
|
||||
CHUNK_SIZE = 1000
|
||||
CHUNK_OVERLAP = 200
|
||||
|
||||
|
||||
class WikiVectorStore:
|
||||
"""Vector store for wiki articles using ChromaDB."""
|
||||
|
||||
def __init__(self, persist_dir: Optional[Path] = None):
|
||||
self.persist_dir = persist_dir or settings.chroma_persist_dir
|
||||
|
||||
# Initialize ChromaDB
|
||||
self.client = chromadb.PersistentClient(
|
||||
path=str(self.persist_dir),
|
||||
settings=ChromaSettings(anonymized_telemetry=False),
|
||||
)
|
||||
|
||||
# Create or get collection
|
||||
self.collection = self.client.get_or_create_collection(
|
||||
name="wiki_articles",
|
||||
metadata={"hnsw:space": "cosine"},
|
||||
)
|
||||
|
||||
# Load embedding model
|
||||
console.print(f"[cyan]Loading embedding model: {settings.embedding_model}[/cyan]")
|
||||
self.model = SentenceTransformer(settings.embedding_model)
|
||||
console.print("[green]Model loaded[/green]")
|
||||
|
||||
def _chunk_text(self, text: str, title: str) -> list[tuple[str, dict]]:
|
||||
"""Split text into overlapping chunks with metadata."""
|
||||
if len(text) <= CHUNK_SIZE:
|
||||
return [(text, {"chunk_index": 0, "total_chunks": 1})]
|
||||
|
||||
chunks = []
|
||||
start = 0
|
||||
chunk_index = 0
|
||||
|
||||
while start < len(text):
|
||||
end = start + CHUNK_SIZE
|
||||
|
||||
# Try to break at sentence boundary
|
||||
if end < len(text):
|
||||
# Look for sentence end within last 100 chars
|
||||
for i in range(min(100, end - start)):
|
||||
if text[end - i] in ".!?\n":
|
||||
end = end - i + 1
|
||||
break
|
||||
|
||||
chunk_text = text[start:end].strip()
|
||||
if chunk_text:
|
||||
# Prepend title for context
|
||||
chunk_with_title = f"{title}\n\n{chunk_text}"
|
||||
chunks.append(
|
||||
(chunk_with_title, {"chunk_index": chunk_index, "total_chunks": -1})
|
||||
)
|
||||
chunk_index += 1
|
||||
|
||||
start = end - CHUNK_OVERLAP
|
||||
|
||||
# Update total_chunks
|
||||
for i, (text, meta) in enumerate(chunks):
|
||||
meta["total_chunks"] = len(chunks)
|
||||
|
||||
return chunks
|
||||
|
||||
def get_embedded_article_ids(self) -> set:
|
||||
"""Get set of article IDs that are already embedded."""
|
||||
results = self.collection.get(include=["metadatas"])
|
||||
article_ids = set()
|
||||
for meta in results["metadatas"]:
|
||||
if meta and "article_id" in meta:
|
||||
article_ids.add(meta["article_id"])
|
||||
return article_ids
|
||||
|
||||
def add_articles(self, articles: list[WikiArticle], batch_size: int = 100, resume: bool = True):
|
||||
"""Add articles to the vector store."""
|
||||
console.print(f"[cyan]Processing {len(articles)} articles...[/cyan]")
|
||||
|
||||
# Check for already embedded articles if resuming
|
||||
if resume:
|
||||
embedded_ids = self.get_embedded_article_ids()
|
||||
original_count = len(articles)
|
||||
articles = [a for a in articles if a.id not in embedded_ids]
|
||||
skipped = original_count - len(articles)
|
||||
if skipped > 0:
|
||||
console.print(f"[yellow]Skipping {skipped} already-embedded articles[/yellow]")
|
||||
if not articles:
|
||||
console.print("[green]All articles already embedded![/green]")
|
||||
return
|
||||
|
||||
all_chunks = []
|
||||
all_ids = []
|
||||
all_metadatas = []
|
||||
|
||||
with Progress() as progress:
|
||||
task = progress.add_task("[cyan]Chunking articles...", total=len(articles))
|
||||
|
||||
for article in articles:
|
||||
if not article.plain_text:
|
||||
progress.advance(task)
|
||||
continue
|
||||
|
||||
chunks = self._chunk_text(article.plain_text, article.title)
|
||||
|
||||
for chunk_text, chunk_meta in chunks:
|
||||
chunk_id = f"{article.id}_{chunk_meta['chunk_index']}"
|
||||
|
||||
metadata = {
|
||||
"article_id": article.id,
|
||||
"title": article.title,
|
||||
"categories": ",".join(article.categories[:10]), # Limit categories
|
||||
"timestamp": article.timestamp,
|
||||
"chunk_index": chunk_meta["chunk_index"],
|
||||
"total_chunks": chunk_meta["total_chunks"],
|
||||
}
|
||||
|
||||
all_chunks.append(chunk_text)
|
||||
all_ids.append(chunk_id)
|
||||
all_metadatas.append(metadata)
|
||||
|
||||
progress.advance(task)
|
||||
|
||||
console.print(f"[cyan]Created {len(all_chunks)} chunks from {len(articles)} articles[/cyan]")
|
||||
|
||||
# Generate embeddings and add in batches
|
||||
console.print("[cyan]Generating embeddings and adding to vector store...[/cyan]")
|
||||
|
||||
with Progress() as progress:
|
||||
task = progress.add_task(
|
||||
"[cyan]Embedding and storing...", total=len(all_chunks) // batch_size + 1
|
||||
)
|
||||
|
||||
for i in range(0, len(all_chunks), batch_size):
|
||||
batch_chunks = all_chunks[i : i + batch_size]
|
||||
batch_ids = all_ids[i : i + batch_size]
|
||||
batch_metadatas = all_metadatas[i : i + batch_size]
|
||||
|
||||
# Generate embeddings
|
||||
embeddings = self.model.encode(batch_chunks, show_progress_bar=False)
|
||||
|
||||
# Add to collection
|
||||
self.collection.add(
|
||||
ids=batch_ids,
|
||||
embeddings=embeddings.tolist(),
|
||||
documents=batch_chunks,
|
||||
metadatas=batch_metadatas,
|
||||
)
|
||||
|
||||
progress.advance(task)
|
||||
|
||||
console.print(f"[green]Added {len(all_chunks)} chunks to vector store[/green]")
|
||||
|
||||
def search(
|
||||
self,
|
||||
query: str,
|
||||
n_results: int = 5,
|
||||
filter_categories: Optional[list[str]] = None,
|
||||
) -> list[dict]:
|
||||
"""Search for relevant chunks."""
|
||||
query_embedding = self.model.encode([query])[0]
|
||||
|
||||
where_filter = None
|
||||
if filter_categories:
|
||||
# ChromaDB where filter for categories
|
||||
where_filter = {
|
||||
"$or": [{"categories": {"$contains": cat}} for cat in filter_categories]
|
||||
}
|
||||
|
||||
results = self.collection.query(
|
||||
query_embeddings=[query_embedding.tolist()],
|
||||
n_results=n_results,
|
||||
where=where_filter,
|
||||
include=["documents", "metadatas", "distances"],
|
||||
)
|
||||
|
||||
# Format results
|
||||
formatted = []
|
||||
if results["documents"] and results["documents"][0]:
|
||||
for i, doc in enumerate(results["documents"][0]):
|
||||
formatted.append(
|
||||
{
|
||||
"content": doc,
|
||||
"metadata": results["metadatas"][0][i],
|
||||
"distance": results["distances"][0][i],
|
||||
}
|
||||
)
|
||||
|
||||
return formatted
|
||||
|
||||
def get_article_titles(self) -> list[str]:
|
||||
"""Get all unique article titles in the store."""
|
||||
# Get all metadata
|
||||
results = self.collection.get(include=["metadatas"])
|
||||
titles = set()
|
||||
for meta in results["metadatas"]:
|
||||
if meta and "title" in meta:
|
||||
titles.add(meta["title"])
|
||||
return sorted(titles)
|
||||
|
||||
def get_stats(self) -> dict:
|
||||
"""Get statistics about the vector store."""
|
||||
count = self.collection.count()
|
||||
|
||||
# Get sample of metadatas to count unique articles
|
||||
sample = self.collection.get(limit=10000, include=["metadatas"])
|
||||
unique_articles = len(set(m["article_id"] for m in sample["metadatas"] if m))
|
||||
|
||||
return {
|
||||
"total_chunks": count,
|
||||
"unique_articles_sampled": unique_articles,
|
||||
"persist_dir": str(self.persist_dir),
|
||||
}
|
||||
|
||||
|
||||
def main():
|
||||
"""CLI entry point for generating embeddings."""
|
||||
articles_path = settings.data_dir / "articles.json"
|
||||
|
||||
if not articles_path.exists():
|
||||
console.print(f"[red]Articles file not found: {articles_path}[/red]")
|
||||
console.print("[yellow]Run 'python -m src.parser' first to parse XML dumps[/yellow]")
|
||||
return
|
||||
|
||||
console.print(f"[cyan]Loading articles from {articles_path}...[/cyan]")
|
||||
with open(articles_path, "r", encoding="utf-8") as f:
|
||||
articles_data = json.load(f)
|
||||
|
||||
articles = [WikiArticle(**a) for a in articles_data]
|
||||
console.print(f"[green]Loaded {len(articles)} articles[/green]")
|
||||
|
||||
store = WikiVectorStore()
|
||||
store.add_articles(articles)
|
||||
|
||||
stats = store.get_stats()
|
||||
console.print(f"[green]Vector store stats: {stats}[/green]")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
|
@ -0,0 +1,467 @@
|
|||
"""Article ingress pipeline - scrape, analyze, and draft wiki content."""
|
||||
|
||||
import json
|
||||
import re
|
||||
from dataclasses import dataclass, field, asdict
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
from urllib.parse import urlparse
|
||||
|
||||
import httpx
|
||||
import trafilatura
|
||||
from bs4 import BeautifulSoup
|
||||
from rich.console import Console
|
||||
|
||||
from .config import settings
|
||||
from .embeddings import WikiVectorStore
|
||||
from .llm import llm_client
|
||||
|
||||
console = Console()
|
||||
|
||||
|
||||
@dataclass
|
||||
class ScrapedArticle:
|
||||
"""Represents a scraped external article."""
|
||||
|
||||
url: str
|
||||
title: str
|
||||
content: str
|
||||
author: Optional[str] = None
|
||||
date: Optional[str] = None
|
||||
domain: str = ""
|
||||
word_count: int = 0
|
||||
|
||||
def __post_init__(self):
|
||||
if not self.domain:
|
||||
self.domain = urlparse(self.url).netloc
|
||||
if not self.word_count:
|
||||
self.word_count = len(self.content.split())
|
||||
|
||||
|
||||
@dataclass
|
||||
class WikiMatch:
|
||||
"""A matching wiki article for citation."""
|
||||
|
||||
title: str
|
||||
article_id: int
|
||||
relevance_score: float
|
||||
categories: list[str]
|
||||
suggested_citation: str # How to cite the scraped article in this wiki page
|
||||
|
||||
|
||||
@dataclass
|
||||
class DraftArticle:
|
||||
"""A draft wiki article generated from scraped content."""
|
||||
|
||||
title: str
|
||||
content: str # MediaWiki formatted content
|
||||
categories: list[str]
|
||||
source_url: str
|
||||
source_title: str
|
||||
summary: str
|
||||
related_articles: list[str] # Existing wiki articles to link to
|
||||
|
||||
|
||||
@dataclass
|
||||
class IngressResult:
|
||||
"""Result of the ingress pipeline."""
|
||||
|
||||
scraped: ScrapedArticle
|
||||
analysis: dict # Topic analysis results
|
||||
wiki_matches: list[WikiMatch] # Existing articles to update with citations
|
||||
draft_articles: list[DraftArticle] # New articles to create
|
||||
timestamp: str = field(default_factory=lambda: datetime.now().isoformat())
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
return {
|
||||
"scraped": asdict(self.scraped),
|
||||
"analysis": self.analysis,
|
||||
"wiki_matches": [asdict(m) for m in self.wiki_matches],
|
||||
"draft_articles": [asdict(d) for d in self.draft_articles],
|
||||
"timestamp": self.timestamp,
|
||||
}
|
||||
|
||||
|
||||
class ArticleScraper:
|
||||
"""Scrapes and extracts content from URLs."""
|
||||
|
||||
async def scrape(self, url: str) -> ScrapedArticle:
|
||||
"""Scrape article content from URL."""
|
||||
console.print(f"[cyan]Scraping: {url}[/cyan]")
|
||||
|
||||
async with httpx.AsyncClient(
|
||||
timeout=30.0,
|
||||
follow_redirects=True,
|
||||
headers={
|
||||
"User-Agent": "Mozilla/5.0 (compatible; P2PWikiBot/1.0; +http://p2pfoundation.net)"
|
||||
},
|
||||
) as client:
|
||||
response = await client.get(url)
|
||||
response.raise_for_status()
|
||||
html = response.text
|
||||
|
||||
# Use trafilatura for main content extraction
|
||||
content = trafilatura.extract(
|
||||
html,
|
||||
include_comments=False,
|
||||
include_tables=True,
|
||||
no_fallback=False,
|
||||
)
|
||||
|
||||
if not content:
|
||||
# Fallback to BeautifulSoup
|
||||
soup = BeautifulSoup(html, "html.parser")
|
||||
# Remove script and style elements
|
||||
for element in soup(["script", "style", "nav", "footer", "header"]):
|
||||
element.decompose()
|
||||
content = soup.get_text(separator="\n", strip=True)
|
||||
|
||||
# Extract metadata
|
||||
soup = BeautifulSoup(html, "html.parser")
|
||||
|
||||
title = ""
|
||||
title_tag = soup.find("title")
|
||||
if title_tag:
|
||||
title = title_tag.get_text(strip=True)
|
||||
# Try og:title
|
||||
og_title = soup.find("meta", property="og:title")
|
||||
if og_title and og_title.get("content"):
|
||||
title = og_title["content"]
|
||||
|
||||
author = None
|
||||
author_meta = soup.find("meta", attrs={"name": "author"})
|
||||
if author_meta and author_meta.get("content"):
|
||||
author = author_meta["content"]
|
||||
|
||||
date = None
|
||||
date_meta = soup.find("meta", attrs={"name": "date"}) or soup.find(
|
||||
"meta", property="article:published_time"
|
||||
)
|
||||
if date_meta and date_meta.get("content"):
|
||||
date = date_meta["content"]
|
||||
|
||||
return ScrapedArticle(
|
||||
url=url,
|
||||
title=title,
|
||||
content=content or "",
|
||||
author=author,
|
||||
date=date,
|
||||
)
|
||||
|
||||
|
||||
class ContentAnalyzer:
|
||||
"""Analyzes scraped content for wiki relevance."""
|
||||
|
||||
def __init__(self, vector_store: Optional[WikiVectorStore] = None):
|
||||
self.vector_store = vector_store or WikiVectorStore()
|
||||
|
||||
async def analyze(self, article: ScrapedArticle) -> dict:
|
||||
"""Analyze article for topics, concepts, and wiki relevance."""
|
||||
# Truncate very long articles for analysis
|
||||
content_for_analysis = article.content[:8000]
|
||||
|
||||
analysis_prompt = f"""Analyze this article for potential wiki content about peer-to-peer culture, commons, alternative economics, and collaborative governance.
|
||||
|
||||
Article Title: {article.title}
|
||||
Source: {article.domain}
|
||||
|
||||
Article Content:
|
||||
{content_for_analysis}
|
||||
|
||||
Please provide your analysis in the following JSON format:
|
||||
{{
|
||||
"main_topics": ["topic1", "topic2"],
|
||||
"key_concepts": ["concept1", "concept2"],
|
||||
"relevant_categories": ["category1", "category2"],
|
||||
"summary": "2-3 sentence summary",
|
||||
"wiki_relevance_score": 0.0-1.0,
|
||||
"suggested_article_titles": ["Title 1", "Title 2"],
|
||||
"key_quotes": ["notable quote 1", "notable quote 2"],
|
||||
"mentioned_organizations": ["org1", "org2"],
|
||||
"mentioned_people": ["person1", "person2"]
|
||||
}}
|
||||
|
||||
Focus on topics relevant to:
|
||||
- Peer-to-peer networks and culture
|
||||
- Commons-based peer production
|
||||
- Alternative economics and post-capitalism
|
||||
- Cooperative business models
|
||||
- Open source / free culture
|
||||
- Collaborative governance
|
||||
- Sustainability and ecology"""
|
||||
|
||||
response = await llm_client.analyze(
|
||||
content=article.content[:8000],
|
||||
task=analysis_prompt,
|
||||
temperature=0.3,
|
||||
)
|
||||
|
||||
# Parse JSON from response
|
||||
try:
|
||||
# Find JSON in response
|
||||
json_match = re.search(r"\{[\s\S]*\}", response)
|
||||
if json_match:
|
||||
analysis = json.loads(json_match.group())
|
||||
else:
|
||||
analysis = {"error": "Could not parse analysis", "raw": response}
|
||||
except json.JSONDecodeError:
|
||||
analysis = {"error": "Invalid JSON in analysis", "raw": response}
|
||||
|
||||
return analysis
|
||||
|
||||
async def find_wiki_matches(
|
||||
self, article: ScrapedArticle, analysis: dict, n_results: int = 10
|
||||
) -> list[WikiMatch]:
|
||||
"""Find existing wiki articles that could cite this content."""
|
||||
matches = []
|
||||
|
||||
# Search using main topics and concepts
|
||||
search_terms = analysis.get("main_topics", []) + analysis.get("key_concepts", [])
|
||||
|
||||
for term in search_terms[:5]: # Limit searches
|
||||
results = self.vector_store.search(term, n_results=3)
|
||||
|
||||
for result in results:
|
||||
title = result["metadata"].get("title", "Unknown")
|
||||
article_id = result["metadata"].get("article_id", 0)
|
||||
distance = result.get("distance", 1.0)
|
||||
|
||||
# Skip if already added
|
||||
if any(m.title == title for m in matches):
|
||||
continue
|
||||
|
||||
# Calculate relevance (lower distance = higher relevance)
|
||||
relevance = max(0, 1 - distance)
|
||||
|
||||
if relevance > 0.3: # Threshold for relevance
|
||||
matches.append(
|
||||
WikiMatch(
|
||||
title=title,
|
||||
article_id=article_id,
|
||||
relevance_score=relevance,
|
||||
categories=result["metadata"]
|
||||
.get("categories", "")
|
||||
.split(","),
|
||||
suggested_citation=f"See also: [{article.title}]({article.url})",
|
||||
)
|
||||
)
|
||||
|
||||
# Sort by relevance and limit
|
||||
matches.sort(key=lambda m: m.relevance_score, reverse=True)
|
||||
return matches[:n_results]
|
||||
|
||||
|
||||
class DraftGenerator:
|
||||
"""Generates draft wiki articles from scraped content."""
|
||||
|
||||
def __init__(self, vector_store: Optional[WikiVectorStore] = None):
|
||||
self.vector_store = vector_store or WikiVectorStore()
|
||||
|
||||
async def generate_drafts(
|
||||
self,
|
||||
article: ScrapedArticle,
|
||||
analysis: dict,
|
||||
max_drafts: int = 3,
|
||||
) -> list[DraftArticle]:
|
||||
"""Generate draft wiki articles based on scraped content."""
|
||||
drafts = []
|
||||
|
||||
suggested_titles = analysis.get("suggested_article_titles", [])
|
||||
if not suggested_titles:
|
||||
return drafts
|
||||
|
||||
for title in suggested_titles[:max_drafts]:
|
||||
# Check if article already exists
|
||||
existing = self.vector_store.search(title, n_results=1)
|
||||
if existing and existing[0].get("distance", 1.0) < 0.1:
|
||||
console.print(f"[yellow]Skipping '{title}' - similar article exists[/yellow]")
|
||||
continue
|
||||
|
||||
draft = await self._generate_single_draft(article, analysis, title)
|
||||
if draft:
|
||||
drafts.append(draft)
|
||||
|
||||
return drafts
|
||||
|
||||
async def _generate_single_draft(
|
||||
self,
|
||||
article: ScrapedArticle,
|
||||
analysis: dict,
|
||||
title: str,
|
||||
) -> Optional[DraftArticle]:
|
||||
"""Generate a single draft article."""
|
||||
# Find related existing articles
|
||||
related_search = self.vector_store.search(title, n_results=5)
|
||||
related_titles = [
|
||||
r["metadata"].get("title", "")
|
||||
for r in related_search
|
||||
if r.get("distance", 1.0) < 0.5
|
||||
]
|
||||
|
||||
categories = analysis.get("relevant_categories", [])
|
||||
summary = analysis.get("summary", "")
|
||||
|
||||
draft_prompt = f"""Create a MediaWiki-formatted article for the P2P Foundation Wiki.
|
||||
|
||||
Article Title: {title}
|
||||
|
||||
Source Material:
|
||||
Title: {article.title}
|
||||
URL: {article.url}
|
||||
Summary: {summary}
|
||||
|
||||
Key concepts to cover: {', '.join(analysis.get('key_concepts', []))}
|
||||
|
||||
Related existing wiki articles: {', '.join(related_titles)}
|
||||
|
||||
Categories to include: {', '.join(categories)}
|
||||
|
||||
Please write the wiki article in MediaWiki markup format with:
|
||||
1. An introduction/definition section
|
||||
2. A "Description" section with key information
|
||||
3. Links to related wiki articles using [[Article Name]] format
|
||||
4. A "Sources" section citing the original article
|
||||
5. Category tags at the end using [[Category:Name]] format
|
||||
|
||||
The article should:
|
||||
- Be encyclopedic and neutral in tone
|
||||
- Focus on the P2P/commons aspects of the topic
|
||||
- Be approximately 300-500 words
|
||||
- Include internal wiki links to related concepts"""
|
||||
|
||||
content = await llm_client.generate_draft(
|
||||
draft_prompt,
|
||||
system="You are a wiki editor for the P2P Foundation Wiki. Write clear, encyclopedic articles in MediaWiki markup format.",
|
||||
temperature=0.5,
|
||||
)
|
||||
|
||||
return DraftArticle(
|
||||
title=title,
|
||||
content=content,
|
||||
categories=categories,
|
||||
source_url=article.url,
|
||||
source_title=article.title,
|
||||
summary=summary,
|
||||
related_articles=related_titles,
|
||||
)
|
||||
|
||||
|
||||
class IngressPipeline:
|
||||
"""Complete ingress pipeline for processing external articles."""
|
||||
|
||||
def __init__(self, vector_store: Optional[WikiVectorStore] = None):
|
||||
self.vector_store = vector_store or WikiVectorStore()
|
||||
self.scraper = ArticleScraper()
|
||||
self.analyzer = ContentAnalyzer(self.vector_store)
|
||||
self.generator = DraftGenerator(self.vector_store)
|
||||
|
||||
async def process(self, url: str) -> IngressResult:
|
||||
"""Process a URL through the complete ingress pipeline."""
|
||||
console.print(f"[bold cyan]Processing: {url}[/bold cyan]")
|
||||
|
||||
# Step 1: Scrape
|
||||
console.print("[cyan]Step 1/4: Scraping article...[/cyan]")
|
||||
scraped = await self.scraper.scrape(url)
|
||||
console.print(f"[green]Scraped: {scraped.title} ({scraped.word_count} words)[/green]")
|
||||
|
||||
# Step 2: Analyze
|
||||
console.print("[cyan]Step 2/4: Analyzing content...[/cyan]")
|
||||
analysis = await self.analyzer.analyze(scraped)
|
||||
console.print(f"[green]Found {len(analysis.get('main_topics', []))} main topics[/green]")
|
||||
|
||||
# Step 3: Find wiki matches
|
||||
console.print("[cyan]Step 3/4: Finding wiki matches...[/cyan]")
|
||||
matches = await self.analyzer.find_wiki_matches(scraped, analysis)
|
||||
console.print(f"[green]Found {len(matches)} potential wiki matches[/green]")
|
||||
|
||||
# Step 4: Generate drafts
|
||||
console.print("[cyan]Step 4/4: Generating draft articles...[/cyan]")
|
||||
drafts = await self.generator.generate_drafts(scraped, analysis)
|
||||
console.print(f"[green]Generated {len(drafts)} draft articles[/green]")
|
||||
|
||||
result = IngressResult(
|
||||
scraped=scraped,
|
||||
analysis=analysis,
|
||||
wiki_matches=matches,
|
||||
draft_articles=drafts,
|
||||
)
|
||||
|
||||
# Save to review queue
|
||||
self._save_to_review_queue(result)
|
||||
|
||||
return result
|
||||
|
||||
def _save_to_review_queue(self, result: IngressResult):
|
||||
"""Save ingress result to the review queue."""
|
||||
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
domain = result.scraped.domain.replace(".", "_")
|
||||
filename = f"{timestamp}_{domain}.json"
|
||||
filepath = settings.review_queue_dir / filename
|
||||
|
||||
with open(filepath, "w", encoding="utf-8") as f:
|
||||
json.dump(result.to_dict(), f, indent=2, ensure_ascii=False)
|
||||
|
||||
console.print(f"[green]Saved to review queue: {filepath}[/green]")
|
||||
|
||||
|
||||
def get_review_queue() -> list[dict]:
|
||||
"""Get all items in the review queue."""
|
||||
queue_files = sorted(settings.review_queue_dir.glob("*.json"), reverse=True)
|
||||
|
||||
items = []
|
||||
for filepath in queue_files:
|
||||
with open(filepath, "r", encoding="utf-8") as f:
|
||||
data = json.load(f)
|
||||
data["_filepath"] = str(filepath)
|
||||
items.append(data)
|
||||
|
||||
return items
|
||||
|
||||
|
||||
def approve_item(filepath: str, item_type: str, item_index: int) -> bool:
|
||||
"""
|
||||
Approve an item from the review queue.
|
||||
|
||||
Args:
|
||||
filepath: Path to the review queue JSON file
|
||||
item_type: "match" or "draft"
|
||||
item_index: Index of the item to approve
|
||||
|
||||
Returns:
|
||||
True if successful
|
||||
"""
|
||||
# For now, just mark as approved in the file
|
||||
# In production, this would push to MediaWiki API
|
||||
with open(filepath, "r", encoding="utf-8") as f:
|
||||
data = json.load(f)
|
||||
|
||||
if item_type == "match":
|
||||
if item_index < len(data.get("wiki_matches", [])):
|
||||
data["wiki_matches"][item_index]["approved"] = True
|
||||
elif item_type == "draft":
|
||||
if item_index < len(data.get("draft_articles", [])):
|
||||
data["draft_articles"][item_index]["approved"] = True
|
||||
|
||||
with open(filepath, "w", encoding="utf-8") as f:
|
||||
json.dump(data, f, indent=2, ensure_ascii=False)
|
||||
|
||||
return True
|
||||
|
||||
|
||||
def reject_item(filepath: str, item_type: str, item_index: int) -> bool:
|
||||
"""Reject an item from the review queue."""
|
||||
with open(filepath, "r", encoding="utf-8") as f:
|
||||
data = json.load(f)
|
||||
|
||||
if item_type == "match":
|
||||
if item_index < len(data.get("wiki_matches", [])):
|
||||
data["wiki_matches"][item_index]["rejected"] = True
|
||||
elif item_type == "draft":
|
||||
if item_index < len(data.get("draft_articles", [])):
|
||||
data["draft_articles"][item_index]["rejected"] = True
|
||||
|
||||
with open(filepath, "w", encoding="utf-8") as f:
|
||||
json.dump(data, f, indent=2, ensure_ascii=False)
|
||||
|
||||
return True
|
||||
|
|
@ -0,0 +1,153 @@
|
|||
"""LLM client with hybrid routing between Ollama and Claude."""
|
||||
|
||||
from typing import AsyncIterator, Optional
|
||||
import httpx
|
||||
from anthropic import Anthropic
|
||||
from tenacity import retry, stop_after_attempt, wait_exponential
|
||||
|
||||
from .config import settings
|
||||
|
||||
|
||||
class LLMClient:
|
||||
"""Unified LLM client with hybrid routing."""
|
||||
|
||||
def __init__(self):
|
||||
self.ollama_url = settings.ollama_base_url
|
||||
self.ollama_model = settings.ollama_model
|
||||
|
||||
# Initialize Claude client if API key is set
|
||||
self.claude_client = None
|
||||
if settings.anthropic_api_key:
|
||||
self.claude_client = Anthropic(api_key=settings.anthropic_api_key)
|
||||
|
||||
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=1, max=10))
|
||||
async def _call_ollama(
|
||||
self,
|
||||
prompt: str,
|
||||
system: Optional[str] = None,
|
||||
temperature: float = 0.7,
|
||||
max_tokens: int = 2048,
|
||||
) -> str:
|
||||
"""Call Ollama API."""
|
||||
messages = []
|
||||
if system:
|
||||
messages.append({"role": "system", "content": system})
|
||||
messages.append({"role": "user", "content": prompt})
|
||||
|
||||
async with httpx.AsyncClient(timeout=120.0) as client:
|
||||
response = await client.post(
|
||||
f"{self.ollama_url}/api/chat",
|
||||
json={
|
||||
"model": self.ollama_model,
|
||||
"messages": messages,
|
||||
"stream": False,
|
||||
"options": {
|
||||
"temperature": temperature,
|
||||
"num_predict": max_tokens,
|
||||
},
|
||||
},
|
||||
)
|
||||
response.raise_for_status()
|
||||
data = response.json()
|
||||
return data["message"]["content"]
|
||||
|
||||
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=1, max=10))
|
||||
async def _call_claude(
|
||||
self,
|
||||
prompt: str,
|
||||
system: Optional[str] = None,
|
||||
temperature: float = 0.7,
|
||||
max_tokens: int = 4096,
|
||||
) -> str:
|
||||
"""Call Claude API."""
|
||||
if not self.claude_client:
|
||||
raise ValueError("Claude API key not configured")
|
||||
|
||||
message = self.claude_client.messages.create(
|
||||
model=settings.claude_model,
|
||||
max_tokens=max_tokens,
|
||||
system=system or "",
|
||||
messages=[{"role": "user", "content": prompt}],
|
||||
temperature=temperature,
|
||||
)
|
||||
return message.content[0].text
|
||||
|
||||
async def chat(
|
||||
self,
|
||||
prompt: str,
|
||||
system: Optional[str] = None,
|
||||
use_claude: bool = False,
|
||||
temperature: float = 0.7,
|
||||
max_tokens: int = 2048,
|
||||
) -> str:
|
||||
"""
|
||||
Chat with LLM using hybrid routing.
|
||||
|
||||
Args:
|
||||
prompt: User prompt
|
||||
system: System prompt
|
||||
use_claude: Force Claude API (otherwise uses Ollama by default)
|
||||
temperature: Sampling temperature
|
||||
max_tokens: Max response tokens
|
||||
|
||||
Returns:
|
||||
LLM response text
|
||||
"""
|
||||
if use_claude and self.claude_client:
|
||||
return await self._call_claude(prompt, system, temperature, max_tokens)
|
||||
else:
|
||||
return await self._call_ollama(prompt, system, temperature, max_tokens)
|
||||
|
||||
async def generate_draft(
|
||||
self,
|
||||
prompt: str,
|
||||
system: Optional[str] = None,
|
||||
temperature: float = 0.5,
|
||||
) -> str:
|
||||
"""
|
||||
Generate article draft - uses Claude for higher quality.
|
||||
|
||||
Args:
|
||||
prompt: Prompt describing what to generate
|
||||
system: System prompt for context
|
||||
temperature: Lower for more factual output
|
||||
|
||||
Returns:
|
||||
Generated draft text
|
||||
"""
|
||||
# Use Claude for drafts if configured, otherwise fall back to Ollama
|
||||
use_claude = settings.use_claude_for_drafts and self.claude_client is not None
|
||||
return await self.chat(
|
||||
prompt, system, use_claude=use_claude, temperature=temperature, max_tokens=4096
|
||||
)
|
||||
|
||||
async def analyze(
|
||||
self,
|
||||
content: str,
|
||||
task: str,
|
||||
temperature: float = 0.3,
|
||||
) -> str:
|
||||
"""
|
||||
Analyze content for a specific task - uses Claude for complex analysis.
|
||||
|
||||
Args:
|
||||
content: Content to analyze
|
||||
task: Description of analysis task
|
||||
temperature: Lower for more deterministic output
|
||||
|
||||
Returns:
|
||||
Analysis result
|
||||
"""
|
||||
prompt = f"""Task: {task}
|
||||
|
||||
Content to analyze:
|
||||
{content}
|
||||
|
||||
Provide your analysis:"""
|
||||
|
||||
use_claude = self.claude_client is not None
|
||||
return await self.chat(prompt, use_claude=use_claude, temperature=temperature)
|
||||
|
||||
|
||||
# Singleton instance
|
||||
llm_client = LLMClient()
|
||||
|
|
@ -0,0 +1,267 @@
|
|||
"""MediaWiki XML dump parser - converts to structured JSON."""
|
||||
|
||||
import json
|
||||
import re
|
||||
from dataclasses import dataclass, field, asdict
|
||||
from pathlib import Path
|
||||
from typing import Iterator
|
||||
from lxml import etree
|
||||
from rich.progress import Progress, TaskID
|
||||
from rich.console import Console
|
||||
|
||||
from .config import settings
|
||||
|
||||
console = Console()
|
||||
|
||||
# MediaWiki namespace
|
||||
MW_NS = {"mw": "http://www.mediawiki.org/xml/export-0.6/"}
|
||||
|
||||
|
||||
@dataclass
|
||||
class WikiArticle:
|
||||
"""Represents a parsed wiki article."""
|
||||
|
||||
id: int
|
||||
title: str
|
||||
content: str # Raw wikitext
|
||||
plain_text: str # Cleaned plain text for embedding
|
||||
categories: list[str] = field(default_factory=list)
|
||||
links: list[str] = field(default_factory=list) # Internal wiki links
|
||||
external_links: list[str] = field(default_factory=list)
|
||||
timestamp: str = ""
|
||||
contributor: str = ""
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
return asdict(self)
|
||||
|
||||
|
||||
def clean_wikitext(text: str) -> str:
|
||||
"""Convert MediaWiki markup to plain text for embedding."""
|
||||
if not text:
|
||||
return ""
|
||||
|
||||
# Remove templates {{...}}
|
||||
text = re.sub(r"\{\{[^}]+\}\}", "", text)
|
||||
|
||||
# Remove categories [[Category:...]]
|
||||
text = re.sub(r"\[\[Category:[^\]]+\]\]", "", text, flags=re.IGNORECASE)
|
||||
|
||||
# Convert wiki links [[Page|Display]] or [[Page]] to just the display text
|
||||
text = re.sub(r"\[\[([^|\]]+)\|([^\]]+)\]\]", r"\2", text)
|
||||
text = re.sub(r"\[\[([^\]]+)\]\]", r"\1", text)
|
||||
|
||||
# Remove external links [url text] -> text
|
||||
text = re.sub(r"\[https?://[^\s\]]+ ([^\]]+)\]", r"\1", text)
|
||||
text = re.sub(r"\[https?://[^\]]+\]", "", text)
|
||||
|
||||
# Remove wiki formatting
|
||||
text = re.sub(r"'''?([^']+)'''?", r"\1", text) # Bold/italic
|
||||
text = re.sub(r"={2,}([^=]+)={2,}", r"\1", text) # Headers
|
||||
text = re.sub(r"^[*#:;]+", "", text, flags=re.MULTILINE) # List markers
|
||||
|
||||
# Remove HTML tags
|
||||
text = re.sub(r"<[^>]+>", "", text)
|
||||
|
||||
# Clean up whitespace
|
||||
text = re.sub(r"\n{3,}", "\n\n", text)
|
||||
text = re.sub(r" {2,}", " ", text)
|
||||
|
||||
return text.strip()
|
||||
|
||||
|
||||
def extract_categories(text: str) -> list[str]:
|
||||
"""Extract category names from wikitext."""
|
||||
pattern = r"\[\[Category:([^\]|]+)"
|
||||
return list(set(re.findall(pattern, text, re.IGNORECASE)))
|
||||
|
||||
|
||||
def extract_wiki_links(text: str) -> list[str]:
|
||||
"""Extract internal wiki links from wikitext."""
|
||||
# Match [[Page]] or [[Page|Display]]
|
||||
pattern = r"\[\[([^|\]]+)"
|
||||
links = re.findall(pattern, text)
|
||||
# Filter out categories and files
|
||||
return list(
|
||||
set(
|
||||
link.strip()
|
||||
for link in links
|
||||
if not link.lower().startswith(("category:", "file:", "image:"))
|
||||
)
|
||||
)
|
||||
|
||||
|
||||
def extract_external_links(text: str) -> list[str]:
|
||||
"""Extract external URLs from wikitext."""
|
||||
pattern = r"https?://[^\s\]\)\"']+"
|
||||
return list(set(re.findall(pattern, text)))
|
||||
|
||||
|
||||
def parse_xml_file(xml_path: Path) -> Iterator[WikiArticle]:
|
||||
"""Parse a MediaWiki XML dump file and yield articles."""
|
||||
context = etree.iterparse(
|
||||
str(xml_path), events=("end",), tag="{http://www.mediawiki.org/xml/export-0.6/}page"
|
||||
)
|
||||
|
||||
for event, page in context:
|
||||
# Get basic info
|
||||
title_elem = page.find("mw:title", MW_NS)
|
||||
id_elem = page.find("mw:id", MW_NS)
|
||||
ns_elem = page.find("mw:ns", MW_NS)
|
||||
|
||||
# Skip non-main namespace pages (talk, user, etc.)
|
||||
if ns_elem is not None and ns_elem.text != "0":
|
||||
page.clear()
|
||||
continue
|
||||
|
||||
title = title_elem.text if title_elem is not None else ""
|
||||
page_id = int(id_elem.text) if id_elem is not None else 0
|
||||
|
||||
# Get latest revision
|
||||
revision = page.find("mw:revision", MW_NS)
|
||||
if revision is None:
|
||||
page.clear()
|
||||
continue
|
||||
|
||||
text_elem = revision.find("mw:text", MW_NS)
|
||||
timestamp_elem = revision.find("mw:timestamp", MW_NS)
|
||||
contributor = revision.find("mw:contributor", MW_NS)
|
||||
|
||||
content = text_elem.text if text_elem is not None else ""
|
||||
timestamp = timestamp_elem.text if timestamp_elem is not None else ""
|
||||
|
||||
contributor_name = ""
|
||||
if contributor is not None:
|
||||
username = contributor.find("mw:username", MW_NS)
|
||||
if username is not None:
|
||||
contributor_name = username.text or ""
|
||||
|
||||
# Skip redirects and empty pages
|
||||
if not content or content.lower().startswith("#redirect"):
|
||||
page.clear()
|
||||
continue
|
||||
|
||||
article = WikiArticle(
|
||||
id=page_id,
|
||||
title=title,
|
||||
content=content,
|
||||
plain_text=clean_wikitext(content),
|
||||
categories=extract_categories(content),
|
||||
links=extract_wiki_links(content),
|
||||
external_links=extract_external_links(content),
|
||||
timestamp=timestamp,
|
||||
contributor=contributor_name,
|
||||
)
|
||||
|
||||
# Clear element to free memory
|
||||
page.clear()
|
||||
|
||||
yield article
|
||||
|
||||
|
||||
def parse_all_dumps(output_path: Path | None = None) -> list[WikiArticle]:
|
||||
"""Parse all XML dump files and optionally save to JSON."""
|
||||
xml_files = sorted(settings.xmldump_dir.glob("*.xml"))
|
||||
|
||||
if not xml_files:
|
||||
console.print(f"[red]No XML files found in {settings.xmldump_dir}[/red]")
|
||||
return []
|
||||
|
||||
console.print(f"[green]Found {len(xml_files)} XML files to parse[/green]")
|
||||
|
||||
all_articles = []
|
||||
|
||||
with Progress() as progress:
|
||||
task = progress.add_task("[cyan]Parsing XML files...", total=len(xml_files))
|
||||
|
||||
for xml_file in xml_files:
|
||||
progress.update(task, description=f"[cyan]Parsing {xml_file.name}...")
|
||||
|
||||
for article in parse_xml_file(xml_file):
|
||||
all_articles.append(article)
|
||||
|
||||
progress.advance(task)
|
||||
|
||||
console.print(f"[green]Parsed {len(all_articles)} articles[/green]")
|
||||
|
||||
if output_path:
|
||||
console.print(f"[cyan]Saving to {output_path}...[/cyan]")
|
||||
with open(output_path, "w", encoding="utf-8") as f:
|
||||
json.dump([a.to_dict() for a in all_articles], f, ensure_ascii=False, indent=2)
|
||||
console.print(f"[green]Saved {len(all_articles)} articles to {output_path}[/green]")
|
||||
|
||||
return all_articles
|
||||
|
||||
|
||||
def parse_mediawiki_files(articles_dir: Path, output_path: Path | None = None) -> list[WikiArticle]:
|
||||
"""Parse individual .mediawiki files from a directory (Codeberg format)."""
|
||||
mediawiki_files = list(articles_dir.glob("*.mediawiki"))
|
||||
|
||||
if not mediawiki_files:
|
||||
console.print(f"[red]No .mediawiki files found in {articles_dir}[/red]")
|
||||
return []
|
||||
|
||||
console.print(f"[green]Found {len(mediawiki_files)} .mediawiki files to parse[/green]")
|
||||
|
||||
all_articles = []
|
||||
|
||||
with Progress() as progress:
|
||||
task = progress.add_task("[cyan]Parsing files...", total=len(mediawiki_files))
|
||||
|
||||
for i, filepath in enumerate(mediawiki_files):
|
||||
# Title is the filename without extension
|
||||
title = filepath.stem
|
||||
|
||||
try:
|
||||
content = filepath.read_text(encoding="utf-8", errors="replace")
|
||||
except Exception as e:
|
||||
console.print(f"[yellow]Warning: Could not read {filepath}: {e}[/yellow]")
|
||||
progress.advance(task)
|
||||
continue
|
||||
|
||||
# Skip redirects and empty files
|
||||
if not content or content.strip().lower().startswith("#redirect"):
|
||||
progress.advance(task)
|
||||
continue
|
||||
|
||||
article = WikiArticle(
|
||||
id=i,
|
||||
title=title,
|
||||
content=content,
|
||||
plain_text=clean_wikitext(content),
|
||||
categories=extract_categories(content),
|
||||
links=extract_wiki_links(content),
|
||||
external_links=extract_external_links(content),
|
||||
timestamp="",
|
||||
contributor="",
|
||||
)
|
||||
|
||||
all_articles.append(article)
|
||||
progress.advance(task)
|
||||
|
||||
console.print(f"[green]Parsed {len(all_articles)} articles[/green]")
|
||||
|
||||
if output_path:
|
||||
console.print(f"[cyan]Saving to {output_path}...[/cyan]")
|
||||
with open(output_path, "w", encoding="utf-8") as f:
|
||||
json.dump([a.to_dict() for a in all_articles], f, ensure_ascii=False, indent=2)
|
||||
console.print(f"[green]Saved {len(all_articles)} articles to {output_path}[/green]")
|
||||
|
||||
return all_articles
|
||||
|
||||
|
||||
def main():
|
||||
"""CLI entry point for parsing wiki content."""
|
||||
output_path = settings.data_dir / "articles.json"
|
||||
|
||||
# Check for Codeberg-style articles directory first (newer, more complete)
|
||||
articles_dir = settings.project_root / "articles" / "articles"
|
||||
if articles_dir.exists():
|
||||
console.print("[cyan]Found Codeberg-style articles directory, using that...[/cyan]")
|
||||
parse_mediawiki_files(articles_dir, output_path)
|
||||
else:
|
||||
# Fall back to XML dumps
|
||||
parse_all_dumps(output_path)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
|
@ -0,0 +1,159 @@
|
|||
"""RAG (Retrieval Augmented Generation) system for wiki Q&A."""
|
||||
|
||||
from dataclasses import dataclass
|
||||
from typing import Optional
|
||||
|
||||
from .embeddings import WikiVectorStore
|
||||
from .llm import llm_client
|
||||
|
||||
|
||||
SYSTEM_PROMPT = """You are a knowledgeable assistant for the P2P Foundation Wiki, a comprehensive knowledge base about peer-to-peer culture, commons-based peer production, alternative economics, and collaborative governance.
|
||||
|
||||
Your role is to answer questions about the wiki content accurately and helpfully. When answering:
|
||||
|
||||
1. Base your answers on the provided wiki content excerpts
|
||||
2. Cite specific articles when relevant (use the article titles)
|
||||
3. If the provided content doesn't fully answer the question, say so
|
||||
4. Explain concepts in accessible language while maintaining accuracy
|
||||
5. Connect related concepts when helpful
|
||||
|
||||
If asked about something not covered in the provided content, acknowledge this and suggest related topics that might be helpful."""
|
||||
|
||||
|
||||
@dataclass
|
||||
class ChatMessage:
|
||||
"""A chat message."""
|
||||
|
||||
role: str # "user" or "assistant"
|
||||
content: str
|
||||
|
||||
|
||||
@dataclass
|
||||
class RAGResponse:
|
||||
"""Response from the RAG system."""
|
||||
|
||||
answer: str
|
||||
sources: list[dict] # List of source articles used
|
||||
query: str
|
||||
|
||||
|
||||
class WikiRAG:
|
||||
"""RAG system for answering questions about wiki content."""
|
||||
|
||||
def __init__(self, vector_store: Optional[WikiVectorStore] = None):
|
||||
self.vector_store = vector_store or WikiVectorStore()
|
||||
self.conversation_history: list[ChatMessage] = []
|
||||
|
||||
def _format_context(self, search_results: list[dict]) -> str:
|
||||
"""Format search results as context for the LLM."""
|
||||
if not search_results:
|
||||
return "No relevant wiki content found for this query."
|
||||
|
||||
context_parts = []
|
||||
for i, result in enumerate(search_results, 1):
|
||||
title = result["metadata"].get("title", "Unknown")
|
||||
content = result["content"]
|
||||
categories = result["metadata"].get("categories", "")
|
||||
|
||||
context_parts.append(
|
||||
f"[Source {i}: {title}]\n"
|
||||
f"Categories: {categories}\n"
|
||||
f"Content:\n{content}\n"
|
||||
)
|
||||
|
||||
return "\n---\n".join(context_parts)
|
||||
|
||||
def _build_prompt(self, query: str, context: str) -> str:
|
||||
"""Build the prompt for the LLM."""
|
||||
# Include recent conversation history for context
|
||||
history_text = ""
|
||||
if self.conversation_history:
|
||||
recent = self.conversation_history[-4:] # Last 2 exchanges
|
||||
history_text = "\n\nRecent conversation:\n"
|
||||
for msg in recent:
|
||||
role = "User" if msg.role == "user" else "Assistant"
|
||||
# Truncate long messages
|
||||
content = msg.content[:500] + "..." if len(msg.content) > 500 else msg.content
|
||||
history_text += f"{role}: {content}\n"
|
||||
|
||||
return f"""Based on the following wiki content, please answer the user's question.
|
||||
|
||||
Wiki Content:
|
||||
{context}
|
||||
{history_text}
|
||||
User Question: {query}
|
||||
|
||||
Please provide a helpful answer based on the wiki content above. Cite specific articles when relevant."""
|
||||
|
||||
async def ask(
|
||||
self,
|
||||
query: str,
|
||||
n_results: int = 5,
|
||||
filter_categories: Optional[list[str]] = None,
|
||||
) -> RAGResponse:
|
||||
"""
|
||||
Ask a question and get an answer based on wiki content.
|
||||
|
||||
Args:
|
||||
query: User's question
|
||||
n_results: Number of relevant chunks to retrieve
|
||||
filter_categories: Optional category filter
|
||||
|
||||
Returns:
|
||||
RAGResponse with answer and sources
|
||||
"""
|
||||
# Search for relevant content
|
||||
search_results = self.vector_store.search(
|
||||
query, n_results=n_results, filter_categories=filter_categories
|
||||
)
|
||||
|
||||
# Format context
|
||||
context = self._format_context(search_results)
|
||||
|
||||
# Build prompt
|
||||
prompt = self._build_prompt(query, context)
|
||||
|
||||
# Get LLM response (use Ollama for chat by default)
|
||||
answer = await llm_client.chat(
|
||||
prompt,
|
||||
system=SYSTEM_PROMPT,
|
||||
use_claude=False, # Use Ollama for chat
|
||||
temperature=0.7,
|
||||
)
|
||||
|
||||
# Update conversation history
|
||||
self.conversation_history.append(ChatMessage(role="user", content=query))
|
||||
self.conversation_history.append(ChatMessage(role="assistant", content=answer))
|
||||
|
||||
# Extract unique sources
|
||||
sources = []
|
||||
seen_titles = set()
|
||||
for result in search_results:
|
||||
title = result["metadata"].get("title", "Unknown")
|
||||
if title not in seen_titles:
|
||||
seen_titles.add(title)
|
||||
sources.append(
|
||||
{
|
||||
"title": title,
|
||||
"article_id": result["metadata"].get("article_id"),
|
||||
"categories": result["metadata"].get("categories", "").split(","),
|
||||
}
|
||||
)
|
||||
|
||||
return RAGResponse(answer=answer, sources=sources, query=query)
|
||||
|
||||
def clear_history(self):
|
||||
"""Clear conversation history."""
|
||||
self.conversation_history = []
|
||||
|
||||
def get_suggestions(self, partial_query: str, n_results: int = 5) -> list[str]:
|
||||
"""Get article title suggestions for autocomplete."""
|
||||
# Simple prefix matching on titles
|
||||
all_titles = self.vector_store.get_article_titles()
|
||||
partial_lower = partial_query.lower()
|
||||
|
||||
suggestions = [
|
||||
title for title in all_titles if partial_lower in title.lower()
|
||||
][:n_results]
|
||||
|
||||
return suggestions
|
||||
|
|
@ -0,0 +1,707 @@
|
|||
<!DOCTYPE html>
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
||||
<title>P2P Wiki AI</title>
|
||||
<style>
|
||||
:root {
|
||||
--bg-primary: #1a1a2e;
|
||||
--bg-secondary: #16213e;
|
||||
--bg-tertiary: #0f3460;
|
||||
--text-primary: #e8e8e8;
|
||||
--text-secondary: #a0a0a0;
|
||||
--accent: #e94560;
|
||||
--accent-hover: #ff6b6b;
|
||||
--success: #4ecdc4;
|
||||
--border: #2a2a4a;
|
||||
}
|
||||
|
||||
* {
|
||||
box-sizing: border-box;
|
||||
margin: 0;
|
||||
padding: 0;
|
||||
}
|
||||
|
||||
body {
|
||||
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
|
||||
background: var(--bg-primary);
|
||||
color: var(--text-primary);
|
||||
min-height: 100vh;
|
||||
}
|
||||
|
||||
.container {
|
||||
max-width: 1200px;
|
||||
margin: 0 auto;
|
||||
padding: 20px;
|
||||
}
|
||||
|
||||
header {
|
||||
display: flex;
|
||||
justify-content: space-between;
|
||||
align-items: center;
|
||||
padding: 20px 0;
|
||||
border-bottom: 1px solid var(--border);
|
||||
margin-bottom: 30px;
|
||||
}
|
||||
|
||||
h1 {
|
||||
font-size: 1.8em;
|
||||
font-weight: 600;
|
||||
}
|
||||
|
||||
h1 span {
|
||||
color: var(--accent);
|
||||
}
|
||||
|
||||
.tabs {
|
||||
display: flex;
|
||||
gap: 10px;
|
||||
}
|
||||
|
||||
.tab {
|
||||
padding: 10px 20px;
|
||||
background: var(--bg-secondary);
|
||||
border: 1px solid var(--border);
|
||||
border-radius: 8px;
|
||||
cursor: pointer;
|
||||
transition: all 0.2s;
|
||||
}
|
||||
|
||||
.tab:hover, .tab.active {
|
||||
background: var(--bg-tertiary);
|
||||
border-color: var(--accent);
|
||||
}
|
||||
|
||||
.panel {
|
||||
display: none;
|
||||
}
|
||||
|
||||
.panel.active {
|
||||
display: block;
|
||||
}
|
||||
|
||||
/* Chat Panel */
|
||||
.chat-container {
|
||||
display: flex;
|
||||
flex-direction: column;
|
||||
height: calc(100vh - 200px);
|
||||
background: var(--bg-secondary);
|
||||
border-radius: 12px;
|
||||
overflow: hidden;
|
||||
}
|
||||
|
||||
.chat-messages {
|
||||
flex: 1;
|
||||
overflow-y: auto;
|
||||
padding: 20px;
|
||||
}
|
||||
|
||||
.message {
|
||||
margin-bottom: 20px;
|
||||
max-width: 80%;
|
||||
}
|
||||
|
||||
.message.user {
|
||||
margin-left: auto;
|
||||
}
|
||||
|
||||
.message-content {
|
||||
padding: 15px;
|
||||
border-radius: 12px;
|
||||
line-height: 1.6;
|
||||
}
|
||||
|
||||
.message.user .message-content {
|
||||
background: var(--bg-tertiary);
|
||||
}
|
||||
|
||||
.message.assistant .message-content {
|
||||
background: var(--bg-primary);
|
||||
border: 1px solid var(--border);
|
||||
}
|
||||
|
||||
.message-sources {
|
||||
margin-top: 10px;
|
||||
padding: 10px;
|
||||
background: rgba(233, 69, 96, 0.1);
|
||||
border-radius: 8px;
|
||||
font-size: 0.9em;
|
||||
}
|
||||
|
||||
.message-sources h4 {
|
||||
color: var(--accent);
|
||||
margin-bottom: 5px;
|
||||
}
|
||||
|
||||
.source-tag {
|
||||
display: inline-block;
|
||||
padding: 3px 8px;
|
||||
margin: 2px;
|
||||
background: var(--bg-tertiary);
|
||||
border-radius: 4px;
|
||||
font-size: 0.85em;
|
||||
}
|
||||
|
||||
.chat-input {
|
||||
display: flex;
|
||||
gap: 10px;
|
||||
padding: 20px;
|
||||
background: var(--bg-primary);
|
||||
border-top: 1px solid var(--border);
|
||||
}
|
||||
|
||||
.chat-input input {
|
||||
flex: 1;
|
||||
padding: 15px;
|
||||
background: var(--bg-secondary);
|
||||
border: 1px solid var(--border);
|
||||
border-radius: 8px;
|
||||
color: var(--text-primary);
|
||||
font-size: 1em;
|
||||
}
|
||||
|
||||
.chat-input input:focus {
|
||||
outline: none;
|
||||
border-color: var(--accent);
|
||||
}
|
||||
|
||||
.chat-input button {
|
||||
padding: 15px 30px;
|
||||
background: var(--accent);
|
||||
border: none;
|
||||
border-radius: 8px;
|
||||
color: white;
|
||||
font-weight: 600;
|
||||
cursor: pointer;
|
||||
transition: background 0.2s;
|
||||
}
|
||||
|
||||
.chat-input button:hover {
|
||||
background: var(--accent-hover);
|
||||
}
|
||||
|
||||
.chat-input button:disabled {
|
||||
opacity: 0.5;
|
||||
cursor: not-allowed;
|
||||
}
|
||||
|
||||
/* Ingress Panel */
|
||||
.ingress-container {
|
||||
background: var(--bg-secondary);
|
||||
border-radius: 12px;
|
||||
padding: 30px;
|
||||
}
|
||||
|
||||
.ingress-form {
|
||||
display: flex;
|
||||
gap: 10px;
|
||||
margin-bottom: 30px;
|
||||
}
|
||||
|
||||
.ingress-form input {
|
||||
flex: 1;
|
||||
padding: 15px;
|
||||
background: var(--bg-primary);
|
||||
border: 1px solid var(--border);
|
||||
border-radius: 8px;
|
||||
color: var(--text-primary);
|
||||
font-size: 1em;
|
||||
}
|
||||
|
||||
.ingress-form input:focus {
|
||||
outline: none;
|
||||
border-color: var(--accent);
|
||||
}
|
||||
|
||||
.ingress-form button {
|
||||
padding: 15px 30px;
|
||||
background: var(--success);
|
||||
border: none;
|
||||
border-radius: 8px;
|
||||
color: var(--bg-primary);
|
||||
font-weight: 600;
|
||||
cursor: pointer;
|
||||
transition: opacity 0.2s;
|
||||
}
|
||||
|
||||
.ingress-form button:hover {
|
||||
opacity: 0.9;
|
||||
}
|
||||
|
||||
.ingress-form button:disabled {
|
||||
opacity: 0.5;
|
||||
cursor: not-allowed;
|
||||
}
|
||||
|
||||
.ingress-result {
|
||||
background: var(--bg-primary);
|
||||
border-radius: 8px;
|
||||
padding: 20px;
|
||||
margin-bottom: 20px;
|
||||
}
|
||||
|
||||
.ingress-result h3 {
|
||||
margin-bottom: 15px;
|
||||
color: var(--accent);
|
||||
}
|
||||
|
||||
.result-stats {
|
||||
display: grid;
|
||||
grid-template-columns: repeat(auto-fit, minmax(150px, 1fr));
|
||||
gap: 15px;
|
||||
margin-bottom: 20px;
|
||||
}
|
||||
|
||||
.stat {
|
||||
background: var(--bg-secondary);
|
||||
padding: 15px;
|
||||
border-radius: 8px;
|
||||
text-align: center;
|
||||
}
|
||||
|
||||
.stat-value {
|
||||
font-size: 2em;
|
||||
font-weight: bold;
|
||||
color: var(--success);
|
||||
}
|
||||
|
||||
.stat-label {
|
||||
color: var(--text-secondary);
|
||||
font-size: 0.9em;
|
||||
}
|
||||
|
||||
/* Review Panel */
|
||||
.review-container {
|
||||
background: var(--bg-secondary);
|
||||
border-radius: 12px;
|
||||
padding: 30px;
|
||||
}
|
||||
|
||||
.review-item {
|
||||
background: var(--bg-primary);
|
||||
border-radius: 8px;
|
||||
padding: 20px;
|
||||
margin-bottom: 20px;
|
||||
}
|
||||
|
||||
.review-item h3 {
|
||||
margin-bottom: 10px;
|
||||
}
|
||||
|
||||
.review-meta {
|
||||
color: var(--text-secondary);
|
||||
font-size: 0.9em;
|
||||
margin-bottom: 15px;
|
||||
}
|
||||
|
||||
.review-section {
|
||||
margin-top: 20px;
|
||||
padding-top: 20px;
|
||||
border-top: 1px solid var(--border);
|
||||
}
|
||||
|
||||
.review-section h4 {
|
||||
margin-bottom: 10px;
|
||||
color: var(--accent);
|
||||
}
|
||||
|
||||
.match-item, .draft-item {
|
||||
background: var(--bg-secondary);
|
||||
padding: 15px;
|
||||
border-radius: 8px;
|
||||
margin-bottom: 10px;
|
||||
}
|
||||
|
||||
.match-item .title, .draft-item .title {
|
||||
font-weight: 600;
|
||||
margin-bottom: 5px;
|
||||
}
|
||||
|
||||
.match-item .score {
|
||||
color: var(--success);
|
||||
}
|
||||
|
||||
.action-buttons {
|
||||
display: flex;
|
||||
gap: 10px;
|
||||
margin-top: 10px;
|
||||
}
|
||||
|
||||
.btn-approve {
|
||||
padding: 8px 16px;
|
||||
background: var(--success);
|
||||
border: none;
|
||||
border-radius: 4px;
|
||||
color: var(--bg-primary);
|
||||
cursor: pointer;
|
||||
}
|
||||
|
||||
.btn-reject {
|
||||
padding: 8px 16px;
|
||||
background: var(--accent);
|
||||
border: none;
|
||||
border-radius: 4px;
|
||||
color: white;
|
||||
cursor: pointer;
|
||||
}
|
||||
|
||||
.loading {
|
||||
display: inline-block;
|
||||
width: 20px;
|
||||
height: 20px;
|
||||
border: 2px solid var(--text-secondary);
|
||||
border-top-color: var(--accent);
|
||||
border-radius: 50%;
|
||||
animation: spin 1s linear infinite;
|
||||
}
|
||||
|
||||
@keyframes spin {
|
||||
to { transform: rotate(360deg); }
|
||||
}
|
||||
|
||||
.empty-state {
|
||||
text-align: center;
|
||||
padding: 50px;
|
||||
color: var(--text-secondary);
|
||||
}
|
||||
|
||||
/* Markdown-like formatting */
|
||||
.message-content p { margin-bottom: 10px; }
|
||||
.message-content ul, .message-content ol { margin-left: 20px; margin-bottom: 10px; }
|
||||
.message-content code { background: var(--bg-tertiary); padding: 2px 6px; border-radius: 4px; }
|
||||
.message-content pre { background: var(--bg-tertiary); padding: 15px; border-radius: 8px; overflow-x: auto; }
|
||||
</style>
|
||||
</head>
|
||||
<body>
|
||||
<div class="container">
|
||||
<header>
|
||||
<h1>P2P Wiki <span>AI</span></h1>
|
||||
<div class="tabs">
|
||||
<div class="tab active" data-panel="chat">Chat</div>
|
||||
<div class="tab" data-panel="ingress">Ingress</div>
|
||||
<div class="tab" data-panel="review">Review Queue</div>
|
||||
</div>
|
||||
</header>
|
||||
|
||||
<!-- Chat Panel -->
|
||||
<div id="chat" class="panel active">
|
||||
<div class="chat-container">
|
||||
<div class="chat-messages" id="chatMessages">
|
||||
<div class="message assistant">
|
||||
<div class="message-content">
|
||||
<p>Welcome to the P2P Wiki AI assistant! I can help you explore the P2P Foundation Wiki's knowledge about peer-to-peer culture, commons-based peer production, alternative economics, and collaborative governance.</p>
|
||||
<p>Ask me anything about these topics!</p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
<div class="chat-input">
|
||||
<input type="text" id="chatInput" placeholder="Ask about P2P, commons, cooperative economics..." />
|
||||
<button id="chatSend">Send</button>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Ingress Panel -->
|
||||
<div id="ingress" class="panel">
|
||||
<div class="ingress-container">
|
||||
<h2>Article Ingress</h2>
|
||||
<p style="color: var(--text-secondary); margin-bottom: 20px;">
|
||||
Drop an article URL to analyze it for wiki content. The AI will identify relevant topics,
|
||||
find matching wiki articles for citations, and draft new articles.
|
||||
</p>
|
||||
<div class="ingress-form">
|
||||
<input type="url" id="ingressUrl" placeholder="https://example.com/article-about-commons" />
|
||||
<button id="ingressSubmit">Process Article</button>
|
||||
</div>
|
||||
<div id="ingressResult"></div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Review Panel -->
|
||||
<div id="review" class="panel">
|
||||
<div class="review-container">
|
||||
<h2>Review Queue</h2>
|
||||
<p style="color: var(--text-secondary); margin-bottom: 20px;">
|
||||
Review and approve AI-generated wiki content before it's added to the wiki.
|
||||
</p>
|
||||
<div id="reviewItems">
|
||||
<div class="empty-state">Loading review items...</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<script>
|
||||
const API_BASE = ''; // Same origin
|
||||
|
||||
// Tab switching
|
||||
document.querySelectorAll('.tab').forEach(tab => {
|
||||
tab.addEventListener('click', () => {
|
||||
document.querySelectorAll('.tab').forEach(t => t.classList.remove('active'));
|
||||
document.querySelectorAll('.panel').forEach(p => p.classList.remove('active'));
|
||||
tab.classList.add('active');
|
||||
document.getElementById(tab.dataset.panel).classList.add('active');
|
||||
|
||||
// Load review items when switching to review tab
|
||||
if (tab.dataset.panel === 'review') {
|
||||
loadReviewItems();
|
||||
}
|
||||
});
|
||||
});
|
||||
|
||||
// Chat functionality
|
||||
const chatMessages = document.getElementById('chatMessages');
|
||||
const chatInput = document.getElementById('chatInput');
|
||||
const chatSend = document.getElementById('chatSend');
|
||||
|
||||
function addMessage(content, role, sources = []) {
|
||||
const div = document.createElement('div');
|
||||
div.className = `message ${role}`;
|
||||
|
||||
let html = `<div class="message-content">${formatMessage(content)}</div>`;
|
||||
|
||||
if (sources.length > 0) {
|
||||
html += `<div class="message-sources">
|
||||
<h4>Sources</h4>
|
||||
${sources.map(s => `<span class="source-tag">${s.title}</span>`).join('')}
|
||||
</div>`;
|
||||
}
|
||||
|
||||
div.innerHTML = html;
|
||||
chatMessages.appendChild(div);
|
||||
chatMessages.scrollTop = chatMessages.scrollHeight;
|
||||
}
|
||||
|
||||
function formatMessage(text) {
|
||||
// Basic markdown-like formatting
|
||||
return text
|
||||
.replace(/\n\n/g, '</p><p>')
|
||||
.replace(/\n/g, '<br>')
|
||||
.replace(/\*\*(.+?)\*\*/g, '<strong>$1</strong>')
|
||||
.replace(/\*(.+?)\*/g, '<em>$1</em>')
|
||||
.replace(/`(.+?)`/g, '<code>$1</code>');
|
||||
}
|
||||
|
||||
async function sendChat() {
|
||||
const query = chatInput.value.trim();
|
||||
if (!query) return;
|
||||
|
||||
chatInput.value = '';
|
||||
chatSend.disabled = true;
|
||||
|
||||
addMessage(query, 'user');
|
||||
|
||||
// Show loading
|
||||
const loadingDiv = document.createElement('div');
|
||||
loadingDiv.className = 'message assistant';
|
||||
loadingDiv.innerHTML = '<div class="message-content"><span class="loading"></span> Thinking...</div>';
|
||||
chatMessages.appendChild(loadingDiv);
|
||||
chatMessages.scrollTop = chatMessages.scrollHeight;
|
||||
|
||||
try {
|
||||
const response = await fetch(`${API_BASE}/chat`, {
|
||||
method: 'POST',
|
||||
headers: { 'Content-Type': 'application/json' },
|
||||
body: JSON.stringify({ query, n_results: 5 })
|
||||
});
|
||||
|
||||
const data = await response.json();
|
||||
|
||||
chatMessages.removeChild(loadingDiv);
|
||||
|
||||
if (response.ok) {
|
||||
addMessage(data.answer, 'assistant', data.sources);
|
||||
} else {
|
||||
addMessage(`Error: ${data.detail || 'Something went wrong'}`, 'assistant');
|
||||
}
|
||||
} catch (error) {
|
||||
chatMessages.removeChild(loadingDiv);
|
||||
addMessage(`Error: ${error.message}`, 'assistant');
|
||||
}
|
||||
|
||||
chatSend.disabled = false;
|
||||
chatInput.focus();
|
||||
}
|
||||
|
||||
chatSend.addEventListener('click', sendChat);
|
||||
chatInput.addEventListener('keypress', (e) => {
|
||||
if (e.key === 'Enter') sendChat();
|
||||
});
|
||||
|
||||
// Ingress functionality
|
||||
const ingressUrl = document.getElementById('ingressUrl');
|
||||
const ingressSubmit = document.getElementById('ingressSubmit');
|
||||
const ingressResult = document.getElementById('ingressResult');
|
||||
|
||||
async function processIngress() {
|
||||
const url = ingressUrl.value.trim();
|
||||
if (!url) return;
|
||||
|
||||
ingressSubmit.disabled = true;
|
||||
ingressSubmit.textContent = 'Processing...';
|
||||
|
||||
ingressResult.innerHTML = `
|
||||
<div class="ingress-result">
|
||||
<h3>Processing Article</h3>
|
||||
<p><span class="loading"></span> Scraping and analyzing content...</p>
|
||||
</div>
|
||||
`;
|
||||
|
||||
try {
|
||||
const response = await fetch(`${API_BASE}/ingress`, {
|
||||
method: 'POST',
|
||||
headers: { 'Content-Type': 'application/json' },
|
||||
body: JSON.stringify({ url })
|
||||
});
|
||||
|
||||
const data = await response.json();
|
||||
|
||||
if (response.ok) {
|
||||
ingressResult.innerHTML = `
|
||||
<div class="ingress-result">
|
||||
<h3>Analysis Complete: ${data.scraped_title || 'Article'}</h3>
|
||||
<div class="result-stats">
|
||||
<div class="stat">
|
||||
<div class="stat-value">${data.topics_found}</div>
|
||||
<div class="stat-label">Topics Found</div>
|
||||
</div>
|
||||
<div class="stat">
|
||||
<div class="stat-value">${data.wiki_matches}</div>
|
||||
<div class="stat-label">Wiki Matches</div>
|
||||
</div>
|
||||
<div class="stat">
|
||||
<div class="stat-value">${data.drafts_generated}</div>
|
||||
<div class="stat-label">Drafts Generated</div>
|
||||
</div>
|
||||
</div>
|
||||
<p style="color: var(--success);">
|
||||
Results added to review queue. Check the Review tab to approve or reject suggestions.
|
||||
</p>
|
||||
</div>
|
||||
`;
|
||||
} else {
|
||||
ingressResult.innerHTML = `
|
||||
<div class="ingress-result">
|
||||
<h3 style="color: var(--accent);">Error</h3>
|
||||
<p>${data.detail || 'Failed to process article'}</p>
|
||||
</div>
|
||||
`;
|
||||
}
|
||||
} catch (error) {
|
||||
ingressResult.innerHTML = `
|
||||
<div class="ingress-result">
|
||||
<h3 style="color: var(--accent);">Error</h3>
|
||||
<p>${error.message}</p>
|
||||
</div>
|
||||
`;
|
||||
}
|
||||
|
||||
ingressSubmit.disabled = false;
|
||||
ingressSubmit.textContent = 'Process Article';
|
||||
}
|
||||
|
||||
ingressSubmit.addEventListener('click', processIngress);
|
||||
ingressUrl.addEventListener('keypress', (e) => {
|
||||
if (e.key === 'Enter') processIngress();
|
||||
});
|
||||
|
||||
// Review functionality
|
||||
const reviewItems = document.getElementById('reviewItems');
|
||||
|
||||
async function loadReviewItems() {
|
||||
try {
|
||||
const response = await fetch(`${API_BASE}/review`);
|
||||
const data = await response.json();
|
||||
|
||||
if (data.count === 0) {
|
||||
reviewItems.innerHTML = '<div class="empty-state">No items in the review queue.</div>';
|
||||
return;
|
||||
}
|
||||
|
||||
reviewItems.innerHTML = data.items.map(item => `
|
||||
<div class="review-item">
|
||||
<h3>${item.scraped?.title || 'Unknown Article'}</h3>
|
||||
<div class="review-meta">
|
||||
Source: <a href="${item.scraped?.url}" target="_blank">${item.scraped?.domain}</a>
|
||||
| Processed: ${new Date(item.timestamp).toLocaleString()}
|
||||
</div>
|
||||
|
||||
${item.wiki_matches?.length > 0 ? `
|
||||
<div class="review-section">
|
||||
<h4>Suggested Citations (${item.wiki_matches.length})</h4>
|
||||
${item.wiki_matches.map((match, i) => `
|
||||
<div class="match-item" ${match.approved ? 'style="opacity: 0.5"' : ''}>
|
||||
<div class="title">${match.title}</div>
|
||||
<div class="score">Relevance: ${(match.relevance_score * 100).toFixed(0)}%</div>
|
||||
<div>${match.suggested_citation}</div>
|
||||
${!match.approved && !match.rejected ? `
|
||||
<div class="action-buttons">
|
||||
<button class="btn-approve" onclick="reviewAction('${item._filepath}', 'match', ${i}, 'approve')">Approve</button>
|
||||
<button class="btn-reject" onclick="reviewAction('${item._filepath}', 'match', ${i}, 'reject')">Reject</button>
|
||||
</div>
|
||||
` : `<em>${match.approved ? 'Approved' : 'Rejected'}</em>`}
|
||||
</div>
|
||||
`).join('')}
|
||||
</div>
|
||||
` : ''}
|
||||
|
||||
${item.draft_articles?.length > 0 ? `
|
||||
<div class="review-section">
|
||||
<h4>Draft Articles (${item.draft_articles.length})</h4>
|
||||
${item.draft_articles.map((draft, i) => `
|
||||
<div class="draft-item" ${draft.approved ? 'style="opacity: 0.5"' : ''}>
|
||||
<div class="title">${draft.title}</div>
|
||||
<div style="color: var(--text-secondary); font-size: 0.9em; margin-bottom: 10px;">
|
||||
${draft.summary || ''}
|
||||
</div>
|
||||
<details>
|
||||
<summary style="cursor: pointer; color: var(--accent);">View Draft Content</summary>
|
||||
<pre style="margin-top: 10px; white-space: pre-wrap; font-size: 0.85em;">${draft.content}</pre>
|
||||
</details>
|
||||
${!draft.approved && !draft.rejected ? `
|
||||
<div class="action-buttons">
|
||||
<button class="btn-approve" onclick="reviewAction('${item._filepath}', 'draft', ${i}, 'approve')">Approve</button>
|
||||
<button class="btn-reject" onclick="reviewAction('${item._filepath}', 'draft', ${i}, 'reject')">Reject</button>
|
||||
</div>
|
||||
` : `<em>${draft.approved ? 'Approved' : 'Rejected'}</em>`}
|
||||
</div>
|
||||
`).join('')}
|
||||
</div>
|
||||
` : ''}
|
||||
</div>
|
||||
`).join('');
|
||||
} catch (error) {
|
||||
reviewItems.innerHTML = `<div class="empty-state">Error loading review items: ${error.message}</div>`;
|
||||
}
|
||||
}
|
||||
|
||||
async function reviewAction(filepath, itemType, itemIndex, action) {
|
||||
try {
|
||||
const response = await fetch(`${API_BASE}/review/action`, {
|
||||
method: 'POST',
|
||||
headers: { 'Content-Type': 'application/json' },
|
||||
body: JSON.stringify({
|
||||
filepath,
|
||||
item_type: itemType,
|
||||
item_index: itemIndex,
|
||||
action
|
||||
})
|
||||
});
|
||||
|
||||
if (response.ok) {
|
||||
loadReviewItems(); // Refresh the list
|
||||
} else {
|
||||
const data = await response.json();
|
||||
alert(`Error: ${data.detail || 'Action failed'}`);
|
||||
}
|
||||
} catch (error) {
|
||||
alert(`Error: ${error.message}`);
|
||||
}
|
||||
}
|
||||
|
||||
// Make reviewAction available globally
|
||||
window.reviewAction = reviewAction;
|
||||
</script>
|
||||
</body>
|
||||
</html>
|
||||
Loading…
Reference in New Issue