Initial commit: P2P Wiki AI system

- RAG-based chat with 39k wiki articles (232k chunks)
- Article ingress pipeline for processing external URLs
- Review queue for AI-generated content
- FastAPI backend with web UI
- Traefik-ready Docker setup for p2pwiki.jeffemmett.com

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Jeff Emmett 2026-01-23 13:53:29 +01:00
commit 4ebd90cc64
16 changed files with 26481 additions and 0 deletions

17
.env.example Normal file
View File

@ -0,0 +1,17 @@
# P2P Wiki AI Configuration
# Ollama (Local LLM)
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=llama3.2
# Claude API (Optional - for higher quality article drafts)
ANTHROPIC_API_KEY=
CLAUDE_MODEL=claude-sonnet-4-20250514
# Hybrid Routing
USE_CLAUDE_FOR_DRAFTS=true
USE_OLLAMA_FOR_CHAT=true
# Server
HOST=0.0.0.0
PORT=8420

28
.gitignore vendored Normal file
View File

@ -0,0 +1,28 @@
# Virtual environment
.venv/
venv/
env/
# Python
__pycache__/
*.py[cod]
*.egg-info/
dist/
build/
# Data files (too large for git)
data/articles.json
data/chroma/
data/review_queue/
xmldump/
xmldump-2014.tar.gz
articles/
articles.tar.gz
# Environment
.env
# IDE
.idea/
.vscode/
*.swp

49
Dockerfile Normal file
View File

@ -0,0 +1,49 @@
# P2P Wiki AI - Multi-stage build
FROM python:3.11-slim as builder
WORKDIR /app
# Install build dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
&& rm -rf /var/lib/apt/lists/*
# Install Python dependencies
COPY pyproject.toml .
RUN pip install --no-cache-dir build && \
pip wheel --no-cache-dir --wheel-dir /wheels .
# Production image
FROM python:3.11-slim
WORKDIR /app
# Install runtime dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
libxml2 \
&& rm -rf /var/lib/apt/lists/*
# Copy wheels and install
COPY --from=builder /wheels /wheels
RUN pip install --no-cache-dir /wheels/*.whl && rm -rf /wheels
# Copy application code
COPY src/ src/
COPY web/ web/
# Create data directories
RUN mkdir -p data/chroma data/review_queue
# Environment variables
ENV PYTHONUNBUFFERED=1
ENV PYTHONDONTWRITEBYTECODE=1
# Expose port
EXPOSE 8420
# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
CMD python -c "import httpx; httpx.get('http://localhost:8420/health')" || exit 1
# Run the application
CMD ["python", "-m", "uvicorn", "src.api:app", "--host", "0.0.0.0", "--port", "8420"]

199
README.md Normal file
View File

@ -0,0 +1,199 @@
# P2P Wiki AI
AI-augmented system for the P2P Foundation Wiki with two main features:
1. **Conversational Agent** - Ask questions about the 23,000+ wiki articles using RAG (Retrieval Augmented Generation)
2. **Article Ingress Pipeline** - Drop article URLs to automatically analyze content, find matching wiki articles for citations, and generate draft articles
## Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ P2P Wiki AI System │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Chat (Q&A) │ │ Ingress Tool │ │
│ │ via RAG │ │ (URL Drop) │ │
│ └────────┬────────┘ └────────┬────────┘ │
│ │ │ │
│ └───────────┬───────────┘ │
│ ▼ │
│ ┌───────────────────────┐ │
│ │ FastAPI Backend │ │
│ └───────────┬───────────┘ │
│ │ │
│ ┌──────────────┼──────────────┐ │
│ ▼ ▼ ▼ │
│ ┌──────────┐ ┌─────────────┐ ┌──────────────┐ │
│ │ ChromaDB │ │ Ollama/ │ │ Article │ │
│ │ (Vector) │ │ Claude │ │ Scraper │ │
│ └──────────┘ └─────────────┘ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
```
## Quick Start
### 1. Prerequisites
- Python 3.10+
- [Ollama](https://ollama.ai) installed locally (or access to a remote Ollama server)
- Optional: Anthropic API key for Claude (higher quality article drafts)
### 2. Install Dependencies
```bash
cd /home/jeffe/Github/p2pwiki-content
pip install -e .
```
### 3. Parse Wiki Content
Convert the MediaWiki XML dumps to searchable JSON:
```bash
python -m src.parser
```
This creates `data/articles.json` with all parsed articles (~23,000 pages).
### 4. Generate Embeddings
Create the vector store for semantic search:
```bash
python -m src.embeddings
```
This creates the ChromaDB vector store in `data/chroma/`. Takes a few minutes.
### 5. Configure Environment
```bash
cp .env.example .env
# Edit .env with your settings
```
### 6. Run the Server
```bash
python -m src.api
```
Visit http://localhost:8420/ui for the web interface.
## Docker Deployment
For production deployment on the RS 8000:
```bash
# Build and run
docker compose up -d --build
# Check logs
docker compose logs -f
# Access at http://localhost:8420/ui
# Or via Traefik at https://wiki-ai.jeffemmett.com
```
## API Endpoints
### Chat
```bash
# Ask a question
curl -X POST http://localhost:8420/chat \
-H "Content-Type: application/json" \
-d '{"query": "What is commons-based peer production?"}'
```
### Ingress
```bash
# Process an external article
curl -X POST http://localhost:8420/ingress \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/article-about-cooperatives"}'
```
### Review Queue
```bash
# Get all items in review queue
curl http://localhost:8420/review
# Approve a draft article
curl -X POST http://localhost:8420/review/action \
-H "Content-Type: application/json" \
-d '{"filepath": "/path/to/item.json", "item_type": "draft", "item_index": 0, "action": "approve"}'
```
### Search
```bash
# Direct vector search
curl "http://localhost:8420/search?q=cooperative%20economics&n=10"
# List article titles
curl "http://localhost:8420/articles?limit=100"
```
## Hybrid AI Routing
The system uses intelligent routing between local (Ollama) and cloud (Claude) LLMs:
| Task | Default LLM | Reasoning |
|------|-------------|-----------|
| Chat Q&A | Ollama | Fast, free, good enough for retrieval-based answers |
| Content Analysis | Claude | Better at extracting topics and identifying wiki relevance |
| Draft Generation | Claude | Higher quality article writing |
| Embeddings | Local (sentence-transformers) | Fast, free, optimized for semantic search |
Configure in `.env`:
```
USE_CLAUDE_FOR_DRAFTS=true
USE_OLLAMA_FOR_CHAT=true
```
## Project Structure
```
p2pwiki-content/
├── src/
│ ├── api.py # FastAPI backend
│ ├── config.py # Configuration settings
│ ├── embeddings.py # Vector store (ChromaDB)
│ ├── ingress.py # Article scraper & analyzer
│ ├── llm.py # LLM client (Ollama/Claude)
│ ├── parser.py # MediaWiki XML parser
│ └── rag.py # RAG chat system
├── web/
│ └── index.html # Web UI
├── data/
│ ├── articles.json # Parsed wiki content
│ ├── chroma/ # Vector store
│ └── review_queue/ # Pending ingress items
├── xmldump/ # MediaWiki XML dumps
├── docker-compose.yml
├── Dockerfile
└── pyproject.toml
```
## Content Coverage
The P2P Foundation Wiki contains ~23,000 articles covering:
- Peer-to-peer networks and culture
- Commons-based peer production (CBPP)
- Alternative economics and post-capitalism
- Cooperative business models
- Open source and free culture
- Collaborative governance
- Sustainability and ecology
## License
The wiki content is from the P2P Foundation under their respective licenses.
The AI system code is provided as-is for educational purposes.

38
docker-compose.yml Normal file
View File

@ -0,0 +1,38 @@
version: '3.8'
services:
p2pwiki-ai:
build: .
container_name: p2pwiki-ai
restart: unless-stopped
ports:
- "8420:8420"
volumes:
# Persist vector store and review queue
- ./data:/app/data
# Mount XML dumps for parsing (read-only)
- ./xmldump:/app/xmldump:ro
environment:
# Ollama connection (adjust host for your setup)
- OLLAMA_BASE_URL=${OLLAMA_BASE_URL:-http://host.docker.internal:11434}
- OLLAMA_MODEL=${OLLAMA_MODEL:-llama3.2}
# Claude API (optional, for higher quality drafts)
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY:-}
- CLAUDE_MODEL=${CLAUDE_MODEL:-claude-sonnet-4-20250514}
# Hybrid routing settings
- USE_CLAUDE_FOR_DRAFTS=${USE_CLAUDE_FOR_DRAFTS:-true}
- USE_OLLAMA_FOR_CHAT=${USE_OLLAMA_FOR_CHAT:-true}
labels:
# Traefik labels for reverse proxy
- "traefik.enable=true"
- "traefik.http.routers.p2pwiki-ai.rule=Host(`p2pwiki.jeffemmett.com`)"
- "traefik.http.services.p2pwiki-ai.loadbalancer.server.port=8420"
networks:
- traefik-public
# Add extra_hosts for Docker Desktop to access host services
extra_hosts:
- "host.docker.internal:host-gateway"
networks:
traefik-public:
external: true

23705
pagenames.txt Normal file

File diff suppressed because it is too large Load Diff

64
pyproject.toml Normal file
View File

@ -0,0 +1,64 @@
[project]
name = "p2pwiki-ai"
version = "0.1.0"
description = "AI-augmented system for P2P Foundation Wiki - chat agent and ingress pipeline"
requires-python = ">=3.10"
dependencies = [
# Core
"fastapi>=0.109.0",
"uvicorn[standard]>=0.27.0",
"pydantic>=2.5.0",
"pydantic-settings>=2.1.0",
# XML parsing
"lxml>=5.1.0",
# Vector store & embeddings
"chromadb>=0.4.22",
"sentence-transformers>=2.3.0",
# LLM integration
"openai>=1.10.0", # For Ollama-compatible API
"anthropic>=0.18.0", # For Claude API
"httpx>=0.26.0",
# Article scraping
"trafilatura>=1.6.0",
"newspaper3k>=0.2.8",
"beautifulsoup4>=4.12.0",
"requests>=2.31.0",
# Utilities
"python-dotenv>=1.0.0",
"rich>=13.7.0",
"tqdm>=4.66.0",
"tenacity>=8.2.0",
]
[project.optional-dependencies]
dev = [
"pytest>=7.4.0",
"pytest-asyncio>=0.23.0",
"black>=24.1.0",
"ruff>=0.1.0",
]
[project.scripts]
p2pwiki-parse = "src.parser:main"
p2pwiki-embed = "src.embeddings:main"
p2pwiki-serve = "src.api:main"
[build-system]
requires = ["setuptools>=68.0", "wheel"]
build-backend = "setuptools.build_meta"
[tool.setuptools.packages.find]
where = ["."]
[tool.black]
line-length = 100
target-version = ["py310"]
[tool.ruff]
line-length = 100
select = ["E", "F", "I", "N", "W"]

1
src/__init__.py Normal file
View File

@ -0,0 +1 @@
"""P2P Wiki AI System - Chat agent and ingress pipeline."""

320
src/api.py Normal file
View File

@ -0,0 +1,320 @@
"""FastAPI backend for P2P Wiki AI system."""
import asyncio
from contextlib import asynccontextmanager
from pathlib import Path
from typing import Optional
from fastapi import FastAPI, HTTPException, BackgroundTasks
from fastapi.middleware.cors import CORSMiddleware
from fastapi.staticfiles import StaticFiles
from fastapi.responses import FileResponse
from pydantic import BaseModel, HttpUrl
from .config import settings
from .embeddings import WikiVectorStore
from .rag import WikiRAG, RAGResponse
from .ingress import IngressPipeline, get_review_queue, approve_item, reject_item
# Global instances
vector_store: Optional[WikiVectorStore] = None
rag_system: Optional[WikiRAG] = None
ingress_pipeline: Optional[IngressPipeline] = None
@asynccontextmanager
async def lifespan(app: FastAPI):
"""Initialize services on startup."""
global vector_store, rag_system, ingress_pipeline
print("Initializing P2P Wiki AI system...")
# Check if vector store has been populated
chroma_path = settings.chroma_persist_dir
if not chroma_path.exists() or not any(chroma_path.iterdir()):
print("WARNING: Vector store not initialized. Run 'python -m src.parser' and 'python -m src.embeddings' first.")
else:
vector_store = WikiVectorStore()
rag_system = WikiRAG(vector_store)
ingress_pipeline = IngressPipeline(vector_store)
print(f"Loaded vector store with {vector_store.get_stats()['total_chunks']} chunks")
yield
print("Shutting down...")
app = FastAPI(
title="P2P Wiki AI",
description="AI-augmented system for P2P Foundation Wiki - chat agent and ingress pipeline",
version="0.1.0",
lifespan=lifespan,
)
# CORS middleware
app.add_middleware(
CORSMiddleware,
allow_origins=["*"], # Configure appropriately for production
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
# --- Request/Response Models ---
class ChatRequest(BaseModel):
"""Chat request model."""
query: str
n_results: int = 5
filter_categories: Optional[list[str]] = None
class ChatResponse(BaseModel):
"""Chat response model."""
answer: str
sources: list[dict]
query: str
class IngressRequest(BaseModel):
"""Ingress request model."""
url: HttpUrl
class IngressResponse(BaseModel):
"""Ingress response model."""
status: str
message: str
scraped_title: Optional[str] = None
topics_found: int = 0
wiki_matches: int = 0
drafts_generated: int = 0
queue_file: Optional[str] = None
class ReviewActionRequest(BaseModel):
"""Review action request model."""
filepath: str
item_type: str # "match" or "draft"
item_index: int
action: str # "approve" or "reject"
# --- API Endpoints ---
@app.get("/")
async def root():
"""Root endpoint."""
return {
"name": "P2P Wiki AI",
"version": "0.1.0",
"status": "running",
"vector_store_ready": vector_store is not None,
}
@app.get("/health")
async def health():
"""Health check endpoint."""
return {
"status": "healthy",
"vector_store_ready": vector_store is not None,
}
@app.get("/stats")
async def stats():
"""Get system statistics."""
if not vector_store:
return {"error": "Vector store not initialized"}
return {
"vector_store": vector_store.get_stats(),
"review_queue_count": len(get_review_queue()),
}
# --- Chat Endpoints ---
@app.post("/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
"""Chat with the wiki knowledge base."""
if not rag_system:
raise HTTPException(
status_code=503,
detail="RAG system not initialized. Run indexing first.",
)
response = await rag_system.ask(
query=request.query,
n_results=request.n_results,
filter_categories=request.filter_categories,
)
return ChatResponse(
answer=response.answer,
sources=response.sources,
query=response.query,
)
@app.post("/chat/clear")
async def clear_chat():
"""Clear chat history."""
if rag_system:
rag_system.clear_history()
return {"status": "cleared"}
@app.get("/chat/suggestions")
async def chat_suggestions(q: str = ""):
"""Get article title suggestions for autocomplete."""
if not rag_system or not q:
return {"suggestions": []}
suggestions = rag_system.get_suggestions(q)
return {"suggestions": suggestions}
# --- Ingress Endpoints ---
@app.post("/ingress", response_model=IngressResponse)
async def ingress(request: IngressRequest, background_tasks: BackgroundTasks):
"""
Process an external article URL through the ingress pipeline.
This scrapes the article, analyzes it for wiki relevance,
finds matching existing articles, and generates draft articles.
"""
if not ingress_pipeline:
raise HTTPException(
status_code=503,
detail="Ingress pipeline not initialized. Run indexing first.",
)
try:
result = await ingress_pipeline.process(str(request.url))
return IngressResponse(
status="success",
message="Article processed successfully",
scraped_title=result.scraped.title,
topics_found=len(result.analysis.get("main_topics", [])),
wiki_matches=len(result.wiki_matches),
drafts_generated=len(result.draft_articles),
queue_file=result.timestamp,
)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
# --- Review Queue Endpoints ---
@app.get("/review")
async def get_review_items():
"""Get all items in the review queue."""
items = get_review_queue()
return {"count": len(items), "items": items}
@app.get("/review/{filename}")
async def get_review_item(filename: str):
"""Get a specific review item."""
filepath = settings.review_queue_dir / filename
if not filepath.exists():
raise HTTPException(status_code=404, detail="Review item not found")
import json
with open(filepath, "r", encoding="utf-8") as f:
data = json.load(f)
return data
@app.post("/review/action")
async def review_action(request: ReviewActionRequest):
"""Approve or reject a review item."""
if request.action == "approve":
success = approve_item(request.filepath, request.item_type, request.item_index)
elif request.action == "reject":
success = reject_item(request.filepath, request.item_type, request.item_index)
else:
raise HTTPException(status_code=400, detail="Invalid action")
if success:
return {"status": "success", "action": request.action}
else:
raise HTTPException(status_code=500, detail="Action failed")
# --- Search Endpoints ---
@app.get("/search")
async def search(q: str, n: int = 10, categories: Optional[str] = None):
"""Direct search of the vector store."""
if not vector_store:
raise HTTPException(status_code=503, detail="Vector store not initialized")
filter_cats = categories.split(",") if categories else None
results = vector_store.search(q, n_results=n, filter_categories=filter_cats)
return {"query": q, "count": len(results), "results": results}
@app.get("/articles")
async def list_articles(limit: int = 100, offset: int = 0):
"""List article titles."""
if not vector_store:
raise HTTPException(status_code=503, detail="Vector store not initialized")
titles = vector_store.get_article_titles()
return {
"total": len(titles),
"limit": limit,
"offset": offset,
"titles": titles[offset : offset + limit],
}
# --- Static Files (Web UI) ---
web_dir = Path(__file__).parent.parent / "web"
if web_dir.exists():
app.mount("/static", StaticFiles(directory=str(web_dir)), name="static")
@app.get("/ui")
async def ui():
"""Serve the web UI."""
index_path = web_dir / "index.html"
if index_path.exists():
return FileResponse(index_path)
raise HTTPException(status_code=404, detail="Web UI not found")
def main():
"""Run the API server."""
import uvicorn
uvicorn.run(
"src.api:app",
host=settings.host,
port=settings.port,
reload=True,
)
if __name__ == "__main__":
main()

51
src/config.py Normal file
View File

@ -0,0 +1,51 @@
"""Configuration settings for P2P Wiki AI system."""
from pathlib import Path
from pydantic_settings import BaseSettings
class Settings(BaseSettings):
"""Application settings loaded from environment variables."""
# Paths
project_root: Path = Path(__file__).parent.parent
data_dir: Path = project_root / "data"
xmldump_dir: Path = project_root / "xmldump"
# Vector store
chroma_persist_dir: Path = data_dir / "chroma"
embedding_model: str = "all-MiniLM-L6-v2" # Fast, good quality
# Ollama (local LLM)
ollama_base_url: str = "http://localhost:11434"
ollama_model: str = "llama3.2" # Default model for local inference
# Claude API (for complex tasks)
anthropic_api_key: str = ""
claude_model: str = "claude-sonnet-4-20250514"
# Hybrid routing thresholds
use_claude_for_drafts: bool = True # Use Claude for article drafting
use_ollama_for_chat: bool = True # Use Ollama for simple Q&A
# MediaWiki
mediawiki_api_url: str = "" # Set if you have a live wiki API
# Server
host: str = "0.0.0.0"
port: int = 8420
# Review queue
review_queue_dir: Path = data_dir / "review_queue"
class Config:
env_file = ".env"
env_file_encoding = "utf-8"
settings = Settings()
# Ensure directories exist
settings.data_dir.mkdir(parents=True, exist_ok=True)
settings.chroma_persist_dir.mkdir(parents=True, exist_ok=True)
settings.review_queue_dir.mkdir(parents=True, exist_ok=True)

256
src/embeddings.py Normal file
View File

@ -0,0 +1,256 @@
"""Vector store setup and embedding generation using ChromaDB."""
import json
from pathlib import Path
from typing import Optional
import chromadb
from chromadb.config import Settings as ChromaSettings
from rich.console import Console
from rich.progress import Progress
from sentence_transformers import SentenceTransformer
from .config import settings
from .parser import WikiArticle
console = Console()
# Chunk size for embedding (in characters)
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200
class WikiVectorStore:
"""Vector store for wiki articles using ChromaDB."""
def __init__(self, persist_dir: Optional[Path] = None):
self.persist_dir = persist_dir or settings.chroma_persist_dir
# Initialize ChromaDB
self.client = chromadb.PersistentClient(
path=str(self.persist_dir),
settings=ChromaSettings(anonymized_telemetry=False),
)
# Create or get collection
self.collection = self.client.get_or_create_collection(
name="wiki_articles",
metadata={"hnsw:space": "cosine"},
)
# Load embedding model
console.print(f"[cyan]Loading embedding model: {settings.embedding_model}[/cyan]")
self.model = SentenceTransformer(settings.embedding_model)
console.print("[green]Model loaded[/green]")
def _chunk_text(self, text: str, title: str) -> list[tuple[str, dict]]:
"""Split text into overlapping chunks with metadata."""
if len(text) <= CHUNK_SIZE:
return [(text, {"chunk_index": 0, "total_chunks": 1})]
chunks = []
start = 0
chunk_index = 0
while start < len(text):
end = start + CHUNK_SIZE
# Try to break at sentence boundary
if end < len(text):
# Look for sentence end within last 100 chars
for i in range(min(100, end - start)):
if text[end - i] in ".!?\n":
end = end - i + 1
break
chunk_text = text[start:end].strip()
if chunk_text:
# Prepend title for context
chunk_with_title = f"{title}\n\n{chunk_text}"
chunks.append(
(chunk_with_title, {"chunk_index": chunk_index, "total_chunks": -1})
)
chunk_index += 1
start = end - CHUNK_OVERLAP
# Update total_chunks
for i, (text, meta) in enumerate(chunks):
meta["total_chunks"] = len(chunks)
return chunks
def get_embedded_article_ids(self) -> set:
"""Get set of article IDs that are already embedded."""
results = self.collection.get(include=["metadatas"])
article_ids = set()
for meta in results["metadatas"]:
if meta and "article_id" in meta:
article_ids.add(meta["article_id"])
return article_ids
def add_articles(self, articles: list[WikiArticle], batch_size: int = 100, resume: bool = True):
"""Add articles to the vector store."""
console.print(f"[cyan]Processing {len(articles)} articles...[/cyan]")
# Check for already embedded articles if resuming
if resume:
embedded_ids = self.get_embedded_article_ids()
original_count = len(articles)
articles = [a for a in articles if a.id not in embedded_ids]
skipped = original_count - len(articles)
if skipped > 0:
console.print(f"[yellow]Skipping {skipped} already-embedded articles[/yellow]")
if not articles:
console.print("[green]All articles already embedded![/green]")
return
all_chunks = []
all_ids = []
all_metadatas = []
with Progress() as progress:
task = progress.add_task("[cyan]Chunking articles...", total=len(articles))
for article in articles:
if not article.plain_text:
progress.advance(task)
continue
chunks = self._chunk_text(article.plain_text, article.title)
for chunk_text, chunk_meta in chunks:
chunk_id = f"{article.id}_{chunk_meta['chunk_index']}"
metadata = {
"article_id": article.id,
"title": article.title,
"categories": ",".join(article.categories[:10]), # Limit categories
"timestamp": article.timestamp,
"chunk_index": chunk_meta["chunk_index"],
"total_chunks": chunk_meta["total_chunks"],
}
all_chunks.append(chunk_text)
all_ids.append(chunk_id)
all_metadatas.append(metadata)
progress.advance(task)
console.print(f"[cyan]Created {len(all_chunks)} chunks from {len(articles)} articles[/cyan]")
# Generate embeddings and add in batches
console.print("[cyan]Generating embeddings and adding to vector store...[/cyan]")
with Progress() as progress:
task = progress.add_task(
"[cyan]Embedding and storing...", total=len(all_chunks) // batch_size + 1
)
for i in range(0, len(all_chunks), batch_size):
batch_chunks = all_chunks[i : i + batch_size]
batch_ids = all_ids[i : i + batch_size]
batch_metadatas = all_metadatas[i : i + batch_size]
# Generate embeddings
embeddings = self.model.encode(batch_chunks, show_progress_bar=False)
# Add to collection
self.collection.add(
ids=batch_ids,
embeddings=embeddings.tolist(),
documents=batch_chunks,
metadatas=batch_metadatas,
)
progress.advance(task)
console.print(f"[green]Added {len(all_chunks)} chunks to vector store[/green]")
def search(
self,
query: str,
n_results: int = 5,
filter_categories: Optional[list[str]] = None,
) -> list[dict]:
"""Search for relevant chunks."""
query_embedding = self.model.encode([query])[0]
where_filter = None
if filter_categories:
# ChromaDB where filter for categories
where_filter = {
"$or": [{"categories": {"$contains": cat}} for cat in filter_categories]
}
results = self.collection.query(
query_embeddings=[query_embedding.tolist()],
n_results=n_results,
where=where_filter,
include=["documents", "metadatas", "distances"],
)
# Format results
formatted = []
if results["documents"] and results["documents"][0]:
for i, doc in enumerate(results["documents"][0]):
formatted.append(
{
"content": doc,
"metadata": results["metadatas"][0][i],
"distance": results["distances"][0][i],
}
)
return formatted
def get_article_titles(self) -> list[str]:
"""Get all unique article titles in the store."""
# Get all metadata
results = self.collection.get(include=["metadatas"])
titles = set()
for meta in results["metadatas"]:
if meta and "title" in meta:
titles.add(meta["title"])
return sorted(titles)
def get_stats(self) -> dict:
"""Get statistics about the vector store."""
count = self.collection.count()
# Get sample of metadatas to count unique articles
sample = self.collection.get(limit=10000, include=["metadatas"])
unique_articles = len(set(m["article_id"] for m in sample["metadatas"] if m))
return {
"total_chunks": count,
"unique_articles_sampled": unique_articles,
"persist_dir": str(self.persist_dir),
}
def main():
"""CLI entry point for generating embeddings."""
articles_path = settings.data_dir / "articles.json"
if not articles_path.exists():
console.print(f"[red]Articles file not found: {articles_path}[/red]")
console.print("[yellow]Run 'python -m src.parser' first to parse XML dumps[/yellow]")
return
console.print(f"[cyan]Loading articles from {articles_path}...[/cyan]")
with open(articles_path, "r", encoding="utf-8") as f:
articles_data = json.load(f)
articles = [WikiArticle(**a) for a in articles_data]
console.print(f"[green]Loaded {len(articles)} articles[/green]")
store = WikiVectorStore()
store.add_articles(articles)
stats = store.get_stats()
console.print(f"[green]Vector store stats: {stats}[/green]")
if __name__ == "__main__":
main()

467
src/ingress.py Normal file
View File

@ -0,0 +1,467 @@
"""Article ingress pipeline - scrape, analyze, and draft wiki content."""
import json
import re
from dataclasses import dataclass, field, asdict
from datetime import datetime
from pathlib import Path
from typing import Optional
from urllib.parse import urlparse
import httpx
import trafilatura
from bs4 import BeautifulSoup
from rich.console import Console
from .config import settings
from .embeddings import WikiVectorStore
from .llm import llm_client
console = Console()
@dataclass
class ScrapedArticle:
"""Represents a scraped external article."""
url: str
title: str
content: str
author: Optional[str] = None
date: Optional[str] = None
domain: str = ""
word_count: int = 0
def __post_init__(self):
if not self.domain:
self.domain = urlparse(self.url).netloc
if not self.word_count:
self.word_count = len(self.content.split())
@dataclass
class WikiMatch:
"""A matching wiki article for citation."""
title: str
article_id: int
relevance_score: float
categories: list[str]
suggested_citation: str # How to cite the scraped article in this wiki page
@dataclass
class DraftArticle:
"""A draft wiki article generated from scraped content."""
title: str
content: str # MediaWiki formatted content
categories: list[str]
source_url: str
source_title: str
summary: str
related_articles: list[str] # Existing wiki articles to link to
@dataclass
class IngressResult:
"""Result of the ingress pipeline."""
scraped: ScrapedArticle
analysis: dict # Topic analysis results
wiki_matches: list[WikiMatch] # Existing articles to update with citations
draft_articles: list[DraftArticle] # New articles to create
timestamp: str = field(default_factory=lambda: datetime.now().isoformat())
def to_dict(self) -> dict:
return {
"scraped": asdict(self.scraped),
"analysis": self.analysis,
"wiki_matches": [asdict(m) for m in self.wiki_matches],
"draft_articles": [asdict(d) for d in self.draft_articles],
"timestamp": self.timestamp,
}
class ArticleScraper:
"""Scrapes and extracts content from URLs."""
async def scrape(self, url: str) -> ScrapedArticle:
"""Scrape article content from URL."""
console.print(f"[cyan]Scraping: {url}[/cyan]")
async with httpx.AsyncClient(
timeout=30.0,
follow_redirects=True,
headers={
"User-Agent": "Mozilla/5.0 (compatible; P2PWikiBot/1.0; +http://p2pfoundation.net)"
},
) as client:
response = await client.get(url)
response.raise_for_status()
html = response.text
# Use trafilatura for main content extraction
content = trafilatura.extract(
html,
include_comments=False,
include_tables=True,
no_fallback=False,
)
if not content:
# Fallback to BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
# Remove script and style elements
for element in soup(["script", "style", "nav", "footer", "header"]):
element.decompose()
content = soup.get_text(separator="\n", strip=True)
# Extract metadata
soup = BeautifulSoup(html, "html.parser")
title = ""
title_tag = soup.find("title")
if title_tag:
title = title_tag.get_text(strip=True)
# Try og:title
og_title = soup.find("meta", property="og:title")
if og_title and og_title.get("content"):
title = og_title["content"]
author = None
author_meta = soup.find("meta", attrs={"name": "author"})
if author_meta and author_meta.get("content"):
author = author_meta["content"]
date = None
date_meta = soup.find("meta", attrs={"name": "date"}) or soup.find(
"meta", property="article:published_time"
)
if date_meta and date_meta.get("content"):
date = date_meta["content"]
return ScrapedArticle(
url=url,
title=title,
content=content or "",
author=author,
date=date,
)
class ContentAnalyzer:
"""Analyzes scraped content for wiki relevance."""
def __init__(self, vector_store: Optional[WikiVectorStore] = None):
self.vector_store = vector_store or WikiVectorStore()
async def analyze(self, article: ScrapedArticle) -> dict:
"""Analyze article for topics, concepts, and wiki relevance."""
# Truncate very long articles for analysis
content_for_analysis = article.content[:8000]
analysis_prompt = f"""Analyze this article for potential wiki content about peer-to-peer culture, commons, alternative economics, and collaborative governance.
Article Title: {article.title}
Source: {article.domain}
Article Content:
{content_for_analysis}
Please provide your analysis in the following JSON format:
{{
"main_topics": ["topic1", "topic2"],
"key_concepts": ["concept1", "concept2"],
"relevant_categories": ["category1", "category2"],
"summary": "2-3 sentence summary",
"wiki_relevance_score": 0.0-1.0,
"suggested_article_titles": ["Title 1", "Title 2"],
"key_quotes": ["notable quote 1", "notable quote 2"],
"mentioned_organizations": ["org1", "org2"],
"mentioned_people": ["person1", "person2"]
}}
Focus on topics relevant to:
- Peer-to-peer networks and culture
- Commons-based peer production
- Alternative economics and post-capitalism
- Cooperative business models
- Open source / free culture
- Collaborative governance
- Sustainability and ecology"""
response = await llm_client.analyze(
content=article.content[:8000],
task=analysis_prompt,
temperature=0.3,
)
# Parse JSON from response
try:
# Find JSON in response
json_match = re.search(r"\{[\s\S]*\}", response)
if json_match:
analysis = json.loads(json_match.group())
else:
analysis = {"error": "Could not parse analysis", "raw": response}
except json.JSONDecodeError:
analysis = {"error": "Invalid JSON in analysis", "raw": response}
return analysis
async def find_wiki_matches(
self, article: ScrapedArticle, analysis: dict, n_results: int = 10
) -> list[WikiMatch]:
"""Find existing wiki articles that could cite this content."""
matches = []
# Search using main topics and concepts
search_terms = analysis.get("main_topics", []) + analysis.get("key_concepts", [])
for term in search_terms[:5]: # Limit searches
results = self.vector_store.search(term, n_results=3)
for result in results:
title = result["metadata"].get("title", "Unknown")
article_id = result["metadata"].get("article_id", 0)
distance = result.get("distance", 1.0)
# Skip if already added
if any(m.title == title for m in matches):
continue
# Calculate relevance (lower distance = higher relevance)
relevance = max(0, 1 - distance)
if relevance > 0.3: # Threshold for relevance
matches.append(
WikiMatch(
title=title,
article_id=article_id,
relevance_score=relevance,
categories=result["metadata"]
.get("categories", "")
.split(","),
suggested_citation=f"See also: [{article.title}]({article.url})",
)
)
# Sort by relevance and limit
matches.sort(key=lambda m: m.relevance_score, reverse=True)
return matches[:n_results]
class DraftGenerator:
"""Generates draft wiki articles from scraped content."""
def __init__(self, vector_store: Optional[WikiVectorStore] = None):
self.vector_store = vector_store or WikiVectorStore()
async def generate_drafts(
self,
article: ScrapedArticle,
analysis: dict,
max_drafts: int = 3,
) -> list[DraftArticle]:
"""Generate draft wiki articles based on scraped content."""
drafts = []
suggested_titles = analysis.get("suggested_article_titles", [])
if not suggested_titles:
return drafts
for title in suggested_titles[:max_drafts]:
# Check if article already exists
existing = self.vector_store.search(title, n_results=1)
if existing and existing[0].get("distance", 1.0) < 0.1:
console.print(f"[yellow]Skipping '{title}' - similar article exists[/yellow]")
continue
draft = await self._generate_single_draft(article, analysis, title)
if draft:
drafts.append(draft)
return drafts
async def _generate_single_draft(
self,
article: ScrapedArticle,
analysis: dict,
title: str,
) -> Optional[DraftArticle]:
"""Generate a single draft article."""
# Find related existing articles
related_search = self.vector_store.search(title, n_results=5)
related_titles = [
r["metadata"].get("title", "")
for r in related_search
if r.get("distance", 1.0) < 0.5
]
categories = analysis.get("relevant_categories", [])
summary = analysis.get("summary", "")
draft_prompt = f"""Create a MediaWiki-formatted article for the P2P Foundation Wiki.
Article Title: {title}
Source Material:
Title: {article.title}
URL: {article.url}
Summary: {summary}
Key concepts to cover: {', '.join(analysis.get('key_concepts', []))}
Related existing wiki articles: {', '.join(related_titles)}
Categories to include: {', '.join(categories)}
Please write the wiki article in MediaWiki markup format with:
1. An introduction/definition section
2. A "Description" section with key information
3. Links to related wiki articles using [[Article Name]] format
4. A "Sources" section citing the original article
5. Category tags at the end using [[Category:Name]] format
The article should:
- Be encyclopedic and neutral in tone
- Focus on the P2P/commons aspects of the topic
- Be approximately 300-500 words
- Include internal wiki links to related concepts"""
content = await llm_client.generate_draft(
draft_prompt,
system="You are a wiki editor for the P2P Foundation Wiki. Write clear, encyclopedic articles in MediaWiki markup format.",
temperature=0.5,
)
return DraftArticle(
title=title,
content=content,
categories=categories,
source_url=article.url,
source_title=article.title,
summary=summary,
related_articles=related_titles,
)
class IngressPipeline:
"""Complete ingress pipeline for processing external articles."""
def __init__(self, vector_store: Optional[WikiVectorStore] = None):
self.vector_store = vector_store or WikiVectorStore()
self.scraper = ArticleScraper()
self.analyzer = ContentAnalyzer(self.vector_store)
self.generator = DraftGenerator(self.vector_store)
async def process(self, url: str) -> IngressResult:
"""Process a URL through the complete ingress pipeline."""
console.print(f"[bold cyan]Processing: {url}[/bold cyan]")
# Step 1: Scrape
console.print("[cyan]Step 1/4: Scraping article...[/cyan]")
scraped = await self.scraper.scrape(url)
console.print(f"[green]Scraped: {scraped.title} ({scraped.word_count} words)[/green]")
# Step 2: Analyze
console.print("[cyan]Step 2/4: Analyzing content...[/cyan]")
analysis = await self.analyzer.analyze(scraped)
console.print(f"[green]Found {len(analysis.get('main_topics', []))} main topics[/green]")
# Step 3: Find wiki matches
console.print("[cyan]Step 3/4: Finding wiki matches...[/cyan]")
matches = await self.analyzer.find_wiki_matches(scraped, analysis)
console.print(f"[green]Found {len(matches)} potential wiki matches[/green]")
# Step 4: Generate drafts
console.print("[cyan]Step 4/4: Generating draft articles...[/cyan]")
drafts = await self.generator.generate_drafts(scraped, analysis)
console.print(f"[green]Generated {len(drafts)} draft articles[/green]")
result = IngressResult(
scraped=scraped,
analysis=analysis,
wiki_matches=matches,
draft_articles=drafts,
)
# Save to review queue
self._save_to_review_queue(result)
return result
def _save_to_review_queue(self, result: IngressResult):
"""Save ingress result to the review queue."""
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
domain = result.scraped.domain.replace(".", "_")
filename = f"{timestamp}_{domain}.json"
filepath = settings.review_queue_dir / filename
with open(filepath, "w", encoding="utf-8") as f:
json.dump(result.to_dict(), f, indent=2, ensure_ascii=False)
console.print(f"[green]Saved to review queue: {filepath}[/green]")
def get_review_queue() -> list[dict]:
"""Get all items in the review queue."""
queue_files = sorted(settings.review_queue_dir.glob("*.json"), reverse=True)
items = []
for filepath in queue_files:
with open(filepath, "r", encoding="utf-8") as f:
data = json.load(f)
data["_filepath"] = str(filepath)
items.append(data)
return items
def approve_item(filepath: str, item_type: str, item_index: int) -> bool:
"""
Approve an item from the review queue.
Args:
filepath: Path to the review queue JSON file
item_type: "match" or "draft"
item_index: Index of the item to approve
Returns:
True if successful
"""
# For now, just mark as approved in the file
# In production, this would push to MediaWiki API
with open(filepath, "r", encoding="utf-8") as f:
data = json.load(f)
if item_type == "match":
if item_index < len(data.get("wiki_matches", [])):
data["wiki_matches"][item_index]["approved"] = True
elif item_type == "draft":
if item_index < len(data.get("draft_articles", [])):
data["draft_articles"][item_index]["approved"] = True
with open(filepath, "w", encoding="utf-8") as f:
json.dump(data, f, indent=2, ensure_ascii=False)
return True
def reject_item(filepath: str, item_type: str, item_index: int) -> bool:
"""Reject an item from the review queue."""
with open(filepath, "r", encoding="utf-8") as f:
data = json.load(f)
if item_type == "match":
if item_index < len(data.get("wiki_matches", [])):
data["wiki_matches"][item_index]["rejected"] = True
elif item_type == "draft":
if item_index < len(data.get("draft_articles", [])):
data["draft_articles"][item_index]["rejected"] = True
with open(filepath, "w", encoding="utf-8") as f:
json.dump(data, f, indent=2, ensure_ascii=False)
return True

153
src/llm.py Normal file
View File

@ -0,0 +1,153 @@
"""LLM client with hybrid routing between Ollama and Claude."""
from typing import AsyncIterator, Optional
import httpx
from anthropic import Anthropic
from tenacity import retry, stop_after_attempt, wait_exponential
from .config import settings
class LLMClient:
"""Unified LLM client with hybrid routing."""
def __init__(self):
self.ollama_url = settings.ollama_base_url
self.ollama_model = settings.ollama_model
# Initialize Claude client if API key is set
self.claude_client = None
if settings.anthropic_api_key:
self.claude_client = Anthropic(api_key=settings.anthropic_api_key)
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=1, max=10))
async def _call_ollama(
self,
prompt: str,
system: Optional[str] = None,
temperature: float = 0.7,
max_tokens: int = 2048,
) -> str:
"""Call Ollama API."""
messages = []
if system:
messages.append({"role": "system", "content": system})
messages.append({"role": "user", "content": prompt})
async with httpx.AsyncClient(timeout=120.0) as client:
response = await client.post(
f"{self.ollama_url}/api/chat",
json={
"model": self.ollama_model,
"messages": messages,
"stream": False,
"options": {
"temperature": temperature,
"num_predict": max_tokens,
},
},
)
response.raise_for_status()
data = response.json()
return data["message"]["content"]
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=1, max=10))
async def _call_claude(
self,
prompt: str,
system: Optional[str] = None,
temperature: float = 0.7,
max_tokens: int = 4096,
) -> str:
"""Call Claude API."""
if not self.claude_client:
raise ValueError("Claude API key not configured")
message = self.claude_client.messages.create(
model=settings.claude_model,
max_tokens=max_tokens,
system=system or "",
messages=[{"role": "user", "content": prompt}],
temperature=temperature,
)
return message.content[0].text
async def chat(
self,
prompt: str,
system: Optional[str] = None,
use_claude: bool = False,
temperature: float = 0.7,
max_tokens: int = 2048,
) -> str:
"""
Chat with LLM using hybrid routing.
Args:
prompt: User prompt
system: System prompt
use_claude: Force Claude API (otherwise uses Ollama by default)
temperature: Sampling temperature
max_tokens: Max response tokens
Returns:
LLM response text
"""
if use_claude and self.claude_client:
return await self._call_claude(prompt, system, temperature, max_tokens)
else:
return await self._call_ollama(prompt, system, temperature, max_tokens)
async def generate_draft(
self,
prompt: str,
system: Optional[str] = None,
temperature: float = 0.5,
) -> str:
"""
Generate article draft - uses Claude for higher quality.
Args:
prompt: Prompt describing what to generate
system: System prompt for context
temperature: Lower for more factual output
Returns:
Generated draft text
"""
# Use Claude for drafts if configured, otherwise fall back to Ollama
use_claude = settings.use_claude_for_drafts and self.claude_client is not None
return await self.chat(
prompt, system, use_claude=use_claude, temperature=temperature, max_tokens=4096
)
async def analyze(
self,
content: str,
task: str,
temperature: float = 0.3,
) -> str:
"""
Analyze content for a specific task - uses Claude for complex analysis.
Args:
content: Content to analyze
task: Description of analysis task
temperature: Lower for more deterministic output
Returns:
Analysis result
"""
prompt = f"""Task: {task}
Content to analyze:
{content}
Provide your analysis:"""
use_claude = self.claude_client is not None
return await self.chat(prompt, use_claude=use_claude, temperature=temperature)
# Singleton instance
llm_client = LLMClient()

267
src/parser.py Normal file
View File

@ -0,0 +1,267 @@
"""MediaWiki XML dump parser - converts to structured JSON."""
import json
import re
from dataclasses import dataclass, field, asdict
from pathlib import Path
from typing import Iterator
from lxml import etree
from rich.progress import Progress, TaskID
from rich.console import Console
from .config import settings
console = Console()
# MediaWiki namespace
MW_NS = {"mw": "http://www.mediawiki.org/xml/export-0.6/"}
@dataclass
class WikiArticle:
"""Represents a parsed wiki article."""
id: int
title: str
content: str # Raw wikitext
plain_text: str # Cleaned plain text for embedding
categories: list[str] = field(default_factory=list)
links: list[str] = field(default_factory=list) # Internal wiki links
external_links: list[str] = field(default_factory=list)
timestamp: str = ""
contributor: str = ""
def to_dict(self) -> dict:
return asdict(self)
def clean_wikitext(text: str) -> str:
"""Convert MediaWiki markup to plain text for embedding."""
if not text:
return ""
# Remove templates {{...}}
text = re.sub(r"\{\{[^}]+\}\}", "", text)
# Remove categories [[Category:...]]
text = re.sub(r"\[\[Category:[^\]]+\]\]", "", text, flags=re.IGNORECASE)
# Convert wiki links [[Page|Display]] or [[Page]] to just the display text
text = re.sub(r"\[\[([^|\]]+)\|([^\]]+)\]\]", r"\2", text)
text = re.sub(r"\[\[([^\]]+)\]\]", r"\1", text)
# Remove external links [url text] -> text
text = re.sub(r"\[https?://[^\s\]]+ ([^\]]+)\]", r"\1", text)
text = re.sub(r"\[https?://[^\]]+\]", "", text)
# Remove wiki formatting
text = re.sub(r"'''?([^']+)'''?", r"\1", text) # Bold/italic
text = re.sub(r"={2,}([^=]+)={2,}", r"\1", text) # Headers
text = re.sub(r"^[*#:;]+", "", text, flags=re.MULTILINE) # List markers
# Remove HTML tags
text = re.sub(r"<[^>]+>", "", text)
# Clean up whitespace
text = re.sub(r"\n{3,}", "\n\n", text)
text = re.sub(r" {2,}", " ", text)
return text.strip()
def extract_categories(text: str) -> list[str]:
"""Extract category names from wikitext."""
pattern = r"\[\[Category:([^\]|]+)"
return list(set(re.findall(pattern, text, re.IGNORECASE)))
def extract_wiki_links(text: str) -> list[str]:
"""Extract internal wiki links from wikitext."""
# Match [[Page]] or [[Page|Display]]
pattern = r"\[\[([^|\]]+)"
links = re.findall(pattern, text)
# Filter out categories and files
return list(
set(
link.strip()
for link in links
if not link.lower().startswith(("category:", "file:", "image:"))
)
)
def extract_external_links(text: str) -> list[str]:
"""Extract external URLs from wikitext."""
pattern = r"https?://[^\s\]\)\"']+"
return list(set(re.findall(pattern, text)))
def parse_xml_file(xml_path: Path) -> Iterator[WikiArticle]:
"""Parse a MediaWiki XML dump file and yield articles."""
context = etree.iterparse(
str(xml_path), events=("end",), tag="{http://www.mediawiki.org/xml/export-0.6/}page"
)
for event, page in context:
# Get basic info
title_elem = page.find("mw:title", MW_NS)
id_elem = page.find("mw:id", MW_NS)
ns_elem = page.find("mw:ns", MW_NS)
# Skip non-main namespace pages (talk, user, etc.)
if ns_elem is not None and ns_elem.text != "0":
page.clear()
continue
title = title_elem.text if title_elem is not None else ""
page_id = int(id_elem.text) if id_elem is not None else 0
# Get latest revision
revision = page.find("mw:revision", MW_NS)
if revision is None:
page.clear()
continue
text_elem = revision.find("mw:text", MW_NS)
timestamp_elem = revision.find("mw:timestamp", MW_NS)
contributor = revision.find("mw:contributor", MW_NS)
content = text_elem.text if text_elem is not None else ""
timestamp = timestamp_elem.text if timestamp_elem is not None else ""
contributor_name = ""
if contributor is not None:
username = contributor.find("mw:username", MW_NS)
if username is not None:
contributor_name = username.text or ""
# Skip redirects and empty pages
if not content or content.lower().startswith("#redirect"):
page.clear()
continue
article = WikiArticle(
id=page_id,
title=title,
content=content,
plain_text=clean_wikitext(content),
categories=extract_categories(content),
links=extract_wiki_links(content),
external_links=extract_external_links(content),
timestamp=timestamp,
contributor=contributor_name,
)
# Clear element to free memory
page.clear()
yield article
def parse_all_dumps(output_path: Path | None = None) -> list[WikiArticle]:
"""Parse all XML dump files and optionally save to JSON."""
xml_files = sorted(settings.xmldump_dir.glob("*.xml"))
if not xml_files:
console.print(f"[red]No XML files found in {settings.xmldump_dir}[/red]")
return []
console.print(f"[green]Found {len(xml_files)} XML files to parse[/green]")
all_articles = []
with Progress() as progress:
task = progress.add_task("[cyan]Parsing XML files...", total=len(xml_files))
for xml_file in xml_files:
progress.update(task, description=f"[cyan]Parsing {xml_file.name}...")
for article in parse_xml_file(xml_file):
all_articles.append(article)
progress.advance(task)
console.print(f"[green]Parsed {len(all_articles)} articles[/green]")
if output_path:
console.print(f"[cyan]Saving to {output_path}...[/cyan]")
with open(output_path, "w", encoding="utf-8") as f:
json.dump([a.to_dict() for a in all_articles], f, ensure_ascii=False, indent=2)
console.print(f"[green]Saved {len(all_articles)} articles to {output_path}[/green]")
return all_articles
def parse_mediawiki_files(articles_dir: Path, output_path: Path | None = None) -> list[WikiArticle]:
"""Parse individual .mediawiki files from a directory (Codeberg format)."""
mediawiki_files = list(articles_dir.glob("*.mediawiki"))
if not mediawiki_files:
console.print(f"[red]No .mediawiki files found in {articles_dir}[/red]")
return []
console.print(f"[green]Found {len(mediawiki_files)} .mediawiki files to parse[/green]")
all_articles = []
with Progress() as progress:
task = progress.add_task("[cyan]Parsing files...", total=len(mediawiki_files))
for i, filepath in enumerate(mediawiki_files):
# Title is the filename without extension
title = filepath.stem
try:
content = filepath.read_text(encoding="utf-8", errors="replace")
except Exception as e:
console.print(f"[yellow]Warning: Could not read {filepath}: {e}[/yellow]")
progress.advance(task)
continue
# Skip redirects and empty files
if not content or content.strip().lower().startswith("#redirect"):
progress.advance(task)
continue
article = WikiArticle(
id=i,
title=title,
content=content,
plain_text=clean_wikitext(content),
categories=extract_categories(content),
links=extract_wiki_links(content),
external_links=extract_external_links(content),
timestamp="",
contributor="",
)
all_articles.append(article)
progress.advance(task)
console.print(f"[green]Parsed {len(all_articles)} articles[/green]")
if output_path:
console.print(f"[cyan]Saving to {output_path}...[/cyan]")
with open(output_path, "w", encoding="utf-8") as f:
json.dump([a.to_dict() for a in all_articles], f, ensure_ascii=False, indent=2)
console.print(f"[green]Saved {len(all_articles)} articles to {output_path}[/green]")
return all_articles
def main():
"""CLI entry point for parsing wiki content."""
output_path = settings.data_dir / "articles.json"
# Check for Codeberg-style articles directory first (newer, more complete)
articles_dir = settings.project_root / "articles" / "articles"
if articles_dir.exists():
console.print("[cyan]Found Codeberg-style articles directory, using that...[/cyan]")
parse_mediawiki_files(articles_dir, output_path)
else:
# Fall back to XML dumps
parse_all_dumps(output_path)
if __name__ == "__main__":
main()

159
src/rag.py Normal file
View File

@ -0,0 +1,159 @@
"""RAG (Retrieval Augmented Generation) system for wiki Q&A."""
from dataclasses import dataclass
from typing import Optional
from .embeddings import WikiVectorStore
from .llm import llm_client
SYSTEM_PROMPT = """You are a knowledgeable assistant for the P2P Foundation Wiki, a comprehensive knowledge base about peer-to-peer culture, commons-based peer production, alternative economics, and collaborative governance.
Your role is to answer questions about the wiki content accurately and helpfully. When answering:
1. Base your answers on the provided wiki content excerpts
2. Cite specific articles when relevant (use the article titles)
3. If the provided content doesn't fully answer the question, say so
4. Explain concepts in accessible language while maintaining accuracy
5. Connect related concepts when helpful
If asked about something not covered in the provided content, acknowledge this and suggest related topics that might be helpful."""
@dataclass
class ChatMessage:
"""A chat message."""
role: str # "user" or "assistant"
content: str
@dataclass
class RAGResponse:
"""Response from the RAG system."""
answer: str
sources: list[dict] # List of source articles used
query: str
class WikiRAG:
"""RAG system for answering questions about wiki content."""
def __init__(self, vector_store: Optional[WikiVectorStore] = None):
self.vector_store = vector_store or WikiVectorStore()
self.conversation_history: list[ChatMessage] = []
def _format_context(self, search_results: list[dict]) -> str:
"""Format search results as context for the LLM."""
if not search_results:
return "No relevant wiki content found for this query."
context_parts = []
for i, result in enumerate(search_results, 1):
title = result["metadata"].get("title", "Unknown")
content = result["content"]
categories = result["metadata"].get("categories", "")
context_parts.append(
f"[Source {i}: {title}]\n"
f"Categories: {categories}\n"
f"Content:\n{content}\n"
)
return "\n---\n".join(context_parts)
def _build_prompt(self, query: str, context: str) -> str:
"""Build the prompt for the LLM."""
# Include recent conversation history for context
history_text = ""
if self.conversation_history:
recent = self.conversation_history[-4:] # Last 2 exchanges
history_text = "\n\nRecent conversation:\n"
for msg in recent:
role = "User" if msg.role == "user" else "Assistant"
# Truncate long messages
content = msg.content[:500] + "..." if len(msg.content) > 500 else msg.content
history_text += f"{role}: {content}\n"
return f"""Based on the following wiki content, please answer the user's question.
Wiki Content:
{context}
{history_text}
User Question: {query}
Please provide a helpful answer based on the wiki content above. Cite specific articles when relevant."""
async def ask(
self,
query: str,
n_results: int = 5,
filter_categories: Optional[list[str]] = None,
) -> RAGResponse:
"""
Ask a question and get an answer based on wiki content.
Args:
query: User's question
n_results: Number of relevant chunks to retrieve
filter_categories: Optional category filter
Returns:
RAGResponse with answer and sources
"""
# Search for relevant content
search_results = self.vector_store.search(
query, n_results=n_results, filter_categories=filter_categories
)
# Format context
context = self._format_context(search_results)
# Build prompt
prompt = self._build_prompt(query, context)
# Get LLM response (use Ollama for chat by default)
answer = await llm_client.chat(
prompt,
system=SYSTEM_PROMPT,
use_claude=False, # Use Ollama for chat
temperature=0.7,
)
# Update conversation history
self.conversation_history.append(ChatMessage(role="user", content=query))
self.conversation_history.append(ChatMessage(role="assistant", content=answer))
# Extract unique sources
sources = []
seen_titles = set()
for result in search_results:
title = result["metadata"].get("title", "Unknown")
if title not in seen_titles:
seen_titles.add(title)
sources.append(
{
"title": title,
"article_id": result["metadata"].get("article_id"),
"categories": result["metadata"].get("categories", "").split(","),
}
)
return RAGResponse(answer=answer, sources=sources, query=query)
def clear_history(self):
"""Clear conversation history."""
self.conversation_history = []
def get_suggestions(self, partial_query: str, n_results: int = 5) -> list[str]:
"""Get article title suggestions for autocomplete."""
# Simple prefix matching on titles
all_titles = self.vector_store.get_article_titles()
partial_lower = partial_query.lower()
suggestions = [
title for title in all_titles if partial_lower in title.lower()
][:n_results]
return suggestions

707
web/index.html Normal file
View File

@ -0,0 +1,707 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>P2P Wiki AI</title>
<style>
:root {
--bg-primary: #1a1a2e;
--bg-secondary: #16213e;
--bg-tertiary: #0f3460;
--text-primary: #e8e8e8;
--text-secondary: #a0a0a0;
--accent: #e94560;
--accent-hover: #ff6b6b;
--success: #4ecdc4;
--border: #2a2a4a;
}
* {
box-sizing: border-box;
margin: 0;
padding: 0;
}
body {
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
background: var(--bg-primary);
color: var(--text-primary);
min-height: 100vh;
}
.container {
max-width: 1200px;
margin: 0 auto;
padding: 20px;
}
header {
display: flex;
justify-content: space-between;
align-items: center;
padding: 20px 0;
border-bottom: 1px solid var(--border);
margin-bottom: 30px;
}
h1 {
font-size: 1.8em;
font-weight: 600;
}
h1 span {
color: var(--accent);
}
.tabs {
display: flex;
gap: 10px;
}
.tab {
padding: 10px 20px;
background: var(--bg-secondary);
border: 1px solid var(--border);
border-radius: 8px;
cursor: pointer;
transition: all 0.2s;
}
.tab:hover, .tab.active {
background: var(--bg-tertiary);
border-color: var(--accent);
}
.panel {
display: none;
}
.panel.active {
display: block;
}
/* Chat Panel */
.chat-container {
display: flex;
flex-direction: column;
height: calc(100vh - 200px);
background: var(--bg-secondary);
border-radius: 12px;
overflow: hidden;
}
.chat-messages {
flex: 1;
overflow-y: auto;
padding: 20px;
}
.message {
margin-bottom: 20px;
max-width: 80%;
}
.message.user {
margin-left: auto;
}
.message-content {
padding: 15px;
border-radius: 12px;
line-height: 1.6;
}
.message.user .message-content {
background: var(--bg-tertiary);
}
.message.assistant .message-content {
background: var(--bg-primary);
border: 1px solid var(--border);
}
.message-sources {
margin-top: 10px;
padding: 10px;
background: rgba(233, 69, 96, 0.1);
border-radius: 8px;
font-size: 0.9em;
}
.message-sources h4 {
color: var(--accent);
margin-bottom: 5px;
}
.source-tag {
display: inline-block;
padding: 3px 8px;
margin: 2px;
background: var(--bg-tertiary);
border-radius: 4px;
font-size: 0.85em;
}
.chat-input {
display: flex;
gap: 10px;
padding: 20px;
background: var(--bg-primary);
border-top: 1px solid var(--border);
}
.chat-input input {
flex: 1;
padding: 15px;
background: var(--bg-secondary);
border: 1px solid var(--border);
border-radius: 8px;
color: var(--text-primary);
font-size: 1em;
}
.chat-input input:focus {
outline: none;
border-color: var(--accent);
}
.chat-input button {
padding: 15px 30px;
background: var(--accent);
border: none;
border-radius: 8px;
color: white;
font-weight: 600;
cursor: pointer;
transition: background 0.2s;
}
.chat-input button:hover {
background: var(--accent-hover);
}
.chat-input button:disabled {
opacity: 0.5;
cursor: not-allowed;
}
/* Ingress Panel */
.ingress-container {
background: var(--bg-secondary);
border-radius: 12px;
padding: 30px;
}
.ingress-form {
display: flex;
gap: 10px;
margin-bottom: 30px;
}
.ingress-form input {
flex: 1;
padding: 15px;
background: var(--bg-primary);
border: 1px solid var(--border);
border-radius: 8px;
color: var(--text-primary);
font-size: 1em;
}
.ingress-form input:focus {
outline: none;
border-color: var(--accent);
}
.ingress-form button {
padding: 15px 30px;
background: var(--success);
border: none;
border-radius: 8px;
color: var(--bg-primary);
font-weight: 600;
cursor: pointer;
transition: opacity 0.2s;
}
.ingress-form button:hover {
opacity: 0.9;
}
.ingress-form button:disabled {
opacity: 0.5;
cursor: not-allowed;
}
.ingress-result {
background: var(--bg-primary);
border-radius: 8px;
padding: 20px;
margin-bottom: 20px;
}
.ingress-result h3 {
margin-bottom: 15px;
color: var(--accent);
}
.result-stats {
display: grid;
grid-template-columns: repeat(auto-fit, minmax(150px, 1fr));
gap: 15px;
margin-bottom: 20px;
}
.stat {
background: var(--bg-secondary);
padding: 15px;
border-radius: 8px;
text-align: center;
}
.stat-value {
font-size: 2em;
font-weight: bold;
color: var(--success);
}
.stat-label {
color: var(--text-secondary);
font-size: 0.9em;
}
/* Review Panel */
.review-container {
background: var(--bg-secondary);
border-radius: 12px;
padding: 30px;
}
.review-item {
background: var(--bg-primary);
border-radius: 8px;
padding: 20px;
margin-bottom: 20px;
}
.review-item h3 {
margin-bottom: 10px;
}
.review-meta {
color: var(--text-secondary);
font-size: 0.9em;
margin-bottom: 15px;
}
.review-section {
margin-top: 20px;
padding-top: 20px;
border-top: 1px solid var(--border);
}
.review-section h4 {
margin-bottom: 10px;
color: var(--accent);
}
.match-item, .draft-item {
background: var(--bg-secondary);
padding: 15px;
border-radius: 8px;
margin-bottom: 10px;
}
.match-item .title, .draft-item .title {
font-weight: 600;
margin-bottom: 5px;
}
.match-item .score {
color: var(--success);
}
.action-buttons {
display: flex;
gap: 10px;
margin-top: 10px;
}
.btn-approve {
padding: 8px 16px;
background: var(--success);
border: none;
border-radius: 4px;
color: var(--bg-primary);
cursor: pointer;
}
.btn-reject {
padding: 8px 16px;
background: var(--accent);
border: none;
border-radius: 4px;
color: white;
cursor: pointer;
}
.loading {
display: inline-block;
width: 20px;
height: 20px;
border: 2px solid var(--text-secondary);
border-top-color: var(--accent);
border-radius: 50%;
animation: spin 1s linear infinite;
}
@keyframes spin {
to { transform: rotate(360deg); }
}
.empty-state {
text-align: center;
padding: 50px;
color: var(--text-secondary);
}
/* Markdown-like formatting */
.message-content p { margin-bottom: 10px; }
.message-content ul, .message-content ol { margin-left: 20px; margin-bottom: 10px; }
.message-content code { background: var(--bg-tertiary); padding: 2px 6px; border-radius: 4px; }
.message-content pre { background: var(--bg-tertiary); padding: 15px; border-radius: 8px; overflow-x: auto; }
</style>
</head>
<body>
<div class="container">
<header>
<h1>P2P Wiki <span>AI</span></h1>
<div class="tabs">
<div class="tab active" data-panel="chat">Chat</div>
<div class="tab" data-panel="ingress">Ingress</div>
<div class="tab" data-panel="review">Review Queue</div>
</div>
</header>
<!-- Chat Panel -->
<div id="chat" class="panel active">
<div class="chat-container">
<div class="chat-messages" id="chatMessages">
<div class="message assistant">
<div class="message-content">
<p>Welcome to the P2P Wiki AI assistant! I can help you explore the P2P Foundation Wiki's knowledge about peer-to-peer culture, commons-based peer production, alternative economics, and collaborative governance.</p>
<p>Ask me anything about these topics!</p>
</div>
</div>
</div>
<div class="chat-input">
<input type="text" id="chatInput" placeholder="Ask about P2P, commons, cooperative economics..." />
<button id="chatSend">Send</button>
</div>
</div>
</div>
<!-- Ingress Panel -->
<div id="ingress" class="panel">
<div class="ingress-container">
<h2>Article Ingress</h2>
<p style="color: var(--text-secondary); margin-bottom: 20px;">
Drop an article URL to analyze it for wiki content. The AI will identify relevant topics,
find matching wiki articles for citations, and draft new articles.
</p>
<div class="ingress-form">
<input type="url" id="ingressUrl" placeholder="https://example.com/article-about-commons" />
<button id="ingressSubmit">Process Article</button>
</div>
<div id="ingressResult"></div>
</div>
</div>
<!-- Review Panel -->
<div id="review" class="panel">
<div class="review-container">
<h2>Review Queue</h2>
<p style="color: var(--text-secondary); margin-bottom: 20px;">
Review and approve AI-generated wiki content before it's added to the wiki.
</p>
<div id="reviewItems">
<div class="empty-state">Loading review items...</div>
</div>
</div>
</div>
</div>
<script>
const API_BASE = ''; // Same origin
// Tab switching
document.querySelectorAll('.tab').forEach(tab => {
tab.addEventListener('click', () => {
document.querySelectorAll('.tab').forEach(t => t.classList.remove('active'));
document.querySelectorAll('.panel').forEach(p => p.classList.remove('active'));
tab.classList.add('active');
document.getElementById(tab.dataset.panel).classList.add('active');
// Load review items when switching to review tab
if (tab.dataset.panel === 'review') {
loadReviewItems();
}
});
});
// Chat functionality
const chatMessages = document.getElementById('chatMessages');
const chatInput = document.getElementById('chatInput');
const chatSend = document.getElementById('chatSend');
function addMessage(content, role, sources = []) {
const div = document.createElement('div');
div.className = `message ${role}`;
let html = `<div class="message-content">${formatMessage(content)}</div>`;
if (sources.length > 0) {
html += `<div class="message-sources">
<h4>Sources</h4>
${sources.map(s => `<span class="source-tag">${s.title}</span>`).join('')}
</div>`;
}
div.innerHTML = html;
chatMessages.appendChild(div);
chatMessages.scrollTop = chatMessages.scrollHeight;
}
function formatMessage(text) {
// Basic markdown-like formatting
return text
.replace(/\n\n/g, '</p><p>')
.replace(/\n/g, '<br>')
.replace(/\*\*(.+?)\*\*/g, '<strong>$1</strong>')
.replace(/\*(.+?)\*/g, '<em>$1</em>')
.replace(/`(.+?)`/g, '<code>$1</code>');
}
async function sendChat() {
const query = chatInput.value.trim();
if (!query) return;
chatInput.value = '';
chatSend.disabled = true;
addMessage(query, 'user');
// Show loading
const loadingDiv = document.createElement('div');
loadingDiv.className = 'message assistant';
loadingDiv.innerHTML = '<div class="message-content"><span class="loading"></span> Thinking...</div>';
chatMessages.appendChild(loadingDiv);
chatMessages.scrollTop = chatMessages.scrollHeight;
try {
const response = await fetch(`${API_BASE}/chat`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ query, n_results: 5 })
});
const data = await response.json();
chatMessages.removeChild(loadingDiv);
if (response.ok) {
addMessage(data.answer, 'assistant', data.sources);
} else {
addMessage(`Error: ${data.detail || 'Something went wrong'}`, 'assistant');
}
} catch (error) {
chatMessages.removeChild(loadingDiv);
addMessage(`Error: ${error.message}`, 'assistant');
}
chatSend.disabled = false;
chatInput.focus();
}
chatSend.addEventListener('click', sendChat);
chatInput.addEventListener('keypress', (e) => {
if (e.key === 'Enter') sendChat();
});
// Ingress functionality
const ingressUrl = document.getElementById('ingressUrl');
const ingressSubmit = document.getElementById('ingressSubmit');
const ingressResult = document.getElementById('ingressResult');
async function processIngress() {
const url = ingressUrl.value.trim();
if (!url) return;
ingressSubmit.disabled = true;
ingressSubmit.textContent = 'Processing...';
ingressResult.innerHTML = `
<div class="ingress-result">
<h3>Processing Article</h3>
<p><span class="loading"></span> Scraping and analyzing content...</p>
</div>
`;
try {
const response = await fetch(`${API_BASE}/ingress`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ url })
});
const data = await response.json();
if (response.ok) {
ingressResult.innerHTML = `
<div class="ingress-result">
<h3>Analysis Complete: ${data.scraped_title || 'Article'}</h3>
<div class="result-stats">
<div class="stat">
<div class="stat-value">${data.topics_found}</div>
<div class="stat-label">Topics Found</div>
</div>
<div class="stat">
<div class="stat-value">${data.wiki_matches}</div>
<div class="stat-label">Wiki Matches</div>
</div>
<div class="stat">
<div class="stat-value">${data.drafts_generated}</div>
<div class="stat-label">Drafts Generated</div>
</div>
</div>
<p style="color: var(--success);">
Results added to review queue. Check the Review tab to approve or reject suggestions.
</p>
</div>
`;
} else {
ingressResult.innerHTML = `
<div class="ingress-result">
<h3 style="color: var(--accent);">Error</h3>
<p>${data.detail || 'Failed to process article'}</p>
</div>
`;
}
} catch (error) {
ingressResult.innerHTML = `
<div class="ingress-result">
<h3 style="color: var(--accent);">Error</h3>
<p>${error.message}</p>
</div>
`;
}
ingressSubmit.disabled = false;
ingressSubmit.textContent = 'Process Article';
}
ingressSubmit.addEventListener('click', processIngress);
ingressUrl.addEventListener('keypress', (e) => {
if (e.key === 'Enter') processIngress();
});
// Review functionality
const reviewItems = document.getElementById('reviewItems');
async function loadReviewItems() {
try {
const response = await fetch(`${API_BASE}/review`);
const data = await response.json();
if (data.count === 0) {
reviewItems.innerHTML = '<div class="empty-state">No items in the review queue.</div>';
return;
}
reviewItems.innerHTML = data.items.map(item => `
<div class="review-item">
<h3>${item.scraped?.title || 'Unknown Article'}</h3>
<div class="review-meta">
Source: <a href="${item.scraped?.url}" target="_blank">${item.scraped?.domain}</a>
| Processed: ${new Date(item.timestamp).toLocaleString()}
</div>
${item.wiki_matches?.length > 0 ? `
<div class="review-section">
<h4>Suggested Citations (${item.wiki_matches.length})</h4>
${item.wiki_matches.map((match, i) => `
<div class="match-item" ${match.approved ? 'style="opacity: 0.5"' : ''}>
<div class="title">${match.title}</div>
<div class="score">Relevance: ${(match.relevance_score * 100).toFixed(0)}%</div>
<div>${match.suggested_citation}</div>
${!match.approved && !match.rejected ? `
<div class="action-buttons">
<button class="btn-approve" onclick="reviewAction('${item._filepath}', 'match', ${i}, 'approve')">Approve</button>
<button class="btn-reject" onclick="reviewAction('${item._filepath}', 'match', ${i}, 'reject')">Reject</button>
</div>
` : `<em>${match.approved ? 'Approved' : 'Rejected'}</em>`}
</div>
`).join('')}
</div>
` : ''}
${item.draft_articles?.length > 0 ? `
<div class="review-section">
<h4>Draft Articles (${item.draft_articles.length})</h4>
${item.draft_articles.map((draft, i) => `
<div class="draft-item" ${draft.approved ? 'style="opacity: 0.5"' : ''}>
<div class="title">${draft.title}</div>
<div style="color: var(--text-secondary); font-size: 0.9em; margin-bottom: 10px;">
${draft.summary || ''}
</div>
<details>
<summary style="cursor: pointer; color: var(--accent);">View Draft Content</summary>
<pre style="margin-top: 10px; white-space: pre-wrap; font-size: 0.85em;">${draft.content}</pre>
</details>
${!draft.approved && !draft.rejected ? `
<div class="action-buttons">
<button class="btn-approve" onclick="reviewAction('${item._filepath}', 'draft', ${i}, 'approve')">Approve</button>
<button class="btn-reject" onclick="reviewAction('${item._filepath}', 'draft', ${i}, 'reject')">Reject</button>
</div>
` : `<em>${draft.approved ? 'Approved' : 'Rejected'}</em>`}
</div>
`).join('')}
</div>
` : ''}
</div>
`).join('');
} catch (error) {
reviewItems.innerHTML = `<div class="empty-state">Error loading review items: ${error.message}</div>`;
}
}
async function reviewAction(filepath, itemType, itemIndex, action) {
try {
const response = await fetch(`${API_BASE}/review/action`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
filepath,
item_type: itemType,
item_index: itemIndex,
action
})
});
if (response.ok) {
loadReviewItems(); // Refresh the list
} else {
const data = await response.json();
alert(`Error: ${data.detail || 'Action failed'}`);
}
} catch (error) {
alert(`Error: ${error.message}`);
}
}
// Make reviewAction available globally
window.reviewAction = reviewAction;
</script>
</body>
</html>