- Warm up both models on startup with keep_alive=24h (no cold starts)
- Use 16 threads for inference (server has 20 cores)
- Reduce context window to 1024 tokens, max output to 256
- Persistent httpx client for embedding calls (skip TCP handshake)
- Trim RAG chunks to 300 chars, history to 4 messages
- Shorter system prompt and context wrapper
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
RAG-powered chatbot that indexes Erowid's experience reports and substance
info, making them searchable via natural conversation. Built with FastAPI,
PostgreSQL+pgvector, Ollama embeddings, and streaming LLM responses.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>