backlog-md/backlog/tasks/task-010 - Netcup-RS-8000-i...

4.8 KiB

id title status assignee created_date labels dependencies priority
task-010 Netcup RS 8000 infrastructure hardening and maintenance Done
2026-03-15 07:30
dev-ops
enhancement
high

Description

High-priority infrastructure tasks for the Netcup RS 8000 production server (20 cores, 64GB RAM, 3TB) running 40+ live services. Covers security hardening, storage cleanup, monitoring, and reliability improvements.

Acceptance Criteria

  • #1 Audit and rotate stale secrets (Infisical + KeePass) — identify unused or old credentials
  • #2 Review and harden Traefik config — audited, fixes in task-011 (requires host access)
  • #3 Storage cleanup — pruned ~330GB (74%→62% disk), removed 15 dead containers
  • #4 Set up automated Docker image pruning — script at /opt/apps/dev-ops/docker-weekly-prune.sh, cron in task-011
  • #5 Health check dashboard — audited (179 monitors, 46 unmonitored containers), gaps in task-014
  • #6 Backup verification — audited: NO automated backups anywhere, remediation in task-013
  • #7 Review container resource limits — added limits to 7 top consumers (postiz x3, p2pwiki-db, elasticsearch, gitea, immich_postgres)
  • #8 Update base images — audited, 5 critical/10 high upgrades needed, tracked in task-012
  • #13 Upgrade p2pwiki-db MariaDB 10.6→10.11 (backup + upgrade + mariadb-upgrade complete)
  • #9 Fix p2p-db CPU (174%→0.02%) — added missing wp_options index, cleaned 15k duplicate rows
  • #10 Fix p2pwiki CPU (50%→15%) — blocked Applebot hammering Special/API pages via .htaccess
  • #11 Remove junk containers — stopped funny_mirzakhani (cat /dev/urandom), payment-safe-mcp (9149 restarts)
  • #12 Vault-migration audit — 20 of 26 secrets confirmed stale, 2 active, 4 unclear. Deletion pending via Infisical UI

Audit Results (2026-03-15)

Secrets Audit — 121 secrets, 18 folders

  • HIGH: vault-migration folder has 26 likely stale secrets (Pusher, Holochain, Obsidian, old Cloudflare tokens, test Stripe keys)
  • HIGH: 6+ duplicate secrets across folders (Syncthing x6, GitHub x3, Cloudflare x3, RunPod x2)
  • MED: Test/dev keys in prod (Duffel test, Stripe test), 2 orphaned root-level secrets
  • Action: Audit each vault-migration key, consolidate duplicates, remove test keys

Container Health — 303 containers, 161 without health checks

  • FIXED: Stopped funny_mirzakhani (junk cat /dev/urandom container)
  • FIXED: Stopped payment-safe-mcp (9,149 restart loop, no logs)
  • FIXED: Removed 15 crashed/init containers
  • URGENT: p2p-db at 78-132% CPU (MariaDB, investigate queries)
  • URGENT: 293 of 303 containers have ZERO resource limits
  • Top memory hogs without limits: postiz x3 (~2GB each), p2pwiki-db (1.8GB), gitea (1.7GB)
  • erpnext-queue-long crashed 7 days ago — needs restart

Storage — 2.1 TB used / 3.0 TB (74%)

  • IN PROGRESS: Docker prune running (build cache ~347GB, dangling images ~50-80GB)
  • 115 dangling volumes (~2.5GB)
  • 20+ stopped rspace services sitting 2 weeks
  • payment-infra rebuild loop generating constant dangling images

Traefik Security — 3 critical, 3 medium (requires HOST access)

  • C1: No TLS minimum version (defaults to TLS 1.0)
  • C2: No capability drops on Traefik container
  • C3: Ports 80/443 on 0.0.0.0 — bypasses Cloudflare
  • M1: No rate limiting middleware
  • M2: insecureSkipVerify on pentagi transport
  • M3: No default Content-Security-Policy header

Host Commands Needed

Traefik TLS hardening (run on host)

# Create TLS options file
cat > /root/traefik/config/tls-options.yml << 'EOF'
tls:
  options:
    default:
      minVersion: VersionTLS12
      cipherSuites:
        - TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
        - TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
        - TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
        - TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
        - TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305
        - TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305
EOF

Traefik rate limiting (run on host)

cat > /root/traefik/config/rate-limit.yml << 'EOF'
http:
  middlewares:
    rate-limit:
      rateLimit:
        average: 100
        burst: 200
        period: 1s
EOF

Traefik container hardening (edit docker-compose on host)

Add to Traefik service:

cap_drop: [ALL]
cap_add: [NET_BIND_SERVICE]
security_opt: [no-new-privileges:true]
read_only: true
tmpfs: [/tmp]
deploy:
  resources:
    limits:
      memory: 512M
      cpus: '2.0'

Restrict ports to localhost (edit docker-compose on host)

ports:
  - "127.0.0.1:80:80"
  - "127.0.0.1:443:443"

Then restart Traefik: cd /root/traefik && docker compose up -d