refactor(ops): standardize image-based production delivery

This commit is contained in:
2026-03-30 23:35:29 +02:00
parent ef5e8016a4
commit 7bcc831b5c
17 changed files with 447 additions and 538 deletions
+132 -272
View File
@@ -2,333 +2,193 @@
## Overview
CapaKraken uses GitHub Actions for continuous integration and Docker for deployment. This document covers the full pipeline from code push to production.
This is the operational runbook for the canonical CapaKraken delivery path:
---
1. CI validates every PR.
2. Every push to `main` publishes immutable release images.
3. Staging deploys one `sha-<commit>` tag.
4. Production promotes the same tag.
5. The host never builds application code from Git.
## 1. CI Pipeline (Automatic on every PR)
## 1. CI Gate
### What triggers it
The merge gate is [ci.yml](/home/hartmut/Documents/Copilot/capakraken/.github/workflows/ci.yml).
| Event | Trigger |
|-------|---------|
| Pull request to `main` | All CI jobs run |
| Push to `main` | All CI jobs run |
It covers:
### Jobs and their purpose
- architecture guardrails
- typecheck
- lint
- unit tests
- build
- E2E
```
PR opened / pushed
├──→ typecheck (tsc --noEmit, ~40s)
├──→ lint (ESLint via Turborepo, ~20s)
├──→ test (Vitest unit tests, ~60s, needs PostgreSQL + Redis)
└──→ build (next build, ~90s, runs after typecheck)
└──→ e2e (Playwright, ~3-5min, runs after build)
```
Before merging, all required checks must pass.
**typecheck, lint, and test run in parallel** for speed. Build waits for typecheck. E2E waits for build.
### What each job checks
| Job | Command | What it catches |
|-----|---------|----------------|
| **typecheck** | `pnpm --filter @capakraken/web exec tsc --noEmit` | Type errors across the full web app |
| **lint** | `pnpm lint` | Code style violations, unused imports, etc. |
| **test** | `pnpm test:unit` | Unit test failures in engine, staffing, API, shared |
| **build** | `pnpm --filter @capakraken/web exec next build` | SSR errors, dynamic import issues, bundle problems |
| **e2e** | `pnpm test:e2e` | End-to-end user flow regressions |
### Required status checks
Before merging a PR, **all 5 jobs must pass**. Configure this in GitHub Settings > Branches > Branch protection rules > Require status checks.
### Caching
The pipeline caches these artifacts to speed up subsequent runs:
| Cache | Key | Saves |
|-------|-----|-------|
| pnpm store | `pnpm-lock.yaml` hash | ~30s install time |
| Turborepo | `.turbo` directory | ~60s on unchanged packages |
| Playwright browsers | Playwright version | ~45s browser download |
---
## 2. Local Development Quality Gates
Run these before pushing to catch issues early:
Useful local commands:
```bash
# Quick check (< 2 min)
pnpm --filter @capakraken/web exec tsc --noEmit && pnpm lint
# Full check (< 3 min)
pnpm --filter @capakraken/web exec tsc --project tsconfig.typecheck.json --noEmit
pnpm lint
pnpm test:unit
# Full check including build (< 5 min)
pnpm --filter @capakraken/web exec next build
```
### Pre-commit hook (optional)
## 2. Image Release
You can add a Git pre-commit hook to run the quick check automatically:
[release-image.yml](/home/hartmut/Documents/Copilot/capakraken/.github/workflows/release-image.yml) runs automatically on every push to `main`.
```bash
# .husky/pre-commit
pnpm --filter @capakraken/web exec tsc --noEmit
pnpm lint
It publishes:
- `ghcr.io/<owner>/<repo>-app:sha-<commit>`
- `ghcr.io/<owner>/<repo>-migrator:sha-<commit>`
The workflow is also callable manually if a rebuild or tag override is needed.
## 3. Host Bootstrap
Each deploy target should have a dedicated directory such as `/opt/capakraken` containing:
```text
docker-compose.prod.yml
.env.production
deploy.env
tooling/deploy/deploy-compose.sh
```
---
Use these examples from the repo:
## 3. Health Check Endpoints
- [tooling/deploy/.env.production.example](/home/hartmut/Documents/Copilot/capakraken/tooling/deploy/.env.production.example)
- [tooling/deploy/deploy.env.example](/home/hartmut/Documents/Copilot/capakraken/tooling/deploy/deploy.env.example)
Two endpoints are available for monitoring:
Important host-side rules:
### GET `/api/health` — Liveness Probe
- keep `RATE_LIMIT_BACKEND=redis`
- keep runtime secrets in `.env.production` or the platform secret layer
- do not rotate runtime secrets through admin settings
- ensure the host can pull from `ghcr.io`
Returns 200 if the Node.js process is running. No external dependencies checked.
```json
{ "status": "ok", "timestamp": "2026-03-19T10:00:00.000Z" }
```
**Use for:** Kubernetes/Docker liveness probe, uptime monitoring.
### GET `/api/ready` — Readiness Probe
Checks PostgreSQL and Redis connectivity. Returns 200 if all services are reachable, 503 if not.
```json
// Healthy
{ "status": "ready", "postgres": "ok", "redis": "ok" }
// Unhealthy
{ "status": "not_ready", "postgres": "ok", "redis": "error" }
```
**Use for:** Kubernetes/Docker readiness probe, load balancer health checks, nginx upstream checks.
---
## 4. Production Docker Build
### Building the production image
```bash
# Build the image
docker build -f Dockerfile.prod -t capakraken:latest .
# Test it locally
docker compose -f docker-compose.prod.yml up -d
```
### Image details
| Property | Value |
|----------|-------|
| Base | `node:20-bookworm-slim` |
| Size | ~150-200 MB (vs ~1.5 GB dev image) |
| Output | Next.js standalone mode |
| Healthcheck | `curl -f http://localhost:3000/api/health` |
| Port | 3000 (internal), mapped to 3100 externally |
### Environment variables
The production image requires these environment variables:
```env
# Required
DATABASE_URL=postgresql://user:pass@host:5432/capakraken
REDIS_URL=redis://host:6379
NEXTAUTH_URL=https://capakraken.your-domain.com
NEXTAUTH_SECRET=<random-32-char-string>
# Optional
SENTRY_DSN=https://xxx@sentry.io/xxx
SMTP_HOST=smtp.example.com
SMTP_PORT=587
SMTP_USER=notifications@example.com
SMTP_PASSWORD=<password>
SMTP_FROM=CapaKraken <notifications@example.com>
OPENAI_API_KEY=<optional-if-openai-used>
AZURE_OPENAI_API_KEY=<optional-if-azure-chat-used>
AZURE_DALLE_API_KEY=<optional-if-azure-image-gen-used>
GEMINI_API_KEY=<optional-if-gemini-used>
ANONYMIZATION_SEED=<required-if-deterministic-anonymization-enabled>
```
Generate a secure `NEXTAUTH_SECRET`:
Generate a secure `NEXTAUTH_SECRET` with:
```bash
openssl rand -base64 32
```
Runtime secret policy:
## 4. Staging Deployment
- production secrets are injected through the deployment environment or host secret store
- admin settings must not be used to enter or rotate AI, SMTP, or anonymization secrets
- the admin UI is only for status checks and cleanup of legacy database-stored secret values
Standard path:
---
1. merge to `main`
2. wait for [release-image.yml](/home/hartmut/Documents/Copilot/capakraken/.github/workflows/release-image.yml) to publish `sha-<commit>`
3. run [deploy-staging.yml](/home/hartmut/Documents/Copilot/capakraken/.github/workflows/deploy-staging.yml) with that tag
## 5. Deployment
The workflow uploads:
### docker-compose (simplest)
- [docker-compose.prod.yml](/home/hartmut/Documents/Copilot/capakraken/docker-compose.prod.yml)
- [tooling/deploy](/home/hartmut/Documents/Copilot/capakraken/tooling/deploy/README.md)
- a short-lived `deploy.env`
On the host, [deploy-compose.sh](/home/hartmut/Documents/Copilot/capakraken/tooling/deploy/deploy-compose.sh):
1. validates the rendered compose file
2. pulls `APP_IMAGE` and `MIGRATOR_IMAGE`
3. starts PostgreSQL and Redis
4. runs Prisma migrations with the `migrator` image
5. starts the app
6. waits for `GET /api/ready`
## 5. Production Promotion
After staging is accepted:
1. run [deploy-prod.yml](/home/hartmut/Documents/Copilot/capakraken/.github/workflows/deploy-prod.yml)
2. use the exact same `sha-<commit>` tag
3. verify `GET /api/ready`
Production must promote the already-tested image, not rebuild from source.
## 6. Manual Host Dry Run
If you need to verify the host outside GitHub Actions:
```bash
# On your server, after updating the host-side env/secret source
git pull
docker compose -f docker-compose.prod.yml up -d --build
cp tooling/deploy/.env.production.example .env.production
cp tooling/deploy/deploy.env.example deploy.env
# fill in real secrets and image refs first
# Run database migrations
docker compose -f docker-compose.prod.yml exec app \
pnpm --filter @capakraken/db db:migrate:deploy
# Seed initial data (first deployment only)
docker compose -f docker-compose.prod.yml exec app \
pnpm db:seed
set -a
. ./deploy.env
set +a
bash tooling/deploy/deploy-compose.sh staging
```
### Manual deployment (current setup)
## 7. Health Endpoints
Since `capakraken.hartmut-noerenberg.com` runs behind nginx:
### GET `/api/health`
Process liveness only. Use it for coarse uptime checks.
### GET `/api/ready`
Checks PostgreSQL and Redis connectivity. Use it for deploy readiness and traffic admission.
For deploys, `/api/ready` is the source of truth.
## 8. Rollback
Rollback is image-based:
1. choose the previous healthy `sha-<commit>`
2. rerun the staging or production deploy workflow with that tag
3. confirm `GET /api/ready`
Schema changes still need expand-and-contract discipline for rollback safety.
## 9. Troubleshooting
### CI failure
Run the failing command locally:
```bash
# On the server
cd /home/hartmut/Documents/Copilot/capakraken
git pull origin main
pnpm install
pnpm db:generate
pnpm db:validate
pnpm --filter @capakraken/db db:migrate:deploy
pnpm --filter @capakraken/web exec next build
rm -rf apps/web/.next/cache # clear stale cache
# Restart the app (systemd, pm2, or manual)
fuser -k 3100/tcp 2>/dev/null
PORT=3100 pnpm --filter @capakraken/web start &
```
Use the repo-level `pnpm db:*` commands for Prisma/database operations. They load `.env`, `.env.local`, `.env.$NODE_ENV`, and `.env.$NODE_ENV.local` automatically before invoking Prisma.
If you rotate runtime secrets during a manual deploy, update the host-side environment source first, then restart the app so the new process reads the updated values. Do not patch those values through admin settings.
### nginx configuration
The existing nginx reverse proxy should forward to port 3100:
```nginx
server {
server_name capakraken.hartmut-noerenberg.com;
location / {
proxy_pass http://127.0.0.1:3100;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# SSE support (keep connection open)
proxy_read_timeout 86400s;
proxy_buffering off;
}
}
```
---
## 6. Monitoring Setup
### Sentry (error tracking)
After creating a Sentry project, add the DSN to `.env.production`:
```env
SENTRY_DSN=https://xxx@sentry.io/xxx
```
Errors are automatically captured by the Sentry integration in Next.js.
### Uptime monitoring
Point an external monitor (UptimeRobot, Better Stack, etc.) at:
```
https://capakraken.hartmut-noerenberg.com/api/health
```
Alert if status code != 200 for more than 2 consecutive checks.
---
## 7. Troubleshooting
### CI job fails: "tsc --noEmit"
TypeScript error in the web app. Run locally:
```bash
pnpm --filter @capakraken/web exec tsc --noEmit
```
### CI job fails: "test:unit"
Unit test failure. Run locally:
```bash
pnpm --filter @capakraken/web exec tsc --project tsconfig.typecheck.json --noEmit
pnpm lint
pnpm test:unit
```
### CI job fails: "next build"
Build error (often `ssr: false` in Server Components, missing exports). Run locally:
```bash
pnpm --filter @capakraken/web exec next build
```
### CI job fails: "e2e"
### Deploy fails before container start
Playwright test failure. Check the HTML report artifact in the GitHub Actions run.
Check the rendered compose configuration on the host:
### Production: 502 Bad Gateway
The Next.js process isn't running. Check:
```bash
ss -tlnp | grep 3100 # Is anything listening?
tail -50 /tmp/capakraken-dev.log # Check app logs
docker compose -f docker-compose.prod.yml config -q
```
Restart:
Then verify `.env.production` and `deploy.env`.
### App never becomes ready
Check:
```bash
fuser -k 3100/tcp 2>/dev/null
pnpm dev & # or pnpm start for production mode
docker compose -f docker-compose.prod.yml ps
docker compose -f docker-compose.prod.yml logs --tail 200 app
curl -s http://127.0.0.1:${APP_HOST_PORT:-3000}/api/ready
```
### Production: 500 Internal Server Error
### Database migration failure
Inspect the migrator logs:
Usually a stale Prisma client after schema changes:
```bash
pnpm db:generate
pnpm db:validate
rm -rf apps/web/.next
pnpm --filter @capakraken/web exec next build
# Restart the server
docker compose -f docker-compose.prod.yml run --rm migrator
```
### Database connection issues
### Registry pull failure
Check the `/api/ready` endpoint:
```bash
curl -s https://capakraken.hartmut-noerenberg.com/api/ready | jq .
```
Verify `GHCR_USERNAME` and `GHCR_TOKEN`, then test:
If `postgres: "error"`, verify:
```bash
docker ps | grep postgres # Is container running?
psql -h localhost -p 5433 -U capakraken -d capakraken # Can you connect?
printf '%s\n' "$GHCR_TOKEN" | docker login ghcr.io -u "$GHCR_USERNAME" --password-stdin
```