refactor(ops): standardize image-based production delivery
This commit is contained in:
+132
-272
@@ -2,333 +2,193 @@
|
||||
|
||||
## Overview
|
||||
|
||||
CapaKraken uses GitHub Actions for continuous integration and Docker for deployment. This document covers the full pipeline from code push to production.
|
||||
This is the operational runbook for the canonical CapaKraken delivery path:
|
||||
|
||||
---
|
||||
1. CI validates every PR.
|
||||
2. Every push to `main` publishes immutable release images.
|
||||
3. Staging deploys one `sha-<commit>` tag.
|
||||
4. Production promotes the same tag.
|
||||
5. The host never builds application code from Git.
|
||||
|
||||
## 1. CI Pipeline (Automatic on every PR)
|
||||
## 1. CI Gate
|
||||
|
||||
### What triggers it
|
||||
The merge gate is [ci.yml](/home/hartmut/Documents/Copilot/capakraken/.github/workflows/ci.yml).
|
||||
|
||||
| Event | Trigger |
|
||||
|-------|---------|
|
||||
| Pull request to `main` | All CI jobs run |
|
||||
| Push to `main` | All CI jobs run |
|
||||
It covers:
|
||||
|
||||
### Jobs and their purpose
|
||||
- architecture guardrails
|
||||
- typecheck
|
||||
- lint
|
||||
- unit tests
|
||||
- build
|
||||
- E2E
|
||||
|
||||
```
|
||||
PR opened / pushed
|
||||
│
|
||||
├──→ typecheck (tsc --noEmit, ~40s)
|
||||
├──→ lint (ESLint via Turborepo, ~20s)
|
||||
├──→ test (Vitest unit tests, ~60s, needs PostgreSQL + Redis)
|
||||
│
|
||||
└──→ build (next build, ~90s, runs after typecheck)
|
||||
│
|
||||
└──→ e2e (Playwright, ~3-5min, runs after build)
|
||||
```
|
||||
Before merging, all required checks must pass.
|
||||
|
||||
**typecheck, lint, and test run in parallel** for speed. Build waits for typecheck. E2E waits for build.
|
||||
|
||||
### What each job checks
|
||||
|
||||
| Job | Command | What it catches |
|
||||
|-----|---------|----------------|
|
||||
| **typecheck** | `pnpm --filter @capakraken/web exec tsc --noEmit` | Type errors across the full web app |
|
||||
| **lint** | `pnpm lint` | Code style violations, unused imports, etc. |
|
||||
| **test** | `pnpm test:unit` | Unit test failures in engine, staffing, API, shared |
|
||||
| **build** | `pnpm --filter @capakraken/web exec next build` | SSR errors, dynamic import issues, bundle problems |
|
||||
| **e2e** | `pnpm test:e2e` | End-to-end user flow regressions |
|
||||
|
||||
### Required status checks
|
||||
|
||||
Before merging a PR, **all 5 jobs must pass**. Configure this in GitHub Settings > Branches > Branch protection rules > Require status checks.
|
||||
|
||||
### Caching
|
||||
|
||||
The pipeline caches these artifacts to speed up subsequent runs:
|
||||
|
||||
| Cache | Key | Saves |
|
||||
|-------|-----|-------|
|
||||
| pnpm store | `pnpm-lock.yaml` hash | ~30s install time |
|
||||
| Turborepo | `.turbo` directory | ~60s on unchanged packages |
|
||||
| Playwright browsers | Playwright version | ~45s browser download |
|
||||
|
||||
---
|
||||
|
||||
## 2. Local Development Quality Gates
|
||||
|
||||
Run these before pushing to catch issues early:
|
||||
Useful local commands:
|
||||
|
||||
```bash
|
||||
# Quick check (< 2 min)
|
||||
pnpm --filter @capakraken/web exec tsc --noEmit && pnpm lint
|
||||
|
||||
# Full check (< 3 min)
|
||||
pnpm --filter @capakraken/web exec tsc --project tsconfig.typecheck.json --noEmit
|
||||
pnpm lint
|
||||
pnpm test:unit
|
||||
|
||||
# Full check including build (< 5 min)
|
||||
pnpm --filter @capakraken/web exec next build
|
||||
```
|
||||
|
||||
### Pre-commit hook (optional)
|
||||
## 2. Image Release
|
||||
|
||||
You can add a Git pre-commit hook to run the quick check automatically:
|
||||
[release-image.yml](/home/hartmut/Documents/Copilot/capakraken/.github/workflows/release-image.yml) runs automatically on every push to `main`.
|
||||
|
||||
```bash
|
||||
# .husky/pre-commit
|
||||
pnpm --filter @capakraken/web exec tsc --noEmit
|
||||
pnpm lint
|
||||
It publishes:
|
||||
|
||||
- `ghcr.io/<owner>/<repo>-app:sha-<commit>`
|
||||
- `ghcr.io/<owner>/<repo>-migrator:sha-<commit>`
|
||||
|
||||
The workflow is also callable manually if a rebuild or tag override is needed.
|
||||
|
||||
## 3. Host Bootstrap
|
||||
|
||||
Each deploy target should have a dedicated directory such as `/opt/capakraken` containing:
|
||||
|
||||
```text
|
||||
docker-compose.prod.yml
|
||||
.env.production
|
||||
deploy.env
|
||||
tooling/deploy/deploy-compose.sh
|
||||
```
|
||||
|
||||
---
|
||||
Use these examples from the repo:
|
||||
|
||||
## 3. Health Check Endpoints
|
||||
- [tooling/deploy/.env.production.example](/home/hartmut/Documents/Copilot/capakraken/tooling/deploy/.env.production.example)
|
||||
- [tooling/deploy/deploy.env.example](/home/hartmut/Documents/Copilot/capakraken/tooling/deploy/deploy.env.example)
|
||||
|
||||
Two endpoints are available for monitoring:
|
||||
Important host-side rules:
|
||||
|
||||
### GET `/api/health` — Liveness Probe
|
||||
- keep `RATE_LIMIT_BACKEND=redis`
|
||||
- keep runtime secrets in `.env.production` or the platform secret layer
|
||||
- do not rotate runtime secrets through admin settings
|
||||
- ensure the host can pull from `ghcr.io`
|
||||
|
||||
Returns 200 if the Node.js process is running. No external dependencies checked.
|
||||
|
||||
```json
|
||||
{ "status": "ok", "timestamp": "2026-03-19T10:00:00.000Z" }
|
||||
```
|
||||
|
||||
**Use for:** Kubernetes/Docker liveness probe, uptime monitoring.
|
||||
|
||||
### GET `/api/ready` — Readiness Probe
|
||||
|
||||
Checks PostgreSQL and Redis connectivity. Returns 200 if all services are reachable, 503 if not.
|
||||
|
||||
```json
|
||||
// Healthy
|
||||
{ "status": "ready", "postgres": "ok", "redis": "ok" }
|
||||
|
||||
// Unhealthy
|
||||
{ "status": "not_ready", "postgres": "ok", "redis": "error" }
|
||||
```
|
||||
|
||||
**Use for:** Kubernetes/Docker readiness probe, load balancer health checks, nginx upstream checks.
|
||||
|
||||
---
|
||||
|
||||
## 4. Production Docker Build
|
||||
|
||||
### Building the production image
|
||||
|
||||
```bash
|
||||
# Build the image
|
||||
docker build -f Dockerfile.prod -t capakraken:latest .
|
||||
|
||||
# Test it locally
|
||||
docker compose -f docker-compose.prod.yml up -d
|
||||
```
|
||||
|
||||
### Image details
|
||||
|
||||
| Property | Value |
|
||||
|----------|-------|
|
||||
| Base | `node:20-bookworm-slim` |
|
||||
| Size | ~150-200 MB (vs ~1.5 GB dev image) |
|
||||
| Output | Next.js standalone mode |
|
||||
| Healthcheck | `curl -f http://localhost:3000/api/health` |
|
||||
| Port | 3000 (internal), mapped to 3100 externally |
|
||||
|
||||
### Environment variables
|
||||
|
||||
The production image requires these environment variables:
|
||||
|
||||
```env
|
||||
# Required
|
||||
DATABASE_URL=postgresql://user:pass@host:5432/capakraken
|
||||
REDIS_URL=redis://host:6379
|
||||
NEXTAUTH_URL=https://capakraken.your-domain.com
|
||||
NEXTAUTH_SECRET=<random-32-char-string>
|
||||
|
||||
# Optional
|
||||
SENTRY_DSN=https://xxx@sentry.io/xxx
|
||||
SMTP_HOST=smtp.example.com
|
||||
SMTP_PORT=587
|
||||
SMTP_USER=notifications@example.com
|
||||
SMTP_PASSWORD=<password>
|
||||
SMTP_FROM=CapaKraken <notifications@example.com>
|
||||
OPENAI_API_KEY=<optional-if-openai-used>
|
||||
AZURE_OPENAI_API_KEY=<optional-if-azure-chat-used>
|
||||
AZURE_DALLE_API_KEY=<optional-if-azure-image-gen-used>
|
||||
GEMINI_API_KEY=<optional-if-gemini-used>
|
||||
ANONYMIZATION_SEED=<required-if-deterministic-anonymization-enabled>
|
||||
```
|
||||
|
||||
Generate a secure `NEXTAUTH_SECRET`:
|
||||
Generate a secure `NEXTAUTH_SECRET` with:
|
||||
|
||||
```bash
|
||||
openssl rand -base64 32
|
||||
```
|
||||
|
||||
Runtime secret policy:
|
||||
## 4. Staging Deployment
|
||||
|
||||
- production secrets are injected through the deployment environment or host secret store
|
||||
- admin settings must not be used to enter or rotate AI, SMTP, or anonymization secrets
|
||||
- the admin UI is only for status checks and cleanup of legacy database-stored secret values
|
||||
Standard path:
|
||||
|
||||
---
|
||||
1. merge to `main`
|
||||
2. wait for [release-image.yml](/home/hartmut/Documents/Copilot/capakraken/.github/workflows/release-image.yml) to publish `sha-<commit>`
|
||||
3. run [deploy-staging.yml](/home/hartmut/Documents/Copilot/capakraken/.github/workflows/deploy-staging.yml) with that tag
|
||||
|
||||
## 5. Deployment
|
||||
The workflow uploads:
|
||||
|
||||
### docker-compose (simplest)
|
||||
- [docker-compose.prod.yml](/home/hartmut/Documents/Copilot/capakraken/docker-compose.prod.yml)
|
||||
- [tooling/deploy](/home/hartmut/Documents/Copilot/capakraken/tooling/deploy/README.md)
|
||||
- a short-lived `deploy.env`
|
||||
|
||||
On the host, [deploy-compose.sh](/home/hartmut/Documents/Copilot/capakraken/tooling/deploy/deploy-compose.sh):
|
||||
|
||||
1. validates the rendered compose file
|
||||
2. pulls `APP_IMAGE` and `MIGRATOR_IMAGE`
|
||||
3. starts PostgreSQL and Redis
|
||||
4. runs Prisma migrations with the `migrator` image
|
||||
5. starts the app
|
||||
6. waits for `GET /api/ready`
|
||||
|
||||
## 5. Production Promotion
|
||||
|
||||
After staging is accepted:
|
||||
|
||||
1. run [deploy-prod.yml](/home/hartmut/Documents/Copilot/capakraken/.github/workflows/deploy-prod.yml)
|
||||
2. use the exact same `sha-<commit>` tag
|
||||
3. verify `GET /api/ready`
|
||||
|
||||
Production must promote the already-tested image, not rebuild from source.
|
||||
|
||||
## 6. Manual Host Dry Run
|
||||
|
||||
If you need to verify the host outside GitHub Actions:
|
||||
|
||||
```bash
|
||||
# On your server, after updating the host-side env/secret source
|
||||
git pull
|
||||
docker compose -f docker-compose.prod.yml up -d --build
|
||||
cp tooling/deploy/.env.production.example .env.production
|
||||
cp tooling/deploy/deploy.env.example deploy.env
|
||||
# fill in real secrets and image refs first
|
||||
|
||||
# Run database migrations
|
||||
docker compose -f docker-compose.prod.yml exec app \
|
||||
pnpm --filter @capakraken/db db:migrate:deploy
|
||||
|
||||
# Seed initial data (first deployment only)
|
||||
docker compose -f docker-compose.prod.yml exec app \
|
||||
pnpm db:seed
|
||||
set -a
|
||||
. ./deploy.env
|
||||
set +a
|
||||
bash tooling/deploy/deploy-compose.sh staging
|
||||
```
|
||||
|
||||
### Manual deployment (current setup)
|
||||
## 7. Health Endpoints
|
||||
|
||||
Since `capakraken.hartmut-noerenberg.com` runs behind nginx:
|
||||
### GET `/api/health`
|
||||
|
||||
Process liveness only. Use it for coarse uptime checks.
|
||||
|
||||
### GET `/api/ready`
|
||||
|
||||
Checks PostgreSQL and Redis connectivity. Use it for deploy readiness and traffic admission.
|
||||
|
||||
For deploys, `/api/ready` is the source of truth.
|
||||
|
||||
## 8. Rollback
|
||||
|
||||
Rollback is image-based:
|
||||
|
||||
1. choose the previous healthy `sha-<commit>`
|
||||
2. rerun the staging or production deploy workflow with that tag
|
||||
3. confirm `GET /api/ready`
|
||||
|
||||
Schema changes still need expand-and-contract discipline for rollback safety.
|
||||
|
||||
## 9. Troubleshooting
|
||||
|
||||
### CI failure
|
||||
|
||||
Run the failing command locally:
|
||||
|
||||
```bash
|
||||
# On the server
|
||||
cd /home/hartmut/Documents/Copilot/capakraken
|
||||
git pull origin main
|
||||
pnpm install
|
||||
pnpm db:generate
|
||||
pnpm db:validate
|
||||
pnpm --filter @capakraken/db db:migrate:deploy
|
||||
pnpm --filter @capakraken/web exec next build
|
||||
rm -rf apps/web/.next/cache # clear stale cache
|
||||
|
||||
# Restart the app (systemd, pm2, or manual)
|
||||
fuser -k 3100/tcp 2>/dev/null
|
||||
PORT=3100 pnpm --filter @capakraken/web start &
|
||||
```
|
||||
|
||||
Use the repo-level `pnpm db:*` commands for Prisma/database operations. They load `.env`, `.env.local`, `.env.$NODE_ENV`, and `.env.$NODE_ENV.local` automatically before invoking Prisma.
|
||||
|
||||
If you rotate runtime secrets during a manual deploy, update the host-side environment source first, then restart the app so the new process reads the updated values. Do not patch those values through admin settings.
|
||||
|
||||
### nginx configuration
|
||||
|
||||
The existing nginx reverse proxy should forward to port 3100:
|
||||
|
||||
```nginx
|
||||
server {
|
||||
server_name capakraken.hartmut-noerenberg.com;
|
||||
|
||||
location / {
|
||||
proxy_pass http://127.0.0.1:3100;
|
||||
proxy_http_version 1.1;
|
||||
proxy_set_header Upgrade $http_upgrade;
|
||||
proxy_set_header Connection "upgrade";
|
||||
proxy_set_header Host $host;
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||
proxy_set_header X-Forwarded-Proto $scheme;
|
||||
|
||||
# SSE support (keep connection open)
|
||||
proxy_read_timeout 86400s;
|
||||
proxy_buffering off;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Monitoring Setup
|
||||
|
||||
### Sentry (error tracking)
|
||||
|
||||
After creating a Sentry project, add the DSN to `.env.production`:
|
||||
|
||||
```env
|
||||
SENTRY_DSN=https://xxx@sentry.io/xxx
|
||||
```
|
||||
|
||||
Errors are automatically captured by the Sentry integration in Next.js.
|
||||
|
||||
### Uptime monitoring
|
||||
|
||||
Point an external monitor (UptimeRobot, Better Stack, etc.) at:
|
||||
|
||||
```
|
||||
https://capakraken.hartmut-noerenberg.com/api/health
|
||||
```
|
||||
|
||||
Alert if status code != 200 for more than 2 consecutive checks.
|
||||
|
||||
---
|
||||
|
||||
## 7. Troubleshooting
|
||||
|
||||
### CI job fails: "tsc --noEmit"
|
||||
|
||||
TypeScript error in the web app. Run locally:
|
||||
```bash
|
||||
pnpm --filter @capakraken/web exec tsc --noEmit
|
||||
```
|
||||
|
||||
### CI job fails: "test:unit"
|
||||
|
||||
Unit test failure. Run locally:
|
||||
```bash
|
||||
pnpm --filter @capakraken/web exec tsc --project tsconfig.typecheck.json --noEmit
|
||||
pnpm lint
|
||||
pnpm test:unit
|
||||
```
|
||||
|
||||
### CI job fails: "next build"
|
||||
|
||||
Build error (often `ssr: false` in Server Components, missing exports). Run locally:
|
||||
```bash
|
||||
pnpm --filter @capakraken/web exec next build
|
||||
```
|
||||
|
||||
### CI job fails: "e2e"
|
||||
### Deploy fails before container start
|
||||
|
||||
Playwright test failure. Check the HTML report artifact in the GitHub Actions run.
|
||||
Check the rendered compose configuration on the host:
|
||||
|
||||
### Production: 502 Bad Gateway
|
||||
|
||||
The Next.js process isn't running. Check:
|
||||
```bash
|
||||
ss -tlnp | grep 3100 # Is anything listening?
|
||||
tail -50 /tmp/capakraken-dev.log # Check app logs
|
||||
docker compose -f docker-compose.prod.yml config -q
|
||||
```
|
||||
|
||||
Restart:
|
||||
Then verify `.env.production` and `deploy.env`.
|
||||
|
||||
### App never becomes ready
|
||||
|
||||
Check:
|
||||
|
||||
```bash
|
||||
fuser -k 3100/tcp 2>/dev/null
|
||||
pnpm dev & # or pnpm start for production mode
|
||||
docker compose -f docker-compose.prod.yml ps
|
||||
docker compose -f docker-compose.prod.yml logs --tail 200 app
|
||||
curl -s http://127.0.0.1:${APP_HOST_PORT:-3000}/api/ready
|
||||
```
|
||||
|
||||
### Production: 500 Internal Server Error
|
||||
### Database migration failure
|
||||
|
||||
Inspect the migrator logs:
|
||||
|
||||
Usually a stale Prisma client after schema changes:
|
||||
```bash
|
||||
pnpm db:generate
|
||||
pnpm db:validate
|
||||
rm -rf apps/web/.next
|
||||
pnpm --filter @capakraken/web exec next build
|
||||
# Restart the server
|
||||
docker compose -f docker-compose.prod.yml run --rm migrator
|
||||
```
|
||||
|
||||
### Database connection issues
|
||||
### Registry pull failure
|
||||
|
||||
Check the `/api/ready` endpoint:
|
||||
```bash
|
||||
curl -s https://capakraken.hartmut-noerenberg.com/api/ready | jq .
|
||||
```
|
||||
Verify `GHCR_USERNAME` and `GHCR_TOKEN`, then test:
|
||||
|
||||
If `postgres: "error"`, verify:
|
||||
```bash
|
||||
docker ps | grep postgres # Is container running?
|
||||
psql -h localhost -p 5433 -U capakraken -d capakraken # Can you connect?
|
||||
printf '%s\n' "$GHCR_TOKEN" | docker login ghcr.io -u "$GHCR_USERNAME" --password-stdin
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user