refactor(ops): standardize image-based production delivery

2026-03-30 23:35:29 +02:00
parent ef5e8016a4
commit 7bcc831b5c
17 changed files with 447 additions and 538 deletions
@@ -2,83 +2,67 @@

 ## Goal

-This document captures the intended delivery model for CapaKraken without replacing the currently working manual production setup immediately.
+This document describes the canonical release path for CapaKraken.

-The target state is:
+The release model is now:

-1. CI validates every PR.
-2. GitHub Actions builds immutable Docker images.
-3. Staging and production pull those exact images from a registry.
-4. Database migrations run as an explicit deploy step.
-5. Traffic is considered safe only after the app answers `GET /api/ready`.
+1. PRs are validated by CI before merge.
+2. Every push to `main` publishes immutable `app` and `migrator` images.
+3. Staging and production promote the exact same `sha-<commit>` tag.
+4. The host deploys only from images and runtime env files.
+5. A deployment is successful only after `GET /api/ready` passes.

-## Core Idea
-
-The production host should stop building application code from a Git checkout. Instead, it should only:
-
- pull a versioned `app` image
- pull a matching `migrator` image
- run Prisma deploy migrations
- start the application container
- wait for readiness
-
-That removes "works on the server but not in CI" drift and makes rollbacks much simpler.
-
-## Delivery Flow
+## Canonical Flow

 ### 1. Pull Request Validation

-The existing `CI` workflow continues to validate:
+The main [ci.yml](/home/hartmut/Documents/Copilot/capakraken/.github/workflows/ci.yml) workflow remains the merge gate for:

- architecture guardrails for SSE audience scoping
+- architecture guardrails
 - typecheck
 - lint
 - unit tests
 - build
 - E2E

-This remains the quality gate before merge.
+### 2. Automatic Image Release

-The guardrail step currently enforces three invariants:
+[release-image.yml](/home/hartmut/Documents/Copilot/capakraken/.github/workflows/release-image.yml) now runs automatically on every push to `main` and can still be started manually for rebuilds or tag overrides.

- no role-based SSE audience fan-out in [event-bus.ts](/home/hartmut/Documents/Copilot/capakraken/packages/api/src/sse/event-bus.ts)
- no role-derived subscription audiences in [subscription-policy.ts](/home/hartmut/Documents/Copilot/capakraken/packages/api/src/sse/subscription-policy.ts)
- no client-provided audience parsing in [route.ts](/home/hartmut/Documents/Copilot/capakraken/apps/web/src/app/api/sse/timeline/route.ts)
+It publishes two images from [Dockerfile.prod](/home/hartmut/Documents/Copilot/capakraken/Dockerfile.prod):

-### 2. Image Build
+- `ghcr.io/<owner>/<repo>-app:sha-<commit>`
+- `ghcr.io/<owner>/<repo>-migrator:sha-<commit>`

-The new manual workflow [release-image.yml](/home/hartmut/Documents/Copilot/capakraken/.github/workflows/release-image.yml) builds two images from [Dockerfile.prod](/home/hartmut/Documents/Copilot/capakraken/Dockerfile.prod):
+### 3. Staging Promotion

- `runner` target as the production app image
- `migrator` target as the Prisma migration image
+[deploy-staging.yml](/home/hartmut/Documents/Copilot/capakraken/.github/workflows/deploy-staging.yml) copies the canonical deploy bundle to the staging host:

-Recommended tag format:
+- [docker-compose.prod.yml](/home/hartmut/Documents/Copilot/capakraken/docker-compose.prod.yml)
+- [tooling/deploy/deploy-compose.sh](/home/hartmut/Documents/Copilot/capakraken/tooling/deploy/deploy-compose.sh)
+- the rest of [tooling/deploy](/home/hartmut/Documents/Copilot/capakraken/tooling/deploy/README.md)

- `sha-<git-commit>`
+GitHub Actions also writes a short-lived `deploy.env` containing `APP_IMAGE`, `MIGRATOR_IMAGE`, and the host port.

-Example:
+### 4. Host-Side Deployment

-```text
-ghcr.io/<owner>/capakraken-app:sha-abc123
-ghcr.io/<owner>/capakraken-migrator:sha-abc123
-```
+On the target host, [deploy-compose.sh](/home/hartmut/Documents/Copilot/capakraken/tooling/deploy/deploy-compose.sh):

-### 3. Staging Deploy
+1. loads `.env.production` and `deploy.env`
+2. validates the rendered compose file
+3. pulls the immutable `app` and `migrator` images
+4. starts PostgreSQL and Redis
+5. runs Prisma migrations through the dedicated `migrator` image
+6. starts the new `app` container
+7. waits for `GET /api/ready`

-The staging workflow [deploy-staging.yml](/home/hartmut/Documents/Copilot/capakraken/.github/workflows/deploy-staging.yml) is intended to:
+The host does not build application code from Git anymore.

-1. connect to the staging host over SSH
-2. copy the deploy assets
-3. export `APP_IMAGE` and `MIGRATOR_IMAGE`
-4. run [deploy-compose.sh](/home/hartmut/Documents/Copilot/capakraken/tooling/deploy/deploy-compose.sh)
+### 5. Production Promotion

-The compose file used for this target flow is [docker-compose.cicd.yml](/home/hartmut/Documents/Copilot/capakraken/docker-compose.cicd.yml).
+[deploy-prod.yml](/home/hartmut/Documents/Copilot/capakraken/.github/workflows/deploy-prod.yml) repeats the exact staging flow with the same image tag after staging acceptance.

-### 4. Production Promotion
-
-The production workflow [deploy-prod.yml](/home/hartmut/Documents/Copilot/capakraken/.github/workflows/deploy-prod.yml) follows the same logic as staging, but the image tag is promoted manually.
-
-That means production uses an image that was already built and can already have been exercised in staging.
+That keeps staging and production on the same artifact instead of rebuilding.

 ## Required Infrastructure

@@ -86,139 +70,66 @@ That means production uses an image that was already built and can already have

 - GitHub repository with Actions enabled
 - GHCR or another container registry
- 1 Linux host with Docker and Docker Compose
+- one Linux host with Docker Engine and Docker Compose v2
 - PostgreSQL
 - Redis
- reverse proxy such as nginx
 - SSH access from GitHub Actions to the host
+- reverse proxy or load balancer in front of the app

 ### Recommended

 - separate staging and production hosts
 - GitHub Environments for `staging` and `production`
- required reviewer approval for `production`
- backup strategy for PostgreSQL volumes
- uptime monitoring and error tracking
+- required approval for the `production` environment
+- monitoring on `/api/health` and `/api/ready`
+- PostgreSQL backup and restore drills

-## Secrets
+## Runtime Configuration

-### GitHub Environment Secrets
-
-For `staging`:
-
- `STAGING_SSH_HOST`
- `STAGING_SSH_PORT`
- `STAGING_SSH_USER`
- `STAGING_SSH_KEY`
- `STAGING_DEPLOY_PATH`
- `STAGING_APP_HOST_PORT`
- `STAGING_GHCR_USERNAME`
- `STAGING_GHCR_TOKEN`
-
-For `production`:
-
- `PROD_SSH_HOST`
- `PROD_SSH_PORT`
- `PROD_SSH_USER`
- `PROD_SSH_KEY`
- `PROD_DEPLOY_PATH`
- `PROD_APP_HOST_PORT`
- `PROD_GHCR_USERNAME`
- `PROD_GHCR_TOKEN`
-
-### Host-side Files
-
-Each target host should already have:
+The canonical host-side inputs are:

+- [docker-compose.prod.yml](/home/hartmut/Documents/Copilot/capakraken/docker-compose.prod.yml)
 - `.env.production`
- Docker installed
- network access to the container registry
+- `deploy.env`

-The repository now also contains a small host example at [tooling/deploy/.env.production.example](/home/hartmut/Documents/Copilot/capakraken/tooling/deploy/.env.production.example) and an operator note at [tooling/deploy/README.md](/home/hartmut/Documents/Copilot/capakraken/tooling/deploy/README.md).
+`.env.production` holds long-lived runtime configuration and secrets. The example file is [tooling/deploy/.env.production.example](/home/hartmut/Documents/Copilot/capakraken/tooling/deploy/.env.production.example).

-### Minimum Host Bootstrap
+`deploy.env` is short-lived deployment metadata. The example file is [tooling/deploy/deploy.env.example](/home/hartmut/Documents/Copilot/capakraken/tooling/deploy/deploy.env.example).

-For each target host, create a dedicated deploy directory such as `/opt/capakraken` and place these files there:
+Important invariants:

-```text
-docker-compose.cicd.yml
-.env.production
-tooling/deploy/deploy-compose.sh
-```
-
-`.env.production` should hold the long-lived runtime settings, including:
-
-```env
-POSTGRES_PASSWORD=<long-random-password>
-NEXTAUTH_URL=https://capakraken.example.com
-NEXTAUTH_SECRET=<long-random-secret>
-```
-
-GitHub Actions only injects the short-lived image references through `deploy.env`. The deploy script then loads both files before calling Docker Compose, so compose interpolation and container runtime env use the same source of truth.
-
-### Runtime Secret Provisioning Policy
-
-Production and staging secrets should be provisioned at the host or platform-secret layer, not through admin mutations and not through application database writes.
-
-That includes at least:
-
-```env
-OPENAI_API_KEY=<optional-if-openai-used>
-AZURE_OPENAI_API_KEY=<optional-if-azure-chat-used>
-AZURE_DALLE_API_KEY=<optional-if-azure-image-gen-used>
-GEMINI_API_KEY=<optional-if-gemini-used>
-SMTP_PASSWORD=<required-if-smtp-auth-used>
-ANONYMIZATION_SEED=<required-if-deterministic-anonymization-enabled>
-```
-
-Operational rule:
-
- keep these values in `.env.production` only for smaller self-managed hosts, or preferably in the host's secret manager / encrypted environment facility
- do not rotate or patch these values through `SystemSettings`
- use the admin settings page only to verify runtime source/status and to clear leftover legacy database copies
- after migration, legacy database secret fields should be empty in both staging and production
+- `RATE_LIMIT_BACKEND=redis` should stay explicit in release environments
+- runtime AI, SMTP, and anonymization secrets belong to the host or platform secret layer
+- admin settings are for verification and legacy-secret cleanup, not for secret rotation

 ## Database Policy

-For release environments, use:
+Release environments must run migrations through the `migrator` image, which executes:

 ```bash
 pnpm --filter @capakraken/db db:migrate:deploy
 ```

-Do not use `db:push` as the main production deployment mechanism. `db:push` is convenient for local development, but it does not give the release traceability that a migration-based deploy requires.
+`db:push` remains a local-development tool, not a production rollout mechanism.

 ## Rollback Model

-Rollback should be image-based:
+Rollback is image-based:

-1. choose the previous good `sha-...` tag
-2. run the production deploy workflow again with that tag
-3. confirm readiness
+1. choose the previous healthy `sha-<commit>` tag
+2. redeploy staging or production with that tag
+3. confirm `GET /api/ready`

-This is only safe when schema changes follow backwards-compatible expand and contract rules.
+This assumes schema changes follow backwards-compatible expand-and-contract rollout rules.

-## How A Production Update Works
+## Production Update Summary

-The intended production update path is:
+The standard production update is:

-1. merge to `main` after the existing CI workflow is green
-2. run [release-image.yml](/home/hartmut/Documents/Copilot/capakraken/.github/workflows/release-image.yml) to build immutable `app` and `migrator` images tagged as `sha-<commit>`
-3. run [deploy-staging.yml](/home/hartmut/Documents/Copilot/capakraken/.github/workflows/deploy-staging.yml) with that exact image tag
-4. GitHub Actions uploads the deploy bundle to the staging host and writes a temporary `deploy.env`
-5. [deploy-compose.sh](/home/hartmut/Documents/Copilot/capakraken/tooling/deploy/deploy-compose.sh) pulls images, starts PostgreSQL and Redis, runs Prisma deploy migrations, starts the new app container, and waits for `GET /api/ready`
-6. after staging is accepted, run [deploy-prod.yml](/home/hartmut/Documents/Copilot/capakraken/.github/workflows/deploy-prod.yml) with the same tag
-7. production repeats the same image-based flow, so the running artifact matches staging
+1. merge to `main` after CI is green
+2. let [release-image.yml](/home/hartmut/Documents/Copilot/capakraken/.github/workflows/release-image.yml) publish `sha-<commit>` images
+3. deploy that tag to staging through [deploy-staging.yml](/home/hartmut/Documents/Copilot/capakraken/.github/workflows/deploy-staging.yml)
+4. validate staging
+5. promote the same tag through [deploy-prod.yml](/home/hartmut/Documents/Copilot/capakraken/.github/workflows/deploy-prod.yml)

-That means the production host no longer builds from Git. It only receives a versioned image and starts it after migrations complete.
-
-The same principle applies to secrets: the running container reads them from the deployment environment at start time, so an update only needs a new image tag unless secret material itself is being rotated.
-
-## Current Status
-
-The repository now contains the CI/CD scaffolding, but the existing manual production setup remains untouched:
-
- current manual compose flow: [docker-compose.prod.yml](/home/hartmut/Documents/Copilot/capakraken/docker-compose.prod.yml)
- current manual runbook: [ci-cd-manual.md](/home/hartmut/Documents/Copilot/capakraken/docs/ci-cd-manual.md)
-
-This allows the team to introduce the new path gradually instead of switching production in one step.
+The important property is artifact identity: staging and production run the same image, not two separate builds.