refactor(ops): standardize image-based production delivery

This commit is contained in:
2026-03-30 23:35:29 +02:00
parent ef5e8016a4
commit 7bcc831b5c
17 changed files with 447 additions and 538 deletions
+64 -153
View File
@@ -2,83 +2,67 @@
## Goal
This document captures the intended delivery model for CapaKraken without replacing the currently working manual production setup immediately.
This document describes the canonical release path for CapaKraken.
The target state is:
The release model is now:
1. CI validates every PR.
2. GitHub Actions builds immutable Docker images.
3. Staging and production pull those exact images from a registry.
4. Database migrations run as an explicit deploy step.
5. Traffic is considered safe only after the app answers `GET /api/ready`.
1. PRs are validated by CI before merge.
2. Every push to `main` publishes immutable `app` and `migrator` images.
3. Staging and production promote the exact same `sha-<commit>` tag.
4. The host deploys only from images and runtime env files.
5. A deployment is successful only after `GET /api/ready` passes.
## Core Idea
The production host should stop building application code from a Git checkout. Instead, it should only:
- pull a versioned `app` image
- pull a matching `migrator` image
- run Prisma deploy migrations
- start the application container
- wait for readiness
That removes "works on the server but not in CI" drift and makes rollbacks much simpler.
## Delivery Flow
## Canonical Flow
### 1. Pull Request Validation
The existing `CI` workflow continues to validate:
The main [ci.yml](/home/hartmut/Documents/Copilot/capakraken/.github/workflows/ci.yml) workflow remains the merge gate for:
- architecture guardrails for SSE audience scoping
- architecture guardrails
- typecheck
- lint
- unit tests
- build
- E2E
This remains the quality gate before merge.
### 2. Automatic Image Release
The guardrail step currently enforces three invariants:
[release-image.yml](/home/hartmut/Documents/Copilot/capakraken/.github/workflows/release-image.yml) now runs automatically on every push to `main` and can still be started manually for rebuilds or tag overrides.
- no role-based SSE audience fan-out in [event-bus.ts](/home/hartmut/Documents/Copilot/capakraken/packages/api/src/sse/event-bus.ts)
- no role-derived subscription audiences in [subscription-policy.ts](/home/hartmut/Documents/Copilot/capakraken/packages/api/src/sse/subscription-policy.ts)
- no client-provided audience parsing in [route.ts](/home/hartmut/Documents/Copilot/capakraken/apps/web/src/app/api/sse/timeline/route.ts)
It publishes two images from [Dockerfile.prod](/home/hartmut/Documents/Copilot/capakraken/Dockerfile.prod):
### 2. Image Build
- `ghcr.io/<owner>/<repo>-app:sha-<commit>`
- `ghcr.io/<owner>/<repo>-migrator:sha-<commit>`
The new manual workflow [release-image.yml](/home/hartmut/Documents/Copilot/capakraken/.github/workflows/release-image.yml) builds two images from [Dockerfile.prod](/home/hartmut/Documents/Copilot/capakraken/Dockerfile.prod):
### 3. Staging Promotion
- `runner` target as the production app image
- `migrator` target as the Prisma migration image
[deploy-staging.yml](/home/hartmut/Documents/Copilot/capakraken/.github/workflows/deploy-staging.yml) copies the canonical deploy bundle to the staging host:
Recommended tag format:
- [docker-compose.prod.yml](/home/hartmut/Documents/Copilot/capakraken/docker-compose.prod.yml)
- [tooling/deploy/deploy-compose.sh](/home/hartmut/Documents/Copilot/capakraken/tooling/deploy/deploy-compose.sh)
- the rest of [tooling/deploy](/home/hartmut/Documents/Copilot/capakraken/tooling/deploy/README.md)
- `sha-<git-commit>`
GitHub Actions also writes a short-lived `deploy.env` containing `APP_IMAGE`, `MIGRATOR_IMAGE`, and the host port.
Example:
### 4. Host-Side Deployment
```text
ghcr.io/<owner>/capakraken-app:sha-abc123
ghcr.io/<owner>/capakraken-migrator:sha-abc123
```
On the target host, [deploy-compose.sh](/home/hartmut/Documents/Copilot/capakraken/tooling/deploy/deploy-compose.sh):
### 3. Staging Deploy
1. loads `.env.production` and `deploy.env`
2. validates the rendered compose file
3. pulls the immutable `app` and `migrator` images
4. starts PostgreSQL and Redis
5. runs Prisma migrations through the dedicated `migrator` image
6. starts the new `app` container
7. waits for `GET /api/ready`
The staging workflow [deploy-staging.yml](/home/hartmut/Documents/Copilot/capakraken/.github/workflows/deploy-staging.yml) is intended to:
The host does not build application code from Git anymore.
1. connect to the staging host over SSH
2. copy the deploy assets
3. export `APP_IMAGE` and `MIGRATOR_IMAGE`
4. run [deploy-compose.sh](/home/hartmut/Documents/Copilot/capakraken/tooling/deploy/deploy-compose.sh)
### 5. Production Promotion
The compose file used for this target flow is [docker-compose.cicd.yml](/home/hartmut/Documents/Copilot/capakraken/docker-compose.cicd.yml).
[deploy-prod.yml](/home/hartmut/Documents/Copilot/capakraken/.github/workflows/deploy-prod.yml) repeats the exact staging flow with the same image tag after staging acceptance.
### 4. Production Promotion
The production workflow [deploy-prod.yml](/home/hartmut/Documents/Copilot/capakraken/.github/workflows/deploy-prod.yml) follows the same logic as staging, but the image tag is promoted manually.
That means production uses an image that was already built and can already have been exercised in staging.
That keeps staging and production on the same artifact instead of rebuilding.
## Required Infrastructure
@@ -86,139 +70,66 @@ That means production uses an image that was already built and can already have
- GitHub repository with Actions enabled
- GHCR or another container registry
- 1 Linux host with Docker and Docker Compose
- one Linux host with Docker Engine and Docker Compose v2
- PostgreSQL
- Redis
- reverse proxy such as nginx
- SSH access from GitHub Actions to the host
- reverse proxy or load balancer in front of the app
### Recommended
- separate staging and production hosts
- GitHub Environments for `staging` and `production`
- required reviewer approval for `production`
- backup strategy for PostgreSQL volumes
- uptime monitoring and error tracking
- required approval for the `production` environment
- monitoring on `/api/health` and `/api/ready`
- PostgreSQL backup and restore drills
## Secrets
## Runtime Configuration
### GitHub Environment Secrets
For `staging`:
- `STAGING_SSH_HOST`
- `STAGING_SSH_PORT`
- `STAGING_SSH_USER`
- `STAGING_SSH_KEY`
- `STAGING_DEPLOY_PATH`
- `STAGING_APP_HOST_PORT`
- `STAGING_GHCR_USERNAME`
- `STAGING_GHCR_TOKEN`
For `production`:
- `PROD_SSH_HOST`
- `PROD_SSH_PORT`
- `PROD_SSH_USER`
- `PROD_SSH_KEY`
- `PROD_DEPLOY_PATH`
- `PROD_APP_HOST_PORT`
- `PROD_GHCR_USERNAME`
- `PROD_GHCR_TOKEN`
### Host-side Files
Each target host should already have:
The canonical host-side inputs are:
- [docker-compose.prod.yml](/home/hartmut/Documents/Copilot/capakraken/docker-compose.prod.yml)
- `.env.production`
- Docker installed
- network access to the container registry
- `deploy.env`
The repository now also contains a small host example at [tooling/deploy/.env.production.example](/home/hartmut/Documents/Copilot/capakraken/tooling/deploy/.env.production.example) and an operator note at [tooling/deploy/README.md](/home/hartmut/Documents/Copilot/capakraken/tooling/deploy/README.md).
`.env.production` holds long-lived runtime configuration and secrets. The example file is [tooling/deploy/.env.production.example](/home/hartmut/Documents/Copilot/capakraken/tooling/deploy/.env.production.example).
### Minimum Host Bootstrap
`deploy.env` is short-lived deployment metadata. The example file is [tooling/deploy/deploy.env.example](/home/hartmut/Documents/Copilot/capakraken/tooling/deploy/deploy.env.example).
For each target host, create a dedicated deploy directory such as `/opt/capakraken` and place these files there:
Important invariants:
```text
docker-compose.cicd.yml
.env.production
tooling/deploy/deploy-compose.sh
```
`.env.production` should hold the long-lived runtime settings, including:
```env
POSTGRES_PASSWORD=<long-random-password>
NEXTAUTH_URL=https://capakraken.example.com
NEXTAUTH_SECRET=<long-random-secret>
```
GitHub Actions only injects the short-lived image references through `deploy.env`. The deploy script then loads both files before calling Docker Compose, so compose interpolation and container runtime env use the same source of truth.
### Runtime Secret Provisioning Policy
Production and staging secrets should be provisioned at the host or platform-secret layer, not through admin mutations and not through application database writes.
That includes at least:
```env
OPENAI_API_KEY=<optional-if-openai-used>
AZURE_OPENAI_API_KEY=<optional-if-azure-chat-used>
AZURE_DALLE_API_KEY=<optional-if-azure-image-gen-used>
GEMINI_API_KEY=<optional-if-gemini-used>
SMTP_PASSWORD=<required-if-smtp-auth-used>
ANONYMIZATION_SEED=<required-if-deterministic-anonymization-enabled>
```
Operational rule:
- keep these values in `.env.production` only for smaller self-managed hosts, or preferably in the host's secret manager / encrypted environment facility
- do not rotate or patch these values through `SystemSettings`
- use the admin settings page only to verify runtime source/status and to clear leftover legacy database copies
- after migration, legacy database secret fields should be empty in both staging and production
- `RATE_LIMIT_BACKEND=redis` should stay explicit in release environments
- runtime AI, SMTP, and anonymization secrets belong to the host or platform secret layer
- admin settings are for verification and legacy-secret cleanup, not for secret rotation
## Database Policy
For release environments, use:
Release environments must run migrations through the `migrator` image, which executes:
```bash
pnpm --filter @capakraken/db db:migrate:deploy
```
Do not use `db:push` as the main production deployment mechanism. `db:push` is convenient for local development, but it does not give the release traceability that a migration-based deploy requires.
`db:push` remains a local-development tool, not a production rollout mechanism.
## Rollback Model
Rollback should be image-based:
Rollback is image-based:
1. choose the previous good `sha-...` tag
2. run the production deploy workflow again with that tag
3. confirm readiness
1. choose the previous healthy `sha-<commit>` tag
2. redeploy staging or production with that tag
3. confirm `GET /api/ready`
This is only safe when schema changes follow backwards-compatible expand and contract rules.
This assumes schema changes follow backwards-compatible expand-and-contract rollout rules.
## How A Production Update Works
## Production Update Summary
The intended production update path is:
The standard production update is:
1. merge to `main` after the existing CI workflow is green
2. run [release-image.yml](/home/hartmut/Documents/Copilot/capakraken/.github/workflows/release-image.yml) to build immutable `app` and `migrator` images tagged as `sha-<commit>`
3. run [deploy-staging.yml](/home/hartmut/Documents/Copilot/capakraken/.github/workflows/deploy-staging.yml) with that exact image tag
4. GitHub Actions uploads the deploy bundle to the staging host and writes a temporary `deploy.env`
5. [deploy-compose.sh](/home/hartmut/Documents/Copilot/capakraken/tooling/deploy/deploy-compose.sh) pulls images, starts PostgreSQL and Redis, runs Prisma deploy migrations, starts the new app container, and waits for `GET /api/ready`
6. after staging is accepted, run [deploy-prod.yml](/home/hartmut/Documents/Copilot/capakraken/.github/workflows/deploy-prod.yml) with the same tag
7. production repeats the same image-based flow, so the running artifact matches staging
1. merge to `main` after CI is green
2. let [release-image.yml](/home/hartmut/Documents/Copilot/capakraken/.github/workflows/release-image.yml) publish `sha-<commit>` images
3. deploy that tag to staging through [deploy-staging.yml](/home/hartmut/Documents/Copilot/capakraken/.github/workflows/deploy-staging.yml)
4. validate staging
5. promote the same tag through [deploy-prod.yml](/home/hartmut/Documents/Copilot/capakraken/.github/workflows/deploy-prod.yml)
That means the production host no longer builds from Git. It only receives a versioned image and starts it after migrations complete.
The same principle applies to secrets: the running container reads them from the deployment environment at start time, so an update only needs a new image tag unless secret material itself is being rotated.
## Current Status
The repository now contains the CI/CD scaffolding, but the existing manual production setup remains untouched:
- current manual compose flow: [docker-compose.prod.yml](/home/hartmut/Documents/Copilot/capakraken/docker-compose.prod.yml)
- current manual runbook: [ci-cd-manual.md](/home/hartmut/Documents/Copilot/capakraken/docs/ci-cd-manual.md)
This allows the team to introduce the new path gradually instead of switching production in one step.
The important property is artifact identity: staging and production run the same image, not two separate builds.