refactor(settings): adopt environment-only runtime secret flow

2026-03-30 19:55:06 +02:00
parent fed7aa5b61
commit a19d2cbae0
19 changed files with 757 additions and 172 deletions
@@ -11,7 +11,7 @@ At the same time, the codebase still carries several risks that are typical of f

 1. some critical cross-cutting concerns are only partially productized
 2. several files and routers have grown beyond comfortable ownership size
-3. runtime configuration and secret handling are still too application-database centric
+3. runtime secret handling is now materially cleaner, but the repo still needs to standardize the operational source of truth around that model
 4. the current operational model is improving, but not yet fully standardized
 5. production-grade multi-instance safeguards are not complete yet

@@ -47,10 +47,10 @@ The previously critical SSE and browser parser coverage issues were addressed du
   Evidence: [assistant-tools.ts](/home/hartmut/Documents/Copilot/capakraken/packages/api/src/router/assistant-tools.ts), [resource.ts](/home/hartmut/Documents/Copilot/capakraken/packages/api/src/router/resource.ts), [allocation.ts](/home/hartmut/Documents/Copilot/capakraken/packages/api/src/router/allocation.ts), [timeline.ts](/home/hartmut/Documents/Copilot/capakraken/packages/api/src/router/timeline.ts), [vacation.ts](/home/hartmut/Documents/Copilot/capakraken/packages/api/src/router/vacation.ts), and large frontend files such as [SystemSettingsClient.tsx](/home/hartmut/Documents/Copilot/capakraken/apps/web/src/components/admin/SystemSettingsClient.tsx) and [TimelineProjectPanel.tsx](/home/hartmut/Documents/Copilot/capakraken/apps/web/src/components/timeline/TimelineProjectPanel.tsx) are each well past the size where safe ownership stays easy.
   Risk: AI-generated changes become harder to review, humans lose local reasoning context, and regressions become more likely.

-2. Secret handling is still application-database centric.
-   Evidence: system settings mutate and persist API keys and SMTP credentials in [settings.ts](/home/hartmut/Documents/Copilot/capakraken/packages/api/src/router/settings.ts).
-   Risk: operational secrets remain too coupled to the main app data plane for a gold-standard project.
-   Update: runtime resolution is now env-first for the active secret consumers, but persistence is still transitional and should be reduced further.
+2. Runtime secret policy is mostly corrected, but deploy standardization still has to catch up.
+   Evidence: runtime resolution and admin flows now treat environment-backed secrets as the preferred source in [settings.ts](/home/hartmut/Documents/Copilot/capakraken/packages/api/src/router/settings.ts), [system-settings-runtime.ts](/home/hartmut/Documents/Copilot/capakraken/packages/api/src/lib/system-settings-runtime.ts), and [SystemSettingsClient.tsx](/home/hartmut/Documents/Copilot/capakraken/apps/web/src/components/admin/SystemSettingsClient.tsx).
+   Risk: a strong secret policy is only fully effective once staging and production provisioning use one canonical deployment path and operators clear remaining legacy database copies.
+   Update: the application no longer persists new operational secret values through admin settings; the remaining work is rollout discipline and cleanup completion.

 3. Least-privilege is materially better documented now, but it still needs long-lived enforcement rather than relying mainly on one hardening batch.
   Evidence: the route audience model is now explicit in [route-access-matrix.md](/home/hartmut/Documents/Copilot/capakraken/docs/route-access-matrix.md) and backed by multiple focused auth tests, but the remaining guarantee still depends on continuing test coverage and architecture guardrails as new routes evolve.
@@ -80,9 +80,9 @@ This is materially better than a typical startup CRUD app and already has the bo

 ### Security Posture

-`7/10`
+`7.5/10`

-There are good foundations, and the most obvious real-time and comment-visibility gaps were closed, but secrets policy and long-lived least-privilege enforcement still need structural work.
+There are good foundations, and the most obvious real-time, comment-visibility, and runtime-secret-policy gaps were closed, but long-lived least-privilege enforcement and operational standardization still need structural work.

 ### Maintainability

@@ -124,8 +124,8 @@ Goals:
 - Keep SSE audience scoping under test and CI guardrails.
 - Keep hardened spreadsheet parser boundaries under regression coverage.
 - Treat the route access matrix and narrowed auth slices as maintained architecture contracts.
- Move production secrets out of regular application settings, or add an interim encrypted-secrets layer with clear migration path.
-  Status: in progress. Runtime consumers now prefer environment overrides; the remaining gap is eliminating or encrypting compatibility persistence in the admin settings path.
+- Enforce the environment-only runtime secret policy operationally and clear remaining legacy database secret residue.
+  Status: mostly completed in code. Runtime consumers prefer environment values, admin updates no longer store new secret material, and operators now need to finish rollout/bootstrap documentation plus cleanup of old database copies.

 Definition of done:

@@ -222,12 +222,11 @@ Artifacts to add:

 ## Suggested Order Of Execution

-1. secrets policy
-2. router/component decomposition
-3. architecture fitness checks in CI
-4. full operational standardization
-5. production-grade rate limiting
-6. performance hotspot reduction
+1. router/component decomposition
+2. architecture fitness checks in CI
+3. full operational standardization
+4. production-grade rate limiting
+5. performance hotspot reduction

 ## Success Criteria For The Next 60 Days

@@ -0,0 +1,89 @@
+# ADR 0001: Runtime Secret Provisioning
+
+**Status:** Accepted
+**Date:** 2026-03-30
+
+## Context
+
+CapaKraken historically allowed some operational runtime secrets to be persisted through `SystemSettings`.
+
+That included values such as:
+
+- primary AI API credentials
+- dedicated DALL-E credentials
+- Gemini credentials
+- SMTP password
+- anonymization seed
+
+This was convenient for fast iteration, but it coupled operational secret material to the main application data plane and blurred the line between configuration metadata and deployment secrets.
+
+The project is moving toward a production model where the running artifact should be immutable and environment-driven. That model is weakened if operators can still rotate runtime secrets through normal application writes.
+
+## Decision
+
+Operational runtime secrets must be provisioned outside the application database.
+
+Allowed sources:
+
+- deployment environment variables
+- host-level secret files such as `.env.production` on self-managed infrastructure
+- platform secret managers or encrypted environment facilities
+
+Disallowed source for new secret values:
+
+- admin updates that write runtime secrets into `SystemSettings`
+
+`SystemSettings` remains valid for non-secret runtime metadata such as:
+
+- provider selection
+- endpoints
+- model names
+- SMTP host/user/from settings
+- anonymization mode and domain
+
+Legacy secret values that already exist in `SystemSettings` may still be read during migration for compatibility, but they are not the target state and should be cleared after equivalent deployment secrets are provisioned.
+
+## Consequences
+
+Positive:
+
+- production updates become more predictable because images and runtime secrets are managed as separate deployment concerns
+- operational secrets stop depending on ordinary application write paths
+- admin tooling can expose status and diagnostics without pretending to be the system of record for secrets
+- secret rotation becomes an infrastructure operation rather than a product mutation
+
+Tradeoffs:
+
+- smaller self-managed installs need a disciplined host bootstrap process
+- operators must understand that updating app settings is no longer sufficient for secret rotation
+- migration requires visibility into which secrets are still backed by database residue
+
+## Implementation Notes
+
+The implementation should follow these rules:
+
+1. runtime consumers resolve supported secret values from environment first
+2. admin settings reads expose presence and source status, not secret values
+3. admin settings updates ignore incoming secret payloads
+4. the UI explains the expected environment variables for each runtime secret
+5. a dedicated cleanup action removes legacy database-stored secret values after migration
+
+## Operational Guidance
+
+For staging and production:
+
+1. provision runtime secrets on the host or platform before starting a new release
+2. deploy the already-built application image
+3. restart the application so the new process reads the current secret source
+4. verify runtime status in admin settings
+5. clear any leftover legacy database secret values once the environment-backed source is confirmed
+
+Secret rotation should follow the same model. In most cases, no application data mutation is needed. The operator updates the deployment secret source and restarts or redeploys the app.
+
+## Follow-up
+
+Still required after this decision:
+
+- complete the canonical image-based staging/production rollout
+- ensure staging and production hosts both use the same secret provisioning rules
+- periodically verify that legacy database secret fields remain empty
@@ -20,6 +20,7 @@
 - comment entity support is now centralized across shared constants, API registry policy, assistant tool metadata, and the web comment target API without pretending a second consumer exists
 - `resource` is now onboarded as the second real comment entity, reusing the same ownership and staff-visibility rules as the resource detail route
 - comment mention autocomplete now uses a dedicated entity-scoped API route instead of inheriting the narrower `user.listAssignable` audience
+- runtime secret handling is now environment-first end to end: admin updates no longer persist new operational secrets, runtime status is surfaced explicitly, and legacy database secret copies can be cleared through a dedicated cleanup path

 ## Next Up

@@ -52,9 +52,9 @@ These files already have unrelated local edits. Audience parity work that would

 ## Next Major Themes

-1. convert the still-open runtime secret model away from application-database centric storage
-2. add broader authorization regression coverage and long-lived guardrails around the narrowed route audiences
-3. reduce oversized routers and UI ownership surfaces so audience rules stay reviewable
+1. add broader authorization regression coverage and long-lived guardrails around the narrowed route audiences
+2. reduce oversized routers and UI ownership surfaces so audience rules stay reviewable
+3. keep runtime secret policy and role/audience boundaries aligned as adjacent architecture guardrails

 ## Slice Definition

@@ -154,6 +154,11 @@ SMTP_PORT=587
 SMTP_USER=notifications@example.com
 SMTP_PASSWORD=<password>
 SMTP_FROM=CapaKraken <notifications@example.com>
+OPENAI_API_KEY=<optional-if-openai-used>
+AZURE_OPENAI_API_KEY=<optional-if-azure-chat-used>
+AZURE_DALLE_API_KEY=<optional-if-azure-image-gen-used>
+GEMINI_API_KEY=<optional-if-gemini-used>
+ANONYMIZATION_SEED=<required-if-deterministic-anonymization-enabled>
 ```

 Generate a secure `NEXTAUTH_SECRET`:
@@ -162,6 +167,12 @@ Generate a secure `NEXTAUTH_SECRET`:
 openssl rand -base64 32
 ```

+Runtime secret policy:
+
+- production secrets are injected through the deployment environment or host secret store
+- admin settings must not be used to enter or rotate AI, SMTP, or anonymization secrets
+- the admin UI is only for status checks and cleanup of legacy database-stored secret values
+
 ---

 ## 5. Deployment
@@ -169,13 +180,13 @@ openssl rand -base64 32
 ### docker-compose (simplest)

 ```bash
-# On your server
+# On your server, after updating the host-side env/secret source
 git pull
 docker compose -f docker-compose.prod.yml up -d --build

 # Run database migrations
 docker compose -f docker-compose.prod.yml exec app \
-  pnpm db:push
+  pnpm --filter @capakraken/db db:migrate:deploy

 # Seed initial data (first deployment only)
 docker compose -f docker-compose.prod.yml exec app \
@@ -193,6 +204,7 @@ git pull origin main
 pnpm install
 pnpm db:generate
 pnpm db:validate
+pnpm --filter @capakraken/db db:migrate:deploy
 pnpm --filter @capakraken/web exec next build
 rm -rf apps/web/.next/cache  # clear stale cache

@@ -203,6 +215,8 @@ PORT=3100 pnpm --filter @capakraken/web start &

 Use the repo-level `pnpm db:*` commands for Prisma/database operations. They load `.env`, `.env.local`, `.env.$NODE_ENV`, and `.env.$NODE_ENV.local` automatically before invoking Prisma.

+If you rotate runtime secrets during a manual deploy, update the host-side environment source first, then restart the app so the new process reads the updated values. Do not patch those values through admin settings.
+
 ### nginx configuration

 The existing nginx reverse proxy should forward to port 3100:
@@ -30,6 +30,7 @@ That removes "works on the server but not in CI" drift and makes rollbacks much

 The existing `CI` workflow continues to validate:

+- architecture guardrails for SSE audience scoping
 - typecheck
 - lint
 - unit tests
@@ -38,6 +39,12 @@ The existing `CI` workflow continues to validate:

 This remains the quality gate before merge.

+The guardrail step currently enforces three invariants:
+
+- no role-based SSE audience fan-out in [event-bus.ts](/home/hartmut/Documents/Copilot/capakraken/packages/api/src/sse/event-bus.ts)
+- no role-derived subscription audiences in [subscription-policy.ts](/home/hartmut/Documents/Copilot/capakraken/packages/api/src/sse/subscription-policy.ts)
+- no client-provided audience parsing in [route.ts](/home/hartmut/Documents/Copilot/capakraken/apps/web/src/app/api/sse/timeline/route.ts)
+
 ### 2. Image Build

 The new manual workflow [release-image.yml](/home/hartmut/Documents/Copilot/capakraken/.github/workflows/release-image.yml) builds two images from [Dockerfile.prod](/home/hartmut/Documents/Copilot/capakraken/Dockerfile.prod):
@@ -149,6 +156,28 @@ NEXTAUTH_SECRET=<long-random-secret>

 GitHub Actions only injects the short-lived image references through `deploy.env`. The deploy script then loads both files before calling Docker Compose, so compose interpolation and container runtime env use the same source of truth.

+### Runtime Secret Provisioning Policy
+
+Production and staging secrets should be provisioned at the host or platform-secret layer, not through admin mutations and not through application database writes.
+
+That includes at least:
+
+```env
+OPENAI_API_KEY=<optional-if-openai-used>
+AZURE_OPENAI_API_KEY=<optional-if-azure-chat-used>
+AZURE_DALLE_API_KEY=<optional-if-azure-image-gen-used>
+GEMINI_API_KEY=<optional-if-gemini-used>
+SMTP_PASSWORD=<required-if-smtp-auth-used>
+ANONYMIZATION_SEED=<required-if-deterministic-anonymization-enabled>
+```
+
+Operational rule:
+
+- keep these values in `.env.production` only for smaller self-managed hosts, or preferably in the host's secret manager / encrypted environment facility
+- do not rotate or patch these values through `SystemSettings`
+- use the admin settings page only to verify runtime source/status and to clear leftover legacy database copies
+- after migration, legacy database secret fields should be empty in both staging and production
+
 ## Database Policy

 For release environments, use:
@@ -183,6 +212,8 @@ The intended production update path is:

 That means the production host no longer builds from Git. It only receives a versioned image and starts it after migrations complete.

+The same principle applies to secrets: the running container reads them from the deployment environment at start time, so an update only needs a new image tag unless secret material itself is being rotated.
+
 ## Current Status

 The repository now contains the CI/CD scaffolding, but the existing manual production setup remains untouched:
@@ -46,7 +46,8 @@ See `.github/PULL_REQUEST_TEMPLATE.md` for the security checklist that must be c

 - No secrets in source code
 - Environment variables for all credentials (`DATABASE_URL`, API keys)
- `SystemSettings` table for runtime-configurable secrets (AI keys, SMTP credentials)
+- Runtime application secrets are provisioned outside the application data plane through environment variables or a deployment-time secret manager
+- `SystemSettings` may still contain legacy secret residue during migration, but new secret values must not be written there
 - `.env` files excluded from version control via `.gitignore`

 ## Incident Response
@@ -65,6 +65,8 @@ publicProcedure
 - Runtime secrets now resolve env-first for AI, Gemini, SMTP, and anonymization seed values. Database-backed `SystemSettings` values remain transitional compatibility storage, not the preferred production source of truth.
 - Recommended runtime overrides: `OPENAI_API_KEY`, `AZURE_OPENAI_API_KEY`, `AZURE_DALLE_API_KEY`, `GEMINI_API_KEY`, `SMTP_PASSWORD`, `ANONYMIZATION_SEED`
 - Admin settings reads expose only presence flags (`hasApiKey`, `hasSmtpPassword`, `hasGeminiApiKey`) instead of returning secret values to the browser, and those flags also reflect environment-backed runtime overrides
+- The admin settings mutation no longer persists new secret values into `SystemSettings`; secret inputs must be provisioned through environment or a deployment-time secret manager, and legacy database copies can be cleared explicitly
+- The admin UI now exposes runtime secret source/status plus an explicit "clear legacy DB secrets" cleanup path so operators can complete the migration without direct database writes

 ### Anonymization