Files
CapaKraken/docs/security-architecture.md
T
Hartmut 01c45d0344 security: align client password policy with server, enforce AUTH_SECRET length + entropy (#56)
Client-side validators (reset-password, invite-accept, first-admin setup,
user-create modal) previously checked password.length < 8 while every
server-side Zod schema required .min(12). External API consumers (or a
confused browser UI) could get past the client check but fail at the tRPC
boundary — or worse, quietly under-enforce policy compared to what
admins expect.

Fix: introduce PASSWORD_MIN_LENGTH (12) and PASSWORD_MAX_LENGTH (128) in
@capakraken/shared and import them from every pre-submit client validator
and every server Zod schema. Single source of truth; drift becomes a
compile error rather than a security finding.

Also hardens the AUTH_SECRET runtime check: in addition to the existing
placeholder-blacklist, production startup now rejects secrets shorter
than 32 chars OR with Shannon entropy below 3.5 bits/char. That covers
low-entropy-but-long values like "aaaa..." (38 chars, entropy 0) which
would have passed the previous checks.

Documented the rotation process for AUTH_SECRET + POSTGRES_PASSWORD in
docs/security-architecture.md §3.

Verified:
- pnpm test:unit — 396 files / 1922 tests passed
- pnpm --filter @capakraken/web exec tsc --noEmit — clean
- pnpm --filter @capakraken/api exec tsc --noEmit — clean

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-17 14:56:43 +02:00

288 lines
16 KiB
Markdown

# Security Architecture — CapaKraken
> Version: 1.0 | Date: 2026-03-27
---
## 1. Authentication
- **Auth.js v5** (NextAuth) with Credentials provider
- **Password hashing**: Argon2id via `@node-rs/argon2` (memory cost 65536, time cost 3)
- **Multi-Factor Authentication**: TOTP (RFC 6238) via `otpauth` library
- Configurable per user (enable/disable via admin or self-service)
- 30-second window, SHA-1, 6-digit codes with 1-step tolerance
- **Rate limiting**: 5 login attempts per 15 minutes per email address (in-memory sliding window)
- **Session strategy**: JWT with server-side validation
- Absolute timeout: 8 hours (configurable via `sessionMaxAge`)
- Idle timeout: 30 minutes (configurable via `sessionIdleTimeout`)
- **Concurrent session limit**: configurable `maxConcurrentSessions` (default 3), kick-oldest strategy
- **Login/logout audit**: all authentication events (success, failure, rate-limit, invalid TOTP, logout) are recorded in the audit log
## 2. Authorization
### Role-Based Access Control (RBAC)
Five-level role hierarchy:
| Role | Level | Capabilities |
| ---------- | ----- | ---------------------------------------------------------- |
| ADMIN | 5 | Full system access, user management, system settings |
| MANAGER | 4 | Project management, resource allocation, vacation approval |
| CONTROLLER | 3 | Financial views, budget management, reporting |
| USER | 2 | Self-service (own vacations, own resource profile) |
| VIEWER | 1 | Read-only access to permitted areas |
### Per-User Permission Overrides
- `permissionOverrides` JSONB field on User model
- `resolvePermissions(role, overrides)` computes effective permissions
- `requirePermission(ctx, key)` enforced on every tRPC procedure
- Granular `PermissionKey` enum covering all domain actions
### tRPC Middleware Stack
```
publicProcedure
-> protectedProcedure (requires authenticated session)
-> controllerProcedure (ADMIN + MANAGER + CONTROLLER)
-> managerProcedure (ADMIN + MANAGER)
-> adminProcedure (ADMIN only)
```
## 3. Data Protection
### Database Security
- **PostgreSQL** with TLS in production
- **Prisma ORM**: parameterized queries by default — no SQL injection risk
- Database not exposed to the internet (Docker internal network only)
- All monetary values stored as integer cents (no floating-point precision issues)
### Data at Rest
- Passwords: Argon2id hash (never stored in plaintext)
- TOTP secrets: stored in DB (encrypted at-rest via PostgreSQL TDE when available)
- Runtime secrets now resolve env-first for AI, Gemini, SMTP, and anonymization seed values. Database-backed `SystemSettings` values remain transitional compatibility storage, not the preferred production source of truth.
- Recommended runtime overrides: `OPENAI_API_KEY`, `AZURE_OPENAI_API_KEY`, `AZURE_DALLE_API_KEY`, `GEMINI_API_KEY`, `SMTP_PASSWORD`, `ANONYMIZATION_SEED`
- Admin settings reads expose only presence flags (`hasApiKey`, `hasSmtpPassword`, `hasGeminiApiKey`) instead of returning secret values to the browser, and those flags also reflect environment-backed runtime overrides
- The admin settings mutation no longer persists new secret values into `SystemSettings`; secret inputs must be provisioned through environment or a deployment-time secret manager, and legacy database copies can be cleared explicitly
- The admin UI now exposes runtime secret source/status plus an explicit "clear legacy DB secrets" cleanup path so operators can complete the migration without direct database writes
- Production startup now validates Auth.js runtime configuration and refuses to boot if `AUTH_SECRET`/`NEXTAUTH_SECRET` is missing, left on a known development placeholder, paired with a non-HTTPS public auth URL, shorter than 32 characters, or failing a Shannon-entropy check (≥ 3.5 bits/char)
- User passwords: minimum 12 characters, maximum 128 characters; single `PASSWORD_MIN_LENGTH` / `PASSWORD_MAX_LENGTH` constant (`@capakraken/shared/constants`) is imported by every client-side pre-submit validator and server-side Zod schema — prevents client/server policy drift
#### Secret rotation
- **`AUTH_SECRET` / `NEXTAUTH_SECRET`** is the signing key for all JWT session cookies. Rotation forces every user to re-authenticate on their next request.
- Generate replacement: `openssl rand -base64 32`
- Deploy path:
1. Update the secret in the deployment secret store (not in repo).
2. Roll all application containers — existing JWTs signed under the old key fail verification and the user is redirected to sign-in.
3. There is no multi-key transition window: this is a hard cut on purpose, because a compromised signing key must be retired immediately.
- Recommended cadence: quarterly, or immediately on suspected compromise.
- **`POSTGRES_PASSWORD`** rotation is coordinated across postgres container init, the app container's `DATABASE_URL`, and any external replication consumers — follow the deployment runbook.
### Anonymization
- Configurable global anonymization for VIEWER role
- Resource names, emails replaced with deterministic pseudonyms (seeded hash)
- Anonymization domain and mode configurable in SystemSettings
## 4. Session Management
- **Server-side JWT** with `SameSite=Strict` cookies
- `httpOnly` cookies prevent XSS-based session theft
- `secure` flag enforced in production (HTTPS only)
- CSRF protection via Auth.js built-in CSRF token
- Configurable session timeouts (absolute + idle) via SystemSettings
- Active session registry with concurrent session limit enforcement
## 5. Input Validation
- **Zod schemas** on every tRPC procedure input
- Strict TypeScript (`strict: true`, `exactOptionalPropertyTypes: true`)
- Blueprint dynamic fields validated at runtime against stored Zod schema definitions
- File uploads validated by:
- MIME type whitelist (`image/png`, `image/jpeg`, `image/webp`, `image/tiff`, `image/bmp`)
- Size limit (10 MB client-side, 4 MB server-side after compression)
- Magic byte verification (actual file content matched against declared MIME)
### Prompt-Injection Guard (defense-in-depth only)
`packages/api/src/lib/prompt-guard.ts` runs a short regex list against every
free-text user prompt sent to an AI tool (assistant chat + project-cover
DALL-E prompt). Input is normalised before the regex runs:
1. Unicode NFKD decomposition (collapses fullwidth / compatibility forms and
splits diacritics from their base letter).
2. Strip zero-width / directional / combining code points that attackers use
to break contiguous substring matches.
3. Fold a small set of Cyrillic / Greek homoglyphs to their Latin
equivalents.
This guard is **defense-in-depth, not an authorisation boundary**. The actual
security boundary for AI-initiated actions is the per-tool
`requirePermission(ctx, PermissionKey.*)` check inside every assistant tool —
an LLM that has been successfully jailbroken still cannot perform an action
its caller's role does not allow. Motivated adversaries **will** find prompts
that defeat the regex layer; its purpose is to raise the cost of casual
injection attempts and to surface them as audit-log entries.
## 6. Audit Logging
### Activity History System
- Centralized `createAuditEntry()` function (fire-and-forget, never blocks)
- Covers 29+ of 36 tRPC routers
- Logged fields: `entityType`, `entityId`, `action`, `userId`, `changes` (JSONB with before/after/diff), `source`, `summary`
- Authentication events: login success/failure, logout, rate limiting, MFA failures
### External API Call Logging
- All OpenAI/Azure/Gemini API calls logged via `loggedAiCall()` wrapper
- Structured Pino logs: `{ provider, model, promptLength, responseTimeMs }`
- Failed calls logged at `warn` level with sanitized diagnostics only, with URL and secret-like tokens redacted before they reach structured logs
### tRPC Request Logging
- Every tRPC call logged with request ID, user ID, path, duration
- Slow calls (>500ms) logged at `warn` level
## 7. HTTP Security Headers
Static headers are configured in `next.config.ts`. The Content-Security-Policy
is emitted per-request by `apps/web/src/middleware.ts` so it can carry a
per-request nonce.
| Header | Value |
| ------------------------- | ---------------------------------------------- |
| Strict-Transport-Security | `max-age=63072000; includeSubDomains; preload` |
| Content-Security-Policy | Restrictive CSP with nonce-based script-src |
| X-Frame-Options | `DENY` |
| X-Content-Type-Options | `nosniff` |
| X-XSS-Protection | `1; mode=block` |
| Referrer-Policy | `strict-origin-when-cross-origin` |
| Permissions-Policy | Camera, microphone, geolocation disabled |
### Content-Security-Policy directives (production)
| Directive | Value | Rationale |
| ----------------- | ------------------------- | -------------------------------------------------- |
| `default-src` | `'self'` | Baseline deny-all-cross-origin. |
| `script-src` | `'self' 'nonce-<random>'` | No `unsafe-inline` / `unsafe-eval` in prod. |
| `style-src` | `'self' 'unsafe-inline'` | Accepted residual risk — see note below. |
| `img-src` | `'self' data: blob:` | Allow base64 previews and generated blobs only. |
| `font-src` | `'self' data:` | Data URLs for inline-embedded fonts. |
| `connect-src` | `'self'` | All AI / third-party calls are server-side. |
| `frame-ancestors` | `'none'` | Clickjacking defence. |
| `frame-src` | `'none'` | No third-party iframes. |
| `object-src` | `'none'` | Blocks legacy `<object>` / Flash / applet vectors. |
| `media-src` | `'self'` | No cross-origin video / audio. |
| `worker-src` | `'self' blob:` | Next.js runtime uses blob-URL workers. |
| `base-uri` | `'self'` | Blocks `<base>` hijacks. |
| `form-action` | `'self'` | Blocks form-exfiltration to third parties. |
**Residual risk — `style-src 'unsafe-inline'`:** React inlines component-scoped
style attributes and `@react-pdf/renderer` emits inline `<style>` blocks that
cannot carry a nonce. A strict `style-src-elem` would break both. The risk is
bounded because `script-src` is nonce-based — a pure CSS-injection attack
cannot escalate to JS execution in this application.
## 8. Rate Limiting
- **Per-IP rate limiting**: via middleware on all API routes
- **Per-user rate limiting**: configurable per-procedure
- **Shared rate-limit backend**: Redis-backed counters when `REDIS_URL` is configured; in-memory fallback remains available for local development and degraded operation
- **Auth-specific rate limiting**: 5 attempts / 15 min per email
- **AI API call rate limits**: upstream provider limits surfaced as user-friendly errors
## 9. Error Handling
- **Sentry** integration for production error tracking
- **Pino** structured logging (JSON in production, pretty-print in development)
- tRPC errors mapped to appropriate HTTP status codes
- AI API errors translated to human-readable messages via `parseAiError()` / `parseGeminiError()`
- Admin connection tests for AI/SMTP return sanitized, user-facing diagnostics only; raw upstream details stay in server logs with redaction for URLs, hosts, emails, and secret-like tokens
- Internal errors never leak stack traces to the client
## 10. Dependency Security
- **Dependabot** configured for automated dependency updates
- `pnpm audit` runs in the scheduled [nightly-security.yml](/home/hartmut/Documents/Copilot/capakraken/.github/workflows/nightly-security.yml) workflow, and high-signal architecture guardrails run on every PR in [ci.yml](/home/hartmut/Documents/Copilot/capakraken/.github/workflows/ci.yml)
- Lockfile integrity verified on install
- transitive audit hotspots such as `flatted` and `picomatch` are pinned through root `pnpm.overrides` to keep dev-tooling CVEs from drifting back in through nested dependencies
- runtime workbook parsing and export generation now use `exceljs` boundaries instead of direct `xlsx` usage in application, engine, and web paths
- `pnpm audit --audit-level=high` is clean as of 2026-03-30; the remaining dependency findings are low/moderate only
## 11. Network Architecture
```
Browser -> Next.js (port 3100) -> tRPC -> Prisma -> PostgreSQL (port 5433)
-> Redis (port 6380, SSE pub/sub)
-> Azure OpenAI / Gemini (external HTTPS)
-> SMTP (email notifications)
```
- PostgreSQL and Redis accessible only within Docker network
- External API calls (AI, SMTP) over TLS
- No direct database access from the internet
## 12. Database Security
### Authentication and Access
- PostgreSQL uses password-based authentication (`capakraken` user with strong password)
- Connection restricted to the Docker internal network (port 5433 on host, 5432 inside container)
- No direct internet access to the database — all queries routed through Prisma ORM via the application layer
- Application uses a single database user; no shared or anonymous access
### Query Safety
- **Prisma ORM** enforces parameterized queries by default — no raw SQL concatenation
- All user inputs validated by Zod schemas before reaching the data layer
- JSONB fields (blueprints, skill matrices, permission overrides) are type-checked at the application boundary
### Active Hardening Measures
- **PostgreSQL audit logging** enabled via `docker-compose.yml` command flags:
- `log_connections=on` / `log_disconnections=on` — all connection lifecycle events
- `log_statement=ddl` — all DDL statements (CREATE, ALTER, DROP)
- `log_min_duration_statement=1000` — slow queries (>1s) logged for performance review
- `log_line_prefix='%t [%p] %u@%d '` — timestamp, PID, user, and database in every log line
- **SUPERUSER removed** from the application database user (`capakraken`); hardening script at `scripts/harden-postgres.sh`
- **Minimal privilege grants**: application user has only SELECT, INSERT, UPDATE, DELETE on tables and USAGE/SELECT on sequences — no CREATE, DROP, or SUPERUSER capabilities
### Recommendations for Further Production Hardening
1. **Enable PostgreSQL SSL/TLS**: Set `ssl: true` in the Prisma connection string and configure `postgresql.conf` with `ssl = on`, `ssl_cert_file`, `ssl_key_file`
2. **Restrict connections by IP**: Configure `pg_hba.conf` to accept connections only from the application container's subnet (e.g., `172.18.0.0/16`)
3. **Use separate database roles**: Create a read-only role for reporting queries and a migration-only role for schema changes, limiting the default application role to DML operations
4. **Enable connection pooling**: Use PgBouncer in production to limit maximum connections and prevent resource exhaustion attacks
5. **Backup encryption**: Ensure `pg_dump` backups are encrypted at rest (GPG or filesystem-level encryption)
### Redis Security
- Redis instance runs without authentication in development (Docker-internal only)
- **Production recommendation**: Enable `requirepass` in Redis configuration and set `REDIS_URL` to include the password (`redis://:password@host:port`)
- Redis is used only for SSE pub/sub (no sensitive data persisted)
## 13. Proactive Monitoring
### Health Check Cron (`/api/cron/health-check`)
- Verifies PostgreSQL and Redis connectivity on each invocation
- On failure: creates CRITICAL in-app notifications for all ADMIN users
- Designed to be triggered by external cron (e.g., `curl` every 5 minutes)
- Protected by `CRON_SECRET` Bearer token
### Security Audit Cron (`/api/cron/security-audit`)
- Scans installed dependency versions against known minimum safe versions
- Alerts ADMIN users when high-severity outdated packages are detected
- Complements Dependabot with an in-app awareness layer
### nginx Hardening
- Reference configuration: `docs/nginx-hardening.conf`
- Covers: server token removal, rate limiting (auth: 1r/s, API: 10r/s), SSL hardening (TLS 1.2+), OCSP stapling
- Security headers applied at nginx level as a defense-in-depth backup to Next.js headers