Files
CapaKraken/docs/security-architecture.md
T
Hartmut c4b01c1bfc security: workbook path allowlist + stronger image polyglot validation (#54)
- dispo workbook imports are pinned to DISPO_IMPORT_DIR (default ./imports):
  tRPC input rejects absolute paths and .. segments, runtime reader
  re-validates containment via path.relative. Closes a path-traversal
  class that reached ExcelJS CVEs through admin/compromised tokens.
- image validator now checks the full 8-byte PNG magic, enforces PNG IEND
  and JPEG EOI trailers, scans the decoded buffer for markup polyglot
  markers (<script, <svg, <iframe, javascript:, onerror=, ...), and
  explicitly rejects SVG. Provider-generated covers (DALL-E, Gemini) run
  through the same validator before persistence — an untrusted upstream
  cannot smuggle a stored-XSS payload past us.
- added image-validation.test.ts and tightened documentation.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-17 15:26:29 +02:00

304 lines
18 KiB
Markdown

# Security Architecture — CapaKraken
> Version: 1.0 | Date: 2026-03-27
---
## 1. Authentication
- **Auth.js v5** (NextAuth) with Credentials provider
- **Password hashing**: Argon2id via `@node-rs/argon2` (memory cost 65536, time cost 3)
- **Multi-Factor Authentication**: TOTP (RFC 6238) via `otpauth` library
- Configurable per user (enable/disable via admin or self-service)
- 30-second window, SHA-1, 6-digit codes with 1-step tolerance
- **Rate limiting**: 5 login attempts per 15 minutes per email address (in-memory sliding window)
- **Session strategy**: JWT with server-side validation
- Absolute timeout: 8 hours (configurable via `sessionMaxAge`)
- Idle timeout: 30 minutes (configurable via `sessionIdleTimeout`)
- **Concurrent session limit**: configurable `maxConcurrentSessions` (default 3), kick-oldest strategy
- **Login/logout audit**: all authentication events (success, failure, rate-limit, invalid TOTP, logout) are recorded in the audit log
## 2. Authorization
### Role-Based Access Control (RBAC)
Five-level role hierarchy:
| Role | Level | Capabilities |
| ---------- | ----- | ---------------------------------------------------------- |
| ADMIN | 5 | Full system access, user management, system settings |
| MANAGER | 4 | Project management, resource allocation, vacation approval |
| CONTROLLER | 3 | Financial views, budget management, reporting |
| USER | 2 | Self-service (own vacations, own resource profile) |
| VIEWER | 1 | Read-only access to permitted areas |
### Per-User Permission Overrides
- `permissionOverrides` JSONB field on User model
- `resolvePermissions(role, overrides)` computes effective permissions
- `requirePermission(ctx, key)` enforced on every tRPC procedure
- Granular `PermissionKey` enum covering all domain actions
### tRPC Middleware Stack
```
publicProcedure
-> protectedProcedure (requires authenticated session)
-> controllerProcedure (ADMIN + MANAGER + CONTROLLER)
-> managerProcedure (ADMIN + MANAGER)
-> adminProcedure (ADMIN only)
```
## 3. Data Protection
### Database Security
- **PostgreSQL** with TLS in production
- **Prisma ORM**: parameterized queries by default — no SQL injection risk
- Database not exposed to the internet (Docker internal network only)
- All monetary values stored as integer cents (no floating-point precision issues)
### Data at Rest
- Passwords: Argon2id hash (never stored in plaintext)
- TOTP secrets: stored in DB (encrypted at-rest via PostgreSQL TDE when available)
- Runtime secrets now resolve env-first for AI, Gemini, SMTP, and anonymization seed values. Database-backed `SystemSettings` values remain transitional compatibility storage, not the preferred production source of truth.
- Recommended runtime overrides: `OPENAI_API_KEY`, `AZURE_OPENAI_API_KEY`, `AZURE_DALLE_API_KEY`, `GEMINI_API_KEY`, `SMTP_PASSWORD`, `ANONYMIZATION_SEED`
- Admin settings reads expose only presence flags (`hasApiKey`, `hasSmtpPassword`, `hasGeminiApiKey`) instead of returning secret values to the browser, and those flags also reflect environment-backed runtime overrides
- The admin settings mutation no longer persists new secret values into `SystemSettings`; secret inputs must be provisioned through environment or a deployment-time secret manager, and legacy database copies can be cleared explicitly
- The admin UI now exposes runtime secret source/status plus an explicit "clear legacy DB secrets" cleanup path so operators can complete the migration without direct database writes
- Production startup now validates Auth.js runtime configuration and refuses to boot if `AUTH_SECRET`/`NEXTAUTH_SECRET` is missing, left on a known development placeholder, paired with a non-HTTPS public auth URL, shorter than 32 characters, or failing a Shannon-entropy check (≥ 3.5 bits/char)
- User passwords: minimum 12 characters, maximum 128 characters; single `PASSWORD_MIN_LENGTH` / `PASSWORD_MAX_LENGTH` constant (`@capakraken/shared/constants`) is imported by every client-side pre-submit validator and server-side Zod schema — prevents client/server policy drift
#### Secret rotation
- **`AUTH_SECRET` / `NEXTAUTH_SECRET`** is the signing key for all JWT session cookies. Rotation forces every user to re-authenticate on their next request.
- Generate replacement: `openssl rand -base64 32`
- Deploy path:
1. Update the secret in the deployment secret store (not in repo).
2. Roll all application containers — existing JWTs signed under the old key fail verification and the user is redirected to sign-in.
3. There is no multi-key transition window: this is a hard cut on purpose, because a compromised signing key must be retired immediately.
- Recommended cadence: quarterly, or immediately on suspected compromise.
- **`POSTGRES_PASSWORD`** rotation is coordinated across postgres container init, the app container's `DATABASE_URL`, and any external replication consumers — follow the deployment runbook.
### Anonymization
- Configurable global anonymization for VIEWER role
- Resource names, emails replaced with deterministic pseudonyms (seeded hash)
- Anonymization domain and mode configurable in SystemSettings
## 4. Session Management
- **Server-side JWT** with `SameSite=Strict` cookies
- `httpOnly` cookies prevent XSS-based session theft
- `secure` flag enforced in production (HTTPS only)
- CSRF protection via Auth.js built-in CSRF token
- Configurable session timeouts (absolute + idle) via SystemSettings
- Active session registry with concurrent session limit enforcement
## 5. Input Validation
- **Zod schemas** on every tRPC procedure input
- Strict TypeScript (`strict: true`, `exactOptionalPropertyTypes: true`)
- Blueprint dynamic fields validated at runtime against stored Zod schema definitions
- File uploads validated by:
- MIME type whitelist (`image/png`, `image/jpeg`, `image/webp`, `image/tiff`, `image/bmp`). SVG is explicitly rejected — XML markup could carry `<script>`.
- Size limit (10 MB client-side, 4 MB server-side after compression)
- Full magic-byte verification: declared MIME must match actual content. PNG uses the full 8-byte signature, not a short prefix that would accept polyglots.
- Trailer check: PNG must end with an `IEND` chunk, JPEG with the `FFD9` EOI marker. Any bytes appended after the trailer are rejected.
- Polyglot-marker scan: the decoded buffer is searched (latin1, lowercased) for markup fragments (`<script`, `<svg`, `<iframe`, `javascript:`, `onerror=`, …) and rejected if any appear. Provider-generated images (DALL-E, Gemini) run through the same validator before persistence — an untrusted upstream cannot smuggle a stored-XSS payload past us by virtue of being "our" API.
- Dispo workbook imports must live under the `DISPO_IMPORT_DIR` directory (defaults to `./imports`). The tRPC input schema accepts only relative paths (no `..` segments, no absolute paths), and the runtime workbook reader re-validates that the resolved absolute path stays inside `DISPO_IMPORT_DIR`. This closes a path-traversal class that would have let an admin (or compromised admin token) point the ExcelJS parser at arbitrary files on disk, keeping known ExcelJS CVEs from being reachable through our own API.
### Prompt-Injection Guard (defense-in-depth only)
`packages/api/src/lib/prompt-guard.ts` runs a short regex list against every
free-text user prompt sent to an AI tool (assistant chat + project-cover
DALL-E prompt). Input is normalised before the regex runs:
1. Unicode NFKD decomposition (collapses fullwidth / compatibility forms and
splits diacritics from their base letter).
2. Strip zero-width / directional / combining code points that attackers use
to break contiguous substring matches.
3. Fold a small set of Cyrillic / Greek homoglyphs to their Latin
equivalents.
This guard is **defense-in-depth, not an authorisation boundary**. The actual
security boundary for AI-initiated actions is the per-tool
`requirePermission(ctx, PermissionKey.*)` check inside every assistant tool —
an LLM that has been successfully jailbroken still cannot perform an action
its caller's role does not allow. Motivated adversaries **will** find prompts
that defeat the regex layer; its purpose is to raise the cost of casual
injection attempts and to surface them as audit-log entries.
## 6. Audit Logging
### Activity History System
- Centralized `createAuditEntry()` function. Security-critical callers (auth, assistant
prompts, admin mutations) `await` the write so the entry is durable before the
user-visible effect completes; non-critical callers may fire-and-forget
- Covers 29+ of 36 tRPC routers
- Logged fields: `entityType`, `entityId`, `action`, `userId`, `changes` (JSONB with before/after/diff), `source`, `summary`
- Authentication events: login success/failure, logout, rate limiting, MFA failures
### Assistant prompt audit
Each user turn through the AI assistant writes an `AssistantPrompt` audit row
with conversation ID, prompt length, SHA-256 fingerprint, current page context,
and whether the prompt-injection guard flagged the input. Raw prompt text is
**not** retained by default — the hash + length fingerprint is enough for a
responder to correlate an audit row with a later forensic export if the user
retains their chat transcript, but the audit store itself does not accumulate a
plain-text corpus of everything users typed into the assistant. This balances
GDPR Art. 30 (records of processing) against data-minimisation.
### External API Call Logging
- All OpenAI/Azure/Gemini API calls logged via `loggedAiCall()` wrapper
- Structured Pino logs: `{ provider, model, promptLength, responseTimeMs }`
- Failed calls logged at `warn` level with sanitized diagnostics only, with URL and secret-like tokens redacted before they reach structured logs
### tRPC Request Logging
- Every tRPC call logged with request ID, user ID, path, duration
- Slow calls (>500ms) logged at `warn` level
## 7. HTTP Security Headers
Static headers are configured in `next.config.ts`. The Content-Security-Policy
is emitted per-request by `apps/web/src/middleware.ts` so it can carry a
per-request nonce.
| Header | Value |
| ------------------------- | ---------------------------------------------- |
| Strict-Transport-Security | `max-age=63072000; includeSubDomains; preload` |
| Content-Security-Policy | Restrictive CSP with nonce-based script-src |
| X-Frame-Options | `DENY` |
| X-Content-Type-Options | `nosniff` |
| X-XSS-Protection | `1; mode=block` |
| Referrer-Policy | `strict-origin-when-cross-origin` |
| Permissions-Policy | Camera, microphone, geolocation disabled |
### Content-Security-Policy directives (production)
| Directive | Value | Rationale |
| ----------------- | ------------------------- | -------------------------------------------------- |
| `default-src` | `'self'` | Baseline deny-all-cross-origin. |
| `script-src` | `'self' 'nonce-<random>'` | No `unsafe-inline` / `unsafe-eval` in prod. |
| `style-src` | `'self' 'unsafe-inline'` | Accepted residual risk — see note below. |
| `img-src` | `'self' data: blob:` | Allow base64 previews and generated blobs only. |
| `font-src` | `'self' data:` | Data URLs for inline-embedded fonts. |
| `connect-src` | `'self'` | All AI / third-party calls are server-side. |
| `frame-ancestors` | `'none'` | Clickjacking defence. |
| `frame-src` | `'none'` | No third-party iframes. |
| `object-src` | `'none'` | Blocks legacy `<object>` / Flash / applet vectors. |
| `media-src` | `'self'` | No cross-origin video / audio. |
| `worker-src` | `'self' blob:` | Next.js runtime uses blob-URL workers. |
| `base-uri` | `'self'` | Blocks `<base>` hijacks. |
| `form-action` | `'self'` | Blocks form-exfiltration to third parties. |
**Residual risk — `style-src 'unsafe-inline'`:** React inlines component-scoped
style attributes and `@react-pdf/renderer` emits inline `<style>` blocks that
cannot carry a nonce. A strict `style-src-elem` would break both. The risk is
bounded because `script-src` is nonce-based — a pure CSS-injection attack
cannot escalate to JS execution in this application.
## 8. Rate Limiting
- **Per-IP rate limiting**: via middleware on all API routes
- **Per-user rate limiting**: configurable per-procedure
- **Shared rate-limit backend**: Redis-backed counters when `REDIS_URL` is configured; in-memory fallback remains available for local development and degraded operation
- **Auth-specific rate limiting**: 5 attempts / 15 min per email
- **AI API call rate limits**: upstream provider limits surfaced as user-friendly errors
## 9. Error Handling
- **Sentry** integration for production error tracking
- **Pino** structured logging (JSON in production, pretty-print in development)
- tRPC errors mapped to appropriate HTTP status codes
- AI API errors translated to human-readable messages via `parseAiError()` / `parseGeminiError()`
- Admin connection tests for AI/SMTP return sanitized, user-facing diagnostics only; raw upstream details stay in server logs with redaction for URLs, hosts, emails, and secret-like tokens
- Internal errors never leak stack traces to the client
## 10. Dependency Security
- **Dependabot** configured for automated dependency updates
- `pnpm audit` runs in the scheduled [nightly-security.yml](/home/hartmut/Documents/Copilot/capakraken/.github/workflows/nightly-security.yml) workflow, and high-signal architecture guardrails run on every PR in [ci.yml](/home/hartmut/Documents/Copilot/capakraken/.github/workflows/ci.yml)
- Lockfile integrity verified on install
- transitive audit hotspots such as `flatted` and `picomatch` are pinned through root `pnpm.overrides` to keep dev-tooling CVEs from drifting back in through nested dependencies
- runtime workbook parsing and export generation now use `exceljs` boundaries instead of direct `xlsx` usage in application, engine, and web paths
- `pnpm audit --audit-level=high` is clean as of 2026-03-30; the remaining dependency findings are low/moderate only
## 11. Network Architecture
```
Browser -> Next.js (port 3100) -> tRPC -> Prisma -> PostgreSQL (port 5433)
-> Redis (port 6380, SSE pub/sub)
-> Azure OpenAI / Gemini (external HTTPS)
-> SMTP (email notifications)
```
- PostgreSQL and Redis accessible only within Docker network
- External API calls (AI, SMTP) over TLS
- No direct database access from the internet
## 12. Database Security
### Authentication and Access
- PostgreSQL uses password-based authentication (`capakraken` user with strong password)
- Connection restricted to the Docker internal network (port 5433 on host, 5432 inside container)
- No direct internet access to the database — all queries routed through Prisma ORM via the application layer
- Application uses a single database user; no shared or anonymous access
### Query Safety
- **Prisma ORM** enforces parameterized queries by default — no raw SQL concatenation
- All user inputs validated by Zod schemas before reaching the data layer
- JSONB fields (blueprints, skill matrices, permission overrides) are type-checked at the application boundary
### Active Hardening Measures
- **PostgreSQL audit logging** enabled via `docker-compose.yml` command flags:
- `log_connections=on` / `log_disconnections=on` — all connection lifecycle events
- `log_statement=ddl` — all DDL statements (CREATE, ALTER, DROP)
- `log_min_duration_statement=1000` — slow queries (>1s) logged for performance review
- `log_line_prefix='%t [%p] %u@%d '` — timestamp, PID, user, and database in every log line
- **SUPERUSER removed** from the application database user (`capakraken`); hardening script at `scripts/harden-postgres.sh`
- **Minimal privilege grants**: application user has only SELECT, INSERT, UPDATE, DELETE on tables and USAGE/SELECT on sequences — no CREATE, DROP, or SUPERUSER capabilities
### Recommendations for Further Production Hardening
1. **Enable PostgreSQL SSL/TLS**: Set `ssl: true` in the Prisma connection string and configure `postgresql.conf` with `ssl = on`, `ssl_cert_file`, `ssl_key_file`
2. **Restrict connections by IP**: Configure `pg_hba.conf` to accept connections only from the application container's subnet (e.g., `172.18.0.0/16`)
3. **Use separate database roles**: Create a read-only role for reporting queries and a migration-only role for schema changes, limiting the default application role to DML operations
4. **Enable connection pooling**: Use PgBouncer in production to limit maximum connections and prevent resource exhaustion attacks
5. **Backup encryption**: Ensure `pg_dump` backups are encrypted at rest (GPG or filesystem-level encryption)
### Redis Security
- Redis instance runs without authentication in development (Docker-internal only)
- **Production recommendation**: Enable `requirepass` in Redis configuration and set `REDIS_URL` to include the password (`redis://:password@host:port`)
- Redis is used only for SSE pub/sub (no sensitive data persisted)
## 13. Proactive Monitoring
### Health Check Cron (`/api/cron/health-check`)
- Verifies PostgreSQL and Redis connectivity on each invocation
- On failure: creates CRITICAL in-app notifications for all ADMIN users
- Designed to be triggered by external cron (e.g., `curl` every 5 minutes)
- Protected by `CRON_SECRET` Bearer token
### Security Audit Cron (`/api/cron/security-audit`)
- Scans installed dependency versions against known minimum safe versions
- Alerts ADMIN users when high-severity outdated packages are detected
- Complements Dependabot with an in-app awareness layer
### nginx Hardening
- Reference configuration: `docs/nginx-hardening.conf`
- Covers: server token removal, rate limiting (auth: 1r/s, API: 10r/s), SSL hardening (TLS 1.2+), OCSP stapling
- Security headers applied at nginx level as a defense-in-depth backup to Next.js headers