# CapaKraken V2 Architecture Proposal **Date:** 2026-03-11 **Scope:** Codebase review, v2 direction, architecture rethink, parallel agent strategy ## Executive Summary CapaKraken already has a good base: - monorepo boundaries are mostly clean - `engine` and `staffing` contain useful pure domain logic - Next.js + tRPC + Prisma keeps product iteration fast - Redis-backed SSE is already a reasonable realtime baseline The main issue is not the stack. The issue is that domain logic is split across: - large client components - large tRPC routers - JSONB-heavy persistence models - ad-hoc calculations in handlers My recommendation for **v2** is: 1. **Do not jump to microservices yet.** 2. **Do move to a modular monolith with a real application layer and async workers.** 3. **Split “planning demand” from “actual assignments” at the data model level.** 4. **Keep JSONB only for extensibility, not for core planning workflows.** 5. **Introduce event/outbox-driven parallel agents for matching, conflicts, budget risk, notifications, and AI work.** This gives you a v2 that is safer, easier to change, and still realistic for a small team. --- ## What The Codebase Does Well - Domain packages are separated from the web app. - Shared types and schemas reduce transport mismatch. - Money is stored in integer cents. - The app stays operationally simple: one app, one DB, one Redis. - The timeline already has virtualization and SSE hooks, which means the product is past prototype stage. --- ## Current Pain Points ## 1. Critical correctness and security issues exist today ### Auth hashing is inconsistent - Login verifies Argon2 hashes in [`apps/web/src/server/auth.ts#L20`](/home/hartmut/Documents/Copilot/capakraken/apps/web/src/server/auth.ts#L20). - Admin-created users are still stored with SHA-256 in [`packages/api/src/router/user.ts#L41`](/home/hartmut/Documents/Copilot/capakraken/packages/api/src/router/user.ts#L41). - Impact: users created from the admin flow are likely unable to log in. ### Notification creation is open to any authenticated user - `notification.create` is only `protectedProcedure` in [`packages/api/src/router/notification.ts#L66`](/home/hartmut/Documents/Copilot/capakraken/packages/api/src/router/notification.ts#L66). - Impact: any logged-in user can create notifications for arbitrary users. ### AI connection testing is Azure-shaped even when provider is OpenAI - `testAiConnection` always constructs an Azure deployment URL in [`packages/api/src/router/settings.ts#L122`](/home/hartmut/Documents/Copilot/capakraken/packages/api/src/router/settings.ts#L122). - Impact: provider abstraction is not actually reliable. ### Repo health checks are currently failing - `pnpm test:unit` fails because `@capakraken/shared` has a Vitest script but no tests in [`packages/shared/package.json`](/home/hartmut/Documents/Copilot/capakraken/packages/shared/package.json). - `pnpm typecheck` fails because `crypto.randomUUID()` is used without a visible import/global typing in [`packages/shared/src/schemas/project.schema.ts#L5`](/home/hartmut/Documents/Copilot/capakraken/packages/shared/src/schemas/project.schema.ts#L5). These are not “v2 someday” items. They should be fixed before deeper refactoring. ## 2. Large surfaces are carrying too much responsibility The biggest modules are already a warning sign: - [`apps/web/src/components/timeline/TimelineView.tsx`](/home/hartmut/Documents/Copilot/capakraken/apps/web/src/components/timeline/TimelineView.tsx) is 1720 lines. - [`apps/web/src/components/projects/ProjectWizard.tsx`](/home/hartmut/Documents/Copilot/capakraken/apps/web/src/components/projects/ProjectWizard.tsx) is 1171 lines. - [`packages/api/src/router/resource.ts`](/home/hartmut/Documents/Copilot/capakraken/packages/api/src/router/resource.ts) is 908 lines. - [`packages/api/src/router/timeline.ts`](/home/hartmut/Documents/Copilot/capakraken/packages/api/src/router/timeline.ts) is 631 lines. That usually means: - transport, orchestration, validation, business rules, and data access are mixed - testing becomes expensive - one change touches too many concerns ## 3. The core planning model is overloaded The Prisma schema uses JSONB heavily in core workflows: - blueprints and role presets in [`packages/db/prisma/schema.prisma#L147`](/home/hartmut/Documents/Copilot/capakraken/packages/db/prisma/schema.prisma#L147) - resource availability, skills, and dynamic fields in [`packages/db/prisma/schema.prisma#L208`](/home/hartmut/Documents/Copilot/capakraken/packages/db/prisma/schema.prisma#L208) - project staffing requirements and dynamic fields in [`packages/db/prisma/schema.prisma#L267`](/home/hartmut/Documents/Copilot/capakraken/packages/db/prisma/schema.prisma#L267) - allocation metadata in [`packages/db/prisma/schema.prisma#L301`](/home/hartmut/Documents/Copilot/capakraken/packages/db/prisma/schema.prisma#L301) The bigger modeling problem is that **`Allocation` currently represents both demand and assignment**: - placeholder demand is modeled with `resourceId = null` - headcount is stored on the same entity - legacy `role` text and `roleId` coexist This is the wrong aggregate for v2. ## 4. Staffing logic is not yet trustworthy enough to become a differentiator `staffing.getSuggestions` currently: - loads all active resources with overlapping allocations - computes utilization in the router - uses only Monday availability as the denominator in [`packages/api/src/router/staffing.ts#L45`](/home/hartmut/Documents/Copilot/capakraken/packages/api/src/router/staffing.ts#L45) That means the suggestion layer is: - hard to scale - not consistent with calendar-aware engine logic - not a strong base for “AI-assisted staffing” ## 5. Routers are doing application-service work Representative examples: - timeline queries and update workflows live directly in [`packages/api/src/router/timeline.ts#L12`](/home/hartmut/Documents/Copilot/capakraken/packages/api/src/router/timeline.ts#L12) - allocation creation, placeholder fill, validation, vacation handling, cost calc, audit log, and event emission all live in [`packages/api/src/router/allocation.ts#L8`](/home/hartmut/Documents/Copilot/capakraken/packages/api/src/router/allocation.ts#L8) The pure `engine` package exists, but the application layer that should orchestrate it does not. --- ## Recommended V2 Architecture ## Core Decision **V2 should be a modular monolith plus worker processes, not a microservice split.** Why: - the product is still changing fast - most failures are domain modeling and module-boundary problems, not network topology problems - a microservice split would increase operational cost before domain seams are stable ### Target shape ```text apps/web -> UI + route handlers only packages/api -> transport adapters only (tRPC procedures, auth boundary, DTO mapping) packages/application -> use cases / command handlers / query handlers packages/domain-people packages/domain-projects packages/domain-demand packages/domain-scheduling packages/domain-calendar packages/domain-notifications packages/domain-ai -> pure domain logic and policies packages/infrastructure -> Prisma repos, Redis pub/sub, job queue, mail, AI clients workers/agents -> async processors consuming outbox events and jobs ``` The key change is: **routers stop containing business workflows**. They become thin. --- ## Data Model Changes For V2 ## 1. Split demand from assignment Replace the current overloaded `Allocation` concept with: - `DemandRequirement` - projectId - roleId - requiredSkills - date range - hoursPerDay - headcount - priority - status - `Assignment` - demandRequirementId nullable during migration - resourceId - projectId - date range - hoursPerDay - cost snapshot - status - `AssignmentChange` or `AssignmentRevision` - audit-friendly timeline history - supports undo/redo and reasoning This removes: - nullable resource meaning two different business states - headcount logic from real assignments - placeholder branching across the whole codebase ## 2. Normalize the skill model Today `Resource.skills` is JSONB. For v2, use: - `Skill` - `ResourceSkill` - optional `RoleSkillProfile` Keep JSONB only for imported raw skill matrix payloads if needed. Benefits: - real filtering - better analytics - reusable recommendation features - explainable ranking ## 3. Normalize calendar capacity Today availability is template-like JSON plus vacation overlays. For v2: - `AvailabilityTemplate` - `ResourceAvailabilityOverride` - `CalendarException` - `PublicHolidayCalendar` This lets the engine answer: - “what is capacity on this exact date?” - “why is this person unavailable?” - “what changed after a vacation approval?” ## 4. Keep blueprints, but narrow their role Blueprints should remain for: - custom fields - UI configuration - optional default demand templates Blueprints should **not** continue to carry too much core planning state in JSONB. ## 5. Add an outbox Introduce: - `DomainEventOutbox` - `Job` Every important mutation writes: - domain row changes - audit row - outbox event in one transaction. That is the foundation for safe parallel agents. --- ## Application Layer Design Every important user action should map to a use case, for example: - `CreateProject` - `DefineDemand` - `AssignResource` - `MoveAssignment` - `ApproveVacation` - `ImportSkillMatrix` - `RecomputeValueScore` - `GenerateAiSummary` Each use case should: - load aggregates via repositories - call pure domain policies - persist through a transaction - publish outbox events Routers then become simple wrappers: - validate input - call use case - map result to DTO This is the main architectural upgrade missing today. --- ## Query Side Design V2 should use a **CQRS-lite** pattern: - commands go through application services - heavy timeline/dashboard/staffing reads use query services or read models Examples: - `timeline_read_model` - `resource_capacity_snapshot` - `project_budget_snapshot` - `staffing_candidate_snapshot` These can start as SQL views/materialized views or dedicated query handlers. No need for a separate read database yet. This is especially important because the timeline and dashboards are read-heavy and aggregate-heavy. --- ## Parallel Runtime Agents These are the v2 agents I would actually build. They should run as worker processes consuming outbox events and job records. ## 1. Match Agent Input: - `DemandRequirementCreated` - `DemandRequirementChanged` - `ResourceSkillChanged` - `CalendarChanged` Output: - ranked candidate snapshots - recommendation explanations Responsibility: - candidate filtering - deterministic scoring - optional AI explanation layer after deterministic ranking ## 2. Conflict Agent Input: - `AssignmentCreated` - `AssignmentChanged` - `VacationApproved` - `CalendarExceptionChanged` Output: - overallocation/conflict records - blocked-demand warnings Responsibility: - recompute exact day-level conflicts - explain why a conflict exists ## 3. Budget Risk Agent Input: - assignment changes - project budget changes - project date changes Output: - burn snapshots - over-budget warnings - forecast deltas Responsibility: - separate financial forecasting from request/response latency ## 4. Notification Agent Input: - all user-visible domain events Output: - in-app notifications - email sends - digest batches Responsibility: - centralize fan-out - remove notification logic from feature routers ## 5. Import Agent Input: - uploaded Excel/CSV/HRIS files Output: - staged import rows - validation results - normalized upserts Responsibility: - make imports resumable and auditable ## 6. AI Agent Input: - explicit AI jobs only Output: - summaries - staffing rationale - project risk narratives Responsibility: - all model interaction happens asynchronously - stores prompt/result metadata for traceability Important rule: **AI never becomes the system of record.** It annotates deterministic outputs. --- ## Parallel Build Workstreams If you want to execute v2 with parallel coding agents, use these lanes to avoid file collisions. ## Agent A: Core Model Refactor Owns: - `packages/db` - `packages/shared` - new domain packages Tasks: - introduce `DemandRequirement` - introduce normalized skill/calendar models - add outbox and job tables - define new shared DTOs/events ## Agent B: Application Service Extraction Owns: - `packages/application` new package - router-to-service extraction in `packages/api` Tasks: - move create/update/fill/approve workflows out of routers - standardize transaction boundaries - standardize audit + outbox emission ## Agent C: Timeline V2 Owns: - `apps/web/src/components/timeline/*` - timeline read models and UI contracts Tasks: - break `TimelineView` into screen shell + view model + row renderers - move timeline state machine into dedicated hooks/store - consume new query DTOs instead of raw Prisma-shaped payloads ## Agent D: Project Creation And Staffing UX Owns: - `apps/web/src/components/projects/*` - staffing query DTO consumers Tasks: - split `ProjectWizard` - convert wizard from local mega-state to step reducers / use cases - integrate recommendation snapshots from Match Agent ## Agent E: Security, Platform, And Notifications Owns: - auth - user management - settings - notification workflows Tasks: - unify password hashing - close permission gaps - move secret handling behind infrastructure services - wire Notification Agent This split keeps most workstreams independent. --- ## Migration Plan ## Phase 0: Stabilize The Current System Do this before any architecture refactor: 1. Fix user creation to use Argon2. 2. Restrict `notification.create` to admin/system workflows. 3. Fix `testAiConnection` to truly support both providers. 4. Make `pnpm test:unit` and `pnpm typecheck` green again. 5. Remove remaining legacy `role`/`roleId` ambiguity where possible. ## Phase 1: Extract The Application Layer Without changing the UI yet: - add use-case services - move router logic into them - introduce outbox writes - standardize domain events This phase creates the seam for the rest of v2. ## Phase 2: Introduce New Core Tables With Dual Write - create `DemandRequirement`, normalized skills, normalized calendar tables - dual-write from old flows - build migration scripts and backfills - add compatibility query adapters ## Phase 3: Rebuild The Timeline And Wizard Against New Read Models - timeline consumes query DTOs - wizard consumes demand/assignment APIs - staffing suggestions come from snapshots, not direct all-resource scans ## Phase 4: Turn On Parallel Agents - Match Agent - Conflict Agent - Budget Risk Agent - Notification Agent - Import Agent - AI Agent ## Phase 5: Optional Service Extraction Only after the domain seams hold: - extract workers into separate deployables if load justifies it - keep the transactional core close to the DB --- ## Recommended Immediate Improvement Backlog If I had to choose the highest-leverage next moves: 1. Fix auth, notification permissions, AI test path, and broken repo checks. 2. Create `packages/application` and move allocation/timeline/project workflows into it. 3. Introduce `DemandRequirement` and stop using placeholder allocations as a dual-purpose model. 4. Rebuild staffing suggestions around normalized skills + calendar-aware capacity. 5. Split timeline and project wizard around view-model boundaries, not just JSX extraction. --- ## Bottom Line **V2 should not be “more features on the current shape.”** It should be: - a cleaner domain model - a thinner API layer - async agents for expensive side effects - read models for planning screens - normalized planning entities with JSONB reserved for extension points That will make CapaKraken better at the thing it claims to be: a planning system, not just a CRUD app with a timeline.