CapaKraken/research/v2-architecture-proposal-2026-03-11.md

# Planarchy V2 Architecture Proposal

**Date:** 2026-03-11
**Scope:** Codebase review, v2 direction, architecture rethink, parallel agent strategy

## Executive Summary

Planarchy already has a good base:
- monorepo boundaries are mostly clean
- `engine` and `staffing` contain useful pure domain logic
- Next.js + tRPC + Prisma keeps product iteration fast
- Redis-backed SSE is already a reasonable realtime baseline

The main issue is not the stack. The issue is that domain logic is split across:
- large client components
- large tRPC routers
- JSONB-heavy persistence models
- ad-hoc calculations in handlers

My recommendation for **v2** is:

1. **Do not jump to microservices yet.**
2. **Do move to a modular monolith with a real application layer and async workers.**
3. **Split “planning demand” from “actual assignments” at the data model level.**
4. **Keep JSONB only for extensibility, not for core planning workflows.**
5. **Introduce event/outbox-driven parallel agents for matching, conflicts, budget risk, notifications, and AI work.**

This gives you a v2 that is safer, easier to change, and still realistic for a small team.

---

## What The Codebase Does Well

- Domain packages are separated from the web app.
- Shared types and schemas reduce transport mismatch.
- Money is stored in integer cents.
- The app stays operationally simple: one app, one DB, one Redis.
- The timeline already has virtualization and SSE hooks, which means the product is past prototype stage.

---

## Current Pain Points

## 1. Critical correctness and security issues exist today

### Auth hashing is inconsistent
- Login verifies Argon2 hashes in [`apps/web/src/server/auth.ts#L20`](/home/hartmut/Documents/Copilot/planarchy/apps/web/src/server/auth.ts#L20).
- Admin-created users are still stored with SHA-256 in [`packages/api/src/router/user.ts#L41`](/home/hartmut/Documents/Copilot/planarchy/packages/api/src/router/user.ts#L41).
- Impact: users created from the admin flow are likely unable to log in.

### Notification creation is open to any authenticated user
- `notification.create` is only `protectedProcedure` in [`packages/api/src/router/notification.ts#L66`](/home/hartmut/Documents/Copilot/planarchy/packages/api/src/router/notification.ts#L66).
- Impact: any logged-in user can create notifications for arbitrary users.

### AI connection testing is Azure-shaped even when provider is OpenAI
- `testAiConnection` always constructs an Azure deployment URL in [`packages/api/src/router/settings.ts#L122`](/home/hartmut/Documents/Copilot/planarchy/packages/api/src/router/settings.ts#L122).
- Impact: provider abstraction is not actually reliable.

### Repo health checks are currently failing
- `pnpm test:unit` fails because `@planarchy/shared` has a Vitest script but no tests in [`packages/shared/package.json`](/home/hartmut/Documents/Copilot/planarchy/packages/shared/package.json).
- `pnpm typecheck` fails because `crypto.randomUUID()` is used without a visible import/global typing in [`packages/shared/src/schemas/project.schema.ts#L5`](/home/hartmut/Documents/Copilot/planarchy/packages/shared/src/schemas/project.schema.ts#L5).

These are not “v2 someday” items. They should be fixed before deeper refactoring.

## 2. Large surfaces are carrying too much responsibility

The biggest modules are already a warning sign:
- [`apps/web/src/components/timeline/TimelineView.tsx`](/home/hartmut/Documents/Copilot/planarchy/apps/web/src/components/timeline/TimelineView.tsx) is 1720 lines.
- [`apps/web/src/components/projects/ProjectWizard.tsx`](/home/hartmut/Documents/Copilot/planarchy/apps/web/src/components/projects/ProjectWizard.tsx) is 1171 lines.
- [`packages/api/src/router/resource.ts`](/home/hartmut/Documents/Copilot/planarchy/packages/api/src/router/resource.ts) is 908 lines.
- [`packages/api/src/router/timeline.ts`](/home/hartmut/Documents/Copilot/planarchy/packages/api/src/router/timeline.ts) is 631 lines.

That usually means:
- transport, orchestration, validation, business rules, and data access are mixed
- testing becomes expensive
- one change touches too many concerns

## 3. The core planning model is overloaded

The Prisma schema uses JSONB heavily in core workflows:
- blueprints and role presets in [`packages/db/prisma/schema.prisma#L147`](/home/hartmut/Documents/Copilot/planarchy/packages/db/prisma/schema.prisma#L147)
- resource availability, skills, and dynamic fields in [`packages/db/prisma/schema.prisma#L208`](/home/hartmut/Documents/Copilot/planarchy/packages/db/prisma/schema.prisma#L208)
- project staffing requirements and dynamic fields in [`packages/db/prisma/schema.prisma#L267`](/home/hartmut/Documents/Copilot/planarchy/packages/db/prisma/schema.prisma#L267)
- allocation metadata in [`packages/db/prisma/schema.prisma#L301`](/home/hartmut/Documents/Copilot/planarchy/packages/db/prisma/schema.prisma#L301)

The bigger modeling problem is that **`Allocation` currently represents both demand and assignment**:
- placeholder demand is modeled with `resourceId = null`
- headcount is stored on the same entity
- legacy `role` text and `roleId` coexist

This is the wrong aggregate for v2.

## 4. Staffing logic is not yet trustworthy enough to become a differentiator

`staffing.getSuggestions` currently:
- loads all active resources with overlapping allocations
- computes utilization in the router
- uses only Monday availability as the denominator in [`packages/api/src/router/staffing.ts#L45`](/home/hartmut/Documents/Copilot/planarchy/packages/api/src/router/staffing.ts#L45)

That means the suggestion layer is:
- hard to scale
- not consistent with calendar-aware engine logic
- not a strong base for “AI-assisted staffing”

## 5. Routers are doing application-service work

Representative examples:
- timeline queries and update workflows live directly in [`packages/api/src/router/timeline.ts#L12`](/home/hartmut/Documents/Copilot/planarchy/packages/api/src/router/timeline.ts#L12)
- allocation creation, placeholder fill, validation, vacation handling, cost calc, audit log, and event emission all live in [`packages/api/src/router/allocation.ts#L8`](/home/hartmut/Documents/Copilot/planarchy/packages/api/src/router/allocation.ts#L8)

The pure `engine` package exists, but the application layer that should orchestrate it does not.

---

## Recommended V2 Architecture

## Core Decision

**V2 should be a modular monolith plus worker processes, not a microservice split.**

Why:
- the product is still changing fast
- most failures are domain modeling and module-boundary problems, not network topology problems
- a microservice split would increase operational cost before domain seams are stable

### Target shape

```text
apps/web
  -> UI + route handlers only

packages/api
  -> transport adapters only (tRPC procedures, auth boundary, DTO mapping)

packages/application
  -> use cases / command handlers / query handlers

packages/domain-people
packages/domain-projects
packages/domain-demand
packages/domain-scheduling
packages/domain-calendar
packages/domain-notifications
packages/domain-ai
  -> pure domain logic and policies

packages/infrastructure
  -> Prisma repos, Redis pub/sub, job queue, mail, AI clients

workers/agents
  -> async processors consuming outbox events and jobs
```

The key change is: **routers stop containing business workflows**. They become thin.

---

## Data Model Changes For V2

## 1. Split demand from assignment

Replace the current overloaded `Allocation` concept with:

- `DemandRequirement`
  - projectId
  - roleId
  - requiredSkills
  - date range
  - hoursPerDay
  - headcount
  - priority
  - status

- `Assignment`
  - demandRequirementId nullable during migration
  - resourceId
  - projectId
  - date range
  - hoursPerDay
  - cost snapshot
  - status

- `AssignmentChange` or `AssignmentRevision`
  - audit-friendly timeline history
  - supports undo/redo and reasoning

This removes:
- nullable resource meaning two different business states
- headcount logic from real assignments
- placeholder branching across the whole codebase

## 2. Normalize the skill model

Today `Resource.skills` is JSONB. For v2, use:
- `Skill`
- `ResourceSkill`
- optional `RoleSkillProfile`

Keep JSONB only for imported raw skill matrix payloads if needed.

Benefits:
- real filtering
- better analytics
- reusable recommendation features
- explainable ranking

## 3. Normalize calendar capacity

Today availability is template-like JSON plus vacation overlays. For v2:
- `AvailabilityTemplate`
- `ResourceAvailabilityOverride`
- `CalendarException`
- `PublicHolidayCalendar`

This lets the engine answer:
- “what is capacity on this exact date?”
- “why is this person unavailable?”
- “what changed after a vacation approval?”

## 4. Keep blueprints, but narrow their role

Blueprints should remain for:
- custom fields
- UI configuration
- optional default demand templates

Blueprints should **not** continue to carry too much core planning state in JSONB.

## 5. Add an outbox

Introduce:
- `DomainEventOutbox`
- `Job`

Every important mutation writes:
- domain row changes
- audit row
- outbox event

in one transaction.

That is the foundation for safe parallel agents.

---

## Application Layer Design

Every important user action should map to a use case, for example:

- `CreateProject`
- `DefineDemand`
- `AssignResource`
- `MoveAssignment`
- `ApproveVacation`
- `ImportSkillMatrix`
- `RecomputeValueScore`
- `GenerateAiSummary`

Each use case should:
- load aggregates via repositories
- call pure domain policies
- persist through a transaction
- publish outbox events

Routers then become simple wrappers:
- validate input
- call use case
- map result to DTO

This is the main architectural upgrade missing today.

---

## Query Side Design

V2 should use a **CQRS-lite** pattern:

- commands go through application services
- heavy timeline/dashboard/staffing reads use query services or read models

Examples:
- `timeline_read_model`
- `resource_capacity_snapshot`
- `project_budget_snapshot`
- `staffing_candidate_snapshot`

These can start as SQL views/materialized views or dedicated query handlers. No need for a separate read database yet.

This is especially important because the timeline and dashboards are read-heavy and aggregate-heavy.

---

## Parallel Runtime Agents

These are the v2 agents I would actually build. They should run as worker processes consuming outbox events and job records.

## 1. Match Agent

Input:
- `DemandRequirementCreated`
- `DemandRequirementChanged`
- `ResourceSkillChanged`
- `CalendarChanged`

Output:
- ranked candidate snapshots
- recommendation explanations

Responsibility:
- candidate filtering
- deterministic scoring
- optional AI explanation layer after deterministic ranking

## 2. Conflict Agent

Input:
- `AssignmentCreated`
- `AssignmentChanged`
- `VacationApproved`
- `CalendarExceptionChanged`

Output:
- overallocation/conflict records
- blocked-demand warnings

Responsibility:
- recompute exact day-level conflicts
- explain why a conflict exists

## 3. Budget Risk Agent

Input:
- assignment changes
- project budget changes
- project date changes

Output:
- burn snapshots
- over-budget warnings
- forecast deltas

Responsibility:
- separate financial forecasting from request/response latency

## 4. Notification Agent

Input:
- all user-visible domain events

Output:
- in-app notifications
- email sends
- digest batches

Responsibility:
- centralize fan-out
- remove notification logic from feature routers

## 5. Import Agent

Input:
- uploaded Excel/CSV/HRIS files

Output:
- staged import rows
- validation results
- normalized upserts

Responsibility:
- make imports resumable and auditable

## 6. AI Agent

Input:
- explicit AI jobs only

Output:
- summaries
- staffing rationale
- project risk narratives

Responsibility:
- all model interaction happens asynchronously
- stores prompt/result metadata for traceability

Important rule: **AI never becomes the system of record.** It annotates deterministic outputs.

---

## Parallel Build Workstreams

If you want to execute v2 with parallel coding agents, use these lanes to avoid file collisions.

## Agent A: Core Model Refactor

Owns:
- `packages/db`
- `packages/shared`
- new domain packages

Tasks:
- introduce `DemandRequirement`
- introduce normalized skill/calendar models
- add outbox and job tables
- define new shared DTOs/events

## Agent B: Application Service Extraction

Owns:
- `packages/application` new package
- router-to-service extraction in `packages/api`

Tasks:
- move create/update/fill/approve workflows out of routers
- standardize transaction boundaries
- standardize audit + outbox emission

## Agent C: Timeline V2

Owns:
- `apps/web/src/components/timeline/*`
- timeline read models and UI contracts

Tasks:
- break `TimelineView` into screen shell + view model + row renderers
- move timeline state machine into dedicated hooks/store
- consume new query DTOs instead of raw Prisma-shaped payloads

## Agent D: Project Creation And Staffing UX

Owns:
- `apps/web/src/components/projects/*`
- staffing query DTO consumers

Tasks:
- split `ProjectWizard`
- convert wizard from local mega-state to step reducers / use cases
- integrate recommendation snapshots from Match Agent

## Agent E: Security, Platform, And Notifications

Owns:
- auth
- user management
- settings
- notification workflows

Tasks:
- unify password hashing
- close permission gaps
- move secret handling behind infrastructure services
- wire Notification Agent

This split keeps most workstreams independent.

---

## Migration Plan

## Phase 0: Stabilize The Current System

Do this before any architecture refactor:

1. Fix user creation to use Argon2.
2. Restrict `notification.create` to admin/system workflows.
3. Fix `testAiConnection` to truly support both providers.
4. Make `pnpm test:unit` and `pnpm typecheck` green again.
5. Remove remaining legacy `role`/`roleId` ambiguity where possible.

## Phase 1: Extract The Application Layer

Without changing the UI yet:
- add use-case services
- move router logic into them
- introduce outbox writes
- standardize domain events

This phase creates the seam for the rest of v2.

## Phase 2: Introduce New Core Tables With Dual Write

- create `DemandRequirement`, normalized skills, normalized calendar tables
- dual-write from old flows
- build migration scripts and backfills
- add compatibility query adapters

## Phase 3: Rebuild The Timeline And Wizard Against New Read Models

- timeline consumes query DTOs
- wizard consumes demand/assignment APIs
- staffing suggestions come from snapshots, not direct all-resource scans

## Phase 4: Turn On Parallel Agents

- Match Agent
- Conflict Agent
- Budget Risk Agent
- Notification Agent
- Import Agent
- AI Agent

## Phase 5: Optional Service Extraction

Only after the domain seams hold:
- extract workers into separate deployables if load justifies it
- keep the transactional core close to the DB

---

## Recommended Immediate Improvement Backlog

If I had to choose the highest-leverage next moves:

1. Fix auth, notification permissions, AI test path, and broken repo checks.
2. Create `packages/application` and move allocation/timeline/project workflows into it.
3. Introduce `DemandRequirement` and stop using placeholder allocations as a dual-purpose model.
4. Rebuild staffing suggestions around normalized skills + calendar-aware capacity.
5. Split timeline and project wizard around view-model boundaries, not just JSX extraction.

---

## Bottom Line

**V2 should not be “more features on the current shape.”**
It should be:

- a cleaner domain model
- a thinner API layer
- async agents for expensive side effects
- read models for planning screens
- normalized planning entities with JSONB reserved for extension points

That will make Planarchy better at the thing it claims to be: a planning system, not just a CRUD app with a timeline.