Hartmut/CapaKraken

Fork 0

Files

T

Hartmut 47e4d701ff chore(repo): checkpoint current capakraken implementation state

2026-03-29 12:47:12 +02:00

16 KiB

Raw Blame History

CapaKraken V2 Architecture Proposal

Date: 2026-03-11
Scope: Codebase review, v2 direction, architecture rethink, parallel agent strategy

Executive Summary

CapaKraken already has a good base:

monorepo boundaries are mostly clean
engine and staffing contain useful pure domain logic
Next.js + tRPC + Prisma keeps product iteration fast
Redis-backed SSE is already a reasonable realtime baseline

The main issue is not the stack. The issue is that domain logic is split across:

large client components
large tRPC routers
JSONB-heavy persistence models
ad-hoc calculations in handlers

My recommendation for v2 is:

Do not jump to microservices yet.
Do move to a modular monolith with a real application layer and async workers.
Split “planning demand” from “actual assignments” at the data model level.
Keep JSONB only for extensibility, not for core planning workflows.
Introduce event/outbox-driven parallel agents for matching, conflicts, budget risk, notifications, and AI work.

This gives you a v2 that is safer, easier to change, and still realistic for a small team.

What The Codebase Does Well

Domain packages are separated from the web app.
Shared types and schemas reduce transport mismatch.
Money is stored in integer cents.
The app stays operationally simple: one app, one DB, one Redis.
The timeline already has virtualization and SSE hooks, which means the product is past prototype stage.

Current Pain Points

1. Critical correctness and security issues exist today

Auth hashing is inconsistent

Login verifies Argon2 hashes in apps/web/src/server/auth.ts#L20.
Admin-created users are still stored with SHA-256 in packages/api/src/router/user.ts#L41.
Impact: users created from the admin flow are likely unable to log in.

Notification creation is open to any authenticated user

notification.create is only protectedProcedure in packages/api/src/router/notification.ts#L66.
Impact: any logged-in user can create notifications for arbitrary users.

AI connection testing is Azure-shaped even when provider is OpenAI

testAiConnection always constructs an Azure deployment URL in packages/api/src/router/settings.ts#L122.
Impact: provider abstraction is not actually reliable.

Repo health checks are currently failing

pnpm test:unit fails because @capakraken/shared has a Vitest script but no tests in packages/shared/package.json.
pnpm typecheck fails because crypto.randomUUID() is used without a visible import/global typing in packages/shared/src/schemas/project.schema.ts#L5.

These are not “v2 someday” items. They should be fixed before deeper refactoring.

2. Large surfaces are carrying too much responsibility

The biggest modules are already a warning sign:

apps/web/src/components/timeline/TimelineView.tsx is 1720 lines.
apps/web/src/components/projects/ProjectWizard.tsx is 1171 lines.
packages/api/src/router/resource.ts is 908 lines.
packages/api/src/router/timeline.ts is 631 lines.

That usually means:

transport, orchestration, validation, business rules, and data access are mixed
testing becomes expensive
one change touches too many concerns

3. The core planning model is overloaded

The Prisma schema uses JSONB heavily in core workflows:

blueprints and role presets in packages/db/prisma/schema.prisma#L147
resource availability, skills, and dynamic fields in packages/db/prisma/schema.prisma#L208
project staffing requirements and dynamic fields in packages/db/prisma/schema.prisma#L267
allocation metadata in packages/db/prisma/schema.prisma#L301

The bigger modeling problem is that Allocation currently represents both demand and assignment:

placeholder demand is modeled with resourceId = null
headcount is stored on the same entity
legacy role text and roleId coexist

This is the wrong aggregate for v2.

4. Staffing logic is not yet trustworthy enough to become a differentiator

staffing.getSuggestions currently:

loads all active resources with overlapping allocations
computes utilization in the router
uses only Monday availability as the denominator in packages/api/src/router/staffing.ts#L45

That means the suggestion layer is:

hard to scale
not consistent with calendar-aware engine logic
not a strong base for “AI-assisted staffing”

5. Routers are doing application-service work

Representative examples:

timeline queries and update workflows live directly in packages/api/src/router/timeline.ts#L12
allocation creation, placeholder fill, validation, vacation handling, cost calc, audit log, and event emission all live in packages/api/src/router/allocation.ts#L8

The pure engine package exists, but the application layer that should orchestrate it does not.

Recommended V2 Architecture

Core Decision

V2 should be a modular monolith plus worker processes, not a microservice split.

Why:

the product is still changing fast
most failures are domain modeling and module-boundary problems, not network topology problems
a microservice split would increase operational cost before domain seams are stable

Target shape

apps/web
  -> UI + route handlers only

packages/api
  -> transport adapters only (tRPC procedures, auth boundary, DTO mapping)

packages/application
  -> use cases / command handlers / query handlers

packages/domain-people
packages/domain-projects
packages/domain-demand
packages/domain-scheduling
packages/domain-calendar
packages/domain-notifications
packages/domain-ai
  -> pure domain logic and policies

packages/infrastructure
  -> Prisma repos, Redis pub/sub, job queue, mail, AI clients

workers/agents
  -> async processors consuming outbox events and jobs

The key change is: routers stop containing business workflows. They become thin.

Data Model Changes For V2

1. Split demand from assignment

Replace the current overloaded Allocation concept with:

DemandRequirement
- projectId
- roleId
- requiredSkills
- date range
- hoursPerDay
- headcount
- priority
- status
Assignment
- demandRequirementId nullable during migration
- resourceId
- projectId
- date range
- hoursPerDay
- cost snapshot
- status
AssignmentChange or AssignmentRevision
- audit-friendly timeline history
- supports undo/redo and reasoning

This removes:

nullable resource meaning two different business states
headcount logic from real assignments
placeholder branching across the whole codebase

2. Normalize the skill model

Today Resource.skills is JSONB. For v2, use:

Skill
ResourceSkill
optional RoleSkillProfile

Keep JSONB only for imported raw skill matrix payloads if needed.

Benefits:

real filtering
better analytics
reusable recommendation features
explainable ranking

3. Normalize calendar capacity

Today availability is template-like JSON plus vacation overlays. For v2:

AvailabilityTemplate
ResourceAvailabilityOverride
CalendarException
PublicHolidayCalendar

This lets the engine answer:

“what is capacity on this exact date?”
“why is this person unavailable?”
“what changed after a vacation approval?”

4. Keep blueprints, but narrow their role

Blueprints should remain for:

custom fields
UI configuration
optional default demand templates

Blueprints should not continue to carry too much core planning state in JSONB.

5. Add an outbox

Introduce:

DomainEventOutbox
Job

Every important mutation writes:

domain row changes
audit row
outbox event

in one transaction.

That is the foundation for safe parallel agents.

Application Layer Design

Every important user action should map to a use case, for example:

CreateProject
DefineDemand
AssignResource
MoveAssignment
ApproveVacation
ImportSkillMatrix
RecomputeValueScore
GenerateAiSummary

Each use case should:

load aggregates via repositories
call pure domain policies
persist through a transaction
publish outbox events

Routers then become simple wrappers:

validate input
call use case
map result to DTO

This is the main architectural upgrade missing today.

Query Side Design

V2 should use a CQRS-lite pattern:

commands go through application services
heavy timeline/dashboard/staffing reads use query services or read models

Examples:

timeline_read_model
resource_capacity_snapshot
project_budget_snapshot
staffing_candidate_snapshot

These can start as SQL views/materialized views or dedicated query handlers. No need for a separate read database yet.

This is especially important because the timeline and dashboards are read-heavy and aggregate-heavy.

Parallel Runtime Agents

These are the v2 agents I would actually build. They should run as worker processes consuming outbox events and job records.

1. Match Agent

Input:

DemandRequirementCreated
DemandRequirementChanged
ResourceSkillChanged
CalendarChanged

Output:

ranked candidate snapshots
recommendation explanations

Responsibility:

candidate filtering
deterministic scoring
optional AI explanation layer after deterministic ranking

2. Conflict Agent

Input:

AssignmentCreated
AssignmentChanged
VacationApproved
CalendarExceptionChanged

Output:

overallocation/conflict records
blocked-demand warnings

Responsibility:

recompute exact day-level conflicts
explain why a conflict exists

3. Budget Risk Agent

Input:

assignment changes
project budget changes
project date changes

Output:

burn snapshots
over-budget warnings
forecast deltas

Responsibility:

separate financial forecasting from request/response latency

4. Notification Agent

Input:

all user-visible domain events

Output:

in-app notifications
email sends
digest batches

Responsibility:

centralize fan-out
remove notification logic from feature routers

5. Import Agent

Input:

uploaded Excel/CSV/HRIS files

Output:

staged import rows
validation results
normalized upserts

Responsibility:

make imports resumable and auditable

6. AI Agent

Input:

explicit AI jobs only

Output:

summaries
staffing rationale
project risk narratives

Responsibility:

all model interaction happens asynchronously
stores prompt/result metadata for traceability

Important rule: AI never becomes the system of record. It annotates deterministic outputs.

Parallel Build Workstreams

If you want to execute v2 with parallel coding agents, use these lanes to avoid file collisions.

Agent A: Core Model Refactor

Owns:

packages/db
packages/shared
new domain packages

Tasks:

introduce DemandRequirement
introduce normalized skill/calendar models
add outbox and job tables
define new shared DTOs/events

Agent B: Application Service Extraction

Owns:

packages/application new package
router-to-service extraction in packages/api

Tasks:

move create/update/fill/approve workflows out of routers
standardize transaction boundaries
standardize audit + outbox emission

Agent C: Timeline V2

Owns:

apps/web/src/components/timeline/*
timeline read models and UI contracts

Tasks:

break TimelineView into screen shell + view model + row renderers
move timeline state machine into dedicated hooks/store
consume new query DTOs instead of raw Prisma-shaped payloads

Agent D: Project Creation And Staffing UX

Owns:

apps/web/src/components/projects/*
staffing query DTO consumers

Tasks:

split ProjectWizard
convert wizard from local mega-state to step reducers / use cases
integrate recommendation snapshots from Match Agent

Agent E: Security, Platform, And Notifications

Owns:

auth
user management
settings
notification workflows

Tasks:

unify password hashing
close permission gaps
move secret handling behind infrastructure services
wire Notification Agent

This split keeps most workstreams independent.

Migration Plan

Phase 0: Stabilize The Current System

Do this before any architecture refactor:

Fix user creation to use Argon2.
Restrict notification.create to admin/system workflows.
Fix testAiConnection to truly support both providers.
Make pnpm test:unit and pnpm typecheck green again.
Remove remaining legacy role/roleId ambiguity where possible.

Phase 1: Extract The Application Layer

Without changing the UI yet:

add use-case services
move router logic into them
introduce outbox writes
standardize domain events

This phase creates the seam for the rest of v2.

Phase 2: Introduce New Core Tables With Dual Write

create DemandRequirement, normalized skills, normalized calendar tables
dual-write from old flows
build migration scripts and backfills
add compatibility query adapters

Phase 3: Rebuild The Timeline And Wizard Against New Read Models

timeline consumes query DTOs
wizard consumes demand/assignment APIs
staffing suggestions come from snapshots, not direct all-resource scans

Phase 4: Turn On Parallel Agents

Match Agent
Conflict Agent
Budget Risk Agent
Notification Agent
Import Agent
AI Agent

Phase 5: Optional Service Extraction

Only after the domain seams hold:

extract workers into separate deployables if load justifies it
keep the transactional core close to the DB

Recommended Immediate Improvement Backlog

If I had to choose the highest-leverage next moves:

Fix auth, notification permissions, AI test path, and broken repo checks.
Create packages/application and move allocation/timeline/project workflows into it.
Introduce DemandRequirement and stop using placeholder allocations as a dual-purpose model.
Rebuild staffing suggestions around normalized skills + calendar-aware capacity.
Split timeline and project wizard around view-model boundaries, not just JSX extraction.

Bottom Line

V2 should not be “more features on the current shape.”
It should be:

a cleaner domain model
a thinner API layer
async agents for expensive side effects
read models for planning screens
normalized planning entities with JSONB reserved for extension points

That will make CapaKraken better at the thing it claims to be: a planning system, not just a CRUD app with a timeline.

16 KiB Raw Blame History

CapaKraken V2 Architecture Proposal

Executive Summary

What The Codebase Does Well

Current Pain Points

1. Critical correctness and security issues exist today

Auth hashing is inconsistent

Notification creation is open to any authenticated user

AI connection testing is Azure-shaped even when provider is OpenAI

Repo health checks are currently failing

2. Large surfaces are carrying too much responsibility

3. The core planning model is overloaded

4. Staffing logic is not yet trustworthy enough to become a differentiator

5. Routers are doing application-service work

Recommended V2 Architecture

Core Decision

Target shape

Data Model Changes For V2

1. Split demand from assignment

2. Normalize the skill model

3. Normalize calendar capacity

4. Keep blueprints, but narrow their role

5. Add an outbox

Application Layer Design

Query Side Design

Parallel Runtime Agents

1. Match Agent

2. Conflict Agent

3. Budget Risk Agent

4. Notification Agent

5. Import Agent

6. AI Agent

Parallel Build Workstreams

Agent A: Core Model Refactor

Agent B: Application Service Extraction

Agent C: Timeline V2

Agent D: Project Creation And Staffing UX

Agent E: Security, Platform, And Notifications

Migration Plan

Phase 0: Stabilize The Current System

Phase 1: Extract The Application Layer

Phase 2: Introduce New Core Tables With Dual Write

Phase 3: Rebuild The Timeline And Wizard Against New Read Models

Phase 4: Turn On Parallel Agents

Phase 5: Optional Service Extraction

Recommended Immediate Improvement Backlog

Bottom Line

16 KiB

Raw Blame History