HartOMat/PLAN_REFACTOR.md

# Schaeffler Automat — Refactor Plan

> Document date: 2026-03-08
> Branch: refactor/v2
> Author: Architecture review via Claude Code

---

## Executive Summary

### Current State

Schaeffler Automat is a working Blender-based media production pipeline with:
- Domain-driven backend structure (partially migrated, many compat shims still present)
- 7 Docker services with GPU render-worker
- PostgreSQL with tenant_id columns + Row Level Security (RLS) enabled but inconsistently
  applied at the application layer
- Celery task queues with two workers (step_processing + thumbnail_rendering)
- WebSocket real-time events via Redis Pub/Sub
- React/Vite frontend with workflow editor (ReactFlow), media browser, notifications

### Core Problems

1. `step_tasks.py` is 1,170 lines — monolithic task file containing 8+ distinct pipeline steps
2. Tenant isolation is partial: RLS is defined in DB migration 036 but `set_tenant_context()`
   is not called consistently in every router; Celery tasks bypass RLS entirely
3. Pillow overlay code (green bar + model name label) is dead code — all renders use
   `transparent_bg=True` but the 55-line block still runs conditionally
4. STL workflow remnants: `stl_quality` setting, `VALID_STL_QUALITIES`, `stl_size_bytes` in
   render_log dicts still reference the old STL-based pipeline; the actual pipeline is GLB-only
5. Render job cancellation uses a synthetic task ID (`render-{line_id}`) that does not match
   actual Celery task IDs — making revoke() a no-op
6. The MATERIAL_PALETTE + palette fallback lives in `step_processor.py` — should be replaced
   with `SCHAEFFLER_059999_FailedMaterial` (magenta) per the project goals
7. Log messages are inconsistent: some use Python f-strings with no prefix, others use
   `[STEP_NAME]` markers; structured logging is not enforced
8. `render_order_line_task` in `step_tasks.py` duplicates most of
   `render_order_line_still_task` in `domains/rendering/tasks.py`
9. The blender_render.py Blender script is 853 lines with no sub-module structure
10. No GPU-first enforcement: `cycles_device` defaults to "auto" with no explicit fallback log

### Vision

A clean, modular pipeline where:
- Every step is a named `ProcessStep` with start/progress/done log events and DB audit trail
- Render jobs are tracked as structured JSON documents (job tickets) in the DB
- Tenant isolation is enforced at the dependency-injection layer, not ad-hoc per endpoint
- Dead code (Pillow overlays, STL workflow, Flamenco shims, threejs renderer) is deleted
- The auth hierarchy supports GlobalAdmin > TenantAdmin > ProjectManager > Client
- Workers scale dynamically without service restarts
- Notifications are batched summaries, not per-render noise

---

## Architecture Overview

### Current Architecture

```
┌─────────────┐    HTTP     ┌──────────────────────────────────────────┐
│  Frontend   │ ──────────> │  backend:8888 (FastAPI)                  │
│  React/Vite │             │  ├─ domains/auth                         │
│  :5173      │ <─ WS ──── │  ├─ domains/orders                       │
└─────────────┘             │  ├─ domains/products                     │
                            │  ├─ domains/rendering                    │
                            │  ├─ domains/tenants                      │
                            │  └─ api/routers/ (compat shims)          │
                            └──────────┬───────────────────────────────┘
                                       │ Celery tasks via Redis broker
                     ┌─────────────────┼──────────────────┐
                     │                 │                  │
              ┌──────▼──────┐  ┌──────▼──────┐  ┌──────▼──────┐
              │   worker    │  │render-worker│  │    beat     │
              │ step_proc   │  │thumbnail_   │  │ scheduler   │
              │ ai_valid    │  │ rendering   │  └─────────────┘
              │ concurr=8   │  │ concurr=1   │
              └─────────────┘  └──────▼──────┘
                                      │ subprocess
                               ┌──────▼──────┐
                               │  blender    │
                               │  /opt/blend │
                               └─────────────┘
         ┌──────────────┐  ┌──────────┐  ┌──────────┐
         │  PostgreSQL  │  │  Redis   │  │  MinIO   │
         │  :5432       │  │  :6379   │  │  :9000   │
         └──────────────┘  └──────────┘  └──────────┘
```

### Target Architecture (Post-Refactor)

```
┌─────────────────────────────────────────────────────────┐
│  Frontend React/Vite :5173                              │
│  ├─ WorkflowEditor (ReactFlow) — visual pipeline        │
│  ├─ MediaBrowser — server-side filtered + virtual scroll│
│  ├─ NotificationCenter — batched summaries only         │
│  └─ Admin — tooltips on every setting                   │
└────────────────────┬────────────────────────────────────┘
                     │ HTTP + WebSocket
┌────────────────────▼────────────────────────────────────┐
│  backend:8888 (FastAPI)                                 │
│  middleware: TenantContextMiddleware (injects RLS)      │
│  ├─ domains/auth      (GlobalAdmin|TenantAdmin|PM|Client)│
│  ├─ domains/pipeline  (process step registry + dispatch) │
│  ├─ domains/rendering (render job documents, workflows)  │
│  ├─ domains/products  (CAD files, media assets)         │
│  ├─ domains/orders    (order state machine)             │
│  ├─ domains/tenants   (tenant management)               │
│  └─ domains/billing   (pricing, invoices)               │
└────────────────────┬────────────────────────────────────┘
                     │ Celery canvas / chain / group
     ┌───────────────┼───────────────┐
     │               │               │
┌────▼────┐  ┌──────▼──────┐  ┌────▼────┐
│ worker  │  │render-worker│  │  beat   │
│ step_   │  │ concurr=1   │  │ sched.  │
│ process │  │ +Blender GPU│  │ recover │
│ concr=8 │  └──────▼──────┘  │ queues  │
└─────────┘         │          └─────────┘
                subprocess (SIGTERM → SIGKILL + cleanup)
                    │
             ┌──────▼──────┐
             │  blender    │  (GPU-first, explicit CPU-fallback log)
             └─────────────┘
```

---

## Phase 1: Foundation (Weeks 1–2)

Critical infrastructure that blocks everything else.

### 1.1 Structured Logging Framework

**Current state:**
Log messages are a mix of bare `logger.info(f"...")`, `emit(order_line_id, "...")`, and
`log_task_event(task_id, "...")`. No consistent prefix, no structured fields.

**Target:**
A `PipelineLogger` class that wraps Python's `logging` module and additionally writes
structured events to the DB (`audit_log` or a new `pipeline_events` table).

**Design:**
```python
# backend/app/core/pipeline_logger.py
class PipelineLogger:
    PREFIX_FORMAT = "[{step_name}]"

    def step_start(self, step: str, context: dict): ...
    def step_progress(self, step: str, pct: int, msg: str): ...
    def step_done(self, step: str, duration_s: float, result: dict): ...
    def step_error(self, step: str, error: str, exc: Exception | None): ...
```

Every log call emits:
- Python `logging` line with `[STEP_NAME] message`
- Redis `log_task_event` for SSE streaming
- Optional DB insert into `pipeline_events(task_id, step_name, level, message, duration_s, context JSONB, created_at)`

**Files to create:**
- `backend/app/core/pipeline_logger.py` — PipelineLogger class
- `backend/alembic/versions/048_pipeline_events.py` — new table migration

**Files to modify:**
- All task files to replace bare `logger.info/error` with `PipelineLogger` calls
- `backend/app/core/task_logs.py` — keep Redis SSE publish, add DB write path

### 1.2 Render Job Document

**Current state:**
`OrderLine.render_log` is a loosely-structured JSONB dict. No schema, no state machine,
no step-level results stored.

**Target:**
A `RenderJobDocument` JSONB schema stored in `order_lines.render_job_doc`. Acts as the
single source of truth for a render job's state machine.

**Schema (JSONB):**
```json
{
  "version": 1,
  "job_id": "<order_line_id>",
  "created_at": "ISO8601",
  "state": "pending|queued|running|completed|failed|cancelled",
  "celery_task_id": "uuid",
  "steps": [
    {
      "name": "resolve_step_path",
      "status": "done",
      "started_at": "ISO8601",
      "completed_at": "ISO8601",
      "duration_s": 0.02,
      "output": {"step_path": "/app/uploads/..."}
    },
    {
      "name": "occ_glb_export",
      "status": "done",
      "duration_s": 8.4,
      "output": {"glb_path": "...", "size_bytes": 204800}
    },
    {
      "name": "blender_render",
      "status": "running",
      "started_at": "ISO8601",
      "gpu_type": "OPTIX",
      "engine": "cycles",
      "samples": 256
    }
  ],
  "error": null,
  "result": {
    "output_path": "...",
    "duration_s": 34.2,
    "engine_used": "cycles",
    "gpu": "RTX 3090"
  }
}
```

**Migration:**
- `backend/alembic/versions/049_render_job_document.py` — add `render_job_doc JSONB` to `order_lines`; keep `render_log` for backward compat (deprecate, remove in Phase 3)

**Files to create:**
- `backend/app/domains/rendering/job_document.py` — `RenderJobDocument` Pydantic model + helpers (`update_step`, `set_state`, `append_error`)

### 1.3 Tenant Context Middleware

**Current state:**
`set_tenant_context()` must be called manually in each endpoint. Celery tasks bypass RLS
entirely (they use sync engines without `SET LOCAL app.current_tenant_id`).

**Problem:**
Migration 036 enables RLS, but `build_tenant_db_dep()` in `database.py` actually yields
`db` without setting the tenant context (line 92: `yield db  # context-setting happens
via set_tenant_context when needed`). This means most endpoints are silently bypassing RLS.

**Target:**
A FastAPI middleware `TenantContextMiddleware` that automatically sets RLS context for
every request based on the JWT `tenant_id` claim.

```python
# backend/app/core/middleware.py
class TenantContextMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request: Request, call_next):
        # Extract JWT, decode tenant_id
        # Store in request.state.tenant_id
        # After DB session is acquired, SET LOCAL app.current_tenant_id
        ...
```

**JWT changes:**
`create_access_token()` must embed `tenant_id` in claims:
```python
payload = {"sub": user_id, "role": role, "tenant_id": str(tenant_id), "exp": expires}
```

**Celery tasks:**
All sync DB sessions in Celery tasks must receive `tenant_id` as a task argument and
execute `session.execute(text("SET LOCAL app.current_tenant_id = :tid"), {"tid": tenant_id})`
immediately after session creation. Add a `_set_tenant(session, tenant_id)` helper in
`backend/app/core/db_utils.py`.

**Files to create:**
- `backend/app/core/middleware.py` — TenantContextMiddleware
- `backend/app/core/db_utils.py` — `_set_tenant(session, tenant_id)`

**Files to modify:**
- `backend/app/main.py` — add middleware
- `backend/app/utils/auth.py` — embed tenant_id in JWT
- All Celery task functions — accept `tenant_id: str | None` parameter, call `_set_tenant`

### 1.4 Process Step Registry

**Current state:**
Pipeline steps are implicit — scattered across `step_tasks.py`, `rendering/tasks.py`,
`step_processor.py`, `render_blender.py`. No central definition.

**Target:**
A `ProcessStep` enum and registry that all tasks reference by name.

```python
# backend/app/domains/pipeline/steps.py
class ProcessStep(str, enum.Enum):
    UPLOAD_STEP       = "upload_step"
    PARSE_EXCEL       = "parse_excel"
    EXTRACT_METADATA  = "extract_metadata"
    OCC_GLB_EXPORT    = "occ_glb_export"
    RENDER_THUMBNAIL  = "render_thumbnail"
    RENDER_STILL      = "render_still"
    RENDER_TURNTABLE  = "render_turntable"
    EXPORT_GLB        = "export_glb"
    EXPORT_BLEND      = "export_blend"
    DELIVER           = "deliver"
```

Each step maps to exactly one Celery task and one workflow node type. This enum becomes
the contract between the visual workflow editor and the task executor.

---

## Phase 2: Pipeline Modularity (Weeks 3–4)

Break up `step_tasks.py` (1,170 lines). One file = one pipeline stage.

### 2.1 Decompose step_tasks.py

**Current functions and their new homes:**

| Current location | Function | Target file |
|---|---|---|
| `step_tasks.py` | `process_step_file` | `domains/pipeline/tasks/extract_metadata.py` |
| `step_tasks.py` | `render_step_thumbnail` | `domains/pipeline/tasks/render_thumbnail.py` |
| `step_tasks.py` | `generate_gltf_geometry_task` | `domains/pipeline/tasks/export_glb_geometry.py` |
| `step_tasks.py` | `generate_gltf_production_task` | `domains/pipeline/tasks/export_glb_production.py` |
| `step_tasks.py` | `regenerate_thumbnail` | `domains/pipeline/tasks/render_thumbnail.py` |
| `step_tasks.py` | `dispatch_order_line_render` | `domains/pipeline/tasks/dispatch.py` |
| `step_tasks.py` | `render_order_line_task` | **DELETE** (duplicate of `domains/rendering/tasks.render_order_line_still_task`) |
| `step_tasks.py` | `reextract_cad_metadata` | `domains/pipeline/tasks/extract_metadata.py` |
| `step_tasks.py` | `_auto_populate_materials_for_cad` | `domains/pipeline/tasks/auto_materials.py` |
| `step_tasks.py` | `_bbox_from_glb`, `_bbox_from_step_cadquery` | `domains/pipeline/tasks/bbox.py` |
| `rendering/tasks.py` | `render_order_line_still_task` | `domains/rendering/tasks/render_still.py` |
| `rendering/tasks.py` | `render_turntable_task` | `domains/rendering/tasks/render_turntable.py` |
| `rendering/tasks.py` | `export_gltf_for_order_line_task` | `domains/pipeline/tasks/export_glb_geometry.py` |
| `rendering/tasks.py` | `export_blend_for_order_line_task` | `domains/rendering/tasks/export_blend.py` |
| `rendering/tasks.py` | `publish_asset` | `domains/media/tasks.py` |

**`step_tasks.py` becomes a compatibility shim** (import-only, deprecated) until all
callers are updated. Remove it in Phase 3.

### 2.2 Render Job Document Integration

Every Celery task in the new structure:
1. Reads/creates `RenderJobDocument` at task start
2. Updates the relevant step via `job_doc.update_step(step_name, status="running")`
3. On completion: `job_doc.update_step(step_name, status="done", duration_s=elapsed)`
4. On failure: `job_doc.set_state("failed")` + `job_doc.append_error(...)`
5. Writes document back to `order_lines.render_job_doc`

### 2.3 Render Job Cancellation (Proper)

**Current problem:**
`celery_app.control.revoke("render-{line_id}", terminate=True)` — this ID is synthetic
and does not match the actual Celery task ID, so revoke is a no-op. The Blender process
continues running.

**Solution:**
1. Store the actual Celery task ID in `render_job_doc.celery_task_id` when the task starts
2. Cancel endpoint reads `render_job_doc.celery_task_id` and revokes with that real ID
3. The render subprocess uses `start_new_session=True` (already done in `render_blender.py`)
   and stores `proc.pid` in the job document
4. On SIGTERM, the Celery task's signal handler calls `os.killpg(pgid, SIGTERM)`, waits 10s,
   then `os.killpg(pgid, SIGKILL)`
5. Clean up: remove partial output file, remove `_frames_*` temp directory
6. Update `render_job_doc.state = "cancelled"`, clear `OrderLine.render_status = "cancelled"`

**Files to modify:**
- `backend/app/api/routers/orders.py` — read celery_task_id from job doc, not synthetic ID
- `backend/app/domains/rendering/tasks/render_still.py` — store task ID + PID in job doc,
  register SIGTERM handler
- `backend/app/domains/rendering/tasks/render_turntable.py` — same

### 2.4 GPU-Primary Rendering

**Current state:**
`cycles_device` defaults to "auto". When GPU is unavailable, Blender silently falls back
to CPU with no log message. The `_activate_gpu()` function in `blender_render.py` already
probes for GPU but the result is not reflected in the render job document.

**Target:**
- `cycles_device` default changes from "auto" to "gpu" in system settings
- `_activate_gpu()` result is logged with `[GPU_PROBE]` prefix:
  - Success: `[GPU_PROBE] RTX 3090 activated (OPTIX) — using GPU render`
  - Failure: `[GPU_PROBE] No GPU found, falling back to CPU — set cycles_device=cpu to suppress this warning`
- GPU type and fallback reason are written to `render_job_doc.result.gpu_info`
- Admin UI shows GPU status on the Settings page (already partially exists via worker activity)

**Files to modify:**
- `render-worker/scripts/blender_render.py` — enhance `_activate_gpu()` logging
- `backend/app/api/routers/admin.py` — change default `cycles_device` to "gpu"
- `backend/app/domains/rendering/job_document.py` — add `gpu_info` field to result

### 2.5 Blender Script Modularity

**Current state:**
`render-worker/scripts/blender_render.py` is 853 lines with everything inline.

**Target structure:**
```
render-worker/scripts/
├── blender_render.py          — entry point, arg parsing, top-level flow
├── _blender_gpu.py            — GPU probe + activation
├── _blender_import.py         — GLB import, rotation, smooth shading
├── _blender_materials.py      — material library application + fallback
├── _blender_camera.py         — auto camera from bbox, clip planes
├── _blender_scene.py          — scene setup (Mode A vs Mode B)
└── _blender_post.py           — (currently Pillow overlay — DELETE THIS FILE)
```

`blender_render.py` imports from these sub-modules. Blender Python's `sys.path` is updated
at the top of the script to include the scripts directory.

---

## Phase 3: Code Deletion (Weeks 3–4, parallel with Phase 2)

### 3.1 Remove Pillow Overlay Code

**Location:** `render-worker/scripts/blender_render.py` lines 798–851

**Why it's dead:** `transparent_bg=True` is always passed for production renders. The
`else:` branch at line 802 can never execute in production. The green Schaeffler bar is
now part of the `.blend` template, not post-processing.

**Delete:**
- Lines 798–851 in `blender_render.py` (the entire `if transparent_bg: ... else: try PIL...` block)
- Remove Pillow from render-worker dependencies in `render-worker/Dockerfile`
- Remove the line `- Schaeffler green top bar + model name label via Pillow post-processing.`
  from the script docstring

### 3.2 Remove STL Workflow Remnants

**What to delete:**

| Location | What to remove |
|---|---|
| `backend/app/api/routers/admin.py` | `VALID_STL_QUALITIES`, `stl_quality` from `SettingsOut`, `SettingsUpdate`, and all `SETTINGS_DEFAULTS` |
| `backend/app/api/routers/admin.py` | `generate-missing-stls` endpoint (if still present) |
| `backend/app/api/routers/cad.py` | `generate-stl/{quality}` endpoint |
| `backend/app/services/render_blender.py` | `stl_quality` parameter from `render_still()` and `render_turntable_to_file()` |
| `backend/app/services/render_blender.py` | Key `stl_duration_s` → rename to `glb_duration_s` (remove `# key kept for backward compat` comment) |
| `backend/app/tasks/step_tasks.py` | `generate_stl_cache` task (check if it still exists) |
| `render-worker/scripts/` | Any `_import_stl`, `_convert_stl`, `_scale_mm_to_m` functions |
| `backend/app/api/routers/analytics.py` | `avg_stl_s` field in analytics response |
| All render log dicts | Replace `stl_size_bytes: 0` and `stl_duration_s:` with `glb_*` equivalents |
| DB migration | `backend/alembic/versions/050_cleanup_stl_settings.py` — `DELETE FROM system_settings WHERE key = 'stl_quality'` |

**Files to delete entirely:**
- `blender-renderer/` directory (already removed from docker-compose.yml, remove directory)
- `threejs-renderer/` directory (migration 033 already removed it from services)
- `flamenco/` directory (migration 032 removed Flamenco; verify nothing still imports from it)

**Verify before deleting:**
```bash
grep -rn "blender-renderer\|threejs-renderer\|flamenco" backend/ frontend/ --include="*.py" --include="*.ts" --include="*.tsx"
```

### 3.3 Remove Compat Shims

After all callers are migrated, delete these shim files:
- `backend/app/models/user.py` (shim → `domains/auth/models.py`)
- `backend/app/models/cad_file.py` (shim → `domains/products/models.py`)
- `backend/app/services/render_dispatcher.py` (shim, 10 lines)
- `backend/app/services/material_service.py` (shim → `domains/materials/service.py`)
- `backend/app/services/render_blender.py` (move fully into `domains/rendering/`)
- `backend/app/models/` directory → all models are already in `domains/*/models.py`

### 3.4 Remove Duplicate render_order_line_task

`step_tasks.render_order_line_task` (lines 705–1050 of `step_tasks.py`) duplicates
`rendering/tasks.render_order_line_still_task`. The step_tasks version has more
baggage (compat imports, `emit()` calls, stl_quality references). Delete the step_tasks
version, migrate all queue routes to the `rendering/tasks` version.

**Migration:**
- `celery_app.py` task routes: route `app.tasks.step_tasks.*` to empty list, removing
  step_tasks from the routing table after all tasks are migrated
- Update `CLAUDE.md` to reflect new task locations

---

## Phase 4: Tenant & Auth (Weeks 5–6)

### 4.1 Role Hierarchy

**Current roles:** `admin | project_manager | client`

**Target roles:**
```python
class UserRole(str, enum.Enum):
    global_admin    = "global_admin"     # platform operator, bypass RLS, all tenants
    tenant_admin    = "tenant_admin"     # per-tenant admin, full control within tenant
    project_manager = "project_manager"  # order/render management within tenant
    client          = "client"           # read own orders, create draft orders
```

**Permission matrix:**

| Permission | GlobalAdmin | TenantAdmin | ProjectManager | Client |
|---|---|---|---|---|
| Manage tenants | YES | no | no | no |
| Manage users (all tenants) | YES | no | no | no |
| Manage users (own tenant) | YES | YES | no | no |
| All system settings | YES | YES | no | no |
| Trigger renders | YES | YES | YES | no |
| View all orders in tenant | YES | YES | YES | no |
| Create/view own orders | YES | YES | YES | YES |
| Reject orders | YES | YES | YES | no |
| Delete renders | YES | YES | YES | no |
| View analytics | YES | YES | YES | no |

**DB migration:**
- `backend/alembic/versions/051_role_hierarchy.py` — rename `admin` → `global_admin`,
  add `tenant_admin` to the `userrole` enum; backfill existing `admin` users to `global_admin`

**Auth utilities:**
- `require_global_admin()` — replaces `require_admin()`
- `require_tenant_admin_or_above()` — TenantAdmin or GlobalAdmin
- `require_pm_or_above()` — PM, TenantAdmin, GlobalAdmin

### 4.2 Tenant Isolation — Consistency Audit

**The problem:**
`database.py:build_tenant_db_dep()` yields the session without setting RLS context
(line 92 comments say "context-setting happens via set_tenant_context when needed").
This means every endpoint that uses `Depends(get_db)` bypasses RLS.

**Fix — Middleware approach (preferred):**

```python
# backend/app/core/middleware.py
class TenantContextMiddleware(BaseHTTPMiddleware):
    """Set PostgreSQL RLS context on every request from JWT claims."""

    BYPASS_PATHS = {"/health", "/api/auth/login", "/api/auth/refresh"}

    async def dispatch(self, request: Request, call_next):
        if request.url.path in self.BYPASS_PATHS:
            return await call_next(request)

        token = self._extract_token(request)
        if token:
            payload = decode_token_safe(token)
            tenant_id = payload.get("tenant_id")
            role = payload.get("role")
            request.state.tenant_id = tenant_id
            request.state.role = role

        response = await call_next(request)
        return response
```

The `get_db` dependency is modified to read `tenant_id` from `request.state`:

```python
async def get_db(request: Request) -> AsyncGenerator[AsyncSession, None]:
    async with AsyncSessionLocal() as session:
        tenant_id = getattr(request.state, "tenant_id", None)
        role = getattr(request.state, "role", None)
        if tenant_id:
            if role == "global_admin":
                await session.execute(text("SET LOCAL app.current_tenant_id = 'bypass'"))
            else:
                await session.execute(
                    text("SET LOCAL app.current_tenant_id = :tid"),
                    {"tid": str(tenant_id)},
                )
        yield session
```

### 4.3 Tenant Isolation Strategy — Shared vs. Dedicated Containers

**Decision: Shared containers with DB-level isolation (current model)**

**Analysis:**

| Factor | Shared containers | Dedicated containers per tenant |
|---|---|---|
| Cost | Low (6 containers total) | High (6 containers × N tenants) |
| Complexity | Low | Very high (orchestration, networking) |
| Data isolation | DB-level (RLS) | Full OS-level |
| GPU sharing | Single GPU shared | Dedicated GPU per tenant (expensive) |
| Blender jobs | Queue + concurrency control | Per-tenant render queue |
| Failure blast radius | All tenants affected by worker crash | Isolated per tenant |
| Scaling | Celery autoscale | Docker Swarm / Kubernetes HPA |
| Migration effort | Weeks (Phase 3-4) | Months (new orchestration layer) |

**Recommendation:** Maintain shared containers with DB-level RLS isolation. Dedicated
containers are only justified if tenants have strict contractual data isolation requirements
(e.g., GDPR-mandated separate processing). For the current internal use case (Schaeffler
internal teams), RLS + tenant_id partitioning is sufficient.

**If dedicated containers are required in future:**
- Docker Compose override file per tenant (`docker-compose.{tenant-slug}.yml`)
- Each tenant gets own PostgreSQL schema (not separate DB) with schema-based routing
- Shared MinIO with per-tenant bucket policies
- Separate Redis database (0-15) per tenant (max 16 tenants)
- Celery routing: per-tenant queue prefix `{tenant_slug}.thumbnail_rendering`

### 4.4 Per-Tenant Feature Flags

Add a `tenant_config` JSONB column to the `tenants` table:

```python
# backend/alembic/versions/052_tenant_feature_flags.py
tenant_config JSONB DEFAULT '{
    "max_concurrent_renders": 3,
    "render_engines_allowed": ["cycles"],
    "max_order_size": 500,
    "fallback_material": "SCHAEFFLER_059999_FailedMaterial",
    "notifications_enabled": true,
    "invoice_prefix": "INV"
}'
```

Feature flags checked at render dispatch time:
- `max_concurrent_renders` — enforced in Celery queue routing
- `render_engines_allowed` — validated in OutputType creation
- `fallback_material` — passed to Blender scripts (see §6.4)

---

## Phase 5: Material & Rendering Improvements (Weeks 5–6)

### 5.1 Fallback Material — SCHAEFFLER_059999_FailedMaterial

**Current state:**
`step_processor.py:MATERIAL_PALETTE` assigns rainbow colors from a palette when material
assignment fails or no material is specified. `blender_render.py` has its own
`PALETTE_LINEAR` for the same purpose.

**Target:**
When material resolution fails (no alias, no exact match, material library link broken),
assign `SCHAEFFLER_059999_FailedMaterial` (magenta) so failed assignments are immediately
visible in renders.

**Implementation:**
- `domains/materials/service.py:resolve_material_map()` — instead of pass-through, return
  `SCHAEFFLER_059999_FailedMaterial` for unresolved parts (configurable per-tenant via
  `tenant_config.fallback_material`)
- `render-worker/scripts/blender_render.py` — when material library is provided but a
  part name does not match any library material, assign `SCHAEFFLER_059999_FailedMaterial`
  rather than palette color
- `render-worker/scripts/_blender_materials.py` — a new sub-module for material logic
  with explicit logging: `[MATERIAL] part 'Outer_Ring' → 'SCHAEFFLER_010101_Steel-Bare' (alias match)`
  and `[MATERIAL] part 'Unknown_Part' → 'SCHAEFFLER_059999_FailedMaterial' (no match)`
- `step_processor.py` — remove `MATERIAL_PALETTE` and `_material_to_color()`; the palette
  is no longer used once fallback material is in place. Part colors for geometry GLB viewer
  should come from the material library color map, not a rainbow palette.

### 5.2 Remove EEVEE Fallback

**Current state:**
`render_blender.py` has an EEVEE-to-Cycles fallback:
```python
if returncode > 0 and engine == "eevee":
    logger.warning("EEVEE failed (exit %d) — retrying with Cycles", returncode)
    returncode, stdout_lines2, stderr_lines2 = _run("cycles")
    engine_used = "cycles (eevee fallback)"
```

This hides failures and makes debugging harder. Per the Blender 5.0.1 requirement, EEVEE
Next should work reliably. If it fails, it should be a hard failure, not a silent retry.

**Target:** Remove the EEVEE-to-Cycles fallback. If EEVEE fails, the task fails with a
clear error. Set `EEVEE_FALLBACK_ENABLED=false` system setting (default false from now on).

### 5.3 Remove Blender Version Check

**Current state:**
`backend/app/services/render_blender.py` defines:
```python
MIN_BLENDER_VERSION = (5, 0, 1)
```

This constant is defined but the check that uses it has been removed. Search for any
remaining version-comparison code in `blender_render.py` and render scripts.

**Target:**
- Remove `MIN_BLENDER_VERSION = (5, 0, 1)` from `render_blender.py`
- Remove any `bpy.app.version` comparisons in render scripts
- Blender 5.0.1+ is assumed; older versions are not supported

---

## Phase 6: Notification Center Refactor (Week 7)

### 6.1 Current Problems

Per-render notifications (render.completed, render.failed) fire for every single
`OrderLine`. An order with 200 lines generates 200 notifications. This is too noisy.

### 6.2 Notification Architecture

**Three channels:**

1. **Activity Feed** (`/api/activity`) — per-action events: every render start/complete,
   every order state change, every upload. Low-level, not shown in bell dropdown. Available
   in a dedicated `/activity` page for debugging.

2. **Notification Center** (`/api/notifications`) — batch summaries only:
   - "Order #ORD-2026-042 rendering complete: 47/50 succeeded, 3 failed"
   - "Excel import failed: 12 products skipped (see import log)"
   - "Worker recovery: 3 stalled renders requeued after 120min timeout"

3. **System Alerts** (admin only) — infrastructure issues: GPU probe failed, Blender
   binary not found, Redis connection lost.

**Notification trigger rules:**
- `render.completed` per-line → suppress; emit batch when ALL lines in order reach terminal state
- `render.failed` per-line → suppress; emit batch on order completion
- `excel.imported` → one notification per upload with summary counts
- `order.submitted` → one notification (always keep)
- System alerts → always emit individually

**DB changes:**
- `audit_log` — add `channel VARCHAR(20)` column: `activity | notification | alert`
- `notification_configs` — extend `event_type` to include new batch event types
- New beat task: `batch_render_notifications` — runs every 60s, checks for orders where
  all lines are terminal but no batch notification has been emitted; emits the summary

### 6.3 Per-User Notification Preferences

Current `notification_configs` table has `event_type` + `channel` + `enabled`. Extend:
- Add `frequency: str` column — `immediate | hourly | daily | never`
- Frequency is respected by the batch notification beat task

**Files to modify:**
- `backend/app/domains/notifications/models.py` — add `channel`, `frequency` columns
- `backend/app/services/notification_service.py` — add `emit_batch_notification()` function
- `backend/app/tasks/beat_tasks.py` — add `batch_render_notifications` schedule
- `frontend/src/pages/NotificationSettings.tsx` — add frequency selector per event type
- `frontend/src/pages/Notifications.tsx` — separate tabs for Activity | Notifications | Alerts

---

## Phase 7: UI/UX Improvements (Week 7–8)

### 7.1 Tooltip / Help Text System

Every setting, parameter, and action in the Admin UI and order wizard needs a tooltip
explaining what it does and what it affects in the pipeline.

**Architecture:**

```typescript
// frontend/src/help/helpTexts.ts
export const HELP_TEXTS: Record<string, HelpText> = {
  "setting.blender_cycles_samples": {
    title: "Cycles Samples",
    body: "Number of render samples per pixel. Higher = better quality, longer render time. 256 is a good balance for product shots. 64 is fast for previews.",
    affects: ["render quality", "render time"],
    unit: "samples",
    range: [1, 4096],
    recommendation: "256 for production, 64 for preview",
  },
  "setting.gltf_preview_linear_deflection": {
    title: "3D Viewer Mesh Quality",
    body: "Controls tessellation precision for the 3D browser viewer. Lower values = finer mesh, larger file. 0.1mm is a good default for medium-complexity parts.",
    affects: ["3D viewer file size", "viewer load time"],
    unit: "mm",
  },
  "action.regenerate_thumbnails": {
    title: "Regenerate All Thumbnails",
    body: "Re-renders thumbnails for all STEP files using current settings. This queues all files on the thumbnail_rendering worker. Expected time: N × 30s. Only needed after changing renderer settings.",
    warning: "This will queue a large number of tasks. Only run during off-peak hours.",
  },
  // ... all settings
}
```

```typescript
// frontend/src/components/HelpTooltip.tsx
interface HelpTooltipProps {
  helpKey: string
  position?: "top" | "right" | "bottom" | "left"
}

export function HelpTooltip({ helpKey, position = "right" }: HelpTooltipProps) {
  const help = HELP_TEXTS[helpKey]
  if (!help) return null
  return (
    <Tooltip content={<HelpContent help={help} />} position={position}>
      <HelpCircle size={14} className="text-text-muted ml-1 cursor-help" />
    </Tooltip>
  )
}
```

**Where to add tooltips (minimum required):**
- All `system_settings` keys in Admin > Settings
- All `OutputType.render_settings` fields in the OutputType editor
- All `RenderTemplate` fields in the template editor
- All actions in Admin > Settings (regenerate thumbnails, process unprocessed, etc.)
- All fields in the Order Wizard with non-obvious meaning

### 7.2 Media Browser Refactor

**Current state:**
`frontend/src/pages/MediaBrowser.tsx` — exists but no details on current filter capabilities.

**Target:**
Server-side filtered media browser with:
- Filters: `lagertyp | category_key | render_status | asset_type | tenant_id (admin)`
- Text search on product name, pim_id
- Server-side pagination (50 per page)
- Virtual scroll for large catalogs (react-virtual or TanStack Virtual)
- Batch download selected assets

**API changes:**
```
GET /api/media/assets?
  asset_type=still&
  category_key=TRB&
  lagertyp=Axial-Zylinderrollenlager&
  render_status=completed&
  page=1&
  page_size=50&
  q=81113
```

**DB indexes required:**
```sql
-- backend/alembic/versions/053_media_browser_indexes.py
CREATE INDEX ix_media_assets_asset_type_created ON media_assets(asset_type, created_at DESC);
CREATE INDEX ix_products_category_lagertyp ON products(category_key, lagertyp);
CREATE INDEX ix_products_name_gin ON products USING GIN(to_tsvector('simple', COALESCE(name, '') || ' ' || COALESCE(pim_id, '')));
```

**Files to modify:**
- `backend/app/domains/media/router.py` — add `GET /assets` with filter params
- `backend/app/domains/media/schemas.py` — add `MediaAssetFilter` Pydantic model
- `frontend/src/pages/MediaBrowser.tsx` — complete rewrite with virtual scroll
- `frontend/src/api/media.ts` — add `getMediaAssets(filters)` function

### 7.3 Workflow Editor — Pipeline Step Nodes

**Current state:**
`WorkflowEditor.tsx` has 5 node types (Upload, Parse, Render, Export, Deliver) but they
do not map to actual Celery tasks. `WorkflowDefinition.config` is a free-form JSONB blob
with no schema validation.

**Target:**
Node types correspond 1:1 to `ProcessStep` enum values. The workflow editor saves a
validated workflow config that the `dispatch_workflow()` function can execute.

**WorkflowDefinition config schema:**
```json
{
  "version": 1,
  "nodes": [
    {"id": "n1", "step": "extract_metadata", "params": {}},
    {"id": "n2", "step": "render_thumbnail", "params": {"engine": "cycles", "samples": 64}},
    {"id": "n3", "step": "render_still", "params": {"width": 2048, "height": 2048}},
    {"id": "n4", "step": "export_glb", "params": {"quality": "high"}},
    {"id": "n5", "step": "deliver", "params": {}}
  ],
  "edges": [
    {"from": "n1", "to": "n2"},
    {"from": "n2", "to": "n3"},
    {"from": "n3", "to": "n4"},
    {"from": "n4", "to": "n5"}
  ]
}
```

Backend validation: `workflow_router.py` validates that all `step` values are in
`ProcessStep` enum before saving.

Frontend: `WorkflowEditor.tsx` builds available node types from a `GET /api/workflows/steps`
endpoint that returns all `ProcessStep` entries with their parameter schemas.

### 7.4 Kanban Rejection Flow

**Current state:**
`OrderStatus.rejected` exists but the rejection flow is undefined. The admin panel has no
rejection UI. `rejected_at` column exists but there is no rejection reason field.

**Target flow:**
1. **Who can reject:** `ProjectManager`, `TenantAdmin`, `GlobalAdmin`
2. **Trigger:** `POST /api/orders/{id}/reject` with body `{"reason": "...", "notify_client": true}`
3. **What happens:**
   - Order status → `rejected`, `rejected_at` = now
   - `rejection_reason` stored (new `Text` column on `Order`)
   - All pending/processing renders are cancelled (same as cancel-renders endpoint)
   - Notification emitted to order creator: "Your order #ORD-2026-042 was rejected. Reason: ..."
   - Audit log entry created
4. **Client sees:** Order status badge changes to `REJECTED` with reason visible
5. **Re-submission:** Client can `POST /api/orders/{id}/resubmit` which clears rejection,
   resets to `draft`, allowing edits before re-submitting. Re-submit creates a new audit log
   entry and emits notification to PMs.

**DB migration:**
- `backend/alembic/versions/054_order_rejection.py` — add `rejection_reason TEXT` to `orders`

---

## Phase 8: Scalable Workers (Week 8)

### 8.1 Current Concurrency Controls

- `worker` (step_processing): `CELERY_WORKER_CONCURRENCY` env var, default 8
- `render-worker` (thumbnail_rendering): hardcoded 1 (Blender serial access)
- Both require Docker service restart to change concurrency

### 8.2 Dynamic Worker Scaling

**Short term (no Kubernetes):**
Use Celery's built-in `autoscale` option:
```yaml
# docker-compose.yml
render-worker:
  command: celery -A app.tasks.celery_app worker
    --loglevel=info
    -Q thumbnail_rendering
    --autoscale=1,1   # min=1, max=1 (single Blender concurrency)
    --concurrency=1
```

For `worker`:
```yaml
worker:
  command: celery -A app.tasks.celery_app worker
    --loglevel=info
    -Q step_processing,ai_validation
    --autoscale=${MAX_CONCURRENCY:-8},${MIN_CONCURRENCY:-2}
```

**Per-queue concurrency via DB:**
Add a `worker_configs` table:
```sql
CREATE TABLE worker_configs (
    queue_name VARCHAR(100) PRIMARY KEY,
    max_concurrency INT NOT NULL DEFAULT 8,
    min_concurrency INT NOT NULL DEFAULT 2,
    updated_at TIMESTAMP NOT NULL DEFAULT now()
);
```

A beat task `apply_worker_concurrency` runs every 5 minutes and uses Celery control
commands to adjust pool size:
```python
celery_app.control.broadcast("pool_shrink", arguments={"n": 2}, destination=["worker@host"])
celery_app.control.broadcast("pool_grow", arguments={"n": 4}, destination=["worker@host"])
```

**Long term (Kubernetes):**
Workers run as Kubernetes Deployments with HPA on `celery_queue_length` metric (exposed via
Flower or a custom `/metrics` endpoint for Prometheus). Render-workers use GPU node pools
with `nvidia.com/gpu: 1` resource requests.

### 8.3 Worker Health Recovery

**Current state:**
`beat_tasks.recover_stuck_cad_files` runs every 5 minutes and handles stuck processing state.

**Extend to:**
- Detect `render_status = 'processing'` with `render_started_at` > `render_stall_timeout_minutes` ago
- SIGTERM any still-running Blender PID (stored in `render_job_doc.celery_task_id`)
- Reset `render_status` to `failed`, update `render_job_doc.state = 'failed'`
- Emit system alert notification (admin channel)
- Log with `[WORKER_RECOVERY] Stalled render for order_line {id} terminated after {N}min`

---

## Detailed Task Breakdown by Area

### A. step_tasks.py Decomposition

**Current problems:**
- 1,170 lines, 8 distinct Celery tasks, many private helpers, multiple inline DB session
  creation patterns
- Imports scattered: some at module level, some inside functions (Celery pattern)
- `render_order_line_task` (lines 705–1050+) duplicates `render_order_line_still_task`

**Migration path:**
1. Create new `domains/pipeline/tasks/` directory with one file per step
2. Each new task calls `PipelineLogger` instead of bare `logger.info`
3. Each new task writes to `render_job_doc` via `job_document.py` helpers
4. Old `step_tasks.py` becomes import-only shim: `from app.domains.pipeline.tasks.extract_metadata import process_step_file`
5. After 2-week migration period, delete `step_tasks.py`

### B. Auth Token Claims

**Current:** `{"sub": user_id, "role": role, "exp": expires}` — no tenant_id in token

**Target:** `{"sub": user_id, "role": role, "tenant_id": str(tenant_id), "exp": expires}`

**Impact:** All existing tokens become invalid after deploy. Users must re-login.
**Mitigation:** Rotate `JWT_SECRET_KEY` as part of the deployment to force re-login.

### C. Celery Task Routing Update

After Phase 2 decomposition, update `celery_app.conf.update(task_routes={...})`:
```python
task_routes = {
    "app.domains.pipeline.tasks.*": {"queue": "step_processing"},
    "app.domains.rendering.tasks.*": {"queue": "thumbnail_rendering"},
    "app.domains.media.tasks.*": {"queue": "step_processing"},
    "app.tasks.ai_tasks.*": {"queue": "ai_validation"},
    "app.tasks.beat_tasks.*": {"queue": "step_processing"},
}
```

### D. Frontend API Client Consistency

All `frontend/src/api/*.ts` files should:
- Use the axios client from `api/client.ts` (which injects `X-Tenant-ID` header)
- Export typed interfaces for all response shapes
- Use `useQuery` / `useMutation` from TanStack Query, not bare `axios.get` in components

**Audit needed:** Check each `api/*.ts` file to confirm `X-Tenant-ID` header is sent
(it is wired in the axios interceptor per commit 5da90b5, but verify all files use
the configured client, not `axios.create()` directly).

---

## Architectural Decisions (ADRs)

### ADR-001: Shared containers vs. per-tenant containers
**Decision:** Shared containers with PostgreSQL RLS
**Rationale:** Cost and complexity savings. RLS provides adequate isolation for internal use.
**Consequences:** Must ensure RLS is applied consistently (Phase 1.3). Blender sessions are
shared; GPU contention is managed via Celery queue depth, not isolation.

### ADR-002: Render Job Document as JSONB
**Decision:** Store render job state machine as JSONB in `order_lines.render_job_doc`
**Rationale:** Avoids additional `workflow_node_results` table queries for debugging;
JSONB is flexible for schema evolution; indexed for state-based queries.
**Alternatives considered:** Separate `render_job_steps` table — rejected (too many joins
for the common "show me render status" query).

### ADR-003: No per-render notifications
**Decision:** Suppress individual render.completed notifications; emit batch at order completion
**Rationale:** An order with 200 lines generates 200 notifications under the current model.
Batch summaries at order completion are actionable; per-render events are noise.
**Consequences:** Activity feed still records all events for debugging.

### ADR-004: GPU-first rendering
**Decision:** Default `cycles_device = "gpu"`, explicit log on CPU fallback
**Rationale:** The render-worker has GPU reservation in docker-compose.yml. CPU fallback
should be visible and logged, not silent.
**Consequences:** Renders on machines without GPU will always log a CPU fallback warning.

### ADR-005: Fallback material over palette
**Decision:** Replace `MATERIAL_PALETTE` rainbow fallback with `SCHAEFFLER_059999_FailedMaterial`
**Rationale:** Failed material assignments should be immediately visible (magenta) rather
than disguised as intentional palette colors.
**Consequences:** Parts with missing material mapping will render magenta in both
thumbnail and production renders. This is a feature, not a bug.

### ADR-006: Blender 5.0.1 minimum, no version guards
**Decision:** Remove all `bpy.app.version` checks and `MIN_BLENDER_VERSION` guards
**Rationale:** The project is Blender 5.0.1-only. Version shims add complexity without value.
**Consequences:** Running with an older Blender binary will cause cryptic errors. Document
the minimum version requirement clearly in the Dockerfile and README.

---

## What Gets Deleted

### Python files to delete entirely:
- `backend/app/models/user.py` — compat shim
- `backend/app/models/cad_file.py` — compat shim
- `backend/app/models/order.py` — compat shim (if exists)
- `backend/app/models/order_item.py` — compat shim
- `backend/app/models/order_line.py` — compat shim
- `backend/app/models/material.py` — compat shim
- `backend/app/models/material_alias.py` — compat shim
- `backend/app/models/render_template.py` — compat shim
- `backend/app/models/output_type.py` — compat shim
- `backend/app/models/system_setting.py` — compat shim
- `backend/app/models/template.py` — compat shim
- `backend/app/models/render_position.py` — compat shim
- `backend/app/services/render_dispatcher.py` — 10-line shim
- `backend/app/services/material_service.py` — 3-line shim
- `backend/app/tasks/step_tasks.py` — after Phase 2 migration complete
- `backend/app/domains/rendering/tasks.py` — split into per-step files in Phase 2

### Directories to delete entirely:
- `blender-renderer/` — HTTP microservice, removed from docker-compose in refactor/v2
- `threejs-renderer/` — removed in migration 033
- `flamenco/` — removed in migration 032

### Code blocks to delete (within files):
- `render-worker/scripts/blender_render.py` lines 798–851 — Pillow overlay
- `render-worker/scripts/blender_render.py` line 17 — docstring Pillow mention
- `backend/app/services/render_blender.py` line 17 — `MIN_BLENDER_VERSION = (5, 0, 1)`
- `backend/app/services/render_blender.py` lines 229–233 — EEVEE-to-Cycles fallback
- `backend/app/services/step_processor.py` lines 19–31 — `MATERIAL_PALETTE` + `_material_to_color()`
- `backend/app/api/routers/admin.py` — `VALID_STL_QUALITIES`, `stl_quality` in all schemas

### System settings to delete (DB migration):
- `stl_quality` — GLB-only pipeline, no STL concept
- `threejs_render_size` — renderer removed
- `thumbnail_renderer` — was multi-value (pillow|blender|threejs), now always blender

---

## Migration Strategy

### Deployment Order (Zero-Downtime)

**Step 1 — DB migrations (non-breaking):**
- Run migrations 048–054 (new columns: `render_job_doc`, `rejection_reason`, feature flags, etc.)
- New columns are nullable, no existing queries break

**Step 2 — Backend deploy (backward compatible):**
- Deploy new backend with compat shims in place
- New endpoints and middleware active
- Old endpoints still work
- JWT tokens are extended with `tenant_id` claim (existing tokens without it still work
  via fallback in middleware)

**Step 3 — Celery worker deploy:**
- Deploy new `domains/pipeline/tasks/` structure
- `step_tasks.py` compat shim routes to new functions
- Old task names still registered via shim

**Step 4 — Frontend deploy:**
- New WorkflowEditor with validated step types
- HelpTooltip components added
- MediaBrowser refactor with virtual scroll

**Step 5 — Cleanup (breaking):**
- Remove compat shims
- Delete `step_tasks.py`
- Rotate `JWT_SECRET_KEY` to force re-login (tenant_id now required in claims)
- Run DB migration to clean up stl_quality and threejs settings

### Rollback Plan
- All migrations have `downgrade()` implemented
- Compat shims mean old task names still work during migration window
- `render_log` column kept alongside `render_job_doc` until all consumers migrated

### Testing Before Delete
Before deleting any compat shim or old code, verify:
```bash
grep -rn "<old_import_path>" backend/ frontend/ --include="*.py" --include="*.ts" --include="*.tsx"
```
Must return 0 results from non-shim files.

---

## Open Questions

These require product decisions before implementation:

1. **Tenant onboarding flow** — How are new tenants created? Self-service signup, or
   admin creates tenant + TenantAdmin user manually? What is the initial data setup?

2. **Blender binary distribution** — Currently host-mounted (`/opt/blender:/opt/blender:ro`).
   If multiple render-workers run on different hosts in a future cluster, how is Blender
   distributed? Container image vs. network share?

3. **MinIO vs. filesystem storage** — All media assets are stored on the local filesystem
   (`/app/uploads` volume). MinIO is configured but not used for primary storage yet. Should
   Phase 2 migrate assets to MinIO for horizontal scaling?

4. **Invoice workflow** — `billing/models.py` has `Invoice` + `InvoiceLine` models and an
   `invoices` table (migration 042). Is billing actually used? If not, should it be removed
   to reduce complexity?

5. **AI validation (Azure OpenAI)** — `ai_tasks.py` and `azure_ai.py` exist but Azure
   credentials are optional. Is this feature actively used or can it be removed?

6. **Email notifications** — SMTP settings exist in system_settings but email sending is
   not implemented. Is this a required feature for the next phase?

7. **Rejection re-submission UX** — When a client re-submits a rejected order, do they
   create a new order or update the existing one? The current data model supports only
   one status per order, not a history of submissions.

8. **Media browser download format** — Bulk download: ZIP of individual files, or separate
   download links? ZIP requires server-side assembly which adds load.

9. **Tooltip language** — Help texts in English (per CLAUDE.md coding standards) or German
   (for end-user-facing UI)? The admin UI is currently in English labels.

10. **3D Viewer geometry quality** — The `gltf_preview_linear_deflection` default is 0.1mm.
    For very small parts (sub-1mm features), this may be too coarse. Should the deflection
    auto-scale based on the CAD file's bounding box dimensions?agentId: a6cf206cd46b868cb (for resuming to continue this agent's work if needed)
<usage>total_tokens: 132964
tool_uses: 72
duration_ms: 467361</usage>