Files
HartOMat/PLAN_REFACTOR.md
T
Hartmut ea31ed657c feat(refactor/phase1): foundation infrastructure for modular pipeline
Phase 1 of PLAN_REFACTOR.md — all four sub-tasks implemented:

1.1 PipelineLogger (backend/app/core/pipeline_logger.py)
  - Structured step_start/step_done/step_error/step_progress API
  - Publishes to Python logging AND Redis SSE via log_task_event
  - Context manager `pl.step("name")` for auto-timing

1.2 RenderJobDocument (backend/app/domains/rendering/job_document.py)
  - Pydantic JSONB schema: state machine + per-step records + timing
  - begin_step/finish_step/fail_step/skip_step helpers
  - Migration 048: adds render_job_doc JSONB column to order_lines
  - OrderLine model updated with render_job_doc field

1.3 TenantContextMiddleware (backend/app/core/middleware.py)
  - Decodes JWT, stores tenant_id + role in request.state
  - get_db updated to auto-apply RLS SET LOCAL from request.state
  - Registered in main.py (runs before every request)
  - JWT now embeds tenant_id claim via create_access_token()
  - Login endpoint passes tenant_id to token creation

1.4 ProcessStep Registry (backend/app/core/process_steps.py)
  - StepName StrEnum with all 20 pipeline step names
  - Single source of truth for log prefixes, DB records, UI labels

Also adds db_utils.py with set_tenant_sync() + get_sync_session()
for use inside Celery tasks (bypass-safe RLS helper).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-08 19:25:08 +01:00

51 KiB
Raw Blame History

Schaeffler Automat — Refactor Plan

Document date: 2026-03-08 Branch: refactor/v2 Author: Architecture review via Claude Code


Executive Summary

Current State

Schaeffler Automat is a working Blender-based media production pipeline with:

  • Domain-driven backend structure (partially migrated, many compat shims still present)
  • 7 Docker services with GPU render-worker
  • PostgreSQL with tenant_id columns + Row Level Security (RLS) enabled but inconsistently applied at the application layer
  • Celery task queues with two workers (step_processing + thumbnail_rendering)
  • WebSocket real-time events via Redis Pub/Sub
  • React/Vite frontend with workflow editor (ReactFlow), media browser, notifications

Core Problems

  1. step_tasks.py is 1,170 lines — monolithic task file containing 8+ distinct pipeline steps
  2. Tenant isolation is partial: RLS is defined in DB migration 036 but set_tenant_context() is not called consistently in every router; Celery tasks bypass RLS entirely
  3. Pillow overlay code (green bar + model name label) is dead code — all renders use transparent_bg=True but the 55-line block still runs conditionally
  4. STL workflow remnants: stl_quality setting, VALID_STL_QUALITIES, stl_size_bytes in render_log dicts still reference the old STL-based pipeline; the actual pipeline is GLB-only
  5. Render job cancellation uses a synthetic task ID (render-{line_id}) that does not match actual Celery task IDs — making revoke() a no-op
  6. The MATERIAL_PALETTE + palette fallback lives in step_processor.py — should be replaced with SCHAEFFLER_059999_FailedMaterial (magenta) per the project goals
  7. Log messages are inconsistent: some use Python f-strings with no prefix, others use [STEP_NAME] markers; structured logging is not enforced
  8. render_order_line_task in step_tasks.py duplicates most of render_order_line_still_task in domains/rendering/tasks.py
  9. The blender_render.py Blender script is 853 lines with no sub-module structure
  10. No GPU-first enforcement: cycles_device defaults to "auto" with no explicit fallback log

Vision

A clean, modular pipeline where:

  • Every step is a named ProcessStep with start/progress/done log events and DB audit trail
  • Render jobs are tracked as structured JSON documents (job tickets) in the DB
  • Tenant isolation is enforced at the dependency-injection layer, not ad-hoc per endpoint
  • Dead code (Pillow overlays, STL workflow, Flamenco shims, threejs renderer) is deleted
  • The auth hierarchy supports GlobalAdmin > TenantAdmin > ProjectManager > Client
  • Workers scale dynamically without service restarts
  • Notifications are batched summaries, not per-render noise

Architecture Overview

Current Architecture

┌─────────────┐    HTTP     ┌──────────────────────────────────────────┐
│  Frontend   │ ──────────> │  backend:8888 (FastAPI)                  │
│  React/Vite │             │  ├─ domains/auth                         │
│  :5173      │ <─ WS ──── │  ├─ domains/orders                       │
└─────────────┘             │  ├─ domains/products                     │
                            │  ├─ domains/rendering                    │
                            │  ├─ domains/tenants                      │
                            │  └─ api/routers/ (compat shims)          │
                            └──────────┬───────────────────────────────┘
                                       │ Celery tasks via Redis broker
                     ┌─────────────────┼──────────────────┐
                     │                 │                  │
              ┌──────▼──────┐  ┌──────▼──────┐  ┌──────▼──────┐
              │   worker    │  │render-worker│  │    beat     │
              │ step_proc   │  │thumbnail_   │  │ scheduler   │
              │ ai_valid    │  │ rendering   │  └─────────────┘
              │ concurr=8   │  │ concurr=1   │
              └─────────────┘  └──────▼──────┘
                                      │ subprocess
                               ┌──────▼──────┐
                               │  blender    │
                               │  /opt/blend │
                               └─────────────┘
         ┌──────────────┐  ┌──────────┐  ┌──────────┐
         │  PostgreSQL  │  │  Redis   │  │  MinIO   │
         │  :5432       │  │  :6379   │  │  :9000   │
         └──────────────┘  └──────────┘  └──────────┘

Target Architecture (Post-Refactor)

┌─────────────────────────────────────────────────────────┐
│  Frontend React/Vite :5173                              │
│  ├─ WorkflowEditor (ReactFlow) — visual pipeline        │
│  ├─ MediaBrowser — server-side filtered + virtual scroll│
│  ├─ NotificationCenter — batched summaries only         │
│  └─ Admin — tooltips on every setting                   │
└────────────────────┬────────────────────────────────────┘
                     │ HTTP + WebSocket
┌────────────────────▼────────────────────────────────────┐
│  backend:8888 (FastAPI)                                 │
│  middleware: TenantContextMiddleware (injects RLS)      │
│  ├─ domains/auth      (GlobalAdmin|TenantAdmin|PM|Client)│
│  ├─ domains/pipeline  (process step registry + dispatch) │
│  ├─ domains/rendering (render job documents, workflows)  │
│  ├─ domains/products  (CAD files, media assets)         │
│  ├─ domains/orders    (order state machine)             │
│  ├─ domains/tenants   (tenant management)               │
│  └─ domains/billing   (pricing, invoices)               │
└────────────────────┬────────────────────────────────────┘
                     │ Celery canvas / chain / group
     ┌───────────────┼───────────────┐
     │               │               │
┌────▼────┐  ┌──────▼──────┐  ┌────▼────┐
│ worker  │  │render-worker│  │  beat   │
│ step_   │  │ concurr=1   │  │ sched.  │
│ process │  │ +Blender GPU│  │ recover │
│ concr=8 │  └──────▼──────┘  │ queues  │
└─────────┘         │          └─────────┘
                subprocess (SIGTERM → SIGKILL + cleanup)
                    │
             ┌──────▼──────┐
             │  blender    │  (GPU-first, explicit CPU-fallback log)
             └─────────────┘

Phase 1: Foundation (Weeks 12)

Critical infrastructure that blocks everything else.

1.1 Structured Logging Framework

Current state:
Log messages are a mix of bare logger.info(f"..."), emit(order_line_id, "..."), and
log_task_event(task_id, "..."). No consistent prefix, no structured fields.

Target:
A PipelineLogger class that wraps Python's logging module and additionally writes
structured events to the DB (audit_log or a new pipeline_events table).

Design:

# backend/app/core/pipeline_logger.py
class PipelineLogger:
    PREFIX_FORMAT = "[{step_name}]"

    def step_start(self, step: str, context: dict): ...
    def step_progress(self, step: str, pct: int, msg: str): ...
    def step_done(self, step: str, duration_s: float, result: dict): ...
    def step_error(self, step: str, error: str, exc: Exception | None): ...

Every log call emits:

  • Python logging line with [STEP_NAME] message
  • Redis log_task_event for SSE streaming
  • Optional DB insert into pipeline_events(task_id, step_name, level, message, duration_s, context JSONB, created_at)

Files to create:

  • backend/app/core/pipeline_logger.py — PipelineLogger class
  • backend/alembic/versions/048_pipeline_events.py — new table migration

Files to modify:

  • All task files to replace bare logger.info/error with PipelineLogger calls
  • backend/app/core/task_logs.py — keep Redis SSE publish, add DB write path

1.2 Render Job Document

Current state:
OrderLine.render_log is a loosely-structured JSONB dict. No schema, no state machine,
no step-level results stored.

Target:
A RenderJobDocument JSONB schema stored in order_lines.render_job_doc. Acts as the
single source of truth for a render job's state machine.

Schema (JSONB):

{
  "version": 1,
  "job_id": "<order_line_id>",
  "created_at": "ISO8601",
  "state": "pending|queued|running|completed|failed|cancelled",
  "celery_task_id": "uuid",
  "steps": [
    {
      "name": "resolve_step_path",
      "status": "done",
      "started_at": "ISO8601",
      "completed_at": "ISO8601",
      "duration_s": 0.02,
      "output": {"step_path": "/app/uploads/..."}
    },
    {
      "name": "occ_glb_export",
      "status": "done",
      "duration_s": 8.4,
      "output": {"glb_path": "...", "size_bytes": 204800}
    },
    {
      "name": "blender_render",
      "status": "running",
      "started_at": "ISO8601",
      "gpu_type": "OPTIX",
      "engine": "cycles",
      "samples": 256
    }
  ],
  "error": null,
  "result": {
    "output_path": "...",
    "duration_s": 34.2,
    "engine_used": "cycles",
    "gpu": "RTX 3090"
  }
}

Migration:

  • backend/alembic/versions/049_render_job_document.py — add render_job_doc JSONB to order_lines; keep render_log for backward compat (deprecate, remove in Phase 3)

Files to create:

  • backend/app/domains/rendering/job_document.pyRenderJobDocument Pydantic model + helpers (update_step, set_state, append_error)

1.3 Tenant Context Middleware

Current state:
set_tenant_context() must be called manually in each endpoint. Celery tasks bypass RLS
entirely (they use sync engines without SET LOCAL app.current_tenant_id).

Problem:
Migration 036 enables RLS, but build_tenant_db_dep() in database.py actually yields
db without setting the tenant context (line 92: yield db # context-setting happens via set_tenant_context when needed). This means most endpoints are silently bypassing RLS.

Target:
A FastAPI middleware TenantContextMiddleware that automatically sets RLS context for
every request based on the JWT tenant_id claim.

# backend/app/core/middleware.py
class TenantContextMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request: Request, call_next):
        # Extract JWT, decode tenant_id
        # Store in request.state.tenant_id
        # After DB session is acquired, SET LOCAL app.current_tenant_id
        ...

JWT changes:
create_access_token() must embed tenant_id in claims:

payload = {"sub": user_id, "role": role, "tenant_id": str(tenant_id), "exp": expires}

Celery tasks:
All sync DB sessions in Celery tasks must receive tenant_id as a task argument and
execute session.execute(text("SET LOCAL app.current_tenant_id = :tid"), {"tid": tenant_id})
immediately after session creation. Add a _set_tenant(session, tenant_id) helper in
backend/app/core/db_utils.py.

Files to create:

  • backend/app/core/middleware.py — TenantContextMiddleware
  • backend/app/core/db_utils.py_set_tenant(session, tenant_id)

Files to modify:

  • backend/app/main.py — add middleware
  • backend/app/utils/auth.py — embed tenant_id in JWT
  • All Celery task functions — accept tenant_id: str | None parameter, call _set_tenant

1.4 Process Step Registry

Current state:
Pipeline steps are implicit — scattered across step_tasks.py, rendering/tasks.py,
step_processor.py, render_blender.py. No central definition.

Target:
A ProcessStep enum and registry that all tasks reference by name.

# backend/app/domains/pipeline/steps.py
class ProcessStep(str, enum.Enum):
    UPLOAD_STEP       = "upload_step"
    PARSE_EXCEL       = "parse_excel"
    EXTRACT_METADATA  = "extract_metadata"
    OCC_GLB_EXPORT    = "occ_glb_export"
    RENDER_THUMBNAIL  = "render_thumbnail"
    RENDER_STILL      = "render_still"
    RENDER_TURNTABLE  = "render_turntable"
    EXPORT_GLB        = "export_glb"
    EXPORT_BLEND      = "export_blend"
    DELIVER           = "deliver"

Each step maps to exactly one Celery task and one workflow node type. This enum becomes
the contract between the visual workflow editor and the task executor.


Phase 2: Pipeline Modularity (Weeks 34)

Break up step_tasks.py (1,170 lines). One file = one pipeline stage.

2.1 Decompose step_tasks.py

Current functions and their new homes:

Current location Function Target file
step_tasks.py process_step_file domains/pipeline/tasks/extract_metadata.py
step_tasks.py render_step_thumbnail domains/pipeline/tasks/render_thumbnail.py
step_tasks.py generate_gltf_geometry_task domains/pipeline/tasks/export_glb_geometry.py
step_tasks.py generate_gltf_production_task domains/pipeline/tasks/export_glb_production.py
step_tasks.py regenerate_thumbnail domains/pipeline/tasks/render_thumbnail.py
step_tasks.py dispatch_order_line_render domains/pipeline/tasks/dispatch.py
step_tasks.py render_order_line_task DELETE (duplicate of domains/rendering/tasks.render_order_line_still_task)
step_tasks.py reextract_cad_metadata domains/pipeline/tasks/extract_metadata.py
step_tasks.py _auto_populate_materials_for_cad domains/pipeline/tasks/auto_materials.py
step_tasks.py _bbox_from_glb, _bbox_from_step_cadquery domains/pipeline/tasks/bbox.py
rendering/tasks.py render_order_line_still_task domains/rendering/tasks/render_still.py
rendering/tasks.py render_turntable_task domains/rendering/tasks/render_turntable.py
rendering/tasks.py export_gltf_for_order_line_task domains/pipeline/tasks/export_glb_geometry.py
rendering/tasks.py export_blend_for_order_line_task domains/rendering/tasks/export_blend.py
rendering/tasks.py publish_asset domains/media/tasks.py

step_tasks.py becomes a compatibility shim (import-only, deprecated) until all
callers are updated. Remove it in Phase 3.

2.2 Render Job Document Integration

Every Celery task in the new structure:

  1. Reads/creates RenderJobDocument at task start
  2. Updates the relevant step via job_doc.update_step(step_name, status="running")
  3. On completion: job_doc.update_step(step_name, status="done", duration_s=elapsed)
  4. On failure: job_doc.set_state("failed") + job_doc.append_error(...)
  5. Writes document back to order_lines.render_job_doc

2.3 Render Job Cancellation (Proper)

Current problem:
celery_app.control.revoke("render-{line_id}", terminate=True) — this ID is synthetic
and does not match the actual Celery task ID, so revoke is a no-op. The Blender process
continues running.

Solution:

  1. Store the actual Celery task ID in render_job_doc.celery_task_id when the task starts
  2. Cancel endpoint reads render_job_doc.celery_task_id and revokes with that real ID
  3. The render subprocess uses start_new_session=True (already done in render_blender.py) and stores proc.pid in the job document
  4. On SIGTERM, the Celery task's signal handler calls os.killpg(pgid, SIGTERM), waits 10s, then os.killpg(pgid, SIGKILL)
  5. Clean up: remove partial output file, remove _frames_* temp directory
  6. Update render_job_doc.state = "cancelled", clear OrderLine.render_status = "cancelled"

Files to modify:

  • backend/app/api/routers/orders.py — read celery_task_id from job doc, not synthetic ID
  • backend/app/domains/rendering/tasks/render_still.py — store task ID + PID in job doc, register SIGTERM handler
  • backend/app/domains/rendering/tasks/render_turntable.py — same

2.4 GPU-Primary Rendering

Current state:
cycles_device defaults to "auto". When GPU is unavailable, Blender silently falls back
to CPU with no log message. The _activate_gpu() function in blender_render.py already
probes for GPU but the result is not reflected in the render job document.

Target:

  • cycles_device default changes from "auto" to "gpu" in system settings
  • _activate_gpu() result is logged with [GPU_PROBE] prefix:
    • Success: [GPU_PROBE] RTX 3090 activated (OPTIX) — using GPU render
    • Failure: [GPU_PROBE] No GPU found, falling back to CPU — set cycles_device=cpu to suppress this warning
  • GPU type and fallback reason are written to render_job_doc.result.gpu_info
  • Admin UI shows GPU status on the Settings page (already partially exists via worker activity)

Files to modify:

  • render-worker/scripts/blender_render.py — enhance _activate_gpu() logging
  • backend/app/api/routers/admin.py — change default cycles_device to "gpu"
  • backend/app/domains/rendering/job_document.py — add gpu_info field to result

2.5 Blender Script Modularity

Current state:
render-worker/scripts/blender_render.py is 853 lines with everything inline.

Target structure:

render-worker/scripts/
├── blender_render.py          — entry point, arg parsing, top-level flow
├── _blender_gpu.py            — GPU probe + activation
├── _blender_import.py         — GLB import, rotation, smooth shading
├── _blender_materials.py      — material library application + fallback
├── _blender_camera.py         — auto camera from bbox, clip planes
├── _blender_scene.py          — scene setup (Mode A vs Mode B)
└── _blender_post.py           — (currently Pillow overlay — DELETE THIS FILE)

blender_render.py imports from these sub-modules. Blender Python's sys.path is updated
at the top of the script to include the scripts directory.


Phase 3: Code Deletion (Weeks 34, parallel with Phase 2)

3.1 Remove Pillow Overlay Code

Location: render-worker/scripts/blender_render.py lines 798851

Why it's dead: transparent_bg=True is always passed for production renders. The
else: branch at line 802 can never execute in production. The green Schaeffler bar is
now part of the .blend template, not post-processing.

Delete:

  • Lines 798851 in blender_render.py (the entire if transparent_bg: ... else: try PIL... block)
  • Remove Pillow from render-worker dependencies in render-worker/Dockerfile
  • Remove the line - Schaeffler green top bar + model name label via Pillow post-processing. from the script docstring

3.2 Remove STL Workflow Remnants

What to delete:

Location What to remove
backend/app/api/routers/admin.py VALID_STL_QUALITIES, stl_quality from SettingsOut, SettingsUpdate, and all SETTINGS_DEFAULTS
backend/app/api/routers/admin.py generate-missing-stls endpoint (if still present)
backend/app/api/routers/cad.py generate-stl/{quality} endpoint
backend/app/services/render_blender.py stl_quality parameter from render_still() and render_turntable_to_file()
backend/app/services/render_blender.py Key stl_duration_s → rename to glb_duration_s (remove # key kept for backward compat comment)
backend/app/tasks/step_tasks.py generate_stl_cache task (check if it still exists)
render-worker/scripts/ Any _import_stl, _convert_stl, _scale_mm_to_m functions
backend/app/api/routers/analytics.py avg_stl_s field in analytics response
All render log dicts Replace stl_size_bytes: 0 and stl_duration_s: with glb_* equivalents
DB migration backend/alembic/versions/050_cleanup_stl_settings.pyDELETE FROM system_settings WHERE key = 'stl_quality'

Files to delete entirely:

  • blender-renderer/ directory (already removed from docker-compose.yml, remove directory)
  • threejs-renderer/ directory (migration 033 already removed it from services)
  • flamenco/ directory (migration 032 removed Flamenco; verify nothing still imports from it)

Verify before deleting:

grep -rn "blender-renderer\|threejs-renderer\|flamenco" backend/ frontend/ --include="*.py" --include="*.ts" --include="*.tsx"

3.3 Remove Compat Shims

After all callers are migrated, delete these shim files:

  • backend/app/models/user.py (shim → domains/auth/models.py)
  • backend/app/models/cad_file.py (shim → domains/products/models.py)
  • backend/app/services/render_dispatcher.py (shim, 10 lines)
  • backend/app/services/material_service.py (shim → domains/materials/service.py)
  • backend/app/services/render_blender.py (move fully into domains/rendering/)
  • backend/app/models/ directory → all models are already in domains/*/models.py

3.4 Remove Duplicate render_order_line_task

step_tasks.render_order_line_task (lines 7051050 of step_tasks.py) duplicates
rendering/tasks.render_order_line_still_task. The step_tasks version has more
baggage (compat imports, emit() calls, stl_quality references). Delete the step_tasks
version, migrate all queue routes to the rendering/tasks version.

Migration:

  • celery_app.py task routes: route app.tasks.step_tasks.* to empty list, removing step_tasks from the routing table after all tasks are migrated
  • Update CLAUDE.md to reflect new task locations

Phase 4: Tenant & Auth (Weeks 56)

4.1 Role Hierarchy

Current roles: admin | project_manager | client

Target roles:

class UserRole(str, enum.Enum):
    global_admin    = "global_admin"     # platform operator, bypass RLS, all tenants
    tenant_admin    = "tenant_admin"     # per-tenant admin, full control within tenant
    project_manager = "project_manager"  # order/render management within tenant
    client          = "client"           # read own orders, create draft orders

Permission matrix:

Permission GlobalAdmin TenantAdmin ProjectManager Client
Manage tenants YES no no no
Manage users (all tenants) YES no no no
Manage users (own tenant) YES YES no no
All system settings YES YES no no
Trigger renders YES YES YES no
View all orders in tenant YES YES YES no
Create/view own orders YES YES YES YES
Reject orders YES YES YES no
Delete renders YES YES YES no
View analytics YES YES YES no

DB migration:

  • backend/alembic/versions/051_role_hierarchy.py — rename adminglobal_admin, add tenant_admin to the userrole enum; backfill existing admin users to global_admin

Auth utilities:

  • require_global_admin() — replaces require_admin()
  • require_tenant_admin_or_above() — TenantAdmin or GlobalAdmin
  • require_pm_or_above() — PM, TenantAdmin, GlobalAdmin

4.2 Tenant Isolation — Consistency Audit

The problem:
database.py:build_tenant_db_dep() yields the session without setting RLS context
(line 92 comments say "context-setting happens via set_tenant_context when needed").
This means every endpoint that uses Depends(get_db) bypasses RLS.

Fix — Middleware approach (preferred):

# backend/app/core/middleware.py
class TenantContextMiddleware(BaseHTTPMiddleware):
    """Set PostgreSQL RLS context on every request from JWT claims."""

    BYPASS_PATHS = {"/health", "/api/auth/login", "/api/auth/refresh"}

    async def dispatch(self, request: Request, call_next):
        if request.url.path in self.BYPASS_PATHS:
            return await call_next(request)

        token = self._extract_token(request)
        if token:
            payload = decode_token_safe(token)
            tenant_id = payload.get("tenant_id")
            role = payload.get("role")
            request.state.tenant_id = tenant_id
            request.state.role = role

        response = await call_next(request)
        return response

The get_db dependency is modified to read tenant_id from request.state:

async def get_db(request: Request) -> AsyncGenerator[AsyncSession, None]:
    async with AsyncSessionLocal() as session:
        tenant_id = getattr(request.state, "tenant_id", None)
        role = getattr(request.state, "role", None)
        if tenant_id:
            if role == "global_admin":
                await session.execute(text("SET LOCAL app.current_tenant_id = 'bypass'"))
            else:
                await session.execute(
                    text("SET LOCAL app.current_tenant_id = :tid"),
                    {"tid": str(tenant_id)},
                )
        yield session

4.3 Tenant Isolation Strategy — Shared vs. Dedicated Containers

Decision: Shared containers with DB-level isolation (current model)

Analysis:

Factor Shared containers Dedicated containers per tenant
Cost Low (6 containers total) High (6 containers × N tenants)
Complexity Low Very high (orchestration, networking)
Data isolation DB-level (RLS) Full OS-level
GPU sharing Single GPU shared Dedicated GPU per tenant (expensive)
Blender jobs Queue + concurrency control Per-tenant render queue
Failure blast radius All tenants affected by worker crash Isolated per tenant
Scaling Celery autoscale Docker Swarm / Kubernetes HPA
Migration effort Weeks (Phase 3-4) Months (new orchestration layer)

Recommendation: Maintain shared containers with DB-level RLS isolation. Dedicated
containers are only justified if tenants have strict contractual data isolation requirements
(e.g., GDPR-mandated separate processing). For the current internal use case (Schaeffler
internal teams), RLS + tenant_id partitioning is sufficient.

If dedicated containers are required in future:

  • Docker Compose override file per tenant (docker-compose.{tenant-slug}.yml)
  • Each tenant gets own PostgreSQL schema (not separate DB) with schema-based routing
  • Shared MinIO with per-tenant bucket policies
  • Separate Redis database (0-15) per tenant (max 16 tenants)
  • Celery routing: per-tenant queue prefix {tenant_slug}.thumbnail_rendering

4.4 Per-Tenant Feature Flags

Add a tenant_config JSONB column to the tenants table:

# backend/alembic/versions/052_tenant_feature_flags.py
tenant_config JSONB DEFAULT '{
    "max_concurrent_renders": 3,
    "render_engines_allowed": ["cycles"],
    "max_order_size": 500,
    "fallback_material": "SCHAEFFLER_059999_FailedMaterial",
    "notifications_enabled": true,
    "invoice_prefix": "INV"
}'

Feature flags checked at render dispatch time:

  • max_concurrent_renders — enforced in Celery queue routing
  • render_engines_allowed — validated in OutputType creation
  • fallback_material — passed to Blender scripts (see §6.4)

Phase 5: Material & Rendering Improvements (Weeks 56)

5.1 Fallback Material — SCHAEFFLER_059999_FailedMaterial

Current state:
step_processor.py:MATERIAL_PALETTE assigns rainbow colors from a palette when material
assignment fails or no material is specified. blender_render.py has its own
PALETTE_LINEAR for the same purpose.

Target:
When material resolution fails (no alias, no exact match, material library link broken),
assign SCHAEFFLER_059999_FailedMaterial (magenta) so failed assignments are immediately
visible in renders.

Implementation:

  • domains/materials/service.py:resolve_material_map() — instead of pass-through, return SCHAEFFLER_059999_FailedMaterial for unresolved parts (configurable per-tenant via tenant_config.fallback_material)
  • render-worker/scripts/blender_render.py — when material library is provided but a
    part name does not match any library material, assign SCHAEFFLER_059999_FailedMaterial rather than palette color
  • render-worker/scripts/_blender_materials.py — a new sub-module for material logic
    with explicit logging: [MATERIAL] part 'Outer_Ring' → 'SCHAEFFLER_010101_Steel-Bare' (alias match)
    and [MATERIAL] part 'Unknown_Part' → 'SCHAEFFLER_059999_FailedMaterial' (no match)
  • step_processor.py — remove MATERIAL_PALETTE and _material_to_color(); the palette is no longer used once fallback material is in place. Part colors for geometry GLB viewer should come from the material library color map, not a rainbow palette.

5.2 Remove EEVEE Fallback

Current state:
render_blender.py has an EEVEE-to-Cycles fallback:

if returncode > 0 and engine == "eevee":
    logger.warning("EEVEE failed (exit %d) — retrying with Cycles", returncode)
    returncode, stdout_lines2, stderr_lines2 = _run("cycles")
    engine_used = "cycles (eevee fallback)"

This hides failures and makes debugging harder. Per the Blender 5.0.1 requirement, EEVEE
Next should work reliably. If it fails, it should be a hard failure, not a silent retry.

Target: Remove the EEVEE-to-Cycles fallback. If EEVEE fails, the task fails with a
clear error. Set EEVEE_FALLBACK_ENABLED=false system setting (default false from now on).

5.3 Remove Blender Version Check

Current state:
backend/app/services/render_blender.py defines:

MIN_BLENDER_VERSION = (5, 0, 1)

This constant is defined but the check that uses it has been removed. Search for any
remaining version-comparison code in blender_render.py and render scripts.

Target:

  • Remove MIN_BLENDER_VERSION = (5, 0, 1) from render_blender.py
  • Remove any bpy.app.version comparisons in render scripts
  • Blender 5.0.1+ is assumed; older versions are not supported

Phase 6: Notification Center Refactor (Week 7)

6.1 Current Problems

Per-render notifications (render.completed, render.failed) fire for every single
OrderLine. An order with 200 lines generates 200 notifications. This is too noisy.

6.2 Notification Architecture

Three channels:

  1. Activity Feed (/api/activity) — per-action events: every render start/complete, every order state change, every upload. Low-level, not shown in bell dropdown. Available in a dedicated /activity page for debugging.

  2. Notification Center (/api/notifications) — batch summaries only:

    • "Order #ORD-2026-042 rendering complete: 47/50 succeeded, 3 failed"
    • "Excel import failed: 12 products skipped (see import log)"
    • "Worker recovery: 3 stalled renders requeued after 120min timeout"
  3. System Alerts (admin only) — infrastructure issues: GPU probe failed, Blender binary not found, Redis connection lost.

Notification trigger rules:

  • render.completed per-line → suppress; emit batch when ALL lines in order reach terminal state
  • render.failed per-line → suppress; emit batch on order completion
  • excel.imported → one notification per upload with summary counts
  • order.submitted → one notification (always keep)
  • System alerts → always emit individually

DB changes:

  • audit_log — add channel VARCHAR(20) column: activity | notification | alert
  • notification_configs — extend event_type to include new batch event types
  • New beat task: batch_render_notifications — runs every 60s, checks for orders where
    all lines are terminal but no batch notification has been emitted; emits the summary

6.3 Per-User Notification Preferences

Current notification_configs table has event_type + channel + enabled. Extend:

  • Add frequency: str column — immediate | hourly | daily | never
  • Frequency is respected by the batch notification beat task

Files to modify:

  • backend/app/domains/notifications/models.py — add channel, frequency columns
  • backend/app/services/notification_service.py — add emit_batch_notification() function
  • backend/app/tasks/beat_tasks.py — add batch_render_notifications schedule
  • frontend/src/pages/NotificationSettings.tsx — add frequency selector per event type
  • frontend/src/pages/Notifications.tsx — separate tabs for Activity | Notifications | Alerts

Phase 7: UI/UX Improvements (Week 78)

7.1 Tooltip / Help Text System

Every setting, parameter, and action in the Admin UI and order wizard needs a tooltip
explaining what it does and what it affects in the pipeline.

Architecture:

// frontend/src/help/helpTexts.ts
export const HELP_TEXTS: Record<string, HelpText> = {
  "setting.blender_cycles_samples": {
    title: "Cycles Samples",
    body: "Number of render samples per pixel. Higher = better quality, longer render time. 256 is a good balance for product shots. 64 is fast for previews.",
    affects: ["render quality", "render time"],
    unit: "samples",
    range: [1, 4096],
    recommendation: "256 for production, 64 for preview",
  },
  "setting.gltf_preview_linear_deflection": {
    title: "3D Viewer Mesh Quality",
    body: "Controls tessellation precision for the 3D browser viewer. Lower values = finer mesh, larger file. 0.1mm is a good default for medium-complexity parts.",
    affects: ["3D viewer file size", "viewer load time"],
    unit: "mm",
  },
  "action.regenerate_thumbnails": {
    title: "Regenerate All Thumbnails",
    body: "Re-renders thumbnails for all STEP files using current settings. This queues all files on the thumbnail_rendering worker. Expected time: N × 30s. Only needed after changing renderer settings.",
    warning: "This will queue a large number of tasks. Only run during off-peak hours.",
  },
  // ... all settings
}
// frontend/src/components/HelpTooltip.tsx
interface HelpTooltipProps {
  helpKey: string
  position?: "top" | "right" | "bottom" | "left"
}

export function HelpTooltip({ helpKey, position = "right" }: HelpTooltipProps) {
  const help = HELP_TEXTS[helpKey]
  if (!help) return null
  return (
    <Tooltip content={<HelpContent help={help} />} position={position}>
      <HelpCircle size={14} className="text-text-muted ml-1 cursor-help" />
    </Tooltip>
  )
}

Where to add tooltips (minimum required):

  • All system_settings keys in Admin > Settings
  • All OutputType.render_settings fields in the OutputType editor
  • All RenderTemplate fields in the template editor
  • All actions in Admin > Settings (regenerate thumbnails, process unprocessed, etc.)
  • All fields in the Order Wizard with non-obvious meaning

7.2 Media Browser Refactor

Current state:
frontend/src/pages/MediaBrowser.tsx — exists but no details on current filter capabilities.

Target:
Server-side filtered media browser with:

  • Filters: lagertyp | category_key | render_status | asset_type | tenant_id (admin)
  • Text search on product name, pim_id
  • Server-side pagination (50 per page)
  • Virtual scroll for large catalogs (react-virtual or TanStack Virtual)
  • Batch download selected assets

API changes:

GET /api/media/assets?
  asset_type=still&
  category_key=TRB&
  lagertyp=Axial-Zylinderrollenlager&
  render_status=completed&
  page=1&
  page_size=50&
  q=81113

DB indexes required:

-- backend/alembic/versions/053_media_browser_indexes.py
CREATE INDEX ix_media_assets_asset_type_created ON media_assets(asset_type, created_at DESC);
CREATE INDEX ix_products_category_lagertyp ON products(category_key, lagertyp);
CREATE INDEX ix_products_name_gin ON products USING GIN(to_tsvector('simple', COALESCE(name, '') || ' ' || COALESCE(pim_id, '')));

Files to modify:

  • backend/app/domains/media/router.py — add GET /assets with filter params
  • backend/app/domains/media/schemas.py — add MediaAssetFilter Pydantic model
  • frontend/src/pages/MediaBrowser.tsx — complete rewrite with virtual scroll
  • frontend/src/api/media.ts — add getMediaAssets(filters) function

7.3 Workflow Editor — Pipeline Step Nodes

Current state:
WorkflowEditor.tsx has 5 node types (Upload, Parse, Render, Export, Deliver) but they
do not map to actual Celery tasks. WorkflowDefinition.config is a free-form JSONB blob
with no schema validation.

Target:
Node types correspond 1:1 to ProcessStep enum values. The workflow editor saves a
validated workflow config that the dispatch_workflow() function can execute.

WorkflowDefinition config schema:

{
  "version": 1,
  "nodes": [
    {"id": "n1", "step": "extract_metadata", "params": {}},
    {"id": "n2", "step": "render_thumbnail", "params": {"engine": "cycles", "samples": 64}},
    {"id": "n3", "step": "render_still", "params": {"width": 2048, "height": 2048}},
    {"id": "n4", "step": "export_glb", "params": {"quality": "high"}},
    {"id": "n5", "step": "deliver", "params": {}}
  ],
  "edges": [
    {"from": "n1", "to": "n2"},
    {"from": "n2", "to": "n3"},
    {"from": "n3", "to": "n4"},
    {"from": "n4", "to": "n5"}
  ]
}

Backend validation: workflow_router.py validates that all step values are in
ProcessStep enum before saving.

Frontend: WorkflowEditor.tsx builds available node types from a GET /api/workflows/steps
endpoint that returns all ProcessStep entries with their parameter schemas.

7.4 Kanban Rejection Flow

Current state:
OrderStatus.rejected exists but the rejection flow is undefined. The admin panel has no
rejection UI. rejected_at column exists but there is no rejection reason field.

Target flow:

  1. Who can reject: ProjectManager, TenantAdmin, GlobalAdmin
  2. Trigger: POST /api/orders/{id}/reject with body {"reason": "...", "notify_client": true}
  3. What happens:
    • Order status → rejected, rejected_at = now
    • rejection_reason stored (new Text column on Order)
    • All pending/processing renders are cancelled (same as cancel-renders endpoint)
    • Notification emitted to order creator: "Your order #ORD-2026-042 was rejected. Reason: ..."
    • Audit log entry created
  4. Client sees: Order status badge changes to REJECTED with reason visible
  5. Re-submission: Client can POST /api/orders/{id}/resubmit which clears rejection, resets to draft, allowing edits before re-submitting. Re-submit creates a new audit log entry and emits notification to PMs.

DB migration:

  • backend/alembic/versions/054_order_rejection.py — add rejection_reason TEXT to orders

Phase 8: Scalable Workers (Week 8)

8.1 Current Concurrency Controls

  • worker (step_processing): CELERY_WORKER_CONCURRENCY env var, default 8
  • render-worker (thumbnail_rendering): hardcoded 1 (Blender serial access)
  • Both require Docker service restart to change concurrency

8.2 Dynamic Worker Scaling

Short term (no Kubernetes): Use Celery's built-in autoscale option:

# docker-compose.yml
render-worker:
  command: celery -A app.tasks.celery_app worker
    --loglevel=info
    -Q thumbnail_rendering
    --autoscale=1,1   # min=1, max=1 (single Blender concurrency)
    --concurrency=1

For worker:

worker:
  command: celery -A app.tasks.celery_app worker
    --loglevel=info
    -Q step_processing,ai_validation
    --autoscale=${MAX_CONCURRENCY:-8},${MIN_CONCURRENCY:-2}

Per-queue concurrency via DB:
Add a worker_configs table:

CREATE TABLE worker_configs (
    queue_name VARCHAR(100) PRIMARY KEY,
    max_concurrency INT NOT NULL DEFAULT 8,
    min_concurrency INT NOT NULL DEFAULT 2,
    updated_at TIMESTAMP NOT NULL DEFAULT now()
);

A beat task apply_worker_concurrency runs every 5 minutes and uses Celery control
commands to adjust pool size:

celery_app.control.broadcast("pool_shrink", arguments={"n": 2}, destination=["worker@host"])
celery_app.control.broadcast("pool_grow", arguments={"n": 4}, destination=["worker@host"])

Long term (Kubernetes):
Workers run as Kubernetes Deployments with HPA on celery_queue_length metric (exposed via
Flower or a custom /metrics endpoint for Prometheus). Render-workers use GPU node pools
with nvidia.com/gpu: 1 resource requests.

8.3 Worker Health Recovery

Current state:
beat_tasks.recover_stuck_cad_files runs every 5 minutes and handles stuck processing state.

Extend to:

  • Detect render_status = 'processing' with render_started_at > render_stall_timeout_minutes ago
  • SIGTERM any still-running Blender PID (stored in render_job_doc.celery_task_id)
  • Reset render_status to failed, update render_job_doc.state = 'failed'
  • Emit system alert notification (admin channel)
  • Log with [WORKER_RECOVERY] Stalled render for order_line {id} terminated after {N}min

Detailed Task Breakdown by Area

A. step_tasks.py Decomposition

Current problems:

  • 1,170 lines, 8 distinct Celery tasks, many private helpers, multiple inline DB session creation patterns
  • Imports scattered: some at module level, some inside functions (Celery pattern)
  • render_order_line_task (lines 7051050+) duplicates render_order_line_still_task

Migration path:

  1. Create new domains/pipeline/tasks/ directory with one file per step
  2. Each new task calls PipelineLogger instead of bare logger.info
  3. Each new task writes to render_job_doc via job_document.py helpers
  4. Old step_tasks.py becomes import-only shim: from app.domains.pipeline.tasks.extract_metadata import process_step_file
  5. After 2-week migration period, delete step_tasks.py

B. Auth Token Claims

Current: {"sub": user_id, "role": role, "exp": expires} — no tenant_id in token

Target: {"sub": user_id, "role": role, "tenant_id": str(tenant_id), "exp": expires}

Impact: All existing tokens become invalid after deploy. Users must re-login.
Mitigation: Rotate JWT_SECRET_KEY as part of the deployment to force re-login.

C. Celery Task Routing Update

After Phase 2 decomposition, update celery_app.conf.update(task_routes={...}):

task_routes = {
    "app.domains.pipeline.tasks.*": {"queue": "step_processing"},
    "app.domains.rendering.tasks.*": {"queue": "thumbnail_rendering"},
    "app.domains.media.tasks.*": {"queue": "step_processing"},
    "app.tasks.ai_tasks.*": {"queue": "ai_validation"},
    "app.tasks.beat_tasks.*": {"queue": "step_processing"},
}

D. Frontend API Client Consistency

All frontend/src/api/*.ts files should:

  • Use the axios client from api/client.ts (which injects X-Tenant-ID header)
  • Export typed interfaces for all response shapes
  • Use useQuery / useMutation from TanStack Query, not bare axios.get in components

Audit needed: Check each api/*.ts file to confirm X-Tenant-ID header is sent
(it is wired in the axios interceptor per commit 5da90b5, but verify all files use
the configured client, not axios.create() directly).


Architectural Decisions (ADRs)

ADR-001: Shared containers vs. per-tenant containers

Decision: Shared containers with PostgreSQL RLS
Rationale: Cost and complexity savings. RLS provides adequate isolation for internal use.
Consequences: Must ensure RLS is applied consistently (Phase 1.3). Blender sessions are
shared; GPU contention is managed via Celery queue depth, not isolation.

ADR-002: Render Job Document as JSONB

Decision: Store render job state machine as JSONB in order_lines.render_job_doc
Rationale: Avoids additional workflow_node_results table queries for debugging;
JSONB is flexible for schema evolution; indexed for state-based queries.
Alternatives considered: Separate render_job_steps table — rejected (too many joins
for the common "show me render status" query).

ADR-003: No per-render notifications

Decision: Suppress individual render.completed notifications; emit batch at order completion
Rationale: An order with 200 lines generates 200 notifications under the current model.
Batch summaries at order completion are actionable; per-render events are noise.
Consequences: Activity feed still records all events for debugging.

ADR-004: GPU-first rendering

Decision: Default cycles_device = "gpu", explicit log on CPU fallback
Rationale: The render-worker has GPU reservation in docker-compose.yml. CPU fallback
should be visible and logged, not silent.
Consequences: Renders on machines without GPU will always log a CPU fallback warning.

ADR-005: Fallback material over palette

Decision: Replace MATERIAL_PALETTE rainbow fallback with SCHAEFFLER_059999_FailedMaterial
Rationale: Failed material assignments should be immediately visible (magenta) rather
than disguised as intentional palette colors.
Consequences: Parts with missing material mapping will render magenta in both
thumbnail and production renders. This is a feature, not a bug.

ADR-006: Blender 5.0.1 minimum, no version guards

Decision: Remove all bpy.app.version checks and MIN_BLENDER_VERSION guards
Rationale: The project is Blender 5.0.1-only. Version shims add complexity without value.
Consequences: Running with an older Blender binary will cause cryptic errors. Document
the minimum version requirement clearly in the Dockerfile and README.


What Gets Deleted

Python files to delete entirely:

  • backend/app/models/user.py — compat shim
  • backend/app/models/cad_file.py — compat shim
  • backend/app/models/order.py — compat shim (if exists)
  • backend/app/models/order_item.py — compat shim
  • backend/app/models/order_line.py — compat shim
  • backend/app/models/material.py — compat shim
  • backend/app/models/material_alias.py — compat shim
  • backend/app/models/render_template.py — compat shim
  • backend/app/models/output_type.py — compat shim
  • backend/app/models/system_setting.py — compat shim
  • backend/app/models/template.py — compat shim
  • backend/app/models/render_position.py — compat shim
  • backend/app/services/render_dispatcher.py — 10-line shim
  • backend/app/services/material_service.py — 3-line shim
  • backend/app/tasks/step_tasks.py — after Phase 2 migration complete
  • backend/app/domains/rendering/tasks.py — split into per-step files in Phase 2

Directories to delete entirely:

  • blender-renderer/ — HTTP microservice, removed from docker-compose in refactor/v2
  • threejs-renderer/ — removed in migration 033
  • flamenco/ — removed in migration 032

Code blocks to delete (within files):

  • render-worker/scripts/blender_render.py lines 798851 — Pillow overlay
  • render-worker/scripts/blender_render.py line 17 — docstring Pillow mention
  • backend/app/services/render_blender.py line 17 — MIN_BLENDER_VERSION = (5, 0, 1)
  • backend/app/services/render_blender.py lines 229233 — EEVEE-to-Cycles fallback
  • backend/app/services/step_processor.py lines 1931 — MATERIAL_PALETTE + _material_to_color()
  • backend/app/api/routers/admin.pyVALID_STL_QUALITIES, stl_quality in all schemas

System settings to delete (DB migration):

  • stl_quality — GLB-only pipeline, no STL concept
  • threejs_render_size — renderer removed
  • thumbnail_renderer — was multi-value (pillow|blender|threejs), now always blender

Migration Strategy

Deployment Order (Zero-Downtime)

Step 1 — DB migrations (non-breaking):

  • Run migrations 048054 (new columns: render_job_doc, rejection_reason, feature flags, etc.)
  • New columns are nullable, no existing queries break

Step 2 — Backend deploy (backward compatible):

  • Deploy new backend with compat shims in place
  • New endpoints and middleware active
  • Old endpoints still work
  • JWT tokens are extended with tenant_id claim (existing tokens without it still work via fallback in middleware)

Step 3 — Celery worker deploy:

  • Deploy new domains/pipeline/tasks/ structure
  • step_tasks.py compat shim routes to new functions
  • Old task names still registered via shim

Step 4 — Frontend deploy:

  • New WorkflowEditor with validated step types
  • HelpTooltip components added
  • MediaBrowser refactor with virtual scroll

Step 5 — Cleanup (breaking):

  • Remove compat shims
  • Delete step_tasks.py
  • Rotate JWT_SECRET_KEY to force re-login (tenant_id now required in claims)
  • Run DB migration to clean up stl_quality and threejs settings

Rollback Plan

  • All migrations have downgrade() implemented
  • Compat shims mean old task names still work during migration window
  • render_log column kept alongside render_job_doc until all consumers migrated

Testing Before Delete

Before deleting any compat shim or old code, verify:

grep -rn "<old_import_path>" backend/ frontend/ --include="*.py" --include="*.ts" --include="*.tsx"

Must return 0 results from non-shim files.


Open Questions

These require product decisions before implementation:

  1. Tenant onboarding flow — How are new tenants created? Self-service signup, or admin creates tenant + TenantAdmin user manually? What is the initial data setup?

  2. Blender binary distribution — Currently host-mounted (/opt/blender:/opt/blender:ro). If multiple render-workers run on different hosts in a future cluster, how is Blender distributed? Container image vs. network share?

  3. MinIO vs. filesystem storage — All media assets are stored on the local filesystem (/app/uploads volume). MinIO is configured but not used for primary storage yet. Should Phase 2 migrate assets to MinIO for horizontal scaling?

  4. Invoice workflowbilling/models.py has Invoice + InvoiceLine models and an invoices table (migration 042). Is billing actually used? If not, should it be removed to reduce complexity?

  5. AI validation (Azure OpenAI)ai_tasks.py and azure_ai.py exist but Azure credentials are optional. Is this feature actively used or can it be removed?

  6. Email notifications — SMTP settings exist in system_settings but email sending is not implemented. Is this a required feature for the next phase?

  7. Rejection re-submission UX — When a client re-submits a rejected order, do they create a new order or update the existing one? The current data model supports only one status per order, not a history of submissions.

  8. Media browser download format — Bulk download: ZIP of individual files, or separate download links? ZIP requires server-side assembly which adds load.

  9. Tooltip language — Help texts in English (per CLAUDE.md coding standards) or German (for end-user-facing UI)? The admin UI is currently in English labels.

  10. 3D Viewer geometry quality — The gltf_preview_linear_deflection default is 0.1mm. For very small parts (sub-1mm features), this may be too coarse. Should the deflection auto-scale based on the CAD file's bounding box dimensions?agentId: a6cf206cd46b868cb (for resuming to continue this agent's work if needed) total_tokens: 132964 tool_uses: 72 duration_ms: 467361