Phase 1 of PLAN_REFACTOR.md — all four sub-tasks implemented:
1.1 PipelineLogger (backend/app/core/pipeline_logger.py)
- Structured step_start/step_done/step_error/step_progress API
- Publishes to Python logging AND Redis SSE via log_task_event
- Context manager `pl.step("name")` for auto-timing
1.2 RenderJobDocument (backend/app/domains/rendering/job_document.py)
- Pydantic JSONB schema: state machine + per-step records + timing
- begin_step/finish_step/fail_step/skip_step helpers
- Migration 048: adds render_job_doc JSONB column to order_lines
- OrderLine model updated with render_job_doc field
1.3 TenantContextMiddleware (backend/app/core/middleware.py)
- Decodes JWT, stores tenant_id + role in request.state
- get_db updated to auto-apply RLS SET LOCAL from request.state
- Registered in main.py (runs before every request)
- JWT now embeds tenant_id claim via create_access_token()
- Login endpoint passes tenant_id to token creation
1.4 ProcessStep Registry (backend/app/core/process_steps.py)
- StepName StrEnum with all 20 pipeline step names
- Single source of truth for log prefixes, DB records, UI labels
Also adds db_utils.py with set_tenant_sync() + get_sync_session()
for use inside Celery tasks (bypass-safe RLS helper).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
51 KiB
Schaeffler Automat — Refactor Plan
Document date: 2026-03-08 Branch: refactor/v2 Author: Architecture review via Claude Code
Executive Summary
Current State
Schaeffler Automat is a working Blender-based media production pipeline with:
- Domain-driven backend structure (partially migrated, many compat shims still present)
- 7 Docker services with GPU render-worker
- PostgreSQL with tenant_id columns + Row Level Security (RLS) enabled but inconsistently applied at the application layer
- Celery task queues with two workers (step_processing + thumbnail_rendering)
- WebSocket real-time events via Redis Pub/Sub
- React/Vite frontend with workflow editor (ReactFlow), media browser, notifications
Core Problems
step_tasks.pyis 1,170 lines — monolithic task file containing 8+ distinct pipeline steps- Tenant isolation is partial: RLS is defined in DB migration 036 but
set_tenant_context()is not called consistently in every router; Celery tasks bypass RLS entirely - Pillow overlay code (green bar + model name label) is dead code — all renders use
transparent_bg=Truebut the 55-line block still runs conditionally - STL workflow remnants:
stl_qualitysetting,VALID_STL_QUALITIES,stl_size_bytesin render_log dicts still reference the old STL-based pipeline; the actual pipeline is GLB-only - Render job cancellation uses a synthetic task ID (
render-{line_id}) that does not match actual Celery task IDs — making revoke() a no-op - The MATERIAL_PALETTE + palette fallback lives in
step_processor.py— should be replaced withSCHAEFFLER_059999_FailedMaterial(magenta) per the project goals - Log messages are inconsistent: some use Python f-strings with no prefix, others use
[STEP_NAME]markers; structured logging is not enforced render_order_line_taskinstep_tasks.pyduplicates most ofrender_order_line_still_taskindomains/rendering/tasks.py- The blender_render.py Blender script is 853 lines with no sub-module structure
- No GPU-first enforcement:
cycles_devicedefaults to "auto" with no explicit fallback log
Vision
A clean, modular pipeline where:
- Every step is a named
ProcessStepwith start/progress/done log events and DB audit trail - Render jobs are tracked as structured JSON documents (job tickets) in the DB
- Tenant isolation is enforced at the dependency-injection layer, not ad-hoc per endpoint
- Dead code (Pillow overlays, STL workflow, Flamenco shims, threejs renderer) is deleted
- The auth hierarchy supports GlobalAdmin > TenantAdmin > ProjectManager > Client
- Workers scale dynamically without service restarts
- Notifications are batched summaries, not per-render noise
Architecture Overview
Current Architecture
┌─────────────┐ HTTP ┌──────────────────────────────────────────┐
│ Frontend │ ──────────> │ backend:8888 (FastAPI) │
│ React/Vite │ │ ├─ domains/auth │
│ :5173 │ <─ WS ──── │ ├─ domains/orders │
└─────────────┘ │ ├─ domains/products │
│ ├─ domains/rendering │
│ ├─ domains/tenants │
│ └─ api/routers/ (compat shims) │
└──────────┬───────────────────────────────┘
│ Celery tasks via Redis broker
┌─────────────────┼──────────────────┐
│ │ │
┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
│ worker │ │render-worker│ │ beat │
│ step_proc │ │thumbnail_ │ │ scheduler │
│ ai_valid │ │ rendering │ └─────────────┘
│ concurr=8 │ │ concurr=1 │
└─────────────┘ └──────▼──────┘
│ subprocess
┌──────▼──────┐
│ blender │
│ /opt/blend │
└─────────────┘
┌──────────────┐ ┌──────────┐ ┌──────────┐
│ PostgreSQL │ │ Redis │ │ MinIO │
│ :5432 │ │ :6379 │ │ :9000 │
└──────────────┘ └──────────┘ └──────────┘
Target Architecture (Post-Refactor)
┌─────────────────────────────────────────────────────────┐
│ Frontend React/Vite :5173 │
│ ├─ WorkflowEditor (ReactFlow) — visual pipeline │
│ ├─ MediaBrowser — server-side filtered + virtual scroll│
│ ├─ NotificationCenter — batched summaries only │
│ └─ Admin — tooltips on every setting │
└────────────────────┬────────────────────────────────────┘
│ HTTP + WebSocket
┌────────────────────▼────────────────────────────────────┐
│ backend:8888 (FastAPI) │
│ middleware: TenantContextMiddleware (injects RLS) │
│ ├─ domains/auth (GlobalAdmin|TenantAdmin|PM|Client)│
│ ├─ domains/pipeline (process step registry + dispatch) │
│ ├─ domains/rendering (render job documents, workflows) │
│ ├─ domains/products (CAD files, media assets) │
│ ├─ domains/orders (order state machine) │
│ ├─ domains/tenants (tenant management) │
│ └─ domains/billing (pricing, invoices) │
└────────────────────┬────────────────────────────────────┘
│ Celery canvas / chain / group
┌───────────────┼───────────────┐
│ │ │
┌────▼────┐ ┌──────▼──────┐ ┌────▼────┐
│ worker │ │render-worker│ │ beat │
│ step_ │ │ concurr=1 │ │ sched. │
│ process │ │ +Blender GPU│ │ recover │
│ concr=8 │ └──────▼──────┘ │ queues │
└─────────┘ │ └─────────┘
subprocess (SIGTERM → SIGKILL + cleanup)
│
┌──────▼──────┐
│ blender │ (GPU-first, explicit CPU-fallback log)
└─────────────┘
Phase 1: Foundation (Weeks 1–2)
Critical infrastructure that blocks everything else.
1.1 Structured Logging Framework
Current state:
Log messages are a mix of bare logger.info(f"..."), emit(order_line_id, "..."), and
log_task_event(task_id, "..."). No consistent prefix, no structured fields.
Target:
A PipelineLogger class that wraps Python's logging module and additionally writes
structured events to the DB (audit_log or a new pipeline_events table).
Design:
# backend/app/core/pipeline_logger.py
class PipelineLogger:
PREFIX_FORMAT = "[{step_name}]"
def step_start(self, step: str, context: dict): ...
def step_progress(self, step: str, pct: int, msg: str): ...
def step_done(self, step: str, duration_s: float, result: dict): ...
def step_error(self, step: str, error: str, exc: Exception | None): ...
Every log call emits:
- Python
loggingline with[STEP_NAME] message - Redis
log_task_eventfor SSE streaming - Optional DB insert into
pipeline_events(task_id, step_name, level, message, duration_s, context JSONB, created_at)
Files to create:
backend/app/core/pipeline_logger.py— PipelineLogger classbackend/alembic/versions/048_pipeline_events.py— new table migration
Files to modify:
- All task files to replace bare
logger.info/errorwithPipelineLoggercalls backend/app/core/task_logs.py— keep Redis SSE publish, add DB write path
1.2 Render Job Document
Current state:
OrderLine.render_log is a loosely-structured JSONB dict. No schema, no state machine,
no step-level results stored.
Target:
A RenderJobDocument JSONB schema stored in order_lines.render_job_doc. Acts as the
single source of truth for a render job's state machine.
Schema (JSONB):
{
"version": 1,
"job_id": "<order_line_id>",
"created_at": "ISO8601",
"state": "pending|queued|running|completed|failed|cancelled",
"celery_task_id": "uuid",
"steps": [
{
"name": "resolve_step_path",
"status": "done",
"started_at": "ISO8601",
"completed_at": "ISO8601",
"duration_s": 0.02,
"output": {"step_path": "/app/uploads/..."}
},
{
"name": "occ_glb_export",
"status": "done",
"duration_s": 8.4,
"output": {"glb_path": "...", "size_bytes": 204800}
},
{
"name": "blender_render",
"status": "running",
"started_at": "ISO8601",
"gpu_type": "OPTIX",
"engine": "cycles",
"samples": 256
}
],
"error": null,
"result": {
"output_path": "...",
"duration_s": 34.2,
"engine_used": "cycles",
"gpu": "RTX 3090"
}
}
Migration:
backend/alembic/versions/049_render_job_document.py— addrender_job_doc JSONBtoorder_lines; keeprender_logfor backward compat (deprecate, remove in Phase 3)
Files to create:
backend/app/domains/rendering/job_document.py—RenderJobDocumentPydantic model + helpers (update_step,set_state,append_error)
1.3 Tenant Context Middleware
Current state:
set_tenant_context() must be called manually in each endpoint. Celery tasks bypass RLS
entirely (they use sync engines without SET LOCAL app.current_tenant_id).
Problem:
Migration 036 enables RLS, but build_tenant_db_dep() in database.py actually yields
db without setting the tenant context (line 92: yield db # context-setting happens via set_tenant_context when needed). This means most endpoints are silently bypassing RLS.
Target:
A FastAPI middleware TenantContextMiddleware that automatically sets RLS context for
every request based on the JWT tenant_id claim.
# backend/app/core/middleware.py
class TenantContextMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request: Request, call_next):
# Extract JWT, decode tenant_id
# Store in request.state.tenant_id
# After DB session is acquired, SET LOCAL app.current_tenant_id
...
JWT changes:
create_access_token() must embed tenant_id in claims:
payload = {"sub": user_id, "role": role, "tenant_id": str(tenant_id), "exp": expires}
Celery tasks:
All sync DB sessions in Celery tasks must receive tenant_id as a task argument and
execute session.execute(text("SET LOCAL app.current_tenant_id = :tid"), {"tid": tenant_id})
immediately after session creation. Add a _set_tenant(session, tenant_id) helper in
backend/app/core/db_utils.py.
Files to create:
backend/app/core/middleware.py— TenantContextMiddlewarebackend/app/core/db_utils.py—_set_tenant(session, tenant_id)
Files to modify:
backend/app/main.py— add middlewarebackend/app/utils/auth.py— embed tenant_id in JWT- All Celery task functions — accept
tenant_id: str | Noneparameter, call_set_tenant
1.4 Process Step Registry
Current state:
Pipeline steps are implicit — scattered across step_tasks.py, rendering/tasks.py,
step_processor.py, render_blender.py. No central definition.
Target:
A ProcessStep enum and registry that all tasks reference by name.
# backend/app/domains/pipeline/steps.py
class ProcessStep(str, enum.Enum):
UPLOAD_STEP = "upload_step"
PARSE_EXCEL = "parse_excel"
EXTRACT_METADATA = "extract_metadata"
OCC_GLB_EXPORT = "occ_glb_export"
RENDER_THUMBNAIL = "render_thumbnail"
RENDER_STILL = "render_still"
RENDER_TURNTABLE = "render_turntable"
EXPORT_GLB = "export_glb"
EXPORT_BLEND = "export_blend"
DELIVER = "deliver"
Each step maps to exactly one Celery task and one workflow node type. This enum becomes
the contract between the visual workflow editor and the task executor.
Phase 2: Pipeline Modularity (Weeks 3–4)
Break up step_tasks.py (1,170 lines). One file = one pipeline stage.
2.1 Decompose step_tasks.py
Current functions and their new homes:
| Current location | Function | Target file |
|---|---|---|
step_tasks.py |
process_step_file |
domains/pipeline/tasks/extract_metadata.py |
step_tasks.py |
render_step_thumbnail |
domains/pipeline/tasks/render_thumbnail.py |
step_tasks.py |
generate_gltf_geometry_task |
domains/pipeline/tasks/export_glb_geometry.py |
step_tasks.py |
generate_gltf_production_task |
domains/pipeline/tasks/export_glb_production.py |
step_tasks.py |
regenerate_thumbnail |
domains/pipeline/tasks/render_thumbnail.py |
step_tasks.py |
dispatch_order_line_render |
domains/pipeline/tasks/dispatch.py |
step_tasks.py |
render_order_line_task |
DELETE (duplicate of domains/rendering/tasks.render_order_line_still_task) |
step_tasks.py |
reextract_cad_metadata |
domains/pipeline/tasks/extract_metadata.py |
step_tasks.py |
_auto_populate_materials_for_cad |
domains/pipeline/tasks/auto_materials.py |
step_tasks.py |
_bbox_from_glb, _bbox_from_step_cadquery |
domains/pipeline/tasks/bbox.py |
rendering/tasks.py |
render_order_line_still_task |
domains/rendering/tasks/render_still.py |
rendering/tasks.py |
render_turntable_task |
domains/rendering/tasks/render_turntable.py |
rendering/tasks.py |
export_gltf_for_order_line_task |
domains/pipeline/tasks/export_glb_geometry.py |
rendering/tasks.py |
export_blend_for_order_line_task |
domains/rendering/tasks/export_blend.py |
rendering/tasks.py |
publish_asset |
domains/media/tasks.py |
step_tasks.py becomes a compatibility shim (import-only, deprecated) until all
callers are updated. Remove it in Phase 3.
2.2 Render Job Document Integration
Every Celery task in the new structure:
- Reads/creates
RenderJobDocumentat task start - Updates the relevant step via
job_doc.update_step(step_name, status="running") - On completion:
job_doc.update_step(step_name, status="done", duration_s=elapsed) - On failure:
job_doc.set_state("failed")+job_doc.append_error(...) - Writes document back to
order_lines.render_job_doc
2.3 Render Job Cancellation (Proper)
Current problem:
celery_app.control.revoke("render-{line_id}", terminate=True) — this ID is synthetic
and does not match the actual Celery task ID, so revoke is a no-op. The Blender process
continues running.
Solution:
- Store the actual Celery task ID in
render_job_doc.celery_task_idwhen the task starts - Cancel endpoint reads
render_job_doc.celery_task_idand revokes with that real ID - The render subprocess uses
start_new_session=True(already done inrender_blender.py) and storesproc.pidin the job document - On SIGTERM, the Celery task's signal handler calls
os.killpg(pgid, SIGTERM), waits 10s, thenos.killpg(pgid, SIGKILL) - Clean up: remove partial output file, remove
_frames_*temp directory - Update
render_job_doc.state = "cancelled", clearOrderLine.render_status = "cancelled"
Files to modify:
backend/app/api/routers/orders.py— read celery_task_id from job doc, not synthetic IDbackend/app/domains/rendering/tasks/render_still.py— store task ID + PID in job doc, register SIGTERM handlerbackend/app/domains/rendering/tasks/render_turntable.py— same
2.4 GPU-Primary Rendering
Current state:
cycles_device defaults to "auto". When GPU is unavailable, Blender silently falls back
to CPU with no log message. The _activate_gpu() function in blender_render.py already
probes for GPU but the result is not reflected in the render job document.
Target:
cycles_devicedefault changes from "auto" to "gpu" in system settings_activate_gpu()result is logged with[GPU_PROBE]prefix:- Success:
[GPU_PROBE] RTX 3090 activated (OPTIX) — using GPU render - Failure:
[GPU_PROBE] No GPU found, falling back to CPU — set cycles_device=cpu to suppress this warning
- Success:
- GPU type and fallback reason are written to
render_job_doc.result.gpu_info - Admin UI shows GPU status on the Settings page (already partially exists via worker activity)
Files to modify:
render-worker/scripts/blender_render.py— enhance_activate_gpu()loggingbackend/app/api/routers/admin.py— change defaultcycles_deviceto "gpu"backend/app/domains/rendering/job_document.py— addgpu_infofield to result
2.5 Blender Script Modularity
Current state:
render-worker/scripts/blender_render.py is 853 lines with everything inline.
Target structure:
render-worker/scripts/
├── blender_render.py — entry point, arg parsing, top-level flow
├── _blender_gpu.py — GPU probe + activation
├── _blender_import.py — GLB import, rotation, smooth shading
├── _blender_materials.py — material library application + fallback
├── _blender_camera.py — auto camera from bbox, clip planes
├── _blender_scene.py — scene setup (Mode A vs Mode B)
└── _blender_post.py — (currently Pillow overlay — DELETE THIS FILE)
blender_render.py imports from these sub-modules. Blender Python's sys.path is updated
at the top of the script to include the scripts directory.
Phase 3: Code Deletion (Weeks 3–4, parallel with Phase 2)
3.1 Remove Pillow Overlay Code
Location: render-worker/scripts/blender_render.py lines 798–851
Why it's dead: transparent_bg=True is always passed for production renders. The
else: branch at line 802 can never execute in production. The green Schaeffler bar is
now part of the .blend template, not post-processing.
Delete:
- Lines 798–851 in
blender_render.py(the entireif transparent_bg: ... else: try PIL...block) - Remove Pillow from render-worker dependencies in
render-worker/Dockerfile - Remove the line
- Schaeffler green top bar + model name label via Pillow post-processing.from the script docstring
3.2 Remove STL Workflow Remnants
What to delete:
| Location | What to remove |
|---|---|
backend/app/api/routers/admin.py |
VALID_STL_QUALITIES, stl_quality from SettingsOut, SettingsUpdate, and all SETTINGS_DEFAULTS |
backend/app/api/routers/admin.py |
generate-missing-stls endpoint (if still present) |
backend/app/api/routers/cad.py |
generate-stl/{quality} endpoint |
backend/app/services/render_blender.py |
stl_quality parameter from render_still() and render_turntable_to_file() |
backend/app/services/render_blender.py |
Key stl_duration_s → rename to glb_duration_s (remove # key kept for backward compat comment) |
backend/app/tasks/step_tasks.py |
generate_stl_cache task (check if it still exists) |
render-worker/scripts/ |
Any _import_stl, _convert_stl, _scale_mm_to_m functions |
backend/app/api/routers/analytics.py |
avg_stl_s field in analytics response |
| All render log dicts | Replace stl_size_bytes: 0 and stl_duration_s: with glb_* equivalents |
| DB migration | backend/alembic/versions/050_cleanup_stl_settings.py — DELETE FROM system_settings WHERE key = 'stl_quality' |
Files to delete entirely:
blender-renderer/directory (already removed from docker-compose.yml, remove directory)threejs-renderer/directory (migration 033 already removed it from services)flamenco/directory (migration 032 removed Flamenco; verify nothing still imports from it)
Verify before deleting:
grep -rn "blender-renderer\|threejs-renderer\|flamenco" backend/ frontend/ --include="*.py" --include="*.ts" --include="*.tsx"
3.3 Remove Compat Shims
After all callers are migrated, delete these shim files:
backend/app/models/user.py(shim →domains/auth/models.py)backend/app/models/cad_file.py(shim →domains/products/models.py)backend/app/services/render_dispatcher.py(shim, 10 lines)backend/app/services/material_service.py(shim →domains/materials/service.py)backend/app/services/render_blender.py(move fully intodomains/rendering/)backend/app/models/directory → all models are already indomains/*/models.py
3.4 Remove Duplicate render_order_line_task
step_tasks.render_order_line_task (lines 705–1050 of step_tasks.py) duplicates
rendering/tasks.render_order_line_still_task. The step_tasks version has more
baggage (compat imports, emit() calls, stl_quality references). Delete the step_tasks
version, migrate all queue routes to the rendering/tasks version.
Migration:
celery_app.pytask routes: routeapp.tasks.step_tasks.*to empty list, removing step_tasks from the routing table after all tasks are migrated- Update
CLAUDE.mdto reflect new task locations
Phase 4: Tenant & Auth (Weeks 5–6)
4.1 Role Hierarchy
Current roles: admin | project_manager | client
Target roles:
class UserRole(str, enum.Enum):
global_admin = "global_admin" # platform operator, bypass RLS, all tenants
tenant_admin = "tenant_admin" # per-tenant admin, full control within tenant
project_manager = "project_manager" # order/render management within tenant
client = "client" # read own orders, create draft orders
Permission matrix:
| Permission | GlobalAdmin | TenantAdmin | ProjectManager | Client |
|---|---|---|---|---|
| Manage tenants | YES | no | no | no |
| Manage users (all tenants) | YES | no | no | no |
| Manage users (own tenant) | YES | YES | no | no |
| All system settings | YES | YES | no | no |
| Trigger renders | YES | YES | YES | no |
| View all orders in tenant | YES | YES | YES | no |
| Create/view own orders | YES | YES | YES | YES |
| Reject orders | YES | YES | YES | no |
| Delete renders | YES | YES | YES | no |
| View analytics | YES | YES | YES | no |
DB migration:
backend/alembic/versions/051_role_hierarchy.py— renameadmin→global_admin, addtenant_adminto theuserroleenum; backfill existingadminusers toglobal_admin
Auth utilities:
require_global_admin()— replacesrequire_admin()require_tenant_admin_or_above()— TenantAdmin or GlobalAdminrequire_pm_or_above()— PM, TenantAdmin, GlobalAdmin
4.2 Tenant Isolation — Consistency Audit
The problem:
database.py:build_tenant_db_dep() yields the session without setting RLS context
(line 92 comments say "context-setting happens via set_tenant_context when needed").
This means every endpoint that uses Depends(get_db) bypasses RLS.
Fix — Middleware approach (preferred):
# backend/app/core/middleware.py
class TenantContextMiddleware(BaseHTTPMiddleware):
"""Set PostgreSQL RLS context on every request from JWT claims."""
BYPASS_PATHS = {"/health", "/api/auth/login", "/api/auth/refresh"}
async def dispatch(self, request: Request, call_next):
if request.url.path in self.BYPASS_PATHS:
return await call_next(request)
token = self._extract_token(request)
if token:
payload = decode_token_safe(token)
tenant_id = payload.get("tenant_id")
role = payload.get("role")
request.state.tenant_id = tenant_id
request.state.role = role
response = await call_next(request)
return response
The get_db dependency is modified to read tenant_id from request.state:
async def get_db(request: Request) -> AsyncGenerator[AsyncSession, None]:
async with AsyncSessionLocal() as session:
tenant_id = getattr(request.state, "tenant_id", None)
role = getattr(request.state, "role", None)
if tenant_id:
if role == "global_admin":
await session.execute(text("SET LOCAL app.current_tenant_id = 'bypass'"))
else:
await session.execute(
text("SET LOCAL app.current_tenant_id = :tid"),
{"tid": str(tenant_id)},
)
yield session
4.3 Tenant Isolation Strategy — Shared vs. Dedicated Containers
Decision: Shared containers with DB-level isolation (current model)
Analysis:
| Factor | Shared containers | Dedicated containers per tenant |
|---|---|---|
| Cost | Low (6 containers total) | High (6 containers × N tenants) |
| Complexity | Low | Very high (orchestration, networking) |
| Data isolation | DB-level (RLS) | Full OS-level |
| GPU sharing | Single GPU shared | Dedicated GPU per tenant (expensive) |
| Blender jobs | Queue + concurrency control | Per-tenant render queue |
| Failure blast radius | All tenants affected by worker crash | Isolated per tenant |
| Scaling | Celery autoscale | Docker Swarm / Kubernetes HPA |
| Migration effort | Weeks (Phase 3-4) | Months (new orchestration layer) |
Recommendation: Maintain shared containers with DB-level RLS isolation. Dedicated
containers are only justified if tenants have strict contractual data isolation requirements
(e.g., GDPR-mandated separate processing). For the current internal use case (Schaeffler
internal teams), RLS + tenant_id partitioning is sufficient.
If dedicated containers are required in future:
- Docker Compose override file per tenant (
docker-compose.{tenant-slug}.yml) - Each tenant gets own PostgreSQL schema (not separate DB) with schema-based routing
- Shared MinIO with per-tenant bucket policies
- Separate Redis database (0-15) per tenant (max 16 tenants)
- Celery routing: per-tenant queue prefix
{tenant_slug}.thumbnail_rendering
4.4 Per-Tenant Feature Flags
Add a tenant_config JSONB column to the tenants table:
# backend/alembic/versions/052_tenant_feature_flags.py
tenant_config JSONB DEFAULT '{
"max_concurrent_renders": 3,
"render_engines_allowed": ["cycles"],
"max_order_size": 500,
"fallback_material": "SCHAEFFLER_059999_FailedMaterial",
"notifications_enabled": true,
"invoice_prefix": "INV"
}'
Feature flags checked at render dispatch time:
max_concurrent_renders— enforced in Celery queue routingrender_engines_allowed— validated in OutputType creationfallback_material— passed to Blender scripts (see §6.4)
Phase 5: Material & Rendering Improvements (Weeks 5–6)
5.1 Fallback Material — SCHAEFFLER_059999_FailedMaterial
Current state:
step_processor.py:MATERIAL_PALETTE assigns rainbow colors from a palette when material
assignment fails or no material is specified. blender_render.py has its own
PALETTE_LINEAR for the same purpose.
Target:
When material resolution fails (no alias, no exact match, material library link broken),
assign SCHAEFFLER_059999_FailedMaterial (magenta) so failed assignments are immediately
visible in renders.
Implementation:
domains/materials/service.py:resolve_material_map()— instead of pass-through, returnSCHAEFFLER_059999_FailedMaterialfor unresolved parts (configurable per-tenant viatenant_config.fallback_material)render-worker/scripts/blender_render.py— when material library is provided but a
part name does not match any library material, assignSCHAEFFLER_059999_FailedMaterialrather than palette colorrender-worker/scripts/_blender_materials.py— a new sub-module for material logic
with explicit logging:[MATERIAL] part 'Outer_Ring' → 'SCHAEFFLER_010101_Steel-Bare' (alias match)
and[MATERIAL] part 'Unknown_Part' → 'SCHAEFFLER_059999_FailedMaterial' (no match)step_processor.py— removeMATERIAL_PALETTEand_material_to_color(); the palette is no longer used once fallback material is in place. Part colors for geometry GLB viewer should come from the material library color map, not a rainbow palette.
5.2 Remove EEVEE Fallback
Current state:
render_blender.py has an EEVEE-to-Cycles fallback:
if returncode > 0 and engine == "eevee":
logger.warning("EEVEE failed (exit %d) — retrying with Cycles", returncode)
returncode, stdout_lines2, stderr_lines2 = _run("cycles")
engine_used = "cycles (eevee fallback)"
This hides failures and makes debugging harder. Per the Blender 5.0.1 requirement, EEVEE
Next should work reliably. If it fails, it should be a hard failure, not a silent retry.
Target: Remove the EEVEE-to-Cycles fallback. If EEVEE fails, the task fails with a
clear error. Set EEVEE_FALLBACK_ENABLED=false system setting (default false from now on).
5.3 Remove Blender Version Check
Current state:
backend/app/services/render_blender.py defines:
MIN_BLENDER_VERSION = (5, 0, 1)
This constant is defined but the check that uses it has been removed. Search for any
remaining version-comparison code in blender_render.py and render scripts.
Target:
- Remove
MIN_BLENDER_VERSION = (5, 0, 1)fromrender_blender.py - Remove any
bpy.app.versioncomparisons in render scripts - Blender 5.0.1+ is assumed; older versions are not supported
Phase 6: Notification Center Refactor (Week 7)
6.1 Current Problems
Per-render notifications (render.completed, render.failed) fire for every single
OrderLine. An order with 200 lines generates 200 notifications. This is too noisy.
6.2 Notification Architecture
Three channels:
-
Activity Feed (
/api/activity) — per-action events: every render start/complete, every order state change, every upload. Low-level, not shown in bell dropdown. Available in a dedicated/activitypage for debugging. -
Notification Center (
/api/notifications) — batch summaries only:- "Order #ORD-2026-042 rendering complete: 47/50 succeeded, 3 failed"
- "Excel import failed: 12 products skipped (see import log)"
- "Worker recovery: 3 stalled renders requeued after 120min timeout"
-
System Alerts (admin only) — infrastructure issues: GPU probe failed, Blender binary not found, Redis connection lost.
Notification trigger rules:
render.completedper-line → suppress; emit batch when ALL lines in order reach terminal staterender.failedper-line → suppress; emit batch on order completionexcel.imported→ one notification per upload with summary countsorder.submitted→ one notification (always keep)- System alerts → always emit individually
DB changes:
audit_log— addchannel VARCHAR(20)column:activity | notification | alertnotification_configs— extendevent_typeto include new batch event types- New beat task:
batch_render_notifications— runs every 60s, checks for orders where
all lines are terminal but no batch notification has been emitted; emits the summary
6.3 Per-User Notification Preferences
Current notification_configs table has event_type + channel + enabled. Extend:
- Add
frequency: strcolumn —immediate | hourly | daily | never - Frequency is respected by the batch notification beat task
Files to modify:
backend/app/domains/notifications/models.py— addchannel,frequencycolumnsbackend/app/services/notification_service.py— addemit_batch_notification()functionbackend/app/tasks/beat_tasks.py— addbatch_render_notificationsschedulefrontend/src/pages/NotificationSettings.tsx— add frequency selector per event typefrontend/src/pages/Notifications.tsx— separate tabs for Activity | Notifications | Alerts
Phase 7: UI/UX Improvements (Week 7–8)
7.1 Tooltip / Help Text System
Every setting, parameter, and action in the Admin UI and order wizard needs a tooltip
explaining what it does and what it affects in the pipeline.
Architecture:
// frontend/src/help/helpTexts.ts
export const HELP_TEXTS: Record<string, HelpText> = {
"setting.blender_cycles_samples": {
title: "Cycles Samples",
body: "Number of render samples per pixel. Higher = better quality, longer render time. 256 is a good balance for product shots. 64 is fast for previews.",
affects: ["render quality", "render time"],
unit: "samples",
range: [1, 4096],
recommendation: "256 for production, 64 for preview",
},
"setting.gltf_preview_linear_deflection": {
title: "3D Viewer Mesh Quality",
body: "Controls tessellation precision for the 3D browser viewer. Lower values = finer mesh, larger file. 0.1mm is a good default for medium-complexity parts.",
affects: ["3D viewer file size", "viewer load time"],
unit: "mm",
},
"action.regenerate_thumbnails": {
title: "Regenerate All Thumbnails",
body: "Re-renders thumbnails for all STEP files using current settings. This queues all files on the thumbnail_rendering worker. Expected time: N × 30s. Only needed after changing renderer settings.",
warning: "This will queue a large number of tasks. Only run during off-peak hours.",
},
// ... all settings
}
// frontend/src/components/HelpTooltip.tsx
interface HelpTooltipProps {
helpKey: string
position?: "top" | "right" | "bottom" | "left"
}
export function HelpTooltip({ helpKey, position = "right" }: HelpTooltipProps) {
const help = HELP_TEXTS[helpKey]
if (!help) return null
return (
<Tooltip content={<HelpContent help={help} />} position={position}>
<HelpCircle size={14} className="text-text-muted ml-1 cursor-help" />
</Tooltip>
)
}
Where to add tooltips (minimum required):
- All
system_settingskeys in Admin > Settings - All
OutputType.render_settingsfields in the OutputType editor - All
RenderTemplatefields in the template editor - All actions in Admin > Settings (regenerate thumbnails, process unprocessed, etc.)
- All fields in the Order Wizard with non-obvious meaning
7.2 Media Browser Refactor
Current state:
frontend/src/pages/MediaBrowser.tsx — exists but no details on current filter capabilities.
Target:
Server-side filtered media browser with:
- Filters:
lagertyp | category_key | render_status | asset_type | tenant_id (admin) - Text search on product name, pim_id
- Server-side pagination (50 per page)
- Virtual scroll for large catalogs (react-virtual or TanStack Virtual)
- Batch download selected assets
API changes:
GET /api/media/assets?
asset_type=still&
category_key=TRB&
lagertyp=Axial-Zylinderrollenlager&
render_status=completed&
page=1&
page_size=50&
q=81113
DB indexes required:
-- backend/alembic/versions/053_media_browser_indexes.py
CREATE INDEX ix_media_assets_asset_type_created ON media_assets(asset_type, created_at DESC);
CREATE INDEX ix_products_category_lagertyp ON products(category_key, lagertyp);
CREATE INDEX ix_products_name_gin ON products USING GIN(to_tsvector('simple', COALESCE(name, '') || ' ' || COALESCE(pim_id, '')));
Files to modify:
backend/app/domains/media/router.py— addGET /assetswith filter paramsbackend/app/domains/media/schemas.py— addMediaAssetFilterPydantic modelfrontend/src/pages/MediaBrowser.tsx— complete rewrite with virtual scrollfrontend/src/api/media.ts— addgetMediaAssets(filters)function
7.3 Workflow Editor — Pipeline Step Nodes
Current state:
WorkflowEditor.tsx has 5 node types (Upload, Parse, Render, Export, Deliver) but they
do not map to actual Celery tasks. WorkflowDefinition.config is a free-form JSONB blob
with no schema validation.
Target:
Node types correspond 1:1 to ProcessStep enum values. The workflow editor saves a
validated workflow config that the dispatch_workflow() function can execute.
WorkflowDefinition config schema:
{
"version": 1,
"nodes": [
{"id": "n1", "step": "extract_metadata", "params": {}},
{"id": "n2", "step": "render_thumbnail", "params": {"engine": "cycles", "samples": 64}},
{"id": "n3", "step": "render_still", "params": {"width": 2048, "height": 2048}},
{"id": "n4", "step": "export_glb", "params": {"quality": "high"}},
{"id": "n5", "step": "deliver", "params": {}}
],
"edges": [
{"from": "n1", "to": "n2"},
{"from": "n2", "to": "n3"},
{"from": "n3", "to": "n4"},
{"from": "n4", "to": "n5"}
]
}
Backend validation: workflow_router.py validates that all step values are in
ProcessStep enum before saving.
Frontend: WorkflowEditor.tsx builds available node types from a GET /api/workflows/steps
endpoint that returns all ProcessStep entries with their parameter schemas.
7.4 Kanban Rejection Flow
Current state:
OrderStatus.rejected exists but the rejection flow is undefined. The admin panel has no
rejection UI. rejected_at column exists but there is no rejection reason field.
Target flow:
- Who can reject:
ProjectManager,TenantAdmin,GlobalAdmin - Trigger:
POST /api/orders/{id}/rejectwith body{"reason": "...", "notify_client": true} - What happens:
- Order status →
rejected,rejected_at= now rejection_reasonstored (newTextcolumn onOrder)- All pending/processing renders are cancelled (same as cancel-renders endpoint)
- Notification emitted to order creator: "Your order #ORD-2026-042 was rejected. Reason: ..."
- Audit log entry created
- Order status →
- Client sees: Order status badge changes to
REJECTEDwith reason visible - Re-submission: Client can
POST /api/orders/{id}/resubmitwhich clears rejection, resets todraft, allowing edits before re-submitting. Re-submit creates a new audit log entry and emits notification to PMs.
DB migration:
backend/alembic/versions/054_order_rejection.py— addrejection_reason TEXTtoorders
Phase 8: Scalable Workers (Week 8)
8.1 Current Concurrency Controls
worker(step_processing):CELERY_WORKER_CONCURRENCYenv var, default 8render-worker(thumbnail_rendering): hardcoded 1 (Blender serial access)- Both require Docker service restart to change concurrency
8.2 Dynamic Worker Scaling
Short term (no Kubernetes):
Use Celery's built-in autoscale option:
# docker-compose.yml
render-worker:
command: celery -A app.tasks.celery_app worker
--loglevel=info
-Q thumbnail_rendering
--autoscale=1,1 # min=1, max=1 (single Blender concurrency)
--concurrency=1
For worker:
worker:
command: celery -A app.tasks.celery_app worker
--loglevel=info
-Q step_processing,ai_validation
--autoscale=${MAX_CONCURRENCY:-8},${MIN_CONCURRENCY:-2}
Per-queue concurrency via DB:
Add a worker_configs table:
CREATE TABLE worker_configs (
queue_name VARCHAR(100) PRIMARY KEY,
max_concurrency INT NOT NULL DEFAULT 8,
min_concurrency INT NOT NULL DEFAULT 2,
updated_at TIMESTAMP NOT NULL DEFAULT now()
);
A beat task apply_worker_concurrency runs every 5 minutes and uses Celery control
commands to adjust pool size:
celery_app.control.broadcast("pool_shrink", arguments={"n": 2}, destination=["worker@host"])
celery_app.control.broadcast("pool_grow", arguments={"n": 4}, destination=["worker@host"])
Long term (Kubernetes):
Workers run as Kubernetes Deployments with HPA on celery_queue_length metric (exposed via
Flower or a custom /metrics endpoint for Prometheus). Render-workers use GPU node pools
with nvidia.com/gpu: 1 resource requests.
8.3 Worker Health Recovery
Current state:
beat_tasks.recover_stuck_cad_files runs every 5 minutes and handles stuck processing state.
Extend to:
- Detect
render_status = 'processing'withrender_started_at>render_stall_timeout_minutesago - SIGTERM any still-running Blender PID (stored in
render_job_doc.celery_task_id) - Reset
render_statustofailed, updaterender_job_doc.state = 'failed' - Emit system alert notification (admin channel)
- Log with
[WORKER_RECOVERY] Stalled render for order_line {id} terminated after {N}min
Detailed Task Breakdown by Area
A. step_tasks.py Decomposition
Current problems:
- 1,170 lines, 8 distinct Celery tasks, many private helpers, multiple inline DB session creation patterns
- Imports scattered: some at module level, some inside functions (Celery pattern)
render_order_line_task(lines 705–1050+) duplicatesrender_order_line_still_task
Migration path:
- Create new
domains/pipeline/tasks/directory with one file per step - Each new task calls
PipelineLoggerinstead of barelogger.info - Each new task writes to
render_job_docviajob_document.pyhelpers - Old
step_tasks.pybecomes import-only shim:from app.domains.pipeline.tasks.extract_metadata import process_step_file - After 2-week migration period, delete
step_tasks.py
B. Auth Token Claims
Current: {"sub": user_id, "role": role, "exp": expires} — no tenant_id in token
Target: {"sub": user_id, "role": role, "tenant_id": str(tenant_id), "exp": expires}
Impact: All existing tokens become invalid after deploy. Users must re-login.
Mitigation: Rotate JWT_SECRET_KEY as part of the deployment to force re-login.
C. Celery Task Routing Update
After Phase 2 decomposition, update celery_app.conf.update(task_routes={...}):
task_routes = {
"app.domains.pipeline.tasks.*": {"queue": "step_processing"},
"app.domains.rendering.tasks.*": {"queue": "thumbnail_rendering"},
"app.domains.media.tasks.*": {"queue": "step_processing"},
"app.tasks.ai_tasks.*": {"queue": "ai_validation"},
"app.tasks.beat_tasks.*": {"queue": "step_processing"},
}
D. Frontend API Client Consistency
All frontend/src/api/*.ts files should:
- Use the axios client from
api/client.ts(which injectsX-Tenant-IDheader) - Export typed interfaces for all response shapes
- Use
useQuery/useMutationfrom TanStack Query, not bareaxios.getin components
Audit needed: Check each api/*.ts file to confirm X-Tenant-ID header is sent
(it is wired in the axios interceptor per commit 5da90b5, but verify all files use
the configured client, not axios.create() directly).
Architectural Decisions (ADRs)
ADR-001: Shared containers vs. per-tenant containers
Decision: Shared containers with PostgreSQL RLS
Rationale: Cost and complexity savings. RLS provides adequate isolation for internal use.
Consequences: Must ensure RLS is applied consistently (Phase 1.3). Blender sessions are
shared; GPU contention is managed via Celery queue depth, not isolation.
ADR-002: Render Job Document as JSONB
Decision: Store render job state machine as JSONB in order_lines.render_job_doc
Rationale: Avoids additional workflow_node_results table queries for debugging;
JSONB is flexible for schema evolution; indexed for state-based queries.
Alternatives considered: Separate render_job_steps table — rejected (too many joins
for the common "show me render status" query).
ADR-003: No per-render notifications
Decision: Suppress individual render.completed notifications; emit batch at order completion
Rationale: An order with 200 lines generates 200 notifications under the current model.
Batch summaries at order completion are actionable; per-render events are noise.
Consequences: Activity feed still records all events for debugging.
ADR-004: GPU-first rendering
Decision: Default cycles_device = "gpu", explicit log on CPU fallback
Rationale: The render-worker has GPU reservation in docker-compose.yml. CPU fallback
should be visible and logged, not silent.
Consequences: Renders on machines without GPU will always log a CPU fallback warning.
ADR-005: Fallback material over palette
Decision: Replace MATERIAL_PALETTE rainbow fallback with SCHAEFFLER_059999_FailedMaterial
Rationale: Failed material assignments should be immediately visible (magenta) rather
than disguised as intentional palette colors.
Consequences: Parts with missing material mapping will render magenta in both
thumbnail and production renders. This is a feature, not a bug.
ADR-006: Blender 5.0.1 minimum, no version guards
Decision: Remove all bpy.app.version checks and MIN_BLENDER_VERSION guards
Rationale: The project is Blender 5.0.1-only. Version shims add complexity without value.
Consequences: Running with an older Blender binary will cause cryptic errors. Document
the minimum version requirement clearly in the Dockerfile and README.
What Gets Deleted
Python files to delete entirely:
backend/app/models/user.py— compat shimbackend/app/models/cad_file.py— compat shimbackend/app/models/order.py— compat shim (if exists)backend/app/models/order_item.py— compat shimbackend/app/models/order_line.py— compat shimbackend/app/models/material.py— compat shimbackend/app/models/material_alias.py— compat shimbackend/app/models/render_template.py— compat shimbackend/app/models/output_type.py— compat shimbackend/app/models/system_setting.py— compat shimbackend/app/models/template.py— compat shimbackend/app/models/render_position.py— compat shimbackend/app/services/render_dispatcher.py— 10-line shimbackend/app/services/material_service.py— 3-line shimbackend/app/tasks/step_tasks.py— after Phase 2 migration completebackend/app/domains/rendering/tasks.py— split into per-step files in Phase 2
Directories to delete entirely:
blender-renderer/— HTTP microservice, removed from docker-compose in refactor/v2threejs-renderer/— removed in migration 033flamenco/— removed in migration 032
Code blocks to delete (within files):
render-worker/scripts/blender_render.pylines 798–851 — Pillow overlayrender-worker/scripts/blender_render.pyline 17 — docstring Pillow mentionbackend/app/services/render_blender.pyline 17 —MIN_BLENDER_VERSION = (5, 0, 1)backend/app/services/render_blender.pylines 229–233 — EEVEE-to-Cycles fallbackbackend/app/services/step_processor.pylines 19–31 —MATERIAL_PALETTE+_material_to_color()backend/app/api/routers/admin.py—VALID_STL_QUALITIES,stl_qualityin all schemas
System settings to delete (DB migration):
stl_quality— GLB-only pipeline, no STL conceptthreejs_render_size— renderer removedthumbnail_renderer— was multi-value (pillow|blender|threejs), now always blender
Migration Strategy
Deployment Order (Zero-Downtime)
Step 1 — DB migrations (non-breaking):
- Run migrations 048–054 (new columns:
render_job_doc,rejection_reason, feature flags, etc.) - New columns are nullable, no existing queries break
Step 2 — Backend deploy (backward compatible):
- Deploy new backend with compat shims in place
- New endpoints and middleware active
- Old endpoints still work
- JWT tokens are extended with
tenant_idclaim (existing tokens without it still work via fallback in middleware)
Step 3 — Celery worker deploy:
- Deploy new
domains/pipeline/tasks/structure step_tasks.pycompat shim routes to new functions- Old task names still registered via shim
Step 4 — Frontend deploy:
- New WorkflowEditor with validated step types
- HelpTooltip components added
- MediaBrowser refactor with virtual scroll
Step 5 — Cleanup (breaking):
- Remove compat shims
- Delete
step_tasks.py - Rotate
JWT_SECRET_KEYto force re-login (tenant_id now required in claims) - Run DB migration to clean up stl_quality and threejs settings
Rollback Plan
- All migrations have
downgrade()implemented - Compat shims mean old task names still work during migration window
render_logcolumn kept alongsiderender_job_docuntil all consumers migrated
Testing Before Delete
Before deleting any compat shim or old code, verify:
grep -rn "<old_import_path>" backend/ frontend/ --include="*.py" --include="*.ts" --include="*.tsx"
Must return 0 results from non-shim files.
Open Questions
These require product decisions before implementation:
-
Tenant onboarding flow — How are new tenants created? Self-service signup, or admin creates tenant + TenantAdmin user manually? What is the initial data setup?
-
Blender binary distribution — Currently host-mounted (
/opt/blender:/opt/blender:ro). If multiple render-workers run on different hosts in a future cluster, how is Blender distributed? Container image vs. network share? -
MinIO vs. filesystem storage — All media assets are stored on the local filesystem (
/app/uploadsvolume). MinIO is configured but not used for primary storage yet. Should Phase 2 migrate assets to MinIO for horizontal scaling? -
Invoice workflow —
billing/models.pyhasInvoice+InvoiceLinemodels and aninvoicestable (migration 042). Is billing actually used? If not, should it be removed to reduce complexity? -
AI validation (Azure OpenAI) —
ai_tasks.pyandazure_ai.pyexist but Azure credentials are optional. Is this feature actively used or can it be removed? -
Email notifications — SMTP settings exist in system_settings but email sending is not implemented. Is this a required feature for the next phase?
-
Rejection re-submission UX — When a client re-submits a rejected order, do they create a new order or update the existing one? The current data model supports only one status per order, not a history of submissions.
-
Media browser download format — Bulk download: ZIP of individual files, or separate download links? ZIP requires server-side assembly which adds load.
-
Tooltip language — Help texts in English (per CLAUDE.md coding standards) or German (for end-user-facing UI)? The admin UI is currently in English labels.
-
3D Viewer geometry quality — The
gltf_preview_linear_deflectiondefault is 0.1mm. For very small parts (sub-1mm features), this may be too coarse. Should the deflection auto-scale based on the CAD file's bounding box dimensions?agentId: a6cf206cd46b868cb (for resuming to continue this agent's work if needed) total_tokens: 132964 tool_uses: 72 duration_ms: 467361