fix: deduplicate GLB/USD generation with Redis locks + review fixes

- Add per-file Redis SET NX EX 1800 locks to generate_gltf_geometry_task
  and generate_usd_master_task — concurrent duplicates (e.g. double-click
  of bulk action buttons) now log a warning and return immediately instead
  of running two expensive OCC tessellation subprocesses on the same file
- Fix eng.dispose() called inside with Session() block in cache-hit path
  of both tasks — moved to after the with block exits (Tasks 3+4 from plan)
- Add cad.updated_at = datetime.utcnow() in save_manual_material_overrides
  (was missing vs parallel save_part_materials endpoint)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-03-12 13:50:05 +01:00
parent 409fb92899
commit 71e099305c
4 changed files with 545 additions and 410 deletions
+123 -81
View File
@@ -1,108 +1,150 @@
# Plan: P2 USD Foundation — Commit & Verify
# Plan: Deduplication for GLB/USD Generation + Two Review Fixes
## Context
All five P2 milestones are already implemented in the working tree as uncommitted changes.
The task now is to apply the DB migrations, commit the work, and verify the stack runs.
Two problems to solve:
### Milestone status (assessed 2026-03-12)
**1. Duplicate generation (main bug)**
When "Generate Missing Canonical Scenes" or "Generate Missing USD Masters" is clicked, the admin endpoint queries for CAD files without a `gltf_geometry` / `usd_master` MediaAsset and queues one task per file. If the button is clicked twice (or both endpoints are triggered in sequence before any task has committed its MediaAsset), the same `cad_file_id` is queued multiple times. The tasks also auto-chain: `generate_gltf_geometry_task` always queues `generate_usd_master_task` at the end — so clicking "Generate Missing USD Masters" while GLB tasks are still running doubles up the USD work.
| Milestone | Status | Key files |
|---|---|---|
| M1: `export_step_to_usd.py` with `schaeffler:partKey` | ✅ DONE | `render-worker/scripts/export_step_to_usd.py` (631 lines) |
| M2: `usd_master` MediaAsset + migrations 060062 + Celery task | ✅ DONE | migrations 060/061/062, `generate_usd_master_task` in `export_glb.py` |
| M3: `GET /api/cad/{id}/scene-manifest` | ✅ DONE | `part_key_service.py`, `SceneManifest` schema, endpoint in `cad.py` |
| M4: `PUT /api/cad/{id}/manual-material-overrides` | ✅ DONE | New endpoint pair in `cad.py`, `saveManualOverrides` in `cad.ts` |
| M5: ThreeDViewer uses partKey, survives reload | ✅ DONE | `partKeyMap` in GLB extras, `effectiveMaterials` merge, server-side persistence |
The existing cache check (`step_file_hash`) only short-circuits tessellation when a MediaAsset already exists — it does not prevent two concurrent tasks from both starting the expensive subprocess on the same file. Two processes writing to `_geometry.glb` simultaneously causes corruption / wasted compute.
## Affected Files (all uncommitted — working tree only)
**Solution**: Apply the same Redis `SET NX EX` dedup lock that `process_step_file` uses (lock key `step_processing_lock:{id}`, released in `finally`). Add equivalent locks to `generate_gltf_geometry_task` and `generate_usd_master_task`.
**Backend**
- `backend/alembic/versions/060_usd_master_asset_type.py` — new migration
- `backend/alembic/versions/061_material_assignment_layers.py` — new migration
- `backend/alembic/versions/062_rename_tessellation_settings.py` — new migration
- `backend/app/domains/media/models.py``MediaAssetType.usd_master` added
- `backend/app/domains/products/models.py` — 3 new JSONB columns on `CadFile`
- `backend/app/domains/products/schemas.py``SceneManifest`, `PartEntry` Pydantic models
- `backend/app/domains/pipeline/tasks/export_glb.py``generate_usd_master_task` + auto-chain
- `backend/app/domains/pipeline/tasks/extract_metadata.py` — minor update
- `backend/app/domains/pipeline/tasks/render_thumbnail.py` — minor update
- `backend/app/domains/pipeline/tasks/render_order_line.py` — minor update
- `backend/app/api/routers/cad.py` — scene-manifest + manual-material-overrides endpoints
- `backend/app/api/routers/admin.py` — generate-missing-usd-masters + generate-missing-canonical-scenes buttons
- `backend/app/services/part_key_service.py` — new file: `build_scene_manifest()`, `generate_part_key()`
- `backend/app/core/config_service.py` — minor update
- `backend/app/core/tenant_context.py` — new file
- `backend/app/tasks/step_tasks.py` — re-exports `generate_usd_master_task`
**2. Review fix A — `eng.dispose()` inside `with Session` block**
`export_glb.py` line 89: `eng.dispose()` is called inside the `with Session(eng)` context manager before the `return`. The context manager's `__exit__` then tries to close a session on a disposed engine. Safe in practice (no exception raised) but fragile and misleading. Move `eng.dispose()` to after the `with` block exits.
**Render worker**
- `render-worker/scripts/export_step_to_usd.py` — new file: full USD exporter
- `render-worker/scripts/export_step_to_gltf.py` — injects `partKeyMap` into GLB extras
- `render-worker/scripts/still_render.py` — USD path support
- `render-worker/scripts/turntable_render.py` — USD path support
- `render-worker/Dockerfile``usd-core>=24.11` added
**3. Review fix B — `save_manual_material_overrides` missing `updated_at`**
`cad.py` line 537: `cad.manual_material_overrides = body.overrides` is committed without updating `cad.updated_at`. The parallel endpoint `save_part_materials` (line 430) does call `cad.updated_at = datetime.utcnow()`. Add the same line to `save_manual_material_overrides`.
**Frontend**
- `frontend/src/api/cad.ts``getManualOverrides()`, `saveManualOverrides()`
- `frontend/src/api/media.ts``usd_master` type added
- `frontend/src/api/sceneManifest.ts` — new file: `SceneManifest`, `fetchSceneManifest()`
- `frontend/src/components/cad/ThreeDViewer.tsx``partKeyMap`, `effectiveMaterials`, reconciliation panel
- `frontend/src/components/cad/MaterialPanel.tsx` — dual-path save, provenance badge
- `frontend/src/pages/Admin.tsx` — USD master bulk action buttons
- `frontend/src/pages/ProductDetail.tsx``usd_master` row in asset table
- `frontend/src/pages/Orders.tsx` — minor update
## Affected Files
| File | Change |
|---|---|
| `backend/app/domains/pipeline/tasks/export_glb.py` | Add Redis dedup locks to `generate_gltf_geometry_task` and `generate_usd_master_task`; fix `eng.dispose()` placement |
| `backend/app/api/routers/cad.py` | Add `cad.updated_at = datetime.utcnow()` in `save_manual_material_overrides` |
## Tasks (in order)
### [ ] Task 1: Apply migrations 060062
- **What**: Run `docker compose exec backend alembic upgrade head` to apply the three pending migrations
- **Acceptance gate**: `docker compose exec backend alembic current` shows `062` (or higher) as current
### [x] Task 1: Add Redis dedup locks to `generate_gltf_geometry_task`
- **File**: `backend/app/domains/pipeline/tasks/export_glb.py`
- **What**: At the top of `generate_gltf_geometry_task`, after `pl.step_start(...)`, acquire a Redis lock using the same pattern as `extract_metadata.py`:
```python
import redis as _redis_lib
_lock_key = f"glb_geometry_lock:{cad_file_id}"
_r = _redis_lib.from_url(app_settings.redis_url)
_acquired = _r.set(_lock_key, "1", nx=True, ex=1800) # 30-min TTL
if not _acquired:
logger.warning("generate_gltf_geometry_task: %s already in-flight — skipping duplicate", cad_file_id)
pl.step_done("export_glb_geometry", result={"skipped": True, "reason": "duplicate"})
return {"skipped": True}
```
Wrap the rest of the task body in `try: ... finally: _r.delete(_lock_key)`.
Note: `app_settings` is already imported inside the function. Import `redis` at the top of the `try` block as `import redis as _redis_lib` (same pattern as `extract_metadata.py` which imports it locally).
- **Acceptance gate**: Trigger "Generate Missing Canonical Scenes" twice in quick succession — worker logs show `"already in-flight — skipping duplicate"` for the second batch; no file ends up being tessellated twice.
- **Dependencies**: none
- **Risk**: Low — each migration is additive (ADD VALUE, ADD COLUMN, UPDATE). Check for phantom drops before running.
- **Risk**: Low — same pattern as `process_step_file`, TTL 30min covers worst-case tessellation time.
### [ ] Task 2: TypeScript check
- **What**: Run `docker compose exec frontend npx tsc --noEmit` to verify no type errors in the frontend changes
- **Acceptance gate**: Zero TypeScript errors
- **Dependencies**: none (frontend hot-reload, no rebuild needed)
- **Risk**: Low
### [x] Task 2: Add Redis dedup lock to `generate_usd_master_task`
### [ ] Task 3: Rebuild and restart backend + render-worker
- **What**: `docker compose up -d --build backend worker render-worker beat` — picks up new Dockerfile (usd-core), new tasks, and new migrations
- **Acceptance gate**: `docker compose logs backend | grep "Application startup complete"` and `docker compose exec render-worker python3 -c "from pxr import Usd; print(Usd.GetVersion())"` both succeed
- **File**: `backend/app/domains/pipeline/tasks/export_glb.py`
- **What**: Same pattern at the top of `generate_usd_master_task`, after `pl.step_start(...)`:
```python
import redis as _redis_lib
_lock_key = f"usd_master_lock:{cad_file_id}"
_r = _redis_lib.from_url(app_settings.redis_url)
_acquired = _r.set(_lock_key, "1", nx=True, ex=1800) # 30-min TTL
if not _acquired:
logger.warning("generate_usd_master_task: %s already in-flight — skipping duplicate", cad_file_id)
pl.step_done("usd_master", result={"skipped": True, "reason": "duplicate"})
return {"skipped": True}
```
Wrap the rest of the function body in `try: ... finally: _r.delete(_lock_key)`.
- **Acceptance gate**: Trigger "Generate Missing USD Masters" while GLB tasks are still running — worker logs show USD tasks skipping duplicates instead of starting a second tessellation.
- **Dependencies**: Task 1
- **Risk**: Medium — `usd-core` pip install adds build time; if it fails the render-worker won't start
### [ ] Task 4: Commit all P2 work
- **What**: Stage and commit all uncommitted P2 files in a single `feat(P2)` commit
- **Acceptance gate**: `git status` shows clean working tree (except LEARNINGS.md and review-report.md which can be included)
- **Dependencies**: Tasks 13 (verify before committing)
- **Risk**: Low
### [ ] Task 5: Smoke-test end-to-end via Admin panel
- **What**: Via Admin → "Generate Missing Canonical Scenes" to regenerate GLBs with `partKeyMap` + auto-chain USD masters for existing CAD files
- **Acceptance gate**:
- `GET /api/cad/{id}/scene-manifest` returns `{"parts": [...], ...}` for a processed CadFile
- ThreeDViewer loads, click a part → MaterialPanel shows assignment provenance
- Assign a material → reload page → assignment still present
- **Dependencies**: Task 3
- **Risk**: Medium — existing CAD files need backfill; may take minutes for bulk jobs to complete
### [x] Task 3: Fix `eng.dispose()` placement in cache-hit early-return path
- **File**: `backend/app/domains/pipeline/tasks/export_glb.py`
- **What**: In `generate_gltf_geometry_task`, the cache-hit path (lines 8695) calls `eng.dispose()` at line 89 while still inside the `with Session(eng)` block, then returns. Move `eng.dispose()` to *after* the `with` block exits.
Current (broken):
```python
with Session(eng) as session:
...
if existing_geo:
pl.step_done(...)
eng.dispose() # ← inside with block
try:
generate_usd_master_task.delay(cad_file_id)
...
return {"cached": True, ...}
eng.dispose() # normal path
```
Fixed: remove the `eng.dispose()` at line 89, and move the `generate_usd_master_task.delay()` + `return` to after the `with` block:
```python
_cache_hit_asset_id: str | None = None
with Session(eng) as session:
...
if existing_geo:
logger.info("[CACHE] hash match — skipping geometry GLB tessellation for %s", cad_file_id)
pl.step_done("export_glb_geometry", result={"cached": True, "asset_id": str(existing_geo.id)})
_cache_hit_asset_id = str(existing_geo.id)
eng.dispose()
if _cache_hit_asset_id is not None:
try:
generate_usd_master_task.delay(cad_file_id)
except Exception:
logger.debug("Could not queue generate_usd_master_task from cache-hit path (non-fatal)")
return {"cached": True, "asset_id": _cache_hit_asset_id}
# ... rest of function (tessellation path)
```
- **Acceptance gate**: `docker compose exec render-worker python3 -c "import app"` (no import errors); cache-hit path still skips tessellation and chains USD master.
- **Dependencies**: none
- **Risk**: Low — pure refactor, no logic change.
### [x] Task 4: Add `updated_at` in `save_manual_material_overrides`
- **File**: `backend/app/api/routers/cad.py`
- **What**: In `save_manual_material_overrides` (around line 537), add `cad.updated_at = datetime.utcnow()` before `await db.commit()`:
```python
cad.manual_material_overrides = body.overrides
cad.updated_at = datetime.utcnow() # ← add this line
await db.commit()
```
- **Acceptance gate**: `PUT /api/cad/{id}/manual-material-overrides` → `GET /api/cad/{id}` shows updated `updated_at` timestamp.
- **Dependencies**: none
- **Risk**: None
## Migration Check
Three migrations are pending in the working tree:
- `060_usd_master_asset_type.py` — additive enum value
- `061_material_assignment_layers.py` — additive JSONB columns
- `062_rename_tessellation_settings.py` — UPDATE on `system_settings` rows (already checked: migration 062 was applied per review-report)
**Before running**: read each migration file to confirm no unexpected DROP statements.
No migration required — no new columns or tables.
## Order Recommendation
Migrations → TypeScript check → Rebuild → Commit → Smoke test
Tasks 3 and 4 are independent cleanup items — implement first (low risk).
Tasks 1 and 2 are the core dedup fix — implement after.
Order: Task 4 → Task 3 → Task 1 → Task 2
## Risks / Open Questions
- `usd-core` build in Docker may be slow (first build) — expected, not a problem
- Migration 062 may already be applied (review noted "verified by 0-row SELECT") — `alembic upgrade head` is idempotent if so
- Existing CAD files need backfill for `partKeyMap` in GLB extras — handled by "Generate Missing Canonical Scenes" bulk action
- `resolvePartKey()` falls back to identity (raw mesh name) for GLBs generated before this change — graceful degradation, not a blocking issue
- Redis TTL of 30 minutes: if a task crashes hard (OOM, SIGKILL) without running `finally`, the lock stays for 30 minutes. This is the same tradeoff as `process_step_file`. Acceptable.
- `generate_usd_master_task` is also queued by the cache-hit path in `generate_gltf_geometry_task` — that chained call will be deduplicated by the lock too if the primary USD task is already running. Correct behaviour.
- The auto-chain from `generate_gltf_geometry_task → generate_usd_master_task` is still desirable (keeps canonical scene up-to-date after a fresh GLB). The lock prevents the *duplicate*, not the *legitimate* chain.