Files
HartOMat/plan.md
T
Hartmut 71e099305c fix: deduplicate GLB/USD generation with Redis locks + review fixes
- Add per-file Redis SET NX EX 1800 locks to generate_gltf_geometry_task
  and generate_usd_master_task — concurrent duplicates (e.g. double-click
  of bulk action buttons) now log a warning and return immediately instead
  of running two expensive OCC tessellation subprocesses on the same file
- Fix eng.dispose() called inside with Session() block in cache-hit path
  of both tasks — moved to after the with block exits (Tasks 3+4 from plan)
- Add cad.updated_at = datetime.utcnow() in save_manual_material_overrides
  (was missing vs parallel save_part_materials endpoint)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-12 13:50:05 +01:00

151 lines
8.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Plan: Deduplication for GLB/USD Generation + Two Review Fixes
## Context
Two problems to solve:
**1. Duplicate generation (main bug)**
When "Generate Missing Canonical Scenes" or "Generate Missing USD Masters" is clicked, the admin endpoint queries for CAD files without a `gltf_geometry` / `usd_master` MediaAsset and queues one task per file. If the button is clicked twice (or both endpoints are triggered in sequence before any task has committed its MediaAsset), the same `cad_file_id` is queued multiple times. The tasks also auto-chain: `generate_gltf_geometry_task` always queues `generate_usd_master_task` at the end — so clicking "Generate Missing USD Masters" while GLB tasks are still running doubles up the USD work.
The existing cache check (`step_file_hash`) only short-circuits tessellation when a MediaAsset already exists — it does not prevent two concurrent tasks from both starting the expensive subprocess on the same file. Two processes writing to `_geometry.glb` simultaneously causes corruption / wasted compute.
**Solution**: Apply the same Redis `SET NX EX` dedup lock that `process_step_file` uses (lock key `step_processing_lock:{id}`, released in `finally`). Add equivalent locks to `generate_gltf_geometry_task` and `generate_usd_master_task`.
**2. Review fix A — `eng.dispose()` inside `with Session` block**
`export_glb.py` line 89: `eng.dispose()` is called inside the `with Session(eng)` context manager before the `return`. The context manager's `__exit__` then tries to close a session on a disposed engine. Safe in practice (no exception raised) but fragile and misleading. Move `eng.dispose()` to after the `with` block exits.
**3. Review fix B — `save_manual_material_overrides` missing `updated_at`**
`cad.py` line 537: `cad.manual_material_overrides = body.overrides` is committed without updating `cad.updated_at`. The parallel endpoint `save_part_materials` (line 430) does call `cad.updated_at = datetime.utcnow()`. Add the same line to `save_manual_material_overrides`.
## Affected Files
| File | Change |
|---|---|
| `backend/app/domains/pipeline/tasks/export_glb.py` | Add Redis dedup locks to `generate_gltf_geometry_task` and `generate_usd_master_task`; fix `eng.dispose()` placement |
| `backend/app/api/routers/cad.py` | Add `cad.updated_at = datetime.utcnow()` in `save_manual_material_overrides` |
## Tasks (in order)
### [x] Task 1: Add Redis dedup locks to `generate_gltf_geometry_task`
- **File**: `backend/app/domains/pipeline/tasks/export_glb.py`
- **What**: At the top of `generate_gltf_geometry_task`, after `pl.step_start(...)`, acquire a Redis lock using the same pattern as `extract_metadata.py`:
```python
import redis as _redis_lib
_lock_key = f"glb_geometry_lock:{cad_file_id}"
_r = _redis_lib.from_url(app_settings.redis_url)
_acquired = _r.set(_lock_key, "1", nx=True, ex=1800) # 30-min TTL
if not _acquired:
logger.warning("generate_gltf_geometry_task: %s already in-flight — skipping duplicate", cad_file_id)
pl.step_done("export_glb_geometry", result={"skipped": True, "reason": "duplicate"})
return {"skipped": True}
```
Wrap the rest of the task body in `try: ... finally: _r.delete(_lock_key)`.
Note: `app_settings` is already imported inside the function. Import `redis` at the top of the `try` block as `import redis as _redis_lib` (same pattern as `extract_metadata.py` which imports it locally).
- **Acceptance gate**: Trigger "Generate Missing Canonical Scenes" twice in quick succession — worker logs show `"already in-flight — skipping duplicate"` for the second batch; no file ends up being tessellated twice.
- **Dependencies**: none
- **Risk**: Low — same pattern as `process_step_file`, TTL 30min covers worst-case tessellation time.
### [x] Task 2: Add Redis dedup lock to `generate_usd_master_task`
- **File**: `backend/app/domains/pipeline/tasks/export_glb.py`
- **What**: Same pattern at the top of `generate_usd_master_task`, after `pl.step_start(...)`:
```python
import redis as _redis_lib
_lock_key = f"usd_master_lock:{cad_file_id}"
_r = _redis_lib.from_url(app_settings.redis_url)
_acquired = _r.set(_lock_key, "1", nx=True, ex=1800) # 30-min TTL
if not _acquired:
logger.warning("generate_usd_master_task: %s already in-flight — skipping duplicate", cad_file_id)
pl.step_done("usd_master", result={"skipped": True, "reason": "duplicate"})
return {"skipped": True}
```
Wrap the rest of the function body in `try: ... finally: _r.delete(_lock_key)`.
- **Acceptance gate**: Trigger "Generate Missing USD Masters" while GLB tasks are still running — worker logs show USD tasks skipping duplicates instead of starting a second tessellation.
- **Dependencies**: Task 1
- **Risk**: Low
### [x] Task 3: Fix `eng.dispose()` placement in cache-hit early-return path
- **File**: `backend/app/domains/pipeline/tasks/export_glb.py`
- **What**: In `generate_gltf_geometry_task`, the cache-hit path (lines 8695) calls `eng.dispose()` at line 89 while still inside the `with Session(eng)` block, then returns. Move `eng.dispose()` to *after* the `with` block exits.
Current (broken):
```python
with Session(eng) as session:
...
if existing_geo:
pl.step_done(...)
eng.dispose() # ← inside with block
try:
generate_usd_master_task.delay(cad_file_id)
...
return {"cached": True, ...}
eng.dispose() # normal path
```
Fixed: remove the `eng.dispose()` at line 89, and move the `generate_usd_master_task.delay()` + `return` to after the `with` block:
```python
_cache_hit_asset_id: str | None = None
with Session(eng) as session:
...
if existing_geo:
logger.info("[CACHE] hash match — skipping geometry GLB tessellation for %s", cad_file_id)
pl.step_done("export_glb_geometry", result={"cached": True, "asset_id": str(existing_geo.id)})
_cache_hit_asset_id = str(existing_geo.id)
eng.dispose()
if _cache_hit_asset_id is not None:
try:
generate_usd_master_task.delay(cad_file_id)
except Exception:
logger.debug("Could not queue generate_usd_master_task from cache-hit path (non-fatal)")
return {"cached": True, "asset_id": _cache_hit_asset_id}
# ... rest of function (tessellation path)
```
- **Acceptance gate**: `docker compose exec render-worker python3 -c "import app"` (no import errors); cache-hit path still skips tessellation and chains USD master.
- **Dependencies**: none
- **Risk**: Low — pure refactor, no logic change.
### [x] Task 4: Add `updated_at` in `save_manual_material_overrides`
- **File**: `backend/app/api/routers/cad.py`
- **What**: In `save_manual_material_overrides` (around line 537), add `cad.updated_at = datetime.utcnow()` before `await db.commit()`:
```python
cad.manual_material_overrides = body.overrides
cad.updated_at = datetime.utcnow() # ← add this line
await db.commit()
```
- **Acceptance gate**: `PUT /api/cad/{id}/manual-material-overrides` → `GET /api/cad/{id}` shows updated `updated_at` timestamp.
- **Dependencies**: none
- **Risk**: None
## Migration Check
No migration required — no new columns or tables.
## Order Recommendation
Tasks 3 and 4 are independent cleanup items — implement first (low risk).
Tasks 1 and 2 are the core dedup fix — implement after.
Order: Task 4 → Task 3 → Task 1 → Task 2
## Risks / Open Questions
- Redis TTL of 30 minutes: if a task crashes hard (OOM, SIGKILL) without running `finally`, the lock stays for 30 minutes. This is the same tradeoff as `process_step_file`. Acceptable.
- `generate_usd_master_task` is also queued by the cache-hit path in `generate_gltf_geometry_task` — that chained call will be deduplicated by the lock too if the primary USD task is already running. Correct behaviour.
- The auto-chain from `generate_gltf_geometry_task → generate_usd_master_task` is still desirable (keeps canonical scene up-to-date after a fresh GLB). The lock prevents the *duplicate*, not the *legitimate* chain.