Files
HartOMat/plan.md
T
Hartmut 71e099305c fix: deduplicate GLB/USD generation with Redis locks + review fixes
- Add per-file Redis SET NX EX 1800 locks to generate_gltf_geometry_task
  and generate_usd_master_task — concurrent duplicates (e.g. double-click
  of bulk action buttons) now log a warning and return immediately instead
  of running two expensive OCC tessellation subprocesses on the same file
- Fix eng.dispose() called inside with Session() block in cache-hit path
  of both tasks — moved to after the with block exits (Tasks 3+4 from plan)
- Add cad.updated_at = datetime.utcnow() in save_manual_material_overrides
  (was missing vs parallel save_part_materials endpoint)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-12 13:50:05 +01:00

8.0 KiB
Raw Blame History

Plan: Deduplication for GLB/USD Generation + Two Review Fixes

Context

Two problems to solve:

1. Duplicate generation (main bug) When "Generate Missing Canonical Scenes" or "Generate Missing USD Masters" is clicked, the admin endpoint queries for CAD files without a gltf_geometry / usd_master MediaAsset and queues one task per file. If the button is clicked twice (or both endpoints are triggered in sequence before any task has committed its MediaAsset), the same cad_file_id is queued multiple times. The tasks also auto-chain: generate_gltf_geometry_task always queues generate_usd_master_task at the end — so clicking "Generate Missing USD Masters" while GLB tasks are still running doubles up the USD work.

The existing cache check (step_file_hash) only short-circuits tessellation when a MediaAsset already exists — it does not prevent two concurrent tasks from both starting the expensive subprocess on the same file. Two processes writing to _geometry.glb simultaneously causes corruption / wasted compute.

Solution: Apply the same Redis SET NX EX dedup lock that process_step_file uses (lock key step_processing_lock:{id}, released in finally). Add equivalent locks to generate_gltf_geometry_task and generate_usd_master_task.

2. Review fix A — eng.dispose() inside with Session block export_glb.py line 89: eng.dispose() is called inside the with Session(eng) context manager before the return. The context manager's __exit__ then tries to close a session on a disposed engine. Safe in practice (no exception raised) but fragile and misleading. Move eng.dispose() to after the with block exits.

3. Review fix B — save_manual_material_overrides missing updated_at cad.py line 537: cad.manual_material_overrides = body.overrides is committed without updating cad.updated_at. The parallel endpoint save_part_materials (line 430) does call cad.updated_at = datetime.utcnow(). Add the same line to save_manual_material_overrides.

Affected Files

File Change
backend/app/domains/pipeline/tasks/export_glb.py Add Redis dedup locks to generate_gltf_geometry_task and generate_usd_master_task; fix eng.dispose() placement
backend/app/api/routers/cad.py Add cad.updated_at = datetime.utcnow() in save_manual_material_overrides

Tasks (in order)

[x] Task 1: Add Redis dedup locks to generate_gltf_geometry_task

  • File: backend/app/domains/pipeline/tasks/export_glb.py

  • What: At the top of generate_gltf_geometry_task, after pl.step_start(...), acquire a Redis lock using the same pattern as extract_metadata.py:

    import redis as _redis_lib
    _lock_key = f"glb_geometry_lock:{cad_file_id}"
    _r = _redis_lib.from_url(app_settings.redis_url)
    _acquired = _r.set(_lock_key, "1", nx=True, ex=1800)  # 30-min TTL
    if not _acquired:
        logger.warning("generate_gltf_geometry_task: %s already in-flight — skipping duplicate", cad_file_id)
        pl.step_done("export_glb_geometry", result={"skipped": True, "reason": "duplicate"})
        return {"skipped": True}
    

    Wrap the rest of the task body in try: ... finally: _r.delete(_lock_key).

    Note: app_settings is already imported inside the function. Import redis at the top of the try block as import redis as _redis_lib (same pattern as extract_metadata.py which imports it locally).

  • Acceptance gate: Trigger "Generate Missing Canonical Scenes" twice in quick succession — worker logs show "already in-flight — skipping duplicate" for the second batch; no file ends up being tessellated twice.

  • Dependencies: none

  • Risk: Low — same pattern as process_step_file, TTL 30min covers worst-case tessellation time.

[x] Task 2: Add Redis dedup lock to generate_usd_master_task

  • File: backend/app/domains/pipeline/tasks/export_glb.py

  • What: Same pattern at the top of generate_usd_master_task, after pl.step_start(...):

    import redis as _redis_lib
    _lock_key = f"usd_master_lock:{cad_file_id}"
    _r = _redis_lib.from_url(app_settings.redis_url)
    _acquired = _r.set(_lock_key, "1", nx=True, ex=1800)  # 30-min TTL
    if not _acquired:
        logger.warning("generate_usd_master_task: %s already in-flight — skipping duplicate", cad_file_id)
        pl.step_done("usd_master", result={"skipped": True, "reason": "duplicate"})
        return {"skipped": True}
    

    Wrap the rest of the function body in try: ... finally: _r.delete(_lock_key).

  • Acceptance gate: Trigger "Generate Missing USD Masters" while GLB tasks are still running — worker logs show USD tasks skipping duplicates instead of starting a second tessellation.

  • Dependencies: Task 1

  • Risk: Low

[x] Task 3: Fix eng.dispose() placement in cache-hit early-return path

  • File: backend/app/domains/pipeline/tasks/export_glb.py

  • What: In generate_gltf_geometry_task, the cache-hit path (lines 8695) calls eng.dispose() at line 89 while still inside the with Session(eng) block, then returns. Move eng.dispose() to after the with block exits.

    Current (broken):

    with Session(eng) as session:
        ...
        if existing_geo:
            pl.step_done(...)
            eng.dispose()          # ← inside with block
            try:
                generate_usd_master_task.delay(cad_file_id)
            ...
            return {"cached": True, ...}
    eng.dispose()                  # normal path
    

    Fixed: remove the eng.dispose() at line 89, and move the generate_usd_master_task.delay() + return to after the with block:

    _cache_hit_asset_id: str | None = None
    with Session(eng) as session:
        ...
        if existing_geo:
            logger.info("[CACHE] hash match — skipping geometry GLB tessellation for %s", cad_file_id)
            pl.step_done("export_glb_geometry", result={"cached": True, "asset_id": str(existing_geo.id)})
            _cache_hit_asset_id = str(existing_geo.id)
    eng.dispose()
    
    if _cache_hit_asset_id is not None:
        try:
            generate_usd_master_task.delay(cad_file_id)
        except Exception:
            logger.debug("Could not queue generate_usd_master_task from cache-hit path (non-fatal)")
        return {"cached": True, "asset_id": _cache_hit_asset_id}
    
    # ... rest of function (tessellation path)
    
  • Acceptance gate: docker compose exec render-worker python3 -c "import app" (no import errors); cache-hit path still skips tessellation and chains USD master.

  • Dependencies: none

  • Risk: Low — pure refactor, no logic change.

[x] Task 4: Add updated_at in save_manual_material_overrides

  • File: backend/app/api/routers/cad.py

  • What: In save_manual_material_overrides (around line 537), add cad.updated_at = datetime.utcnow() before await db.commit():

    cad.manual_material_overrides = body.overrides
    cad.updated_at = datetime.utcnow()   # ← add this line
    await db.commit()
    
  • Acceptance gate: PUT /api/cad/{id}/manual-material-overridesGET /api/cad/{id} shows updated updated_at timestamp.

  • Dependencies: none

  • Risk: None

Migration Check

No migration required — no new columns or tables.

Order Recommendation

Tasks 3 and 4 are independent cleanup items — implement first (low risk). Tasks 1 and 2 are the core dedup fix — implement after.

Order: Task 4 → Task 3 → Task 1 → Task 2

Risks / Open Questions

  • Redis TTL of 30 minutes: if a task crashes hard (OOM, SIGKILL) without running finally, the lock stays for 30 minutes. This is the same tradeoff as process_step_file. Acceptable.
  • generate_usd_master_task is also queued by the cache-hit path in generate_gltf_geometry_task — that chained call will be deduplicated by the lock too if the primary USD task is already running. Correct behaviour.
  • The auto-chain from generate_gltf_geometry_task → generate_usd_master_task is still desirable (keeps canonical scene up-to-date after a fresh GLB). The lock prevents the duplicate, not the legitimate chain.