Files
HartOMat/plan.md
T
Hartmut cfccdd5397 feat: rich product metadata extraction from STEP files
Extract volume, surface area, part count, assembly hierarchy, and
complexity from STEP files via OCC B-rep analysis.

Backend:
- extract_rich_metadata() in step_processor.py: computes per-part volume
  (BRepGProp), surface area, triangle/vertex count, assembly depth,
  instance count, complexity score, largest part identification
- cad_metadata JSONB column on Product model (DB migration)
- Auto-populated during STEP processing (non-fatal, 10s timeout)
- Also stored in cad_files.mesh_attributes["rich_metadata"]
- Batch re-extract endpoint: POST /admin/settings/reextract-rich-metadata

AI Agent:
- search_products returns part_count, volume_cm3, complexity, largest_part
- query_database tool description documents cad_metadata schema

Frontend:
- ProductDetail page: CAD Metadata section with stat cards
  (parts, volume, surface area, complexity, triangles, assembly depth)
- Admin System Tools: "Re-extract Rich Metadata" button for backfill

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 18:49:50 +01:00

6.6 KiB

Plan: Rich Product Metadata Extraction from STEP Files

Context

The AI chat agent was asked "What is the biggest product from my order?" and couldn't answer because dimensional data wasn't available in tool results. While cad_files.mesh_attributes already stores bounding box dimensions, much more metadata is extractable from STEP files via OCC that would make the AI agent and the product library significantly more useful.

Currently extracted: part names, bounding box (xyz), sharp edges, smooth angle Available but not extracted: per-part volume, surface area, assembly hierarchy, instance counts, embedded colors, triangle counts, geometric complexity

Goal: Expand the STEP metadata extraction to compute richer product characteristics and store them in a structured cad_metadata JSONB field, accessible to the AI agent, product search, and frontend.

Affected Files

File Change
backend/app/services/step_processor.py Expand extract_step_metadata() with volume, surface area, hierarchy, complexity
backend/app/domains/products/models.py Add cad_metadata JSONB column to Product
backend/alembic/versions/XXX_add_cad_metadata.py Migration
backend/app/domains/pipeline/tasks/extract_metadata.py Populate cad_metadata after STEP processing
backend/app/domains/products/schemas.py Expose cad_metadata in ProductOut
backend/app/services/chat_service.py Include metadata in search_products and system prompt
frontend/src/pages/ProductDetail.tsx Display rich metadata (volume, part count, complexity)

Tasks (in order)

[ ] Task 1: Expand STEP metadata extraction

  • File: backend/app/services/step_processor.py
  • What: Expand extract_step_metadata() to compute additional properties after the existing bbox/edge extraction. Add a new function extract_rich_metadata(doc, shape_tool) that returns:
    {
        "part_count": 42,                    # Number of leaf parts
        "assembly_depth": 3,                 # Max nesting depth
        "total_volume_cm3": 1250.4,          # Sum of all part volumes (cm³)
        "total_surface_area_cm2": 3400.2,    # Sum of all surface areas (cm²)
        "total_triangle_count": 45000,       # After tessellation
        "total_vertex_count": 23000,         # After tessellation
        "largest_part": {                    # Part with largest volume
            "name": "OuterRing",
            "volume_cm3": 450.2,
        },
        "smallest_dimension_mm": 0.5,        # Smallest bbox dimension across all parts
        "instance_count": 36,                # Total instances (parts may repeat)
        "unique_part_count": 12,             # Distinct shapes
        "complexity_score": "high",          # low/medium/high based on triangle count
    }
    
    Use OCC:
    • GProp_GProps + BRepGProp.VolumeProperties() for volume
    • BRepGProp.SurfaceProperties() for surface area
    • Poly_Triangulation for triangle/vertex counts (after tessellation)
    • Assembly tree walk (already done in _collect_part_key_map) for hierarchy depth + instance count
  • Acceptance gate: extract_rich_metadata() returns all fields for a test STEP file
  • Dependencies: None

[ ] Task 2: Add cad_metadata column to Product model

  • File: backend/app/domains/products/models.py
  • What: Add cad_metadata: Mapped[dict | None] = mapped_column(JSONB, nullable=True, default=None) to the Product model. This stores the rich metadata at the product level (not cad_file) because products are the user-facing entity.
  • Migration: alembic revision --autogenerate -m "add cad_metadata to products"
  • Also: Add to ProductOut schema in backend/app/domains/products/schemas.py
  • Acceptance gate: Column exists, schema includes it
  • Dependencies: None

[ ] Task 3: Populate cad_metadata during STEP processing

  • File: backend/app/domains/pipeline/tasks/extract_metadata.py
  • What: After process_step_file extracts objects and queues thumbnail, call extract_rich_metadata() and store the result on the Product's cad_metadata field. Also store it on cad_files.mesh_attributes (merge with existing data).
  • Also: Add a "reextract metadata" admin action that re-runs this for all existing products
  • Acceptance gate: After STEP processing, product.cad_metadata is populated with volume, part_count, etc.
  • Dependencies: Tasks 1, 2

[ ] Task 4: Expose metadata in AI agent tools

  • File: backend/app/services/chat_service.py
  • What:
    1. Update _tool_search_products() to include cad_metadata fields (part_count, total_volume_cm3, complexity_score) in results
    2. Update query_database tool description to mention products.cad_metadata JSONB field
    3. Update system prompt to mention available metadata
  • Acceptance gate: AI agent can answer "What is the biggest product?" using volume data
  • Dependencies: Task 3

[ ] Task 5: Display rich metadata on ProductDetail page

  • File: frontend/src/pages/ProductDetail.tsx
  • What: Add a "CAD Metadata" section on the product detail page showing:
    • Part count + unique parts + instances
    • Total volume (cm³) + surface area (cm²)
    • Largest part name + volume
    • Complexity score badge (low/medium/high)
    • Triangle/vertex count
    • Assembly depth
  • Acceptance gate: Metadata displayed on product page; empty gracefully when not available
  • Dependencies: Task 2

[ ] Task 6: Batch re-extract metadata for existing products

  • File: backend/app/api/routers/admin.py
  • What: Add a "Re-extract Rich Metadata" button in System Tools that queues a Celery task to re-process all completed STEP files and populate cad_metadata for all products.
  • Acceptance gate: Button triggers batch job; existing products get metadata populated
  • Dependencies: Tasks 1, 3

Migration Check

Yes — one new JSONB column on products table.

Order Recommendation

  1. Task 1 (extraction logic) + Task 2 (model + migration) — parallel
  2. Task 3 (wire up in pipeline)
  3. Task 4 (AI agent) + Task 5 (frontend) — parallel
  4. Task 6 (batch re-extract)

Risks / Open Questions

  1. Volume calculation accuracy: OCC BRepGProp computes exact B-rep volume, not mesh-based. This is accurate but can be slow for very complex shapes. Cap at 5s per file.

  2. Performance: Rich metadata extraction adds ~100-500ms per STEP file. This is acceptable since STEP processing already takes 1-5s.

  3. Existing products: ~45 products with STEP files need backfill. Task 6 handles this.

  4. Triangle count varies: Depends on tessellation settings (deflection angles). Store the count at the current tessellation quality for reference, with a note that it's approximate.