# Plan: Rich Product Metadata Extraction from STEP Files ## Context The AI chat agent was asked "What is the biggest product from my order?" and couldn't answer because dimensional data wasn't available in tool results. While `cad_files.mesh_attributes` already stores bounding box dimensions, much more metadata is extractable from STEP files via OCC that would make the AI agent and the product library significantly more useful. **Currently extracted**: part names, bounding box (xyz), sharp edges, smooth angle **Available but not extracted**: per-part volume, surface area, assembly hierarchy, instance counts, embedded colors, triangle counts, geometric complexity **Goal**: Expand the STEP metadata extraction to compute richer product characteristics and store them in a structured `cad_metadata` JSONB field, accessible to the AI agent, product search, and frontend. ## Affected Files | File | Change | |------|--------| | `backend/app/services/step_processor.py` | Expand `extract_step_metadata()` with volume, surface area, hierarchy, complexity | | `backend/app/domains/products/models.py` | Add `cad_metadata` JSONB column to Product | | `backend/alembic/versions/XXX_add_cad_metadata.py` | Migration | | `backend/app/domains/pipeline/tasks/extract_metadata.py` | Populate `cad_metadata` after STEP processing | | `backend/app/domains/products/schemas.py` | Expose `cad_metadata` in ProductOut | | `backend/app/services/chat_service.py` | Include metadata in search_products and system prompt | | `frontend/src/pages/ProductDetail.tsx` | Display rich metadata (volume, part count, complexity) | ## Tasks (in order) ### [ ] Task 1: Expand STEP metadata extraction - **File**: `backend/app/services/step_processor.py` - **What**: Expand `extract_step_metadata()` to compute additional properties after the existing bbox/edge extraction. Add a new function `extract_rich_metadata(doc, shape_tool)` that returns: ```python { "part_count": 42, # Number of leaf parts "assembly_depth": 3, # Max nesting depth "total_volume_cm3": 1250.4, # Sum of all part volumes (cm³) "total_surface_area_cm2": 3400.2, # Sum of all surface areas (cm²) "total_triangle_count": 45000, # After tessellation "total_vertex_count": 23000, # After tessellation "largest_part": { # Part with largest volume "name": "OuterRing", "volume_cm3": 450.2, }, "smallest_dimension_mm": 0.5, # Smallest bbox dimension across all parts "instance_count": 36, # Total instances (parts may repeat) "unique_part_count": 12, # Distinct shapes "complexity_score": "high", # low/medium/high based on triangle count } ``` Use OCC: - `GProp_GProps` + `BRepGProp.VolumeProperties()` for volume - `BRepGProp.SurfaceProperties()` for surface area - `Poly_Triangulation` for triangle/vertex counts (after tessellation) - Assembly tree walk (already done in `_collect_part_key_map`) for hierarchy depth + instance count - **Acceptance gate**: `extract_rich_metadata()` returns all fields for a test STEP file - **Dependencies**: None ### [ ] Task 2: Add cad_metadata column to Product model - **File**: `backend/app/domains/products/models.py` - **What**: Add `cad_metadata: Mapped[dict | None] = mapped_column(JSONB, nullable=True, default=None)` to the Product model. This stores the rich metadata at the product level (not cad_file) because products are the user-facing entity. - **Migration**: `alembic revision --autogenerate -m "add cad_metadata to products"` - **Also**: Add to ProductOut schema in `backend/app/domains/products/schemas.py` - **Acceptance gate**: Column exists, schema includes it - **Dependencies**: None ### [ ] Task 3: Populate cad_metadata during STEP processing - **File**: `backend/app/domains/pipeline/tasks/extract_metadata.py` - **What**: After `process_step_file` extracts objects and queues thumbnail, call `extract_rich_metadata()` and store the result on the Product's `cad_metadata` field. Also store it on `cad_files.mesh_attributes` (merge with existing data). - **Also**: Add a "reextract metadata" admin action that re-runs this for all existing products - **Acceptance gate**: After STEP processing, product.cad_metadata is populated with volume, part_count, etc. - **Dependencies**: Tasks 1, 2 ### [ ] Task 4: Expose metadata in AI agent tools - **File**: `backend/app/services/chat_service.py` - **What**: 1. Update `_tool_search_products()` to include `cad_metadata` fields (part_count, total_volume_cm3, complexity_score) in results 2. Update `query_database` tool description to mention `products.cad_metadata` JSONB field 3. Update system prompt to mention available metadata - **Acceptance gate**: AI agent can answer "What is the biggest product?" using volume data - **Dependencies**: Task 3 ### [ ] Task 5: Display rich metadata on ProductDetail page - **File**: `frontend/src/pages/ProductDetail.tsx` - **What**: Add a "CAD Metadata" section on the product detail page showing: - Part count + unique parts + instances - Total volume (cm³) + surface area (cm²) - Largest part name + volume - Complexity score badge (low/medium/high) - Triangle/vertex count - Assembly depth - **Acceptance gate**: Metadata displayed on product page; empty gracefully when not available - **Dependencies**: Task 2 ### [ ] Task 6: Batch re-extract metadata for existing products - **File**: `backend/app/api/routers/admin.py` - **What**: Add a "Re-extract Rich Metadata" button in System Tools that queues a Celery task to re-process all completed STEP files and populate `cad_metadata` for all products. - **Acceptance gate**: Button triggers batch job; existing products get metadata populated - **Dependencies**: Tasks 1, 3 ## Migration Check **Yes** — one new JSONB column on `products` table. ## Order Recommendation 1. Task 1 (extraction logic) + Task 2 (model + migration) — parallel 2. Task 3 (wire up in pipeline) 3. Task 4 (AI agent) + Task 5 (frontend) — parallel 4. Task 6 (batch re-extract) ## Risks / Open Questions 1. **Volume calculation accuracy**: OCC `BRepGProp` computes exact B-rep volume, not mesh-based. This is accurate but can be slow for very complex shapes. Cap at 5s per file. 2. **Performance**: Rich metadata extraction adds ~100-500ms per STEP file. This is acceptable since STEP processing already takes 1-5s. 3. **Existing products**: ~45 products with STEP files need backfill. Task 6 handles this. 4. **Triangle count varies**: Depends on tessellation settings (deflection angles). Store the count at the current tessellation quality for reference, with a note that it's approximate.