Files
HartOMat/plan.md
T
Hartmut cfccdd5397 feat: rich product metadata extraction from STEP files
Extract volume, surface area, part count, assembly hierarchy, and
complexity from STEP files via OCC B-rep analysis.

Backend:
- extract_rich_metadata() in step_processor.py: computes per-part volume
  (BRepGProp), surface area, triangle/vertex count, assembly depth,
  instance count, complexity score, largest part identification
- cad_metadata JSONB column on Product model (DB migration)
- Auto-populated during STEP processing (non-fatal, 10s timeout)
- Also stored in cad_files.mesh_attributes["rich_metadata"]
- Batch re-extract endpoint: POST /admin/settings/reextract-rich-metadata

AI Agent:
- search_products returns part_count, volume_cm3, complexity, largest_part
- query_database tool description documents cad_metadata schema

Frontend:
- ProductDetail page: CAD Metadata section with stat cards
  (parts, volume, surface area, complexity, triangles, assembly depth)
- Admin System Tools: "Re-extract Rich Metadata" button for backfill

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 18:49:50 +01:00

123 lines
6.6 KiB
Markdown

# Plan: Rich Product Metadata Extraction from STEP Files
## Context
The AI chat agent was asked "What is the biggest product from my order?" and couldn't answer because dimensional data wasn't available in tool results. While `cad_files.mesh_attributes` already stores bounding box dimensions, much more metadata is extractable from STEP files via OCC that would make the AI agent and the product library significantly more useful.
**Currently extracted**: part names, bounding box (xyz), sharp edges, smooth angle
**Available but not extracted**: per-part volume, surface area, assembly hierarchy, instance counts, embedded colors, triangle counts, geometric complexity
**Goal**: Expand the STEP metadata extraction to compute richer product characteristics and store them in a structured `cad_metadata` JSONB field, accessible to the AI agent, product search, and frontend.
## Affected Files
| File | Change |
|------|--------|
| `backend/app/services/step_processor.py` | Expand `extract_step_metadata()` with volume, surface area, hierarchy, complexity |
| `backend/app/domains/products/models.py` | Add `cad_metadata` JSONB column to Product |
| `backend/alembic/versions/XXX_add_cad_metadata.py` | Migration |
| `backend/app/domains/pipeline/tasks/extract_metadata.py` | Populate `cad_metadata` after STEP processing |
| `backend/app/domains/products/schemas.py` | Expose `cad_metadata` in ProductOut |
| `backend/app/services/chat_service.py` | Include metadata in search_products and system prompt |
| `frontend/src/pages/ProductDetail.tsx` | Display rich metadata (volume, part count, complexity) |
## Tasks (in order)
### [ ] Task 1: Expand STEP metadata extraction
- **File**: `backend/app/services/step_processor.py`
- **What**: Expand `extract_step_metadata()` to compute additional properties after the existing bbox/edge extraction. Add a new function `extract_rich_metadata(doc, shape_tool)` that returns:
```python
{
"part_count": 42, # Number of leaf parts
"assembly_depth": 3, # Max nesting depth
"total_volume_cm3": 1250.4, # Sum of all part volumes (cm³)
"total_surface_area_cm2": 3400.2, # Sum of all surface areas (cm²)
"total_triangle_count": 45000, # After tessellation
"total_vertex_count": 23000, # After tessellation
"largest_part": { # Part with largest volume
"name": "OuterRing",
"volume_cm3": 450.2,
},
"smallest_dimension_mm": 0.5, # Smallest bbox dimension across all parts
"instance_count": 36, # Total instances (parts may repeat)
"unique_part_count": 12, # Distinct shapes
"complexity_score": "high", # low/medium/high based on triangle count
}
```
Use OCC:
- `GProp_GProps` + `BRepGProp.VolumeProperties()` for volume
- `BRepGProp.SurfaceProperties()` for surface area
- `Poly_Triangulation` for triangle/vertex counts (after tessellation)
- Assembly tree walk (already done in `_collect_part_key_map`) for hierarchy depth + instance count
- **Acceptance gate**: `extract_rich_metadata()` returns all fields for a test STEP file
- **Dependencies**: None
### [ ] Task 2: Add cad_metadata column to Product model
- **File**: `backend/app/domains/products/models.py`
- **What**: Add `cad_metadata: Mapped[dict | None] = mapped_column(JSONB, nullable=True, default=None)` to the Product model. This stores the rich metadata at the product level (not cad_file) because products are the user-facing entity.
- **Migration**: `alembic revision --autogenerate -m "add cad_metadata to products"`
- **Also**: Add to ProductOut schema in `backend/app/domains/products/schemas.py`
- **Acceptance gate**: Column exists, schema includes it
- **Dependencies**: None
### [ ] Task 3: Populate cad_metadata during STEP processing
- **File**: `backend/app/domains/pipeline/tasks/extract_metadata.py`
- **What**: After `process_step_file` extracts objects and queues thumbnail, call `extract_rich_metadata()` and store the result on the Product's `cad_metadata` field. Also store it on `cad_files.mesh_attributes` (merge with existing data).
- **Also**: Add a "reextract metadata" admin action that re-runs this for all existing products
- **Acceptance gate**: After STEP processing, product.cad_metadata is populated with volume, part_count, etc.
- **Dependencies**: Tasks 1, 2
### [ ] Task 4: Expose metadata in AI agent tools
- **File**: `backend/app/services/chat_service.py`
- **What**:
1. Update `_tool_search_products()` to include `cad_metadata` fields (part_count, total_volume_cm3, complexity_score) in results
2. Update `query_database` tool description to mention `products.cad_metadata` JSONB field
3. Update system prompt to mention available metadata
- **Acceptance gate**: AI agent can answer "What is the biggest product?" using volume data
- **Dependencies**: Task 3
### [ ] Task 5: Display rich metadata on ProductDetail page
- **File**: `frontend/src/pages/ProductDetail.tsx`
- **What**: Add a "CAD Metadata" section on the product detail page showing:
- Part count + unique parts + instances
- Total volume (cm³) + surface area (cm²)
- Largest part name + volume
- Complexity score badge (low/medium/high)
- Triangle/vertex count
- Assembly depth
- **Acceptance gate**: Metadata displayed on product page; empty gracefully when not available
- **Dependencies**: Task 2
### [ ] Task 6: Batch re-extract metadata for existing products
- **File**: `backend/app/api/routers/admin.py`
- **What**: Add a "Re-extract Rich Metadata" button in System Tools that queues a Celery task to re-process all completed STEP files and populate `cad_metadata` for all products.
- **Acceptance gate**: Button triggers batch job; existing products get metadata populated
- **Dependencies**: Tasks 1, 3
## Migration Check
**Yes** — one new JSONB column on `products` table.
## Order Recommendation
1. Task 1 (extraction logic) + Task 2 (model + migration) — parallel
2. Task 3 (wire up in pipeline)
3. Task 4 (AI agent) + Task 5 (frontend) — parallel
4. Task 6 (batch re-extract)
## Risks / Open Questions
1. **Volume calculation accuracy**: OCC `BRepGProp` computes exact B-rep volume, not mesh-based. This is accurate but can be slow for very complex shapes. Cap at 5s per file.
2. **Performance**: Rich metadata extraction adds ~100-500ms per STEP file. This is acceptable since STEP processing already takes 1-5s.
3. **Existing products**: ~45 products with STEP files need backfill. Task 6 handles this.
4. **Triangle count varies**: Depends on tessellation settings (deflection angles). Store the count at the current tessellation quality for reference, with a note that it's approximate.