refactor: things

2026-03-06 17:17:59 +08:00
parent 8c6087683f
commit 33ab1a5d9d
171 changed files with 293 additions and 29894 deletions
@@ -0,0 +1,83 @@
+# Integrating Local Binary Extensions with `uv`
+
+This guide explains how we packaged the local `pyzed` binary extension (originally from a system installation) so that `uv` can manage it as a project dependency.
+
+## The Problem
+
+The ZED SDK installs its Python wrapper (`pyzed`) as a system-level package (often in `/usr/local/lib/...`). It consists of compiled extensions (`.so` or `.sl` files) and Python bindings. 
+
+`uv` strictly manages virtual environments and dependencies. It cannot directly "see" or import packages from the global system site-packages unless explicitly configured to use system site-packages (which reduces isolation). Furthermore, a raw `.so` file or a bare directory without metadata isn't a valid package source for `uv`.
+
+## The Solution: A Local Package Wrapper
+
+To make `pyzed` compatible with `uv`, we wrapped the raw library files into a proper, minimally compliant Python package located at `libs/pyzed_pkg`.
+
+### 1. Directory Structure
+
+We organized the files into a standard package layout:
+
+```text
+libs/pyzed_pkg/
+├── pyproject.toml       # Package metadata (CRITICAL)
+└── pyzed/               # The actual importable package
+    ├── __init__.py
+    ├── sl.cpython-312-x86_64-linux-gnu.so  # The compiled extension
+    └── ...
+```
+
+### 2. The Local `pyproject.toml`
+
+We created a `pyproject.toml` inside `libs/pyzed_pkg` to tell build tools how to handle the files. 
+
+Key configuration points:
+1.  **Build System**: Uses `setuptools` to bundle the files.
+2.  **Package Discovery**: Explicitly lists `pyzed`.
+3.  **Package Data**: **Crucially**, configures `setuptools` to include binary files (`*.so`, `*.pyi`) which are usually ignored by default.
+
+```toml
+[project]
+name = "pyzed"
+version = "0.1.0"
+description = "Wrapper for ZED SDK"
+requires-python = ">=3.12"
+dependencies = []
+
+[build-system]
+requires = ["setuptools", "wheel"]
+build-backend = "setuptools.build_meta"
+
+[tool.setuptools]
+packages = ["pyzed"]
+
+# IMPORTANT: Ensure the binary extension is included in the build
+[tool.setuptools.package-data]
+pyzed = ["*.so", "*.pyi"]
+```
+
+## Configuring the Main Project
+
+In the root `pyproject.toml` of your application, we used **`[tool.uv.sources]`** to redirect the `pyzed` dependency to our local path.
+
+```toml
+[project]
+dependencies = [
+    "pyzed",          # Declared as a normal dependency
+    "cupy-cuda12x",
+    "numpy",
+]
+
+# uv-specific configuration
+[tool.uv.sources]
+pyzed = { path = "libs/pyzed_pkg" }
+```
+
+## How `uv` Processes This
+
+1.  **Resolution**: When you run `uv sync` or `uv run`, `uv` sees `pyzed` in the dependencies.
+2.  **Source Lookup**: It checks `tool.uv.sources` and finds the local path `libs/pyzed_pkg`.
+3.  **Build/Install**: 
+    *   `uv` treats the directory as a source distribution.
+    *   It uses the `build-system` defined in `libs/pyzed_pkg/pyproject.toml` to build a temporary wheel.
+    *   This wheel (containing the `.so` file) is installed into the project's virtual environment (`.venv`).
+
+This ensures that your application works seamlessly with `import pyzed.sl` while maintaining a clean, isolated, and reproducible environment managed by `uv`.
@@ -0,0 +1,292 @@
+# Calibrate Extrinsics Workflow
+
+This document explains the workflow for `calibrate_extrinsics.py`, focusing on ground plane alignment (`--auto-align`) and depth-based refinement (`--verify-depth`, `--refine-depth`).
+
+## CLI Overview
+
+The script calibrates camera extrinsics using ArUco markers detected in SVO recordings.
+
+**Key Options:**
+- `--svo`: Path to SVO file(s) or directory containing them.
+- `--markers`: Path to the marker configuration parquet file.
+- `--auto-align`: Enables automatic ground plane alignment (opt-in).
+- `--verify-depth`: Enables depth-based verification of computed poses.
+- `--refine-depth`: Enables optimization of poses using depth data (requires `--verify-depth`).
+- `--use-confidence-weights`: Uses ZED depth confidence map to weight residuals in optimization.
+- `--benchmark-matrix`: Runs a comparison of baseline vs. robust refinement configurations.
+- `--max-samples`: Limits the number of processed samples for fast iteration.
+- `--debug`: Enables verbose debug logging (default is INFO).
+
+## Ground Plane Alignment (`--auto-align`)
+
+When `--auto-align` is enabled, the script attempts to align the global coordinate system such that a specific face of the marker object becomes the ground plane (XZ plane, normal pointing +Y).
+
+**Prerequisites:**
+- The marker parquet file MUST contain `name` and `ids` columns defining which markers belong to which face (e.g., "top", "bottom", "front").
+- If this metadata is missing, alignment is skipped with a warning.
+
+**Decision Flow:**
+The script selects the ground face using the following precedence:
+
+1.  **Explicit Face (`--ground-face`)**:
+    - If you provide `--ground-face="bottom"`, the script looks up the markers for "bottom" in the loaded map.
+    - It computes the average normal of those markers and aligns it to the global up vector.
+
+2.  **Marker ID Mapping (`--ground-marker-id`)**:
+    - If you provide `--ground-marker-id=21`, the script finds which face contains marker 21 (e.g., "bottom").
+    - It then proceeds as if `--ground-face="bottom"` was specified.
+
+3.  **Heuristic Detection (Fallback)**:
+    - If neither option is provided, the script analyzes all visible markers.
+    - It computes the normal for every defined face.
+    - It selects the face whose normal is most aligned with the camera's "down" direction (assuming the camera is roughly upright).
+
+**Logging:**
+The script logs the selected decision path for debugging:
+- `Mapped ground-marker-id 21 to face 'bottom' (markers=[21])`
+- `Using explicit ground face 'bottom' (markers=[21])`
+- `Heuristically detected ground face 'bottom' (markers=[21])`
+
+## Depth Verification & Refinement
+
+This workflow uses the ZED camera's depth map to verify and improve the ArUco-based pose estimation.
+
+### 1. Verification (`--verify-depth`)
+- **Input**: The computed extrinsic pose ($T_{world\_from\_cam}$) and the known 3D world coordinates of the marker corners.
+- **Process**:
+    1. Projects marker corners into the camera frame using the computed pose.
+    2. Samples the ZED depth map at these projected 2D locations (using a 5x5 median filter for robustness).
+    3. Compares the *measured* depth (ZED) with the *computed* depth (distance from camera center to projected corner).
+- **Output**:
+    - RMSE (Root Mean Square Error) of the depth residuals.
+    - Number of valid points (where depth was available and finite).
+    - Added to JSON output under `depth_verify`.
+
+### 2. Refinement (`--refine-depth`)
+- **Trigger**: Runs only if verification is enabled and enough valid depth points (>4) are found.
+- **Process**:
+    - Uses `scipy.optimize.least_squares` with a robust loss function (`soft_l1`) to handle outliers.
+    - **Objective Function**: Minimizes the robust residual between computed depth and measured depth for all visible marker corners.
+    - **Confidence Weighting** (`--use-confidence-weights`): If enabled, residuals are weighted by the ZED confidence map (higher confidence = higher weight).
+    - **Constraints**: Bounded optimization to prevent drifting too far from the initial ArUco pose (default: ±5 degrees, ±5cm).
+- **Output**:
+    - Refined pose replaces the original pose in the JSON output.
+    - Improvement stats (delta rotation, delta translation, RMSE reduction) added under `refine_depth`.
+
+### 3. Best Frame Selection
+When multiple frames are available, the system scores them to pick the best candidate for verification/refinement:
+- **Criteria**:
+    - Number of detected markers (primary factor).
+    - Reprojection error (lower is better).
+    - Valid depth ratio (percentage of marker corners with valid depth data).
+    - Depth confidence (if available).
+- **Benefit**: Ensures refinement uses high-quality data rather than just the last valid frame.
+
+## Benchmark Matrix (`--benchmark-matrix`)
+
+This mode runs a comparative analysis of different refinement configurations on the same data to evaluate improvements. It compares:
+1. **Baseline**: Linear loss (MSE), no confidence weighting.
+2. **Robust**: Soft-L1 loss, no confidence weighting.
+3. **Robust + Confidence**: Soft-L1 loss with confidence-weighted residuals.
+4. **Robust + Confidence + Best Frame**: All of the above, using the highest-scored frame.
+
+**Output:**
+- Prints a summary table for each camera showing RMSE improvement and iteration counts.
+- Adds a `benchmark` object to the JSON output containing detailed stats for each configuration.
+
+## Fast Iteration (`--max-samples`)
+
+For development or quick checks, processing thousands of frames is unnecessary.
+- Use `--max-samples N` to stop after `N` valid samples (frames where markers were detected).
+- Example: `--max-samples 1` will process the first valid frame, run alignment/refinement, save the result, and exit.
+
+## Example Workflow
+
+**Full Run with Alignment and Robust Refinement:**
+```bash
+uv run calibrate_extrinsics.py \
+  --svo output/recording.svo \
+  --markers aruco/markers/box.parquet \
+  --aruco-dictionary DICT_APRILTAG_36h11 \
+  --auto-align \
+  --ground-marker-id 21 \
+  --verify-depth \
+  --refine-depth \
+  --use-confidence-weights \
+  --output output/calibrated.json
+```
+
+**Benchmark Run:**
+```bash
+uv run calibrate_extrinsics.py \
+  --svo output/recording.svo \
+  --markers aruco/markers/box.parquet \
+  --benchmark-matrix \
+  --max-samples 100
+```
+
+**Fast Debug Run:**
+```bash
+uv run calibrate_extrinsics.py \
+  --svo output/ \
+  --markers aruco/markers/box.parquet \
+  --auto-align \
+  --max-samples 1 \
+  --debug \
+  --no-preview
+```
+
+## Depth Data Management
+
+To enable decoupled refinement workflows, the system supports saving the depth data used during calibration.
+
+### Saving Depth Data
+Use the `--save-depth <path.h5>` flag with `calibrate_extrinsics.py`.
+```bash
+uv run calibrate_extrinsics.py ... --save-depth output/calibration_depth.h5
+```
+
+**HDF5 Format Structure:**
+- `meta/`: Global metadata (schema version, units=meters).
+- `cameras/{serial}/`:
+  - `intrinsics`: Camera matrix (3x3).
+  - `resolution`: [width, height].
+  - `pooled_depth`: The aggregated depth map used for verification (gzip compressed).
+  - `raw_frames/`: (Optional) Individual frames if pooling wasn't used.
+
+This allows `refine_ground_plane.py` to run repeatedly with different parameters without re-processing the raw SVO files.
+
+## Ground Plane Refinement (`refine_ground_plane.py`)
+
+This standalone tool refines camera extrinsics by ensuring all cameras agree on the ground plane location. It addresses common issues where ArUco markers are slightly tilted or not perfectly coplanar with the floor.
+
+### Workflow
+```bash
+uv run refine_ground_plane.py \
+  --input-extrinsics output/calibrated.json \
+  --input-depth output/calibration_depth.h5 \
+  --output-extrinsics output/refined.json \
+  --plot --plot-output output/ground_debug.html
+```
+
+### Algorithm Details
+
+The algorithm proceeds in four stages:
+
+1.  **Plane Detection (Per Camera)**
+    - Unprojects the depth map to a point cloud in the **world frame** (using current extrinsics).
+    - Uses **RANSAC** (via Open3D) to segment the dominant plane.
+    - **Quality Gates**:
+        - Minimum inliers (default: 500 points).
+        - Normal orientation check (must be roughly vertical, `normal_vertical_thresh=0.9`).
+
+2.  **Robust Consensus**
+    - Computes a "consensus plane" from all valid camera detections.
+    - **Method**:
+        - Aligns all normals to the upper hemisphere.
+        - Computes the **geometric median** of normals and distances.
+        - Filters outliers based on deviation from the median (>15° angle or >0.5m distance).
+        - Computes a weighted average of the remaining inlier planes.
+
+3.  **Correction Calculation**
+    - Computes a rigid transform $T_{corr}$ for each camera.
+    - **Constraints**:
+        - **Rotation**: Only corrects pitch and roll (aligns normal to vertical). Yaw is preserved.
+        - **Translation**: Only corrects vertical height (aligns plane distance). X/Z position is preserved.
+    - **Consensus-Relative Correction**: By default, cameras are aligned to the *consensus plane* rather than absolute Y=0. This ensures relative consistency between cameras even if the absolute floor height is slightly off.
+
+4.  **Safety Guardrails**
+    - The correction is **rejected** if:
+        - Rotation > `max_rotation_deg` (default: 5°).
+        - Translation > `max_translation_m` (default: 0.1m).
+        - Deviation from consensus > `max_consensus_deviation` (default: 10°, 0.5m).
+    - **Why no ICP?**: For flat floors, plane-to-plane alignment is more robust than ICP. ICP on featureless planes can drift (slide) along the surface.
+
+### Tuning Guidance
+
+Based on end-to-end observations:
+
+-   **`--stride`**: Default is 8. Decrease to 4 or 2 for higher density if the floor is far away or sparse.
+-   **`--ransac-dist-thresh`**: Default 0.02m (2cm). Increase to 0.03-0.05m if the floor is uneven or depth noise is high.
+-   **`--max-rotation-deg`**: Keep this tight (3-5°). If the floor correction needs >5°, the initial ArUco calibration is likely poor and should be re-run.
+-   **`--target-y`**: Use this if you need the floor to be at a specific absolute height (e.g., -1.5m) instead of just consistent.
+
+## Known Unexpected Behavior / Troubleshooting
+
+### Resolved: Depth Refinement Failure (Unit Mismatch)
+
+*Note: This issue has been resolved in the latest version by enforcing explicit meter units in the SVO reader and removing ambiguous manual conversions.*
+
+**Previous Symptoms:**
+- `depth_verify` reports extremely large RMSE values (e.g., > 1000).
+- `refine_depth` reports `success: false`, `iterations: 0`, and near-zero improvement.
+
+**Resolution:**
+The system now explicitly sets `InitParameters.coordinate_units = sl.UNIT.METER` when opening SVO files, ensuring consistent units across the pipeline.
+
+### Optimization Stalls
+If `refine_depth` shows `success: false` but `nfev` (evaluations) is high, the optimizer may have hit a flat region or local minimum.
+- **Check**: Look at `termination_message` in the JSON output.
+- **Fix**: Try enabling `--use-confidence-weights` or checking if the initial ArUco pose is too far off (reprojection error > 2.0).
+
+## Implementation Details
+
+### 1. Depth Data Structure (`--save-depth`)
+
+The system uses HDF5 for efficient, compressed storage of depth data required for decoupled refinement.
+
+**File Structure:**
+- **`meta/`**: Global metadata.
+  - `schema_version`: Integer version (currently 1).
+  - `units`: Explicitly "meters".
+  - `coordinate_frame`: "world_from_cam".
+- **`cameras/{serial}/`**: Per-camera data.
+  - `intrinsics`: 3x3 camera matrix.
+  - `resolution`: [width, height].
+  - `pooled_depth`: Aggregated depth map (gzip compressed, level 4).
+  - `raw_frames/`: (Optional) Individual frames if pooling wasn't used.
+
+This structure allows `refine_ground_plane.py` to load pre-processed depth maps without needing the original SVO files or re-running the ArUco detection pipeline.
+
+### 2. Ground Plane Refinement Pipeline
+
+The `refine_ground_plane.py` tool implements a robust multi-camera consensus algorithm to align the floor plane.
+
+**Core Algorithm (`aruco/ground_plane.py`):**
+
+1.  **Per-Camera Plane Detection**:
+    -   **Unprojection**: Converts the depth map to a point cloud in the *world frame* using the initial ArUco extrinsics.
+    -   **RANSAC**: Uses Open3D's `segment_plane` to find the dominant plane.
+    -   **Quality Gates**:
+        -   `min_inliers`: Requires at least 500 points.
+        -   `normal_vertical_thresh`: Normal must be roughly vertical (>0.9 dot product with Y-axis).
+
+2.  **Robust Consensus**:
+    -   Computes the **geometric median** of all valid plane normals and distances to reject outliers.
+    -   **Outlier Rejection**: Discards planes deviating >15° in angle or >0.5m in distance from the median.
+    -   **Weighted Average**: Computes the final consensus plane from the remaining inliers.
+
+3.  **Correction Calculation**:
+    -   Computes a rigid transform $T_{corr}$ for each camera to align its detected floor to the consensus plane (or absolute Y=0).
+    -   **Constraints**:
+        -   **Rotation**: Corrects only pitch and roll to align the normal. Yaw is preserved.
+        -   **Translation**: Corrects only vertical height. X/Z positions are preserved.
+    -   **Consensus-Relative Correction**: By default, aligns cameras to the *consensus plane* to ensure relative consistency.
+    -   **Safety Bounds**: The correction is **rejected** if it exceeds safety limits (default: 5° rotation, 0.1m translation).
+
+### 3. Observed Behavior & Tuning
+
+**Real-World Performance:**
+-   **Legacy/Unstable Behavior**: In early versions (before unit standardization), the system often reported 0 corrections or attempted extreme translations (>1m) due to mm/m confusion or depth noise.
+-   **Hardened Behavior**: In validated runs, the system now applies small, precise corrections (e.g., max ~0.078m translation, < 1° rotation), effectively "snapping" the floor to a consistent level without disrupting the lateral calibration.
+
+**Why No ICP?**
+Iterative Closest Point (ICP) is **not enabled** by default for ground plane alignment.
+-   **Reason**: ICP on featureless planar surfaces is ill-constrained; it can "slide" along the floor, introducing drift in X/Z.
+-   **Approach**: Plane-to-plane alignment is analytically exact for the vertical dimension and rotation, which are the only degrees of freedom we want to correct.
+
+**When to Escalate:**
+If the ground plane refinement fails or produces large corrections (>5°), it usually indicates:
+1.  **Poor Initial Calibration**: The ArUco markers were moved or poorly detected. Re-run `calibrate_extrinsics.py`.
+2.  **Non-Planar Floor**: The floor has significant slopes or steps.
+3.  **Obstacles**: Large objects are occluding the floor in the depth map.
@@ -0,0 +1,70 @@
+# ICP Depth Bias Diagnosis Report
+
+## Executive Summary
+Recent diagnostics indicate that while geometric overlap between camera pairs is high (~71%–80%), significant cross-camera depth bias is the primary blocker for ICP convergence. The overlap regions exhibit acceptable planarity, suggesting that the issue is not a degenerate scene geometry but rather a systematic inconsistency in depth measurement between specific camera units.
+
+## Measured Diagnostics
+
+### 1. Shared FOV Overlap Proxies
+Geometric overlap is sufficient across the cluster.
+- **Range:** 0.707 – 0.799
+- **Pair Minima:** ~0.71
+- **Caution:** Geometric overlap $\neq$ usable overlap. High overlap values do not guarantee ICP success if depth noise or bias is high.
+
+### 2. Valid Depth Ratios
+Percentage of pixels with valid depth data per camera:
+- **41831756:** 0.871
+- **44289123:** 0.870
+- **44435674:** 0.789
+- **46195029:** 0.805
+
+### 3. Overlap Region Planarity
+Measured via $\lambda_3 / \sum \lambda_i$ (ratio of smallest eigenvalue to sum of eigenvalues):
+- **Mean Range:** 0.136 – 0.170
+- **Interpretation:** The overlap regions are not grossly degenerate (not perfectly planar, but sufficiently structured for ICP).
+
+### 4. Signed Residual Symmetric Bias Ranking
+Median absolute signed residuals between camera pairs (sorted by worst inconsistency):
+1. **44289123 - 44435674:** 0.137m
+2. **41831756 - 44435674:** 0.115m
+3. **44435674 - 46195029:** 0.098m
+4. **41831756 - 46195029:** 0.095m
+5. **44289123 - 46195029:** 0.082m
+6. **41831756 - 44289123:** 0.038m
+
+## Oracle Interpretation
+The strongest evidence points to **cross-camera depth bias**. 
+- Camera **44435674** is involved in the top 3 most biased pairs, suggesting it is the primary outlier in depth scale or offset.
+- The bias (up to 13.7cm) is significantly larger than the expected noise floor for ZED cameras at typical ranges, which prevents ICP from finding a stable global minimum.
+
+### What this implies for ICP behavior
+- ICP will likely "drift" or fail to converge because the point clouds from different cameras do not represent the same physical surface at the same coordinates.
+- Residuals will remain high even at the "optimal" alignment.
+- Point cloud "ghosting" or double-surfaces will be visible in fused outputs.
+
+## Remediation Plan (3-Step)
+
+1. **Step 1: Per-Camera Depth Validation (Go/No-Go)**
+   - Measure a known flat target at a fixed distance (e.g., 2m) with each camera.
+   - **Go Criteria:** Absolute depth error < 2% of distance.
+   - **No-Go:** If error > 5%, recalibrate internal camera parameters or apply a linear scale correction.
+
+2. **Step 2: Pairwise Scale Alignment**
+   - Use the least-biased pair (41831756-44289123) as the "golden" reference.
+   - Calculate a scale/offset correction factor for 44435674 to minimize the 13.7cm median residual.
+
+3. **Step 3: Constrained ICP with Bias Compensation**
+   - Re-run ICP using the corrected depth maps.
+   - **Success Criteria:** Median residuals drop below 0.05m across all pairs.
+
+## Recommended Immediate Next Diagnostic
+Perform a **Static Target Depth Sweep**. Place a planar target in the center of the shared FOV and record 100 frames from all cameras. Calculate the per-camera mean distance to the plane to isolate the absolute offset per unit.
+
+## Remediation Applied (2026-02-11)
+Automatic depth bias estimation and pre-correction has been implemented and integrated into the `refine_ground_plane.py` pipeline.
+
+**Outcome:**
+- **Bias-enabled run:** Successfully optimized 1 non-reference camera that previously failed to converge due to depth inconsistency.
+- **No-bias run:** Optimized 0 cameras (baseline), confirming that depth bias was indeed the primary blocker for this dataset.
+
+The system now estimates per-camera offsets relative to a reference unit and applies them to the point clouds before ICP registration, significantly improving robustness to unit-to-unit depth variations.
@@ -0,0 +1,49 @@
+# Marker Parquet Format
+
+This document describes the expected structure for marker configuration files (e.g., `standard_box_markers_600mm.parquet`). These files define both the physical geometry of markers and their logical grouping into faces.
+
+## Schema
+
+The parquet file must contain the following columns:
+
+| Column | Type | Description |
+| :--- | :--- | :--- |
+| `name` | `string` | Name of the face (e.g., "bottom", "front"). Used for logical grouping. |
+| `ids` | `list<int64>` | List of ArUco marker IDs belonging to this face. |
+| `corners` | `list<list<list<float64>>>` | 3D coordinates of marker corners. Shape must be `(N, 4, 3)` where N is the number of markers. |
+
+## Example Data
+
+Based on `standard_box_markers_600mm.parquet`:
+
+| name | ids | corners (approximate structure) |
+| :--- | :--- | :--- |
+| "bottom" | `[21]` | `[[[-0.225, -0.3, 0.226], [0.225, -0.3, 0.226], [0.225, -0.3, -0.224], [-0.225, -0.3, -0.224]]]` |
+
+## Loader Behavior
+
+The system uses two different loading strategies based on this file:
+
+### 1. Geometry Loader (`load_marker_geometry`)
+- **Ignores**: `name` column.
+- **Uses**: `ids` and `corners`.
+- **Process**:
+  - Flattens all `ids` and `corners` from all rows.
+  - Reshapes `corners` to `(-1, 4, 3)`.
+  - Validates that each marker has exactly 4 corners with 3 coordinates (x, y, z).
+  - Validates coordinates are finite and within reasonable range (< 100m).
+- **Output**: `dict[int, np.ndarray]` mapping Marker ID → (4, 3) corner array.
+
+### 2. Face Mapping Loader (`load_face_mapping`)
+- **Uses**: `name` and `ids`.
+- **Process**:
+  - Reads face names and associated marker IDs.
+  - Normalizes names to lowercase.
+- **Output**: `dict[str, list[int]]` mapping Face Name → List of Marker IDs.
+- **Usage**: Used for ground plane alignment (e.g., identifying the "bottom" face).
+
+## Validation Rules
+Runtime validation in `marker_geometry.py` ensures:
+- `corners` shape is strictly `(4, 3)` per marker.
+- No `NaN` or `Inf` values.
+- Coordinates are absolute (meters) and must be < 100m.
@@ -0,0 +1,489 @@
+# Visualization Conventions & Coordinate Frame Reference
+
+> **Status**: Canonical reference as of 2026-02-08.
+> **Applies to**: `visualize_extrinsics.py`, `calibrate_extrinsics.py`, and `inside_network.json`.
+
+---
+
+## Executive Summary
+
+The `visualize_extrinsics.py` script went through multiple iterations of coordinate-frame
+switching (OpenCV ↔ OpenGL), Plotly camera/view hacks, and partial basis transforms that
+created compounding confusion about whether the visualization was correct. The root cause
+was **conflating Plotly's scene camera settings with actual data-frame transforms**:
+adjusting `camera.up`, `autorange: "reversed"`, or eye position changes *how you look at
+the data* but does **not** change the coordinate frame the data lives in.
+
+After several rounds of adding and removing `--world-basis`, `--render-space`, and
+`--pose-convention` flags, the visualizer was simplified to a single convention:
+
+- **All data is in OpenCV convention** (+X right, +Y down, +Z forward).
+- **No basis switching**. The `--world-basis` flag was removed.
+- **Plotly's scene camera** is configured with `up = {x:0, y:-1, z:0}` so that the
+  OpenCV +Y-down axis renders as "down" on screen.
+
+The confusion was never a bug in the calibration math — it was a visualization-layer
+problem caused by trying to make Plotly (which defaults to Y-up) display OpenCV data
+(which is Y-down) without a clear separation between "data frame" and "view frame."
+
+---
+
+## Current Policy Checklist (2026-02-09)
+
+For engineers maintaining or using these tools:
+
+- [x] **`calibrate_extrinsics.py`**: Outputs `T_world_from_cam` (OpenCV). Auto-aligns to **Y-down** (gravity along +Y). Writes `_meta` block.
+- [x] **`visualize_extrinsics.py`**: Renders raw JSON data. Ignores `_meta`. Sets view camera up to `-Y`.
+- [x] **`apply_calibration_to_fusion_config.py`**: Direct pose copy only. No `--cv-to-opengl` conversion.
+- [x] **`compare_pose_sets.py`**: Symmetric inputs (`--pose-a-json`, `--pose-b-json`). Heuristic parsing.
+- [x] **Conventions**: Always OpenCV frame (+X Right, +Y Down, +Z Forward). Units in meters.
+
+---
+
+## Ground Truth Conventions
+
+### 1. Calibration Output: `world_from_cam`
+
+`calibrate_extrinsics.py` stores poses as **T_world_from_cam** (4×4 homogeneous):
+
+```
+T_world_from_cam = invert_transform(T_cam_from_world)
+```
+
+- `solvePnP` returns `T_cam_from_world` (maps world points into camera frame).
+- The script **inverts** this before saving to JSON.
+- The translation column `T[:3, 3]` is the **camera center in world coordinates**.
+- The rotation columns `T[:3, :3]` are the camera's local axes expressed in world frame.
+
+**JSON format** (16 floats, row-major 4×4):
+```json
+{
+  "44289123": {
+    "pose": "0.878804 -0.039482 0.475548 -2.155006 0.070301 0.996409 ..."
+  }
+}
+```
+
+### 2. Camera-Local Axes (OpenCV)
+
+Every camera's local frame follows the OpenCV pinhole convention:
+
+| Axis | Direction | Color in visualizer |
+|------|-----------|-------------------|
+| +X   | Right     | Red               |
+| +Y   | Down      | Green             |
+| +Z   | Forward (into scene) | Blue   |
+
+The frustum is drawn along the camera's local +Z axis. The four corners of the
+frustum's far plane are at `(±w, ±h, frustum_scale)` in camera-local coordinates.
+
+### 3. Metadata & Auto-Alignment (`_meta`)
+
+`calibrate_extrinsics.py` now writes a `_meta` key to the output JSON. This metadata
+describes the conventions used but is **optional** for consumers (like the visualizer)
+which may ignore it.
+
+```json
+{
+  "_meta": {
+    "pose_convention": "world_from_cam",
+    "frame_convention": "opencv",
+    "auto_aligned": true,
+    "gravity_direction_world": [0.0, 1.0, 0.0],
+    "alignment": { ... }
+  },
+  "SN123": { ... }
+}
+```
+
+- `pose_convention`: Always `world_from_cam`.
+- `frame_convention`: Always `opencv` (+X Right, +Y Down, +Z Forward).
+- `auto_aligned`: `true` if `--auto-align` was used.
+- `gravity_direction_world`: The vector in world space that points "down" (gravity).
+  - For **Y-down** (current default), this is `[0, 1, 0]`.
+  - For **Y-up** (legacy/Fusion), this would be `[0, -1, 0]`.
+
+**Auto-Alignment Target**:
+The `--auto-align` flag now targets a **Y-down** world frame by default (`target_axis=[0, -1, 0]`
+in internal logic, resulting in gravity pointing +Y).
+- **Old behavior**: Targeted Y-up (gravity along -Y).
+- **New behavior**: Targets Y-down (gravity along +Y) to match the OpenCV camera frame convention.
+- **Result**: The ground plane is still XZ, but "up" is -Y and "down" is +Y.
+
+### 4. Plotly Scene/Camera Interpretation Pitfalls
+
+Plotly's 3D scene has its own camera model that controls **how you view** the data:
+
+| Plotly setting | What it does | What it does NOT do |
+|----------------|-------------|-------------------|
+| `camera.up` | Sets which direction is "up" on screen | Does not transform data coordinates |
+| `camera.eye` | Sets the viewpoint position | Does not change axis orientation |
+| `yaxis.autorange = "reversed"` | Flips the Y axis tick direction | Does not negate Y data values |
+| `aspectmode = "data"` | Preserves metric proportions | Does not imply any convention |
+
+**Critical insight**: Changing `camera.up` from `{y:1}` to `{y:-1}` makes the plot
+*look* like Y-down is rendered correctly, but the underlying Plotly axis still runs
+bottom-to-top by default. This is purely a view transform — the data coordinates are
+unchanged.
+
+---
+
+## Historical Confusion Timeline
+
+This section documents the sequence of changes that led to confusion, for future
+reference. All commits are on `visualize_extrinsics.py`.
+
+### Phase 1: Initial Plotly Rewrite (`7b9782a`)
+- Rewrote the visualizer from matplotlib to Plotly with a `--diagnose` mode.
+- Used Plotly defaults (Y-up). OpenCV data (Y-down) appeared "upside down."
+- Frustums pointed in the correct direction in data space but *looked* inverted.
+
+### Phase 2: Y-Up Enforcement (`a8d3751`)
+- Attempted to fix by setting `camera.up = {y:1}` and using `autorange: "reversed"`.
+- This made the view *look* correct for some angles but introduced axis-label confusion.
+- The Y axis ticks ran in the opposite direction from the data, misleading users.
+
+### Phase 3: Render-Space Option (`ab88a24`)
+- Added `--render-space` flag to switch between "cv" and "opengl" rendering.
+- The OpenGL path applied a basis-change matrix `diag(1, -1, -1)` to all data.
+- This actually transformed the data, not just the view — a correct approach but
+  introduced a second code path that was hard to validate.
+
+### Phase 4: Ground Plane & Origin Triad (`18e8142`, `57f0dff`)
+- Added ground plane overlay and world-origin axis triad.
+- These were drawn in the *data* frame, so they were correct in CV mode but
+  appeared wrong in OpenGL mode (the basis transform was applied inconsistently
+  to some elements but not others).
+
+### Phase 5: `--world-basis` with Global Transform (`79f2ab0`)
+- Renamed `--render-space` to `--world-basis` with `cv` and `opengl` options.
+- Introduced `world_to_plot()` as a central transform function.
+- In `opengl` mode: `world_to_plot` applied `diag(1, -1, -1)` to all points.
+- **Problem**: The Plotly `camera.up` and axis labels were not always updated
+  consistently with the basis choice, leading to "it looks right from one angle
+  but wrong from another" reports.
+
+### Phase 6: Restore After Removal (`6330e0e`)
+- `--world-basis` was briefly removed, then restored due to user request.
+- This back-and-forth left the README with stale documentation referencing both
+  the old and new interfaces.
+
+### Phase 7: Final Cleanup — CV Only (`d07c244`)
+- **Removed `--world-basis` entirely.**
+- `world_to_plot()` became a no-op (identity function).
+- Plotly camera set to `up = {x:0, y:-1, z:0}` to render Y-down correctly.
+- Axis labels explicitly set to `X (Right)`, `Y (Down)`, `Z (Forward)`.
+- Added `--origin-axes-scale` for independent control of the origin triad size.
+- Removed `--diagnose`, `--pose-convention`, and `--render-space` flags.
+
+**This is the current state.**
+
+---
+
+## Peculiar Behaviors Catalog
+
+| # | Symptom | Root Cause | Fix / Explanation |
+|---|---------|-----------|-------------------|
+| 1 | Frustum appears to point in "-Z" direction | Plotly default camera has Y-up; OpenCV frustum points +Z which looks "backward" when viewed from a Y-up perspective | Set `camera.up = {y:-1}` (done in current code). The frustum is correct in data space. |
+| 2 | Switching to `--world-basis opengl` makes some elements flip but not others | The `world_to_plot()` transform was applied to camera traces but not consistently to ground plane or origin triad | Removed `--world-basis`. Single convention eliminates partial-transform bugs. |
+| 3 | `yaxis.autorange = "reversed"` makes ticks confusing | Plotly reverses the tick labels but the data coordinates stay the same. Users see "0 at top, -2 at bottom" which contradicts Y-down intuition. | Removed `autorange: reversed`. Use `camera.up = {y:-1}` instead, which rotates the view without mangling tick labels. |
+| 4 | Camera positions don't match `inside_network.json` | `inside_network.json` stores poses in the ZED Fusion coordinate frame (gravity-aligned, Y-up). `calibrate_extrinsics.py` stores poses in the ArUco marker object's frame (Y-down if the marker board is horizontal). These are **different world frames**. | Not a bug. The two systems use different world origins and orientations. To compare, you must apply the alignment transform between the two frames. See FAQ below. |
+| 5 | Origin triad too small or too large relative to cameras | Origin triad defaulted to `--scale` (camera axis size), which is often much smaller than the camera spread | Use `--origin-axes-scale 0.6` (or similar) independently of `--scale`. |
+| 6 | Bird-eye view shows unexpected orientation | `--birdseye` uses orthographic projection looking down the Y axis. In CV convention, Y is "down" so this is looking from below the scene upward. | Expected behavior. The bird-eye view shows the X-Z plane as seen from the -Y direction (below the cameras). |
+
+---
+
+## Known Pitfalls & Common Confusions
+
+1.  **Y-Up vs Y-Down**:
+    -   **OpenCV/ArUco**: Y is **Down**. Gravity is `[0, 1, 0]`.
+    -   **OpenGL/Fusion**: Y is **Up**. Gravity is `[0, -1, 0]`.
+    -   **Pitfall**: Assuming "Y is vertical" is ambiguous. You must know the sign.
+
+2.  **Frame Mismatch vs Origin Mismatch**:
+    -   Two pose sets can have the **same convention** (e.g., both Y-down) but different **world origins** (e.g., one at marker, one at camera 1).
+    -   **Fix**: Use `compare_pose_sets.py` to align them rigidly before comparing errors.
+
+3.  **Visualizer "Inversion"**:
+    -   The visualizer sets `camera.up = {y:-1}`. This makes the scene look "normal" (floor at bottom) even though the data is Y-down.
+    -   **Pitfall**: Don't try to "fix" the data to match the view. The view is already compensating for the data.
+
+---
+
+## Canonical Rules Going Forward
+
+1. **Single convention**: All visualization data is in OpenCV frame. No basis switching.
+2. **`world_to_plot()` is identity**: It exists as a hook but performs no transform.
+   If a future need arises for basis conversion, it should be the *only* place it happens.
+3. **Plotly camera settings are view-only**: Never use `autorange: reversed` or axis
+   negation to simulate a coordinate change. Use `camera.up` and `camera.eye` only.
+4. **Poses are `world_from_cam`**: The 4×4 matrix maps camera-local points to world.
+   Translation = camera position in world. Rotation columns = camera axes in world.
+5. **Colors are RGB = XYZ**: Red = X (right), Green = Y (down), Blue = Z (forward).
+   This applies to both per-camera axis triads and the world-origin triad.
+6. **Units are meters**: Consistent with marker parquet geometry and calibration output.
+
+---
+
+## Current CLI Behavior
+
+### Available Flags
+
+```
+visualize_extrinsics.py
+  -i, --input TEXT          [required] Path to JSON extrinsics file
+  -o, --output TEXT         Output path (.html or .png)
+  --show                    Open interactive Plotly viewer
+  --scale FLOAT             Camera axis length (default: 0.2)
+  --frustum-scale FLOAT     Frustum depth (default: 0.5)
+  --fov FLOAT               Horizontal FOV degrees (default: 60.0)
+  --birdseye                Top-down orthographic view
+  --show-ground/--no-show-ground    Ground plane toggle
+  --ground-y FLOAT          Ground plane Y position (default: 0.0)
+  --ground-size FLOAT       Ground plane side length (default: 8.0)
+  --show-origin-axes/--no-show-origin-axes  Origin triad toggle (default: on)
+  --origin-axes-scale FLOAT Origin triad size (defaults to --scale)
+  --zed-configs TEXT         ZED calibration file(s) for accurate frustums
+  --resolution [FHD1200|FHD|2K|HD|SVGA|VGA]
+  --eye [left|right]
+```
+
+### Removed Flags (Historical Only)
+
+| Flag | Removed In | Reason |
+|------|-----------|--------|
+| `--world-basis` | `d07c244` | Caused partial/inconsistent transforms. Single CV convention is simpler. |
+| `--pose-convention` | `d07c244` | Only `world_from_cam` is supported. No need for a flag. |
+| `--diagnose` | `d07c244` | Diagnostic checks moved out of the visualizer. |
+| `--render-space` | `79f2ab0` | Renamed to `--world-basis`, then removed. |
+
+> **Note**: The README.md still contains stale references to `--world-basis`,
+> `--pose-convention`, and `--diagnose` in the Troubleshooting section. These should
+> be cleaned up to match the current CLI.
+
+---
+
+## Verification Playbook
+
+### Quick Sanity Check
+
+```bash
+# Render with origin triad at 0.6m scale, save as PNG
+uv run visualize_extrinsics.py \
+  --input output/e2e_refine_depth_smoke_rerun.json \
+  --output output/_final_opencv_origin_axes_scaled.png \
+  --origin-axes-scale 0.6
+```
+
+**Expected result**:
+- Origin triad at (0,0,0) with Red→+X (right), Green→+Y (down), Blue→+Z (forward).
+- Camera frustums pointing along each camera's local +Z (blue axis).
+- Camera positions spread out in world space (not bunched at origin).
+- Y values for cameras should be negative (cameras are above the marker board,
+  which is at Y≈0; "above" in CV convention means negative Y).
+
+### Interactive Validation
+
+```bash
+# Open interactive HTML for rotation/inspection
+uv run visualize_extrinsics.py \
+  --input output/e2e_refine_depth_smoke_rerun.json \
+  --show \
+  --origin-axes-scale 0.6
+```
+
+**What to check**:
+1. **Rotate the view**: The origin triad should remain consistent — Red/Green/Blue
+   always point in the same data-space directions regardless of view angle.
+2. **Hover over camera centers**: Tooltip shows the camera serial number.
+3. **Frustum orientation**: Each frustum's open end faces away from the camera center
+   along the camera's blue (Z) axis.
+
+### Bird-Eye Sanity Check
+
+```bash
+uv run visualize_extrinsics.py \
+  --input output/e2e_refine_depth_smoke_rerun.json \
+  --birdseye --show \
+  --origin-axes-scale 0.6
+```
+
+**Expected**: Top-down view of the X-Z plane. Cameras should form a recognizable
+spatial layout matching the physical installation. The Red (X) axis points right,
+Blue (Z) axis points "up" on screen (forward in world).
+
+---
+
+## FAQ
+
+### "Why does an OpenGL-like view look strange?"
+
+Because the data is in OpenCV convention (Y-down, Z-forward) and Plotly defaults to
+Y-up. When you try to make Plotly act like an OpenGL viewer (Y-up, Z-backward), you
+need to either:
+
+1. **Transform all data** by applying `diag(1, -1, -1)` — correct but doubles the
+   code paths and creates consistency risks.
+2. **Adjust the Plotly camera** — only changes the view, not the data. Axis labels
+   and hover values still show CV coordinates.
+
+We chose option (2) with `camera.up = {y:-1}`: minimal code, no data transformation,
+axis labels match the actual coordinate values. The trade-off is that the default
+Plotly orbit feels "inverted" compared to a Y-up 3D viewer. This is expected.
+
+### "Does flipping axes in the view equal changing the world frame?"
+
+**No.** Plotly's `camera.up`, `camera.eye`, and `autorange: reversed` are purely
+view transforms. They change how the data is *displayed* but not what the coordinates
+*mean*. The data always lives in the frame it was computed in (OpenCV/ArUco world frame).
+
+If you set `camera.up = {y:1}` (Plotly default), the plot will render Y-up on screen,
+but the data values are still Y-down. This creates a visual inversion that looks like
+"the cameras are upside down" — they're not; the view is just flipped.
+
+### "How do I compare with the C++ viewer and `inside_network.json`?"
+
+The C++ ZED Fusion viewer and `inside_network.json` use a **different world frame**
+than `calibrate_extrinsics.py`:
+
+| Property | `calibrate_extrinsics.py` | ZED Fusion / `inside_network.json` |
+|----------|--------------------------|-------------------------------------|
+| World origin | ArUco marker object center | Gravity-aligned, first camera or user-defined |
+| Y direction | Down (OpenCV) | Up (gravity-aligned) |
+| Pose meaning | `T_world_from_cam` | `T_world_from_cam` (same semantics, different world) |
+| Units | Meters | Meters |
+
+To compare numerically:
+1. The **relative** poses between cameras should match (up to the alignment transform).
+2. The **absolute** positions will differ because the world origins are different.
+3. To convert: apply the alignment rotation that maps the ArUco world frame to the
+   Fusion world frame. If `--auto-align` was used with a ground face, the ArUco frame
+   is partially aligned (ground = XZ plane), but the origin and yaw may still differ.
+
+**Quick visual comparison**: Look at the *shape* of the camera arrangement (distances
+and angles between cameras), not the absolute positions. If the shape matches, the
+calibration is consistent.
+
+### "Why are camera Y-positions negative?"
+
+In OpenCV convention, +Y is down. Cameras mounted above the marker board (which defines
+Y≈0) have negative Y values. This is correct. A camera at `Y = -1.3` is 1.3 meters
+above the board.
+
+### "What does `inside_network.json` camera 41831756's pose mean?"
+
+```
+Translation: [0.0, -1.175, 0.0]
+Rotation: Identity
+```
+
+This camera is the reference frame origin (identity rotation) positioned 1.175m in the
+-Y direction. In the Fusion frame (Y-up), this means 1.175m *below* the world origin.
+In practice, this is the height offset of the camera relative to the Fusion coordinate
+system's origin.
+
+---
+
+## Methodology: Comparing Different World Frames
+
+Since `inside_network.json` (Fusion) and `calibrate_extrinsics.py` (ArUco) use different
+world origins, raw coordinate comparison is meaningless. We validated consistency using
+**rigid SE(3) alignment**:
+
+1.  **Match Serials**: Identify cameras present in both JSON files.
+2.  **Extract Centers**: Extract the translation column `t` from `T_world_from_cam` for
+    each camera.
+    *   **Crucial**: Both systems use `T_world_from_cam`. It is **not** `cam_from_world`.
+3.  **Compute Alignment**: Solve for the rigid transform `(R_align, t_align)` that
+    minimizes the distance between the two point sets (Kabsch algorithm).
+    *   Scale is fixed at 1.0 (both systems use meters).
+4.  **Apply & Compare**:
+    *   Transform Fusion points: `P_aligned = R_align * P_fusion + t_align`.
+    *   **Position Residual**: `|| P_aruco - P_aligned ||`.
+    *   **Orientation Check**: Apply `R_align` to Fusion rotation matrices and compare
+        column vectors (Right/Down/Forward) with ArUco rotations.
+5.  **Up-Vector Verification**:
+    *   Fusion uses Y-Up (gravity). ArUco uses Y-Down (image).
+    *   After alignment, the transformed Fusion Y-axis should be approximately parallel
+        to the ArUco -Y axis (or +Y depending on the specific alignment solution found,
+        but they must be collinear with gravity).
+
+**Result**: The overlay images in `output/` were generated using this aligned frame.
+The low residuals (<2cm) confirm that the internal calibration is consistent, even
+though the absolute world coordinates differ.
+
+---
+
+## `compare_pose_sets.py` Input Formats
+
+The `compare_pose_sets.py` tool is designed to be agnostic to the source of the JSON files.
+It uses a **symmetric, heuristic parser** for both `--pose-a-json` and `--pose-b-json`.
+
+### Accepted JSON Schemas
+
+The parser automatically detects and handles either of these two structures for any input file:
+
+**1. Flat Format (Standard Output)**
+Used by `calibrate_extrinsics.py` and `refine_extrinsics.py`.
+```json
+{
+  "SERIAL_NUMBER": {
+    "pose": "r00 r01 r02 tx r10 r11 r12 ty r20 r21 r22 tz 0 0 0 1"
+  }
+}
+```
+
+**2. Nested Fusion Format**
+Used by ZED Fusion `inside_network.json` configuration files.
+```json
+{
+  "SERIAL_NUMBER": {
+    "FusionConfiguration": {
+      "pose": "r00 r01 r02 tx r10 r11 r12 ty r20 r21 r22 tz 0 0 0 1"
+    }
+  }
+}
+```
+
+### Key Behaviors
+
+1.  **Interchangeability**: You can swap inputs. Comparing A (ArUco) vs B (Fusion) is valid,
+    as is A (Fusion) vs B (ArUco). The script aligns B to A.
+2.  **Pose Semantics**: All poses are interpreted as `T_world_from_cam` (camera-to-world).
+    The script does **not** invert matrices; it assumes the input strings are already in the
+    correct convention.
+3.  **Minimum Overlap**: The script requires at least **3 shared camera serials** between
+    the two files to compute a rigid alignment.
+4.  **Heuristic Parsing**: For each serial key, the parser looks for `FusionConfiguration.pose`
+    first, then falls back to `pose`.
+
+### Example: Swapped Inputs
+
+Since the parser is symmetric, you can verify consistency by reversing the alignment direction:
+
+```bash
+# Align Fusion (B) to ArUco (A)
+uv run compare_pose_sets.py \
+    --pose-a-json output/e2e_refine_depth.json \
+    --pose-b-json ../zed_settings/inside_network.json \
+    --report-json output/report_aruco_ref.json
+
+# Align ArUco (B) to Fusion (A)
+uv run compare_pose_sets.py \
+    --pose-a-json ../zed_settings/inside_network.json \
+    --pose-b-json output/e2e_refine_depth.json \
+    --report-json output/report_fusion_ref.json
+```
+
+---
+
+## Appendix: Stale README References
+
+The following lines in `py_workspace/README.md` reference removed flags and should be
+updated:
+
+- **Line ~104**: References `--pose-convention` (removed).
+- **Line ~105**: References `--world-basis opengl` (removed).
+- **Line ~116**: References `--diagnose` (removed).
+
+These were left from earlier iterations and do not reflect the current CLI.