zed-playground/py_workspace/docs/calibrate-extrinsics-workflow.md

# Calibrate Extrinsics Workflow

This document explains the workflow for `calibrate_extrinsics.py`, focusing on ground plane alignment (`--auto-align`) and depth-based refinement (`--verify-depth`, `--refine-depth`).

## CLI Overview

The script calibrates camera extrinsics using ArUco markers detected in SVO recordings.

**Key Options:**
- `--svo`: Path to SVO file(s) or directory containing them.
- `--markers`: Path to the marker configuration parquet file.
- `--auto-align`: Enables automatic ground plane alignment (opt-in).
- `--verify-depth`: Enables depth-based verification of computed poses.
- `--refine-depth`: Enables optimization of poses using depth data (requires `--verify-depth`).
- `--use-confidence-weights`: Uses ZED depth confidence map to weight residuals in optimization.
- `--benchmark-matrix`: Runs a comparison of baseline vs. robust refinement configurations.
- `--max-samples`: Limits the number of processed samples for fast iteration.
- `--debug`: Enables verbose debug logging (default is INFO).

## Ground Plane Alignment (`--auto-align`)

When `--auto-align` is enabled, the script attempts to align the global coordinate system such that a specific face of the marker object becomes the ground plane (XZ plane, normal pointing +Y).

**Prerequisites:**
- The marker parquet file MUST contain `name` and `ids` columns defining which markers belong to which face (e.g., "top", "bottom", "front").
- If this metadata is missing, alignment is skipped with a warning.

**Decision Flow:**
The script selects the ground face using the following precedence:

1.  **Explicit Face (`--ground-face`)**:
    - If you provide `--ground-face="bottom"`, the script looks up the markers for "bottom" in the loaded map.
    - It computes the average normal of those markers and aligns it to the global up vector.

2.  **Marker ID Mapping (`--ground-marker-id`)**:
    - If you provide `--ground-marker-id=21`, the script finds which face contains marker 21 (e.g., "bottom").
    - It then proceeds as if `--ground-face="bottom"` was specified.

3.  **Heuristic Detection (Fallback)**:
    - If neither option is provided, the script analyzes all visible markers.
    - It computes the normal for every defined face.
    - It selects the face whose normal is most aligned with the camera's "down" direction (assuming the camera is roughly upright).

**Logging:**
The script logs the selected decision path for debugging:
- `Mapped ground-marker-id 21 to face 'bottom' (markers=[21])`
- `Using explicit ground face 'bottom' (markers=[21])`
- `Heuristically detected ground face 'bottom' (markers=[21])`

## Depth Verification & Refinement

This workflow uses the ZED camera's depth map to verify and improve the ArUco-based pose estimation.

### 1. Verification (`--verify-depth`)
- **Input**: The computed extrinsic pose ($T_{world\_from\_cam}$) and the known 3D world coordinates of the marker corners.
- **Process**:
    1. Projects marker corners into the camera frame using the computed pose.
    2. Samples the ZED depth map at these projected 2D locations (using a 5x5 median filter for robustness).
    3. Compares the *measured* depth (ZED) with the *computed* depth (distance from camera center to projected corner).
- **Output**:
    - RMSE (Root Mean Square Error) of the depth residuals.
    - Number of valid points (where depth was available and finite).
    - Added to JSON output under `depth_verify`.

### 2. Refinement (`--refine-depth`)
- **Trigger**: Runs only if verification is enabled and enough valid depth points (>4) are found.
- **Process**:
    - Uses `scipy.optimize.least_squares` with a robust loss function (`soft_l1`) to handle outliers.
    - **Objective Function**: Minimizes the robust residual between computed depth and measured depth for all visible marker corners.
    - **Confidence Weighting** (`--use-confidence-weights`): If enabled, residuals are weighted by the ZED confidence map (higher confidence = higher weight).
    - **Constraints**: Bounded optimization to prevent drifting too far from the initial ArUco pose (default: ±5 degrees, ±5cm).
- **Output**:
    - Refined pose replaces the original pose in the JSON output.
    - Improvement stats (delta rotation, delta translation, RMSE reduction) added under `refine_depth`.

### 3. Best Frame Selection
When multiple frames are available, the system scores them to pick the best candidate for verification/refinement:
- **Criteria**:
    - Number of detected markers (primary factor).
    - Reprojection error (lower is better).
    - Valid depth ratio (percentage of marker corners with valid depth data).
    - Depth confidence (if available).
- **Benefit**: Ensures refinement uses high-quality data rather than just the last valid frame.

## Benchmark Matrix (`--benchmark-matrix`)

This mode runs a comparative analysis of different refinement configurations on the same data to evaluate improvements. It compares:
1. **Baseline**: Linear loss (MSE), no confidence weighting.
2. **Robust**: Soft-L1 loss, no confidence weighting.
3. **Robust + Confidence**: Soft-L1 loss with confidence-weighted residuals.
4. **Robust + Confidence + Best Frame**: All of the above, using the highest-scored frame.

**Output:**
- Prints a summary table for each camera showing RMSE improvement and iteration counts.
- Adds a `benchmark` object to the JSON output containing detailed stats for each configuration.

## Fast Iteration (`--max-samples`)

For development or quick checks, processing thousands of frames is unnecessary.
- Use `--max-samples N` to stop after `N` valid samples (frames where markers were detected).
- Example: `--max-samples 1` will process the first valid frame, run alignment/refinement, save the result, and exit.

## Example Workflow

**Full Run with Alignment and Robust Refinement:**
```bash
uv run calibrate_extrinsics.py \
  --svo output/recording.svo \
  --markers aruco/markers/box.parquet \
  --aruco-dictionary DICT_APRILTAG_36h11 \
  --auto-align \
  --ground-marker-id 21 \
  --verify-depth \
  --refine-depth \
  --use-confidence-weights \
  --output output/calibrated.json
```

**Benchmark Run:**
```bash
uv run calibrate_extrinsics.py \
  --svo output/recording.svo \
  --markers aruco/markers/box.parquet \
  --benchmark-matrix \
  --max-samples 100
```

**Fast Debug Run:**
```bash
uv run calibrate_extrinsics.py \
  --svo output/ \
  --markers aruco/markers/box.parquet \
  --auto-align \
  --max-samples 1 \
  --debug \
  --no-preview
```

## Depth Data Management

To enable decoupled refinement workflows, the system supports saving the depth data used during calibration.

### Saving Depth Data
Use the `--save-depth <path.h5>` flag with `calibrate_extrinsics.py`.
```bash
uv run calibrate_extrinsics.py ... --save-depth output/calibration_depth.h5
```

**HDF5 Format Structure:**
- `meta/`: Global metadata (schema version, units=meters).
- `cameras/{serial}/`:
  - `intrinsics`: Camera matrix (3x3).
  - `resolution`: [width, height].
  - `pooled_depth`: The aggregated depth map used for verification (gzip compressed).
  - `raw_frames/`: (Optional) Individual frames if pooling wasn't used.

This allows `refine_ground_plane.py` to run repeatedly with different parameters without re-processing the raw SVO files.

## Ground Plane Refinement (`refine_ground_plane.py`)

This standalone tool refines camera extrinsics by ensuring all cameras agree on the ground plane location. It addresses common issues where ArUco markers are slightly tilted or not perfectly coplanar with the floor.

### Workflow
```bash
uv run refine_ground_plane.py \
  --input-extrinsics output/calibrated.json \
  --input-depth output/calibration_depth.h5 \
  --output-extrinsics output/refined.json \
  --plot --plot-output output/ground_debug.html
```

### Algorithm Details

The algorithm proceeds in four stages:

1.  **Plane Detection (Per Camera)**
    - Unprojects the depth map to a point cloud in the **world frame** (using current extrinsics).
    - Uses **RANSAC** (via Open3D) to segment the dominant plane.
    - **Quality Gates**:
        - Minimum inliers (default: 500 points).
        - Normal orientation check (must be roughly vertical, `normal_vertical_thresh=0.9`).

2.  **Robust Consensus**
    - Computes a "consensus plane" from all valid camera detections.
    - **Method**:
        - Aligns all normals to the upper hemisphere.
        - Computes the **geometric median** of normals and distances.
        - Filters outliers based on deviation from the median (>15° angle or >0.5m distance).
        - Computes a weighted average of the remaining inlier planes.

3.  **Correction Calculation**
    - Computes a rigid transform $T_{corr}$ for each camera.
    - **Constraints**:
        - **Rotation**: Only corrects pitch and roll (aligns normal to vertical). Yaw is preserved.
        - **Translation**: Only corrects vertical height (aligns plane distance). X/Z position is preserved.
    - **Consensus-Relative Correction**: By default, cameras are aligned to the *consensus plane* rather than absolute Y=0. This ensures relative consistency between cameras even if the absolute floor height is slightly off.

4.  **Safety Guardrails**
    - The correction is **rejected** if:
        - Rotation > `max_rotation_deg` (default: 5°).
        - Translation > `max_translation_m` (default: 0.1m).
        - Deviation from consensus > `max_consensus_deviation` (default: 10°, 0.5m).
    - **Why no ICP?**: For flat floors, plane-to-plane alignment is more robust than ICP. ICP on featureless planes can drift (slide) along the surface.

### Tuning Guidance

Based on end-to-end observations:

-   **`--stride`**: Default is 8. Decrease to 4 or 2 for higher density if the floor is far away or sparse.
-   **`--ransac-dist-thresh`**: Default 0.02m (2cm). Increase to 0.03-0.05m if the floor is uneven or depth noise is high.
-   **`--max-rotation-deg`**: Keep this tight (3-5°). If the floor correction needs >5°, the initial ArUco calibration is likely poor and should be re-run.
-   **`--target-y`**: Use this if you need the floor to be at a specific absolute height (e.g., -1.5m) instead of just consistent.

## Known Unexpected Behavior / Troubleshooting

### Resolved: Depth Refinement Failure (Unit Mismatch)

*Note: This issue has been resolved in the latest version by enforcing explicit meter units in the SVO reader and removing ambiguous manual conversions.*

**Previous Symptoms:**
- `depth_verify` reports extremely large RMSE values (e.g., > 1000).
- `refine_depth` reports `success: false`, `iterations: 0`, and near-zero improvement.

**Resolution:**
The system now explicitly sets `InitParameters.coordinate_units = sl.UNIT.METER` when opening SVO files, ensuring consistent units across the pipeline.

### Optimization Stalls
If `refine_depth` shows `success: false` but `nfev` (evaluations) is high, the optimizer may have hit a flat region or local minimum.
- **Check**: Look at `termination_message` in the JSON output.
- **Fix**: Try enabling `--use-confidence-weights` or checking if the initial ArUco pose is too far off (reprojection error > 2.0).

## Implementation Details

### 1. Depth Data Structure (`--save-depth`)

The system uses HDF5 for efficient, compressed storage of depth data required for decoupled refinement.

**File Structure:**
- **`meta/`**: Global metadata.
  - `schema_version`: Integer version (currently 1).
  - `units`: Explicitly "meters".
  - `coordinate_frame`: "world_from_cam".
- **`cameras/{serial}/`**: Per-camera data.
  - `intrinsics`: 3x3 camera matrix.
  - `resolution`: [width, height].
  - `pooled_depth`: Aggregated depth map (gzip compressed, level 4).
  - `raw_frames/`: (Optional) Individual frames if pooling wasn't used.

This structure allows `refine_ground_plane.py` to load pre-processed depth maps without needing the original SVO files or re-running the ArUco detection pipeline.

### 2. Ground Plane Refinement Pipeline

The `refine_ground_plane.py` tool implements a robust multi-camera consensus algorithm to align the floor plane.

**Core Algorithm (`aruco/ground_plane.py`):**

1.  **Per-Camera Plane Detection**:
    -   **Unprojection**: Converts the depth map to a point cloud in the *world frame* using the initial ArUco extrinsics.
    -   **RANSAC**: Uses Open3D's `segment_plane` to find the dominant plane.
    -   **Quality Gates**:
        -   `min_inliers`: Requires at least 500 points.
        -   `normal_vertical_thresh`: Normal must be roughly vertical (>0.9 dot product with Y-axis).

2.  **Robust Consensus**:
    -   Computes the **geometric median** of all valid plane normals and distances to reject outliers.
    -   **Outlier Rejection**: Discards planes deviating >15° in angle or >0.5m in distance from the median.
    -   **Weighted Average**: Computes the final consensus plane from the remaining inliers.

3.  **Correction Calculation**:
    -   Computes a rigid transform $T_{corr}$ for each camera to align its detected floor to the consensus plane (or absolute Y=0).
    -   **Constraints**:
        -   **Rotation**: Corrects only pitch and roll to align the normal. Yaw is preserved.
        -   **Translation**: Corrects only vertical height. X/Z positions are preserved.
    -   **Consensus-Relative Correction**: By default, aligns cameras to the *consensus plane* to ensure relative consistency.
    -   **Safety Bounds**: The correction is **rejected** if it exceeds safety limits (default: 5° rotation, 0.1m translation).

### 3. Observed Behavior & Tuning

**Real-World Performance:**
-   **Legacy/Unstable Behavior**: In early versions (before unit standardization), the system often reported 0 corrections or attempted extreme translations (>1m) due to mm/m confusion or depth noise.
-   **Hardened Behavior**: In validated runs, the system now applies small, precise corrections (e.g., max ~0.078m translation, < 1° rotation), effectively "snapping" the floor to a consistent level without disrupting the lateral calibration.

**Why No ICP?**
Iterative Closest Point (ICP) is **not enabled** by default for ground plane alignment.
-   **Reason**: ICP on featureless planar surfaces is ill-constrained; it can "slide" along the floor, introducing drift in X/Z.
-   **Approach**: Plane-to-plane alignment is analytically exact for the vertical dimension and rotation, which are the only degrees of freedom we want to correct.

**When to Escalate:**
If the ground plane refinement fails or produces large corrections (>5°), it usually indicates:
1.  **Poor Initial Calibration**: The ArUco markers were moved or poorly detected. Re-run `calibrate_extrinsics.py`.
2.  **Non-Planar Floor**: The floor has significant slopes or steps.
3.  **Obstacles**: Large objects are occluding the floor in the depth map.