docs(calibration): update findings summary and troubleshooting for depth refinement

2026-02-07 04:13:46 +00:00
parent fca04c47c1
commit ddb7054f96
1 changed files with 89 additions and 0 deletions
@@ -101,3 +101,92 @@ uv run calibrate_extrinsics.py \
  --debug \
  --no-preview
 ```
+
+## Known Unexpected Behavior / Troubleshooting
+
+### Depth Refinement Failure (Unit Mismatch)
+
+**Symptoms:**
+- `depth_verify` reports extremely large RMSE values (e.g., > 1000).
+- `refine_depth` reports `success: false`, `iterations: 0`, and near-zero improvement.
+- The optimization fails to converge or produces nonsensical results.
+
+**Root Cause:**
+The ZED SDK `retrieve_measure(sl.MEASURE.DEPTH)` returns depth values in the unit defined by `InitParameters.coordinate_units`. The default is **MILLIMETERS**. However, the calibration system (extrinsics, marker geometry) operates in **METERS**.
+
+This scale mismatch (factor of 1000) causes the residuals in the optimization objective function to be massive, breaking the numerical stability of the L-BFGS-B solver.
+
+**Mitigation:**
+The `SVOReader` class in `aruco/svo_sync.py` explicitly converts the retrieved depth map to meters:
+```python
+# aruco/svo_sync.py
+return depth_data / 1000.0
+```
+This ensures that all geometric math downstream remains consistent in meters.
+
+**Diagnostic Check:**
+If you suspect a unit mismatch, check the `depth_verify` RMSE in the output JSON.
+- **Healthy:** RMSE < 0.5 (meters)
+- **Mismatch:** RMSE > 100 (likely millimeters)
+
+*Note: Confidence filtering (`--depth-confidence-threshold`) is orthogonal to this issue. A unit mismatch affects all valid pixels regardless of confidence.*
+
+## Findings Summary (2026-02-07 exhaustive search)
+
+This section summarizes the latest deep investigation across local code, outputs, and external docs.
+
+### Confirmed Facts
+
+1. **Marker geometry parquet is in meters**
+   - `aruco/markers/standard_box_markers_600mm.parquet` stores values around `0.3` (meters), not `300` (millimeters).
+   - `docs/marker-parquet-format.md` also documents meter-scale coordinates.
+
+2. **Depth unit contract is still fragile**
+   - ZED defaults to millimeters unless `InitParameters.coordinate_units` is explicitly set.
+   - Current reader path converts depth by dividing by `1000.0` in `aruco/svo_sync.py`.
+   - This works only if incoming depth is truly millimeters. It can become fragile if unit config changes elsewhere.
+
+3. **Observed runtime behavior still indicates refinement instability**
+   - Existing outputs (for example `output/aligned_refined_extrinsics*.json`) show very large `depth_verify.rmse`, often `refine_depth.success: false`, `iterations: 0`, and negligible improvement.
+   - This indicates that refinement quality is currently limited beyond the original mm↔m mismatch narrative.
+
+4. **Current refinement objective is not robust enough**
+   - Objective is plain squared depth residuals + simple regularization.
+   - It does **not** currently include robust loss (Huber/Soft-L1), confidence weighting in the objective, or strong convergence diagnostics.
+
+### Likely Contributors to Poor Refinement
+
+- Depth outliers are not sufficiently down-weighted in optimization.
+- Confidence map is used for verification filtering, but not as residual weights in the optimizer objective.
+- Representative frame choice uses the latest valid frame, not necessarily the best-quality frame.
+- Optimizer diagnostics are limited, making it hard to distinguish "real convergence" from "stuck at initialization".
+
+### Recommended Implementation Order (for next session)
+
+1. **Unit hardening (P0)**
+   - Explicitly set `init_params.coordinate_units = sl.UNIT.METER` in SVO reader.
+   - Remove or guard manual `/1000.0` conversion to avoid double-scaling risk.
+   - Add depth sanity logs (min/median/max sampled depth) under `--debug`.
+
+2. **Robust objective (P0)**
+   - Replace MSE-only residual with Huber (or Soft-L1) in meters.
+   - Add confidence-weighted depth residuals in objective function.
+   - Split translation/rotation regularization coefficients.
+
+3. **Frame quality selection (P1)**
+   - Replace "latest valid frame" with best-frame scoring:
+     - marker count (higher better)
+     - median reprojection error (lower better)
+     - valid depth ratio (higher better)
+
+4. **Diagnostics and acceptance gates (P1)**
+   - Log optimizer termination reason, gradient/step behavior, and effective valid points.
+   - Treat tiny RMSE changes as "no effective refinement" even if optimizer returns.
+
+5. **Benchmark matrix (P1)**
+   - Compare baseline vs robust loss vs robust+confidence vs robust+confidence+best-frame.
+   - Report per-camera pre/post RMSE, iteration count, and success/failure reason.
+
+### Practical note
+
+The previous troubleshooting section correctly explains one important failure mode (unit mismatch), but current evidence shows that **robust objective design and frame quality control** are now the primary bottlenecks for meaningful depth refinement gains.