Files
zed-playground/py_workspace/.sisyphus/notepads/depth-refinement-robust/learnings.md
T

6.0 KiB

Robust Optimization Patterns

  • Use method='trf' for robust loss + bounds.
  • loss='cauchy' is highly effective for outlier-heavy depth data.
  • f_scale should be tuned to the expected inlier noise (e.g., sensor precision).
  • Weights must be manually multiplied into the residual vector.

Unit Hardening Learnings

  • SDK Unit Consistency: Explicitly setting init_params.coordinate_units = sl.UNIT.METER ensures that all SDK-retrieved measures (depth, point clouds, tracking) are in meters, avoiding manual conversion errors.
  • Double Scaling Guard: When moving to SDK-level meter units, existing manual conversions (e.g., / 1000.0) must be guarded or removed. Checking cam.get_init_parameters().coordinate_units provides a safe runtime check.
  • Depth Sanity Logging: Adding min/median/max/p95 stats for valid depth values in debug logs helps identify scaling issues (e.g., seeing values in the thousands when expecting meters) or data quality problems early.
  • Loguru Integration: Standardized on loguru for debug logging in SVOReader to match project patterns.

Best-Frame Selection (Task 4)

  • Implemented score_frame function in calibrate_extrinsics.py to evaluate frame quality.
  • Scoring criteria:
    • Base score: n_markers * 100.0 - reproj_err
    • Depth bonus: Up to +50.0 based on valid depth ratio at marker corners.
  • Main loop now tracks the frame with the highest score per camera instead of just the latest valid frame.
  • Deterministic tie-breaking: The first frame with a given score is kept (implicitly by current_score > best_so_far["score"]).
  • This ensures depth verification and refinement use the highest quality data available in the SVO.
  • Regression Testing for Units: Added tests/test_depth_units.py which mocks sl.Camera and sl.Mat to verify that _retrieve_depth correctly handles both sl.UNIT.METER (no scaling) and sl.UNIT.MILLIMETER (divides by 1000) paths. This ensures the unit hardening is robust against future changes.

Robust Optimizer Implementation (Task 2)

  • Replaced minimize(L-BFGS-B) with least_squares(trf, soft_l1).
  • Key Finding: soft_l1 loss with f_scale=0.1 (10cm) effectively ignores 3m outliers in synthetic tests, whereas MSE is heavily biased by them.
  • Regularization: Split into reg_rot (0.1) and reg_trans (1.0) to penalize translation more heavily in meters.
  • Testing: Synthetic tests require careful depth map painting to ensure markers project into the correct "measured" regions as the optimizer moves the camera. A 5x5 window lookup means we need to paint at least +/- 30 pixels to cover the optimization trajectory.
  • Convergence: least_squares with robust loss may stop slightly earlier than MSE on clean data due to gradient dampening; relaxed tolerance to 5mm for unit tests.

Task 5: Diagnostics and Acceptance Gates

  • Surfaced rich optimizer diagnostics in refine_extrinsics_with_depth stats: termination_status, nfev, njev, optimality, n_active_bounds.
  • Added data quality counts: n_points_total, n_depth_valid, n_confidence_rejected.
  • Implemented warning gates in calibrate_extrinsics.py:
    • Negligible improvement: Warns if improvement_rmse < 1e-4 after more than 5 iterations.
    • Stalled/Failed: Warns if success is false or nfev <= 1.
  • These diagnostics provide better visibility into why refinement might be failing or doing nothing, which is critical for the upcoming benchmark matrix (Task 6).

Benchmark Matrix Implementation

  • Added --benchmark-matrix flag to calibrate_extrinsics.py.
  • Implemented run_benchmark_matrix to compare 4 configurations:
    1. baseline (linear loss, no confidence)
    2. robust (soft_l1, f_scale=0.1, no confidence)
    3. robust+confidence (soft_l1, f_scale=0.1, confidence weights)
    4. robust+confidence+best-frame (same as 3 but using the best-scored frame instead of the first valid one)
  • The benchmark results are printed as a table to stdout and saved in the output JSON under the benchmark key for each camera.
  • Captured first_frames in the main loop to provide a consistent baseline for comparison against the best_frame (verification_frames).

Documentation Updates (2026-02-07)

Workflow Documentation

  • Updated docs/calibrate-extrinsics-workflow.md to reflect the new robust refinement pipeline.
  • Added documentation for new CLI flags: --use-confidence-weights, --benchmark-matrix.
  • Explained the switch from L-BFGS-B (MSE) to least_squares (Soft-L1) for robust optimization.
  • Documented the "Best Frame Selection" logic (scoring based on marker count, reprojection error, and valid depth).
  • Marked the "Unit Mismatch" issue as resolved due to explicit meter enforcement in SVOReader.

Key Learnings

  • Documentation as Contract: Updating the docs after implementation revealed that the "Unit Mismatch" section was outdated. Explicitly marking it as "Resolved" preserves the history while clarifying current behavior.
  • Benchmark Matrix Value: Documenting the benchmark matrix makes it a first-class citizen in the workflow, encouraging users to empirically verify refinement improvements rather than trusting defaults.
  • Confidence Weights: Explicitly documenting this feature highlights the importance of sensor uncertainty in the optimization process.

Bug Fix: Variable-Length Residual Vectors

  • Fixed a ValueError in scipy.optimize.least_squares caused by the residual vector changing length between iterations.
  • The root cause was filtering for valid depth points inside the residual function. If a point projected outside the image or had invalid depth in one iteration but not another, the vector length would change, which least_squares does not support.
  • Solution: Identify "active" points at the start of refinement (T_initial) and use this fixed set of points for all iterations.
  • If a point becomes invalid during optimization (e.g., projects out of bounds), it is now assigned a large constant residual (10.0m) instead of being removed from the vector. This maintains a stable dimensionality while discouraging the optimizer from moving towards invalid regions.