Robust Optimization Patterns

Use method='trf' for robust loss + bounds.
loss='cauchy' is highly effective for outlier-heavy depth data.
f_scale should be tuned to the expected inlier noise (e.g., sensor precision).
Weights must be manually multiplied into the residual vector.

Unit Hardening Learnings

SDK Unit Consistency: Explicitly setting init_params.coordinate_units = sl.UNIT.METER ensures that all SDK-retrieved measures (depth, point clouds, tracking) are in meters, avoiding manual conversion errors.
Double Scaling Guard: When moving to SDK-level meter units, existing manual conversions (e.g., / 1000.0) must be guarded or removed. Checking cam.get_init_parameters().coordinate_units provides a safe runtime check.
Depth Sanity Logging: Adding min/median/max/p95 stats for valid depth values in debug logs helps identify scaling issues (e.g., seeing values in the thousands when expecting meters) or data quality problems early.
Loguru Integration: Standardized on loguru for debug logging in SVOReader to match project patterns.

Implemented score_frame function in calibrate_extrinsics.py to evaluate frame quality.
Scoring criteria:
- Base score: n_markers * 100.0 - reproj_err
- Depth bonus: Up to +50.0 based on valid depth ratio at marker corners.
Main loop now tracks the frame with the highest score per camera instead of just the latest valid frame.
Deterministic tie-breaking: The first frame with a given score is kept (implicitly by current_score > best_so_far["score"]).
This ensures depth verification and refinement use the highest quality data available in the SVO.
Regression Testing for Units: Added tests/test_depth_units.py which mocks sl.Camera and sl.Mat to verify that _retrieve_depth correctly handles both sl.UNIT.METER (no scaling) and sl.UNIT.MILLIMETER (divides by 1000) paths. This ensures the unit hardening is robust against future changes.

Replaced minimize(L-BFGS-B) with least_squares(trf, soft_l1).
Key Finding: soft_l1 loss with f_scale=0.1 (10cm) effectively ignores 3m outliers in synthetic tests, whereas MSE is heavily biased by them.
Regularization: Split into reg_rot (0.1) and reg_trans (1.0) to penalize translation more heavily in meters.
Testing: Synthetic tests require careful depth map painting to ensure markers project into the correct "measured" regions as the optimizer moves the camera. A 5x5 window lookup means we need to paint at least +/- 30 pixels to cover the optimization trajectory.
Convergence: least_squares with robust loss may stop slightly earlier than MSE on clean data due to gradient dampening; relaxed tolerance to 5mm for unit tests.

Surfaced rich optimizer diagnostics in refine_extrinsics_with_depth stats: termination_status, nfev, njev, optimality, n_active_bounds.
Added data quality counts: n_points_total, n_depth_valid, n_confidence_rejected.
Implemented warning gates in calibrate_extrinsics.py:
- Negligible improvement: Warns if improvement_rmse < 1e-4 after more than 5 iterations.
- Stalled/Failed: Warns if success is false or nfev <= 1.
These diagnostics provide better visibility into why refinement might be failing or doing nothing, which is critical for the upcoming benchmark matrix (Task 6).

Added --benchmark-matrix flag to calibrate_extrinsics.py.
Implemented run_benchmark_matrix to compare 4 configurations:
1. baseline (linear loss, no confidence)
2. robust (soft_l1, f_scale=0.1, no confidence)
3. robust+confidence (soft_l1, f_scale=0.1, confidence weights)
4. robust+confidence+best-frame (same as 3 but using the best-scored frame instead of the first valid one)
The benchmark results are printed as a table to stdout and saved in the output JSON under the benchmark key for each camera.
Captured first_frames in the main loop to provide a consistent baseline for comparison against the best_frame (verification_frames).

Updated docs/calibrate-extrinsics-workflow.md to reflect the new robust refinement pipeline.
Added documentation for new CLI flags: --use-confidence-weights, --benchmark-matrix.
Explained the switch from L-BFGS-B (MSE) to least_squares (Soft-L1) for robust optimization.
Documented the "Best Frame Selection" logic (scoring based on marker count, reprojection error, and valid depth).
Marked the "Unit Mismatch" issue as resolved due to explicit meter enforcement in SVOReader.

Documentation as Contract: Updating the docs after implementation revealed that the "Unit Mismatch" section was outdated. Explicitly marking it as "Resolved" preserves the history while clarifying current behavior.
Benchmark Matrix Value: Documenting the benchmark matrix makes it a first-class citizen in the workflow, encouraging users to empirically verify refinement improvements rather than trusting defaults.
Confidence Weights: Explicitly documenting this feature highlights the importance of sensor uncertainty in the optimization process.

Fixed a ValueError in scipy.optimize.least_squares caused by the residual vector changing length between iterations.
The root cause was filtering for valid depth points inside the residual function. If a point projected outside the image or had invalid depth in one iteration but not another, the vector length would change, which least_squares does not support.
Solution: Identify "active" points at the start of refinement (T_initial) and use this fixed set of points for all iterations.
If a point becomes invalid during optimization (e.g., projects out of bounds), it is now assigned a large constant residual (10.0m) instead of being removed from the vector. This maintains a stable dimensionality while discouraging the optimizer from moving towards invalid regions.