crosstyan/zed-playground

Fork 0

Files

T

crosstyan 511994e3a8 chore: checkpoint ground-plane calibration refinement work

2026-02-09 10:02:48 +00:00

34 KiB

Raw Blame History

Robust Depth Refinement for Camera Extrinsics

TL;DR

Quick Summary: Replace the failing depth-based pose refinement pipeline with a robust optimizer (scipy.optimize.least_squares with soft-L1 loss), add unit hardening, confidence-weighted residuals, best-frame selection, rich diagnostics, and a benchmark matrix comparing configurations.

Deliverables:

Unit-hardened depth retrieval (set coordinate_units=METER, guard double-conversion)

Robust optimization objective using least_squares(method="trf", loss="soft_l1", f_scale=0.1)

Confidence-weighted depth residuals (toggleable via CLI flag)

Best-frame selection replacing naive "latest valid frame"

Rich optimizer diagnostics and acceptance gates

Benchmark matrix comparing baseline/robust/+confidence/+best-frame

Updated tests for all new functionality

Estimated Effort: Medium (3-4 hours implementation) Parallel Execution: YES - 2 waves Critical Path: Task 1 (units) → Task 2 (robust optimizer) → Task 3 (confidence) → Task 5 (diagnostics) → Task 6 (benchmark)

Context

Original Request

Implement the 5 items from "Recommended Implementation Order" in docs/calibrate-extrinsics-workflow.md, plus research and choose the best optimization method for depth-based camera extrinsic refinement.

Interview Summary

Key Discussions:

Requirements were explicitly specified in the documentation (no interactive interview needed)
Research confirmed scipy.optimize.least_squares is superior to scipy.optimize.minimize for this problem class

Research Findings:

freemocap/anipose (production multi-camera calibration) uses exactly least_squares(method="trf", loss=loss, f_scale=threshold) for bundle adjustment — validates our approach
scipy docs recommend soft_l1 or huber for robust fitting; f_scale controls the inlier/outlier threshold
Current output JSONs confirm catastrophic failure: RMSE 5000+ meters (aligned_refined_extrinsics_fast.json), RMSE ~11.6m (test_refine_current.json), iterations=0/1, success=false across all cameras
Unit mismatch still active despite /1000.0 conversion — ZED defaults to mm, code divides by 1000, but no coordinate_units=METER set
Confidence map retrieved but only used in verify filtering, not in optimizer objective

Metis Review

Identified Gaps (addressed):

Output JSON schema backward compatibility → New fields are additive only (existing fields preserved)
Confidence weighting can interact with robust loss → Made toggleable, logged statistics
Best-frame selection changes behavior → Deterministic scoring, old behavior available as fallback
Zero valid points edge case → Explicit early exit with diagnostic
Numerical pass/fail gate → Added RMSE threshold checks
Regression guard → Default CLI behavior unchanged unless user opts into new features

Work Objectives

Core Objective

Make depth-based extrinsic refinement actually work by fixing the unit mismatch, switching to a robust optimizer, incorporating confidence weighting, and selecting the best frame for refinement.

Concrete Deliverables

Modified aruco/svo_sync.py with unit hardening
Rewritten aruco/depth_refine.py using least_squares with robust loss
Updated aruco/depth_verify.py with confidence weight extraction helper
Updated calibrate_extrinsics.py with frame scoring, diagnostics, new CLI flags
New and updated tests in tests/
Updated docs/calibrate-extrinsics-workflow.md with new behavior docs

Definition of Done

uv run pytest passes with 0 failures
Synthetic test: robust optimizer converges (success=True, nfev > 1) with injected outliers
Existing tests still pass (backward compatibility)
Benchmark matrix produces 4 comparable result records

Must Have

coordinate_units = sl.UNIT.METER set in SVOReader
least_squares with loss="soft_l1" and f_scale=0.1 as default optimizer
Confidence weighting via --use-confidence-weights flag
Best-frame selection with deterministic scoring
Optimizer diagnostics in output JSON and logs
All changes covered by automated tests

Must NOT Have (Guardrails)

Must NOT change unrelated calibration logic (marker detection, PnP, pose averaging, alignment)
Must NOT change file I/O formats or break JSON schema (only additive fields)
Must NOT introduce new dependencies beyond scipy/numpy already in use
Must NOT implement multi-optimizer auto-selection or hyperparameter search
Must NOT turn frame scoring into a ML quality model — simple weighted heuristic only
Must NOT add premature abstractions or over-engineer the API
Must NOT remove existing CLI flags or change their default behavior

Verification Strategy

UNIVERSAL RULE: ZERO HUMAN INTERVENTION

ALL tasks in this plan MUST be verifiable WITHOUT any human action. Every criterion is verified by running uv run pytest or inspecting code.

Test Decision

Infrastructure exists: YES (pytest configured in pyproject.toml, tests/ directory)
Automated tests: YES (tests-after, matching existing project pattern)
Framework: pytest (via uv run pytest)

Agent-Executed QA Scenarios (MANDATORY — ALL tasks)

Verification Tool by Deliverable Type:

Type	Tool	How Agent Verifies
Python module changes	Bash (`uv run pytest`)	Run tests, assert 0 failures
New functions	Bash (`uv run pytest -k test_name`)	Run specific test, assert pass
CLI behavior	Bash (`uv run python calibrate_extrinsics.py --help`)	Verify new flags present

Execution Strategy

Parallel Execution Waves

Wave 1 (Start Immediately):
├── Task 1: Unit hardening (svo_sync.py) [no dependencies]
└── Task 4: Best-frame selection (calibrate_extrinsics.py) [no dependencies]

Wave 2 (After Wave 1):
├── Task 2: Robust optimizer (depth_refine.py) [depends: 1]
├── Task 3: Confidence weighting (depth_verify.py + depth_refine.py) [depends: 2]
└── Task 5: Diagnostics and acceptance gates [depends: 2]

Wave 3 (After Wave 2):
└── Task 6: Benchmark matrix [depends: 2, 3, 4, 5]

Wave 4 (After All):
└── Task 7: Documentation update [depends: all]

Critical Path: Task 1 → Task 2 → Task 3 → Task 5 → Task 6

Dependency Matrix

Task	Depends On	Blocks	Can Parallelize With
1	None	2, 3	4
2	1	3, 5, 6	-
3	2	6	5
4	None	6	1
5	2	6	3
6	2, 3, 4, 5	7	-
7	All	None	-

Agent Dispatch Summary

Wave	Tasks	Recommended Agents
1	1, 4	`category="quick"` for T1; `category="unspecified-low"` for T4
2	2, 3, 5	`category="deep"` for T2; `category="quick"` for T3, T5
3	6	`category="unspecified-low"`
4	7	`category="writing"`

TODOs

1. Unit Hardening (P0)

What to do:
- In aruco/svo_sync.py, add init_params.coordinate_units = sl.UNIT.METER in the SVOReader.__init__ method, right after init_params.set_from_svo_file(path) (around line 42)
- Guard the existing /1000.0 conversion: check whether coordinate_units is already METER. If METER is set, skip the division. If not set or MILLIMETER, apply the division. Add a log warning if division is applied as fallback
- Add depth sanity logging under --debug mode: after retrieving depth, log min/median/max/p95 of valid depth values. This goes in the _retrieve_depth method
- Write a test that verifies the unit-hardened path doesn't double-convert
Must NOT do:
- Do NOT change depth retrieval for confidence maps
- Do NOT modify the grab_synced() or grab_all() methods
- Do NOT add new CLI parameters for this task
Recommended Agent Profile:
- Category: quick
  - Reason: Small, focused change in one file + one test file
- Skills: [git-master]
  - git-master: Atomic commit of unit hardening change
Parallelization:
- Can Run In Parallel: YES
- Parallel Group: Wave 1 (with Task 4)
- Blocks: Tasks 2, 3
- Blocked By: None
References:

Pattern References (existing code to follow):
- aruco/svo_sync.py:40-44 — Current init_params setup where coordinate_units must be added
- aruco/svo_sync.py:180-189 — Current _retrieve_depth method with /1000.0 conversion to modify
- aruco/svo_sync.py:191-196 — Confidence retrieval pattern (do NOT modify, but understand adjacency)
API/Type References (contracts to implement against):
- ZED SDK InitParameters.coordinate_units — Set to sl.UNIT.METER
- loguru.logger — Used project-wide for debug logging
Test References (testing patterns to follow):
- tests/test_depth_verify.py:36-66 — Test pattern using synthetic depth maps (follow this style)
- tests/test_depth_refine.py:21-39 — Test pattern with synthetic K matrix and depth maps
Documentation References:
- docs/calibrate-extrinsics-workflow.md:116-132 — Documents the unit mismatch problem and mitigation strategy
- docs/calibrate-extrinsics-workflow.md:166-169 — Specifies the exact implementation steps for unit hardening
Acceptance Criteria:
- init_params.coordinate_units = sl.UNIT.METER is set in SVOReader.init before cam.open()
- The /1000.0 division in _retrieve_depth is guarded (only applied if units are NOT meters)
- Debug logging of depth statistics (min/median/max) is added to _retrieve_depth when depth mode is active
- uv run pytest tests/test_depth_refine.py tests/test_depth_verify.py -q → all pass (no regressions)
Agent-Executed QA Scenarios:
```
Scenario: Verify unit hardening doesn't break existing tests
  Tool: Bash (uv run pytest)
  Preconditions: All dependencies installed
  Steps:
    1. Run: uv run pytest tests/test_depth_refine.py tests/test_depth_verify.py -q
    2. Assert: exit code 0
    3. Assert: output contains "passed" and no "FAILED"
  Expected Result: All existing tests pass
  Evidence: Terminal output captured

Scenario: Verify coordinate_units is set in code
  Tool: Bash (grep)
  Preconditions: File modified
  Steps:
    1. Run: grep -n "coordinate_units" aruco/svo_sync.py
    2. Assert: output contains "UNIT.METER" or "METER"
  Expected Result: Unit setting is present
  Evidence: Grep output
```
Commit: YES
- Message: fix(svo): harden depth units — set coordinate_units=METER, guard /1000 conversion
- Files: aruco/svo_sync.py, tests/test_depth_refine.py
- Pre-commit: uv run pytest tests/ -q

2. Robust Optimizer — Replace MSE with least_squares + Soft-L1 Loss (P0)

What to do:
- Rewrite depth_residual_objective → Replace with a residual vector function depth_residuals(params, ...) that returns an array of residuals (not a scalar cost). Each element is (z_measured - z_predicted) for one marker corner. This is what least_squares expects.
- Add regularization as pseudo-residuals: Append [reg_weight_rot * delta_rvec, reg_weight_trans * delta_tvec] to the residual vector. This naturally penalizes deviation from the initial pose. Split into separate rotation and translation regularization weights (default: reg_rot=0.1, reg_trans=1.0 — translation more tightly regularized in meters scale).
- Replace minimize(method="L-BFGS-B") with least_squares(method="trf", loss="soft_l1", f_scale=0.1):
  - method="trf" — Trust Region Reflective, handles bounds naturally
  - loss="soft_l1" — Smooth robust loss, downweights outliers beyond f_scale
  - f_scale=0.1 — Residuals >0.1m are treated as outliers (matches ZED depth noise ~1-5cm)
  - bounds — Same ±5°/±5cm bounds, expressed as (lower_bounds_array, upper_bounds_array) tuple
  - x_scale="jac" — Automatic Jacobian-based scaling (prevents ill-conditioning)
  - max_nfev=200 — Maximum function evaluations
- Update refine_extrinsics_with_depth signature: Add parameters for loss, f_scale, reg_rot, reg_trans. Keep backward-compatible defaults. Return enriched stats dict including: termination_message, nfev, optimality, active_mask, cost.
- Handle zero residuals: If residual vector is empty (no valid depth points), return initial pose unchanged with stats indicating "reason": "no_valid_depth_points".
- Maintain backward-compatible scalar cost reporting: Compute initial_cost and final_cost from the residual vector for comparison with old output format.
Must NOT do:
- Do NOT change extrinsics_to_params or params_to_extrinsics (the Rodrigues parameterization is correct)
- Do NOT modify depth_verify.py in this task
- Do NOT add confidence weighting here (that's Task 3)
- Do NOT add CLI flags here (that's Task 5)
Recommended Agent Profile:
- Category: deep
  - Reason: Core algorithmic change, requires understanding of optimization theory and careful residual construction
- Skills: []
  - No specialized skills needed — pure Python/numpy/scipy work
Parallelization:
- Can Run In Parallel: NO
- Parallel Group: Wave 2 (sequential after Wave 1)
- Blocks: Tasks 3, 5, 6
- Blocked By: Task 1
References:

Pattern References (existing code to follow):
- aruco/depth_refine.py:19-47 — Current depth_residual_objective function to REPLACE
- aruco/depth_refine.py:50-112 — Current refine_extrinsics_with_depth function to REWRITE
- aruco/depth_refine.py:1-16 — Import block and helper functions (keep extrinsics_to_params, params_to_extrinsics)
- aruco/depth_verify.py:27-67 — compute_depth_residual function — this is the per-point residual computation called from the objective. Understand its contract: returns float(z_measured - z_predicted) or None.
API/Type References:
- scipy.optimize.least_squares — scipy docs: fun(x, *args) -> residuals_array; parameters: method="trf", loss="soft_l1", f_scale=0.1, bounds=(lb, ub), x_scale="jac", max_nfev=200
- Return type: OptimizeResult with attributes: .x, .cost, .fun, .jac, .grad, .optimality, .active_mask, .nfev, .njev, .status, .message, .success
External References (production examples):
- freemocap/anipose bundle_adjust method — Uses least_squares(error_fun, x0, jac_sparsity=jac_sparse, f_scale=f_scale, x_scale="jac", loss=loss, ftol=ftol, method="trf", tr_solver="lsmr") for multi-camera calibration. Key pattern: residual function returns per-point reprojection errors.
- scipy Context7 docs — Example shows least_squares(fun, x0, loss='soft_l1', f_scale=0.1, args=(t_train, y_train)) where fun returns residual vector
Test References:
- tests/test_depth_refine.py — ALL 4 existing tests must still pass. They test: roundtrip, no-change convergence, offset correction, and bounds respect. The new optimizer must satisfy these same properties.
Acceptance Criteria:
- from scipy.optimize import least_squares replaces from scipy.optimize import minimize
- depth_residuals() returns np.ndarray (vector), not scalar float
- least_squares(method="trf", loss="soft_l1", f_scale=0.1) is the optimizer call
- Regularization is split: separate reg_rot and reg_trans weights, appended as pseudo-residuals
- Stats dict includes: termination_message, nfev, optimality, cost
- Zero-residual case returns initial pose with reason: "no_valid_depth_points"
- uv run pytest tests/test_depth_refine.py -q → all 4 existing tests pass
- New test: synthetic data with 30% outlier depths → robust optimizer converges (success=True, nfev > 1) with lower median residual than would occur with pure MSE
Agent-Executed QA Scenarios:
```
Scenario: All existing depth_refine tests pass after rewrite
  Tool: Bash (uv run pytest)
  Preconditions: Task 1 completed, aruco/depth_refine.py rewritten
  Steps:
    1. Run: uv run pytest tests/test_depth_refine.py -v
    2. Assert: exit code 0
    3. Assert: output contains "4 passed"
  Expected Result: All 4 existing tests pass
  Evidence: Terminal output captured

Scenario: Robust optimizer handles outliers better than MSE
  Tool: Bash (uv run pytest)
  Preconditions: New test added
  Steps:
    1. Run: uv run pytest tests/test_depth_refine.py::test_robust_loss_handles_outliers -v
    2. Assert: exit code 0
    3. Assert: test passes
  Expected Result: With 30% outliers, robust optimizer has lower median abs residual
  Evidence: Terminal output captured
```
Commit: YES
- Message: feat(refine): replace L-BFGS-B MSE with least_squares soft-L1 robust optimizer
- Files: aruco/depth_refine.py, tests/test_depth_refine.py
- Pre-commit: uv run pytest tests/test_depth_refine.py -q

3. Confidence-Weighted Depth Residuals (P0)

What to do:
- Add confidence weight extraction helper to aruco/depth_verify.py: Create a function get_confidence_weight(confidence_map, u, v, confidence_thresh=50) -> float that returns a normalized weight in [0, 1]. ZED confidence: [1, 100] where higher = LESS confident. Normalize as max(0, (confidence_thresh - conf_value)) / confidence_thresh. Values above threshold → weight 0. Clamp to [eps, 1.0] where eps=1e-6.
- Update depth_residuals() in aruco/depth_refine.py: Accept optional confidence_map and confidence_thresh parameters. If confidence_map is provided, multiply each depth residual by sqrt(weight) before returning. This implements weighted least squares within the least_squares framework.
- Update refine_extrinsics_with_depth signature: Add confidence_map=None, confidence_thresh=50 parameters. Pass through to depth_residuals().
- Update calibrate_extrinsics.py: Pass confidence_map=frame.confidence_map and confidence_thresh=depth_confidence_threshold to refine_extrinsics_with_depth when confidence weighting is requested
- Add --use-confidence-weights/--no-confidence-weights CLI flag (default: False for backward compatibility)
- Log confidence statistics under --debug: After computing weights, log n_zero_weight, mean_weight, median_weight
Must NOT do:
- Do NOT change the verification logic in verify_extrinsics_with_depth (it already uses confidence correctly)
- Do NOT change confidence semantics (higher ZED value = less confident)
- Do NOT make confidence weighting the default behavior
Recommended Agent Profile:
- Category: quick
  - Reason: Adding parameters and weight multiplication — straightforward plumbing
- Skills: []
Parallelization:
- Can Run In Parallel: NO (depends on Task 2)
- Parallel Group: Wave 2 (after Task 2)
- Blocks: Task 6
- Blocked By: Task 2
References:

Pattern References:
- aruco/depth_verify.py:82-96 — Existing confidence handling pattern (filtering, NOT weighting). Follow this semantics but produce a continuous weight instead of binary skip
- aruco/depth_verify.py:93-95 — ZED confidence semantics: "Higher confidence value means LESS confident... Range [1, 100], where 100 is typically occlusion/invalid"
- aruco/depth_refine.py — Updated in Task 2 with depth_residuals() function. Add confidence_map parameter here
- calibrate_extrinsics.py:136-148 — Current call site for refine_extrinsics_with_depth. Add confidence_map/thresh forwarding
Test References:
- tests/test_depth_verify.py:69-84 — Test pattern for compute_marker_corner_residuals. Follow for confidence weight test
Acceptance Criteria:
- get_confidence_weight() function exists in depth_verify.py
- Confidence weighting is off by default (backward compatible)
- --use-confidence-weights flag exists in CLI
- Low-confidence points have lower influence on optimization (verified by test)
- uv run pytest tests/ -q → all pass
Agent-Executed QA Scenarios:
```
Scenario: Confidence weighting reduces outlier influence
  Tool: Bash (uv run pytest)
  Steps:
    1. Run: uv run pytest tests/test_depth_refine.py::test_confidence_weighting -v
    2. Assert: exit code 0
  Expected Result: With low-confidence outlier points, weighted optimizer ignores them
  Evidence: Terminal output

Scenario: CLI flag exists
  Tool: Bash
  Steps:
    1. Run: uv run python calibrate_extrinsics.py --help | grep -i confidence-weight
    2. Assert: output contains "--use-confidence-weights"
  Expected Result: Flag is available
  Evidence: Help text
```
Commit: YES
- Message: feat(refine): add confidence-weighted depth residuals with --use-confidence-weights flag
- Files: aruco/depth_verify.py, aruco/depth_refine.py, calibrate_extrinsics.py, tests/test_depth_refine.py
- Pre-commit: uv run pytest tests/ -q

4. Best-Frame Selection (P1)

What to do:
- Create score_frame_quality() function in calibrate_extrinsics.py (or a new aruco/frame_scoring.py if cleaner). The function takes: n_markers: int, reproj_error: float, depth_map: np.ndarray, marker_corners_world: Dict[int, np.ndarray], T_world_cam: np.ndarray, K: np.ndarray and returns a float score (higher = better).
- Scoring formula: score = w_markers * n_markers + w_reproj * (1 / (reproj_error + eps)) + w_depth * valid_depth_ratio
  - w_markers = 1.0 — more markers = better constraint
  - w_reproj = 5.0 — lower reprojection error = more accurate PnP
  - w_depth = 3.0 — higher ratio of valid depth at marker locations = better depth signal
  - valid_depth_ratio = n_valid_depths / n_total_corners
  - eps = 1e-6 to avoid division by zero
- Replace "last valid frame" logic in calibrate_extrinsics.py: Instead of overwriting verification_frames[serial] every time (line 467-471), track ALL valid frames per camera with their scores. After the processing loop, select the frame with the highest score.
- Log selected frame: Under --debug, log the chosen frame index, score, and component breakdown for each camera
- Ensure deterministic tiebreaking: If scores are equal, pick the frame with the lower frame_index (earliest)
- Keep frame storage bounded: Store at most max_stored_frames=10 candidates per camera (configurable), keeping the top-scoring ones
Must NOT do:
- Do NOT add ML-based frame scoring
- Do NOT change the frame grabbing/syncing logic
- Do NOT add new dependencies
Recommended Agent Profile:
- Category: unspecified-low
  - Reason: New functionality but straightforward heuristic
- Skills: []
Parallelization:
- Can Run In Parallel: YES
- Parallel Group: Wave 1 (with Task 1)
- Blocks: Task 6
- Blocked By: None
References:

Pattern References:
- calibrate_extrinsics.py:463-471 — Current "last valid frame" logic to REPLACE. Currently: verification_frames[serial] = {"frame": frame, "ids": ids, "corners": corners}
- calibrate_extrinsics.py:452-478 — Full frame processing context (pose estimation, accumulation, frame caching)
- aruco/depth_verify.py:27-67 — compute_depth_residual can be used to check valid depth at marker locations for scoring
Test References:
- tests/test_depth_cli_postprocess.py — Test pattern for calibrate_extrinsics functions
Acceptance Criteria:
- score_frame_quality() function exists and returns a float
- Best frame is selected (not last frame) for each camera
- Scoring is deterministic (same inputs → same selected frame)
- Frame selection metadata is logged under --debug
- uv run pytest tests/ -q → all pass (no regressions)
Agent-Executed QA Scenarios:
```
Scenario: Frame scoring is deterministic
  Tool: Bash (uv run pytest)
  Steps:
    1. Run: uv run pytest tests/test_frame_scoring.py -v
    2. Assert: exit code 0
  Expected Result: Same inputs always produce same score and selection
  Evidence: Terminal output

Scenario: Higher marker count increases score
  Tool: Bash (uv run pytest)
  Steps:
    1. Run: uv run pytest tests/test_frame_scoring.py::test_more_markers_higher_score -v
    2. Assert: exit code 0
  Expected Result: Frame with more markers scores higher
  Evidence: Terminal output
```
Commit: YES
- Message: feat(calibrate): replace naive frame selection with quality-scored best-frame
- Files: calibrate_extrinsics.py, tests/test_frame_scoring.py
- Pre-commit: uv run pytest tests/ -q

5. Diagnostics and Acceptance Gates (P1)

What to do:
- Enrich refine_extrinsics_with_depth stats dict: The least_squares result (from Task 2) already provides .status, .message, .nfev, .njev, .optimality, .active_mask. Surface these in the returned stats dict as: termination_status (int), termination_message (str), nfev (int), njev (int), optimality (float), n_active_bounds (int, count of parameters at bound limits).
- Add effective valid points count: Log how many marker corners had valid (finite, positive) depth, and how many were used after confidence filtering. Add to stats: n_depth_valid, n_confidence_filtered.
- Add RMSE improvement gate: If improvement_rmse < 1e-4 AND nfev > 5, log WARNING: "Refinement converged with negligible improvement — consider checking depth data quality"
- Add failure diagnostic: If success == False or nfev <= 1, log WARNING with termination message and suggest checking depth unit consistency
- Log optimizer progress under --debug: Before and after optimization, log: initial cost, final cost, delta_rotation, delta_translation, termination message, number of function evaluations
- Surface diagnostics in JSON output: Add fields to refine_depth dict in output JSON: termination_status, termination_message, nfev, n_valid_points, loss_function, f_scale
Must NOT do:
- Do NOT add automated "redo with different params" logic
- Do NOT add email/notification alerts
- Do NOT change the optimization algorithm or parameters (already done in Task 2)
Recommended Agent Profile:
- Category: quick
  - Reason: Adding logging and dict fields — no algorithmic changes
- Skills: []
Parallelization:
- Can Run In Parallel: YES (with Task 3)
- Parallel Group: Wave 2
- Blocks: Task 6
- Blocked By: Task 2
References:

Pattern References:
- aruco/depth_refine.py:103-111 — Current stats dict construction (to EXTEND, not replace)
- calibrate_extrinsics.py:159-181 — Current refinement result logging and JSON field assignment
- loguru.logger — Project uses loguru for structured logging
API/Type References:
- scipy.optimize.OptimizeResult — .status (int: 1=convergence, 0=max_nfev, -1=improper), .message (str), .nfev, .njev, .optimality (gradient infinity norm)
Acceptance Criteria:
- Stats dict contains: termination_status, termination_message, nfev, n_valid_points
- Output JSON refine_depth section contains diagnostic fields
- WARNING log emitted when improvement < 1e-4 with nfev > 5
- WARNING log emitted when success=False or nfev <= 1
- uv run pytest tests/ -q → all pass
Agent-Executed QA Scenarios:
```
Scenario: Diagnostics present in refine stats
  Tool: Bash (uv run pytest)
  Steps:
    1. Run: uv run pytest tests/test_depth_refine.py -v
    2. Assert: All tests pass
    3. Check that stats dict from refine function contains "termination_message" key
  Expected Result: Diagnostics are in stats output
  Evidence: Terminal output
```
Commit: YES
- Message: feat(refine): add rich optimizer diagnostics and acceptance gates
- Files: aruco/depth_refine.py, calibrate_extrinsics.py, tests/test_depth_refine.py
- Pre-commit: uv run pytest tests/ -q

6. Benchmark Matrix (P1)

What to do:
- Add --benchmark-matrix flag to calibrate_extrinsics.py CLI
- When enabled, run the depth refinement pipeline 4 times per camera with different configurations:
  1. baseline: loss="linear" (no robust loss), no confidence weights
  2. robust: loss="soft_l1", f_scale=0.1, no confidence weights
  3. robust+confidence: loss="soft_l1", f_scale=0.1, confidence weighting ON
  4. robust+confidence+best-frame: Same as #3 but using best-frame selection
- Output: For each configuration, report per-camera: pre-refinement RMSE, post-refinement RMSE, improvement, iteration count, success/failure, termination reason
- Format: Print a formatted table to stdout (using click.echo) AND save to a benchmark section in the output JSON
- Implementation: Create a helper function run_benchmark_matrix(T_initial, marker_corners_world, depth_map, K, confidence_map, ...) that returns a list of result dicts
Must NOT do:
- Do NOT implement automated configuration tuning
- Do NOT add visualization/plotting dependencies
- Do NOT change the default (non-benchmark) codepath behavior
Recommended Agent Profile:
- Category: unspecified-low
  - Reason: Orchestration code, calling existing functions with different params
- Skills: []
Parallelization:
- Can Run In Parallel: NO (depends on all previous tasks)
- Parallel Group: Wave 3 (after all)
- Blocks: Task 7
- Blocked By: Tasks 2, 3, 4, 5
References:

Pattern References:
- calibrate_extrinsics.py:73-196 — apply_depth_verify_refine_postprocess function. The benchmark matrix calls this logic with varied parameters
- aruco/depth_refine.py — Updated refine_extrinsics_with_depth with loss, f_scale, confidence_map params
Acceptance Criteria:
- --benchmark-matrix flag exists in CLI
- When enabled, 4 configurations are run per camera
- Output table is printed to stdout
- Benchmark results are in output JSON under benchmark key
- uv run pytest tests/ -q → all pass
Agent-Executed QA Scenarios:
```
Scenario: Benchmark flag in CLI help
  Tool: Bash
  Steps:
    1. Run: uv run python calibrate_extrinsics.py --help | grep benchmark
    2. Assert: output contains "--benchmark-matrix"
  Expected Result: Flag is present
  Evidence: Help text output
```
Commit: YES
- Message: feat(calibrate): add --benchmark-matrix for comparing refinement configurations
- Files: calibrate_extrinsics.py, tests/test_benchmark.py
- Pre-commit: uv run pytest tests/ -q

7. Documentation Update

What to do:
- Update docs/calibrate-extrinsics-workflow.md:
  - Add new CLI flags: --use-confidence-weights, --benchmark-matrix
  - Update "Depth Verification & Refinement" section with new optimizer details
  - Update "Refinement" section: document least_squares with soft_l1 loss, f_scale, confidence weighting
  - Add "Best-Frame Selection" section explaining the scoring formula
  - Add "Diagnostics" section documenting new output JSON fields
  - Update "Example Workflow" commands to show new flags
  - Mark the "Known Unexpected Behavior" unit mismatch section as RESOLVED with the fix description
Must NOT do:
- Do NOT rewrite unrelated documentation sections
- Do NOT add tutorial-style content
Recommended Agent Profile:
- Category: writing
  - Reason: Pure documentation writing
- Skills: []
Parallelization:
- Can Run In Parallel: NO
- Parallel Group: Wave 4 (final)
- Blocks: None
- Blocked By: All previous tasks
References:

Pattern References:
- docs/calibrate-extrinsics-workflow.md — Entire file. Follow existing section structure and formatting
Acceptance Criteria:
- New CLI flags documented
- least_squares optimizer documented with parameter explanations
- Best-frame selection documented
- Unit mismatch section updated as resolved
- Example commands include new flags
Commit: YES
- Message: docs: update calibrate-extrinsics-workflow for robust refinement changes
- Files: docs/calibrate-extrinsics-workflow.md
- Pre-commit: uv run pytest tests/ -q

Commit Strategy

After Task	Message	Files	Verification
1	`fix(svo): harden depth units — set coordinate_units=METER, guard /1000 conversion`	`aruco/svo_sync.py`, tests	`uv run pytest tests/ -q`
2	`feat(refine): replace L-BFGS-B MSE with least_squares soft-L1 robust optimizer`	`aruco/depth_refine.py`, tests	`uv run pytest tests/ -q`
3	`feat(refine): add confidence-weighted depth residuals with --use-confidence-weights flag`	`aruco/depth_verify.py`, `aruco/depth_refine.py`, `calibrate_extrinsics.py`, tests	`uv run pytest tests/ -q`
4	`feat(calibrate): replace naive frame selection with quality-scored best-frame`	`calibrate_extrinsics.py`, tests	`uv run pytest tests/ -q`
5	`feat(refine): add rich optimizer diagnostics and acceptance gates`	`aruco/depth_refine.py`, `calibrate_extrinsics.py`, tests	`uv run pytest tests/ -q`
6	`feat(calibrate): add --benchmark-matrix for comparing refinement configurations`	`calibrate_extrinsics.py`, tests	`uv run pytest tests/ -q`
7	`docs: update calibrate-extrinsics-workflow for robust refinement changes`	`docs/calibrate-extrinsics-workflow.md`	`uv run pytest tests/ -q`

Success Criteria

Verification Commands

uv run pytest tests/ -q                    # Expected: all pass, 0 failures
uv run pytest tests/test_depth_refine.py -v  # Expected: all tests pass including new robust/confidence tests

Final Checklist

All "Must Have" items present
All "Must NOT Have" items absent
All tests pass (uv run pytest tests/ -q)
Output JSON backward compatible (existing fields preserved, new fields additive)
Default CLI behavior unchanged (new features opt-in)
Optimizer actually converges on synthetic test data (success=True, nfev > 1)

34 KiB Raw Blame History