Files
zed-playground/py_workspace/.sisyphus/plans/finished/depth-refinement-robust.md
T

34 KiB

Robust Depth Refinement for Camera Extrinsics

TL;DR

Quick Summary: Replace the failing depth-based pose refinement pipeline with a robust optimizer (scipy.optimize.least_squares with soft-L1 loss), add unit hardening, confidence-weighted residuals, best-frame selection, rich diagnostics, and a benchmark matrix comparing configurations.

Deliverables:

  • Unit-hardened depth retrieval (set coordinate_units=METER, guard double-conversion)
  • Robust optimization objective using least_squares(method="trf", loss="soft_l1", f_scale=0.1)
  • Confidence-weighted depth residuals (toggleable via CLI flag)
  • Best-frame selection replacing naive "latest valid frame"
  • Rich optimizer diagnostics and acceptance gates
  • Benchmark matrix comparing baseline/robust/+confidence/+best-frame
  • Updated tests for all new functionality

Estimated Effort: Medium (3-4 hours implementation) Parallel Execution: YES - 2 waves Critical Path: Task 1 (units) → Task 2 (robust optimizer) → Task 3 (confidence) → Task 5 (diagnostics) → Task 6 (benchmark)


Context

Original Request

Implement the 5 items from "Recommended Implementation Order" in docs/calibrate-extrinsics-workflow.md, plus research and choose the best optimization method for depth-based camera extrinsic refinement.

Interview Summary

Key Discussions:

  • Requirements were explicitly specified in the documentation (no interactive interview needed)
  • Research confirmed scipy.optimize.least_squares is superior to scipy.optimize.minimize for this problem class

Research Findings:

  • freemocap/anipose (production multi-camera calibration) uses exactly least_squares(method="trf", loss=loss, f_scale=threshold) for bundle adjustment — validates our approach
  • scipy docs recommend soft_l1 or huber for robust fitting; f_scale controls the inlier/outlier threshold
  • Current output JSONs confirm catastrophic failure: RMSE 5000+ meters (aligned_refined_extrinsics_fast.json), RMSE ~11.6m (test_refine_current.json), iterations=0/1, success=false across all cameras
  • Unit mismatch still active despite /1000.0 conversion — ZED defaults to mm, code divides by 1000, but no coordinate_units=METER set
  • Confidence map retrieved but only used in verify filtering, not in optimizer objective

Metis Review

Identified Gaps (addressed):

  • Output JSON schema backward compatibility → New fields are additive only (existing fields preserved)
  • Confidence weighting can interact with robust loss → Made toggleable, logged statistics
  • Best-frame selection changes behavior → Deterministic scoring, old behavior available as fallback
  • Zero valid points edge case → Explicit early exit with diagnostic
  • Numerical pass/fail gate → Added RMSE threshold checks
  • Regression guard → Default CLI behavior unchanged unless user opts into new features

Work Objectives

Core Objective

Make depth-based extrinsic refinement actually work by fixing the unit mismatch, switching to a robust optimizer, incorporating confidence weighting, and selecting the best frame for refinement.

Concrete Deliverables

  • Modified aruco/svo_sync.py with unit hardening
  • Rewritten aruco/depth_refine.py using least_squares with robust loss
  • Updated aruco/depth_verify.py with confidence weight extraction helper
  • Updated calibrate_extrinsics.py with frame scoring, diagnostics, new CLI flags
  • New and updated tests in tests/
  • Updated docs/calibrate-extrinsics-workflow.md with new behavior docs

Definition of Done

  • uv run pytest passes with 0 failures
  • Synthetic test: robust optimizer converges (success=True, nfev > 1) with injected outliers
  • Existing tests still pass (backward compatibility)
  • Benchmark matrix produces 4 comparable result records

Must Have

  • coordinate_units = sl.UNIT.METER set in SVOReader
  • least_squares with loss="soft_l1" and f_scale=0.1 as default optimizer
  • Confidence weighting via --use-confidence-weights flag
  • Best-frame selection with deterministic scoring
  • Optimizer diagnostics in output JSON and logs
  • All changes covered by automated tests

Must NOT Have (Guardrails)

  • Must NOT change unrelated calibration logic (marker detection, PnP, pose averaging, alignment)
  • Must NOT change file I/O formats or break JSON schema (only additive fields)
  • Must NOT introduce new dependencies beyond scipy/numpy already in use
  • Must NOT implement multi-optimizer auto-selection or hyperparameter search
  • Must NOT turn frame scoring into a ML quality model — simple weighted heuristic only
  • Must NOT add premature abstractions or over-engineer the API
  • Must NOT remove existing CLI flags or change their default behavior

Verification Strategy

UNIVERSAL RULE: ZERO HUMAN INTERVENTION

ALL tasks in this plan MUST be verifiable WITHOUT any human action. Every criterion is verified by running uv run pytest or inspecting code.

Test Decision

  • Infrastructure exists: YES (pytest configured in pyproject.toml, tests/ directory)
  • Automated tests: YES (tests-after, matching existing project pattern)
  • Framework: pytest (via uv run pytest)

Agent-Executed QA Scenarios (MANDATORY — ALL tasks)

Verification Tool by Deliverable Type:

Type Tool How Agent Verifies
Python module changes Bash (uv run pytest) Run tests, assert 0 failures
New functions Bash (uv run pytest -k test_name) Run specific test, assert pass
CLI behavior Bash (uv run python calibrate_extrinsics.py --help) Verify new flags present

Execution Strategy

Parallel Execution Waves

Wave 1 (Start Immediately):
├── Task 1: Unit hardening (svo_sync.py) [no dependencies]
└── Task 4: Best-frame selection (calibrate_extrinsics.py) [no dependencies]

Wave 2 (After Wave 1):
├── Task 2: Robust optimizer (depth_refine.py) [depends: 1]
├── Task 3: Confidence weighting (depth_verify.py + depth_refine.py) [depends: 2]
└── Task 5: Diagnostics and acceptance gates [depends: 2]

Wave 3 (After Wave 2):
└── Task 6: Benchmark matrix [depends: 2, 3, 4, 5]

Wave 4 (After All):
└── Task 7: Documentation update [depends: all]

Critical Path: Task 1 → Task 2 → Task 3 → Task 5 → Task 6

Dependency Matrix

Task Depends On Blocks Can Parallelize With
1 None 2, 3 4
2 1 3, 5, 6 -
3 2 6 5
4 None 6 1
5 2 6 3
6 2, 3, 4, 5 7 -
7 All None -

Agent Dispatch Summary

Wave Tasks Recommended Agents
1 1, 4 category="quick" for T1; category="unspecified-low" for T4
2 2, 3, 5 category="deep" for T2; category="quick" for T3, T5
3 6 category="unspecified-low"
4 7 category="writing"

TODOs

  • 1. Unit Hardening (P0)

    What to do:

    • In aruco/svo_sync.py, add init_params.coordinate_units = sl.UNIT.METER in the SVOReader.__init__ method, right after init_params.set_from_svo_file(path) (around line 42)
    • Guard the existing /1000.0 conversion: check whether coordinate_units is already METER. If METER is set, skip the division. If not set or MILLIMETER, apply the division. Add a log warning if division is applied as fallback
    • Add depth sanity logging under --debug mode: after retrieving depth, log min/median/max/p95 of valid depth values. This goes in the _retrieve_depth method
    • Write a test that verifies the unit-hardened path doesn't double-convert

    Must NOT do:

    • Do NOT change depth retrieval for confidence maps
    • Do NOT modify the grab_synced() or grab_all() methods
    • Do NOT add new CLI parameters for this task

    Recommended Agent Profile:

    • Category: quick
      • Reason: Small, focused change in one file + one test file
    • Skills: [git-master]
      • git-master: Atomic commit of unit hardening change

    Parallelization:

    • Can Run In Parallel: YES
    • Parallel Group: Wave 1 (with Task 4)
    • Blocks: Tasks 2, 3
    • Blocked By: None

    References:

    Pattern References (existing code to follow):

    • aruco/svo_sync.py:40-44 — Current init_params setup where coordinate_units must be added
    • aruco/svo_sync.py:180-189 — Current _retrieve_depth method with /1000.0 conversion to modify
    • aruco/svo_sync.py:191-196 — Confidence retrieval pattern (do NOT modify, but understand adjacency)

    API/Type References (contracts to implement against):

    • ZED SDK InitParameters.coordinate_units — Set to sl.UNIT.METER
    • loguru.logger — Used project-wide for debug logging

    Test References (testing patterns to follow):

    • tests/test_depth_verify.py:36-66 — Test pattern using synthetic depth maps (follow this style)
    • tests/test_depth_refine.py:21-39 — Test pattern with synthetic K matrix and depth maps

    Documentation References:

    • docs/calibrate-extrinsics-workflow.md:116-132 — Documents the unit mismatch problem and mitigation strategy
    • docs/calibrate-extrinsics-workflow.md:166-169 — Specifies the exact implementation steps for unit hardening

    Acceptance Criteria:

    • init_params.coordinate_units = sl.UNIT.METER is set in SVOReader.init before cam.open()
    • The /1000.0 division in _retrieve_depth is guarded (only applied if units are NOT meters)
    • Debug logging of depth statistics (min/median/max) is added to _retrieve_depth when depth mode is active
    • uv run pytest tests/test_depth_refine.py tests/test_depth_verify.py -q → all pass (no regressions)

    Agent-Executed QA Scenarios:

    Scenario: Verify unit hardening doesn't break existing tests
      Tool: Bash (uv run pytest)
      Preconditions: All dependencies installed
      Steps:
        1. Run: uv run pytest tests/test_depth_refine.py tests/test_depth_verify.py -q
        2. Assert: exit code 0
        3. Assert: output contains "passed" and no "FAILED"
      Expected Result: All existing tests pass
      Evidence: Terminal output captured
    
    Scenario: Verify coordinate_units is set in code
      Tool: Bash (grep)
      Preconditions: File modified
      Steps:
        1. Run: grep -n "coordinate_units" aruco/svo_sync.py
        2. Assert: output contains "UNIT.METER" or "METER"
      Expected Result: Unit setting is present
      Evidence: Grep output
    

    Commit: YES

    • Message: fix(svo): harden depth units — set coordinate_units=METER, guard /1000 conversion
    • Files: aruco/svo_sync.py, tests/test_depth_refine.py
    • Pre-commit: uv run pytest tests/ -q

  • 2. Robust Optimizer — Replace MSE with least_squares + Soft-L1 Loss (P0)

    What to do:

    • Rewrite depth_residual_objective → Replace with a residual vector function depth_residuals(params, ...) that returns an array of residuals (not a scalar cost). Each element is (z_measured - z_predicted) for one marker corner. This is what least_squares expects.
    • Add regularization as pseudo-residuals: Append [reg_weight_rot * delta_rvec, reg_weight_trans * delta_tvec] to the residual vector. This naturally penalizes deviation from the initial pose. Split into separate rotation and translation regularization weights (default: reg_rot=0.1, reg_trans=1.0 — translation more tightly regularized in meters scale).
    • Replace minimize(method="L-BFGS-B") with least_squares(method="trf", loss="soft_l1", f_scale=0.1):
      • method="trf" — Trust Region Reflective, handles bounds naturally
      • loss="soft_l1" — Smooth robust loss, downweights outliers beyond f_scale
      • f_scale=0.1 — Residuals >0.1m are treated as outliers (matches ZED depth noise ~1-5cm)
      • bounds — Same ±5°/±5cm bounds, expressed as (lower_bounds_array, upper_bounds_array) tuple
      • x_scale="jac" — Automatic Jacobian-based scaling (prevents ill-conditioning)
      • max_nfev=200 — Maximum function evaluations
    • Update refine_extrinsics_with_depth signature: Add parameters for loss, f_scale, reg_rot, reg_trans. Keep backward-compatible defaults. Return enriched stats dict including: termination_message, nfev, optimality, active_mask, cost.
    • Handle zero residuals: If residual vector is empty (no valid depth points), return initial pose unchanged with stats indicating "reason": "no_valid_depth_points".
    • Maintain backward-compatible scalar cost reporting: Compute initial_cost and final_cost from the residual vector for comparison with old output format.

    Must NOT do:

    • Do NOT change extrinsics_to_params or params_to_extrinsics (the Rodrigues parameterization is correct)
    • Do NOT modify depth_verify.py in this task
    • Do NOT add confidence weighting here (that's Task 3)
    • Do NOT add CLI flags here (that's Task 5)

    Recommended Agent Profile:

    • Category: deep
      • Reason: Core algorithmic change, requires understanding of optimization theory and careful residual construction
    • Skills: []
      • No specialized skills needed — pure Python/numpy/scipy work

    Parallelization:

    • Can Run In Parallel: NO
    • Parallel Group: Wave 2 (sequential after Wave 1)
    • Blocks: Tasks 3, 5, 6
    • Blocked By: Task 1

    References:

    Pattern References (existing code to follow):

    • aruco/depth_refine.py:19-47 — Current depth_residual_objective function to REPLACE
    • aruco/depth_refine.py:50-112 — Current refine_extrinsics_with_depth function to REWRITE
    • aruco/depth_refine.py:1-16 — Import block and helper functions (keep extrinsics_to_params, params_to_extrinsics)
    • aruco/depth_verify.py:27-67compute_depth_residual function — this is the per-point residual computation called from the objective. Understand its contract: returns float(z_measured - z_predicted) or None.

    API/Type References:

    • scipy.optimize.least_squaresscipy docs: fun(x, *args) -> residuals_array; parameters: method="trf", loss="soft_l1", f_scale=0.1, bounds=(lb, ub), x_scale="jac", max_nfev=200
    • Return type: OptimizeResult with attributes: .x, .cost, .fun, .jac, .grad, .optimality, .active_mask, .nfev, .njev, .status, .message, .success

    External References (production examples):

    • freemocap/anipose bundle_adjust method — Uses least_squares(error_fun, x0, jac_sparsity=jac_sparse, f_scale=f_scale, x_scale="jac", loss=loss, ftol=ftol, method="trf", tr_solver="lsmr") for multi-camera calibration. Key pattern: residual function returns per-point reprojection errors.
    • scipy Context7 docs — Example shows least_squares(fun, x0, loss='soft_l1', f_scale=0.1, args=(t_train, y_train)) where fun returns residual vector

    Test References:

    • tests/test_depth_refine.py — ALL 4 existing tests must still pass. They test: roundtrip, no-change convergence, offset correction, and bounds respect. The new optimizer must satisfy these same properties.

    Acceptance Criteria:

    • from scipy.optimize import least_squares replaces from scipy.optimize import minimize
    • depth_residuals() returns np.ndarray (vector), not scalar float
    • least_squares(method="trf", loss="soft_l1", f_scale=0.1) is the optimizer call
    • Regularization is split: separate reg_rot and reg_trans weights, appended as pseudo-residuals
    • Stats dict includes: termination_message, nfev, optimality, cost
    • Zero-residual case returns initial pose with reason: "no_valid_depth_points"
    • uv run pytest tests/test_depth_refine.py -q → all 4 existing tests pass
    • New test: synthetic data with 30% outlier depths → robust optimizer converges (success=True, nfev > 1) with lower median residual than would occur with pure MSE

    Agent-Executed QA Scenarios:

    Scenario: All existing depth_refine tests pass after rewrite
      Tool: Bash (uv run pytest)
      Preconditions: Task 1 completed, aruco/depth_refine.py rewritten
      Steps:
        1. Run: uv run pytest tests/test_depth_refine.py -v
        2. Assert: exit code 0
        3. Assert: output contains "4 passed"
      Expected Result: All 4 existing tests pass
      Evidence: Terminal output captured
    
    Scenario: Robust optimizer handles outliers better than MSE
      Tool: Bash (uv run pytest)
      Preconditions: New test added
      Steps:
        1. Run: uv run pytest tests/test_depth_refine.py::test_robust_loss_handles_outliers -v
        2. Assert: exit code 0
        3. Assert: test passes
      Expected Result: With 30% outliers, robust optimizer has lower median abs residual
      Evidence: Terminal output captured
    

    Commit: YES

    • Message: feat(refine): replace L-BFGS-B MSE with least_squares soft-L1 robust optimizer
    • Files: aruco/depth_refine.py, tests/test_depth_refine.py
    • Pre-commit: uv run pytest tests/test_depth_refine.py -q

  • 3. Confidence-Weighted Depth Residuals (P0)

    What to do:

    • Add confidence weight extraction helper to aruco/depth_verify.py: Create a function get_confidence_weight(confidence_map, u, v, confidence_thresh=50) -> float that returns a normalized weight in [0, 1]. ZED confidence: [1, 100] where higher = LESS confident. Normalize as max(0, (confidence_thresh - conf_value)) / confidence_thresh. Values above threshold → weight 0. Clamp to [eps, 1.0] where eps=1e-6.
    • Update depth_residuals() in aruco/depth_refine.py: Accept optional confidence_map and confidence_thresh parameters. If confidence_map is provided, multiply each depth residual by sqrt(weight) before returning. This implements weighted least squares within the least_squares framework.
    • Update refine_extrinsics_with_depth signature: Add confidence_map=None, confidence_thresh=50 parameters. Pass through to depth_residuals().
    • Update calibrate_extrinsics.py: Pass confidence_map=frame.confidence_map and confidence_thresh=depth_confidence_threshold to refine_extrinsics_with_depth when confidence weighting is requested
    • Add --use-confidence-weights/--no-confidence-weights CLI flag (default: False for backward compatibility)
    • Log confidence statistics under --debug: After computing weights, log n_zero_weight, mean_weight, median_weight

    Must NOT do:

    • Do NOT change the verification logic in verify_extrinsics_with_depth (it already uses confidence correctly)
    • Do NOT change confidence semantics (higher ZED value = less confident)
    • Do NOT make confidence weighting the default behavior

    Recommended Agent Profile:

    • Category: quick
      • Reason: Adding parameters and weight multiplication — straightforward plumbing
    • Skills: []

    Parallelization:

    • Can Run In Parallel: NO (depends on Task 2)
    • Parallel Group: Wave 2 (after Task 2)
    • Blocks: Task 6
    • Blocked By: Task 2

    References:

    Pattern References:

    • aruco/depth_verify.py:82-96 — Existing confidence handling pattern (filtering, NOT weighting). Follow this semantics but produce a continuous weight instead of binary skip
    • aruco/depth_verify.py:93-95 — ZED confidence semantics: "Higher confidence value means LESS confident... Range [1, 100], where 100 is typically occlusion/invalid"
    • aruco/depth_refine.py — Updated in Task 2 with depth_residuals() function. Add confidence_map parameter here
    • calibrate_extrinsics.py:136-148 — Current call site for refine_extrinsics_with_depth. Add confidence_map/thresh forwarding

    Test References:

    • tests/test_depth_verify.py:69-84 — Test pattern for compute_marker_corner_residuals. Follow for confidence weight test

    Acceptance Criteria:

    • get_confidence_weight() function exists in depth_verify.py
    • Confidence weighting is off by default (backward compatible)
    • --use-confidence-weights flag exists in CLI
    • Low-confidence points have lower influence on optimization (verified by test)
    • uv run pytest tests/ -q → all pass

    Agent-Executed QA Scenarios:

    Scenario: Confidence weighting reduces outlier influence
      Tool: Bash (uv run pytest)
      Steps:
        1. Run: uv run pytest tests/test_depth_refine.py::test_confidence_weighting -v
        2. Assert: exit code 0
      Expected Result: With low-confidence outlier points, weighted optimizer ignores them
      Evidence: Terminal output
    
    Scenario: CLI flag exists
      Tool: Bash
      Steps:
        1. Run: uv run python calibrate_extrinsics.py --help | grep -i confidence-weight
        2. Assert: output contains "--use-confidence-weights"
      Expected Result: Flag is available
      Evidence: Help text
    

    Commit: YES

    • Message: feat(refine): add confidence-weighted depth residuals with --use-confidence-weights flag
    • Files: aruco/depth_verify.py, aruco/depth_refine.py, calibrate_extrinsics.py, tests/test_depth_refine.py
    • Pre-commit: uv run pytest tests/ -q

  • 4. Best-Frame Selection (P1)

    What to do:

    • Create score_frame_quality() function in calibrate_extrinsics.py (or a new aruco/frame_scoring.py if cleaner). The function takes: n_markers: int, reproj_error: float, depth_map: np.ndarray, marker_corners_world: Dict[int, np.ndarray], T_world_cam: np.ndarray, K: np.ndarray and returns a float score (higher = better).
    • Scoring formula: score = w_markers * n_markers + w_reproj * (1 / (reproj_error + eps)) + w_depth * valid_depth_ratio
      • w_markers = 1.0 — more markers = better constraint
      • w_reproj = 5.0 — lower reprojection error = more accurate PnP
      • w_depth = 3.0 — higher ratio of valid depth at marker locations = better depth signal
      • valid_depth_ratio = n_valid_depths / n_total_corners
      • eps = 1e-6 to avoid division by zero
    • Replace "last valid frame" logic in calibrate_extrinsics.py: Instead of overwriting verification_frames[serial] every time (line 467-471), track ALL valid frames per camera with their scores. After the processing loop, select the frame with the highest score.
    • Log selected frame: Under --debug, log the chosen frame index, score, and component breakdown for each camera
    • Ensure deterministic tiebreaking: If scores are equal, pick the frame with the lower frame_index (earliest)
    • Keep frame storage bounded: Store at most max_stored_frames=10 candidates per camera (configurable), keeping the top-scoring ones

    Must NOT do:

    • Do NOT add ML-based frame scoring
    • Do NOT change the frame grabbing/syncing logic
    • Do NOT add new dependencies

    Recommended Agent Profile:

    • Category: unspecified-low
      • Reason: New functionality but straightforward heuristic
    • Skills: []

    Parallelization:

    • Can Run In Parallel: YES
    • Parallel Group: Wave 1 (with Task 1)
    • Blocks: Task 6
    • Blocked By: None

    References:

    Pattern References:

    • calibrate_extrinsics.py:463-471 — Current "last valid frame" logic to REPLACE. Currently: verification_frames[serial] = {"frame": frame, "ids": ids, "corners": corners}
    • calibrate_extrinsics.py:452-478 — Full frame processing context (pose estimation, accumulation, frame caching)
    • aruco/depth_verify.py:27-67compute_depth_residual can be used to check valid depth at marker locations for scoring

    Test References:

    • tests/test_depth_cli_postprocess.py — Test pattern for calibrate_extrinsics functions

    Acceptance Criteria:

    • score_frame_quality() function exists and returns a float
    • Best frame is selected (not last frame) for each camera
    • Scoring is deterministic (same inputs → same selected frame)
    • Frame selection metadata is logged under --debug
    • uv run pytest tests/ -q → all pass (no regressions)

    Agent-Executed QA Scenarios:

    Scenario: Frame scoring is deterministic
      Tool: Bash (uv run pytest)
      Steps:
        1. Run: uv run pytest tests/test_frame_scoring.py -v
        2. Assert: exit code 0
      Expected Result: Same inputs always produce same score and selection
      Evidence: Terminal output
    
    Scenario: Higher marker count increases score
      Tool: Bash (uv run pytest)
      Steps:
        1. Run: uv run pytest tests/test_frame_scoring.py::test_more_markers_higher_score -v
        2. Assert: exit code 0
      Expected Result: Frame with more markers scores higher
      Evidence: Terminal output
    

    Commit: YES

    • Message: feat(calibrate): replace naive frame selection with quality-scored best-frame
    • Files: calibrate_extrinsics.py, tests/test_frame_scoring.py
    • Pre-commit: uv run pytest tests/ -q

  • 5. Diagnostics and Acceptance Gates (P1)

    What to do:

    • Enrich refine_extrinsics_with_depth stats dict: The least_squares result (from Task 2) already provides .status, .message, .nfev, .njev, .optimality, .active_mask. Surface these in the returned stats dict as: termination_status (int), termination_message (str), nfev (int), njev (int), optimality (float), n_active_bounds (int, count of parameters at bound limits).
    • Add effective valid points count: Log how many marker corners had valid (finite, positive) depth, and how many were used after confidence filtering. Add to stats: n_depth_valid, n_confidence_filtered.
    • Add RMSE improvement gate: If improvement_rmse < 1e-4 AND nfev > 5, log WARNING: "Refinement converged with negligible improvement — consider checking depth data quality"
    • Add failure diagnostic: If success == False or nfev <= 1, log WARNING with termination message and suggest checking depth unit consistency
    • Log optimizer progress under --debug: Before and after optimization, log: initial cost, final cost, delta_rotation, delta_translation, termination message, number of function evaluations
    • Surface diagnostics in JSON output: Add fields to refine_depth dict in output JSON: termination_status, termination_message, nfev, n_valid_points, loss_function, f_scale

    Must NOT do:

    • Do NOT add automated "redo with different params" logic
    • Do NOT add email/notification alerts
    • Do NOT change the optimization algorithm or parameters (already done in Task 2)

    Recommended Agent Profile:

    • Category: quick
      • Reason: Adding logging and dict fields — no algorithmic changes
    • Skills: []

    Parallelization:

    • Can Run In Parallel: YES (with Task 3)
    • Parallel Group: Wave 2
    • Blocks: Task 6
    • Blocked By: Task 2

    References:

    Pattern References:

    • aruco/depth_refine.py:103-111 — Current stats dict construction (to EXTEND, not replace)
    • calibrate_extrinsics.py:159-181 — Current refinement result logging and JSON field assignment
    • loguru.logger — Project uses loguru for structured logging

    API/Type References:

    • scipy.optimize.OptimizeResult.status (int: 1=convergence, 0=max_nfev, -1=improper), .message (str), .nfev, .njev, .optimality (gradient infinity norm)

    Acceptance Criteria:

    • Stats dict contains: termination_status, termination_message, nfev, n_valid_points
    • Output JSON refine_depth section contains diagnostic fields
    • WARNING log emitted when improvement < 1e-4 with nfev > 5
    • WARNING log emitted when success=False or nfev <= 1
    • uv run pytest tests/ -q → all pass

    Agent-Executed QA Scenarios:

    Scenario: Diagnostics present in refine stats
      Tool: Bash (uv run pytest)
      Steps:
        1. Run: uv run pytest tests/test_depth_refine.py -v
        2. Assert: All tests pass
        3. Check that stats dict from refine function contains "termination_message" key
      Expected Result: Diagnostics are in stats output
      Evidence: Terminal output
    

    Commit: YES

    • Message: feat(refine): add rich optimizer diagnostics and acceptance gates
    • Files: aruco/depth_refine.py, calibrate_extrinsics.py, tests/test_depth_refine.py
    • Pre-commit: uv run pytest tests/ -q

  • 6. Benchmark Matrix (P1)

    What to do:

    • Add --benchmark-matrix flag to calibrate_extrinsics.py CLI
    • When enabled, run the depth refinement pipeline 4 times per camera with different configurations:
      1. baseline: loss="linear" (no robust loss), no confidence weights
      2. robust: loss="soft_l1", f_scale=0.1, no confidence weights
      3. robust+confidence: loss="soft_l1", f_scale=0.1, confidence weighting ON
      4. robust+confidence+best-frame: Same as #3 but using best-frame selection
    • Output: For each configuration, report per-camera: pre-refinement RMSE, post-refinement RMSE, improvement, iteration count, success/failure, termination reason
    • Format: Print a formatted table to stdout (using click.echo) AND save to a benchmark section in the output JSON
    • Implementation: Create a helper function run_benchmark_matrix(T_initial, marker_corners_world, depth_map, K, confidence_map, ...) that returns a list of result dicts

    Must NOT do:

    • Do NOT implement automated configuration tuning
    • Do NOT add visualization/plotting dependencies
    • Do NOT change the default (non-benchmark) codepath behavior

    Recommended Agent Profile:

    • Category: unspecified-low
      • Reason: Orchestration code, calling existing functions with different params
    • Skills: []

    Parallelization:

    • Can Run In Parallel: NO (depends on all previous tasks)
    • Parallel Group: Wave 3 (after all)
    • Blocks: Task 7
    • Blocked By: Tasks 2, 3, 4, 5

    References:

    Pattern References:

    • calibrate_extrinsics.py:73-196apply_depth_verify_refine_postprocess function. The benchmark matrix calls this logic with varied parameters
    • aruco/depth_refine.py — Updated refine_extrinsics_with_depth with loss, f_scale, confidence_map params

    Acceptance Criteria:

    • --benchmark-matrix flag exists in CLI
    • When enabled, 4 configurations are run per camera
    • Output table is printed to stdout
    • Benchmark results are in output JSON under benchmark key
    • uv run pytest tests/ -q → all pass

    Agent-Executed QA Scenarios:

    Scenario: Benchmark flag in CLI help
      Tool: Bash
      Steps:
        1. Run: uv run python calibrate_extrinsics.py --help | grep benchmark
        2. Assert: output contains "--benchmark-matrix"
      Expected Result: Flag is present
      Evidence: Help text output
    

    Commit: YES

    • Message: feat(calibrate): add --benchmark-matrix for comparing refinement configurations
    • Files: calibrate_extrinsics.py, tests/test_benchmark.py
    • Pre-commit: uv run pytest tests/ -q

  • 7. Documentation Update

    What to do:

    • Update docs/calibrate-extrinsics-workflow.md:
      • Add new CLI flags: --use-confidence-weights, --benchmark-matrix
      • Update "Depth Verification & Refinement" section with new optimizer details
      • Update "Refinement" section: document least_squares with soft_l1 loss, f_scale, confidence weighting
      • Add "Best-Frame Selection" section explaining the scoring formula
      • Add "Diagnostics" section documenting new output JSON fields
      • Update "Example Workflow" commands to show new flags
      • Mark the "Known Unexpected Behavior" unit mismatch section as RESOLVED with the fix description

    Must NOT do:

    • Do NOT rewrite unrelated documentation sections
    • Do NOT add tutorial-style content

    Recommended Agent Profile:

    • Category: writing
      • Reason: Pure documentation writing
    • Skills: []

    Parallelization:

    • Can Run In Parallel: NO
    • Parallel Group: Wave 4 (final)
    • Blocks: None
    • Blocked By: All previous tasks

    References:

    Pattern References:

    • docs/calibrate-extrinsics-workflow.md — Entire file. Follow existing section structure and formatting

    Acceptance Criteria:

    • New CLI flags documented
    • least_squares optimizer documented with parameter explanations
    • Best-frame selection documented
    • Unit mismatch section updated as resolved
    • Example commands include new flags

    Commit: YES

    • Message: docs: update calibrate-extrinsics-workflow for robust refinement changes
    • Files: docs/calibrate-extrinsics-workflow.md
    • Pre-commit: uv run pytest tests/ -q

Commit Strategy

After Task Message Files Verification
1 fix(svo): harden depth units — set coordinate_units=METER, guard /1000 conversion aruco/svo_sync.py, tests uv run pytest tests/ -q
2 feat(refine): replace L-BFGS-B MSE with least_squares soft-L1 robust optimizer aruco/depth_refine.py, tests uv run pytest tests/ -q
3 feat(refine): add confidence-weighted depth residuals with --use-confidence-weights flag aruco/depth_verify.py, aruco/depth_refine.py, calibrate_extrinsics.py, tests uv run pytest tests/ -q
4 feat(calibrate): replace naive frame selection with quality-scored best-frame calibrate_extrinsics.py, tests uv run pytest tests/ -q
5 feat(refine): add rich optimizer diagnostics and acceptance gates aruco/depth_refine.py, calibrate_extrinsics.py, tests uv run pytest tests/ -q
6 feat(calibrate): add --benchmark-matrix for comparing refinement configurations calibrate_extrinsics.py, tests uv run pytest tests/ -q
7 docs: update calibrate-extrinsics-workflow for robust refinement changes docs/calibrate-extrinsics-workflow.md uv run pytest tests/ -q

Success Criteria

Verification Commands

uv run pytest tests/ -q                    # Expected: all pass, 0 failures
uv run pytest tests/test_depth_refine.py -v  # Expected: all tests pass including new robust/confidence tests

Final Checklist

  • All "Must Have" items present
  • All "Must NOT Have" items absent
  • All tests pass (uv run pytest tests/ -q)
  • Output JSON backward compatible (existing fields preserved, new fields additive)
  • Default CLI behavior unchanged (new features opt-in)
  • Optimizer actually converges on synthetic test data (success=True, nfev > 1)