34 KiB
Robust Depth Refinement for Camera Extrinsics
TL;DR
Quick Summary: Replace the failing depth-based pose refinement pipeline with a robust optimizer (
scipy.optimize.least_squareswith soft-L1 loss), add unit hardening, confidence-weighted residuals, best-frame selection, rich diagnostics, and a benchmark matrix comparing configurations.Deliverables:
- Unit-hardened depth retrieval (set
coordinate_units=METER, guard double-conversion)- Robust optimization objective using
least_squares(method="trf", loss="soft_l1", f_scale=0.1)- Confidence-weighted depth residuals (toggleable via CLI flag)
- Best-frame selection replacing naive "latest valid frame"
- Rich optimizer diagnostics and acceptance gates
- Benchmark matrix comparing baseline/robust/+confidence/+best-frame
- Updated tests for all new functionality
Estimated Effort: Medium (3-4 hours implementation) Parallel Execution: YES - 2 waves Critical Path: Task 1 (units) → Task 2 (robust optimizer) → Task 3 (confidence) → Task 5 (diagnostics) → Task 6 (benchmark)
Context
Original Request
Implement the 5 items from "Recommended Implementation Order" in docs/calibrate-extrinsics-workflow.md, plus research and choose the best optimization method for depth-based camera extrinsic refinement.
Interview Summary
Key Discussions:
- Requirements were explicitly specified in the documentation (no interactive interview needed)
- Research confirmed
scipy.optimize.least_squaresis superior toscipy.optimize.minimizefor this problem class
Research Findings:
- freemocap/anipose (production multi-camera calibration) uses exactly
least_squares(method="trf", loss=loss, f_scale=threshold)for bundle adjustment — validates our approach - scipy docs recommend
soft_l1orhuberfor robust fitting;f_scalecontrols the inlier/outlier threshold - Current output JSONs confirm catastrophic failure: RMSE 5000+ meters (
aligned_refined_extrinsics_fast.json), RMSE ~11.6m (test_refine_current.json), iterations=0/1, success=false across all cameras - Unit mismatch still active despite
/1000.0conversion — ZED defaults to mm, code divides by 1000, but nocoordinate_units=METERset - Confidence map retrieved but only used in verify filtering, not in optimizer objective
Metis Review
Identified Gaps (addressed):
- Output JSON schema backward compatibility → New fields are additive only (existing fields preserved)
- Confidence weighting can interact with robust loss → Made toggleable, logged statistics
- Best-frame selection changes behavior → Deterministic scoring, old behavior available as fallback
- Zero valid points edge case → Explicit early exit with diagnostic
- Numerical pass/fail gate → Added RMSE threshold checks
- Regression guard → Default CLI behavior unchanged unless user opts into new features
Work Objectives
Core Objective
Make depth-based extrinsic refinement actually work by fixing the unit mismatch, switching to a robust optimizer, incorporating confidence weighting, and selecting the best frame for refinement.
Concrete Deliverables
- Modified
aruco/svo_sync.pywith unit hardening - Rewritten
aruco/depth_refine.pyusingleast_squareswith robust loss - Updated
aruco/depth_verify.pywith confidence weight extraction helper - Updated
calibrate_extrinsics.pywith frame scoring, diagnostics, new CLI flags - New and updated tests in
tests/ - Updated
docs/calibrate-extrinsics-workflow.mdwith new behavior docs
Definition of Done
uv run pytestpasses with 0 failures- Synthetic test: robust optimizer converges (success=True, nfev > 1) with injected outliers
- Existing tests still pass (backward compatibility)
- Benchmark matrix produces 4 comparable result records
Must Have
coordinate_units = sl.UNIT.METERset in SVOReaderleast_squareswithloss="soft_l1"andf_scale=0.1as default optimizer- Confidence weighting via
--use-confidence-weightsflag - Best-frame selection with deterministic scoring
- Optimizer diagnostics in output JSON and logs
- All changes covered by automated tests
Must NOT Have (Guardrails)
- Must NOT change unrelated calibration logic (marker detection, PnP, pose averaging, alignment)
- Must NOT change file I/O formats or break JSON schema (only additive fields)
- Must NOT introduce new dependencies beyond scipy/numpy already in use
- Must NOT implement multi-optimizer auto-selection or hyperparameter search
- Must NOT turn frame scoring into a ML quality model — simple weighted heuristic only
- Must NOT add premature abstractions or over-engineer the API
- Must NOT remove existing CLI flags or change their default behavior
Verification Strategy
UNIVERSAL RULE: ZERO HUMAN INTERVENTION
ALL tasks in this plan MUST be verifiable WITHOUT any human action. Every criterion is verified by running
uv run pytestor inspecting code.
Test Decision
- Infrastructure exists: YES (pytest configured in pyproject.toml, tests/ directory)
- Automated tests: YES (tests-after, matching existing project pattern)
- Framework: pytest (via
uv run pytest)
Agent-Executed QA Scenarios (MANDATORY — ALL tasks)
Verification Tool by Deliverable Type:
| Type | Tool | How Agent Verifies |
|---|---|---|
| Python module changes | Bash (uv run pytest) |
Run tests, assert 0 failures |
| New functions | Bash (uv run pytest -k test_name) |
Run specific test, assert pass |
| CLI behavior | Bash (uv run python calibrate_extrinsics.py --help) |
Verify new flags present |
Execution Strategy
Parallel Execution Waves
Wave 1 (Start Immediately):
├── Task 1: Unit hardening (svo_sync.py) [no dependencies]
└── Task 4: Best-frame selection (calibrate_extrinsics.py) [no dependencies]
Wave 2 (After Wave 1):
├── Task 2: Robust optimizer (depth_refine.py) [depends: 1]
├── Task 3: Confidence weighting (depth_verify.py + depth_refine.py) [depends: 2]
└── Task 5: Diagnostics and acceptance gates [depends: 2]
Wave 3 (After Wave 2):
└── Task 6: Benchmark matrix [depends: 2, 3, 4, 5]
Wave 4 (After All):
└── Task 7: Documentation update [depends: all]
Critical Path: Task 1 → Task 2 → Task 3 → Task 5 → Task 6
Dependency Matrix
| Task | Depends On | Blocks | Can Parallelize With |
|---|---|---|---|
| 1 | None | 2, 3 | 4 |
| 2 | 1 | 3, 5, 6 | - |
| 3 | 2 | 6 | 5 |
| 4 | None | 6 | 1 |
| 5 | 2 | 6 | 3 |
| 6 | 2, 3, 4, 5 | 7 | - |
| 7 | All | None | - |
Agent Dispatch Summary
| Wave | Tasks | Recommended Agents |
|---|---|---|
| 1 | 1, 4 | category="quick" for T1; category="unspecified-low" for T4 |
| 2 | 2, 3, 5 | category="deep" for T2; category="quick" for T3, T5 |
| 3 | 6 | category="unspecified-low" |
| 4 | 7 | category="writing" |
TODOs
-
1. Unit Hardening (P0)
What to do:
- In
aruco/svo_sync.py, addinit_params.coordinate_units = sl.UNIT.METERin theSVOReader.__init__method, right afterinit_params.set_from_svo_file(path)(around line 42) - Guard the existing
/1000.0conversion: check whethercoordinate_unitsis already METER. If METER is set, skip the division. If not set or MILLIMETER, apply the division. Add a log warning if division is applied as fallback - Add depth sanity logging under
--debugmode: after retrieving depth, logmin/median/max/p95of valid depth values. This goes in the_retrieve_depthmethod - Write a test that verifies the unit-hardened path doesn't double-convert
Must NOT do:
- Do NOT change depth retrieval for confidence maps
- Do NOT modify the
grab_synced()orgrab_all()methods - Do NOT add new CLI parameters for this task
Recommended Agent Profile:
- Category:
quick- Reason: Small, focused change in one file + one test file
- Skills: [
git-master]git-master: Atomic commit of unit hardening change
Parallelization:
- Can Run In Parallel: YES
- Parallel Group: Wave 1 (with Task 4)
- Blocks: Tasks 2, 3
- Blocked By: None
References:
Pattern References (existing code to follow):
aruco/svo_sync.py:40-44— Currentinit_paramssetup wherecoordinate_unitsmust be addedaruco/svo_sync.py:180-189— Current_retrieve_depthmethod with/1000.0conversion to modifyaruco/svo_sync.py:191-196— Confidence retrieval pattern (do NOT modify, but understand adjacency)
API/Type References (contracts to implement against):
- ZED SDK
InitParameters.coordinate_units— Set tosl.UNIT.METER loguru.logger— Used project-wide for debug logging
Test References (testing patterns to follow):
tests/test_depth_verify.py:36-66— Test pattern using synthetic depth maps (follow this style)tests/test_depth_refine.py:21-39— Test pattern with synthetic K matrix and depth maps
Documentation References:
docs/calibrate-extrinsics-workflow.md:116-132— Documents the unit mismatch problem and mitigation strategydocs/calibrate-extrinsics-workflow.md:166-169— Specifies the exact implementation steps for unit hardening
Acceptance Criteria:
init_params.coordinate_units = sl.UNIT.METERis set in SVOReader.init beforecam.open()- The
/1000.0division in_retrieve_depthis guarded (only applied if units are NOT meters) - Debug logging of depth statistics (min/median/max) is added to
_retrieve_depthwhen depth mode is active uv run pytest tests/test_depth_refine.py tests/test_depth_verify.py -q→ all pass (no regressions)
Agent-Executed QA Scenarios:
Scenario: Verify unit hardening doesn't break existing tests Tool: Bash (uv run pytest) Preconditions: All dependencies installed Steps: 1. Run: uv run pytest tests/test_depth_refine.py tests/test_depth_verify.py -q 2. Assert: exit code 0 3. Assert: output contains "passed" and no "FAILED" Expected Result: All existing tests pass Evidence: Terminal output captured Scenario: Verify coordinate_units is set in code Tool: Bash (grep) Preconditions: File modified Steps: 1. Run: grep -n "coordinate_units" aruco/svo_sync.py 2. Assert: output contains "UNIT.METER" or "METER" Expected Result: Unit setting is present Evidence: Grep outputCommit: YES
- Message:
fix(svo): harden depth units — set coordinate_units=METER, guard /1000 conversion - Files:
aruco/svo_sync.py,tests/test_depth_refine.py - Pre-commit:
uv run pytest tests/ -q
- In
-
2. Robust Optimizer — Replace MSE with
least_squares+ Soft-L1 Loss (P0)What to do:
- Rewrite
depth_residual_objective→ Replace with a residual vector functiondepth_residuals(params, ...)that returns an array of residuals (not a scalar cost). Each element is(z_measured - z_predicted)for one marker corner. This is whatleast_squaresexpects. - Add regularization as pseudo-residuals: Append
[reg_weight_rot * delta_rvec, reg_weight_trans * delta_tvec]to the residual vector. This naturally penalizes deviation from the initial pose. Split into separate rotation and translation regularization weights (default:reg_rot=0.1,reg_trans=1.0— translation more tightly regularized in meters scale). - Replace
minimize(method="L-BFGS-B")withleast_squares(method="trf", loss="soft_l1", f_scale=0.1):method="trf"— Trust Region Reflective, handles bounds naturallyloss="soft_l1"— Smooth robust loss, downweights outliers beyondf_scalef_scale=0.1— Residuals >0.1m are treated as outliers (matches ZED depth noise ~1-5cm)bounds— Same ±5°/±5cm bounds, expressed as(lower_bounds_array, upper_bounds_array)tuplex_scale="jac"— Automatic Jacobian-based scaling (prevents ill-conditioning)max_nfev=200— Maximum function evaluations
- Update
refine_extrinsics_with_depthsignature: Add parameters forloss,f_scale,reg_rot,reg_trans. Keep backward-compatible defaults. Return enriched stats dict including:termination_message,nfev,optimality,active_mask,cost. - Handle zero residuals: If residual vector is empty (no valid depth points), return initial pose unchanged with stats indicating
"reason": "no_valid_depth_points". - Maintain backward-compatible scalar cost reporting: Compute
initial_costandfinal_costfrom the residual vector for comparison with old output format.
Must NOT do:
- Do NOT change
extrinsics_to_paramsorparams_to_extrinsics(the Rodrigues parameterization is correct) - Do NOT modify
depth_verify.pyin this task - Do NOT add confidence weighting here (that's Task 3)
- Do NOT add CLI flags here (that's Task 5)
Recommended Agent Profile:
- Category:
deep- Reason: Core algorithmic change, requires understanding of optimization theory and careful residual construction
- Skills: []
- No specialized skills needed — pure Python/numpy/scipy work
Parallelization:
- Can Run In Parallel: NO
- Parallel Group: Wave 2 (sequential after Wave 1)
- Blocks: Tasks 3, 5, 6
- Blocked By: Task 1
References:
Pattern References (existing code to follow):
aruco/depth_refine.py:19-47— Currentdepth_residual_objectivefunction to REPLACEaruco/depth_refine.py:50-112— Currentrefine_extrinsics_with_depthfunction to REWRITEaruco/depth_refine.py:1-16— Import block and helper functions (keepextrinsics_to_params,params_to_extrinsics)aruco/depth_verify.py:27-67—compute_depth_residualfunction — this is the per-point residual computation called from the objective. Understand its contract: returnsfloat(z_measured - z_predicted)orNone.
API/Type References:
scipy.optimize.least_squares— scipy docs:fun(x, *args) -> residuals_array; parameters:method="trf",loss="soft_l1",f_scale=0.1,bounds=(lb, ub),x_scale="jac",max_nfev=200- Return type:
OptimizeResultwith attributes:.x,.cost,.fun,.jac,.grad,.optimality,.active_mask,.nfev,.njev,.status,.message,.success
External References (production examples):
freemocap/aniposebundle_adjust method — Usesleast_squares(error_fun, x0, jac_sparsity=jac_sparse, f_scale=f_scale, x_scale="jac", loss=loss, ftol=ftol, method="trf", tr_solver="lsmr")for multi-camera calibration. Key pattern: residual function returns per-point reprojection errors.- scipy Context7 docs — Example shows
least_squares(fun, x0, loss='soft_l1', f_scale=0.1, args=(t_train, y_train))wherefunreturns residual vector
Test References:
tests/test_depth_refine.py— ALL 4 existing tests must still pass. They test: roundtrip, no-change convergence, offset correction, and bounds respect. The new optimizer must satisfy these same properties.
Acceptance Criteria:
from scipy.optimize import least_squaresreplacesfrom scipy.optimize import minimizedepth_residuals()returnsnp.ndarray(vector), not scalar floatleast_squares(method="trf", loss="soft_l1", f_scale=0.1)is the optimizer call- Regularization is split: separate
reg_rotandreg_transweights, appended as pseudo-residuals - Stats dict includes:
termination_message,nfev,optimality,cost - Zero-residual case returns initial pose with
reason: "no_valid_depth_points" uv run pytest tests/test_depth_refine.py -q→ all 4 existing tests pass- New test: synthetic data with 30% outlier depths → robust optimizer converges (success=True, nfev > 1) with lower median residual than would occur with pure MSE
Agent-Executed QA Scenarios:
Scenario: All existing depth_refine tests pass after rewrite Tool: Bash (uv run pytest) Preconditions: Task 1 completed, aruco/depth_refine.py rewritten Steps: 1. Run: uv run pytest tests/test_depth_refine.py -v 2. Assert: exit code 0 3. Assert: output contains "4 passed" Expected Result: All 4 existing tests pass Evidence: Terminal output captured Scenario: Robust optimizer handles outliers better than MSE Tool: Bash (uv run pytest) Preconditions: New test added Steps: 1. Run: uv run pytest tests/test_depth_refine.py::test_robust_loss_handles_outliers -v 2. Assert: exit code 0 3. Assert: test passes Expected Result: With 30% outliers, robust optimizer has lower median abs residual Evidence: Terminal output capturedCommit: YES
- Message:
feat(refine): replace L-BFGS-B MSE with least_squares soft-L1 robust optimizer - Files:
aruco/depth_refine.py,tests/test_depth_refine.py - Pre-commit:
uv run pytest tests/test_depth_refine.py -q
- Rewrite
-
3. Confidence-Weighted Depth Residuals (P0)
What to do:
- Add confidence weight extraction helper to
aruco/depth_verify.py: Create a functionget_confidence_weight(confidence_map, u, v, confidence_thresh=50) -> floatthat returns a normalized weight in [0, 1]. ZED confidence: [1, 100] where higher = LESS confident. Normalize asmax(0, (confidence_thresh - conf_value)) / confidence_thresh. Values above threshold → weight 0. Clamp to[eps, 1.0]where eps=1e-6. - Update
depth_residuals()inaruco/depth_refine.py: Accept optionalconfidence_mapandconfidence_threshparameters. If confidence_map is provided, multiply each depth residual bysqrt(weight)before returning. This implements weighted least squares within theleast_squaresframework. - Update
refine_extrinsics_with_depthsignature: Addconfidence_map=None,confidence_thresh=50parameters. Pass through todepth_residuals(). - Update
calibrate_extrinsics.py: Passconfidence_map=frame.confidence_mapandconfidence_thresh=depth_confidence_thresholdtorefine_extrinsics_with_depthwhen confidence weighting is requested - Add
--use-confidence-weights/--no-confidence-weightsCLI flag (default: False for backward compatibility) - Log confidence statistics under
--debug: After computing weights, logn_zero_weight,mean_weight,median_weight
Must NOT do:
- Do NOT change the verification logic in
verify_extrinsics_with_depth(it already uses confidence correctly) - Do NOT change confidence semantics (higher ZED value = less confident)
- Do NOT make confidence weighting the default behavior
Recommended Agent Profile:
- Category:
quick- Reason: Adding parameters and weight multiplication — straightforward plumbing
- Skills: []
Parallelization:
- Can Run In Parallel: NO (depends on Task 2)
- Parallel Group: Wave 2 (after Task 2)
- Blocks: Task 6
- Blocked By: Task 2
References:
Pattern References:
aruco/depth_verify.py:82-96— Existing confidence handling pattern (filtering, NOT weighting). Follow this semantics but produce a continuous weight instead of binary skiparuco/depth_verify.py:93-95— ZED confidence semantics: "Higher confidence value means LESS confident... Range [1, 100], where 100 is typically occlusion/invalid"aruco/depth_refine.py— Updated in Task 2 withdepth_residuals()function. Addconfidence_mapparameter herecalibrate_extrinsics.py:136-148— Current call site forrefine_extrinsics_with_depth. Add confidence_map/thresh forwarding
Test References:
tests/test_depth_verify.py:69-84— Test pattern forcompute_marker_corner_residuals. Follow for confidence weight test
Acceptance Criteria:
get_confidence_weight()function exists indepth_verify.py- Confidence weighting is off by default (backward compatible)
--use-confidence-weightsflag exists in CLI- Low-confidence points have lower influence on optimization (verified by test)
uv run pytest tests/ -q→ all pass
Agent-Executed QA Scenarios:
Scenario: Confidence weighting reduces outlier influence Tool: Bash (uv run pytest) Steps: 1. Run: uv run pytest tests/test_depth_refine.py::test_confidence_weighting -v 2. Assert: exit code 0 Expected Result: With low-confidence outlier points, weighted optimizer ignores them Evidence: Terminal output Scenario: CLI flag exists Tool: Bash Steps: 1. Run: uv run python calibrate_extrinsics.py --help | grep -i confidence-weight 2. Assert: output contains "--use-confidence-weights" Expected Result: Flag is available Evidence: Help textCommit: YES
- Message:
feat(refine): add confidence-weighted depth residuals with --use-confidence-weights flag - Files:
aruco/depth_verify.py,aruco/depth_refine.py,calibrate_extrinsics.py,tests/test_depth_refine.py - Pre-commit:
uv run pytest tests/ -q
- Add confidence weight extraction helper to
-
4. Best-Frame Selection (P1)
What to do:
- Create
score_frame_quality()function incalibrate_extrinsics.py(or a newaruco/frame_scoring.pyif cleaner). The function takes:n_markers: int,reproj_error: float,depth_map: np.ndarray,marker_corners_world: Dict[int, np.ndarray],T_world_cam: np.ndarray,K: np.ndarrayand returns a float score (higher = better). - Scoring formula:
score = w_markers * n_markers + w_reproj * (1 / (reproj_error + eps)) + w_depth * valid_depth_ratiow_markers = 1.0— more markers = better constraintw_reproj = 5.0— lower reprojection error = more accurate PnPw_depth = 3.0— higher ratio of valid depth at marker locations = better depth signalvalid_depth_ratio = n_valid_depths / n_total_cornerseps = 1e-6to avoid division by zero
- Replace "last valid frame" logic in
calibrate_extrinsics.py: Instead of overwritingverification_frames[serial]every time (line 467-471), track ALL valid frames per camera with their scores. After the processing loop, select the frame with the highest score. - Log selected frame: Under
--debug, log the chosen frame index, score, and component breakdown for each camera - Ensure deterministic tiebreaking: If scores are equal, pick the frame with the lower frame_index (earliest)
- Keep frame storage bounded: Store at most
max_stored_frames=10candidates per camera (configurable), keeping the top-scoring ones
Must NOT do:
- Do NOT add ML-based frame scoring
- Do NOT change the frame grabbing/syncing logic
- Do NOT add new dependencies
Recommended Agent Profile:
- Category:
unspecified-low- Reason: New functionality but straightforward heuristic
- Skills: []
Parallelization:
- Can Run In Parallel: YES
- Parallel Group: Wave 1 (with Task 1)
- Blocks: Task 6
- Blocked By: None
References:
Pattern References:
calibrate_extrinsics.py:463-471— Current "last valid frame" logic to REPLACE. Currently:verification_frames[serial] = {"frame": frame, "ids": ids, "corners": corners}calibrate_extrinsics.py:452-478— Full frame processing context (pose estimation, accumulation, frame caching)aruco/depth_verify.py:27-67—compute_depth_residualcan be used to check valid depth at marker locations for scoring
Test References:
tests/test_depth_cli_postprocess.py— Test pattern for calibrate_extrinsics functions
Acceptance Criteria:
score_frame_quality()function exists and returns a float- Best frame is selected (not last frame) for each camera
- Scoring is deterministic (same inputs → same selected frame)
- Frame selection metadata is logged under
--debug uv run pytest tests/ -q→ all pass (no regressions)
Agent-Executed QA Scenarios:
Scenario: Frame scoring is deterministic Tool: Bash (uv run pytest) Steps: 1. Run: uv run pytest tests/test_frame_scoring.py -v 2. Assert: exit code 0 Expected Result: Same inputs always produce same score and selection Evidence: Terminal output Scenario: Higher marker count increases score Tool: Bash (uv run pytest) Steps: 1. Run: uv run pytest tests/test_frame_scoring.py::test_more_markers_higher_score -v 2. Assert: exit code 0 Expected Result: Frame with more markers scores higher Evidence: Terminal outputCommit: YES
- Message:
feat(calibrate): replace naive frame selection with quality-scored best-frame - Files:
calibrate_extrinsics.py,tests/test_frame_scoring.py - Pre-commit:
uv run pytest tests/ -q
- Create
-
5. Diagnostics and Acceptance Gates (P1)
What to do:
- Enrich
refine_extrinsics_with_depthstats dict: Theleast_squaresresult (from Task 2) already provides.status,.message,.nfev,.njev,.optimality,.active_mask. Surface these in the returned stats dict as:termination_status(int),termination_message(str),nfev(int),njev(int),optimality(float),n_active_bounds(int, count of parameters at bound limits). - Add effective valid points count: Log how many marker corners had valid (finite, positive) depth, and how many were used after confidence filtering. Add to stats:
n_depth_valid,n_confidence_filtered. - Add RMSE improvement gate: If
improvement_rmse < 1e-4ANDnfev > 5, log WARNING: "Refinement converged with negligible improvement — consider checking depth data quality" - Add failure diagnostic: If
success == Falseornfev <= 1, log WARNING with termination message and suggest checking depth unit consistency - Log optimizer progress under
--debug: Before and after optimization, log: initial cost, final cost, delta_rotation, delta_translation, termination message, number of function evaluations - Surface diagnostics in JSON output: Add fields to
refine_depthdict in output JSON:termination_status,termination_message,nfev,n_valid_points,loss_function,f_scale
Must NOT do:
- Do NOT add automated "redo with different params" logic
- Do NOT add email/notification alerts
- Do NOT change the optimization algorithm or parameters (already done in Task 2)
Recommended Agent Profile:
- Category:
quick- Reason: Adding logging and dict fields — no algorithmic changes
- Skills: []
Parallelization:
- Can Run In Parallel: YES (with Task 3)
- Parallel Group: Wave 2
- Blocks: Task 6
- Blocked By: Task 2
References:
Pattern References:
aruco/depth_refine.py:103-111— Current stats dict construction (to EXTEND, not replace)calibrate_extrinsics.py:159-181— Current refinement result logging and JSON field assignmentloguru.logger— Project uses loguru for structured logging
API/Type References:
scipy.optimize.OptimizeResult—.status(int: 1=convergence, 0=max_nfev, -1=improper),.message(str),.nfev,.njev,.optimality(gradient infinity norm)
Acceptance Criteria:
- Stats dict contains:
termination_status,termination_message,nfev,n_valid_points - Output JSON
refine_depthsection contains diagnostic fields - WARNING log emitted when improvement < 1e-4 with nfev > 5
- WARNING log emitted when success=False or nfev <= 1
uv run pytest tests/ -q→ all pass
Agent-Executed QA Scenarios:
Scenario: Diagnostics present in refine stats Tool: Bash (uv run pytest) Steps: 1. Run: uv run pytest tests/test_depth_refine.py -v 2. Assert: All tests pass 3. Check that stats dict from refine function contains "termination_message" key Expected Result: Diagnostics are in stats output Evidence: Terminal outputCommit: YES
- Message:
feat(refine): add rich optimizer diagnostics and acceptance gates - Files:
aruco/depth_refine.py,calibrate_extrinsics.py,tests/test_depth_refine.py - Pre-commit:
uv run pytest tests/ -q
- Enrich
-
6. Benchmark Matrix (P1)
What to do:
- Add
--benchmark-matrixflag tocalibrate_extrinsics.pyCLI - When enabled, run the depth refinement pipeline 4 times per camera with different configurations:
- baseline:
loss="linear"(no robust loss), no confidence weights - robust:
loss="soft_l1",f_scale=0.1, no confidence weights - robust+confidence:
loss="soft_l1",f_scale=0.1, confidence weighting ON - robust+confidence+best-frame: Same as #3 but using best-frame selection
- baseline:
- Output: For each configuration, report per-camera: pre-refinement RMSE, post-refinement RMSE, improvement, iteration count, success/failure, termination reason
- Format: Print a formatted table to stdout (using click.echo) AND save to a benchmark section in the output JSON
- Implementation: Create a helper function
run_benchmark_matrix(T_initial, marker_corners_world, depth_map, K, confidence_map, ...)that returns a list of result dicts
Must NOT do:
- Do NOT implement automated configuration tuning
- Do NOT add visualization/plotting dependencies
- Do NOT change the default (non-benchmark) codepath behavior
Recommended Agent Profile:
- Category:
unspecified-low- Reason: Orchestration code, calling existing functions with different params
- Skills: []
Parallelization:
- Can Run In Parallel: NO (depends on all previous tasks)
- Parallel Group: Wave 3 (after all)
- Blocks: Task 7
- Blocked By: Tasks 2, 3, 4, 5
References:
Pattern References:
calibrate_extrinsics.py:73-196—apply_depth_verify_refine_postprocessfunction. The benchmark matrix calls this logic with varied parametersaruco/depth_refine.py— Updatedrefine_extrinsics_with_depthwithloss,f_scale,confidence_mapparams
Acceptance Criteria:
--benchmark-matrixflag exists in CLI- When enabled, 4 configurations are run per camera
- Output table is printed to stdout
- Benchmark results are in output JSON under
benchmarkkey uv run pytest tests/ -q→ all pass
Agent-Executed QA Scenarios:
Scenario: Benchmark flag in CLI help Tool: Bash Steps: 1. Run: uv run python calibrate_extrinsics.py --help | grep benchmark 2. Assert: output contains "--benchmark-matrix" Expected Result: Flag is present Evidence: Help text outputCommit: YES
- Message:
feat(calibrate): add --benchmark-matrix for comparing refinement configurations - Files:
calibrate_extrinsics.py,tests/test_benchmark.py - Pre-commit:
uv run pytest tests/ -q
- Add
-
7. Documentation Update
What to do:
- Update
docs/calibrate-extrinsics-workflow.md:- Add new CLI flags:
--use-confidence-weights,--benchmark-matrix - Update "Depth Verification & Refinement" section with new optimizer details
- Update "Refinement" section: document
least_squareswithsoft_l1loss,f_scale, confidence weighting - Add "Best-Frame Selection" section explaining the scoring formula
- Add "Diagnostics" section documenting new output JSON fields
- Update "Example Workflow" commands to show new flags
- Mark the "Known Unexpected Behavior" unit mismatch section as RESOLVED with the fix description
- Add new CLI flags:
Must NOT do:
- Do NOT rewrite unrelated documentation sections
- Do NOT add tutorial-style content
Recommended Agent Profile:
- Category:
writing- Reason: Pure documentation writing
- Skills: []
Parallelization:
- Can Run In Parallel: NO
- Parallel Group: Wave 4 (final)
- Blocks: None
- Blocked By: All previous tasks
References:
Pattern References:
docs/calibrate-extrinsics-workflow.md— Entire file. Follow existing section structure and formatting
Acceptance Criteria:
- New CLI flags documented
least_squaresoptimizer documented with parameter explanations- Best-frame selection documented
- Unit mismatch section updated as resolved
- Example commands include new flags
Commit: YES
- Message:
docs: update calibrate-extrinsics-workflow for robust refinement changes - Files:
docs/calibrate-extrinsics-workflow.md - Pre-commit:
uv run pytest tests/ -q
- Update
Commit Strategy
| After Task | Message | Files | Verification |
|---|---|---|---|
| 1 | fix(svo): harden depth units — set coordinate_units=METER, guard /1000 conversion |
aruco/svo_sync.py, tests |
uv run pytest tests/ -q |
| 2 | feat(refine): replace L-BFGS-B MSE with least_squares soft-L1 robust optimizer |
aruco/depth_refine.py, tests |
uv run pytest tests/ -q |
| 3 | feat(refine): add confidence-weighted depth residuals with --use-confidence-weights flag |
aruco/depth_verify.py, aruco/depth_refine.py, calibrate_extrinsics.py, tests |
uv run pytest tests/ -q |
| 4 | feat(calibrate): replace naive frame selection with quality-scored best-frame |
calibrate_extrinsics.py, tests |
uv run pytest tests/ -q |
| 5 | feat(refine): add rich optimizer diagnostics and acceptance gates |
aruco/depth_refine.py, calibrate_extrinsics.py, tests |
uv run pytest tests/ -q |
| 6 | feat(calibrate): add --benchmark-matrix for comparing refinement configurations |
calibrate_extrinsics.py, tests |
uv run pytest tests/ -q |
| 7 | docs: update calibrate-extrinsics-workflow for robust refinement changes |
docs/calibrate-extrinsics-workflow.md |
uv run pytest tests/ -q |
Success Criteria
Verification Commands
uv run pytest tests/ -q # Expected: all pass, 0 failures
uv run pytest tests/test_depth_refine.py -v # Expected: all tests pass including new robust/confidence tests
Final Checklist
- All "Must Have" items present
- All "Must NOT Have" items absent
- All tests pass (
uv run pytest tests/ -q) - Output JSON backward compatible (existing fields preserved, new fields additive)
- Default CLI behavior unchanged (new features opt-in)
- Optimizer actually converges on synthetic test data (success=True, nfev > 1)