# Multi-Frame Depth Pooling for Extrinsic Calibration ## TL;DR > **Quick Summary**: Replace single-best-frame depth verification/refinement with top-N temporal pooling to reduce noise sensitivity and improve calibration robustness, while keeping existing verify/refine function signatures untouched. > > **Deliverables**: > - New `pool_depth_maps()` utility function in `aruco/depth_pool.py` > - Extended frame collection (top-N per camera) in main loop > - New `--depth-pool-size` CLI option (default 1 = backward compatible) > - Unit tests for pooling, fallback, and N=1 equivalence > - E2E smoke comparison (pooled vs single-frame RMSE) > > **Estimated Effort**: Medium > **Parallel Execution**: YES — 3 waves > **Critical Path**: Task 1 → Task 3 → Task 5 → Task 7 --- ## Context ### Original Request User asked: "Is `apply_depth_verify_refine_postprocess` optimal? When `depth_mode` is not NONE, every frame computes depth regardless of whether it's used. Is there a better way to utilize every depth map when verify/refine is enabled?" ### Interview Summary **Key Discussions**: - Oracle confirmed single-best-frame is simplicity-biased but leaves accuracy on the table - Recommended top 3–5 frame temporal pooling with confidence gating - Phased approach: quick win (pooling), medium (weighted selection), advanced (joint optimization) **Research Findings**: - `calibrate_extrinsics.py:682-714`: Current loop stores exactly one `verification_frames[serial]` per camera (best-scored) - `aruco/depth_verify.py`: `verify_extrinsics_with_depth()` accepts single `depth_map` + `confidence_map` - `aruco/depth_refine.py`: `refine_extrinsics_with_depth()` accepts single `depth_map` + `confidence_map` - `aruco/svo_sync.py:FrameData`: Each frame already carries `depth_map` + `confidence_map` - Memory: each depth map is ~3.5MB (720×1280 float32); storing 5 per camera = ~17.5MB/cam, ~70MB total for 4 cameras — acceptable - Existing tests use synthetic depth maps, so new tests can follow same pattern ### Metis Review **Identified Gaps** (addressed): - Camera motion during capture → addressed via assumption that cameras are static during calibration; documented as guardrail - "Top-N by score" may not correlate with depth quality → addressed by keeping confidence gating in pooling function - Fewer than N frames available → addressed with explicit fallback behavior - All pixels invalid after gating → addressed with fallback to best single frame - N=1 must reproduce baseline exactly → addressed with explicit equivalence test --- ## Work Objectives ### Core Objective Pool depth maps from the top-N scored frames per camera to produce a more robust single depth target for verification and refinement, reducing sensitivity to single-frame noise. ### Concrete Deliverables - `aruco/depth_pool.py` — new module with `pool_depth_maps()` function - Modified `calibrate_extrinsics.py` — top-N collection + pooling integration + CLI flag - `tests/test_depth_pool.py` — unit tests for pooling logic - Updated `tests/test_depth_cli_postprocess.py` — integration test for N=1 equivalence ### Definition of Done - [x] `uv run pytest -k "depth_pool"` → all tests pass - [x] `uv run basedpyright` → 0 new errors - [x] `--depth-pool-size 1` produces identical output to current baseline - [x] `--depth-pool-size 5` produces equal or lower post-RMSE on test SVOs ### Must Have - Feature-flagged behind `--depth-pool-size` (default 1) - Pure function `pool_depth_maps()` with deterministic output - Confidence gating during pooling - Graceful fallback when pooling fails (insufficient valid pixels) - N=1 code path identical to current behavior ### Must NOT Have (Guardrails) - NO changes to `verify_extrinsics_with_depth()` or `refine_extrinsics_with_depth()` signatures - NO scoring function redesign (use existing `score_frame()` as-is) - NO cross-camera fusion or spatial alignment/warping between frames - NO GPU acceleration or threading changes - NO new artifact files or dashboards - NO "unbounded history" — enforce max pool size cap (10) - NO optical flow, Kalman filters, or temporal alignment beyond frame selection --- ## Verification Strategy (MANDATORY) > **UNIVERSAL RULE: ZERO HUMAN INTERVENTION** > > ALL tasks in this plan MUST be verifiable WITHOUT any human action. ### Test Decision - **Infrastructure exists**: YES - **Automated tests**: YES (Tests-after, matching existing pattern) - **Framework**: pytest (via `uv run pytest`) ### Agent-Executed QA Scenarios (MANDATORY — ALL tasks) **Verification Tool by Deliverable Type:** | Type | Tool | How Agent Verifies | |------|------|-------------------| | Library/Module | Bash (uv run pytest) | Run targeted tests, compare output | | CLI | Bash (uv run calibrate_extrinsics.py) | Run with flags, check JSON output | | Type safety | Bash (uv run basedpyright) | Zero new errors | --- ## Execution Strategy ### Parallel Execution Waves ``` Wave 1 (Start Immediately): ├── Task 1: Create pool_depth_maps() utility └── Task 2: Unit tests for pool_depth_maps() Wave 2 (After Wave 1): ├── Task 3: Extend main loop to collect top-N frames ├── Task 4: Add --depth-pool-size CLI option └── Task 5: Integrate pooling into postprocess function Wave 3 (After Wave 2): ├── Task 6: N=1 equivalence regression test └── Task 7: E2E smoke comparison (pooled vs single-frame) ``` ### Dependency Matrix | Task | Depends On | Blocks | Can Parallelize With | |------|------------|--------|---------------------| | 1 | None | 2, 3, 5 | 2 | | 2 | 1 | None | 1 | | 3 | 1 | 5, 6 | 4 | | 4 | None | 5 | 3 | | 5 | 1, 3, 4 | 6, 7 | None | | 6 | 5 | None | 7 | | 7 | 5 | None | 6 | --- ## TODOs - [x] 1. Create `pool_depth_maps()` utility in `aruco/depth_pool.py` **What to do**: - Create new file `aruco/depth_pool.py` - Implement `pool_depth_maps(depth_maps: list[np.ndarray], confidence_maps: list[np.ndarray | None], confidence_thresh: float = 50.0, min_valid_count: int = 1) -> tuple[np.ndarray, np.ndarray | None]` - Algorithm: 1. Stack depth maps along new axis → shape (N, H, W) 2. For each pixel position, mask invalid values (NaN, inf, ≤ 0) AND confidence-rejected pixels (conf > thresh) 3. Compute per-pixel **median** across valid frames → pooled depth 4. For confidence: compute per-pixel **minimum** (most confident) across frames → pooled confidence 5. Pixels with < `min_valid_count` valid observations → set to NaN in pooled depth - Handle edge cases: - Empty input list → raise ValueError - Single map (N=1) → return copy of input (exact equivalence path) - All maps invalid at a pixel → NaN in output - Shape mismatch across maps → raise ValueError - Mixed None confidence maps → pool only non-None, or return None if all None - Add type hints, docstring with Args/Returns **Must NOT do**: - No weighted mean (median is more robust to outliers; keep simple for Phase 1) - No spatial alignment or warping **Recommended Agent Profile**: - **Category**: `quick` - Reason: Single focused module, pure function, no complex dependencies - **Skills**: [] - No special skills needed; standard Python/numpy work **Parallelization**: - **Can Run In Parallel**: YES - **Parallel Group**: Wave 1 (with Task 2) - **Blocks**: Tasks 2, 3, 5 - **Blocked By**: None **References**: **Pattern References**: - `aruco/depth_verify.py:39-79` — `compute_depth_residual()` shows how invalid depth is handled (NaN, ≤0, window median pattern) - `aruco/depth_verify.py:27-36` — `get_confidence_weight()` shows confidence semantics (ZED: 1=most confident, 100=least; threshold default 50) **API/Type References**: - `aruco/svo_sync.py:10-18` — `FrameData` dataclass: `depth_map: np.ndarray | None`, `confidence_map: np.ndarray | None` **Test References**: - `tests/test_depth_verify.py:36-60` — Pattern for creating synthetic depth maps and testing residual computation **WHY Each Reference Matters**: - `depth_verify.py:39-79`: Defines the invalid-depth encoding convention (NaN/≤0) that pooling must respect - `depth_verify.py:27-36`: Defines confidence semantics and threshold convention; pooling gating must match - `svo_sync.py:10-18`: Defines the data types the pooling function will receive **Acceptance Criteria**: - [ ] File `aruco/depth_pool.py` exists with `pool_depth_maps()` function - [ ] Function handles N=1 by returning exact copy of input - [ ] Function raises ValueError on empty input or shape mismatch - [ ] `uv run basedpyright aruco/depth_pool.py` → 0 errors **Agent-Executed QA Scenarios:** ``` Scenario: Module imports without error Tool: Bash Steps: 1. uv run python -c "from aruco.depth_pool import pool_depth_maps; print('OK')" 2. Assert: stdout contains "OK" Expected Result: Clean import ``` **Commit**: YES - Message: `feat(aruco): add pool_depth_maps utility for multi-frame depth pooling` - Files: `aruco/depth_pool.py` --- - [x] 2. Unit tests for `pool_depth_maps()` **What to do**: - Create `tests/test_depth_pool.py` - Test cases: 1. **Single map (N=1)**: output equals input exactly 2. **Two maps, clean**: median of two values at each pixel 3. **Three maps with NaN**: median ignores NaN pixels correctly 4. **Confidence gating**: pixels above threshold excluded from median 5. **All invalid at pixel**: output is NaN 6. **Empty input**: raises ValueError 7. **Shape mismatch**: raises ValueError 8. **min_valid_count**: pixel with fewer valid observations → NaN 9. **None confidence maps**: graceful handling (pools depth only, returns None confidence) - Use `numpy.testing.assert_allclose` for numerical checks - Use `pytest.raises(ValueError, match=...)` for error cases **Must NOT do**: - No integration with calibrate_extrinsics.py yet (unit tests only) **Recommended Agent Profile**: - **Category**: `quick` - Reason: Focused test file creation following existing patterns - **Skills**: [] **Parallelization**: - **Can Run In Parallel**: YES - **Parallel Group**: Wave 1 (with Task 1) - **Blocks**: None - **Blocked By**: Task 1 **References**: **Test References**: - `tests/test_depth_verify.py:36-60` — Pattern for synthetic depth map creation and assertion style - `tests/test_depth_refine.py:10-18` — Pattern for roundtrip/equivalence testing **WHY Each Reference Matters**: - Shows the exact assertion patterns and synthetic data conventions used in this codebase **Acceptance Criteria**: - [ ] `uv run pytest tests/test_depth_pool.py -v` → all tests pass - [ ] At least 9 test cases covering the enumerated scenarios **Agent-Executed QA Scenarios:** ``` Scenario: All pool tests pass Tool: Bash Steps: 1. uv run pytest tests/test_depth_pool.py -v 2. Assert: exit code 0 3. Assert: output contains "passed" with 0 "failed" Expected Result: All tests green ``` **Commit**: YES (groups with Task 1) - Message: `test(aruco): add unit tests for pool_depth_maps` - Files: `tests/test_depth_pool.py` --- - [x] 3. Extend main loop to collect top-N frames per camera **What to do**: - In `calibrate_extrinsics.py`, modify the verification frame collection (lines ~682-714): - Change `verification_frames` from `dict[serial, single_frame_dict]` to `dict[serial, list[frame_dict]]` - Maintain list sorted by score (descending), truncated to `depth_pool_size` - Use `heapq` or sorted insertion to keep top-N efficiently - When `depth_pool_size == 1`, behavior must be identical to current (store only best) - Update all downstream references to `verification_frames` that assume single-frame structure - The `first_frames` dict remains unchanged (it's for benchmarking, separate concern) **Must NOT do**: - Do NOT change the scoring function `score_frame()` - Do NOT change `FrameData` structure - Do NOT store frames outside the sampled loop (only collect from frames that already have depth) **Recommended Agent Profile**: - **Category**: `unspecified-low` - Reason: Surgical modification to existing loop logic; requires careful attention to existing consumers - **Skills**: [] **Parallelization**: - **Can Run In Parallel**: YES - **Parallel Group**: Wave 2 (with Tasks 4) - **Blocks**: Tasks 5, 6 - **Blocked By**: Task 1 **References**: **Pattern References**: - `calibrate_extrinsics.py:620-760` — Main loop where verification frames are collected; lines 682-714 are the critical section - `calibrate_extrinsics.py:118-258` — `apply_depth_verify_refine_postprocess()` which consumes `verification_frames` **API/Type References**: - `aruco/svo_sync.py:10-18` — `FrameData` structure that's stored in verification_frames **WHY Each Reference Matters**: - `calibrate_extrinsics.py:682-714`: This is the exact code being modified; must understand score comparison and dict storage - `calibrate_extrinsics.py:118-258`: Must understand how `verification_frames` is consumed downstream to know what structure changes are safe **Acceptance Criteria**: - [ ] `verification_frames[serial]` is now a list of frame dicts, sorted by score descending - [ ] List length ≤ `depth_pool_size` for each camera - [ ] When `depth_pool_size == 1`, list has exactly one element matching current best-frame behavior - [ ] `uv run basedpyright calibrate_extrinsics.py` → 0 new errors **Agent-Executed QA Scenarios:** ``` Scenario: Top-N collection works with pool size 3 Tool: Bash Steps: 1. uv run python -c " # Verify the data structure change is correct by inspecting types import ast, inspect # If this imports without error, structure is consistent from calibrate_extrinsics import apply_depth_verify_refine_postprocess print('OK') " 2. Assert: stdout contains "OK" Expected Result: No import errors from structural changes ``` **Commit**: NO (groups with Task 5) --- - [x] 4. Add `--depth-pool-size` CLI option **What to do**: - Add click option to `main()` in `calibrate_extrinsics.py`: ```python @click.option( "--depth-pool-size", default=1, type=click.IntRange(min=1, max=10), help="Number of top-scored frames to pool for depth verification/refinement (1=single best frame, >1=median pooling).", ) ``` - Pass through to function signature - Add to `apply_depth_verify_refine_postprocess()` parameters (or pass `depth_pool_size` to control pooling) - Update help text for `--depth-mode` if needed to mention pooling interaction **Must NOT do**: - Do NOT implement the actual pooling logic here (that's Task 5) - Do NOT allow values > 10 (memory guardrail) **Recommended Agent Profile**: - **Category**: `quick` - Reason: Single CLI option addition, boilerplate only - **Skills**: [] **Parallelization**: - **Can Run In Parallel**: YES - **Parallel Group**: Wave 2 (with Task 3) - **Blocks**: Task 5 - **Blocked By**: None **References**: **Pattern References**: - `calibrate_extrinsics.py:474-478` — Existing `--max-samples` option as pattern for optional integer CLI flag - `calibrate_extrinsics.py:431-436` — `--depth-mode` option pattern **WHY Each Reference Matters**: - Shows the exact click option pattern and placement convention in this file **Acceptance Criteria**: - [ ] `uv run calibrate_extrinsics.py --help` shows `--depth-pool-size` with description - [ ] Default value is 1 - [ ] Values outside 1-10 are rejected by click **Agent-Executed QA Scenarios:** ``` Scenario: CLI option appears in help Tool: Bash Steps: 1. uv run calibrate_extrinsics.py --help 2. Assert: output contains "--depth-pool-size" 3. Assert: output contains "1=single best frame" Expected Result: Option visible with correct help text Scenario: Invalid pool size rejected Tool: Bash Steps: 1. uv run calibrate_extrinsics.py --depth-pool-size 0 --help 2>&1 || true 2. Assert: output contains error or "Invalid value" Expected Result: Click rejects out-of-range value ``` **Commit**: NO (groups with Task 5) --- - [x] 5. Integrate pooling into `apply_depth_verify_refine_postprocess()` **What to do**: - Modify `apply_depth_verify_refine_postprocess()` to accept `depth_pool_size: int = 1` parameter - When `depth_pool_size > 1` and multiple frames available: 1. Extract depth_maps and confidence_maps from the top-N frame list 2. Call `pool_depth_maps()` to produce pooled depth/confidence 3. Use pooled maps for `verify_extrinsics_with_depth()` and `refine_extrinsics_with_depth()` 4. Use the **best-scored frame's** `ids` for marker corner lookup (it has best detection quality) - When `depth_pool_size == 1` OR only 1 frame available: - Use existing single-frame path exactly (no pooling call) - Add pooling metadata to JSON output: `"depth_pool": {"pool_size_requested": N, "pool_size_actual": M, "pooled": true/false}` - Wire `depth_pool_size` from `main()` through to this function - Handle edge case: if pooling produces a map with fewer valid points than best single frame, log warning and fall back to single frame **Must NOT do**: - Do NOT change `verify_extrinsics_with_depth()` or `refine_extrinsics_with_depth()` function signatures - Do NOT add new CLI output formats **Recommended Agent Profile**: - **Category**: `unspecified-high` - Reason: Core integration task with multiple touchpoints; requires careful wiring and edge case handling - **Skills**: [] **Parallelization**: - **Can Run In Parallel**: NO - **Parallel Group**: Sequential (after Wave 2) - **Blocks**: Tasks 6, 7 - **Blocked By**: Tasks 1, 3, 4 **References**: **Pattern References**: - `calibrate_extrinsics.py:118-258` — Full `apply_depth_verify_refine_postprocess()` function being modified - `calibrate_extrinsics.py:140-156` — Frame data extraction pattern (accessing `vf["frame"]`, `vf["ids"]`) - `calibrate_extrinsics.py:158-180` — Verification call pattern - `calibrate_extrinsics.py:182-245` — Refinement call pattern **API/Type References**: - `aruco/depth_pool.py:pool_depth_maps()` — The pooling function (Task 1 output) - `aruco/depth_verify.py:119-179` — `verify_extrinsics_with_depth()` signature - `aruco/depth_refine.py:71-227` — `refine_extrinsics_with_depth()` signature **WHY Each Reference Matters**: - `calibrate_extrinsics.py:140-156`: Shows how frame data is currently extracted; must adapt for list-of-frames - `depth_pool.py`: The function we're calling for multi-frame pooling - `depth_verify.py/depth_refine.py`: Confirms signatures remain unchanged (just pass different depth_map) **Acceptance Criteria**: - [ ] With `--depth-pool-size 1`: output JSON identical to baseline (no `depth_pool` metadata needed for N=1) - [ ] With `--depth-pool-size 5`: output JSON includes `depth_pool` metadata; verify/refine uses pooled maps - [ ] Fallback to single frame logged when pooling produces fewer valid points - [ ] `uv run basedpyright calibrate_extrinsics.py` → 0 new errors **Agent-Executed QA Scenarios:** ``` Scenario: Pool size 1 produces baseline-equivalent output Tool: Bash Preconditions: output/ directory with SVO files Steps: 1. uv run calibrate_extrinsics.py -s output/ -m aruco/markers/standard_box_markers_600mm.parquet --aruco-dictionary DICT_APRILTAG_36h11 --verify-depth --no-preview --max-samples 5 --depth-pool-size 1 --output output/_test_pool1.json 2. Assert: exit code 0 3. Assert: output/_test_pool1.json exists and contains depth_verify entries Expected Result: Runs cleanly, produces valid output Scenario: Pool size 5 runs and includes pool metadata Tool: Bash Preconditions: output/ directory with SVO files Steps: 1. uv run calibrate_extrinsics.py -s output/ -m aruco/markers/standard_box_markers_600mm.parquet --aruco-dictionary DICT_APRILTAG_36h11 --verify-depth --refine-depth --no-preview --max-samples 10 --depth-pool-size 5 --output output/_test_pool5.json 2. Assert: exit code 0 3. Parse output/_test_pool5.json 4. Assert: at least one camera entry contains "depth_pool" key Expected Result: Pooling metadata present in output ``` **Commit**: YES - Message: `feat(calibrate): integrate multi-frame depth pooling with --depth-pool-size flag` - Files: `calibrate_extrinsics.py`, `aruco/depth_pool.py`, `tests/test_depth_pool.py` - Pre-commit: `uv run pytest tests/test_depth_pool.py && uv run basedpyright calibrate_extrinsics.py` --- - [x] 6. N=1 equivalence regression test **What to do**: - Add test in `tests/test_depth_cli_postprocess.py` (or `tests/test_depth_pool.py`): - Create synthetic scenario with known depth maps and marker geometry - Run `apply_depth_verify_refine_postprocess()` with pool_size=1 using the old single-frame structure - Run with pool_size=1 using the new list-of-frames structure - Assert outputs are numerically identical (atol=0) - This proves the refactor preserves backward compatibility **Must NOT do**: - No E2E CLI test here (that's Task 7) **Recommended Agent Profile**: - **Category**: `quick` - Reason: Focused regression test with synthetic data - **Skills**: [] **Parallelization**: - **Can Run In Parallel**: YES - **Parallel Group**: Wave 3 (with Task 7) - **Blocks**: None - **Blocked By**: Task 5 **References**: **Test References**: - `tests/test_depth_cli_postprocess.py` — Existing integration test patterns - `tests/test_depth_verify.py:36-60` — Synthetic depth map creation pattern **Acceptance Criteria**: - [ ] `uv run pytest -k "pool_size_1_equivalence"` → passes - [ ] Test asserts exact numerical equality between old-path and new-path outputs **Commit**: YES - Message: `test(calibrate): add N=1 equivalence regression test for depth pooling` - Files: `tests/test_depth_pool.py` or `tests/test_depth_cli_postprocess.py` --- - [x] 7. E2E smoke comparison: pooled vs single-frame RMSE **What to do**: - Run calibration on test SVOs with `--depth-pool-size 1` and `--depth-pool-size 5` - Compare: - Post-refinement RMSE per camera - Depth-normalized RMSE - CSV residual distribution (mean_abs, p50, p90) - Runtime (wall clock) - Document results in a brief summary (stdout or saved to a comparison file) - **Success criterion**: pooled RMSE ≤ single-frame RMSE for majority of cameras; runtime overhead < 25% **Must NOT do**: - No automated pass/fail assertion on real data (metrics are directional, not deterministic) - No permanent benchmark infrastructure **Recommended Agent Profile**: - **Category**: `quick` - Reason: Run two commands, compare JSON output, summarize - **Skills**: [] **Parallelization**: - **Can Run In Parallel**: YES - **Parallel Group**: Wave 3 (with Task 6) - **Blocks**: None - **Blocked By**: Task 5 **References**: **Pattern References**: - Previous smoke runs in this session: `output/e2e_refine_depth_full_neural_plus.json` as baseline **Acceptance Criteria**: - [ ] Both runs complete without error - [ ] Comparison summary printed showing per-camera RMSE for pool=1 vs pool=5 - [ ] Runtime logged for both runs **Agent-Executed QA Scenarios:** ``` Scenario: Compare pool=1 vs pool=5 on full SVOs Tool: Bash Steps: 1. Run with --depth-pool-size 1 --verify-depth --refine-depth --output output/_compare_pool1.json 2. Run with --depth-pool-size 5 --verify-depth --refine-depth --output output/_compare_pool5.json 3. Parse both JSON files 4. Print per-camera post RMSE comparison table 5. Print runtime difference Expected Result: Both complete; comparison table printed Evidence: Terminal output captured ``` **Commit**: NO (no code change; just verification) --- ## Commit Strategy | After Task | Message | Files | Verification | |------------|---------|-------|--------------| | 1+2 | `feat(aruco): add pool_depth_maps utility with tests` | `aruco/depth_pool.py`, `tests/test_depth_pool.py` | `uv run pytest tests/test_depth_pool.py` | | 5 (includes 3+4) | `feat(calibrate): integrate multi-frame depth pooling with --depth-pool-size flag` | `calibrate_extrinsics.py` | `uv run pytest && uv run basedpyright` | | 6 | `test(calibrate): add N=1 equivalence regression test for depth pooling` | `tests/test_depth_pool.py` or `tests/test_depth_cli_postprocess.py` | `uv run pytest -k pool_size_1` | --- ## Success Criteria ### Verification Commands ```bash uv run pytest tests/test_depth_pool.py -v # All pool unit tests pass uv run pytest -k "pool_size_1_equivalence" -v # N=1 regression passes uv run basedpyright # 0 new errors uv run calibrate_extrinsics.py --help | grep pool # CLI flag visible ``` ### Final Checklist - [x] `pool_depth_maps()` pure function exists with full edge case handling - [x] `--depth-pool-size` CLI option with default=1, max=10 - [x] N=1 produces identical results to baseline - [x] All existing tests still pass - [x] Type checker clean - [x] E2E comparison shows pooled RMSE ≤ single-frame RMSE for majority of cameras