Files
zed-playground/py_workspace/.sisyphus/plans/finished/multi-frame-depth-pooling.md
T

25 KiB
Raw Blame History

Multi-Frame Depth Pooling for Extrinsic Calibration

TL;DR

Quick Summary: Replace single-best-frame depth verification/refinement with top-N temporal pooling to reduce noise sensitivity and improve calibration robustness, while keeping existing verify/refine function signatures untouched.

Deliverables:

  • New pool_depth_maps() utility function in aruco/depth_pool.py
  • Extended frame collection (top-N per camera) in main loop
  • New --depth-pool-size CLI option (default 1 = backward compatible)
  • Unit tests for pooling, fallback, and N=1 equivalence
  • E2E smoke comparison (pooled vs single-frame RMSE)

Estimated Effort: Medium Parallel Execution: YES — 3 waves Critical Path: Task 1 → Task 3 → Task 5 → Task 7


Context

Original Request

User asked: "Is apply_depth_verify_refine_postprocess optimal? When depth_mode is not NONE, every frame computes depth regardless of whether it's used. Is there a better way to utilize every depth map when verify/refine is enabled?"

Interview Summary

Key Discussions:

  • Oracle confirmed single-best-frame is simplicity-biased but leaves accuracy on the table
  • Recommended top 35 frame temporal pooling with confidence gating
  • Phased approach: quick win (pooling), medium (weighted selection), advanced (joint optimization)

Research Findings:

  • calibrate_extrinsics.py:682-714: Current loop stores exactly one verification_frames[serial] per camera (best-scored)
  • aruco/depth_verify.py: verify_extrinsics_with_depth() accepts single depth_map + confidence_map
  • aruco/depth_refine.py: refine_extrinsics_with_depth() accepts single depth_map + confidence_map
  • aruco/svo_sync.py:FrameData: Each frame already carries depth_map + confidence_map
  • Memory: each depth map is ~3.5MB (720×1280 float32); storing 5 per camera = ~17.5MB/cam, ~70MB total for 4 cameras — acceptable
  • Existing tests use synthetic depth maps, so new tests can follow same pattern

Metis Review

Identified Gaps (addressed):

  • Camera motion during capture → addressed via assumption that cameras are static during calibration; documented as guardrail
  • "Top-N by score" may not correlate with depth quality → addressed by keeping confidence gating in pooling function
  • Fewer than N frames available → addressed with explicit fallback behavior
  • All pixels invalid after gating → addressed with fallback to best single frame
  • N=1 must reproduce baseline exactly → addressed with explicit equivalence test

Work Objectives

Core Objective

Pool depth maps from the top-N scored frames per camera to produce a more robust single depth target for verification and refinement, reducing sensitivity to single-frame noise.

Concrete Deliverables

  • aruco/depth_pool.py — new module with pool_depth_maps() function
  • Modified calibrate_extrinsics.py — top-N collection + pooling integration + CLI flag
  • tests/test_depth_pool.py — unit tests for pooling logic
  • Updated tests/test_depth_cli_postprocess.py — integration test for N=1 equivalence

Definition of Done

  • uv run pytest -k "depth_pool" → all tests pass
  • uv run basedpyright → 0 new errors
  • --depth-pool-size 1 produces identical output to current baseline
  • --depth-pool-size 5 produces equal or lower post-RMSE on test SVOs

Must Have

  • Feature-flagged behind --depth-pool-size (default 1)
  • Pure function pool_depth_maps() with deterministic output
  • Confidence gating during pooling
  • Graceful fallback when pooling fails (insufficient valid pixels)
  • N=1 code path identical to current behavior

Must NOT Have (Guardrails)

  • NO changes to verify_extrinsics_with_depth() or refine_extrinsics_with_depth() signatures
  • NO scoring function redesign (use existing score_frame() as-is)
  • NO cross-camera fusion or spatial alignment/warping between frames
  • NO GPU acceleration or threading changes
  • NO new artifact files or dashboards
  • NO "unbounded history" — enforce max pool size cap (10)
  • NO optical flow, Kalman filters, or temporal alignment beyond frame selection

Verification Strategy (MANDATORY)

UNIVERSAL RULE: ZERO HUMAN INTERVENTION

ALL tasks in this plan MUST be verifiable WITHOUT any human action.

Test Decision

  • Infrastructure exists: YES
  • Automated tests: YES (Tests-after, matching existing pattern)
  • Framework: pytest (via uv run pytest)

Agent-Executed QA Scenarios (MANDATORY — ALL tasks)

Verification Tool by Deliverable Type:

Type Tool How Agent Verifies
Library/Module Bash (uv run pytest) Run targeted tests, compare output
CLI Bash (uv run calibrate_extrinsics.py) Run with flags, check JSON output
Type safety Bash (uv run basedpyright) Zero new errors

Execution Strategy

Parallel Execution Waves

Wave 1 (Start Immediately):
├── Task 1: Create pool_depth_maps() utility
└── Task 2: Unit tests for pool_depth_maps()

Wave 2 (After Wave 1):
├── Task 3: Extend main loop to collect top-N frames
├── Task 4: Add --depth-pool-size CLI option
└── Task 5: Integrate pooling into postprocess function

Wave 3 (After Wave 2):
├── Task 6: N=1 equivalence regression test
└── Task 7: E2E smoke comparison (pooled vs single-frame)

Dependency Matrix

Task Depends On Blocks Can Parallelize With
1 None 2, 3, 5 2
2 1 None 1
3 1 5, 6 4
4 None 5 3
5 1, 3, 4 6, 7 None
6 5 None 7
7 5 None 6

TODOs

  • 1. Create pool_depth_maps() utility in aruco/depth_pool.py

    What to do:

    • Create new file aruco/depth_pool.py
    • Implement pool_depth_maps(depth_maps: list[np.ndarray], confidence_maps: list[np.ndarray | None], confidence_thresh: float = 50.0, min_valid_count: int = 1) -> tuple[np.ndarray, np.ndarray | None]
    • Algorithm:
      1. Stack depth maps along new axis → shape (N, H, W)
      2. For each pixel position, mask invalid values (NaN, inf, ≤ 0) AND confidence-rejected pixels (conf > thresh)
      3. Compute per-pixel median across valid frames → pooled depth
      4. For confidence: compute per-pixel minimum (most confident) across frames → pooled confidence
      5. Pixels with < min_valid_count valid observations → set to NaN in pooled depth
    • Handle edge cases:
      • Empty input list → raise ValueError
      • Single map (N=1) → return copy of input (exact equivalence path)
      • All maps invalid at a pixel → NaN in output
      • Shape mismatch across maps → raise ValueError
      • Mixed None confidence maps → pool only non-None, or return None if all None
    • Add type hints, docstring with Args/Returns

    Must NOT do:

    • No weighted mean (median is more robust to outliers; keep simple for Phase 1)
    • No spatial alignment or warping

    Recommended Agent Profile:

    • Category: quick
      • Reason: Single focused module, pure function, no complex dependencies
    • Skills: []
      • No special skills needed; standard Python/numpy work

    Parallelization:

    • Can Run In Parallel: YES
    • Parallel Group: Wave 1 (with Task 2)
    • Blocks: Tasks 2, 3, 5
    • Blocked By: None

    References:

    Pattern References:

    • aruco/depth_verify.py:39-79compute_depth_residual() shows how invalid depth is handled (NaN, ≤0, window median pattern)
    • aruco/depth_verify.py:27-36get_confidence_weight() shows confidence semantics (ZED: 1=most confident, 100=least; threshold default 50)

    API/Type References:

    • aruco/svo_sync.py:10-18FrameData dataclass: depth_map: np.ndarray | None, confidence_map: np.ndarray | None

    Test References:

    • tests/test_depth_verify.py:36-60 — Pattern for creating synthetic depth maps and testing residual computation

    WHY Each Reference Matters:

    • depth_verify.py:39-79: Defines the invalid-depth encoding convention (NaN/≤0) that pooling must respect
    • depth_verify.py:27-36: Defines confidence semantics and threshold convention; pooling gating must match
    • svo_sync.py:10-18: Defines the data types the pooling function will receive

    Acceptance Criteria:

    • File aruco/depth_pool.py exists with pool_depth_maps() function
    • Function handles N=1 by returning exact copy of input
    • Function raises ValueError on empty input or shape mismatch
    • uv run basedpyright aruco/depth_pool.py → 0 errors

    Agent-Executed QA Scenarios:

    Scenario: Module imports without error
      Tool: Bash
      Steps:
        1. uv run python -c "from aruco.depth_pool import pool_depth_maps; print('OK')"
        2. Assert: stdout contains "OK"
      Expected Result: Clean import
    

    Commit: YES

    • Message: feat(aruco): add pool_depth_maps utility for multi-frame depth pooling
    • Files: aruco/depth_pool.py

  • 2. Unit tests for pool_depth_maps()

    What to do:

    • Create tests/test_depth_pool.py
    • Test cases:
      1. Single map (N=1): output equals input exactly
      2. Two maps, clean: median of two values at each pixel
      3. Three maps with NaN: median ignores NaN pixels correctly
      4. Confidence gating: pixels above threshold excluded from median
      5. All invalid at pixel: output is NaN
      6. Empty input: raises ValueError
      7. Shape mismatch: raises ValueError
      8. min_valid_count: pixel with fewer valid observations → NaN
      9. None confidence maps: graceful handling (pools depth only, returns None confidence)
    • Use numpy.testing.assert_allclose for numerical checks
    • Use pytest.raises(ValueError, match=...) for error cases

    Must NOT do:

    • No integration with calibrate_extrinsics.py yet (unit tests only)

    Recommended Agent Profile:

    • Category: quick
      • Reason: Focused test file creation following existing patterns
    • Skills: []

    Parallelization:

    • Can Run In Parallel: YES
    • Parallel Group: Wave 1 (with Task 1)
    • Blocks: None
    • Blocked By: Task 1

    References:

    Test References:

    • tests/test_depth_verify.py:36-60 — Pattern for synthetic depth map creation and assertion style
    • tests/test_depth_refine.py:10-18 — Pattern for roundtrip/equivalence testing

    WHY Each Reference Matters:

    • Shows the exact assertion patterns and synthetic data conventions used in this codebase

    Acceptance Criteria:

    • uv run pytest tests/test_depth_pool.py -v → all tests pass
    • At least 9 test cases covering the enumerated scenarios

    Agent-Executed QA Scenarios:

    Scenario: All pool tests pass
      Tool: Bash
      Steps:
        1. uv run pytest tests/test_depth_pool.py -v
        2. Assert: exit code 0
        3. Assert: output contains "passed" with 0 "failed"
      Expected Result: All tests green
    

    Commit: YES (groups with Task 1)

    • Message: test(aruco): add unit tests for pool_depth_maps
    • Files: tests/test_depth_pool.py

  • 3. Extend main loop to collect top-N frames per camera

    What to do:

    • In calibrate_extrinsics.py, modify the verification frame collection (lines ~682-714):
      • Change verification_frames from dict[serial, single_frame_dict] to dict[serial, list[frame_dict]]
      • Maintain list sorted by score (descending), truncated to depth_pool_size
      • Use heapq or sorted insertion to keep top-N efficiently
      • When depth_pool_size == 1, behavior must be identical to current (store only best)
    • Update all downstream references to verification_frames that assume single-frame structure
    • The first_frames dict remains unchanged (it's for benchmarking, separate concern)

    Must NOT do:

    • Do NOT change the scoring function score_frame()
    • Do NOT change FrameData structure
    • Do NOT store frames outside the sampled loop (only collect from frames that already have depth)

    Recommended Agent Profile:

    • Category: unspecified-low
      • Reason: Surgical modification to existing loop logic; requires careful attention to existing consumers
    • Skills: []

    Parallelization:

    • Can Run In Parallel: YES
    • Parallel Group: Wave 2 (with Tasks 4)
    • Blocks: Tasks 5, 6
    • Blocked By: Task 1

    References:

    Pattern References:

    • calibrate_extrinsics.py:620-760 — Main loop where verification frames are collected; lines 682-714 are the critical section
    • calibrate_extrinsics.py:118-258apply_depth_verify_refine_postprocess() which consumes verification_frames

    API/Type References:

    • aruco/svo_sync.py:10-18FrameData structure that's stored in verification_frames

    WHY Each Reference Matters:

    • calibrate_extrinsics.py:682-714: This is the exact code being modified; must understand score comparison and dict storage
    • calibrate_extrinsics.py:118-258: Must understand how verification_frames is consumed downstream to know what structure changes are safe

    Acceptance Criteria:

    • verification_frames[serial] is now a list of frame dicts, sorted by score descending
    • List length ≤ depth_pool_size for each camera
    • When depth_pool_size == 1, list has exactly one element matching current best-frame behavior
    • uv run basedpyright calibrate_extrinsics.py → 0 new errors

    Agent-Executed QA Scenarios:

    Scenario: Top-N collection works with pool size 3
      Tool: Bash
      Steps:
        1. uv run python -c "
           # Verify the data structure change is correct by inspecting types
           import ast, inspect
           # If this imports without error, structure is consistent
           from calibrate_extrinsics import apply_depth_verify_refine_postprocess
           print('OK')
           "
        2. Assert: stdout contains "OK"
      Expected Result: No import errors from structural changes
    

    Commit: NO (groups with Task 5)


  • 4. Add --depth-pool-size CLI option

    What to do:

    • Add click option to main() in calibrate_extrinsics.py:
      @click.option(
          "--depth-pool-size",
          default=1,
          type=click.IntRange(min=1, max=10),
          help="Number of top-scored frames to pool for depth verification/refinement (1=single best frame, >1=median pooling).",
      )
      
    • Pass through to function signature
    • Add to apply_depth_verify_refine_postprocess() parameters (or pass depth_pool_size to control pooling)
    • Update help text for --depth-mode if needed to mention pooling interaction

    Must NOT do:

    • Do NOT implement the actual pooling logic here (that's Task 5)
    • Do NOT allow values > 10 (memory guardrail)

    Recommended Agent Profile:

    • Category: quick
      • Reason: Single CLI option addition, boilerplate only
    • Skills: []

    Parallelization:

    • Can Run In Parallel: YES
    • Parallel Group: Wave 2 (with Task 3)
    • Blocks: Task 5
    • Blocked By: None

    References:

    Pattern References:

    • calibrate_extrinsics.py:474-478 — Existing --max-samples option as pattern for optional integer CLI flag
    • calibrate_extrinsics.py:431-436--depth-mode option pattern

    WHY Each Reference Matters:

    • Shows the exact click option pattern and placement convention in this file

    Acceptance Criteria:

    • uv run calibrate_extrinsics.py --help shows --depth-pool-size with description
    • Default value is 1
    • Values outside 1-10 are rejected by click

    Agent-Executed QA Scenarios:

    Scenario: CLI option appears in help
      Tool: Bash
      Steps:
        1. uv run calibrate_extrinsics.py --help
        2. Assert: output contains "--depth-pool-size"
        3. Assert: output contains "1=single best frame"
      Expected Result: Option visible with correct help text
    
    Scenario: Invalid pool size rejected
      Tool: Bash
      Steps:
        1. uv run calibrate_extrinsics.py --depth-pool-size 0 --help 2>&1 || true
        2. Assert: output contains error or "Invalid value"
      Expected Result: Click rejects out-of-range value
    

    Commit: NO (groups with Task 5)


  • 5. Integrate pooling into apply_depth_verify_refine_postprocess()

    What to do:

    • Modify apply_depth_verify_refine_postprocess() to accept depth_pool_size: int = 1 parameter
    • When depth_pool_size > 1 and multiple frames available:
      1. Extract depth_maps and confidence_maps from the top-N frame list
      2. Call pool_depth_maps() to produce pooled depth/confidence
      3. Use pooled maps for verify_extrinsics_with_depth() and refine_extrinsics_with_depth()
      4. Use the best-scored frame's ids for marker corner lookup (it has best detection quality)
    • When depth_pool_size == 1 OR only 1 frame available:
      • Use existing single-frame path exactly (no pooling call)
    • Add pooling metadata to JSON output: "depth_pool": {"pool_size_requested": N, "pool_size_actual": M, "pooled": true/false}
    • Wire depth_pool_size from main() through to this function
    • Handle edge case: if pooling produces a map with fewer valid points than best single frame, log warning and fall back to single frame

    Must NOT do:

    • Do NOT change verify_extrinsics_with_depth() or refine_extrinsics_with_depth() function signatures
    • Do NOT add new CLI output formats

    Recommended Agent Profile:

    • Category: unspecified-high
      • Reason: Core integration task with multiple touchpoints; requires careful wiring and edge case handling
    • Skills: []

    Parallelization:

    • Can Run In Parallel: NO
    • Parallel Group: Sequential (after Wave 2)
    • Blocks: Tasks 6, 7
    • Blocked By: Tasks 1, 3, 4

    References:

    Pattern References:

    • calibrate_extrinsics.py:118-258 — Full apply_depth_verify_refine_postprocess() function being modified
    • calibrate_extrinsics.py:140-156 — Frame data extraction pattern (accessing vf["frame"], vf["ids"])
    • calibrate_extrinsics.py:158-180 — Verification call pattern
    • calibrate_extrinsics.py:182-245 — Refinement call pattern

    API/Type References:

    • aruco/depth_pool.py:pool_depth_maps() — The pooling function (Task 1 output)
    • aruco/depth_verify.py:119-179verify_extrinsics_with_depth() signature
    • aruco/depth_refine.py:71-227refine_extrinsics_with_depth() signature

    WHY Each Reference Matters:

    • calibrate_extrinsics.py:140-156: Shows how frame data is currently extracted; must adapt for list-of-frames
    • depth_pool.py: The function we're calling for multi-frame pooling
    • depth_verify.py/depth_refine.py: Confirms signatures remain unchanged (just pass different depth_map)

    Acceptance Criteria:

    • With --depth-pool-size 1: output JSON identical to baseline (no depth_pool metadata needed for N=1)
    • With --depth-pool-size 5: output JSON includes depth_pool metadata; verify/refine uses pooled maps
    • Fallback to single frame logged when pooling produces fewer valid points
    • uv run basedpyright calibrate_extrinsics.py → 0 new errors

    Agent-Executed QA Scenarios:

    Scenario: Pool size 1 produces baseline-equivalent output
      Tool: Bash
      Preconditions: output/ directory with SVO files
      Steps:
        1. uv run calibrate_extrinsics.py -s output/ -m aruco/markers/standard_box_markers_600mm.parquet --aruco-dictionary DICT_APRILTAG_36h11 --verify-depth --no-preview --max-samples 5 --depth-pool-size 1 --output output/_test_pool1.json
        2. Assert: exit code 0
        3. Assert: output/_test_pool1.json exists and contains depth_verify entries
      Expected Result: Runs cleanly, produces valid output
    
    Scenario: Pool size 5 runs and includes pool metadata
      Tool: Bash
      Preconditions: output/ directory with SVO files
      Steps:
        1. uv run calibrate_extrinsics.py -s output/ -m aruco/markers/standard_box_markers_600mm.parquet --aruco-dictionary DICT_APRILTAG_36h11 --verify-depth --refine-depth --no-preview --max-samples 10 --depth-pool-size 5 --output output/_test_pool5.json
        2. Assert: exit code 0
        3. Parse output/_test_pool5.json
        4. Assert: at least one camera entry contains "depth_pool" key
      Expected Result: Pooling metadata present in output
    

    Commit: YES

    • Message: feat(calibrate): integrate multi-frame depth pooling with --depth-pool-size flag
    • Files: calibrate_extrinsics.py, aruco/depth_pool.py, tests/test_depth_pool.py
    • Pre-commit: uv run pytest tests/test_depth_pool.py && uv run basedpyright calibrate_extrinsics.py

  • 6. N=1 equivalence regression test

    What to do:

    • Add test in tests/test_depth_cli_postprocess.py (or tests/test_depth_pool.py):
      • Create synthetic scenario with known depth maps and marker geometry
      • Run apply_depth_verify_refine_postprocess() with pool_size=1 using the old single-frame structure
      • Run with pool_size=1 using the new list-of-frames structure
      • Assert outputs are numerically identical (atol=0)
    • This proves the refactor preserves backward compatibility

    Must NOT do:

    • No E2E CLI test here (that's Task 7)

    Recommended Agent Profile:

    • Category: quick
      • Reason: Focused regression test with synthetic data
    • Skills: []

    Parallelization:

    • Can Run In Parallel: YES
    • Parallel Group: Wave 3 (with Task 7)
    • Blocks: None
    • Blocked By: Task 5

    References:

    Test References:

    • tests/test_depth_cli_postprocess.py — Existing integration test patterns
    • tests/test_depth_verify.py:36-60 — Synthetic depth map creation pattern

    Acceptance Criteria:

    • uv run pytest -k "pool_size_1_equivalence" → passes
    • Test asserts exact numerical equality between old-path and new-path outputs

    Commit: YES

    • Message: test(calibrate): add N=1 equivalence regression test for depth pooling
    • Files: tests/test_depth_pool.py or tests/test_depth_cli_postprocess.py

  • 7. E2E smoke comparison: pooled vs single-frame RMSE

    What to do:

    • Run calibration on test SVOs with --depth-pool-size 1 and --depth-pool-size 5
    • Compare:
      • Post-refinement RMSE per camera
      • Depth-normalized RMSE
      • CSV residual distribution (mean_abs, p50, p90)
      • Runtime (wall clock)
    • Document results in a brief summary (stdout or saved to a comparison file)
    • Success criterion: pooled RMSE ≤ single-frame RMSE for majority of cameras; runtime overhead < 25%

    Must NOT do:

    • No automated pass/fail assertion on real data (metrics are directional, not deterministic)
    • No permanent benchmark infrastructure

    Recommended Agent Profile:

    • Category: quick
      • Reason: Run two commands, compare JSON output, summarize
    • Skills: []

    Parallelization:

    • Can Run In Parallel: YES
    • Parallel Group: Wave 3 (with Task 6)
    • Blocks: None
    • Blocked By: Task 5

    References:

    Pattern References:

    • Previous smoke runs in this session: output/e2e_refine_depth_full_neural_plus.json as baseline

    Acceptance Criteria:

    • Both runs complete without error
    • Comparison summary printed showing per-camera RMSE for pool=1 vs pool=5
    • Runtime logged for both runs

    Agent-Executed QA Scenarios:

    Scenario: Compare pool=1 vs pool=5 on full SVOs
      Tool: Bash
      Steps:
        1. Run with --depth-pool-size 1 --verify-depth --refine-depth --output output/_compare_pool1.json
        2. Run with --depth-pool-size 5 --verify-depth --refine-depth --output output/_compare_pool5.json
        3. Parse both JSON files
        4. Print per-camera post RMSE comparison table
        5. Print runtime difference
      Expected Result: Both complete; comparison table printed
      Evidence: Terminal output captured
    

    Commit: NO (no code change; just verification)


Commit Strategy

After Task Message Files Verification
1+2 feat(aruco): add pool_depth_maps utility with tests aruco/depth_pool.py, tests/test_depth_pool.py uv run pytest tests/test_depth_pool.py
5 (includes 3+4) feat(calibrate): integrate multi-frame depth pooling with --depth-pool-size flag calibrate_extrinsics.py uv run pytest && uv run basedpyright
6 test(calibrate): add N=1 equivalence regression test for depth pooling tests/test_depth_pool.py or tests/test_depth_cli_postprocess.py uv run pytest -k pool_size_1

Success Criteria

Verification Commands

uv run pytest tests/test_depth_pool.py -v           # All pool unit tests pass
uv run pytest -k "pool_size_1_equivalence" -v        # N=1 regression passes
uv run basedpyright                                   # 0 new errors
uv run calibrate_extrinsics.py --help | grep pool    # CLI flag visible

Final Checklist

  • pool_depth_maps() pure function exists with full edge case handling
  • --depth-pool-size CLI option with default=1, max=10
  • N=1 produces identical results to baseline
  • All existing tests still pass
  • Type checker clean
  • E2E comparison shows pooled RMSE ≤ single-frame RMSE for majority of cameras