zed-playground/py_workspace/.sisyphus/plans/multi-frame-depth-pooling.md

# Multi-Frame Depth Pooling for Extrinsic Calibration

## TL;DR

> **Quick Summary**: Replace single-best-frame depth verification/refinement with top-N temporal pooling to reduce noise sensitivity and improve calibration robustness, while keeping existing verify/refine function signatures untouched.
>
> **Deliverables**:
> - New `pool_depth_maps()` utility function in `aruco/depth_pool.py`
> - Extended frame collection (top-N per camera) in main loop
> - New `--depth-pool-size` CLI option (default 1 = backward compatible)
> - Unit tests for pooling, fallback, and N=1 equivalence
> - E2E smoke comparison (pooled vs single-frame RMSE)
>
> **Estimated Effort**: Medium
> **Parallel Execution**: YES — 3 waves
> **Critical Path**: Task 1 → Task 3 → Task 5 → Task 7

---

## Context

### Original Request
User asked: "Is `apply_depth_verify_refine_postprocess` optimal? When `depth_mode` is not NONE, every frame computes depth regardless of whether it's used. Is there a better way to utilize every depth map when verify/refine is enabled?"

### Interview Summary
**Key Discussions**:
- Oracle confirmed single-best-frame is simplicity-biased but leaves accuracy on the table
- Recommended top 3–5 frame temporal pooling with confidence gating
- Phased approach: quick win (pooling), medium (weighted selection), advanced (joint optimization)

**Research Findings**:
- `calibrate_extrinsics.py:682-714`: Current loop stores exactly one `verification_frames[serial]` per camera (best-scored)
- `aruco/depth_verify.py`: `verify_extrinsics_with_depth()` accepts single `depth_map` + `confidence_map`
- `aruco/depth_refine.py`: `refine_extrinsics_with_depth()` accepts single `depth_map` + `confidence_map`
- `aruco/svo_sync.py:FrameData`: Each frame already carries `depth_map` + `confidence_map`
- Memory: each depth map is ~3.5MB (720×1280 float32); storing 5 per camera = ~17.5MB/cam, ~70MB total for 4 cameras — acceptable
- Existing tests use synthetic depth maps, so new tests can follow same pattern

### Metis Review
**Identified Gaps** (addressed):
- Camera motion during capture → addressed via assumption that cameras are static during calibration; documented as guardrail
- "Top-N by score" may not correlate with depth quality → addressed by keeping confidence gating in pooling function
- Fewer than N frames available → addressed with explicit fallback behavior
- All pixels invalid after gating → addressed with fallback to best single frame
- N=1 must reproduce baseline exactly → addressed with explicit equivalence test

---

## Work Objectives

### Core Objective
Pool depth maps from the top-N scored frames per camera to produce a more robust single depth target for verification and refinement, reducing sensitivity to single-frame noise.

### Concrete Deliverables
- `aruco/depth_pool.py` — new module with `pool_depth_maps()` function
- Modified `calibrate_extrinsics.py` — top-N collection + pooling integration + CLI flag
- `tests/test_depth_pool.py` — unit tests for pooling logic
- Updated `tests/test_depth_cli_postprocess.py` — integration test for N=1 equivalence

### Definition of Done
- [x] `uv run pytest -k "depth_pool"` → all tests pass
- [x] `uv run basedpyright` → 0 new errors
- [x] `--depth-pool-size 1` produces identical output to current baseline
- [x] `--depth-pool-size 5` produces equal or lower post-RMSE on test SVOs

### Must Have
- Feature-flagged behind `--depth-pool-size` (default 1)
- Pure function `pool_depth_maps()` with deterministic output
- Confidence gating during pooling
- Graceful fallback when pooling fails (insufficient valid pixels)
- N=1 code path identical to current behavior

### Must NOT Have (Guardrails)
- NO changes to `verify_extrinsics_with_depth()` or `refine_extrinsics_with_depth()` signatures
- NO scoring function redesign (use existing `score_frame()` as-is)
- NO cross-camera fusion or spatial alignment/warping between frames
- NO GPU acceleration or threading changes
- NO new artifact files or dashboards
- NO "unbounded history" — enforce max pool size cap (10)
- NO optical flow, Kalman filters, or temporal alignment beyond frame selection

---

## Verification Strategy (MANDATORY)

> **UNIVERSAL RULE: ZERO HUMAN INTERVENTION**
>
> ALL tasks in this plan MUST be verifiable WITHOUT any human action.

### Test Decision
- **Infrastructure exists**: YES
- **Automated tests**: YES (Tests-after, matching existing pattern)
- **Framework**: pytest (via `uv run pytest`)

### Agent-Executed QA Scenarios (MANDATORY — ALL tasks)

**Verification Tool by Deliverable Type:**

| Type | Tool | How Agent Verifies |
|------|------|-------------------|
| Library/Module | Bash (uv run pytest) | Run targeted tests, compare output |
| CLI | Bash (uv run calibrate_extrinsics.py) | Run with flags, check JSON output |
| Type safety | Bash (uv run basedpyright) | Zero new errors |

---

## Execution Strategy

### Parallel Execution Waves

```
Wave 1 (Start Immediately):
├── Task 1: Create pool_depth_maps() utility
└── Task 2: Unit tests for pool_depth_maps()

Wave 2 (After Wave 1):
├── Task 3: Extend main loop to collect top-N frames
├── Task 4: Add --depth-pool-size CLI option
└── Task 5: Integrate pooling into postprocess function

Wave 3 (After Wave 2):
├── Task 6: N=1 equivalence regression test
└── Task 7: E2E smoke comparison (pooled vs single-frame)
```

### Dependency Matrix

| Task | Depends On | Blocks | Can Parallelize With |
|------|------------|--------|---------------------|
| 1 | None | 2, 3, 5 | 2 |
| 2 | 1 | None | 1 |
| 3 | 1 | 5, 6 | 4 |
| 4 | None | 5 | 3 |
| 5 | 1, 3, 4 | 6, 7 | None |
| 6 | 5 | None | 7 |
| 7 | 5 | None | 6 |

---

## TODOs

- [x] 1. Create `pool_depth_maps()` utility in `aruco/depth_pool.py`

  **What to do**:
  - Create new file `aruco/depth_pool.py`
  - Implement `pool_depth_maps(depth_maps: list[np.ndarray], confidence_maps: list[np.ndarray | None], confidence_thresh: float = 50.0, min_valid_count: int = 1) -> tuple[np.ndarray, np.ndarray | None]`
  - Algorithm:
    1. Stack depth maps along new axis → shape (N, H, W)
    2. For each pixel position, mask invalid values (NaN, inf, ≤ 0) AND confidence-rejected pixels (conf > thresh)
    3. Compute per-pixel **median** across valid frames → pooled depth
    4. For confidence: compute per-pixel **minimum** (most confident) across frames → pooled confidence
    5. Pixels with < `min_valid_count` valid observations → set to NaN in pooled depth
  - Handle edge cases:
    - Empty input list → raise ValueError
    - Single map (N=1) → return copy of input (exact equivalence path)
    - All maps invalid at a pixel → NaN in output
    - Shape mismatch across maps → raise ValueError
    - Mixed None confidence maps → pool only non-None, or return None if all None
  - Add type hints, docstring with Args/Returns

  **Must NOT do**:
  - No weighted mean (median is more robust to outliers; keep simple for Phase 1)
  - No spatial alignment or warping

  **Recommended Agent Profile**:
  - **Category**: `quick`
    - Reason: Single focused module, pure function, no complex dependencies
  - **Skills**: []
    - No special skills needed; standard Python/numpy work

  **Parallelization**:
  - **Can Run In Parallel**: YES
  - **Parallel Group**: Wave 1 (with Task 2)
  - **Blocks**: Tasks 2, 3, 5
  - **Blocked By**: None

  **References**:

  **Pattern References**:
  - `aruco/depth_verify.py:39-79` — `compute_depth_residual()` shows how invalid depth is handled (NaN, ≤0, window median pattern)
  - `aruco/depth_verify.py:27-36` — `get_confidence_weight()` shows confidence semantics (ZED: 1=most confident, 100=least; threshold default 50)

  **API/Type References**:
  - `aruco/svo_sync.py:10-18` — `FrameData` dataclass: `depth_map: np.ndarray | None`, `confidence_map: np.ndarray | None`

  **Test References**:
  - `tests/test_depth_verify.py:36-60` — Pattern for creating synthetic depth maps and testing residual computation

  **WHY Each Reference Matters**:
  - `depth_verify.py:39-79`: Defines the invalid-depth encoding convention (NaN/≤0) that pooling must respect
  - `depth_verify.py:27-36`: Defines confidence semantics and threshold convention; pooling gating must match
  - `svo_sync.py:10-18`: Defines the data types the pooling function will receive

  **Acceptance Criteria**:
  - [ ] File `aruco/depth_pool.py` exists with `pool_depth_maps()` function
  - [ ] Function handles N=1 by returning exact copy of input
  - [ ] Function raises ValueError on empty input or shape mismatch
  - [ ] `uv run basedpyright aruco/depth_pool.py` → 0 errors

  **Agent-Executed QA Scenarios:**
  ```
  Scenario: Module imports without error
    Tool: Bash
    Steps:
      1. uv run python -c "from aruco.depth_pool import pool_depth_maps; print('OK')"
      2. Assert: stdout contains "OK"
    Expected Result: Clean import
  ```

  **Commit**: YES
  - Message: `feat(aruco): add pool_depth_maps utility for multi-frame depth pooling`
  - Files: `aruco/depth_pool.py`

---

- [x] 2. Unit tests for `pool_depth_maps()`

  **What to do**:
  - Create `tests/test_depth_pool.py`
  - Test cases:
    1. **Single map (N=1)**: output equals input exactly
    2. **Two maps, clean**: median of two values at each pixel
    3. **Three maps with NaN**: median ignores NaN pixels correctly
    4. **Confidence gating**: pixels above threshold excluded from median
    5. **All invalid at pixel**: output is NaN
    6. **Empty input**: raises ValueError
    7. **Shape mismatch**: raises ValueError
    8. **min_valid_count**: pixel with fewer valid observations → NaN
    9. **None confidence maps**: graceful handling (pools depth only, returns None confidence)
  - Use `numpy.testing.assert_allclose` for numerical checks
  - Use `pytest.raises(ValueError, match=...)` for error cases

  **Must NOT do**:
  - No integration with calibrate_extrinsics.py yet (unit tests only)

  **Recommended Agent Profile**:
  - **Category**: `quick`
    - Reason: Focused test file creation following existing patterns
  - **Skills**: []

  **Parallelization**:
  - **Can Run In Parallel**: YES
  - **Parallel Group**: Wave 1 (with Task 1)
  - **Blocks**: None
  - **Blocked By**: Task 1

  **References**:

  **Test References**:
  - `tests/test_depth_verify.py:36-60` — Pattern for synthetic depth map creation and assertion style
  - `tests/test_depth_refine.py:10-18` — Pattern for roundtrip/equivalence testing

  **WHY Each Reference Matters**:
  - Shows the exact assertion patterns and synthetic data conventions used in this codebase

  **Acceptance Criteria**:
  - [ ] `uv run pytest tests/test_depth_pool.py -v` → all tests pass
  - [ ] At least 9 test cases covering the enumerated scenarios

  **Agent-Executed QA Scenarios:**
  ```
  Scenario: All pool tests pass
    Tool: Bash
    Steps:
      1. uv run pytest tests/test_depth_pool.py -v
      2. Assert: exit code 0
      3. Assert: output contains "passed" with 0 "failed"
    Expected Result: All tests green
  ```

  **Commit**: YES (groups with Task 1)
  - Message: `test(aruco): add unit tests for pool_depth_maps`
  - Files: `tests/test_depth_pool.py`

---

- [x] 3. Extend main loop to collect top-N frames per camera

  **What to do**:
  - In `calibrate_extrinsics.py`, modify the verification frame collection (lines ~682-714):
    - Change `verification_frames` from `dict[serial, single_frame_dict]` to `dict[serial, list[frame_dict]]`
    - Maintain list sorted by score (descending), truncated to `depth_pool_size`
    - Use `heapq` or sorted insertion to keep top-N efficiently
    - When `depth_pool_size == 1`, behavior must be identical to current (store only best)
  - Update all downstream references to `verification_frames` that assume single-frame structure
  - The `first_frames` dict remains unchanged (it's for benchmarking, separate concern)

  **Must NOT do**:
  - Do NOT change the scoring function `score_frame()`
  - Do NOT change `FrameData` structure
  - Do NOT store frames outside the sampled loop (only collect from frames that already have depth)

  **Recommended Agent Profile**:
  - **Category**: `unspecified-low`
    - Reason: Surgical modification to existing loop logic; requires careful attention to existing consumers
  - **Skills**: []

  **Parallelization**:
  - **Can Run In Parallel**: YES
  - **Parallel Group**: Wave 2 (with Tasks 4)
  - **Blocks**: Tasks 5, 6
  - **Blocked By**: Task 1

  **References**:

  **Pattern References**:
  - `calibrate_extrinsics.py:620-760` — Main loop where verification frames are collected; lines 682-714 are the critical section
  - `calibrate_extrinsics.py:118-258` — `apply_depth_verify_refine_postprocess()` which consumes `verification_frames`

  **API/Type References**:
  - `aruco/svo_sync.py:10-18` — `FrameData` structure that's stored in verification_frames

  **WHY Each Reference Matters**:
  - `calibrate_extrinsics.py:682-714`: This is the exact code being modified; must understand score comparison and dict storage
  - `calibrate_extrinsics.py:118-258`: Must understand how `verification_frames` is consumed downstream to know what structure changes are safe

  **Acceptance Criteria**:
  - [ ] `verification_frames[serial]` is now a list of frame dicts, sorted by score descending
  - [ ] List length ≤ `depth_pool_size` for each camera
  - [ ] When `depth_pool_size == 1`, list has exactly one element matching current best-frame behavior
  - [ ] `uv run basedpyright calibrate_extrinsics.py` → 0 new errors

  **Agent-Executed QA Scenarios:**
  ```
  Scenario: Top-N collection works with pool size 3
    Tool: Bash
    Steps:
      1. uv run python -c "
         # Verify the data structure change is correct by inspecting types
         import ast, inspect
         # If this imports without error, structure is consistent
         from calibrate_extrinsics import apply_depth_verify_refine_postprocess
         print('OK')
         "
      2. Assert: stdout contains "OK"
    Expected Result: No import errors from structural changes
  ```

  **Commit**: NO (groups with Task 5)

---

- [x] 4. Add `--depth-pool-size` CLI option

  **What to do**:
  - Add click option to `main()` in `calibrate_extrinsics.py`:
    ```python
    @click.option(
        "--depth-pool-size",
        default=1,
        type=click.IntRange(min=1, max=10),
        help="Number of top-scored frames to pool for depth verification/refinement (1=single best frame, >1=median pooling).",
    )
    ```
  - Pass through to function signature
  - Add to `apply_depth_verify_refine_postprocess()` parameters (or pass `depth_pool_size` to control pooling)
  - Update help text for `--depth-mode` if needed to mention pooling interaction

  **Must NOT do**:
  - Do NOT implement the actual pooling logic here (that's Task 5)
  - Do NOT allow values > 10 (memory guardrail)

  **Recommended Agent Profile**:
  - **Category**: `quick`
    - Reason: Single CLI option addition, boilerplate only
  - **Skills**: []

  **Parallelization**:
  - **Can Run In Parallel**: YES
  - **Parallel Group**: Wave 2 (with Task 3)
  - **Blocks**: Task 5
  - **Blocked By**: None

  **References**:

  **Pattern References**:
  - `calibrate_extrinsics.py:474-478` — Existing `--max-samples` option as pattern for optional integer CLI flag
  - `calibrate_extrinsics.py:431-436` — `--depth-mode` option pattern

  **WHY Each Reference Matters**:
  - Shows the exact click option pattern and placement convention in this file

  **Acceptance Criteria**:
  - [ ] `uv run calibrate_extrinsics.py --help` shows `--depth-pool-size` with description
  - [ ] Default value is 1
  - [ ] Values outside 1-10 are rejected by click

  **Agent-Executed QA Scenarios:**
  ```
  Scenario: CLI option appears in help
    Tool: Bash
    Steps:
      1. uv run calibrate_extrinsics.py --help
      2. Assert: output contains "--depth-pool-size"
      3. Assert: output contains "1=single best frame"
    Expected Result: Option visible with correct help text

  Scenario: Invalid pool size rejected
    Tool: Bash
    Steps:
      1. uv run calibrate_extrinsics.py --depth-pool-size 0 --help 2>&1 || true
      2. Assert: output contains error or "Invalid value"
    Expected Result: Click rejects out-of-range value
  ```

  **Commit**: NO (groups with Task 5)

---

- [x] 5. Integrate pooling into `apply_depth_verify_refine_postprocess()`

  **What to do**:
  - Modify `apply_depth_verify_refine_postprocess()` to accept `depth_pool_size: int = 1` parameter
  - When `depth_pool_size > 1` and multiple frames available:
    1. Extract depth_maps and confidence_maps from the top-N frame list
    2. Call `pool_depth_maps()` to produce pooled depth/confidence
    3. Use pooled maps for `verify_extrinsics_with_depth()` and `refine_extrinsics_with_depth()`
    4. Use the **best-scored frame's** `ids` for marker corner lookup (it has best detection quality)
  - When `depth_pool_size == 1` OR only 1 frame available:
    - Use existing single-frame path exactly (no pooling call)
  - Add pooling metadata to JSON output: `"depth_pool": {"pool_size_requested": N, "pool_size_actual": M, "pooled": true/false}`
  - Wire `depth_pool_size` from `main()` through to this function
  - Handle edge case: if pooling produces a map with fewer valid points than best single frame, log warning and fall back to single frame

  **Must NOT do**:
  - Do NOT change `verify_extrinsics_with_depth()` or `refine_extrinsics_with_depth()` function signatures
  - Do NOT add new CLI output formats

  **Recommended Agent Profile**:
  - **Category**: `unspecified-high`
    - Reason: Core integration task with multiple touchpoints; requires careful wiring and edge case handling
  - **Skills**: []

  **Parallelization**:
  - **Can Run In Parallel**: NO
  - **Parallel Group**: Sequential (after Wave 2)
  - **Blocks**: Tasks 6, 7
  - **Blocked By**: Tasks 1, 3, 4

  **References**:

  **Pattern References**:
  - `calibrate_extrinsics.py:118-258` — Full `apply_depth_verify_refine_postprocess()` function being modified
  - `calibrate_extrinsics.py:140-156` — Frame data extraction pattern (accessing `vf["frame"]`, `vf["ids"]`)
  - `calibrate_extrinsics.py:158-180` — Verification call pattern
  - `calibrate_extrinsics.py:182-245` — Refinement call pattern

  **API/Type References**:
  - `aruco/depth_pool.py:pool_depth_maps()` — The pooling function (Task 1 output)
  - `aruco/depth_verify.py:119-179` — `verify_extrinsics_with_depth()` signature
  - `aruco/depth_refine.py:71-227` — `refine_extrinsics_with_depth()` signature

  **WHY Each Reference Matters**:
  - `calibrate_extrinsics.py:140-156`: Shows how frame data is currently extracted; must adapt for list-of-frames
  - `depth_pool.py`: The function we're calling for multi-frame pooling
  - `depth_verify.py/depth_refine.py`: Confirms signatures remain unchanged (just pass different depth_map)

  **Acceptance Criteria**:
  - [ ] With `--depth-pool-size 1`: output JSON identical to baseline (no `depth_pool` metadata needed for N=1)
  - [ ] With `--depth-pool-size 5`: output JSON includes `depth_pool` metadata; verify/refine uses pooled maps
  - [ ] Fallback to single frame logged when pooling produces fewer valid points
  - [ ] `uv run basedpyright calibrate_extrinsics.py` → 0 new errors

  **Agent-Executed QA Scenarios:**
  ```
  Scenario: Pool size 1 produces baseline-equivalent output
    Tool: Bash
    Preconditions: output/ directory with SVO files
    Steps:
      1. uv run calibrate_extrinsics.py -s output/ -m aruco/markers/standard_box_markers_600mm.parquet --aruco-dictionary DICT_APRILTAG_36h11 --verify-depth --no-preview --max-samples 5 --depth-pool-size 1 --output output/_test_pool1.json
      2. Assert: exit code 0
      3. Assert: output/_test_pool1.json exists and contains depth_verify entries
    Expected Result: Runs cleanly, produces valid output

  Scenario: Pool size 5 runs and includes pool metadata
    Tool: Bash
    Preconditions: output/ directory with SVO files
    Steps:
      1. uv run calibrate_extrinsics.py -s output/ -m aruco/markers/standard_box_markers_600mm.parquet --aruco-dictionary DICT_APRILTAG_36h11 --verify-depth --refine-depth --no-preview --max-samples 10 --depth-pool-size 5 --output output/_test_pool5.json
      2. Assert: exit code 0
      3. Parse output/_test_pool5.json
      4. Assert: at least one camera entry contains "depth_pool" key
    Expected Result: Pooling metadata present in output
  ```

  **Commit**: YES
  - Message: `feat(calibrate): integrate multi-frame depth pooling with --depth-pool-size flag`
  - Files: `calibrate_extrinsics.py`, `aruco/depth_pool.py`, `tests/test_depth_pool.py`
  - Pre-commit: `uv run pytest tests/test_depth_pool.py && uv run basedpyright calibrate_extrinsics.py`

---

- [x] 6. N=1 equivalence regression test

  **What to do**:
  - Add test in `tests/test_depth_cli_postprocess.py` (or `tests/test_depth_pool.py`):
    - Create synthetic scenario with known depth maps and marker geometry
    - Run `apply_depth_verify_refine_postprocess()` with pool_size=1 using the old single-frame structure
    - Run with pool_size=1 using the new list-of-frames structure
    - Assert outputs are numerically identical (atol=0)
  - This proves the refactor preserves backward compatibility

  **Must NOT do**:
  - No E2E CLI test here (that's Task 7)

  **Recommended Agent Profile**:
  - **Category**: `quick`
    - Reason: Focused regression test with synthetic data
  - **Skills**: []

  **Parallelization**:
  - **Can Run In Parallel**: YES
  - **Parallel Group**: Wave 3 (with Task 7)
  - **Blocks**: None
  - **Blocked By**: Task 5

  **References**:

  **Test References**:
  - `tests/test_depth_cli_postprocess.py` — Existing integration test patterns
  - `tests/test_depth_verify.py:36-60` — Synthetic depth map creation pattern

  **Acceptance Criteria**:
  - [ ] `uv run pytest -k "pool_size_1_equivalence"` → passes
  - [ ] Test asserts exact numerical equality between old-path and new-path outputs

  **Commit**: YES
  - Message: `test(calibrate): add N=1 equivalence regression test for depth pooling`
  - Files: `tests/test_depth_pool.py` or `tests/test_depth_cli_postprocess.py`

---

- [x] 7. E2E smoke comparison: pooled vs single-frame RMSE

  **What to do**:
  - Run calibration on test SVOs with `--depth-pool-size 1` and `--depth-pool-size 5`
  - Compare:
    - Post-refinement RMSE per camera
    - Depth-normalized RMSE
    - CSV residual distribution (mean_abs, p50, p90)
    - Runtime (wall clock)
  - Document results in a brief summary (stdout or saved to a comparison file)
  - **Success criterion**: pooled RMSE ≤ single-frame RMSE for majority of cameras; runtime overhead < 25%

  **Must NOT do**:
  - No automated pass/fail assertion on real data (metrics are directional, not deterministic)
  - No permanent benchmark infrastructure

  **Recommended Agent Profile**:
  - **Category**: `quick`
    - Reason: Run two commands, compare JSON output, summarize
  - **Skills**: []

  **Parallelization**:
  - **Can Run In Parallel**: YES
  - **Parallel Group**: Wave 3 (with Task 6)
  - **Blocks**: None
  - **Blocked By**: Task 5

  **References**:

  **Pattern References**:
  - Previous smoke runs in this session: `output/e2e_refine_depth_full_neural_plus.json` as baseline

  **Acceptance Criteria**:
  - [ ] Both runs complete without error
  - [ ] Comparison summary printed showing per-camera RMSE for pool=1 vs pool=5
  - [ ] Runtime logged for both runs

  **Agent-Executed QA Scenarios:**
  ```
  Scenario: Compare pool=1 vs pool=5 on full SVOs
    Tool: Bash
    Steps:
      1. Run with --depth-pool-size 1 --verify-depth --refine-depth --output output/_compare_pool1.json
      2. Run with --depth-pool-size 5 --verify-depth --refine-depth --output output/_compare_pool5.json
      3. Parse both JSON files
      4. Print per-camera post RMSE comparison table
      5. Print runtime difference
    Expected Result: Both complete; comparison table printed
    Evidence: Terminal output captured
  ```

  **Commit**: NO (no code change; just verification)

---

## Commit Strategy

| After Task | Message | Files | Verification |
|------------|---------|-------|--------------|
| 1+2 | `feat(aruco): add pool_depth_maps utility with tests` | `aruco/depth_pool.py`, `tests/test_depth_pool.py` | `uv run pytest tests/test_depth_pool.py` |
| 5 (includes 3+4) | `feat(calibrate): integrate multi-frame depth pooling with --depth-pool-size flag` | `calibrate_extrinsics.py` | `uv run pytest && uv run basedpyright` |
| 6 | `test(calibrate): add N=1 equivalence regression test for depth pooling` | `tests/test_depth_pool.py` or `tests/test_depth_cli_postprocess.py` | `uv run pytest -k pool_size_1` |

---

## Success Criteria

### Verification Commands
```bash
uv run pytest tests/test_depth_pool.py -v           # All pool unit tests pass
uv run pytest -k "pool_size_1_equivalence" -v        # N=1 regression passes
uv run basedpyright                                   # 0 new errors
uv run calibrate_extrinsics.py --help | grep pool    # CLI flag visible
```

### Final Checklist
- [x] `pool_depth_maps()` pure function exists with full edge case handling
- [x] `--depth-pool-size` CLI option with default=1, max=10
- [x] N=1 produces identical results to baseline
- [x] All existing tests still pass
- [x] Type checker clean
- [x] E2E comparison shows pooled RMSE ≤ single-frame RMSE for majority of cameras