feat(zed): add DuckDB segment timestamp indexer

Add a new mcap_video_bounds helper binary plus a zed_segment_time_index.py CLI that builds and queries an embedded DuckDB index for bundled ZED segment recordings.

The index stores segment folders, MCAP paths, video time bounds, durations, camera labels, and dataset metadata, and reuses the existing recursive multi-camera segment discovery logic so nested kindergarten layouts are indexed correctly.

Infer a dataset default timezone from folder names versus MCAP timestamps, and make point queries precision-aware so second-level folder timestamps like 2026-03-18T12-00-23 resolve to the matching segment instead of missing due to subsecond start offsets.

Verification:
- uv add 'duckdb>=1.0'
- cmake --build build --target mcap_video_bounds
- uv run python -m unittest tests.test_zed_segment_time_index
- uv run python scripts/zed_segment_time_index.py build /workspaces/data/kindergarten --jobs 8
- uv run python scripts/zed_segment_time_index.py query /workspaces/data/kindergarten --at 2026-03-18T12-00-23
This commit is contained in:
2026-03-23 09:35:54 +00:00
parent a0b9c95d5b
commit e3a423433e
7 changed files with 1185 additions and 0 deletions
+97
View File
@@ -0,0 +1,97 @@
# ZED Segment Time Index
`scripts/zed_segment_time_index.py` builds and queries an embedded DuckDB index for bundled ZED segment folders.
Default artifact name:
```text
<DATASET_ROOT>/segment_time_index.duckdb
```
Primary commands:
```bash
uv run python scripts/zed_segment_time_index.py build <DATASET_ROOT>
uv run python scripts/zed_segment_time_index.py query <DATASET_ROOT> --at 2026-03-18T12-00-23
uv run python scripts/zed_segment_time_index.py query <DATASET_ROOT> --start 2026-03-18T12-00-23 --end 2026-03-18T12-00-30
```
## Data Source Rules
- Segment discovery is recursive and follows the same multi-camera layout assumptions as the batch ZED tooling.
- A directory is considered a valid segment when it contains at least two unique `*_zedN.svo` or `*_zedN.svo2` files and no duplicate camera labels.
- Timing is sourced from the segment MCAP, not from the SVO/SVO2 files.
- A valid segment is skipped when it has no `.mcap` file or more than one `.mcap` file in the segment directory.
## MCAP Bounds Extraction
`build/bin/mcap_video_bounds` scans `foxglove.CompressedVideo` messages in one MCAP and emits:
- `start_ns`
- `end_ns`
- `duration_ns`
- `video_message_count`
- `start_iso_utc`
- `end_iso_utc`
The helper prefers the protobuf `CompressedVideo.timestamp` field and falls back to MCAP `logTime` when that field is zero.
## DuckDB Layout
The database contains two tables: `meta` and `segments`.
### `meta`
Key-value metadata for the index:
- `schema_version`: current schema version, currently `1`
- `dataset_root`: absolute dataset root used when the index was built
- `built_at_utc`: build timestamp in UTC
- `default_timezone`: inferred dataset wall-clock timezone used when querying with `--timezone dataset`
### `segments`
One row per indexed segment.
| Column | Type | Meaning |
|---|---|---|
| `segment_dir` | `VARCHAR` | Absolute path to the segment directory |
| `relative_segment_dir` | `VARCHAR` | Path relative to the dataset root |
| `group_path` | `VARCHAR` | Parent path of the segment within the dataset |
| `activity` | `VARCHAR` | First path component under the dataset root |
| `segment_name` | `VARCHAR` | Segment directory basename |
| `mcap_path` | `VARCHAR` | Absolute MCAP path used for timing |
| `start_ns` | `BIGINT` | Earliest video timestamp in nanoseconds since Unix epoch |
| `end_ns` | `BIGINT` | Latest video timestamp in nanoseconds since Unix epoch |
| `duration_ns` | `BIGINT` | `end_ns - start_ns` |
| `start_iso_utc` | `VARCHAR` | UTC rendering of `start_ns` |
| `end_iso_utc` | `VARCHAR` | UTC rendering of `end_ns` |
| `camera_count` | `INTEGER` | Number of discovered camera inputs in the segment directory |
| `camera_labels` | `VARCHAR` | Comma-separated camera labels, for example `zed1,zed2,zed3,zed4` |
| `video_message_count` | `BIGINT` | Number of `foxglove.CompressedVideo` messages observed in the MCAP |
| `index_source` | `VARCHAR` | Current extractor label, currently `mcap_video_bounds` |
Indexes are created on `start_ns` and `end_ns`.
## Query Semantics
- `--at` performs an overlap lookup, not just an exact nanosecond equality check.
- Query precision follows the precision supplied by the user.
- A second-precision value like `2026-03-18T12-00-23` is treated as the whole second `[12:00:23.000, 12:00:23.999999999]`.
- Integer epochs are widened similarly by their apparent unit:
- 10 digits or fewer: seconds
- 11-13 digits: milliseconds
- 14-16 digits: microseconds
- 17+ digits: nanoseconds
- `--start/--end` returns every segment whose `[start_ns, end_ns]` overlaps the requested interval.
## Timezone Behavior
- Query default is `--timezone dataset`.
- `dataset` resolves to the `default_timezone` stored in `meta`.
- If inference is unavailable, the script falls back to `local`.
- Explicit values are also accepted:
- `local`
- `UTC`
- fixed offsets such as `UTC+08:00`
- IANA zone names such as `Asia/Shanghai`