chore: update demo runtime, tests, and agent docs
This commit is contained in:
@@ -0,0 +1,144 @@
|
||||
# Demo Window, Stride, and Sequence Behavior (ScoNet)
|
||||
|
||||
This note explains how the `opengait/demo` runtime feeds silhouettes into the neural network, what `stride` means, and when the sliding window is reset.
|
||||
|
||||
## Why sequence input (not single frame)
|
||||
|
||||
ScoNet-style inference is sequence-based.
|
||||
|
||||
- ScoNet / ScoNet-MT paper: https://arxiv.org/html/2407.05726v3
|
||||
- DRF follow-up paper: https://arxiv.org/html/2509.00872v1
|
||||
|
||||
Both works use temporal information across walking frames rather than a single independent image.
|
||||
|
||||
### Direct quotes from the papers
|
||||
|
||||
From ScoNet / ScoNet-MT (MICCAI 2024, `2407.05726v3`):
|
||||
|
||||
> "For experiments, **30 frames were selected from each gait sequence as input**."
|
||||
> (Section 4.1, Implementation Details)
|
||||
|
||||
From the same paper's dataset description:
|
||||
|
||||
> "Each sequence, containing approximately **300 frames at 15 frames per second**..."
|
||||
> (Section 2.2, Data Collection and Preprocessing)
|
||||
|
||||
From DRF (MICCAI 2025, `2509.00872v1`):
|
||||
|
||||
DRF follows ScoNet-MT's sequence-level setup/architecture in its implementation details, and its PAV branch also aggregates across frames:
|
||||
|
||||
> "Sequence-Level PAV Refinement ... (2) **Temporal Aggregation**: For each metric, the mean of valid measurements across **all frames** is computed..."
|
||||
> (Section 3.1, PAV: Discrete Clinical Prior)
|
||||
|
||||
## What papers say (and do not say) about stride
|
||||
|
||||
The papers define sequence-based inputs and temporal aggregation, but they do **not** define a deployment/runtime `stride` knob for online inference windows.
|
||||
|
||||
In other words:
|
||||
|
||||
- Paper gives the sequence framing (e.g., 30-frame inputs in ScoNet experiments).
|
||||
- Demo `stride` is an engineering control for how often to run inference in streaming mode.
|
||||
|
||||
## What the demo feeds into the network
|
||||
|
||||
In `opengait/demo`, each inference uses the current silhouette buffer from `SilhouetteWindow`:
|
||||
|
||||
- Per-frame silhouette shape: `64 x 44`
|
||||
- Tensor shape for inference: `[1, 1, window_size, 64, 44]`
|
||||
- Default `window_size`: `30`
|
||||
|
||||
So by default, one prediction uses **30 silhouettes**.
|
||||
|
||||
## What is stride?
|
||||
|
||||
`stride` means the minimum frame distance between two consecutive classifications **after** the window is already full.
|
||||
|
||||
In this demo, the window is a true sliding buffer. It is **not** cleared after each inference. After inference, the pipeline only records the last classified frame and continues buffering new silhouettes.
|
||||
|
||||
- If `stride = 1`: classify at every new frame once ready
|
||||
- If `stride = 30` (default): classify every 30 frames once ready
|
||||
|
||||
## Window mode shortcut (`--window-mode`)
|
||||
|
||||
To make window scheduling explicit, the demo CLI supports:
|
||||
|
||||
- `--window-mode manual` (default): use the exact `--stride` value
|
||||
- `--window-mode sliding`: force `stride = 1` (max overlap)
|
||||
- `--window-mode chunked`: force `stride = window` (no overlap)
|
||||
|
||||
This is only a shortcut for runtime behavior. It does not change ScoNet weights or architecture.
|
||||
|
||||
Examples:
|
||||
|
||||
- Sliding windows: `--window 30 --window-mode sliding` -> windows like `0-29, 1-30, 2-31, ...`
|
||||
- Chunked windows: `--window 30 --window-mode chunked` -> windows like `0-29, 30-59, 60-89, ...`
|
||||
- Manual stride: `--window 30 --stride 10 --window-mode manual` -> windows every 10 frames
|
||||
|
||||
Time interval between predictions is approximately:
|
||||
|
||||
`prediction_interval_seconds ~= stride / fps`
|
||||
|
||||
If `--target-fps` is set, use the emitted (downsampled) fps in this formula.
|
||||
|
||||
Examples:
|
||||
|
||||
- `stride=30`, `fps=15` -> about `2.0s`
|
||||
- `stride=15`, `fps=30` -> about `0.5s`
|
||||
|
||||
First prediction latency is approximately:
|
||||
|
||||
`first_prediction_latency_seconds ~= window_size / fps`
|
||||
|
||||
assuming detections are continuous.
|
||||
|
||||
## Does the window clear when tracking target switches?
|
||||
|
||||
Yes. The window is reset in either case:
|
||||
|
||||
1. **Track ID changed** (new tracking target)
|
||||
2. **Frame gap too large** (`frame_idx - last_frame > gap_threshold`)
|
||||
|
||||
Default `gap_threshold` in demo is `15` frames.
|
||||
|
||||
This prevents silhouettes from different people or long interrupted segments from being mixed into one inference window.
|
||||
|
||||
To be explicit:
|
||||
|
||||
- **Inference finished** -> window stays (sliding continues)
|
||||
- **Track ID changed** -> window reset
|
||||
- **Frame gap > gap_threshold** -> window reset
|
||||
|
||||
## Practical note about real-time detections
|
||||
|
||||
The window fills only when a valid silhouette is produced (i.e., person detection/segmentation succeeds). If detections are intermittent, the real-world time covered by one `window_size` can be longer than `window_size / fps`.
|
||||
|
||||
## Online vs offline behavior (important)
|
||||
|
||||
ScoNet's neural network does not hard-code a fixed frame count in the model graph. In OpenGait, frame count is controlled by sampling/runtime policy:
|
||||
|
||||
- Training config typically uses `frames_num_fixed: 30` with random fixed-frame sampling.
|
||||
- Offline evaluation often uses `all_ordered` sequences (with `frames_all_limit` as a memory guard).
|
||||
- Online demo uses the runtime window/stride scheduler.
|
||||
|
||||
So this does not mean the method only works offline. It means online performance depends on the latency/robustness trade-off you choose:
|
||||
|
||||
- Smaller windows / larger stride -> lower latency, potentially less stable predictions
|
||||
- Larger windows / overlap -> smoother predictions, higher compute/latency
|
||||
|
||||
If you want behavior closest to ScoNet training assumptions, start from `--window 30` and tune stride (or `--window-mode`) for your deployment latency budget.
|
||||
|
||||
## Temporal downsampling (`--target-fps`)
|
||||
|
||||
Use `--target-fps` to normalize incoming frame cadence before silhouettes are pushed into the classification window.
|
||||
|
||||
- Default (`--target-fps 15`): timestamp-based pacing emits frames at approximately 15 FPS into the window
|
||||
- Optional override (`--no-target-fps`): disable temporal downsampling and use all frames
|
||||
|
||||
Current default is `--target-fps 15` to align runtime cadence with ScoNet training assumptions.
|
||||
|
||||
For offline video sources, pacing uses video-time timestamps (`CAP_PROP_POS_MSEC`) when available, with an FPS-based synthetic timestamp fallback. This avoids coupling downsampling to processing throughput.
|
||||
|
||||
This is useful when camera FPS differs from training cadence. For example, with a 24 FPS camera:
|
||||
|
||||
- `--target-fps 15 --window 30` keeps model input near ~2.0 seconds of gait context (close to paper setup)
|
||||
- `--stride` is interpreted in emitted-frame units after pacing
|
||||
Reference in New Issue
Block a user