6.1 KiB
Demo Window, Stride, and Sequence Behavior (ScoNet)
This note explains how the opengait/demo runtime feeds silhouettes into the neural network, what stride means, and when the sliding window is reset.
Why sequence input (not single frame)
ScoNet-style inference is sequence-based.
- ScoNet / ScoNet-MT paper: https://arxiv.org/html/2407.05726v3
- DRF follow-up paper: https://arxiv.org/html/2509.00872v1
Both works use temporal information across walking frames rather than a single independent image.
Direct quotes from the papers
From ScoNet / ScoNet-MT (MICCAI 2024, 2407.05726v3):
"For experiments, 30 frames were selected from each gait sequence as input."
(Section 4.1, Implementation Details)
From the same paper's dataset description:
"Each sequence, containing approximately 300 frames at 15 frames per second..."
(Section 2.2, Data Collection and Preprocessing)
From DRF (MICCAI 2025, 2509.00872v1):
DRF follows ScoNet-MT's sequence-level setup/architecture in its implementation details, and its PAV branch also aggregates across frames:
"Sequence-Level PAV Refinement ... (2) Temporal Aggregation: For each metric, the mean of valid measurements across all frames is computed..."
(Section 3.1, PAV: Discrete Clinical Prior)
What papers say (and do not say) about stride
The papers define sequence-based inputs and temporal aggregation, but they do not define a deployment/runtime stride knob for online inference windows.
In other words:
- Paper gives the sequence framing (e.g., 30-frame inputs in ScoNet experiments).
- Demo
strideis an engineering control for how often to run inference in streaming mode.
What the demo feeds into the network
In opengait/demo, each inference uses the current silhouette buffer from SilhouetteWindow:
- Per-frame silhouette shape:
64 x 44 - Tensor shape for inference:
[1, 1, window_size, 64, 44] - Default
window_size:30
So by default, one prediction uses 30 silhouettes.
What is stride?
stride means the minimum frame distance between two consecutive classifications after the window is already full.
In this demo, the window is a true sliding buffer. It is not cleared after each inference. After inference, the pipeline only records the last classified frame and continues buffering new silhouettes.
- If
stride = 1: classify at every new frame once ready - If
stride = 30(default): classify every 30 frames once ready
Window mode shortcut (--window-mode)
To make window scheduling explicit, the demo CLI supports:
--window-mode manual(default): use the exact--stridevalue--window-mode sliding: forcestride = 1(max overlap)--window-mode chunked: forcestride = window(no overlap)
This is only a shortcut for runtime behavior. It does not change ScoNet weights or architecture.
Examples:
- Sliding windows:
--window 30 --window-mode sliding-> windows like0-29, 1-30, 2-31, ... - Chunked windows:
--window 30 --window-mode chunked-> windows like0-29, 30-59, 60-89, ... - Manual stride:
--window 30 --stride 10 --window-mode manual-> windows every 10 frames
Time interval between predictions is approximately:
prediction_interval_seconds ~= stride / fps
If --target-fps is set, use the emitted (downsampled) fps in this formula.
Examples:
stride=30,fps=15-> about2.0sstride=15,fps=30-> about0.5s
First prediction latency is approximately:
first_prediction_latency_seconds ~= window_size / fps
assuming detections are continuous.
Does the window clear when tracking target switches?
Yes. The window is reset in either case:
- Track ID changed (new tracking target)
- Frame gap too large (
frame_idx - last_frame > gap_threshold)
Default gap_threshold in demo is 15 frames.
This prevents silhouettes from different people or long interrupted segments from being mixed into one inference window.
To be explicit:
- Inference finished -> window stays (sliding continues)
- Track ID changed -> window reset
- Frame gap > gap_threshold -> window reset
Practical note about real-time detections
The window fills only when a valid silhouette is produced (i.e., person detection/segmentation succeeds). If detections are intermittent, the real-world time covered by one window_size can be longer than window_size / fps.
Online vs offline behavior (important)
ScoNet's neural network does not hard-code a fixed frame count in the model graph. In OpenGait, frame count is controlled by sampling/runtime policy:
- Training config typically uses
frames_num_fixed: 30with random fixed-frame sampling. - Offline evaluation often uses
all_orderedsequences (withframes_all_limitas a memory guard). - Online demo uses the runtime window/stride scheduler.
So this does not mean the method only works offline. It means online performance depends on the latency/robustness trade-off you choose:
- Smaller windows / larger stride -> lower latency, potentially less stable predictions
- Larger windows / overlap -> smoother predictions, higher compute/latency
If you want behavior closest to ScoNet training assumptions, start from --window 30 and tune stride (or --window-mode) for your deployment latency budget.
Temporal downsampling (--target-fps)
Use --target-fps to normalize incoming frame cadence before silhouettes are pushed into the classification window.
- Default (
--target-fps 15): timestamp-based pacing emits frames at approximately 15 FPS into the window - Optional override (
--no-target-fps): disable temporal downsampling and use all frames
Current default is --target-fps 15 to align runtime cadence with ScoNet training assumptions.
For offline video sources, pacing uses video-time timestamps (CAP_PROP_POS_MSEC) when available, with an FPS-based synthetic timestamp fallback. This avoids coupling downsampling to processing throughput.
This is useful when camera FPS differs from training cadence. For example, with a 24 FPS camera:
--target-fps 15 --window 30keeps model input near ~2.0 seconds of gait context (close to paper setup)--strideis interpreted in emitted-frame units after pacing