Files
OpenGait/docs/demo-window-and-stride.md

6.1 KiB

Demo Window, Stride, and Sequence Behavior (ScoNet)

This note explains how the opengait/demo runtime feeds silhouettes into the neural network, what stride means, and when the sliding window is reset.

Why sequence input (not single frame)

ScoNet-style inference is sequence-based.

Both works use temporal information across walking frames rather than a single independent image.

Direct quotes from the papers

From ScoNet / ScoNet-MT (MICCAI 2024, 2407.05726v3):

"For experiments, 30 frames were selected from each gait sequence as input."
(Section 4.1, Implementation Details)

From the same paper's dataset description:

"Each sequence, containing approximately 300 frames at 15 frames per second..."
(Section 2.2, Data Collection and Preprocessing)

From DRF (MICCAI 2025, 2509.00872v1):

DRF follows ScoNet-MT's sequence-level setup/architecture in its implementation details, and its PAV branch also aggregates across frames:

"Sequence-Level PAV Refinement ... (2) Temporal Aggregation: For each metric, the mean of valid measurements across all frames is computed..."
(Section 3.1, PAV: Discrete Clinical Prior)

What papers say (and do not say) about stride

The papers define sequence-based inputs and temporal aggregation, but they do not define a deployment/runtime stride knob for online inference windows.

In other words:

  • Paper gives the sequence framing (e.g., 30-frame inputs in ScoNet experiments).
  • Demo stride is an engineering control for how often to run inference in streaming mode.

What the demo feeds into the network

In opengait/demo, each inference uses the current silhouette buffer from SilhouetteWindow:

  • Per-frame silhouette shape: 64 x 44
  • Tensor shape for inference: [1, 1, window_size, 64, 44]
  • Default window_size: 30

So by default, one prediction uses 30 silhouettes.

What is stride?

stride means the minimum frame distance between two consecutive classifications after the window is already full.

In this demo, the window is a true sliding buffer. It is not cleared after each inference. After inference, the pipeline only records the last classified frame and continues buffering new silhouettes.

  • If stride = 1: classify at every new frame once ready
  • If stride = 30 (default): classify every 30 frames once ready

Window mode shortcut (--window-mode)

To make window scheduling explicit, the demo CLI supports:

  • --window-mode manual (default): use the exact --stride value
  • --window-mode sliding: force stride = 1 (max overlap)
  • --window-mode chunked: force stride = window (no overlap)

This is only a shortcut for runtime behavior. It does not change ScoNet weights or architecture.

Examples:

  • Sliding windows: --window 30 --window-mode sliding -> windows like 0-29, 1-30, 2-31, ...
  • Chunked windows: --window 30 --window-mode chunked -> windows like 0-29, 30-59, 60-89, ...
  • Manual stride: --window 30 --stride 10 --window-mode manual -> windows every 10 frames

Time interval between predictions is approximately:

prediction_interval_seconds ~= stride / fps

If --target-fps is set, use the emitted (downsampled) fps in this formula.

Examples:

  • stride=30, fps=15 -> about 2.0s
  • stride=15, fps=30 -> about 0.5s

First prediction latency is approximately:

first_prediction_latency_seconds ~= window_size / fps

assuming detections are continuous.

Does the window clear when tracking target switches?

Yes. The window is reset in either case:

  1. Track ID changed (new tracking target)
  2. Frame gap too large (frame_idx - last_frame > gap_threshold)

Default gap_threshold in demo is 15 frames.

This prevents silhouettes from different people or long interrupted segments from being mixed into one inference window.

To be explicit:

  • Inference finished -> window stays (sliding continues)
  • Track ID changed -> window reset
  • Frame gap > gap_threshold -> window reset

Practical note about real-time detections

The window fills only when a valid silhouette is produced (i.e., person detection/segmentation succeeds). If detections are intermittent, the real-world time covered by one window_size can be longer than window_size / fps.

Online vs offline behavior (important)

ScoNet's neural network does not hard-code a fixed frame count in the model graph. In OpenGait, frame count is controlled by sampling/runtime policy:

  • Training config typically uses frames_num_fixed: 30 with random fixed-frame sampling.
  • Offline evaluation often uses all_ordered sequences (with frames_all_limit as a memory guard).
  • Online demo uses the runtime window/stride scheduler.

So this does not mean the method only works offline. It means online performance depends on the latency/robustness trade-off you choose:

  • Smaller windows / larger stride -> lower latency, potentially less stable predictions
  • Larger windows / overlap -> smoother predictions, higher compute/latency

If you want behavior closest to ScoNet training assumptions, start from --window 30 and tune stride (or --window-mode) for your deployment latency budget.

Temporal downsampling (--target-fps)

Use --target-fps to normalize incoming frame cadence before silhouettes are pushed into the classification window.

  • Default (--target-fps 15): timestamp-based pacing emits frames at approximately 15 FPS into the window
  • Optional override (--no-target-fps): disable temporal downsampling and use all frames

Current default is --target-fps 15 to align runtime cadence with ScoNet training assumptions.

For offline video sources, pacing uses video-time timestamps (CAP_PROP_POS_MSEC) when available, with an FPS-based synthetic timestamp fallback. This avoids coupling downsampling to processing throughput.

This is useful when camera FPS differs from training cadence. For example, with a 24 FPS camera:

  • --target-fps 15 --window 30 keeps model input near ~2.0 seconds of gait context (close to paper setup)
  • --stride is interpreted in emitted-frame units after pacing