Files
OpenGait/docs/sconet-eval-reproduction-notes.md

3.3 KiB

ScoNet Checkpoint Evaluation Reproduction Notes

This document records the findings and successful procedure for reproducing ScoNet checkpoint evaluation using uv and the OpenGait framework.

Observed Failure Sequence and Root Causes

1. Missing Dependencies (Eager Auto-Import)

OpenGait uses a dynamic registration pattern in opengait/modeling/models/__init__.py. When main.py imports models, it attempts to iterate through all modules in the models/ directory. If any model file (e.g., BiggerGait_DINOv2.py) has dependencies not installed in the current environment (like timm), the entire program fails even if you are not using that specific model.

Root Cause: iter_modules in opengait/modeling/models/__init__.py triggers imports of all sibling files.

2. GPU/World Size Mismatch

The runtime enforces a strict equality between the number of visible GPUs and the DDP world size in opengait/main.py:

# opengait/main.py
if torch.distributed.get_world_size() != torch.cuda.device_count():
    raise ValueError("Expect number of available GPUs({}) equals to the world size({}).".format(
        torch.cuda.device_count(), torch.distributed.get_world_size()))

Error Message: ValueError: Expect number of available GPUs(2) equals to the world size(1)

3. Evaluator Sampler Batch Size Rule

The evaluator enforces that the total batch size must equal the number of GPUs in testing mode, as checked in opengait/modeling/base_model.py.

Error Message: ValueError: The batch size (8) must be equal to the number of GPUs (1) in testing mode!

Successful Reproduction Environment

  • Runtime: uv with PEP 621 (pyproject.toml)
  • Hardware: 1 Visible GPU
  • Dataset Path: Symlinked at datasets/Scoliosis1K (user-created link pointing to the actual data root).

Successful Command and Config

Command

CUDA_VISIBLE_DEVICES=0 uv run python -m torch.distributed.launch \
    --nproc_per_node=1 \
    opengait/main.py \
    --cfgs ./configs/sconet/sconet_scoliosis1k_local_eval_1gpu.yaml \
    --phase test

Config Highlights (configs/sconet/sconet_scoliosis1k_local_eval_1gpu.yaml)

data_cfg:
  dataset_root: ./datasets/Scoliosis1K/Scoliosis1K-sil-pkl
  dataset_partition: ./datasets/Scoliosis1K/Scoliosis1K_1116.json

evaluator_cfg:
  restore_hint: ./ckpt/ScoNet-20000.pt
  sampler:
    batch_size: 1  # Must be integer for evaluation
    sample_type: all_ordered

Final Metrics

The successful evaluation of the ScoNet-20000.pt checkpoint yielded:

Metric Value
Accuracy 80.88%
Macro Precision 81.50%
Macro Recall 78.82%
Macro F1 75.14%

Troubleshooting Checklist

  1. Environment: Ensure all dependencies for all registered models are installed (e.g., timm for BiggerGait_DINOv2.py) to avoid eager import failures in opengait/modeling/models/__init__.py.
  2. GPU Visibility: Match CUDA_VISIBLE_DEVICES count exactly with --nproc_per_node (checked in opengait/main.py).
  3. Config Check: Verify evaluator_cfg.sampler.batch_size equals the number of GPUs (checked in opengait/modeling/base_model.py).
  4. Data Paths: Ensure dataset_root and dataset_partition in the YAML point to valid paths (use symlinks under datasets/ for convenience).