3.3 KiB
ScoNet Checkpoint Evaluation Reproduction Notes
This document records the findings and successful procedure for reproducing ScoNet checkpoint evaluation using uv and the OpenGait framework.
Observed Failure Sequence and Root Causes
1. Missing Dependencies (Eager Auto-Import)
OpenGait uses a dynamic registration pattern in opengait/modeling/models/__init__.py. When main.py imports models, it attempts to iterate through all modules in the models/ directory. If any model file (e.g., BiggerGait_DINOv2.py) has dependencies not installed in the current environment (like timm), the entire program fails even if you are not using that specific model.
Root Cause: iter_modules in opengait/modeling/models/__init__.py triggers imports of all sibling files.
2. GPU/World Size Mismatch
The runtime enforces a strict equality between the number of visible GPUs and the DDP world size in opengait/main.py:
# opengait/main.py
if torch.distributed.get_world_size() != torch.cuda.device_count():
raise ValueError("Expect number of available GPUs({}) equals to the world size({}).".format(
torch.cuda.device_count(), torch.distributed.get_world_size()))
Error Message: ValueError: Expect number of available GPUs(2) equals to the world size(1)
3. Evaluator Sampler Batch Size Rule
The evaluator enforces that the total batch size must equal the number of GPUs in testing mode, as checked in opengait/modeling/base_model.py.
Error Message: ValueError: The batch size (8) must be equal to the number of GPUs (1) in testing mode!
Successful Reproduction Environment
- Runtime:
uvwith PEP 621 (pyproject.toml) - Hardware: 1 Visible GPU
- Dataset Path: Symlinked at
datasets/Scoliosis1K(user-created link pointing to the actual data root).
Successful Command and Config
Command
CUDA_VISIBLE_DEVICES=0 uv run python -m torch.distributed.launch \
--nproc_per_node=1 \
opengait/main.py \
--cfgs ./configs/sconet/sconet_scoliosis1k_local_eval_1gpu.yaml \
--phase test
Config Highlights (configs/sconet/sconet_scoliosis1k_local_eval_1gpu.yaml)
data_cfg:
dataset_root: ./datasets/Scoliosis1K/Scoliosis1K-sil-pkl
dataset_partition: ./datasets/Scoliosis1K/Scoliosis1K_1116.json
evaluator_cfg:
restore_hint: ./ckpt/ScoNet-20000.pt
sampler:
batch_size: 1 # Must be integer for evaluation
sample_type: all_ordered
Final Metrics
The successful evaluation of the ScoNet-20000.pt checkpoint yielded:
| Metric | Value |
|---|---|
| Accuracy | 80.88% |
| Macro Precision | 81.50% |
| Macro Recall | 78.82% |
| Macro F1 | 75.14% |
Troubleshooting Checklist
- Environment: Ensure all dependencies for all registered models are installed (e.g.,
timmforBiggerGait_DINOv2.py) to avoid eager import failures inopengait/modeling/models/__init__.py. - GPU Visibility: Match
CUDA_VISIBLE_DEVICEScount exactly with--nproc_per_node(checked inopengait/main.py). - Config Check: Verify
evaluator_cfg.sampler.batch_sizeequals the number of GPUs (checked inopengait/modeling/base_model.py). - Data Paths: Ensure
dataset_rootanddataset_partitionin the YAML point to valid paths (use symlinks underdatasets/for convenience).