OpenGait/docs/sconet-eval-reproduction-notes.md

# ScoNet Checkpoint Evaluation Reproduction Notes

This document records the findings and successful procedure for reproducing ScoNet checkpoint evaluation using `uv` and the OpenGait framework.

## Observed Failure Sequence and Root Causes

### 1. Missing Dependencies (Eager Auto-Import)
OpenGait uses a dynamic registration pattern in `opengait/modeling/models/__init__.py`. When `main.py` imports `models`, it attempts to iterate through all modules in the `models/` directory. If any model file (e.g., `BiggerGait_DINOv2.py`) has dependencies not installed in the current environment (like `timm`), the entire program fails even if you are not using that specific model.

**Root Cause:** `iter_modules` in `opengait/modeling/models/__init__.py` triggers imports of all sibling files.

### 2. GPU/World Size Mismatch
The runtime enforces a strict equality between the number of visible GPUs and the DDP world size in `opengait/main.py`:

```python
# opengait/main.py
if torch.distributed.get_world_size() != torch.cuda.device_count():
    raise ValueError("Expect number of available GPUs({}) equals to the world size({}).".format(
        torch.cuda.device_count(), torch.distributed.get_world_size()))
```

**Error Message:** `ValueError: Expect number of available GPUs(2) equals to the world size(1)`

### 3. Evaluator Sampler Batch Size Rule
The evaluator enforces that the total batch size must equal the number of GPUs in testing mode, as checked in `opengait/modeling/base_model.py`.

**Error Message:** `ValueError: The batch size (8) must be equal to the number of GPUs (1) in testing mode!`

## Successful Reproduction Environment

- **Runtime:** `uv` with PEP 621 (`pyproject.toml`)
- **Hardware:** 1 Visible GPU
- **Dataset Path:** Symlinked at `datasets/Scoliosis1K` (user-created link pointing to the actual data root).

## Successful Command and Config

### Command
```bash
CUDA_VISIBLE_DEVICES=0 uv run python -m torch.distributed.launch \
    --nproc_per_node=1 \
    opengait/main.py \
    --cfgs ./configs/sconet/sconet_scoliosis1k_local_eval_1gpu.yaml \
    --phase test
```

### Config Highlights (`configs/sconet/sconet_scoliosis1k_local_eval_1gpu.yaml`)
```yaml
data_cfg:
  dataset_root: ./datasets/Scoliosis1K/Scoliosis1K-sil-pkl
  dataset_partition: ./datasets/Scoliosis1K/Scoliosis1K_1116.json

evaluator_cfg:
  restore_hint: ./ckpt/ScoNet-20000.pt
  sampler:
    batch_size: 1  # Must be integer for evaluation
    sample_type: all_ordered
```

## Final Metrics
The successful evaluation of the `ScoNet-20000.pt` checkpoint yielded:

| Metric | Value |
| :--- | :--- |
| **Accuracy** | 80.88% |
| **Macro Precision** | 81.50% |
| **Macro Recall** | 78.82% |
| **Macro F1** | 75.14% |

## Troubleshooting Checklist

1. **Environment:** Ensure all dependencies for *all* registered models are installed (e.g., `timm` for `BiggerGait_DINOv2.py`) to avoid eager import failures in `opengait/modeling/models/__init__.py`.
2. **GPU Visibility:** Match `CUDA_VISIBLE_DEVICES` count exactly with `--nproc_per_node` (checked in `opengait/main.py`).
3. **Config Check:** Verify `evaluator_cfg.sampler.batch_size` equals the number of GPUs (checked in `opengait/modeling/base_model.py`).
4. **Data Paths:** Ensure `dataset_root` and `dataset_partition` in the YAML point to valid paths (use symlinks under `datasets/` for convenience).