Files
OpenGait/docs/sconet-eval-reproduction-notes.md

75 lines
3.3 KiB
Markdown

# ScoNet Checkpoint Evaluation Reproduction Notes
This document records the findings and successful procedure for reproducing ScoNet checkpoint evaluation using `uv` and the OpenGait framework.
## Observed Failure Sequence and Root Causes
### 1. Missing Dependencies (Eager Auto-Import)
OpenGait uses a dynamic registration pattern in `opengait/modeling/models/__init__.py`. When `main.py` imports `models`, it attempts to iterate through all modules in the `models/` directory. If any model file (e.g., `BiggerGait_DINOv2.py`) has dependencies not installed in the current environment (like `timm`), the entire program fails even if you are not using that specific model.
**Root Cause:** `iter_modules` in `opengait/modeling/models/__init__.py` triggers imports of all sibling files.
### 2. GPU/World Size Mismatch
The runtime enforces a strict equality between the number of visible GPUs and the DDP world size in `opengait/main.py`:
```python
# opengait/main.py
if torch.distributed.get_world_size() != torch.cuda.device_count():
raise ValueError("Expect number of available GPUs({}) equals to the world size({}).".format(
torch.cuda.device_count(), torch.distributed.get_world_size()))
```
**Error Message:** `ValueError: Expect number of available GPUs(2) equals to the world size(1)`
### 3. Evaluator Sampler Batch Size Rule
The evaluator enforces that the total batch size must equal the number of GPUs in testing mode, as checked in `opengait/modeling/base_model.py`.
**Error Message:** `ValueError: The batch size (8) must be equal to the number of GPUs (1) in testing mode!`
## Successful Reproduction Environment
- **Runtime:** `uv` with PEP 621 (`pyproject.toml`)
- **Hardware:** 1 Visible GPU
- **Dataset Path:** Symlinked at `datasets/Scoliosis1K` (user-created link pointing to the actual data root).
## Successful Command and Config
### Command
```bash
CUDA_VISIBLE_DEVICES=0 uv run python -m torch.distributed.launch \
--nproc_per_node=1 \
opengait/main.py \
--cfgs ./configs/sconet/sconet_scoliosis1k_local_eval_1gpu.yaml \
--phase test
```
### Config Highlights (`configs/sconet/sconet_scoliosis1k_local_eval_1gpu.yaml`)
```yaml
data_cfg:
dataset_root: ./datasets/Scoliosis1K/Scoliosis1K-sil-pkl
dataset_partition: ./datasets/Scoliosis1K/Scoliosis1K_1116.json
evaluator_cfg:
restore_hint: ./ckpt/ScoNet-20000.pt
sampler:
batch_size: 1 # Must be integer for evaluation
sample_type: all_ordered
```
## Final Metrics
The successful evaluation of the `ScoNet-20000.pt` checkpoint yielded:
| Metric | Value |
| :--- | :--- |
| **Accuracy** | 80.88% |
| **Macro Precision** | 81.50% |
| **Macro Recall** | 78.82% |
| **Macro F1** | 75.14% |
## Troubleshooting Checklist
1. **Environment:** Ensure all dependencies for *all* registered models are installed (e.g., `timm` for `BiggerGait_DINOv2.py`) to avoid eager import failures in `opengait/modeling/models/__init__.py`.
2. **GPU Visibility:** Match `CUDA_VISIBLE_DEVICES` count exactly with `--nproc_per_node` (checked in `opengait/main.py`).
3. **Config Check:** Verify `evaluator_cfg.sampler.batch_size` equals the number of GPUs (checked in `opengait/modeling/base_model.py`).
4. **Data Paths:** Ensure `dataset_root` and `dataset_partition` in the YAML point to valid paths (use symlinks under `datasets/` for convenience).