docs: add uv workflow and ScoNet eval reproduction notes
This commit is contained in:
@@ -0,0 +1,74 @@
|
||||
# ScoNet Checkpoint Evaluation Reproduction Notes
|
||||
|
||||
This document records the findings and successful procedure for reproducing ScoNet checkpoint evaluation using `uv` and the OpenGait framework.
|
||||
|
||||
## Observed Failure Sequence and Root Causes
|
||||
|
||||
### 1. Missing Dependencies (Eager Auto-Import)
|
||||
OpenGait uses a dynamic registration pattern in `opengait/modeling/models/__init__.py`. When `main.py` imports `models`, it attempts to iterate through all modules in the `models/` directory. If any model file (e.g., `BiggerGait_DINOv2.py`) has dependencies not installed in the current environment (like `timm`), the entire program fails even if you are not using that specific model.
|
||||
|
||||
**Root Cause:** `iter_modules` in `opengait/modeling/models/__init__.py` triggers imports of all sibling files.
|
||||
|
||||
### 2. GPU/World Size Mismatch
|
||||
The runtime enforces a strict equality between the number of visible GPUs and the DDP world size in `opengait/main.py`:
|
||||
|
||||
```python
|
||||
# opengait/main.py
|
||||
if torch.distributed.get_world_size() != torch.cuda.device_count():
|
||||
raise ValueError("Expect number of available GPUs({}) equals to the world size({}).".format(
|
||||
torch.cuda.device_count(), torch.distributed.get_world_size()))
|
||||
```
|
||||
|
||||
**Error Message:** `ValueError: Expect number of available GPUs(2) equals to the world size(1)`
|
||||
|
||||
### 3. Evaluator Sampler Batch Size Rule
|
||||
The evaluator enforces that the total batch size must equal the number of GPUs in testing mode, as checked in `opengait/modeling/base_model.py`.
|
||||
|
||||
**Error Message:** `ValueError: The batch size (8) must be equal to the number of GPUs (1) in testing mode!`
|
||||
|
||||
## Successful Reproduction Environment
|
||||
|
||||
- **Runtime:** `uv` with PEP 621 (`pyproject.toml`)
|
||||
- **Hardware:** 1 Visible GPU
|
||||
- **Dataset Path:** Symlinked at `datasets/Scoliosis1K` (user-created link pointing to the actual data root).
|
||||
|
||||
## Successful Command and Config
|
||||
|
||||
### Command
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 uv run python -m torch.distributed.launch \
|
||||
--nproc_per_node=1 \
|
||||
opengait/main.py \
|
||||
--cfgs ./configs/sconet/sconet_scoliosis1k_local_eval_1gpu.yaml \
|
||||
--phase test
|
||||
```
|
||||
|
||||
### Config Highlights (`configs/sconet/sconet_scoliosis1k_local_eval_1gpu.yaml`)
|
||||
```yaml
|
||||
data_cfg:
|
||||
dataset_root: ./datasets/Scoliosis1K/Scoliosis1K-sil-pkl
|
||||
dataset_partition: ./datasets/Scoliosis1K/Scoliosis1K_1116.json
|
||||
|
||||
evaluator_cfg:
|
||||
restore_hint: ./ckpt/ScoNet-20000.pt
|
||||
sampler:
|
||||
batch_size: 1 # Must be integer for evaluation
|
||||
sample_type: all_ordered
|
||||
```
|
||||
|
||||
## Final Metrics
|
||||
The successful evaluation of the `ScoNet-20000.pt` checkpoint yielded:
|
||||
|
||||
| Metric | Value |
|
||||
| :--- | :--- |
|
||||
| **Accuracy** | 80.88% |
|
||||
| **Macro Precision** | 81.50% |
|
||||
| **Macro Recall** | 78.82% |
|
||||
| **Macro F1** | 75.14% |
|
||||
|
||||
## Troubleshooting Checklist
|
||||
|
||||
1. **Environment:** Ensure all dependencies for *all* registered models are installed (e.g., `timm` for `BiggerGait_DINOv2.py`) to avoid eager import failures in `opengait/modeling/models/__init__.py`.
|
||||
2. **GPU Visibility:** Match `CUDA_VISIBLE_DEVICES` count exactly with `--nproc_per_node` (checked in `opengait/main.py`).
|
||||
3. **Config Check:** Verify `evaluator_cfg.sampler.batch_size` equals the number of GPUs (checked in `opengait/modeling/base_model.py`).
|
||||
4. **Data Paths:** Ensure `dataset_root` and `dataset_partition` in the YAML point to valid paths (use symlinks under `datasets/` for convenience).
|
||||
Reference in New Issue
Block a user