docs: add uv workflow and ScoNet eval reproduction notes

2026-02-26 14:19:00 +08:00
parent 0fdd35bd78
commit 5c06a80d93
8 changed files with 2693 additions and 0 deletions
@@ -0,0 +1,74 @@
+# ScoNet Checkpoint Evaluation Reproduction Notes
+
+This document records the findings and successful procedure for reproducing ScoNet checkpoint evaluation using `uv` and the OpenGait framework.
+
+## Observed Failure Sequence and Root Causes
+
+### 1. Missing Dependencies (Eager Auto-Import)
+OpenGait uses a dynamic registration pattern in `opengait/modeling/models/__init__.py`. When `main.py` imports `models`, it attempts to iterate through all modules in the `models/` directory. If any model file (e.g., `BiggerGait_DINOv2.py`) has dependencies not installed in the current environment (like `timm`), the entire program fails even if you are not using that specific model.
+
+**Root Cause:** `iter_modules` in `opengait/modeling/models/__init__.py` triggers imports of all sibling files.
+
+### 2. GPU/World Size Mismatch
+The runtime enforces a strict equality between the number of visible GPUs and the DDP world size in `opengait/main.py`:
+
+```python
+# opengait/main.py
+if torch.distributed.get_world_size() != torch.cuda.device_count():
+    raise ValueError("Expect number of available GPUs({}) equals to the world size({}).".format(
+        torch.cuda.device_count(), torch.distributed.get_world_size()))
+```
+
+**Error Message:** `ValueError: Expect number of available GPUs(2) equals to the world size(1)`
+
+### 3. Evaluator Sampler Batch Size Rule
+The evaluator enforces that the total batch size must equal the number of GPUs in testing mode, as checked in `opengait/modeling/base_model.py`.
+
+**Error Message:** `ValueError: The batch size (8) must be equal to the number of GPUs (1) in testing mode!`
+
+## Successful Reproduction Environment
+
+- **Runtime:** `uv` with PEP 621 (`pyproject.toml`)
+- **Hardware:** 1 Visible GPU
+- **Dataset Path:** Symlinked at `datasets/Scoliosis1K` (user-created link pointing to the actual data root).
+
+## Successful Command and Config
+
+### Command
+```bash
+CUDA_VISIBLE_DEVICES=0 uv run python -m torch.distributed.launch \
+    --nproc_per_node=1 \
+    opengait/main.py \
+    --cfgs ./configs/sconet/sconet_scoliosis1k_local_eval_1gpu.yaml \
+    --phase test
+```
+
+### Config Highlights (`configs/sconet/sconet_scoliosis1k_local_eval_1gpu.yaml`)
+```yaml
+data_cfg:
+  dataset_root: ./datasets/Scoliosis1K/Scoliosis1K-sil-pkl
+  dataset_partition: ./datasets/Scoliosis1K/Scoliosis1K_1116.json
+
+evaluator_cfg:
+  restore_hint: ./ckpt/ScoNet-20000.pt
+  sampler:
+    batch_size: 1  # Must be integer for evaluation
+    sample_type: all_ordered
+```
+
+## Final Metrics
+The successful evaluation of the `ScoNet-20000.pt` checkpoint yielded:
+
+| Metric | Value |
+| :--- | :--- |
+| **Accuracy** | 80.88% |
+| **Macro Precision** | 81.50% |
+| **Macro Recall** | 78.82% |
+| **Macro F1** | 75.14% |
+
+## Troubleshooting Checklist
+
+1. **Environment:** Ensure all dependencies for *all* registered models are installed (e.g., `timm` for `BiggerGait_DINOv2.py`) to avoid eager import failures in `opengait/modeling/models/__init__.py`.
+2. **GPU Visibility:** Match `CUDA_VISIBLE_DEVICES` count exactly with `--nproc_per_node` (checked in `opengait/main.py`).
+3. **Config Check:** Verify `evaluator_cfg.sampler.batch_size` equals the number of GPUs (checked in `opengait/modeling/base_model.py`).
+4. **Data Paths:** Ensure `dataset_root` and `dataset_partition` in the YAML point to valid paths (use symlinks under `datasets/` for convenience).