Add resumable ScoNet skeleton training diagnostics

This commit is contained in:
2026-03-09 15:57:13 +08:00
parent 4e0b0a18dc
commit 36aef46a0d
15 changed files with 1226 additions and 44 deletions
+2
View File
@@ -89,6 +89,8 @@ CUDA_VISIBLE_DEVICES=0,1 uv run python -m torch.distributed.launch --nproc_per_n
```
> **Note:** The `--nproc_per_node` argument must exactly match the number of GPUs specified in `CUDA_VISIBLE_DEVICES`. For single-GPU evaluation, use `CUDA_VISIBLE_DEVICES=0` and `--nproc_per_node=1` with the DDP launcher.
>
> **Resume Tip:** To survive interrupted training runs, set `trainer_cfg.resume_every_iter` to a non-zero value and optionally `trainer_cfg.auto_resume_latest: true`. OpenGait will keep `output/.../checkpoints/latest.pt` updated for crash recovery.