feat: add systemd-run training launcher and docs

2026-03-11 00:45:02 +08:00
parent e2908febfa
commit 63e2ed1097
4 changed files with 346 additions and 0 deletions
@@ -0,0 +1,146 @@
+# Stable Long-Running Training with `systemd-run --user`
+
+This note documents the recommended way to launch long OpenGait jobs on a local workstation.
+
+## Why use `systemd-run --user`
+
+For long training runs, `systemd-run --user` is more reliable than shell background tricks like:
+
+- `nohup ... &`
+- `disown`
+- one-shot detached shell wrappers
+
+Why:
+
+- the training process is supervised by the user systemd instance instead of a transient shell process
+- stdout/stderr can be sent to a real log file and the systemd journal
+- you can query status with `systemctl --user`
+- you can stop the job cleanly with `systemctl --user stop ...`
+- the job is no longer tied to the lifetime of a tool process tree
+
+In this repo, detached shell launches were observed to die unexpectedly even when the training code itself was healthy. `systemd-run --user` avoids that failure mode.
+
+## Prerequisites
+
+Check that user services are available:
+
+```bash
+systemd-run --user --version
+```
+
+If you want jobs to survive logout, enable linger:
+
+```bash
+loginctl enable-linger "$USER"
+```
+
+This is optional if you only need the job to survive shell/session teardown while you stay logged in.
+
+## Recommended launcher
+
+Use the helper script:
+
+- [scripts/systemd_run_opengait.py](/home/crosstyan/Code/OpenGait/scripts/systemd_run_opengait.py)
+
+It:
+
+- uses `torchrun` (`python -m torch.distributed.run`), not deprecated `torch.distributed.launch`
+- accepts GPU UUIDs instead of ordinal indices
+- launches a transient user service with `systemd-run --user`
+- writes stdout/stderr to a real log file
+- provides `status`, `logs`, and `stop` helpers
+
+## Launch examples
+
+Single-GPU train:
+
+```bash
+uv run python scripts/systemd_run_opengait.py launch \
+  --cfgs configs/sconet/sconet_scoliosis1k_skeleton_112_sigma15_joint8_bodyonly_plaince_bridge_1gpu_10k.yaml \
+  --phase train \
+  --gpu-uuids GPU-9cc7b26e-90d4-0c49-4d4c-060e528ffba6
+```
+
+Single-GPU eval:
+
+```bash
+uv run python scripts/systemd_run_opengait.py launch \
+  --cfgs configs/sconet/sconet_scoliosis1k_local_eval_1gpu.yaml \
+  --phase test \
+  --gpu-uuids GPU-9cc7b26e-90d4-0c49-4d4c-060e528ffba6
+```
+
+Two-GPU train:
+
+```bash
+uv run python scripts/systemd_run_opengait.py launch \
+  --cfgs configs/baseline/baseline.yaml \
+  --phase train \
+  --gpu-uuids GPU-9cc7b26e-90d4-0c49-4d4c-060e528ffba6,GPU-1155e14e-6097-5942-7feb-20453868b202
+```
+
+Dry run:
+
+```bash
+uv run python scripts/systemd_run_opengait.py launch \
+  --cfgs configs/baseline/baseline.yaml \
+  --phase train \
+  --gpu-uuids GPU-9cc7b26e-90d4-0c49-4d4c-060e528ffba6 \
+  --dry-run
+```
+
+## Monitoring and control
+
+Show service status:
+
+```bash
+uv run python scripts/systemd_run_opengait.py status opengait-baseline-train
+```
+
+Show recent journal lines:
+
+```bash
+uv run python scripts/systemd_run_opengait.py logs opengait-baseline-train -n 200
+```
+
+Follow the journal directly:
+
+```bash
+journalctl --user -u opengait-baseline-train -f
+```
+
+Stop the run:
+
+```bash
+uv run python scripts/systemd_run_opengait.py stop opengait-baseline-train
+```
+
+## Logging behavior
+
+The launcher configures both:
+
+- a file log under `/tmp` by default
+- the systemd journal for the transient unit
+
+This makes it easier to recover logs even if the original shell or tool session disappears.
+
+## GPU selection
+
+Prefer GPU UUIDs, not ordinal indices.
+
+Reason:
+
+- local `CUDA_VISIBLE_DEVICES` ordinal mapping can be unstable or surprising
+- UUIDs make the intended device explicit
+
+Get UUIDs with:
+
+```bash
+nvidia-smi -L
+```
+
+## Notes
+
+- The helper uses `torchrun` through `python -m torch.distributed.run`.
+- `--nproc_per_node` is inferred from the number of UUIDs passed to `--gpu-uuids`.
+- OpenGait evaluator constraints still apply: test batch/world-size settings must match the visible GPU count.