Files
OpenGait/docs/systemd-run-training.md
T

147 lines
3.7 KiB
Markdown

# Stable Long-Running Training with `systemd-run --user`
This note documents the recommended way to launch long OpenGait jobs on a local workstation.
## Why use `systemd-run --user`
For long training runs, `systemd-run --user` is more reliable than shell background tricks like:
- `nohup ... &`
- `disown`
- one-shot detached shell wrappers
Why:
- the training process is supervised by the user systemd instance instead of a transient shell process
- stdout/stderr can be sent to a real log file and the systemd journal
- you can query status with `systemctl --user`
- you can stop the job cleanly with `systemctl --user stop ...`
- the job is no longer tied to the lifetime of a tool process tree
In this repo, detached shell launches were observed to die unexpectedly even when the training code itself was healthy. `systemd-run --user` avoids that failure mode.
## Prerequisites
Check that user services are available:
```bash
systemd-run --user --version
```
If you want jobs to survive logout, enable linger:
```bash
loginctl enable-linger "$USER"
```
This is optional if you only need the job to survive shell/session teardown while you stay logged in.
## Recommended launcher
Use the helper script:
- [scripts/systemd_run_opengait.py](/home/crosstyan/Code/OpenGait/scripts/systemd_run_opengait.py)
It:
- uses `torchrun` (`python -m torch.distributed.run`), not deprecated `torch.distributed.launch`
- accepts GPU UUIDs instead of ordinal indices
- launches a transient user service with `systemd-run --user`
- writes stdout/stderr to a real log file
- provides `status`, `logs`, and `stop` helpers
## Launch examples
Single-GPU train:
```bash
uv run python scripts/systemd_run_opengait.py launch \
--cfgs configs/sconet/sconet_scoliosis1k_skeleton_112_sigma15_joint8_bodyonly_plaince_bridge_1gpu_10k.yaml \
--phase train \
--gpu-uuids GPU-9cc7b26e-90d4-0c49-4d4c-060e528ffba6
```
Single-GPU eval:
```bash
uv run python scripts/systemd_run_opengait.py launch \
--cfgs configs/sconet/sconet_scoliosis1k_local_eval_1gpu.yaml \
--phase test \
--gpu-uuids GPU-9cc7b26e-90d4-0c49-4d4c-060e528ffba6
```
Two-GPU train:
```bash
uv run python scripts/systemd_run_opengait.py launch \
--cfgs configs/baseline/baseline.yaml \
--phase train \
--gpu-uuids GPU-9cc7b26e-90d4-0c49-4d4c-060e528ffba6,GPU-1155e14e-6097-5942-7feb-20453868b202
```
Dry run:
```bash
uv run python scripts/systemd_run_opengait.py launch \
--cfgs configs/baseline/baseline.yaml \
--phase train \
--gpu-uuids GPU-9cc7b26e-90d4-0c49-4d4c-060e528ffba6 \
--dry-run
```
## Monitoring and control
Show service status:
```bash
uv run python scripts/systemd_run_opengait.py status opengait-baseline-train
```
Show recent journal lines:
```bash
uv run python scripts/systemd_run_opengait.py logs opengait-baseline-train -n 200
```
Follow the journal directly:
```bash
journalctl --user -u opengait-baseline-train -f
```
Stop the run:
```bash
uv run python scripts/systemd_run_opengait.py stop opengait-baseline-train
```
## Logging behavior
The launcher configures both:
- a file log under `/tmp` by default
- the systemd journal for the transient unit
This makes it easier to recover logs even if the original shell or tool session disappears.
## GPU selection
Prefer GPU UUIDs, not ordinal indices.
Reason:
- local `CUDA_VISIBLE_DEVICES` ordinal mapping can be unstable or surprising
- UUIDs make the intended device explicit
Get UUIDs with:
```bash
nvidia-smi -L
```
## Notes
- The helper uses `torchrun` through `python -m torch.distributed.run`.
- `--nproc_per_node` is inferred from the number of UUIDs passed to `--gpu-uuids`.
- OpenGait evaluator constraints still apply: test batch/world-size settings must match the visible GPU count.