4.4 KiB
Stable Long-Running Training with systemd-run --user
This note documents the recommended way to launch long OpenGait jobs on a local workstation.
Why use systemd-run --user
For long training runs, systemd-run --user is more reliable than shell background tricks like:
nohup ... &disown- one-shot detached shell wrappers
Why:
- the training process is supervised by the user systemd instance instead of a transient shell process
- stdout/stderr can be sent to a real log file and the systemd journal
- you can query status with
systemctl --user - you can stop the job cleanly with
systemctl --user stop ... - the job is no longer tied to the lifetime of a tool process tree
In this repo, detached shell launches were observed to die unexpectedly even when the training code itself was healthy. systemd-run --user avoids that failure mode.
Prerequisites
Check that user services are available:
systemd-run --user --version
If you want jobs to survive logout, enable linger:
loginctl enable-linger "$USER"
This is optional if you only need the job to survive shell/session teardown while you stay logged in.
Recommended launcher
Use the helper script:
It:
- uses
torchrun(python -m torch.distributed.run), not deprecatedtorch.distributed.launch - accepts GPU UUIDs instead of ordinal indices
- launches a transient user service with
systemd-run --user - writes stdout/stderr to a real log file
- provides
status,logs, andstophelpers
Launch examples
Single-GPU train:
uv run python scripts/systemd_run_opengait.py launch \
--cfgs configs/sconet/sconet_scoliosis1k_skeleton_112_sigma15_joint8_bodyonly_plaince_bridge_1gpu_10k.yaml \
--phase train \
--gpu-uuids GPU-9cc7b26e-90d4-0c49-4d4c-060e528ffba6
Single-GPU eval:
uv run python scripts/systemd_run_opengait.py launch \
--cfgs configs/sconet/sconet_scoliosis1k_local_eval_1gpu.yaml \
--phase test \
--gpu-uuids GPU-9cc7b26e-90d4-0c49-4d4c-060e528ffba6
Two-GPU train:
uv run python scripts/systemd_run_opengait.py launch \
--cfgs configs/baseline/baseline.yaml \
--phase train \
--gpu-uuids GPU-9cc7b26e-90d4-0c49-4d4c-060e528ffba6,GPU-1155e14e-6097-5942-7feb-20453868b202
Dry run:
uv run python scripts/systemd_run_opengait.py launch \
--cfgs configs/baseline/baseline.yaml \
--phase train \
--gpu-uuids GPU-9cc7b26e-90d4-0c49-4d4c-060e528ffba6 \
--dry-run
Monitoring and control
Show service status:
uv run python scripts/systemd_run_opengait.py status opengait-baseline-train
Show recent journal lines:
uv run python scripts/systemd_run_opengait.py logs opengait-baseline-train -n 200
Follow the journal directly:
journalctl --user -u opengait-baseline-train -f
Stop the run:
uv run python scripts/systemd_run_opengait.py stop opengait-baseline-train
Logging behavior
The launcher configures both:
- a file log under
/tmpby default - the systemd journal for the transient unit
This makes it easier to recover logs even if the original shell or tool session disappears.
Moving outputs off the SSD
OpenGait writes checkpoints, TensorBoard summaries, best-checkpoint snapshots, and file logs under a run output root.
By default that root is output/, but you can override it per run with output_root in the engine config:
trainer_cfg:
output_root: /mnt/hddl/data/OpenGait-output
evaluator_cfg:
output_root: /mnt/hddl/data/OpenGait-output
The final path layout stays the same under that root:
<output_root>/<dataset>/<model>/<save_name>/
For long scoliosis runs, using an HDD-backed root is recommended so local SSD space is not consumed by:
- numbered checkpoints
- rolling resume checkpoints
- best-N retained checkpoints
- TensorBoard summary files
GPU selection
Prefer GPU UUIDs, not ordinal indices.
Reason:
- local
CUDA_VISIBLE_DEVICESordinal mapping can be unstable or surprising - UUIDs make the intended device explicit
Get UUIDs with:
nvidia-smi -L
Notes
- The helper uses
torchrunthroughpython -m torch.distributed.run. --nproc_per_nodeis inferred from the number of UUIDs passed to--gpu-uuids.- OpenGait evaluator constraints still apply: test batch/world-size settings must match the visible GPU count.