Files
OpenGait/docs/systemd-run-training.md

4.4 KiB

Stable Long-Running Training with systemd-run --user

This note documents the recommended way to launch long OpenGait jobs on a local workstation.

Why use systemd-run --user

For long training runs, systemd-run --user is more reliable than shell background tricks like:

  • nohup ... &
  • disown
  • one-shot detached shell wrappers

Why:

  • the training process is supervised by the user systemd instance instead of a transient shell process
  • stdout/stderr can be sent to a real log file and the systemd journal
  • you can query status with systemctl --user
  • you can stop the job cleanly with systemctl --user stop ...
  • the job is no longer tied to the lifetime of a tool process tree

In this repo, detached shell launches were observed to die unexpectedly even when the training code itself was healthy. systemd-run --user avoids that failure mode.

Prerequisites

Check that user services are available:

systemd-run --user --version

If you want jobs to survive logout, enable linger:

loginctl enable-linger "$USER"

This is optional if you only need the job to survive shell/session teardown while you stay logged in.

Use the helper script:

It:

  • uses torchrun (python -m torch.distributed.run), not deprecated torch.distributed.launch
  • accepts GPU UUIDs instead of ordinal indices
  • launches a transient user service with systemd-run --user
  • writes stdout/stderr to a real log file
  • provides status, logs, and stop helpers

Launch examples

Single-GPU train:

uv run python scripts/systemd_run_opengait.py launch \
  --cfgs configs/sconet/sconet_scoliosis1k_skeleton_112_sigma15_joint8_bodyonly_plaince_bridge_1gpu_10k.yaml \
  --phase train \
  --gpu-uuids GPU-9cc7b26e-90d4-0c49-4d4c-060e528ffba6

Single-GPU eval:

uv run python scripts/systemd_run_opengait.py launch \
  --cfgs configs/sconet/sconet_scoliosis1k_local_eval_1gpu.yaml \
  --phase test \
  --gpu-uuids GPU-9cc7b26e-90d4-0c49-4d4c-060e528ffba6

Two-GPU train:

uv run python scripts/systemd_run_opengait.py launch \
  --cfgs configs/baseline/baseline.yaml \
  --phase train \
  --gpu-uuids GPU-9cc7b26e-90d4-0c49-4d4c-060e528ffba6,GPU-1155e14e-6097-5942-7feb-20453868b202

Dry run:

uv run python scripts/systemd_run_opengait.py launch \
  --cfgs configs/baseline/baseline.yaml \
  --phase train \
  --gpu-uuids GPU-9cc7b26e-90d4-0c49-4d4c-060e528ffba6 \
  --dry-run

Monitoring and control

Show service status:

uv run python scripts/systemd_run_opengait.py status opengait-baseline-train

Show recent journal lines:

uv run python scripts/systemd_run_opengait.py logs opengait-baseline-train -n 200

Follow the journal directly:

journalctl --user -u opengait-baseline-train -f

Stop the run:

uv run python scripts/systemd_run_opengait.py stop opengait-baseline-train

Logging behavior

The launcher configures both:

  • a file log under /tmp by default
  • the systemd journal for the transient unit

This makes it easier to recover logs even if the original shell or tool session disappears.

Moving outputs off the SSD

OpenGait writes checkpoints, TensorBoard summaries, best-checkpoint snapshots, and file logs under a run output root.

By default that root is output/, but you can override it per run with output_root in the engine config:

trainer_cfg:
  output_root: /mnt/hddl/data/OpenGait-output

evaluator_cfg:
  output_root: /mnt/hddl/data/OpenGait-output

The final path layout stays the same under that root:

<output_root>/<dataset>/<model>/<save_name>/

For long scoliosis runs, using an HDD-backed root is recommended so local SSD space is not consumed by:

  • numbered checkpoints
  • rolling resume checkpoints
  • best-N retained checkpoints
  • TensorBoard summary files

GPU selection

Prefer GPU UUIDs, not ordinal indices.

Reason:

  • local CUDA_VISIBLE_DEVICES ordinal mapping can be unstable or surprising
  • UUIDs make the intended device explicit

Get UUIDs with:

nvidia-smi -L

Notes

  • The helper uses torchrun through python -m torch.distributed.run.
  • --nproc_per_node is inferred from the number of UUIDs passed to --gpu-uuids.
  • OpenGait evaluator constraints still apply: test batch/world-size settings must match the visible GPU count.