crosstyan/OpenGait

Fork 0

Files

T

crosstyan a0150c791f feat: retain best checkpoints and support alternate output roots

2026-03-11 01:14:05 +08:00

4.4 KiB

Raw Permalink Blame History

Stable Long-Running Training with `systemd-run --user`

This note documents the recommended way to launch long OpenGait jobs on a local workstation.

Why use `systemd-run --user`

For long training runs, systemd-run --user is more reliable than shell background tricks like:

nohup ... &
disown
one-shot detached shell wrappers

Why:

the training process is supervised by the user systemd instance instead of a transient shell process
stdout/stderr can be sent to a real log file and the systemd journal
you can query status with systemctl --user
you can stop the job cleanly with systemctl --user stop ...
the job is no longer tied to the lifetime of a tool process tree

In this repo, detached shell launches were observed to die unexpectedly even when the training code itself was healthy. systemd-run --user avoids that failure mode.

Prerequisites

Check that user services are available:

systemd-run --user --version

If you want jobs to survive logout, enable linger:

loginctl enable-linger "$USER"

This is optional if you only need the job to survive shell/session teardown while you stay logged in.

Recommended launcher

Use the helper script:

scripts/systemd_run_opengait.py

It:

uses torchrun (python -m torch.distributed.run), not deprecated torch.distributed.launch
accepts GPU UUIDs instead of ordinal indices
launches a transient user service with systemd-run --user
writes stdout/stderr to a real log file
provides status, logs, and stop helpers

Launch examples

Single-GPU train:

uv run python scripts/systemd_run_opengait.py launch \
  --cfgs configs/sconet/sconet_scoliosis1k_skeleton_112_sigma15_joint8_bodyonly_plaince_bridge_1gpu_10k.yaml \
  --phase train \
  --gpu-uuids GPU-9cc7b26e-90d4-0c49-4d4c-060e528ffba6

Single-GPU eval:

uv run python scripts/systemd_run_opengait.py launch \
  --cfgs configs/sconet/sconet_scoliosis1k_local_eval_1gpu.yaml \
  --phase test \
  --gpu-uuids GPU-9cc7b26e-90d4-0c49-4d4c-060e528ffba6

Two-GPU train:

uv run python scripts/systemd_run_opengait.py launch \
  --cfgs configs/baseline/baseline.yaml \
  --phase train \
  --gpu-uuids GPU-9cc7b26e-90d4-0c49-4d4c-060e528ffba6,GPU-1155e14e-6097-5942-7feb-20453868b202

Dry run:

uv run python scripts/systemd_run_opengait.py launch \
  --cfgs configs/baseline/baseline.yaml \
  --phase train \
  --gpu-uuids GPU-9cc7b26e-90d4-0c49-4d4c-060e528ffba6 \
  --dry-run

Monitoring and control

Show service status:

uv run python scripts/systemd_run_opengait.py status opengait-baseline-train

Show recent journal lines:

uv run python scripts/systemd_run_opengait.py logs opengait-baseline-train -n 200

Follow the journal directly:

journalctl --user -u opengait-baseline-train -f

Stop the run:

uv run python scripts/systemd_run_opengait.py stop opengait-baseline-train

Logging behavior

The launcher configures both:

a file log under /tmp by default
the systemd journal for the transient unit

This makes it easier to recover logs even if the original shell or tool session disappears.

Moving outputs off the SSD

OpenGait writes checkpoints, TensorBoard summaries, best-checkpoint snapshots, and file logs under a run output root.

By default that root is output/, but you can override it per run with output_root in the engine config:

trainer_cfg:
  output_root: /mnt/hddl/data/OpenGait-output

evaluator_cfg:
  output_root: /mnt/hddl/data/OpenGait-output

The final path layout stays the same under that root:

<output_root>/<dataset>/<model>/<save_name>/

For long scoliosis runs, using an HDD-backed root is recommended so local SSD space is not consumed by:

numbered checkpoints
rolling resume checkpoints
best-N retained checkpoints
TensorBoard summary files

GPU selection

Prefer GPU UUIDs, not ordinal indices.

Reason:

local CUDA_VISIBLE_DEVICES ordinal mapping can be unstable or surprising
UUIDs make the intended device explicit

Get UUIDs with:

nvidia-smi -L

Notes

The helper uses torchrun through python -m torch.distributed.run.
--nproc_per_node is inferred from the number of UUIDs passed to --gpu-uuids.
OpenGait evaluator constraints still apply: test batch/world-size settings must match the visible GPU count.

4.4 KiB Raw Permalink Blame History

Stable Long-Running Training with systemd-run --user

Why use systemd-run --user