# Stable Long-Running Training with `systemd-run --user` This note documents the recommended way to launch long OpenGait jobs on a local workstation. ## Why use `systemd-run --user` For long training runs, `systemd-run --user` is more reliable than shell background tricks like: - `nohup ... &` - `disown` - one-shot detached shell wrappers Why: - the training process is supervised by the user systemd instance instead of a transient shell process - stdout/stderr can be sent to a real log file and the systemd journal - you can query status with `systemctl --user` - you can stop the job cleanly with `systemctl --user stop ...` - the job is no longer tied to the lifetime of a tool process tree In this repo, detached shell launches were observed to die unexpectedly even when the training code itself was healthy. `systemd-run --user` avoids that failure mode. ## Prerequisites Check that user services are available: ```bash systemd-run --user --version ``` If you want jobs to survive logout, enable linger: ```bash loginctl enable-linger "$USER" ``` This is optional if you only need the job to survive shell/session teardown while you stay logged in. ## Recommended launcher Use the helper script: - [scripts/systemd_run_opengait.py](/home/crosstyan/Code/OpenGait/scripts/systemd_run_opengait.py) It: - uses `torchrun` (`python -m torch.distributed.run`), not deprecated `torch.distributed.launch` - accepts GPU UUIDs instead of ordinal indices - launches a transient user service with `systemd-run --user` - writes stdout/stderr to a real log file - provides `status`, `logs`, and `stop` helpers ## Launch examples Single-GPU train: ```bash uv run python scripts/systemd_run_opengait.py launch \ --cfgs configs/sconet/sconet_scoliosis1k_skeleton_112_sigma15_joint8_bodyonly_plaince_bridge_1gpu_10k.yaml \ --phase train \ --gpu-uuids GPU-9cc7b26e-90d4-0c49-4d4c-060e528ffba6 ``` Single-GPU eval: ```bash uv run python scripts/systemd_run_opengait.py launch \ --cfgs configs/sconet/sconet_scoliosis1k_local_eval_1gpu.yaml \ --phase test \ --gpu-uuids GPU-9cc7b26e-90d4-0c49-4d4c-060e528ffba6 ``` Two-GPU train: ```bash uv run python scripts/systemd_run_opengait.py launch \ --cfgs configs/baseline/baseline.yaml \ --phase train \ --gpu-uuids GPU-9cc7b26e-90d4-0c49-4d4c-060e528ffba6,GPU-1155e14e-6097-5942-7feb-20453868b202 ``` Dry run: ```bash uv run python scripts/systemd_run_opengait.py launch \ --cfgs configs/baseline/baseline.yaml \ --phase train \ --gpu-uuids GPU-9cc7b26e-90d4-0c49-4d4c-060e528ffba6 \ --dry-run ``` ## Monitoring and control Show service status: ```bash uv run python scripts/systemd_run_opengait.py status opengait-baseline-train ``` Show recent journal lines: ```bash uv run python scripts/systemd_run_opengait.py logs opengait-baseline-train -n 200 ``` Follow the journal directly: ```bash journalctl --user -u opengait-baseline-train -f ``` Stop the run: ```bash uv run python scripts/systemd_run_opengait.py stop opengait-baseline-train ``` ## Logging behavior The launcher configures both: - a file log under `/tmp` by default - the systemd journal for the transient unit This makes it easier to recover logs even if the original shell or tool session disappears. ## Moving outputs off the SSD OpenGait writes checkpoints, TensorBoard summaries, best-checkpoint snapshots, and file logs under a run output root. By default that root is `output/`, but you can override it per run with `output_root` in the engine config: ```yaml trainer_cfg: output_root: /mnt/hddl/data/OpenGait-output evaluator_cfg: output_root: /mnt/hddl/data/OpenGait-output ``` The final path layout stays the same under that root: ```text //// ``` For long scoliosis runs, using an HDD-backed root is recommended so local SSD space is not consumed by: - numbered checkpoints - rolling resume checkpoints - best-N retained checkpoints - TensorBoard summary files ## GPU selection Prefer GPU UUIDs, not ordinal indices. Reason: - local `CUDA_VISIBLE_DEVICES` ordinal mapping can be unstable or surprising - UUIDs make the intended device explicit Get UUIDs with: ```bash nvidia-smi -L ``` ## Notes - The helper uses `torchrun` through `python -m torch.distributed.run`. - `--nproc_per_node` is inferred from the number of UUIDs passed to `--gpu-uuids`. - OpenGait evaluator constraints still apply: test batch/world-size settings must match the visible GPU count.