feat: add systemd-run training launcher and docs
This commit is contained in:
@@ -49,6 +49,8 @@ CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 o
|
||||
|
||||
You can run commands in [train.sh](train.sh) for training different models.
|
||||
|
||||
For long-running local jobs, prefer the supervised `systemd-run --user` workflow documented in [systemd-run-training.md](systemd-run-training.md). It uses `torchrun`, UUID-based GPU selection, real log files, and survives shell/session teardown more reliably than `nohup ... &`.
|
||||
|
||||
## Test
|
||||
Evaluate the trained model by
|
||||
```
|
||||
|
||||
@@ -3,6 +3,7 @@
|
||||
This note records the current Scoliosis1K implementation status in this repo and the main conclusions from the recent reproduction/debugging work.
|
||||
|
||||
For a stricter paper-vs-local reproducibility breakdown, see [scoliosis_reproducibility_audit.md](scoliosis_reproducibility_audit.md).
|
||||
For the recommended long-running local launch workflow, see [systemd-run-training.md](systemd-run-training.md).
|
||||
|
||||
## Current status
|
||||
|
||||
@@ -79,6 +80,9 @@ The current working conclusion is:
|
||||
- the main remaining suspect is the skeleton-map representation and preprocessing path
|
||||
- for practical model development, `1:1:2` is currently the better working split than `1:1:8`
|
||||
- for practical model development, the current best skeleton recipe is still `body-only + plain CE + SGD` on `1:1:2`
|
||||
- the first practical DRF bridge on that same winning `1:1:2` recipe did not improve on the plain skeleton baseline:
|
||||
- best retained DRF checkpoint (`2000`) on the full test set: `80.21 Acc / 58.92 Prec / 59.23 Rec / 57.84 F1`
|
||||
- current best plain skeleton checkpoint (`7000`) on the full test set: `83.16 Acc / 68.24 Prec / 80.02 Rec / 68.47 F1`
|
||||
|
||||
For readability in this repo's docs, `ScoNet-MT-ske` refers to the skeleton-map variant that the DRF paper writes as `ScoNet-MT^{ske}`.
|
||||
|
||||
|
||||
@@ -0,0 +1,146 @@
|
||||
# Stable Long-Running Training with `systemd-run --user`
|
||||
|
||||
This note documents the recommended way to launch long OpenGait jobs on a local workstation.
|
||||
|
||||
## Why use `systemd-run --user`
|
||||
|
||||
For long training runs, `systemd-run --user` is more reliable than shell background tricks like:
|
||||
|
||||
- `nohup ... &`
|
||||
- `disown`
|
||||
- one-shot detached shell wrappers
|
||||
|
||||
Why:
|
||||
|
||||
- the training process is supervised by the user systemd instance instead of a transient shell process
|
||||
- stdout/stderr can be sent to a real log file and the systemd journal
|
||||
- you can query status with `systemctl --user`
|
||||
- you can stop the job cleanly with `systemctl --user stop ...`
|
||||
- the job is no longer tied to the lifetime of a tool process tree
|
||||
|
||||
In this repo, detached shell launches were observed to die unexpectedly even when the training code itself was healthy. `systemd-run --user` avoids that failure mode.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
Check that user services are available:
|
||||
|
||||
```bash
|
||||
systemd-run --user --version
|
||||
```
|
||||
|
||||
If you want jobs to survive logout, enable linger:
|
||||
|
||||
```bash
|
||||
loginctl enable-linger "$USER"
|
||||
```
|
||||
|
||||
This is optional if you only need the job to survive shell/session teardown while you stay logged in.
|
||||
|
||||
## Recommended launcher
|
||||
|
||||
Use the helper script:
|
||||
|
||||
- [scripts/systemd_run_opengait.py](/home/crosstyan/Code/OpenGait/scripts/systemd_run_opengait.py)
|
||||
|
||||
It:
|
||||
|
||||
- uses `torchrun` (`python -m torch.distributed.run`), not deprecated `torch.distributed.launch`
|
||||
- accepts GPU UUIDs instead of ordinal indices
|
||||
- launches a transient user service with `systemd-run --user`
|
||||
- writes stdout/stderr to a real log file
|
||||
- provides `status`, `logs`, and `stop` helpers
|
||||
|
||||
## Launch examples
|
||||
|
||||
Single-GPU train:
|
||||
|
||||
```bash
|
||||
uv run python scripts/systemd_run_opengait.py launch \
|
||||
--cfgs configs/sconet/sconet_scoliosis1k_skeleton_112_sigma15_joint8_bodyonly_plaince_bridge_1gpu_10k.yaml \
|
||||
--phase train \
|
||||
--gpu-uuids GPU-9cc7b26e-90d4-0c49-4d4c-060e528ffba6
|
||||
```
|
||||
|
||||
Single-GPU eval:
|
||||
|
||||
```bash
|
||||
uv run python scripts/systemd_run_opengait.py launch \
|
||||
--cfgs configs/sconet/sconet_scoliosis1k_local_eval_1gpu.yaml \
|
||||
--phase test \
|
||||
--gpu-uuids GPU-9cc7b26e-90d4-0c49-4d4c-060e528ffba6
|
||||
```
|
||||
|
||||
Two-GPU train:
|
||||
|
||||
```bash
|
||||
uv run python scripts/systemd_run_opengait.py launch \
|
||||
--cfgs configs/baseline/baseline.yaml \
|
||||
--phase train \
|
||||
--gpu-uuids GPU-9cc7b26e-90d4-0c49-4d4c-060e528ffba6,GPU-1155e14e-6097-5942-7feb-20453868b202
|
||||
```
|
||||
|
||||
Dry run:
|
||||
|
||||
```bash
|
||||
uv run python scripts/systemd_run_opengait.py launch \
|
||||
--cfgs configs/baseline/baseline.yaml \
|
||||
--phase train \
|
||||
--gpu-uuids GPU-9cc7b26e-90d4-0c49-4d4c-060e528ffba6 \
|
||||
--dry-run
|
||||
```
|
||||
|
||||
## Monitoring and control
|
||||
|
||||
Show service status:
|
||||
|
||||
```bash
|
||||
uv run python scripts/systemd_run_opengait.py status opengait-baseline-train
|
||||
```
|
||||
|
||||
Show recent journal lines:
|
||||
|
||||
```bash
|
||||
uv run python scripts/systemd_run_opengait.py logs opengait-baseline-train -n 200
|
||||
```
|
||||
|
||||
Follow the journal directly:
|
||||
|
||||
```bash
|
||||
journalctl --user -u opengait-baseline-train -f
|
||||
```
|
||||
|
||||
Stop the run:
|
||||
|
||||
```bash
|
||||
uv run python scripts/systemd_run_opengait.py stop opengait-baseline-train
|
||||
```
|
||||
|
||||
## Logging behavior
|
||||
|
||||
The launcher configures both:
|
||||
|
||||
- a file log under `/tmp` by default
|
||||
- the systemd journal for the transient unit
|
||||
|
||||
This makes it easier to recover logs even if the original shell or tool session disappears.
|
||||
|
||||
## GPU selection
|
||||
|
||||
Prefer GPU UUIDs, not ordinal indices.
|
||||
|
||||
Reason:
|
||||
|
||||
- local `CUDA_VISIBLE_DEVICES` ordinal mapping can be unstable or surprising
|
||||
- UUIDs make the intended device explicit
|
||||
|
||||
Get UUIDs with:
|
||||
|
||||
```bash
|
||||
nvidia-smi -L
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- The helper uses `torchrun` through `python -m torch.distributed.run`.
|
||||
- `--nproc_per_node` is inferred from the number of UUIDs passed to `--gpu-uuids`.
|
||||
- OpenGait evaluator constraints still apply: test batch/world-size settings must match the visible GPU count.
|
||||
Reference in New Issue
Block a user