feat: add systemd-run training launcher and docs

2026-03-11 00:45:02 +08:00
parent e2908febfa
commit 63e2ed1097
4 changed files with 346 additions and 0 deletions
@@ -49,6 +49,8 @@ CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 o
 You can run commands in [train.sh](train.sh) for training different models.
 For long-running local jobs, prefer the supervised `systemd-run --user` workflow documented in [systemd-run-training.md](systemd-run-training.md). It uses `torchrun`, UUID-based GPU selection, real log files, and survives shell/session teardown more reliably than `nohup ... &`.
 ## Test
 Evaluate the trained model by
 ```
@@ -3,6 +3,7 @@
 This note records the current Scoliosis1K implementation status in this repo and the main conclusions from the recent reproduction/debugging work.
 For a stricter paper-vs-local reproducibility breakdown, see [scoliosis_reproducibility_audit.md](scoliosis_reproducibility_audit.md).
 For the recommended long-running local launch workflow, see [systemd-run-training.md](systemd-run-training.md).
 ## Current status
@@ -79,6 +80,9 @@ The current working conclusion is:
 - the main remaining suspect is the skeleton-map representation and preprocessing path
 - for practical model development, `1:1:2` is currently the better working split than `1:1:8`
 - for practical model development, the current best skeleton recipe is still `body-only + plain CE + SGD` on `1:1:2`
 - the first practical DRF bridge on that same winning `1:1:2` recipe did not improve on the plain skeleton baseline:
  - best retained DRF checkpoint (`2000`) on the full test set: `80.21 Acc / 58.92 Prec / 59.23 Rec / 57.84 F1`
  - current best plain skeleton checkpoint (`7000`) on the full test set: `83.16 Acc / 68.24 Prec / 80.02 Rec / 68.47 F1`
 For readability in this repo's docs, `ScoNet-MT-ske` refers to the skeleton-map variant that the DRF paper writes as `ScoNet-MT^{ske}`.
@@ -0,0 +1,146 @@
 # Stable Long-Running Training with `systemd-run --user`
 This note documents the recommended way to launch long OpenGait jobs on a local workstation.
 ## Why use `systemd-run --user`
 For long training runs, `systemd-run --user` is more reliable than shell background tricks like:
 - `nohup ... &`
 - `disown`
 - one-shot detached shell wrappers
 Why:
 - the training process is supervised by the user systemd instance instead of a transient shell process
 - stdout/stderr can be sent to a real log file and the systemd journal
 - you can query status with `systemctl --user`
 - you can stop the job cleanly with `systemctl --user stop ...`
 - the job is no longer tied to the lifetime of a tool process tree
 In this repo, detached shell launches were observed to die unexpectedly even when the training code itself was healthy. `systemd-run --user` avoids that failure mode.
 ## Prerequisites
 Check that user services are available:
 ```bash
 systemd-run --user --version
 ```
 If you want jobs to survive logout, enable linger:
 ```bash
 loginctl enable-linger "$USER"
 ```
 This is optional if you only need the job to survive shell/session teardown while you stay logged in.
 ## Recommended launcher
 Use the helper script:
 - [scripts/systemd_run_opengait.py](/home/crosstyan/Code/OpenGait/scripts/systemd_run_opengait.py)
 It:
 - uses `torchrun` (`python -m torch.distributed.run`), not deprecated `torch.distributed.launch`
 - accepts GPU UUIDs instead of ordinal indices
 - launches a transient user service with `systemd-run --user`
 - writes stdout/stderr to a real log file
 - provides `status`, `logs`, and `stop` helpers
 ## Launch examples
 Single-GPU train:
 ```bash
 uv run python scripts/systemd_run_opengait.py launch \
  --cfgs configs/sconet/sconet_scoliosis1k_skeleton_112_sigma15_joint8_bodyonly_plaince_bridge_1gpu_10k.yaml \
  --phase train \
  --gpu-uuids GPU-9cc7b26e-90d4-0c49-4d4c-060e528ffba6
 ```
 Single-GPU eval:
 ```bash
 uv run python scripts/systemd_run_opengait.py launch \
  --cfgs configs/sconet/sconet_scoliosis1k_local_eval_1gpu.yaml \
  --phase test \
  --gpu-uuids GPU-9cc7b26e-90d4-0c49-4d4c-060e528ffba6
 ```
 Two-GPU train:
 ```bash
 uv run python scripts/systemd_run_opengait.py launch \
  --cfgs configs/baseline/baseline.yaml \
  --phase train \
  --gpu-uuids GPU-9cc7b26e-90d4-0c49-4d4c-060e528ffba6,GPU-1155e14e-6097-5942-7feb-20453868b202
 ```
 Dry run:
 ```bash
 uv run python scripts/systemd_run_opengait.py launch \
  --cfgs configs/baseline/baseline.yaml \
  --phase train \
  --gpu-uuids GPU-9cc7b26e-90d4-0c49-4d4c-060e528ffba6 \
  --dry-run
 ```
 ## Monitoring and control
 Show service status:
 ```bash
 uv run python scripts/systemd_run_opengait.py status opengait-baseline-train
 ```
 Show recent journal lines:
 ```bash
 uv run python scripts/systemd_run_opengait.py logs opengait-baseline-train -n 200
 ```
 Follow the journal directly:
 ```bash
 journalctl --user -u opengait-baseline-train -f
 ```
 Stop the run:
 ```bash
 uv run python scripts/systemd_run_opengait.py stop opengait-baseline-train
 ```
 ## Logging behavior
 The launcher configures both:
 - a file log under `/tmp` by default
 - the systemd journal for the transient unit
 This makes it easier to recover logs even if the original shell or tool session disappears.
 ## GPU selection
 Prefer GPU UUIDs, not ordinal indices.
 Reason:
 - local `CUDA_VISIBLE_DEVICES` ordinal mapping can be unstable or surprising
 - UUIDs make the intended device explicit
 Get UUIDs with:
 ```bash
 nvidia-smi -L
 ```
 ## Notes
 - The helper uses `torchrun` through `python -m torch.distributed.run`.
 - `--nproc_per_node` is inferred from the number of UUIDs passed to `--gpu-uuids`.
 - OpenGait evaluator constraints still apply: test batch/world-size settings must match the visible GPU count.
@@ -0,0 +1,194 @@
 from __future__ import annotations
 import re
 import subprocess
 from collections.abc import Sequence
 from pathlib import Path
 import click
 REPO_ROOT = Path(__file__).resolve().parents[1]
 DEFAULT_LOG_DIR = Path("/tmp")
 def _sanitize_unit_name(raw: str) -> str:
    sanitized = re.sub(r"[^A-Za-z0-9_.@-]+", "-", raw).strip("-")
    if not sanitized:
        raise click.ClickException("Unit name cannot be empty after sanitization.")
    return sanitized
 def _split_gpu_uuids(value: str) -> list[str]:
    uuids = [part.strip() for part in value.split(",") if part.strip()]
    if not uuids:
        raise click.ClickException("At least one GPU UUID is required.")
    return uuids
 def _run_command(
    args: Sequence[str],
    *,
    cwd: Path | None = None,
    check: bool = True,
 ) -> subprocess.CompletedProcess[str]:
    return subprocess.run(
        list(args),
        cwd=str(cwd) if cwd is not None else None,
        text=True,
        capture_output=True,
        check=check,
    )
 def _default_unit_name(cfgs: Path, phase: str) -> str:
    stem = cfgs.stem
    return _sanitize_unit_name(f"opengait-{stem}-{phase}")
@click.group()
 def cli() -> None:
    """Launch and manage OpenGait runs under systemd user services."""
@cli.command("launch")
@click.option("--cfgs", type=click.Path(path_type=Path, exists=True, dir_okay=False), required=True)
@click.option("--phase", type=click.Choice(["train", "test"]), required=True)
@click.option(
    "--gpu-uuids",
    required=True,
    help="Comma-separated GPU UUID list for CUDA_VISIBLE_DEVICES.",
 )
@click.option("--unit", type=str, default=None, help="systemd unit name. Defaults to a name derived from cfgs + phase.")
@click.option(
    "--log-path",
    type=click.Path(path_type=Path, dir_okay=False),
    default=None,
    help="Optional file to append stdout/stderr to. Defaults to /tmp/<unit>.log",
 )
@click.option(
    "--workdir",
    type=click.Path(path_type=Path, file_okay=False),
    default=REPO_ROOT,
    show_default=True,
 )
@click.option("--description", type=str, default=None, help="Optional systemd unit description.")
@click.option("--dry-run", is_flag=True, help="Print the resolved systemd-run command without launching it.")
 def launch(
    cfgs: Path,
    phase: str,
    gpu_uuids: str,
    unit: str | None,
    log_path: Path | None,
    workdir: Path,
    description: str | None,
    dry_run: bool,
 ) -> None:
    """Launch an OpenGait run via systemd-run --user using torchrun."""
    resolved_cfgs = cfgs if cfgs.is_absolute() else (workdir / cfgs).resolve()
    if not resolved_cfgs.exists():
        raise click.ClickException(f"Config not found: {resolved_cfgs}")
    unit_name = _sanitize_unit_name(unit) if unit is not None else _default_unit_name(resolved_cfgs, phase)
    resolved_log_path = (log_path if log_path is not None else DEFAULT_LOG_DIR / f"{unit_name}.log").resolve()
    resolved_log_path.parent.mkdir(parents=True, exist_ok=True)
    gpu_uuid_list = _split_gpu_uuids(gpu_uuids)
    nproc = len(gpu_uuid_list)
    command = [
        "systemd-run",
        "--user",
        "--unit",
        unit_name,
        "--collect",
        "--same-dir",
        "--property",
        "KillMode=mixed",
        "--property",
        f"StandardOutput=append:{resolved_log_path}",
        "--property",
        f"StandardError=append:{resolved_log_path}",
        "--setenv",
        f"CUDA_VISIBLE_DEVICES={','.join(gpu_uuid_list)}",
    ]
    if description:
        command.extend(["--description", description])
    command.extend(
        [
            "uv",
            "run",
            "python",
            "-m",
            "torch.distributed.run",
            "--nproc_per_node",
            str(nproc),
            "opengait/main.py",
            "--cfgs",
            str(resolved_cfgs),
            "--phase",
            phase,
        ]
    )
    if dry_run:
        click.echo(" ".join(command))
        return
    result = _run_command(command, cwd=workdir, check=False)
    if result.returncode != 0:
        raise click.ClickException(
            f"systemd-run launch failed.\nstdout:\n{result.stdout}\nstderr:\n{result.stderr}"
        )
    click.echo(f"unit={unit_name}")
    click.echo(f"log={resolved_log_path}")
    click.echo("journal: journalctl --user -u " + unit_name + " -f")
    if result.stdout.strip():
        click.echo(result.stdout.strip())
@cli.command("status")
@click.argument("unit")
 def status(unit: str) -> None:
    """Show systemd user-unit status."""
    result = _run_command(["systemctl", "--user", "status", unit], check=False)
    click.echo(result.stdout, nl=False)
    if result.stderr:
        click.echo(result.stderr, err=True, nl=False)
    if result.returncode != 0:
        raise SystemExit(result.returncode)
@cli.command("logs")
@click.argument("unit")
@click.option("-n", "--lines", type=int, default=200, show_default=True)
 def logs(unit: str, lines: int) -> None:
    """Show recent journal lines for a unit."""
    result = _run_command(
        ["journalctl", "--user", "-u", unit, "-n", str(lines), "--no-pager"],
        check=False,
    )
    click.echo(result.stdout, nl=False)
    if result.stderr:
        click.echo(result.stderr, err=True, nl=False)
    if result.returncode != 0:
        raise SystemExit(result.returncode)
@cli.command("stop")
@click.argument("unit")
 def stop(unit: str) -> None:
    """Stop a systemd user unit."""
    result = _run_command(["systemctl", "--user", "stop", unit], check=False)
    if result.stdout:
        click.echo(result.stdout, nl=False)
    if result.stderr:
        click.echo(result.stderr, err=True, nl=False)
    if result.returncode != 0:
        raise SystemExit(result.returncode)
 if __name__ == "__main__":
    cli()