feat: add systemd-run training launcher and docs

This commit is contained in:
2026-03-11 00:45:02 +08:00
parent e2908febfa
commit 63e2ed1097
4 changed files with 346 additions and 0 deletions
+2
View File
@@ -49,6 +49,8 @@ CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 o
You can run commands in [train.sh](train.sh) for training different models.
For long-running local jobs, prefer the supervised `systemd-run --user` workflow documented in [systemd-run-training.md](systemd-run-training.md). It uses `torchrun`, UUID-based GPU selection, real log files, and survives shell/session teardown more reliably than `nohup ... &`.
## Test
Evaluate the trained model by
```