feat: retain best checkpoints and support alternate output roots

2026-03-11 01:14:05 +08:00
parent 63e2ed1097
commit a0150c791f
14 changed files with 852 additions and 9 deletions
@@ -85,4 +85,28 @@
 >> if torch.distributed.get_rank() == 0 and self.training and self.iteration % 100==0:
 >>     summary_writer.add_video('outs', outs.mean(2).unsqueeze(2), self.iteration)
 >> ```
-> Note that this example requires the [`moviepy`](https://github.com/Zulko/moviepy) package, and hence you should run `pip install moviepy` first.
+> Note that this example requires the [`moviepy`](https://github.com/Zulko/moviepy) package, and hence you should run `pip install moviepy` first.
+
+### Keep Best Checkpoints
+> If you want to retain the strongest evaluation checkpoints instead of relying only on the latest or final save, you can enable best-checkpoint tracking in `trainer_cfg`.
+>
+> Example:
+>> ```yaml
+>> trainer_cfg:
+>>   with_test: true
+>>   eval_iter: 1000
+>>   save_iter: 1000
+>>   best_ckpt_cfg:
+>>     keep_n: 3
+>>     metric_names:
+>>       - scalar/test_f1/
+>>       - scalar/test_accuracy/
+>> ```
+>
+> Behavior:
+> * The normal numbered checkpoints are still written by `save_iter`.
+> * After each eval, the trainer checks the configured scalar metrics and keeps the top `N` checkpoints separately for each metric.
+> * Best checkpoints are saved under `output/.../checkpoints/best/<metric>/`.
+> * Each best-metric directory contains an `index.json` file with the retained iterations, scores, and paths.
+>
+> This is useful for long or unstable runs where the best checkpoint may appear well before the final iteration.