feat: retain best checkpoints and support alternate output roots

This commit is contained in:
2026-03-11 01:14:05 +08:00
parent 63e2ed1097
commit a0150c791f
14 changed files with 852 additions and 9 deletions
+25 -1
View File
@@ -85,4 +85,28 @@
>> if torch.distributed.get_rank() == 0 and self.training and self.iteration % 100==0:
>> summary_writer.add_video('outs', outs.mean(2).unsqueeze(2), self.iteration)
>> ```
> Note that this example requires the [`moviepy`](https://github.com/Zulko/moviepy) package, and hence you should run `pip install moviepy` first.
> Note that this example requires the [`moviepy`](https://github.com/Zulko/moviepy) package, and hence you should run `pip install moviepy` first.
### Keep Best Checkpoints
> If you want to retain the strongest evaluation checkpoints instead of relying only on the latest or final save, you can enable best-checkpoint tracking in `trainer_cfg`.
>
> Example:
>> ```yaml
>> trainer_cfg:
>> with_test: true
>> eval_iter: 1000
>> save_iter: 1000
>> best_ckpt_cfg:
>> keep_n: 3
>> metric_names:
>> - scalar/test_f1/
>> - scalar/test_accuracy/
>> ```
>
> Behavior:
> * The normal numbered checkpoints are still written by `save_iter`.
> * After each eval, the trainer checks the configured scalar metrics and keeps the top `N` checkpoints separately for each metric.
> * Best checkpoints are saved under `output/.../checkpoints/best/<metric>/`.
> * Each best-metric directory contains an `index.json` file with the retained iterations, scores, and paths.
>
> This is useful for long or unstable runs where the best checkpoint may appear well before the final iteration.