feat: retain best checkpoints and support alternate output roots
This commit is contained in:
@@ -85,4 +85,28 @@
|
||||
>> if torch.distributed.get_rank() == 0 and self.training and self.iteration % 100==0:
|
||||
>> summary_writer.add_video('outs', outs.mean(2).unsqueeze(2), self.iteration)
|
||||
>> ```
|
||||
> Note that this example requires the [`moviepy`](https://github.com/Zulko/moviepy) package, and hence you should run `pip install moviepy` first.
|
||||
> Note that this example requires the [`moviepy`](https://github.com/Zulko/moviepy) package, and hence you should run `pip install moviepy` first.
|
||||
|
||||
### Keep Best Checkpoints
|
||||
> If you want to retain the strongest evaluation checkpoints instead of relying only on the latest or final save, you can enable best-checkpoint tracking in `trainer_cfg`.
|
||||
>
|
||||
> Example:
|
||||
>> ```yaml
|
||||
>> trainer_cfg:
|
||||
>> with_test: true
|
||||
>> eval_iter: 1000
|
||||
>> save_iter: 1000
|
||||
>> best_ckpt_cfg:
|
||||
>> keep_n: 3
|
||||
>> metric_names:
|
||||
>> - scalar/test_f1/
|
||||
>> - scalar/test_accuracy/
|
||||
>> ```
|
||||
>
|
||||
> Behavior:
|
||||
> * The normal numbered checkpoints are still written by `save_iter`.
|
||||
> * After each eval, the trainer checks the configured scalar metrics and keeps the top `N` checkpoints separately for each metric.
|
||||
> * Best checkpoints are saved under `output/.../checkpoints/best/<metric>/`.
|
||||
> * Each best-metric directory contains an `index.json` file with the retained iterations, scores, and paths.
|
||||
>
|
||||
> This is useful for long or unstable runs where the best checkpoint may appear well before the final iteration.
|
||||
|
||||
Reference in New Issue
Block a user