Files
OpenGait/docs/drf_author_checkpoint_compat.md
T

282 lines
9.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# DRF Author Checkpoint Compatibility Note
This note records what happened when evaluating the author-provided DRF bundle in this repo:
- checkpoint: `artifact/scoliosis_drf_author_118_compat/DRF_118_unordered_iter2w_lr0.001_8830-08000.pt`
- config: `ckpt/drf_author/drf_scoliosis1k_20000.yaml`
The short version:
- the weight file is real and structurally usable
- the provided YAML is not a reliable source of truth
- the main problem was integration-contract mismatch, not a broken checkpoint
## What Was Wrong
The author bundle was internally inconsistent in several ways.
### 1. Split mismatch
The DRF paper says the main experiment uses `1:1:8`, i.e. the `118` split.
But the provided YAML pointed to:
- `./datasets/Scoliosis1K/Scoliosis1K_112.json`
while the checkpoint filename itself says:
- `DRF_118_...`
So the bundle already disagreed with itself.
### 2. Class-order mismatch
The biggest hidden bug was class ordering.
The current repo evaluator assumes:
- `negative = 0`
- `neutral = 1`
- `positive = 2`
But the author stub in `research/drf.py` uses:
- `negative = 0`
- `positive = 1`
- `neutral = 2`
That means an otherwise good checkpoint can look very bad if logits are interpreted in the wrong class order.
### 3. Legacy module-name mismatch
The author checkpoint stores PGA weights under:
- `attention_layer.*`
The current repo uses:
- `PGA.*`
This is a small compatibility issue, but it must be remapped before loading.
### 4. Preprocessing/runtime-contract mismatch
The author checkpoint does not line up with the stale YAMLs full runtime contract.
Most importantly, it did **not** work well with the more paper-literal local export:
- `Scoliosis1K-drf-pkl-118-paper`
It worked much better with the more OpenGait-like aligned export:
- `Scoliosis1K-drf-pkl-118-aligned`
That strongly suggests the checkpoint was trained against a preprocessing/runtime path closer to the aligned OpenGait integration than to the later local “paper-literal” summed-heatmap ablation.
## What Was Added In-Tree
The current repo now has a small compatibility layer in:
- `opengait/modeling/models/drf.py`
It does two things:
- remaps legacy keys `attention_layer.* -> PGA.*`
- supports configurable `model_cfg.label_order`
The model also canonicalizes inference logits back into the repos evaluator order, so author checkpoints can be evaluated without modifying the evaluator itself.
## Tested Compatibility Results
### Best usable author-checkpoint path
Config:
- `configs/drf/drf_author_eval_118_aligned_1gpu.yaml`
Dataset/runtime:
- dataset root: `Scoliosis1K-drf-pkl-118-aligned`
- partition: `Scoliosis1K_118.json`
- transform: `BaseSilCuttingTransform`
- label order:
- `negative`
- `positive`
- `neutral`
Result:
- `80.24 Acc / 76.73 Prec / 76.40 Rec / 76.56 F1`
This is the strongest recovered path so far.
### Verified provenance of `Scoliosis1K-drf-pkl-118-aligned`
The `118-aligned` root is no longer just an informed guess. It was verified
directly against the raw pose source:
- `/mnt/public/data/Scoliosis1K/Scoliosis1K-pose-pkl`
The matching preprocessing path is:
- `datasets/pretreatment_scoliosis_drf.py`
- default heatmap config:
- `configs/drf/pretreatment_heatmap_drf.yaml`
- archived equivalent config:
- `configs/drf/pretreatment_heatmap_drf_118_aligned.yaml`
That means the aligned root was produced with:
- shared `sigma: 8.0`
- `align: True`
- `final_img_size: 64`
- default `heatmap_reduction=upstream`
- no `--stats_partition`, i.e. dataset-level PAV min-max stats
Equivalent command:
```bash
uv run python datasets/pretreatment_scoliosis_drf.py \
--pose_data_path /mnt/public/data/Scoliosis1K/Scoliosis1K-pose-pkl \
--output_path /mnt/public/data/Scoliosis1K/Scoliosis1K-drf-pkl-118-aligned
```
Verification evidence:
- a regenerated `0_heatmap.pkl` sample from the raw pose input matched the stored
`Scoliosis1K-drf-pkl-118-aligned` sample exactly (`array_equal == True`)
- a full recomputation of `pav_stats.pkl` from the raw pose input matched the
stored `pav_min`, `pav_max`, and `stats_partition=None` exactly
So `118-aligned` is the old default OpenGait-style DRF export, not the later:
- `118-paper` paper-literal summed-heatmap export
- `118` train-only-stats splitroot export
- `sigma15` / `sigma15_joint8` exports
### Targeted preprocessing ablations around the recovered path
After verifying the aligned root provenance, a few focused runtime/data ablations
were tested against the author checkpoint to see which part of the contract still
mattered most.
Baseline:
- `118-aligned`
- `BaseSilCuttingTransform`
- result:
- `80.24 Acc / 76.73 Prec / 76.40 Rec / 76.56 F1`
Hybrid 1:
- aligned heatmap + splitroot PAV
- result:
- `77.30 Acc / 73.70 Prec / 73.04 Rec / 73.28 F1`
Hybrid 2:
- splitroot heatmap + aligned PAV
- result:
- `80.37 Acc / 77.16 Prec / 76.48 Rec / 76.80 F1`
Runtime ablation:
- `118-aligned` + `BaseSilTransform` (`no-cut`)
- result:
- `49.93 Acc / 50.49 Prec / 51.58 Rec / 47.75 F1`
What these ablations suggest:
- `BaseSilCuttingTransform` is necessary; `no-cut` breaks the checkpoint badly
- dataset-level PAV stats (`stats_partition=None`) matter more than the exact
aligned-vs-splitroot heatmap writer
- the heatmap export is still part of the contract, but it is no longer the
dominant remaining mismatch
### Other tested paths
`configs/drf/drf_author_eval_118_splitroot_1gpu.yaml`
- dataset root: `Scoliosis1K-drf-pkl-118`
- result:
- `77.17 Acc / 73.61 Prec / 72.59 Rec / 72.98 F1`
`configs/drf/drf_author_eval_112_1gpu.yaml`
- dataset root: `Scoliosis1K-drf-pkl`
- partition: `Scoliosis1K_112.json`
- result:
- `85.19 Acc / 57.98 Prec / 56.65 Rec / 57.30 F1`
`configs/drf/drf_author_eval_118_paper_1gpu.yaml`
- dataset root: `Scoliosis1K-drf-pkl-118-paper`
- transform: `BaseSilTransform`
- result:
- `27.24 Acc / 9.08 Prec / 33.33 Rec / 14.27 F1`
## Interpretation
What these results mean:
- the checkpoint is not garbage
- the original “very bad” local eval was mostly a compatibility failure
- the largest single hidden bug was the class-order mismatch
- the author checkpoint is also sensitive to which local DRF dataset root is used
- the recovered runtime is now good enough to make the checkpoint believable, but
preprocessing alone did not recover the paper DRF headline row
What they do **not** mean:
- we have perfectly reconstructed the authors original training path
- the provided YAML is trustworthy as-is
- the papers full DRF claim is fully reproduced here
One practical caveat on `1:1:2` vs `1:1:8` comparisons in this repo:
- local `Scoliosis1K_112.json` and `Scoliosis1K_118.json` are not the same train/test
split with only a different class ratio
- they differ substantially in membership
- so local `112` vs `118` results should not be overinterpreted as a pure
class-balance ablation unless the train/test pool is explicitly held fixed
To support a clean same-pool comparison, the repo now also includes:
- `datasets/Scoliosis1K/Scoliosis1K_118_fixedpool_train112.json`
That partition keeps the full `118` `TEST_SET` unchanged and keeps the same
positive/neutral `TRAIN_SET` ids as `118`, but downsamples `TRAIN_SET` negatives
to `148` so the train ratio becomes `74 / 74 / 148` (`1:1:2`).
The strongest recovered result:
- `80.24 / 76.73 / 76.40 / 76.56`
This is close to the papers reported `ScoNet-MT^ske` F1 and much better than our earlier broken compat evals, but it is still below the papers DRF headline result:
- paper DRF: `86.0 Acc / 84.1 Prec / 79.2 Rec / 80.8 F1`
## Practical Recommendation
If someone wants to use the author checkpoint in this repo today, the recommended path is:
1. use `configs/drf/drf_author_eval_118_aligned_1gpu.yaml`
2. keep the author label order:
- `negative, positive, neutral`
3. keep the legacy `attention_layer -> PGA` remap in the model
4. do **not** assume the stale `112` YAML is the correct training/eval contract
If someone wants to push this further, the highest-value next step is:
- finetune from the author checkpoint on the aligned `118` path instead of starting DRF from scratch
## How To Run
Recommended eval:
```bash
CUDA_VISIBLE_DEVICES=GPU-9cc7b26e-90d4-0c49-4d4c-060e528ffba6 \
uv run torchrun --nproc_per_node=1 --master_port=29693 \
opengait/main.py \
--cfgs ./configs/drf/drf_author_eval_118_aligned_1gpu.yaml \
--phase test
```
Other compatibility checks:
```bash
CUDA_VISIBLE_DEVICES=GPU-9cc7b26e-90d4-0c49-4d4c-060e528ffba6 \
uv run torchrun --nproc_per_node=1 --master_port=29695 \
opengait/main.py \
--cfgs ./configs/drf/drf_author_eval_112_1gpu.yaml \
--phase test
CUDA_VISIBLE_DEVICES=GPU-9cc7b26e-90d4-0c49-4d4c-060e528ffba6 \
uv run torchrun --nproc_per_node=1 --master_port=29696 \
opengait/main.py \
--cfgs ./configs/drf/drf_author_eval_118_splitroot_1gpu.yaml \
--phase test
CUDA_VISIBLE_DEVICES=GPU-9cc7b26e-90d4-0c49-4d4c-060e528ffba6 \
uv run torchrun --nproc_per_node=1 --master_port=29697 \
opengait/main.py \
--cfgs ./configs/drf/drf_author_eval_118_paper_1gpu.yaml \
--phase test
```
If someone wants to reproduce this on another machine, the usual paths to change are:
- `data_cfg.dataset_root`
- `data_cfg.dataset_partition`
- `evaluator_cfg.restore_hint`
The archived artifact bundle is:
- `artifact/scoliosis_drf_author_118_compat`