feat: add drf author checkpoint compatibility bundle

2026-03-14 17:12:27 +08:00
parent d4e2a59ad2
commit 5f98844aff
18 changed files with 1144 additions and 8 deletions
@@ -0,0 +1,192 @@
+# DRF Author Checkpoint Compatibility Note
+
+This note records what happened when evaluating the author-provided DRF bundle in this repo:
+
+- checkpoint: `artifact/scoliosis_drf_author_118_compat/DRF_118_unordered_iter2w_lr0.001_8830-08000.pt`
+- config: `ckpt/drf_author/drf_scoliosis1k_20000.yaml`
+
+The short version:
+- the weight file is real and structurally usable
+- the provided YAML is not a reliable source of truth
+- the main problem was integration-contract mismatch, not a broken checkpoint
+
+## What Was Wrong
+
+The author bundle was internally inconsistent in several ways.
+
+### 1. Split mismatch
+
+The DRF paper says the main experiment uses `1:1:8`, i.e. the `118` split.
+
+But the provided YAML pointed to:
+- `./datasets/Scoliosis1K/Scoliosis1K_112.json`
+
+while the checkpoint filename itself says:
+- `DRF_118_...`
+
+So the bundle already disagreed with itself.
+
+### 2. Class-order mismatch
+
+The biggest hidden bug was class ordering.
+
+The current repo evaluator assumes:
+- `negative = 0`
+- `neutral = 1`
+- `positive = 2`
+
+But the author stub in `research/drf.py` uses:
+- `negative = 0`
+- `positive = 1`
+- `neutral = 2`
+
+That means an otherwise good checkpoint can look very bad if logits are interpreted in the wrong class order.
+
+### 3. Legacy module-name mismatch
+
+The author checkpoint stores PGA weights under:
+- `attention_layer.*`
+
+The current repo uses:
+- `PGA.*`
+
+This is a small compatibility issue, but it must be remapped before loading.
+
+### 4. Preprocessing/runtime-contract mismatch
+
+The author checkpoint does not line up with the stale YAML’s full runtime contract.
+
+Most importantly, it did **not** work well with the more paper-literal local export:
+- `Scoliosis1K-drf-pkl-118-paper`
+
+It worked much better with the more OpenGait-like aligned export:
+- `Scoliosis1K-drf-pkl-118-aligned`
+
+That strongly suggests the checkpoint was trained against a preprocessing/runtime path closer to the aligned OpenGait integration than to the later local “paper-literal” summed-heatmap ablation.
+
+## What Was Added In-Tree
+
+The current repo now has a small compatibility layer in:
+- `opengait/modeling/models/drf.py`
+
+It does two things:
+- remaps legacy keys `attention_layer.* -> PGA.*`
+- supports configurable `model_cfg.label_order`
+
+The model also canonicalizes inference logits back into the repo’s evaluator order, so author checkpoints can be evaluated without modifying the evaluator itself.
+
+## Tested Compatibility Results
+
+### Best usable author-checkpoint path
+
+Config:
+- `configs/drf/drf_author_eval_118_aligned_1gpu.yaml`
+
+Dataset/runtime:
+- dataset root: `Scoliosis1K-drf-pkl-118-aligned`
+- partition: `Scoliosis1K_118.json`
+- transform: `BaseSilCuttingTransform`
+- label order:
+  - `negative`
+  - `positive`
+  - `neutral`
+
+Result:
+- `80.24 Acc / 76.73 Prec / 76.40 Rec / 76.56 F1`
+
+This is the strongest recovered path so far.
+
+### Other tested paths
+
+`configs/drf/drf_author_eval_118_splitroot_1gpu.yaml`
+- dataset root: `Scoliosis1K-drf-pkl-118`
+- result:
+  - `77.17 Acc / 73.61 Prec / 72.59 Rec / 72.98 F1`
+
+`configs/drf/drf_author_eval_112_1gpu.yaml`
+- dataset root: `Scoliosis1K-drf-pkl`
+- partition: `Scoliosis1K_112.json`
+- result:
+  - `85.19 Acc / 57.98 Prec / 56.65 Rec / 57.30 F1`
+
+`configs/drf/drf_author_eval_118_paper_1gpu.yaml`
+- dataset root: `Scoliosis1K-drf-pkl-118-paper`
+- transform: `BaseSilTransform`
+- result:
+  - `27.24 Acc / 9.08 Prec / 33.33 Rec / 14.27 F1`
+
+## Interpretation
+
+What these results mean:
+
+- the checkpoint is not garbage
+- the original “very bad” local eval was mostly a compatibility failure
+- the largest single hidden bug was the class-order mismatch
+- the author checkpoint is also sensitive to which local DRF dataset root is used
+
+What they do **not** mean:
+
+- we have perfectly reconstructed the author’s original training path
+- the provided YAML is trustworthy as-is
+- the paper’s full DRF claim is fully reproduced here
+
+The strongest recovered result:
+- `80.24 / 76.73 / 76.40 / 76.56`
+
+This is close to the paper’s reported `ScoNet-MT^ske` F1 and much better than our earlier broken compat evals, but it is still below the paper’s DRF headline result:
+- paper DRF: `86.0 Acc / 84.1 Prec / 79.2 Rec / 80.8 F1`
+
+## Practical Recommendation
+
+If someone wants to use the author checkpoint in this repo today, the recommended path is:
+
+1. use `configs/drf/drf_author_eval_118_aligned_1gpu.yaml`
+2. keep the author label order:
+   - `negative, positive, neutral`
+3. keep the legacy `attention_layer -> PGA` remap in the model
+4. do **not** assume the stale `112` YAML is the correct training/eval contract
+
+If someone wants to push this further, the highest-value next step is:
+- finetune from the author checkpoint on the aligned `118` path instead of starting DRF from scratch
+
+## How To Run
+
+Recommended eval:
+
+```bash
+CUDA_VISIBLE_DEVICES=GPU-9cc7b26e-90d4-0c49-4d4c-060e528ffba6 \
+uv run torchrun --nproc_per_node=1 --master_port=29693 \
+  opengait/main.py \
+  --cfgs ./configs/drf/drf_author_eval_118_aligned_1gpu.yaml \
+  --phase test
+```
+
+Other compatibility checks:
+
+```bash
+CUDA_VISIBLE_DEVICES=GPU-9cc7b26e-90d4-0c49-4d4c-060e528ffba6 \
+uv run torchrun --nproc_per_node=1 --master_port=29695 \
+  opengait/main.py \
+  --cfgs ./configs/drf/drf_author_eval_112_1gpu.yaml \
+  --phase test
+
+CUDA_VISIBLE_DEVICES=GPU-9cc7b26e-90d4-0c49-4d4c-060e528ffba6 \
+uv run torchrun --nproc_per_node=1 --master_port=29696 \
+  opengait/main.py \
+  --cfgs ./configs/drf/drf_author_eval_118_splitroot_1gpu.yaml \
+  --phase test
+
+CUDA_VISIBLE_DEVICES=GPU-9cc7b26e-90d4-0c49-4d4c-060e528ffba6 \
+uv run torchrun --nproc_per_node=1 --master_port=29697 \
+  opengait/main.py \
+  --cfgs ./configs/drf/drf_author_eval_118_paper_1gpu.yaml \
+  --phase test
+```
+
+If someone wants to reproduce this on another machine, the usual paths to change are:
+- `data_cfg.dataset_root`
+- `data_cfg.dataset_partition`
+- `evaluator_cfg.restore_hint`
+
+The archived artifact bundle is:
+- `artifact/scoliosis_drf_author_118_compat`