Files
OpenGait/docs/drf_author_checkpoint_compat.md

9.2 KiB
Raw Permalink Blame History

DRF Author Checkpoint Compatibility Note

This note records what happened when evaluating the author-provided DRF bundle in this repo:

  • checkpoint: artifact/scoliosis_drf_author_118_compat/DRF_118_unordered_iter2w_lr0.001_8830-08000.pt
  • config: ckpt/drf_author/drf_scoliosis1k_20000.yaml

The short version:

  • the weight file is real and structurally usable
  • the provided YAML is not a reliable source of truth
  • the main problem was integration-contract mismatch, not a broken checkpoint

What Was Wrong

The author bundle was internally inconsistent in several ways.

1. Split mismatch

The DRF paper says the main experiment uses 1:1:8, i.e. the 118 split.

But the provided YAML pointed to:

  • ./datasets/Scoliosis1K/Scoliosis1K_112.json

while the checkpoint filename itself says:

  • DRF_118_...

So the bundle already disagreed with itself.

2. Class-order mismatch

The biggest hidden bug was class ordering.

The current repo evaluator assumes:

  • negative = 0
  • neutral = 1
  • positive = 2

But the author stub in research/drf.py uses:

  • negative = 0
  • positive = 1
  • neutral = 2

That means an otherwise good checkpoint can look very bad if logits are interpreted in the wrong class order.

3. Legacy module-name mismatch

The author checkpoint stores PGA weights under:

  • attention_layer.*

The current repo uses:

  • PGA.*

This is a small compatibility issue, but it must be remapped before loading.

4. Preprocessing/runtime-contract mismatch

The author checkpoint does not line up with the stale YAMLs full runtime contract.

Most importantly, it did not work well with the more paper-literal local export:

  • Scoliosis1K-drf-pkl-118-paper

It worked much better with the more OpenGait-like aligned export:

  • Scoliosis1K-drf-pkl-118-aligned

That strongly suggests the checkpoint was trained against a preprocessing/runtime path closer to the aligned OpenGait integration than to the later local “paper-literal” summed-heatmap ablation.

What Was Added In-Tree

The current repo now has a small compatibility layer in:

  • opengait/modeling/models/drf.py

It does two things:

  • remaps legacy keys attention_layer.* -> PGA.*
  • supports configurable model_cfg.label_order

The model also canonicalizes inference logits back into the repos evaluator order, so author checkpoints can be evaluated without modifying the evaluator itself.

Tested Compatibility Results

Best usable author-checkpoint path

Config:

  • configs/drf/drf_author_eval_118_aligned_1gpu.yaml

Dataset/runtime:

  • dataset root: Scoliosis1K-drf-pkl-118-aligned
  • partition: Scoliosis1K_118.json
  • transform: BaseSilCuttingTransform
  • label order:
    • negative
    • positive
    • neutral

Result:

  • 80.24 Acc / 76.73 Prec / 76.40 Rec / 76.56 F1

This is the strongest recovered path so far.

Verified provenance of Scoliosis1K-drf-pkl-118-aligned

The 118-aligned root is no longer just an informed guess. It was verified directly against the raw pose source:

  • /mnt/public/data/Scoliosis1K/Scoliosis1K-pose-pkl

The matching preprocessing path is:

  • datasets/pretreatment_scoliosis_drf.py
  • default heatmap config:
    • configs/drf/pretreatment_heatmap_drf.yaml
  • archived equivalent config:
    • configs/drf/pretreatment_heatmap_drf_118_aligned.yaml

That means the aligned root was produced with:

  • shared sigma: 8.0
  • align: True
  • final_img_size: 64
  • default heatmap_reduction=upstream
  • no --stats_partition, i.e. dataset-level PAV min-max stats

Equivalent command:

uv run python datasets/pretreatment_scoliosis_drf.py \
  --pose_data_path /mnt/public/data/Scoliosis1K/Scoliosis1K-pose-pkl \
  --output_path /mnt/public/data/Scoliosis1K/Scoliosis1K-drf-pkl-118-aligned

Verification evidence:

  • a regenerated 0_heatmap.pkl sample from the raw pose input matched the stored Scoliosis1K-drf-pkl-118-aligned sample exactly (array_equal == True)
  • a full recomputation of pav_stats.pkl from the raw pose input matched the stored pav_min, pav_max, and stats_partition=None exactly

So 118-aligned is the old default OpenGait-style DRF export, not the later:

  • 118-paper paper-literal summed-heatmap export
  • 118 train-only-stats splitroot export
  • sigma15 / sigma15_joint8 exports

Targeted preprocessing ablations around the recovered path

After verifying the aligned root provenance, a few focused runtime/data ablations were tested against the author checkpoint to see which part of the contract still mattered most.

Baseline:

  • 118-aligned
  • BaseSilCuttingTransform
  • result:
    • 80.24 Acc / 76.73 Prec / 76.40 Rec / 76.56 F1

Hybrid 1:

  • aligned heatmap + splitroot PAV
  • result:
    • 77.30 Acc / 73.70 Prec / 73.04 Rec / 73.28 F1

Hybrid 2:

  • splitroot heatmap + aligned PAV
  • result:
    • 80.37 Acc / 77.16 Prec / 76.48 Rec / 76.80 F1

Runtime ablation:

  • 118-aligned + BaseSilTransform (no-cut)
  • result:
    • 49.93 Acc / 50.49 Prec / 51.58 Rec / 47.75 F1

What these ablations suggest:

  • BaseSilCuttingTransform is necessary; no-cut breaks the checkpoint badly
  • dataset-level PAV stats (stats_partition=None) matter more than the exact aligned-vs-splitroot heatmap writer
  • the heatmap export is still part of the contract, but it is no longer the dominant remaining mismatch

Other tested paths

configs/drf/drf_author_eval_118_splitroot_1gpu.yaml

  • dataset root: Scoliosis1K-drf-pkl-118
  • result:
    • 77.17 Acc / 73.61 Prec / 72.59 Rec / 72.98 F1

configs/drf/drf_author_eval_112_1gpu.yaml

  • dataset root: Scoliosis1K-drf-pkl
  • partition: Scoliosis1K_112.json
  • result:
    • 85.19 Acc / 57.98 Prec / 56.65 Rec / 57.30 F1

configs/drf/drf_author_eval_118_paper_1gpu.yaml

  • dataset root: Scoliosis1K-drf-pkl-118-paper
  • transform: BaseSilTransform
  • result:
    • 27.24 Acc / 9.08 Prec / 33.33 Rec / 14.27 F1

Interpretation

What these results mean:

  • the checkpoint is not garbage
  • the original “very bad” local eval was mostly a compatibility failure
  • the largest single hidden bug was the class-order mismatch
  • the author checkpoint is also sensitive to which local DRF dataset root is used
  • the recovered runtime is now good enough to make the checkpoint believable, but preprocessing alone did not recover the paper DRF headline row

What they do not mean:

  • we have perfectly reconstructed the authors original training path
  • the provided YAML is trustworthy as-is
  • the papers full DRF claim is fully reproduced here

One practical caveat on 1:1:2 vs 1:1:8 comparisons in this repo:

  • local Scoliosis1K_112.json and Scoliosis1K_118.json are not the same train/test split with only a different class ratio
  • they differ substantially in membership
  • so local 112 vs 118 results should not be overinterpreted as a pure class-balance ablation unless the train/test pool is explicitly held fixed

To support a clean same-pool comparison, the repo now also includes:

  • datasets/Scoliosis1K/Scoliosis1K_118_fixedpool_train112.json

That partition keeps the full 118 TEST_SET unchanged and keeps the same positive/neutral TRAIN_SET ids as 118, but downsamples TRAIN_SET negatives to 148 so the train ratio becomes 74 / 74 / 148 (1:1:2).

The strongest recovered result:

  • 80.24 / 76.73 / 76.40 / 76.56

This is close to the papers reported ScoNet-MT^ske F1 and much better than our earlier broken compat evals, but it is still below the papers DRF headline result:

  • paper DRF: 86.0 Acc / 84.1 Prec / 79.2 Rec / 80.8 F1

Practical Recommendation

If someone wants to use the author checkpoint in this repo today, the recommended path is:

  1. use configs/drf/drf_author_eval_118_aligned_1gpu.yaml
  2. keep the author label order:
    • negative, positive, neutral
  3. keep the legacy attention_layer -> PGA remap in the model
  4. do not assume the stale 112 YAML is the correct training/eval contract

If someone wants to push this further, the highest-value next step is:

  • finetune from the author checkpoint on the aligned 118 path instead of starting DRF from scratch

How To Run

Recommended eval:

CUDA_VISIBLE_DEVICES=GPU-9cc7b26e-90d4-0c49-4d4c-060e528ffba6 \
uv run torchrun --nproc_per_node=1 --master_port=29693 \
  opengait/main.py \
  --cfgs ./configs/drf/drf_author_eval_118_aligned_1gpu.yaml \
  --phase test

Other compatibility checks:

CUDA_VISIBLE_DEVICES=GPU-9cc7b26e-90d4-0c49-4d4c-060e528ffba6 \
uv run torchrun --nproc_per_node=1 --master_port=29695 \
  opengait/main.py \
  --cfgs ./configs/drf/drf_author_eval_112_1gpu.yaml \
  --phase test

CUDA_VISIBLE_DEVICES=GPU-9cc7b26e-90d4-0c49-4d4c-060e528ffba6 \
uv run torchrun --nproc_per_node=1 --master_port=29696 \
  opengait/main.py \
  --cfgs ./configs/drf/drf_author_eval_118_splitroot_1gpu.yaml \
  --phase test

CUDA_VISIBLE_DEVICES=GPU-9cc7b26e-90d4-0c49-4d4c-060e528ffba6 \
uv run torchrun --nproc_per_node=1 --master_port=29697 \
  opengait/main.py \
  --cfgs ./configs/drf/drf_author_eval_118_paper_1gpu.yaml \
  --phase test

If someone wants to reproduce this on another machine, the usual paths to change are:

  • data_cfg.dataset_root
  • data_cfg.dataset_partition
  • evaluator_cfg.restore_hint

The archived artifact bundle is:

  • artifact/scoliosis_drf_author_118_compat