Files
OpenGait/docs/scoliosis_reproducibility_audit.md
T

8.9 KiB
Raw Blame History

Scoliosis1K Reproducibility Audit

This note is the current audit of what is and is not reproducible in the Scoliosis1K ScoNet/DRF work inside this repo.

It separates three questions that should not be mixed together:

  • is the repo/train stack itself working?
  • is the paper-level DRF claim reproducible?
  • what is the strongest practical model we have right now?

Related notes:

Primary references:

Executive Summary

Current audit conclusion:

  • the repo and training stack are working
  • the skeleton branch is learnable
  • the published DRF result is still not independently reproducible here
  • the best practical model in this repo is currently not DRF

Current practical winner:

  • model family: ScoNet-MT-ske
  • split: 1:1:2
  • representation: body-only
  • loss: plain CE + triplet
  • optimizer path: later AdamW cosine finetune
  • verified best retained checkpoint:
    • 92.38 Acc / 90.30 Prec / 87.39 Rec / 88.70 F1

That means:

  • practical model development is in a good state
  • paper-faithful DRF reproduction is not

What We Can Say With High Confidence

1. The core training and evaluation stack works

Evidence:

  • the silhouette ScoNet path behaves sensibly
  • train/eval loops, checkpointing, resume, and standalone eval all work
  • the strong practical skeleton result is reproducible from a saved checkpoint

Conclusion:

  • the repo is not globally broken
  • the main remaining issues are method-specific, not infrastructure-wide

2. The raw Scoliosis1K pose data is usable

Evidence:

  • earlier dataset analysis showed high pose confidence and stable sequence lengths
  • the skeleton branch eventually learns well on the easier practical split

Conclusion:

  • the raw pose source is not the main blocker

3. The skeleton branch is learnable

Evidence:

  • on 1:1:2, the skeleton branch moved from poor early baselines to a strong retained checkpoint family
  • the best verified practical result is:
    • 92.38 Acc / 90.30 Prec / 87.39 Rec / 88.70 F1

Conclusion:

  • the skeleton path is not dead
  • earlier failures on harder settings were not proof that skeleton input is fundamentally wrong

What Is Reproducible

4. The silhouette ScoNet path is reproducible enough to trust

Evidence:

  • the silhouette pipeline and trainer behave consistently
  • strong silhouette checkpoints evaluate sensibly on their intended split family

Conclusion:

  • silhouette ScoNet remains a valid sanity anchor for this repo

5. The high-level DRF idea is reproducible

Evidence:

  • the DRF paper defines the method as skeleton map + PAV + PGA
  • this repo now contains:
    • a DRF model
    • DRF-specific preprocessing
    • PAV generation
    • PGA integration

Conclusion:

  • the architecture-level idea is implementable
  • a plausible DRF implementation exists locally

6. The PAV concept is reproducible

Evidence:

  • PAV metrics were implemented and produced stable sequence-level signals
  • the earlier analysis showed PAV still carries useful class signal

Conclusion:

  • “we could not build the clinical prior” is not the main explanation anymore

What Is Only Partially Reproducible

7. The skeleton-map branch is only partially specified by the papers

The papers define the representation conceptually, but not enough for quantitative reproduction from text alone.

Missing or under-specified details included:

  • numeric Gaussian widths
  • joint-vs-limb relative weighting
  • crop and alignment policy
  • resize/padding policy
  • quantization/dtype behavior
  • runtime transform assumptions

Why that matters:

  • small rasterization and alignment changes moved results a lot
  • many early failures came from details the paper did not pin down tightly enough

Conclusion:

  • the paper gives the representation idea
  • it does not fully specify the winning implementation

8. The visualization story is only partially reproducible

The papers cite visualization families, but do not specify enough extraction detail to reproduce the exact figures cleanly.

Missing detail included:

  • which layer
  • before or after temporal pooling
  • exact normalization/rendering procedure

Conclusion:

  • local visualization work is useful for debugging
  • it should not be treated as paper-faithful evidence

What Is Not Reproduced

9. The published DRF quantitative claim is not reproduced here

Paper-side numbers:

  • ScoNet-MT-ske: 82.5 Acc / 81.4 Prec / 74.3 Rec / 76.6 F1
  • DRF: 86.0 Acc / 84.1 Prec / 79.2 Rec / 80.8 F1

Local practical DRF result on the current workable 1:1:2 path:

  • best retained DRF checkpoint (2000) full test:
    • 80.21 Acc / 58.92 Prec / 59.23 Rec / 57.84 F1

Conclusion:

  • DRF currently loses to the stronger plain skeleton baseline in this repo
  • the published DRF advantage is not established locally

10. The paper-level 1:1:8 skeleton story is not reproduced here

What happened locally:

  • 1:1:8 skeleton runs remained much weaker and more unstable
  • the stronger practical result came from moving to 1:1:2

Conclusion:

  • the hard 1:1:8 regime is still unresolved here
  • current local evidence says class distribution is a major part of the failure mode

11. Representation findings

Current local ranking:

  • body-only is the best practical representation so far
  • head-lite helped some small proxy runs but did not transfer to the full test set
  • full-body has not yet beaten the best body-only checkpoint family

Concrete comparison:

  • body-only + plain CE full test at 7000:
    • 83.16 Acc / 68.24 Prec / 80.02 Rec / 68.47 F1
  • head-lite + plain CE full test at 7000:
    • 78.07 Acc / 65.42 Prec / 80.50 Rec / 62.08 F1

Conclusion:

  • the stable useful signal is mainly torso-centered
  • adding limited head information did not improve full-test performance

12. Loss and optimizer findings

Current local ranking:

  • on the practical 1:1:2 branch, plain CE beat weighted CE
  • SGD produced the first strong baseline
  • later AdamW cosine finetune beat that baseline substantially
  • earlier AdamW multistep finetune was unstable and inferior

Conclusion:

  • the current best recipe is not “AdamW from scratch”
  • it is “strong SGD-style baseline first, then milder AdamW cosine finetune”

13. DRF-specific finding

Current local interpretation:

  • DRF is not failing because the skeleton branch is dead
  • it is failing because the extra prior branch is not yet adding a strong enough complementary signal

Most likely reasons:

  • the body-only skeleton baseline already captures most of the useful torso signal on 1:1:2
  • the current PAV/PGA path looks weakly selective
  • DRF tended to peak early and then degrade, suggesting less stable optimization than the plain baseline

Conclusion:

  • DRF is still a research branch here
  • it is not the current practical winner

Important Evaluation Caveat

14. There was no test-sample gradient leakage

For the long finetune run:

  • training batches came only from TRAIN_SET
  • repeated evaluation ran under torch.no_grad()
  • no TEST_SET samples were used in backprop or optimizer updates

Conclusion:

  • there was no train/test mixing inside gradient descent

15. There was test-set model-selection leakage

For the long finetune run:

  • full TEST_SET eval was run repeatedly during training
  • best-checkpoint retention used test_accuracy and test_f1
  • the final retained winner was chosen with test metrics

That means:

  • the retained best checkpoint is real and reproducible
  • but its metric should be interpreted as a test-guided selected result, not a pristine one-shot final estimate

Conclusion:

  • acceptable for practical model selection
  • not ideal for a publication-style clean generalization claim

Bottom Line

The repo currently supports:

  • strong practical scoliosis model development
  • stable skeleton-map experimentation
  • plausible DRF implementation work

The repo does not currently support:

  • claiming an independent reproduction of the DRF papers quantitative result
  • claiming that the DRF paper fully specifies the winning skeleton-map preprocessing
  • treating DRF as the default best model in this codebase

Practical bottom line:

  • keep the body-only skeleton baseline as the mainline path
  • keep the retained best checkpoint family as the working artifact
  • treat DRF as an optional follow-up branch, not the current winner

Current strongest practical checkpoint:

  • 92.38 Acc / 90.30 Prec / 87.39 Rec / 88.70 F1

If future work claims to improve on this, it should:

  1. beat the current retained best checkpoint on full-test macro-F1
  2. be verifiable by standalone eval from a saved checkpoint
  3. state clearly whether the change is paper-faithful or purely empirical