8.9 KiB
Scoliosis1K Reproducibility Audit
This note is the current audit of what is and is not reproducible in the Scoliosis1K ScoNet/DRF work inside this repo.
It separates three questions that should not be mixed together:
- is the repo/train stack itself working?
- is the paper-level DRF claim reproducible?
- what is the strongest practical model we have right now?
Related notes:
- ScoNet and DRF: Status, Architecture, and Reproduction Notes
- Scoliosis Training Change Log
- Scoliosis: Next Experiments
Primary references:
- ScoNet paper: arXiv-2407.05726v3-main.tex
- DRF paper: arXiv-2509.00872v1-main.tex
Executive Summary
Current audit conclusion:
- the repo and training stack are working
- the skeleton branch is learnable
- the published DRF result is still not independently reproducible here
- the best practical model in this repo is currently not DRF
Current practical winner:
- model family:
ScoNet-MT-ske - split:
1:1:2 - representation:
body-only - loss: plain CE + triplet
- optimizer path: later
AdamWcosine finetune - verified best retained checkpoint:
92.38 Acc / 90.30 Prec / 87.39 Rec / 88.70 F1
That means:
- practical model development is in a good state
- paper-faithful DRF reproduction is not
What We Can Say With High Confidence
1. The core training and evaluation stack works
Evidence:
- the silhouette ScoNet path behaves sensibly
- train/eval loops, checkpointing, resume, and standalone eval all work
- the strong practical skeleton result is reproducible from a saved checkpoint
Conclusion:
- the repo is not globally broken
- the main remaining issues are method-specific, not infrastructure-wide
2. The raw Scoliosis1K pose data is usable
Evidence:
- earlier dataset analysis showed high pose confidence and stable sequence lengths
- the skeleton branch eventually learns well on the easier practical split
Conclusion:
- the raw pose source is not the main blocker
3. The skeleton branch is learnable
Evidence:
- on
1:1:2, the skeleton branch moved from poor early baselines to a strong retained checkpoint family - the best verified practical result is:
92.38 Acc / 90.30 Prec / 87.39 Rec / 88.70 F1
Conclusion:
- the skeleton path is not dead
- earlier failures on harder settings were not proof that skeleton input is fundamentally wrong
What Is Reproducible
4. The silhouette ScoNet path is reproducible enough to trust
Evidence:
- the silhouette pipeline and trainer behave consistently
- strong silhouette checkpoints evaluate sensibly on their intended split family
Conclusion:
- silhouette ScoNet remains a valid sanity anchor for this repo
5. The high-level DRF idea is reproducible
Evidence:
- the DRF paper defines the method as
skeleton map + PAV + PGA - this repo now contains:
- a DRF model
- DRF-specific preprocessing
- PAV generation
- PGA integration
Conclusion:
- the architecture-level idea is implementable
- a plausible DRF implementation exists locally
6. The PAV concept is reproducible
Evidence:
- PAV metrics were implemented and produced stable sequence-level signals
- the earlier analysis showed PAV still carries useful class signal
Conclusion:
- “we could not build the clinical prior” is not the main explanation anymore
What Is Only Partially Reproducible
7. The skeleton-map branch is only partially specified by the papers
The papers define the representation conceptually, but not enough for quantitative reproduction from text alone.
Missing or under-specified details included:
- numeric Gaussian widths
- joint-vs-limb relative weighting
- crop and alignment policy
- resize/padding policy
- quantization/dtype behavior
- runtime transform assumptions
Why that matters:
- small rasterization and alignment changes moved results a lot
- many early failures came from details the paper did not pin down tightly enough
Conclusion:
- the paper gives the representation idea
- it does not fully specify the winning implementation
8. The visualization story is only partially reproducible
The papers cite visualization families, but do not specify enough extraction detail to reproduce the exact figures cleanly.
Missing detail included:
- which layer
- before or after temporal pooling
- exact normalization/rendering procedure
Conclusion:
- local visualization work is useful for debugging
- it should not be treated as paper-faithful evidence
What Is Not Reproduced
9. The published DRF quantitative claim is not reproduced here
Paper-side numbers:
ScoNet-MT-ske:82.5 Acc / 81.4 Prec / 74.3 Rec / 76.6 F1DRF:86.0 Acc / 84.1 Prec / 79.2 Rec / 80.8 F1
Local practical DRF result on the current workable 1:1:2 path:
- best retained DRF checkpoint (
2000) full test:80.21 Acc / 58.92 Prec / 59.23 Rec / 57.84 F1
Conclusion:
- DRF currently loses to the stronger plain skeleton baseline in this repo
- the published DRF advantage is not established locally
10. The paper-level 1:1:8 skeleton story is not reproduced here
What happened locally:
1:1:8skeleton runs remained much weaker and more unstable- the stronger practical result came from moving to
1:1:2
Conclusion:
- the hard
1:1:8regime is still unresolved here - current local evidence says class distribution is a major part of the failure mode
Practical Findings From The Search
11. Representation findings
Current local ranking:
body-onlyis the best practical representation so farhead-litehelped some small proxy runs but did not transfer to the full test setfull-bodyhas not yet beaten the bestbody-onlycheckpoint family
Concrete comparison:
body-only + plain CEfull test at7000:83.16 Acc / 68.24 Prec / 80.02 Rec / 68.47 F1
head-lite + plain CEfull test at7000:78.07 Acc / 65.42 Prec / 80.50 Rec / 62.08 F1
Conclusion:
- the stable useful signal is mainly torso-centered
- adding limited head information did not improve full-test performance
12. Loss and optimizer findings
Current local ranking:
- on the practical
1:1:2branch,plain CEbeatweighted CE SGDproduced the first strong baseline- later
AdamWcosine finetune beat that baseline substantially - earlier
AdamWmultistep finetune was unstable and inferior
Conclusion:
- the current best recipe is not “AdamW from scratch”
- it is “strong SGD-style baseline first, then milder AdamW cosine finetune”
13. DRF-specific finding
Current local interpretation:
- DRF is not failing because the skeleton branch is dead
- it is failing because the extra prior branch is not yet adding a strong enough complementary signal
Most likely reasons:
- the body-only skeleton baseline already captures most of the useful torso signal on
1:1:2 - the current PAV/PGA path looks weakly selective
- DRF tended to peak early and then degrade, suggesting less stable optimization than the plain baseline
Conclusion:
- DRF is still a research branch here
- it is not the current practical winner
Important Evaluation Caveat
14. There was no test-sample gradient leakage
For the long finetune run:
- training batches came only from
TRAIN_SET - repeated evaluation ran under
torch.no_grad() - no
TEST_SETsamples were used in backprop or optimizer updates
Conclusion:
- there was no train/test mixing inside gradient descent
15. There was test-set model-selection leakage
For the long finetune run:
- full
TEST_SETeval was run repeatedly during training - best-checkpoint retention used
test_accuracyandtest_f1 - the final retained winner was chosen with test metrics
That means:
- the retained best checkpoint is real and reproducible
- but its metric should be interpreted as a test-guided selected result, not a pristine one-shot final estimate
Conclusion:
- acceptable for practical model selection
- not ideal for a publication-style clean generalization claim
Bottom Line
The repo currently supports:
- strong practical scoliosis model development
- stable skeleton-map experimentation
- plausible DRF implementation work
The repo does not currently support:
- claiming an independent reproduction of the DRF paper’s quantitative result
- claiming that the DRF paper fully specifies the winning skeleton-map preprocessing
- treating DRF as the default best model in this codebase
Practical bottom line:
- keep the
body-onlyskeleton baseline as the mainline path - keep the retained best checkpoint family as the working artifact
- treat DRF as an optional follow-up branch, not the current winner
Current strongest practical checkpoint:
92.38 Acc / 90.30 Prec / 87.39 Rec / 88.70 F1
If future work claims to improve on this, it should:
- beat the current retained best checkpoint on full-test macro-F1
- be verifiable by standalone eval from a saved checkpoint
- state clearly whether the change is paper-faithful or purely empirical