# Scoliosis1K Reproducibility Audit This note is the current audit of what is and is not reproducible in the Scoliosis1K ScoNet/DRF work inside this repo. It separates three questions that should not be mixed together: - is the repo/train stack itself working? - is the paper-level DRF claim reproducible? - what is the strongest practical model we have right now? Related notes: - [ScoNet and DRF: Status, Architecture, and Reproduction Notes](sconet-drf-status-and-training.md) - [Scoliosis Training Change Log](scoliosis_training_change_log.md) - [Scoliosis: Next Experiments](scoliosis_next_experiments.md) - [DRF Author Checkpoint Compatibility Note](drf_author_checkpoint_compat.md) Primary references: - ScoNet paper: [arXiv-2407.05726v3-main.tex](papers/arXiv-2407.05726v3-main.tex) - DRF paper: [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex) ## Executive Summary Current audit conclusion: - the repo and training stack are working - the skeleton branch is learnable - the published DRF result is still not independently reproducible here - the best practical model in this repo is currently not DRF Current practical winner: - model family: `ScoNet-MT-ske` - split: `1:1:2` - representation: `body-only` - loss: plain CE + triplet - optimizer path: later `AdamW` cosine finetune - verified best retained checkpoint: - `92.38 Acc / 90.30 Prec / 87.39 Rec / 88.70 F1` That means: - practical model development is in a good state - paper-faithful DRF reproduction is not ## What We Can Say With High Confidence ### 1. The core training and evaluation stack works Evidence: - the silhouette ScoNet path behaves sensibly - train/eval loops, checkpointing, resume, and standalone eval all work - the strong practical skeleton result is reproducible from a saved checkpoint Conclusion: - the repo is not globally broken - the main remaining issues are method-specific, not infrastructure-wide ### 2. The raw Scoliosis1K pose data is usable Evidence: - earlier dataset analysis showed high pose confidence and stable sequence lengths - the skeleton branch eventually learns well on the easier practical split Conclusion: - the raw pose source is not the main blocker ### 3. The skeleton branch is learnable Evidence: - on `1:1:2`, the skeleton branch moved from poor early baselines to a strong retained checkpoint family - the best verified practical result is: - `92.38 Acc / 90.30 Prec / 87.39 Rec / 88.70 F1` Conclusion: - the skeleton path is not dead - earlier failures on harder settings were not proof that skeleton input is fundamentally wrong ## What Is Reproducible ### 4. The silhouette ScoNet path is reproducible enough to trust Evidence: - the silhouette pipeline and trainer behave consistently - strong silhouette checkpoints evaluate sensibly on their intended split family Conclusion: - silhouette ScoNet remains a valid sanity anchor for this repo ### 5. The high-level DRF idea is reproducible Evidence: - the DRF paper defines the method as `skeleton map + PAV + PGA` - this repo now contains: - a DRF model - DRF-specific preprocessing - PAV generation - PGA integration Conclusion: - the architecture-level idea is implementable - a plausible DRF implementation exists locally ### 6. The PAV concept is reproducible Evidence: - PAV metrics were implemented and produced stable sequence-level signals - the earlier analysis showed PAV still carries useful class signal Conclusion: - “we could not build the clinical prior” is not the main explanation anymore ## What Is Only Partially Reproducible ### 7. The skeleton-map branch is only partially specified by the papers The papers define the representation conceptually, but not enough for quantitative reproduction from text alone. Missing or under-specified details included: - numeric Gaussian widths - joint-vs-limb relative weighting - crop and alignment policy - resize/padding policy - quantization/dtype behavior - runtime transform assumptions Why that matters: - small rasterization and alignment changes moved results a lot - many early failures came from details the paper did not pin down tightly enough Conclusion: - the paper gives the representation idea - it does not fully specify the winning implementation ### 8. The visualization story is only partially reproducible The papers cite visualization families, but do not specify enough extraction detail to reproduce the exact figures cleanly. Missing detail included: - which layer - before or after temporal pooling - exact normalization/rendering procedure Conclusion: - local visualization work is useful for debugging - it should not be treated as paper-faithful evidence ## What Is Not Reproduced ### 9. The published DRF quantitative claim is not reproduced here Paper-side numbers: - `ScoNet-MT-ske`: `82.5 Acc / 81.4 Prec / 74.3 Rec / 76.6 F1` - `DRF`: `86.0 Acc / 84.1 Prec / 79.2 Rec / 80.8 F1` Local practical DRF result on the current workable `1:1:2` path: - best retained DRF checkpoint (`2000`) full test: - `80.21 Acc / 58.92 Prec / 59.23 Rec / 57.84 F1` Conclusion: - DRF currently loses to the stronger plain skeleton baseline in this repo - the published DRF advantage is not established locally ### 9a. The author-provided DRF checkpoint is partially recoverable This changed one important part of the audit: - the author checkpoint itself is not unusable - the earlier very poor local eval was mostly a compatibility failure The recovered best author-checkpoint path is: - config: `configs/drf/drf_author_eval_118_aligned_1gpu.yaml` - result: - `80.24 Acc / 76.73 Prec / 76.40 Rec / 76.56 F1` This is still below the paper's DRF headline result: - `86.0 / 84.1 / 79.2 / 80.8` But it is far better than the earlier broken compat evals, which means: - the weight file is real - the stale author YAML is not a reliable runtime contract The main causes were: - split mismatch: - checkpoint name says `118` - provided YAML points to `112` - class-order mismatch: - author stub uses `negative=0, positive=1, neutral=2` - repo evaluator assumes `negative=0, neutral=1, positive=2` - legacy module naming mismatch: - `attention_layer.*` vs `PGA.*` - preprocessing/runtime mismatch: - the checkpoint aligns much better with `Scoliosis1K-drf-pkl-118-aligned` - it performs very badly on the local `118-paper` export Conclusion: - the author checkpoint can be made meaningfully usable in this repo - but the provided bundle still does not fully specify the original training/eval contract ### 10. The paper-level `1:1:8` skeleton story is not reproduced here What happened locally: - `1:1:8` skeleton runs remained much weaker and more unstable - the stronger practical result came from moving to `1:1:2` Conclusion: - the hard `1:1:8` regime is still unresolved here - current local evidence says class distribution is a major part of the failure mode ## Practical Findings From The Search ### 11. Representation findings Current local ranking: - `body-only` is the best practical representation so far - `head-lite` helped some small proxy runs but did not transfer to the full test set - `full-body` has not yet beaten the best `body-only` checkpoint family Concrete comparison: - `body-only + plain CE` full test at `7000`: - `83.16 Acc / 68.24 Prec / 80.02 Rec / 68.47 F1` - `head-lite + plain CE` full test at `7000`: - `78.07 Acc / 65.42 Prec / 80.50 Rec / 62.08 F1` Conclusion: - the stable useful signal is mainly torso-centered - adding limited head information did not improve full-test performance ### 12. Loss and optimizer findings Current local ranking: - on the practical `1:1:2` branch, `plain CE` beat `weighted CE` - `SGD` produced the first strong baseline - later `AdamW` cosine finetune beat that baseline substantially - earlier `AdamW` multistep finetune was unstable and inferior Conclusion: - the current best recipe is not “AdamW from scratch” - it is “strong SGD-style baseline first, then milder AdamW cosine finetune” ### 13. DRF-specific finding Current local interpretation: - DRF is not failing because the skeleton branch is dead - it is failing because the extra prior branch is not yet adding a strong enough complementary signal Most likely reasons: - the body-only skeleton baseline already captures most of the useful torso signal on `1:1:2` - the current PAV/PGA path looks weakly selective - DRF tended to peak early and then degrade, suggesting less stable optimization than the plain baseline Conclusion: - DRF is still a research branch here - it is not the current practical winner ## Important Evaluation Caveat ### 14. There was no test-sample gradient leakage For the long finetune run: - training batches came only from `TRAIN_SET` - repeated evaluation ran under `torch.no_grad()` - no `TEST_SET` samples were used in backprop or optimizer updates Conclusion: - there was no train/test mixing inside gradient descent ### 15. There was test-set model-selection leakage For the long finetune run: - full `TEST_SET` eval was run repeatedly during training - best-checkpoint retention used `test_accuracy` and `test_f1` - the final retained winner was chosen with test metrics That means: - the retained best checkpoint is real and reproducible - but its metric should be interpreted as a test-guided selected result, not a pristine one-shot final estimate Conclusion: - acceptable for practical model selection - not ideal for a publication-style clean generalization claim ## Bottom Line The repo currently supports: - strong practical scoliosis model development - stable skeleton-map experimentation - plausible DRF implementation work The repo does not currently support: - claiming an independent reproduction of the DRF paper’s quantitative result - claiming that the DRF paper fully specifies the winning skeleton-map preprocessing - treating DRF as the default best model in this codebase Practical bottom line: - keep the `body-only` skeleton baseline as the mainline path - keep the retained best checkpoint family as the working artifact - treat DRF as an optional follow-up branch, not the current winner - if DRF work continues, the author checkpoint on the aligned `118` path is now a much better starting point than scratch DRF Current strongest practical checkpoint: - `92.38 Acc / 90.30 Prec / 87.39 Rec / 88.70 F1` If future work claims to improve on this, it should: 1. beat the current retained best checkpoint on full-test macro-F1 2. be verifiable by standalone eval from a saved checkpoint 3. state clearly whether the change is paper-faithful or purely empirical