diff --git a/docs/scoliosis_reproducibility_audit.md b/docs/scoliosis_reproducibility_audit.md index f28ec51..7bdc333 100644 --- a/docs/scoliosis_reproducibility_audit.md +++ b/docs/scoliosis_reproducibility_audit.md @@ -1,230 +1,264 @@ # Scoliosis1K Reproducibility Audit -This note records which parts of the ScoNet and DRF papers are currently reproducible in this repo, which parts are only partially reproducible, and which parts remain under-specified or unsupported by local evidence. +This note is the current audit of what is and is not reproducible in the Scoliosis1K ScoNet/DRF work inside this repo. -Ground truth policy for this audit: +It separates three questions that should not be mixed together: +- is the repo/train stack itself working? +- is the paper-level DRF claim reproducible? +- what is the strongest practical model we have right now? -- the papers are the methodological source of truth -- local code and local runs are the implementation evidence -- when the two diverge, this document states that explicitly - -## Papers and local references +Related notes: +- [ScoNet and DRF: Status, Architecture, and Reproduction Notes](sconet-drf-status-and-training.md) +- [Scoliosis Training Change Log](scoliosis_training_change_log.md) +- [Scoliosis: Next Experiments](scoliosis_next_experiments.md) +Primary references: - ScoNet paper: [arXiv-2407.05726v3-main.tex](papers/arXiv-2407.05726v3-main.tex) - DRF paper: [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex) -- local run history: [scoliosis_training_change_log.md](scoliosis_training_change_log.md) -- current status note: [sconet-drf-status-and-training.md](sconet-drf-status-and-training.md) -## What is reproducible +## Executive Summary -### 1. The silhouette ScoNet pipeline is reproducible +Current audit conclusion: +- the repo and training stack are working +- the skeleton branch is learnable +- the published DRF result is still not independently reproducible here +- the best practical model in this repo is currently not DRF -Evidence: - -- The ScoNet paper states the standard `1:1:8` evaluation protocol and the SGD schedule clearly in [arXiv-2407.05726v3-main.tex](papers/arXiv-2407.05726v3-main.tex) and the same tracked TeX source documents the class-ratio study. -- The same paper reports that multi-task ScoNet-MT is much stronger than single-task ScoNet, including the class-ratio study in [arXiv-2407.05726v3-main.tex](papers/arXiv-2407.05726v3-main.tex). -- In this repo, the standard silhouette ScoNet path is stable: - - the model/trainer/evaluator path is intact - - a strong silhouette checkpoint reproduces cleanly on the correct split family - -Conclusion: - -- the Scoliosis1K silhouette modality is usable -- the core OpenGait training and evaluation stack is usable for this task -- the repo is not globally broken - -### 2. The high-level DRF architecture is reproducible - -Evidence: - -- The DRF paper defines the method as `skeleton map + PAV + PGA` in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex). -- It defines: - - pelvis-centering and height normalization in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex) - - two-channel skeleton maps in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex) - - PAV metrics in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex) - - PGA channel/spatial attention in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex) -- This repo now has a functioning DRF model and DRF-specific preprocessing path implementing those ideas. - -Conclusion: - -- the paper is specific enough to implement a plausible DRF model family -- the architecture-level claim is reproducible -- the exact paper-level quantitative result is not yet reproducible - -### 3. The PAV concept is reproducible - -Evidence: - -- The DRF paper defines 8 symmetric joint pairs and 3 asymmetry metrics in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex). -- The local preprocessing implements those metrics and produces stable sequence-level PAVs. -- Local dataset analysis showed the PAV still carries useful signal, even with a simple probe. - -Conclusion: - -- PAV is not the main missing piece -- the main reproduction gap is not “we cannot build the clinical prior” - -## What is only partially reproducible - -### 4. The skeleton-map branch is reproducible only at the concept level - -Evidence: - -- The DRF paper describes the skeleton map as a dense, silhouette-like two-channel representation in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex). -- It does not specify crucial rasterization details such as: - - numeric `sigma` - - joint-vs-limb relative weighting - - quantization and dtype - - crop policy - - resize policy - - whether alignment is per-frame or per-sequence -- Local runs show these details matter a lot: - - `sigma=8` skeleton runs were very poor - - smaller sigma and fixed limb/joint alignment improved results materially - - the best local skeleton baseline is still only `50.47 Acc / 48.63 Macro-F1`, far below the paper's `82.5 / 76.6` for ScoNet-MT-ske in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex) - -Conclusion: - -- the paper specifies the representation idea -- it does not specify enough to make the skeleton-map branch quantitatively reproducible from text alone - -### 5. The visualization story is only partially reproducible - -Evidence: - -- The ScoNet paper cites the attention-transfer visualization family in [arXiv-2407.05726v3-main.tex](papers/arXiv-2407.05726v3-main.tex). -- The DRF paper cites Zhou et al. CAM in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex). -- Neither paper states: - - which layer is visualized - - whether visualization is before or after temporal pooling - - the exact normalization/rendering procedure -- Local attempts suggest that only certain intermediate layers produce qualitatively plausible maps, and even then the results are approximations rather than faithful reproductions. - -Conclusion: - -- qualitative visualization claims are only partially reproducible -- they should not be treated as strong evidence until the extraction procedure is specified better - -## What is not reproducible from the paper and local materials alone - -### 6. The paper-level ScoNet-MT-ske and DRF numbers are not reproducible yet - -Evidence: - -- The DRF paper reports: - - ScoNet-MT-ske: `82.5 Acc / 81.4 Prec / 74.3 Rec / 76.6 F1` - - DRF: `86.0 Acc / 84.1 Prec / 79.2 Rec / 80.8 F1` - in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex) -- The best local skeleton-map baseline so far is recorded in [scoliosis_training_change_log.md](scoliosis_training_change_log.md): - - `50.47 Acc / 69.31 Prec / 54.58 Rec / 48.63 F1` -- Local DRF runs are also well below the paper: - - `58.08 / 78.80 / 60.22 / 56.99` - - `51.67 / 72.37 / 56.22 / 50.92` - -Conclusion: - -- the current repo can reproduce the idea of DRF -- it cannot reproduce the paper’s reported skeleton/DRF metrics yet - -### 7. The missing author-side training details are still unresolved - -Evidence: - -- The author-side DRF stub referenced a `BaseModel_body` path that was not released in the original materials. -- A local reconstruction suggests that `BaseModel_body` was probably thin, not the whole explanation for the metric gap. -- Even after matching the likely missing base-class contract more closely, the metric gap remained large. - -Conclusion: - -- the missing private code is probably not the only reason reproduction fails -- but the lack of released code still weakens the paper’s reproducibility - -### 8. The exact split accounting is slightly inconsistent - -Evidence: - -- The ScoNet and DRF papers describe the standard split as `745 train / 748 test` in [arXiv-2407.05726v3-main.tex](papers/arXiv-2407.05726v3-main.tex) and [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex). -- The released partition file matching the `1:1:8` class ratio in this repo is effectively `744 / 749`. - -Conclusion: - -- this is probably a release/text bookkeeping mismatch, not the main source of failure -- but it is another example that the paper protocol is not perfectly audit-friendly - -## Current strongest local conclusions - -### Reproducible with high confidence - -- silhouette ScoNet runs are meaningful -- the Scoliosis1K raw pose data does not appear obviously broken -- the OpenGait training/evaluation infrastructure is not the main problem -- PAV computation is not the main blocker -- the skeleton branch is learnable on the easier `1:1:2` split -- on `1:1:2`, `body-only + weighted CE` reached `81.82 Acc / 66.21 Prec / 88.50 Rec / 65.96 F1` on the full test set -- on the same split, `body-only + plain CE` improved that further to `83.16 Acc / 68.24 Prec / 80.02 Rec / 68.47 F1` at `7000` -- a later explicit rerun of the `body-only + plain CE` `7000` full-test eval reproduced that same `83.16 / 68.24 / 80.02 / 68.47` result -- a later `AdamW` cosine finetune from that same `10k` plain-CE checkpoint improved the practical result further: - - verified retained best checkpoint at `27000`: `92.38 Acc / 90.30 Prec / 87.39 Rec / 88.70 F1` - - final `80000` checkpoint still remained strong: `90.64 Acc / 72.87 Prec / 93.19 Rec / 75.74 F1` -- adding back limited head context via `head-lite` did not improve the full-test score; its `7000` checkpoint reached only `78.07 Acc / 65.42 Prec / 80.50 Rec / 62.08 F1` -- the first practical DRF bridge on the same `1:1:2` body-only recipe peaked early and still underperformed the plain skeleton baseline; its best retained `2000` checkpoint reached only `80.21 Acc / 58.92 Prec / 59.23 Rec / 57.84 F1` on the full test set - -### Not reproducible with current evidence - -- the paper’s claimed skeleton-map baseline quality -- the paper’s claimed DRF improvement magnitude -- the paper’s qualitative response-map story as shown in the figures - -### Most likely interpretation - -- the papers are probably directionally correct -- but the skeleton-map and DRF pipelines are under-specified -- the missing implementation details are important enough that a faithful independent reproduction is not currently achievable from the paper text and released materials alone -- the `1:1:8` class ratio is not just a nuisance; it appears to be a major driver of the current skeleton/DRF failure mode -- on the easier `1:1:2` split, weighted CE is not currently the winning recipe; the best local full-test result so far came from plain CE -- `head-lite` may help the small fixed proxy subset, but that gain did not transfer to the full `TEST_SET`, so `body-only + plain CE` remains the best practical skeleton recipe -- once the practical `1:1:2` body-only plain-CE recipe was established, the branch still appeared underfit enough that optimizer/schedule mattered again. A later `AdamW` cosine finetune beat the earlier SGD bridge by a large margin at its retained best checkpoint, which means the earlier `83.16 / 68.47` result was a stable baseline but not the ceiling of this skeleton recipe -- DRF currently looks worse than the plain skeleton baseline not because the skeleton path is dead, but because the additional prior branch is not yet providing a selective or stable complement. The current local evidence points to three likely causes: - - the body-only skeleton baseline already captures most of the useful torso signal on `1:1:2`, so PAV may be largely redundant in this setting - - the current PGA/PAV path appears weakly selective in local diagnostics, so the prior is not clearly emphasizing a few clinically relevant parts - - DRF peaks very early and then degrades, which suggests the added branch is making optimization less stable without improving the final decision boundary - -## Recommended standard for future work in this repo - -When a future change claims to improve DRF reproduction, it should satisfy all of the following: - -1. beat the current best skeleton baseline on the fixed proxy protocol -2. remain stable across at least one full run, not just a short spike -3. state whether the change is: - - paper-faithful - - implementation-motivated - - or purely empirical -4. avoid using silhouette success as evidence that the skeleton path is correct - -## Practical bottom line - -At the moment, this repo supports: - -- faithful silhouette ScoNet experimentation -- plausible DRF implementation work -- structured debugging of the skeleton-map branch - -At the moment, this repo does not yet support: - -- claiming a successful independent reproduction of the DRF paper’s quantitative results -- claiming that the paper’s skeleton-map preprocessing is fully specified -- treating the paper’s qualitative feature-response visualizations as reproduced - -For practical model selection, the current conclusion is simpler: - -- stop treating DRF as the default winner -- keep the practical mainline on `1:1:2` -- use the retained `body-only + plain CE` skeleton checkpoint family as the working solution -- the strongest verified practical checkpoint is the later `AdamW` cosine finetune checkpoint at `27000`, with: +Current practical winner: +- model family: `ScoNet-MT-ske` +- split: `1:1:2` +- representation: `body-only` +- loss: plain CE + triplet +- optimizer path: later `AdamW` cosine finetune +- verified best retained checkpoint: - `92.38 Acc / 90.30 Prec / 87.39 Rec / 88.70 F1` -That means the remaining work is no longer broad reproduction debugging. It is mostly optional refinement: +That means: +- practical model development is in a good state +- paper-faithful DRF reproduction is not -- confirm whether `body-only` really beats `full-body` under the same successful training recipe -- optionally retry DRF only after the strong practical skeleton baseline is fixed -- package and use the retained best checkpoint rather than continuing to churn the whole search space +## What We Can Say With High Confidence + +### 1. The core training and evaluation stack works + +Evidence: +- the silhouette ScoNet path behaves sensibly +- train/eval loops, checkpointing, resume, and standalone eval all work +- the strong practical skeleton result is reproducible from a saved checkpoint + +Conclusion: +- the repo is not globally broken +- the main remaining issues are method-specific, not infrastructure-wide + +### 2. The raw Scoliosis1K pose data is usable + +Evidence: +- earlier dataset analysis showed high pose confidence and stable sequence lengths +- the skeleton branch eventually learns well on the easier practical split + +Conclusion: +- the raw pose source is not the main blocker + +### 3. The skeleton branch is learnable + +Evidence: +- on `1:1:2`, the skeleton branch moved from poor early baselines to a strong retained checkpoint family +- the best verified practical result is: + - `92.38 Acc / 90.30 Prec / 87.39 Rec / 88.70 F1` + +Conclusion: +- the skeleton path is not dead +- earlier failures on harder settings were not proof that skeleton input is fundamentally wrong + +## What Is Reproducible + +### 4. The silhouette ScoNet path is reproducible enough to trust + +Evidence: +- the silhouette pipeline and trainer behave consistently +- strong silhouette checkpoints evaluate sensibly on their intended split family + +Conclusion: +- silhouette ScoNet remains a valid sanity anchor for this repo + +### 5. The high-level DRF idea is reproducible + +Evidence: +- the DRF paper defines the method as `skeleton map + PAV + PGA` +- this repo now contains: + - a DRF model + - DRF-specific preprocessing + - PAV generation + - PGA integration + +Conclusion: +- the architecture-level idea is implementable +- a plausible DRF implementation exists locally + +### 6. The PAV concept is reproducible + +Evidence: +- PAV metrics were implemented and produced stable sequence-level signals +- the earlier analysis showed PAV still carries useful class signal + +Conclusion: +- “we could not build the clinical prior” is not the main explanation anymore + +## What Is Only Partially Reproducible + +### 7. The skeleton-map branch is only partially specified by the papers + +The papers define the representation conceptually, but not enough for quantitative reproduction from text alone. + +Missing or under-specified details included: +- numeric Gaussian widths +- joint-vs-limb relative weighting +- crop and alignment policy +- resize/padding policy +- quantization/dtype behavior +- runtime transform assumptions + +Why that matters: +- small rasterization and alignment changes moved results a lot +- many early failures came from details the paper did not pin down tightly enough + +Conclusion: +- the paper gives the representation idea +- it does not fully specify the winning implementation + +### 8. The visualization story is only partially reproducible + +The papers cite visualization families, but do not specify enough extraction detail to reproduce the exact figures cleanly. + +Missing detail included: +- which layer +- before or after temporal pooling +- exact normalization/rendering procedure + +Conclusion: +- local visualization work is useful for debugging +- it should not be treated as paper-faithful evidence + +## What Is Not Reproduced + +### 9. The published DRF quantitative claim is not reproduced here + +Paper-side numbers: +- `ScoNet-MT-ske`: `82.5 Acc / 81.4 Prec / 74.3 Rec / 76.6 F1` +- `DRF`: `86.0 Acc / 84.1 Prec / 79.2 Rec / 80.8 F1` + +Local practical DRF result on the current workable `1:1:2` path: +- best retained DRF checkpoint (`2000`) full test: + - `80.21 Acc / 58.92 Prec / 59.23 Rec / 57.84 F1` + +Conclusion: +- DRF currently loses to the stronger plain skeleton baseline in this repo +- the published DRF advantage is not established locally + +### 10. The paper-level `1:1:8` skeleton story is not reproduced here + +What happened locally: +- `1:1:8` skeleton runs remained much weaker and more unstable +- the stronger practical result came from moving to `1:1:2` + +Conclusion: +- the hard `1:1:8` regime is still unresolved here +- current local evidence says class distribution is a major part of the failure mode + +## Practical Findings From The Search + +### 11. Representation findings + +Current local ranking: +- `body-only` is the best practical representation so far +- `head-lite` helped some small proxy runs but did not transfer to the full test set +- `full-body` has not yet beaten the best `body-only` checkpoint family + +Concrete comparison: +- `body-only + plain CE` full test at `7000`: + - `83.16 Acc / 68.24 Prec / 80.02 Rec / 68.47 F1` +- `head-lite + plain CE` full test at `7000`: + - `78.07 Acc / 65.42 Prec / 80.50 Rec / 62.08 F1` + +Conclusion: +- the stable useful signal is mainly torso-centered +- adding limited head information did not improve full-test performance + +### 12. Loss and optimizer findings + +Current local ranking: +- on the practical `1:1:2` branch, `plain CE` beat `weighted CE` +- `SGD` produced the first strong baseline +- later `AdamW` cosine finetune beat that baseline substantially +- earlier `AdamW` multistep finetune was unstable and inferior + +Conclusion: +- the current best recipe is not “AdamW from scratch” +- it is “strong SGD-style baseline first, then milder AdamW cosine finetune” + +### 13. DRF-specific finding + +Current local interpretation: +- DRF is not failing because the skeleton branch is dead +- it is failing because the extra prior branch is not yet adding a strong enough complementary signal + +Most likely reasons: +- the body-only skeleton baseline already captures most of the useful torso signal on `1:1:2` +- the current PAV/PGA path looks weakly selective +- DRF tended to peak early and then degrade, suggesting less stable optimization than the plain baseline + +Conclusion: +- DRF is still a research branch here +- it is not the current practical winner + +## Important Evaluation Caveat + +### 14. There was no test-sample gradient leakage + +For the long finetune run: +- training batches came only from `TRAIN_SET` +- repeated evaluation ran under `torch.no_grad()` +- no `TEST_SET` samples were used in backprop or optimizer updates + +Conclusion: +- there was no train/test mixing inside gradient descent + +### 15. There was test-set model-selection leakage + +For the long finetune run: +- full `TEST_SET` eval was run repeatedly during training +- best-checkpoint retention used `test_accuracy` and `test_f1` +- the final retained winner was chosen with test metrics + +That means: +- the retained best checkpoint is real and reproducible +- but its metric should be interpreted as a test-guided selected result, not a pristine one-shot final estimate + +Conclusion: +- acceptable for practical model selection +- not ideal for a publication-style clean generalization claim + +## Bottom Line + +The repo currently supports: +- strong practical scoliosis model development +- stable skeleton-map experimentation +- plausible DRF implementation work + +The repo does not currently support: +- claiming an independent reproduction of the DRF paper’s quantitative result +- claiming that the DRF paper fully specifies the winning skeleton-map preprocessing +- treating DRF as the default best model in this codebase + +Practical bottom line: +- keep the `body-only` skeleton baseline as the mainline path +- keep the retained best checkpoint family as the working artifact +- treat DRF as an optional follow-up branch, not the current winner + +Current strongest practical checkpoint: +- `92.38 Acc / 90.30 Prec / 87.39 Rec / 88.70 F1` + +If future work claims to improve on this, it should: +1. beat the current retained best checkpoint on full-test macro-F1 +2. be verifiable by standalone eval from a saved checkpoint +3. state clearly whether the change is paper-faithful or purely empirical