Files
OpenGait/docs/scoliosis_reproducibility_audit.md
T

265 lines
8.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Scoliosis1K Reproducibility Audit
This note is the current audit of what is and is not reproducible in the Scoliosis1K ScoNet/DRF work inside this repo.
It separates three questions that should not be mixed together:
- is the repo/train stack itself working?
- is the paper-level DRF claim reproducible?
- what is the strongest practical model we have right now?
Related notes:
- [ScoNet and DRF: Status, Architecture, and Reproduction Notes](sconet-drf-status-and-training.md)
- [Scoliosis Training Change Log](scoliosis_training_change_log.md)
- [Scoliosis: Next Experiments](scoliosis_next_experiments.md)
Primary references:
- ScoNet paper: [arXiv-2407.05726v3-main.tex](papers/arXiv-2407.05726v3-main.tex)
- DRF paper: [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex)
## Executive Summary
Current audit conclusion:
- the repo and training stack are working
- the skeleton branch is learnable
- the published DRF result is still not independently reproducible here
- the best practical model in this repo is currently not DRF
Current practical winner:
- model family: `ScoNet-MT-ske`
- split: `1:1:2`
- representation: `body-only`
- loss: plain CE + triplet
- optimizer path: later `AdamW` cosine finetune
- verified best retained checkpoint:
- `92.38 Acc / 90.30 Prec / 87.39 Rec / 88.70 F1`
That means:
- practical model development is in a good state
- paper-faithful DRF reproduction is not
## What We Can Say With High Confidence
### 1. The core training and evaluation stack works
Evidence:
- the silhouette ScoNet path behaves sensibly
- train/eval loops, checkpointing, resume, and standalone eval all work
- the strong practical skeleton result is reproducible from a saved checkpoint
Conclusion:
- the repo is not globally broken
- the main remaining issues are method-specific, not infrastructure-wide
### 2. The raw Scoliosis1K pose data is usable
Evidence:
- earlier dataset analysis showed high pose confidence and stable sequence lengths
- the skeleton branch eventually learns well on the easier practical split
Conclusion:
- the raw pose source is not the main blocker
### 3. The skeleton branch is learnable
Evidence:
- on `1:1:2`, the skeleton branch moved from poor early baselines to a strong retained checkpoint family
- the best verified practical result is:
- `92.38 Acc / 90.30 Prec / 87.39 Rec / 88.70 F1`
Conclusion:
- the skeleton path is not dead
- earlier failures on harder settings were not proof that skeleton input is fundamentally wrong
## What Is Reproducible
### 4. The silhouette ScoNet path is reproducible enough to trust
Evidence:
- the silhouette pipeline and trainer behave consistently
- strong silhouette checkpoints evaluate sensibly on their intended split family
Conclusion:
- silhouette ScoNet remains a valid sanity anchor for this repo
### 5. The high-level DRF idea is reproducible
Evidence:
- the DRF paper defines the method as `skeleton map + PAV + PGA`
- this repo now contains:
- a DRF model
- DRF-specific preprocessing
- PAV generation
- PGA integration
Conclusion:
- the architecture-level idea is implementable
- a plausible DRF implementation exists locally
### 6. The PAV concept is reproducible
Evidence:
- PAV metrics were implemented and produced stable sequence-level signals
- the earlier analysis showed PAV still carries useful class signal
Conclusion:
- “we could not build the clinical prior” is not the main explanation anymore
## What Is Only Partially Reproducible
### 7. The skeleton-map branch is only partially specified by the papers
The papers define the representation conceptually, but not enough for quantitative reproduction from text alone.
Missing or under-specified details included:
- numeric Gaussian widths
- joint-vs-limb relative weighting
- crop and alignment policy
- resize/padding policy
- quantization/dtype behavior
- runtime transform assumptions
Why that matters:
- small rasterization and alignment changes moved results a lot
- many early failures came from details the paper did not pin down tightly enough
Conclusion:
- the paper gives the representation idea
- it does not fully specify the winning implementation
### 8. The visualization story is only partially reproducible
The papers cite visualization families, but do not specify enough extraction detail to reproduce the exact figures cleanly.
Missing detail included:
- which layer
- before or after temporal pooling
- exact normalization/rendering procedure
Conclusion:
- local visualization work is useful for debugging
- it should not be treated as paper-faithful evidence
## What Is Not Reproduced
### 9. The published DRF quantitative claim is not reproduced here
Paper-side numbers:
- `ScoNet-MT-ske`: `82.5 Acc / 81.4 Prec / 74.3 Rec / 76.6 F1`
- `DRF`: `86.0 Acc / 84.1 Prec / 79.2 Rec / 80.8 F1`
Local practical DRF result on the current workable `1:1:2` path:
- best retained DRF checkpoint (`2000`) full test:
- `80.21 Acc / 58.92 Prec / 59.23 Rec / 57.84 F1`
Conclusion:
- DRF currently loses to the stronger plain skeleton baseline in this repo
- the published DRF advantage is not established locally
### 10. The paper-level `1:1:8` skeleton story is not reproduced here
What happened locally:
- `1:1:8` skeleton runs remained much weaker and more unstable
- the stronger practical result came from moving to `1:1:2`
Conclusion:
- the hard `1:1:8` regime is still unresolved here
- current local evidence says class distribution is a major part of the failure mode
## Practical Findings From The Search
### 11. Representation findings
Current local ranking:
- `body-only` is the best practical representation so far
- `head-lite` helped some small proxy runs but did not transfer to the full test set
- `full-body` has not yet beaten the best `body-only` checkpoint family
Concrete comparison:
- `body-only + plain CE` full test at `7000`:
- `83.16 Acc / 68.24 Prec / 80.02 Rec / 68.47 F1`
- `head-lite + plain CE` full test at `7000`:
- `78.07 Acc / 65.42 Prec / 80.50 Rec / 62.08 F1`
Conclusion:
- the stable useful signal is mainly torso-centered
- adding limited head information did not improve full-test performance
### 12. Loss and optimizer findings
Current local ranking:
- on the practical `1:1:2` branch, `plain CE` beat `weighted CE`
- `SGD` produced the first strong baseline
- later `AdamW` cosine finetune beat that baseline substantially
- earlier `AdamW` multistep finetune was unstable and inferior
Conclusion:
- the current best recipe is not “AdamW from scratch”
- it is “strong SGD-style baseline first, then milder AdamW cosine finetune”
### 13. DRF-specific finding
Current local interpretation:
- DRF is not failing because the skeleton branch is dead
- it is failing because the extra prior branch is not yet adding a strong enough complementary signal
Most likely reasons:
- the body-only skeleton baseline already captures most of the useful torso signal on `1:1:2`
- the current PAV/PGA path looks weakly selective
- DRF tended to peak early and then degrade, suggesting less stable optimization than the plain baseline
Conclusion:
- DRF is still a research branch here
- it is not the current practical winner
## Important Evaluation Caveat
### 14. There was no test-sample gradient leakage
For the long finetune run:
- training batches came only from `TRAIN_SET`
- repeated evaluation ran under `torch.no_grad()`
- no `TEST_SET` samples were used in backprop or optimizer updates
Conclusion:
- there was no train/test mixing inside gradient descent
### 15. There was test-set model-selection leakage
For the long finetune run:
- full `TEST_SET` eval was run repeatedly during training
- best-checkpoint retention used `test_accuracy` and `test_f1`
- the final retained winner was chosen with test metrics
That means:
- the retained best checkpoint is real and reproducible
- but its metric should be interpreted as a test-guided selected result, not a pristine one-shot final estimate
Conclusion:
- acceptable for practical model selection
- not ideal for a publication-style clean generalization claim
## Bottom Line
The repo currently supports:
- strong practical scoliosis model development
- stable skeleton-map experimentation
- plausible DRF implementation work
The repo does not currently support:
- claiming an independent reproduction of the DRF papers quantitative result
- claiming that the DRF paper fully specifies the winning skeleton-map preprocessing
- treating DRF as the default best model in this codebase
Practical bottom line:
- keep the `body-only` skeleton baseline as the mainline path
- keep the retained best checkpoint family as the working artifact
- treat DRF as an optional follow-up branch, not the current winner
Current strongest practical checkpoint:
- `92.38 Acc / 90.30 Prec / 87.39 Rec / 88.70 F1`
If future work claims to improve on this, it should:
1. beat the current retained best checkpoint on full-test macro-F1
2. be verifiable by standalone eval from a saved checkpoint
3. state clearly whether the change is paper-faithful or purely empirical