Files
OpenGait/docs/scoliosis_reproducibility_audit.md

302 lines
10 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Scoliosis1K Reproducibility Audit
This note is the current audit of what is and is not reproducible in the Scoliosis1K ScoNet/DRF work inside this repo.
It separates three questions that should not be mixed together:
- is the repo/train stack itself working?
- is the paper-level DRF claim reproducible?
- what is the strongest practical model we have right now?
Related notes:
- [ScoNet and DRF: Status, Architecture, and Reproduction Notes](sconet-drf-status-and-training.md)
- [Scoliosis Training Change Log](scoliosis_training_change_log.md)
- [Scoliosis: Next Experiments](scoliosis_next_experiments.md)
- [DRF Author Checkpoint Compatibility Note](drf_author_checkpoint_compat.md)
Primary references:
- ScoNet paper: [arXiv-2407.05726v3-main.tex](papers/arXiv-2407.05726v3-main.tex)
- DRF paper: [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex)
## Executive Summary
Current audit conclusion:
- the repo and training stack are working
- the skeleton branch is learnable
- the published DRF result is still not independently reproducible here
- the best practical model in this repo is currently not DRF
Current practical winner:
- model family: `ScoNet-MT-ske`
- split: `1:1:2`
- representation: `body-only`
- loss: plain CE + triplet
- optimizer path: later `AdamW` cosine finetune
- verified best retained checkpoint:
- `92.38 Acc / 90.30 Prec / 87.39 Rec / 88.70 F1`
That means:
- practical model development is in a good state
- paper-faithful DRF reproduction is not
## What We Can Say With High Confidence
### 1. The core training and evaluation stack works
Evidence:
- the silhouette ScoNet path behaves sensibly
- train/eval loops, checkpointing, resume, and standalone eval all work
- the strong practical skeleton result is reproducible from a saved checkpoint
Conclusion:
- the repo is not globally broken
- the main remaining issues are method-specific, not infrastructure-wide
### 2. The raw Scoliosis1K pose data is usable
Evidence:
- earlier dataset analysis showed high pose confidence and stable sequence lengths
- the skeleton branch eventually learns well on the easier practical split
Conclusion:
- the raw pose source is not the main blocker
### 3. The skeleton branch is learnable
Evidence:
- on `1:1:2`, the skeleton branch moved from poor early baselines to a strong retained checkpoint family
- the best verified practical result is:
- `92.38 Acc / 90.30 Prec / 87.39 Rec / 88.70 F1`
Conclusion:
- the skeleton path is not dead
- earlier failures on harder settings were not proof that skeleton input is fundamentally wrong
## What Is Reproducible
### 4. The silhouette ScoNet path is reproducible enough to trust
Evidence:
- the silhouette pipeline and trainer behave consistently
- strong silhouette checkpoints evaluate sensibly on their intended split family
Conclusion:
- silhouette ScoNet remains a valid sanity anchor for this repo
### 5. The high-level DRF idea is reproducible
Evidence:
- the DRF paper defines the method as `skeleton map + PAV + PGA`
- this repo now contains:
- a DRF model
- DRF-specific preprocessing
- PAV generation
- PGA integration
Conclusion:
- the architecture-level idea is implementable
- a plausible DRF implementation exists locally
### 6. The PAV concept is reproducible
Evidence:
- PAV metrics were implemented and produced stable sequence-level signals
- the earlier analysis showed PAV still carries useful class signal
Conclusion:
- “we could not build the clinical prior” is not the main explanation anymore
## What Is Only Partially Reproducible
### 7. The skeleton-map branch is only partially specified by the papers
The papers define the representation conceptually, but not enough for quantitative reproduction from text alone.
Missing or under-specified details included:
- numeric Gaussian widths
- joint-vs-limb relative weighting
- crop and alignment policy
- resize/padding policy
- quantization/dtype behavior
- runtime transform assumptions
Why that matters:
- small rasterization and alignment changes moved results a lot
- many early failures came from details the paper did not pin down tightly enough
Conclusion:
- the paper gives the representation idea
- it does not fully specify the winning implementation
### 8. The visualization story is only partially reproducible
The papers cite visualization families, but do not specify enough extraction detail to reproduce the exact figures cleanly.
Missing detail included:
- which layer
- before or after temporal pooling
- exact normalization/rendering procedure
Conclusion:
- local visualization work is useful for debugging
- it should not be treated as paper-faithful evidence
## What Is Not Reproduced
### 9. The published DRF quantitative claim is not reproduced here
Paper-side numbers:
- `ScoNet-MT-ske`: `82.5 Acc / 81.4 Prec / 74.3 Rec / 76.6 F1`
- `DRF`: `86.0 Acc / 84.1 Prec / 79.2 Rec / 80.8 F1`
Local practical DRF result on the current workable `1:1:2` path:
- best retained DRF checkpoint (`2000`) full test:
- `80.21 Acc / 58.92 Prec / 59.23 Rec / 57.84 F1`
Conclusion:
- DRF currently loses to the stronger plain skeleton baseline in this repo
- the published DRF advantage is not established locally
### 9a. The author-provided DRF checkpoint is partially recoverable
This changed one important part of the audit:
- the author checkpoint itself is not unusable
- the earlier very poor local eval was mostly a compatibility failure
The recovered best author-checkpoint path is:
- config: `configs/drf/drf_author_eval_118_aligned_1gpu.yaml`
- result:
- `80.24 Acc / 76.73 Prec / 76.40 Rec / 76.56 F1`
This is still below the paper's DRF headline result:
- `86.0 / 84.1 / 79.2 / 80.8`
But it is far better than the earlier broken compat evals, which means:
- the weight file is real
- the stale author YAML is not a reliable runtime contract
The main causes were:
- split mismatch:
- checkpoint name says `118`
- provided YAML points to `112`
- class-order mismatch:
- author stub uses `negative=0, positive=1, neutral=2`
- repo evaluator assumes `negative=0, neutral=1, positive=2`
- legacy module naming mismatch:
- `attention_layer.*` vs `PGA.*`
- preprocessing/runtime mismatch:
- the checkpoint aligns much better with `Scoliosis1K-drf-pkl-118-aligned`
- it performs very badly on the local `118-paper` export
Conclusion:
- the author checkpoint can be made meaningfully usable in this repo
- but the provided bundle still does not fully specify the original training/eval contract
### 10. The paper-level `1:1:8` skeleton story is not reproduced here
What happened locally:
- `1:1:8` skeleton runs remained much weaker and more unstable
- the stronger practical result came from moving to `1:1:2`
Conclusion:
- the hard `1:1:8` regime is still unresolved here
- current local evidence says class distribution is a major part of the failure mode
## Practical Findings From The Search
### 11. Representation findings
Current local ranking:
- `body-only` is the best practical representation so far
- `head-lite` helped some small proxy runs but did not transfer to the full test set
- `full-body` has not yet beaten the best `body-only` checkpoint family
Concrete comparison:
- `body-only + plain CE` full test at `7000`:
- `83.16 Acc / 68.24 Prec / 80.02 Rec / 68.47 F1`
- `head-lite + plain CE` full test at `7000`:
- `78.07 Acc / 65.42 Prec / 80.50 Rec / 62.08 F1`
Conclusion:
- the stable useful signal is mainly torso-centered
- adding limited head information did not improve full-test performance
### 12. Loss and optimizer findings
Current local ranking:
- on the practical `1:1:2` branch, `plain CE` beat `weighted CE`
- `SGD` produced the first strong baseline
- later `AdamW` cosine finetune beat that baseline substantially
- earlier `AdamW` multistep finetune was unstable and inferior
Conclusion:
- the current best recipe is not “AdamW from scratch”
- it is “strong SGD-style baseline first, then milder AdamW cosine finetune”
### 13. DRF-specific finding
Current local interpretation:
- DRF is not failing because the skeleton branch is dead
- it is failing because the extra prior branch is not yet adding a strong enough complementary signal
Most likely reasons:
- the body-only skeleton baseline already captures most of the useful torso signal on `1:1:2`
- the current PAV/PGA path looks weakly selective
- DRF tended to peak early and then degrade, suggesting less stable optimization than the plain baseline
Conclusion:
- DRF is still a research branch here
- it is not the current practical winner
## Important Evaluation Caveat
### 14. There was no test-sample gradient leakage
For the long finetune run:
- training batches came only from `TRAIN_SET`
- repeated evaluation ran under `torch.no_grad()`
- no `TEST_SET` samples were used in backprop or optimizer updates
Conclusion:
- there was no train/test mixing inside gradient descent
### 15. There was test-set model-selection leakage
For the long finetune run:
- full `TEST_SET` eval was run repeatedly during training
- best-checkpoint retention used `test_accuracy` and `test_f1`
- the final retained winner was chosen with test metrics
That means:
- the retained best checkpoint is real and reproducible
- but its metric should be interpreted as a test-guided selected result, not a pristine one-shot final estimate
Conclusion:
- acceptable for practical model selection
- not ideal for a publication-style clean generalization claim
## Bottom Line
The repo currently supports:
- strong practical scoliosis model development
- stable skeleton-map experimentation
- plausible DRF implementation work
The repo does not currently support:
- claiming an independent reproduction of the DRF papers quantitative result
- claiming that the DRF paper fully specifies the winning skeleton-map preprocessing
- treating DRF as the default best model in this codebase
Practical bottom line:
- keep the `body-only` skeleton baseline as the mainline path
- keep the retained best checkpoint family as the working artifact
- treat DRF as an optional follow-up branch, not the current winner
- if DRF work continues, the author checkpoint on the aligned `118` path is now a much better starting point than scratch DRF
Current strongest practical checkpoint:
- `92.38 Acc / 90.30 Prec / 87.39 Rec / 88.70 F1`
If future work claims to improve on this, it should:
1. beat the current retained best checkpoint on full-test macro-F1
2. be verifiable by standalone eval from a saved checkpoint
3. state clearly whether the change is paper-faithful or purely empirical