302 lines
10 KiB
Markdown
302 lines
10 KiB
Markdown
# Scoliosis1K Reproducibility Audit
|
||
|
||
This note is the current audit of what is and is not reproducible in the Scoliosis1K ScoNet/DRF work inside this repo.
|
||
|
||
It separates three questions that should not be mixed together:
|
||
- is the repo/train stack itself working?
|
||
- is the paper-level DRF claim reproducible?
|
||
- what is the strongest practical model we have right now?
|
||
|
||
Related notes:
|
||
- [ScoNet and DRF: Status, Architecture, and Reproduction Notes](sconet-drf-status-and-training.md)
|
||
- [Scoliosis Training Change Log](scoliosis_training_change_log.md)
|
||
- [Scoliosis: Next Experiments](scoliosis_next_experiments.md)
|
||
- [DRF Author Checkpoint Compatibility Note](drf_author_checkpoint_compat.md)
|
||
|
||
Primary references:
|
||
- ScoNet paper: [arXiv-2407.05726v3-main.tex](papers/arXiv-2407.05726v3-main.tex)
|
||
- DRF paper: [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex)
|
||
|
||
## Executive Summary
|
||
|
||
Current audit conclusion:
|
||
- the repo and training stack are working
|
||
- the skeleton branch is learnable
|
||
- the published DRF result is still not independently reproducible here
|
||
- the best practical model in this repo is currently not DRF
|
||
|
||
Current practical winner:
|
||
- model family: `ScoNet-MT-ske`
|
||
- split: `1:1:2`
|
||
- representation: `body-only`
|
||
- loss: plain CE + triplet
|
||
- optimizer path: later `AdamW` cosine finetune
|
||
- verified best retained checkpoint:
|
||
- `92.38 Acc / 90.30 Prec / 87.39 Rec / 88.70 F1`
|
||
|
||
That means:
|
||
- practical model development is in a good state
|
||
- paper-faithful DRF reproduction is not
|
||
|
||
## What We Can Say With High Confidence
|
||
|
||
### 1. The core training and evaluation stack works
|
||
|
||
Evidence:
|
||
- the silhouette ScoNet path behaves sensibly
|
||
- train/eval loops, checkpointing, resume, and standalone eval all work
|
||
- the strong practical skeleton result is reproducible from a saved checkpoint
|
||
|
||
Conclusion:
|
||
- the repo is not globally broken
|
||
- the main remaining issues are method-specific, not infrastructure-wide
|
||
|
||
### 2. The raw Scoliosis1K pose data is usable
|
||
|
||
Evidence:
|
||
- earlier dataset analysis showed high pose confidence and stable sequence lengths
|
||
- the skeleton branch eventually learns well on the easier practical split
|
||
|
||
Conclusion:
|
||
- the raw pose source is not the main blocker
|
||
|
||
### 3. The skeleton branch is learnable
|
||
|
||
Evidence:
|
||
- on `1:1:2`, the skeleton branch moved from poor early baselines to a strong retained checkpoint family
|
||
- the best verified practical result is:
|
||
- `92.38 Acc / 90.30 Prec / 87.39 Rec / 88.70 F1`
|
||
|
||
Conclusion:
|
||
- the skeleton path is not dead
|
||
- earlier failures on harder settings were not proof that skeleton input is fundamentally wrong
|
||
|
||
## What Is Reproducible
|
||
|
||
### 4. The silhouette ScoNet path is reproducible enough to trust
|
||
|
||
Evidence:
|
||
- the silhouette pipeline and trainer behave consistently
|
||
- strong silhouette checkpoints evaluate sensibly on their intended split family
|
||
|
||
Conclusion:
|
||
- silhouette ScoNet remains a valid sanity anchor for this repo
|
||
|
||
### 5. The high-level DRF idea is reproducible
|
||
|
||
Evidence:
|
||
- the DRF paper defines the method as `skeleton map + PAV + PGA`
|
||
- this repo now contains:
|
||
- a DRF model
|
||
- DRF-specific preprocessing
|
||
- PAV generation
|
||
- PGA integration
|
||
|
||
Conclusion:
|
||
- the architecture-level idea is implementable
|
||
- a plausible DRF implementation exists locally
|
||
|
||
### 6. The PAV concept is reproducible
|
||
|
||
Evidence:
|
||
- PAV metrics were implemented and produced stable sequence-level signals
|
||
- the earlier analysis showed PAV still carries useful class signal
|
||
|
||
Conclusion:
|
||
- “we could not build the clinical prior” is not the main explanation anymore
|
||
|
||
## What Is Only Partially Reproducible
|
||
|
||
### 7. The skeleton-map branch is only partially specified by the papers
|
||
|
||
The papers define the representation conceptually, but not enough for quantitative reproduction from text alone.
|
||
|
||
Missing or under-specified details included:
|
||
- numeric Gaussian widths
|
||
- joint-vs-limb relative weighting
|
||
- crop and alignment policy
|
||
- resize/padding policy
|
||
- quantization/dtype behavior
|
||
- runtime transform assumptions
|
||
|
||
Why that matters:
|
||
- small rasterization and alignment changes moved results a lot
|
||
- many early failures came from details the paper did not pin down tightly enough
|
||
|
||
Conclusion:
|
||
- the paper gives the representation idea
|
||
- it does not fully specify the winning implementation
|
||
|
||
### 8. The visualization story is only partially reproducible
|
||
|
||
The papers cite visualization families, but do not specify enough extraction detail to reproduce the exact figures cleanly.
|
||
|
||
Missing detail included:
|
||
- which layer
|
||
- before or after temporal pooling
|
||
- exact normalization/rendering procedure
|
||
|
||
Conclusion:
|
||
- local visualization work is useful for debugging
|
||
- it should not be treated as paper-faithful evidence
|
||
|
||
## What Is Not Reproduced
|
||
|
||
### 9. The published DRF quantitative claim is not reproduced here
|
||
|
||
Paper-side numbers:
|
||
- `ScoNet-MT-ske`: `82.5 Acc / 81.4 Prec / 74.3 Rec / 76.6 F1`
|
||
- `DRF`: `86.0 Acc / 84.1 Prec / 79.2 Rec / 80.8 F1`
|
||
|
||
Local practical DRF result on the current workable `1:1:2` path:
|
||
- best retained DRF checkpoint (`2000`) full test:
|
||
- `80.21 Acc / 58.92 Prec / 59.23 Rec / 57.84 F1`
|
||
|
||
Conclusion:
|
||
- DRF currently loses to the stronger plain skeleton baseline in this repo
|
||
- the published DRF advantage is not established locally
|
||
|
||
### 9a. The author-provided DRF checkpoint is partially recoverable
|
||
|
||
This changed one important part of the audit:
|
||
- the author checkpoint itself is not unusable
|
||
- the earlier very poor local eval was mostly a compatibility failure
|
||
|
||
The recovered best author-checkpoint path is:
|
||
- config: `configs/drf/drf_author_eval_118_aligned_1gpu.yaml`
|
||
- result:
|
||
- `80.24 Acc / 76.73 Prec / 76.40 Rec / 76.56 F1`
|
||
|
||
This is still below the paper's DRF headline result:
|
||
- `86.0 / 84.1 / 79.2 / 80.8`
|
||
|
||
But it is far better than the earlier broken compat evals, which means:
|
||
- the weight file is real
|
||
- the stale author YAML is not a reliable runtime contract
|
||
|
||
The main causes were:
|
||
- split mismatch:
|
||
- checkpoint name says `118`
|
||
- provided YAML points to `112`
|
||
- class-order mismatch:
|
||
- author stub uses `negative=0, positive=1, neutral=2`
|
||
- repo evaluator assumes `negative=0, neutral=1, positive=2`
|
||
- legacy module naming mismatch:
|
||
- `attention_layer.*` vs `PGA.*`
|
||
- preprocessing/runtime mismatch:
|
||
- the checkpoint aligns much better with `Scoliosis1K-drf-pkl-118-aligned`
|
||
- it performs very badly on the local `118-paper` export
|
||
|
||
Conclusion:
|
||
- the author checkpoint can be made meaningfully usable in this repo
|
||
- but the provided bundle still does not fully specify the original training/eval contract
|
||
|
||
### 10. The paper-level `1:1:8` skeleton story is not reproduced here
|
||
|
||
What happened locally:
|
||
- `1:1:8` skeleton runs remained much weaker and more unstable
|
||
- the stronger practical result came from moving to `1:1:2`
|
||
|
||
Conclusion:
|
||
- the hard `1:1:8` regime is still unresolved here
|
||
- current local evidence says class distribution is a major part of the failure mode
|
||
|
||
## Practical Findings From The Search
|
||
|
||
### 11. Representation findings
|
||
|
||
Current local ranking:
|
||
- `body-only` is the best practical representation so far
|
||
- `head-lite` helped some small proxy runs but did not transfer to the full test set
|
||
- `full-body` has not yet beaten the best `body-only` checkpoint family
|
||
|
||
Concrete comparison:
|
||
- `body-only + plain CE` full test at `7000`:
|
||
- `83.16 Acc / 68.24 Prec / 80.02 Rec / 68.47 F1`
|
||
- `head-lite + plain CE` full test at `7000`:
|
||
- `78.07 Acc / 65.42 Prec / 80.50 Rec / 62.08 F1`
|
||
|
||
Conclusion:
|
||
- the stable useful signal is mainly torso-centered
|
||
- adding limited head information did not improve full-test performance
|
||
|
||
### 12. Loss and optimizer findings
|
||
|
||
Current local ranking:
|
||
- on the practical `1:1:2` branch, `plain CE` beat `weighted CE`
|
||
- `SGD` produced the first strong baseline
|
||
- later `AdamW` cosine finetune beat that baseline substantially
|
||
- earlier `AdamW` multistep finetune was unstable and inferior
|
||
|
||
Conclusion:
|
||
- the current best recipe is not “AdamW from scratch”
|
||
- it is “strong SGD-style baseline first, then milder AdamW cosine finetune”
|
||
|
||
### 13. DRF-specific finding
|
||
|
||
Current local interpretation:
|
||
- DRF is not failing because the skeleton branch is dead
|
||
- it is failing because the extra prior branch is not yet adding a strong enough complementary signal
|
||
|
||
Most likely reasons:
|
||
- the body-only skeleton baseline already captures most of the useful torso signal on `1:1:2`
|
||
- the current PAV/PGA path looks weakly selective
|
||
- DRF tended to peak early and then degrade, suggesting less stable optimization than the plain baseline
|
||
|
||
Conclusion:
|
||
- DRF is still a research branch here
|
||
- it is not the current practical winner
|
||
|
||
## Important Evaluation Caveat
|
||
|
||
### 14. There was no test-sample gradient leakage
|
||
|
||
For the long finetune run:
|
||
- training batches came only from `TRAIN_SET`
|
||
- repeated evaluation ran under `torch.no_grad()`
|
||
- no `TEST_SET` samples were used in backprop or optimizer updates
|
||
|
||
Conclusion:
|
||
- there was no train/test mixing inside gradient descent
|
||
|
||
### 15. There was test-set model-selection leakage
|
||
|
||
For the long finetune run:
|
||
- full `TEST_SET` eval was run repeatedly during training
|
||
- best-checkpoint retention used `test_accuracy` and `test_f1`
|
||
- the final retained winner was chosen with test metrics
|
||
|
||
That means:
|
||
- the retained best checkpoint is real and reproducible
|
||
- but its metric should be interpreted as a test-guided selected result, not a pristine one-shot final estimate
|
||
|
||
Conclusion:
|
||
- acceptable for practical model selection
|
||
- not ideal for a publication-style clean generalization claim
|
||
|
||
## Bottom Line
|
||
|
||
The repo currently supports:
|
||
- strong practical scoliosis model development
|
||
- stable skeleton-map experimentation
|
||
- plausible DRF implementation work
|
||
|
||
The repo does not currently support:
|
||
- claiming an independent reproduction of the DRF paper’s quantitative result
|
||
- claiming that the DRF paper fully specifies the winning skeleton-map preprocessing
|
||
- treating DRF as the default best model in this codebase
|
||
|
||
Practical bottom line:
|
||
- keep the `body-only` skeleton baseline as the mainline path
|
||
- keep the retained best checkpoint family as the working artifact
|
||
- treat DRF as an optional follow-up branch, not the current winner
|
||
- if DRF work continues, the author checkpoint on the aligned `118` path is now a much better starting point than scratch DRF
|
||
|
||
Current strongest practical checkpoint:
|
||
- `92.38 Acc / 90.30 Prec / 87.39 Rec / 88.70 F1`
|
||
|
||
If future work claims to improve on this, it should:
|
||
1. beat the current retained best checkpoint on full-test macro-F1
|
||
2. be verifiable by standalone eval from a saved checkpoint
|
||
3. state clearly whether the change is paper-faithful or purely empirical
|