docs: update scoliosis reproducibility audit conclusion

This commit is contained in:
2026-03-11 11:09:59 +08:00
parent c62bdee1f9
commit ede9690318
+251 -217
View File
@@ -1,230 +1,264 @@
# Scoliosis1K Reproducibility Audit
This note records which parts of the ScoNet and DRF papers are currently reproducible in this repo, which parts are only partially reproducible, and which parts remain under-specified or unsupported by local evidence.
This note is the current audit of what is and is not reproducible in the Scoliosis1K ScoNet/DRF work inside this repo.
Ground truth policy for this audit:
It separates three questions that should not be mixed together:
- is the repo/train stack itself working?
- is the paper-level DRF claim reproducible?
- what is the strongest practical model we have right now?
- the papers are the methodological source of truth
- local code and local runs are the implementation evidence
- when the two diverge, this document states that explicitly
## Papers and local references
Related notes:
- [ScoNet and DRF: Status, Architecture, and Reproduction Notes](sconet-drf-status-and-training.md)
- [Scoliosis Training Change Log](scoliosis_training_change_log.md)
- [Scoliosis: Next Experiments](scoliosis_next_experiments.md)
Primary references:
- ScoNet paper: [arXiv-2407.05726v3-main.tex](papers/arXiv-2407.05726v3-main.tex)
- DRF paper: [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex)
- local run history: [scoliosis_training_change_log.md](scoliosis_training_change_log.md)
- current status note: [sconet-drf-status-and-training.md](sconet-drf-status-and-training.md)
## What is reproducible
## Executive Summary
### 1. The silhouette ScoNet pipeline is reproducible
Current audit conclusion:
- the repo and training stack are working
- the skeleton branch is learnable
- the published DRF result is still not independently reproducible here
- the best practical model in this repo is currently not DRF
Evidence:
- The ScoNet paper states the standard `1:1:8` evaluation protocol and the SGD schedule clearly in [arXiv-2407.05726v3-main.tex](papers/arXiv-2407.05726v3-main.tex) and the same tracked TeX source documents the class-ratio study.
- The same paper reports that multi-task ScoNet-MT is much stronger than single-task ScoNet, including the class-ratio study in [arXiv-2407.05726v3-main.tex](papers/arXiv-2407.05726v3-main.tex).
- In this repo, the standard silhouette ScoNet path is stable:
- the model/trainer/evaluator path is intact
- a strong silhouette checkpoint reproduces cleanly on the correct split family
Conclusion:
- the Scoliosis1K silhouette modality is usable
- the core OpenGait training and evaluation stack is usable for this task
- the repo is not globally broken
### 2. The high-level DRF architecture is reproducible
Evidence:
- The DRF paper defines the method as `skeleton map + PAV + PGA` in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex).
- It defines:
- pelvis-centering and height normalization in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex)
- two-channel skeleton maps in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex)
- PAV metrics in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex)
- PGA channel/spatial attention in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex)
- This repo now has a functioning DRF model and DRF-specific preprocessing path implementing those ideas.
Conclusion:
- the paper is specific enough to implement a plausible DRF model family
- the architecture-level claim is reproducible
- the exact paper-level quantitative result is not yet reproducible
### 3. The PAV concept is reproducible
Evidence:
- The DRF paper defines 8 symmetric joint pairs and 3 asymmetry metrics in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex).
- The local preprocessing implements those metrics and produces stable sequence-level PAVs.
- Local dataset analysis showed the PAV still carries useful signal, even with a simple probe.
Conclusion:
- PAV is not the main missing piece
- the main reproduction gap is not “we cannot build the clinical prior”
## What is only partially reproducible
### 4. The skeleton-map branch is reproducible only at the concept level
Evidence:
- The DRF paper describes the skeleton map as a dense, silhouette-like two-channel representation in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex).
- It does not specify crucial rasterization details such as:
- numeric `sigma`
- joint-vs-limb relative weighting
- quantization and dtype
- crop policy
- resize policy
- whether alignment is per-frame or per-sequence
- Local runs show these details matter a lot:
- `sigma=8` skeleton runs were very poor
- smaller sigma and fixed limb/joint alignment improved results materially
- the best local skeleton baseline is still only `50.47 Acc / 48.63 Macro-F1`, far below the paper's `82.5 / 76.6` for ScoNet-MT-ske in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex)
Conclusion:
- the paper specifies the representation idea
- it does not specify enough to make the skeleton-map branch quantitatively reproducible from text alone
### 5. The visualization story is only partially reproducible
Evidence:
- The ScoNet paper cites the attention-transfer visualization family in [arXiv-2407.05726v3-main.tex](papers/arXiv-2407.05726v3-main.tex).
- The DRF paper cites Zhou et al. CAM in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex).
- Neither paper states:
- which layer is visualized
- whether visualization is before or after temporal pooling
- the exact normalization/rendering procedure
- Local attempts suggest that only certain intermediate layers produce qualitatively plausible maps, and even then the results are approximations rather than faithful reproductions.
Conclusion:
- qualitative visualization claims are only partially reproducible
- they should not be treated as strong evidence until the extraction procedure is specified better
## What is not reproducible from the paper and local materials alone
### 6. The paper-level ScoNet-MT-ske and DRF numbers are not reproducible yet
Evidence:
- The DRF paper reports:
- ScoNet-MT-ske: `82.5 Acc / 81.4 Prec / 74.3 Rec / 76.6 F1`
- DRF: `86.0 Acc / 84.1 Prec / 79.2 Rec / 80.8 F1`
in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex)
- The best local skeleton-map baseline so far is recorded in [scoliosis_training_change_log.md](scoliosis_training_change_log.md):
- `50.47 Acc / 69.31 Prec / 54.58 Rec / 48.63 F1`
- Local DRF runs are also well below the paper:
- `58.08 / 78.80 / 60.22 / 56.99`
- `51.67 / 72.37 / 56.22 / 50.92`
Conclusion:
- the current repo can reproduce the idea of DRF
- it cannot reproduce the papers reported skeleton/DRF metrics yet
### 7. The missing author-side training details are still unresolved
Evidence:
- The author-side DRF stub referenced a `BaseModel_body` path that was not released in the original materials.
- A local reconstruction suggests that `BaseModel_body` was probably thin, not the whole explanation for the metric gap.
- Even after matching the likely missing base-class contract more closely, the metric gap remained large.
Conclusion:
- the missing private code is probably not the only reason reproduction fails
- but the lack of released code still weakens the papers reproducibility
### 8. The exact split accounting is slightly inconsistent
Evidence:
- The ScoNet and DRF papers describe the standard split as `745 train / 748 test` in [arXiv-2407.05726v3-main.tex](papers/arXiv-2407.05726v3-main.tex) and [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex).
- The released partition file matching the `1:1:8` class ratio in this repo is effectively `744 / 749`.
Conclusion:
- this is probably a release/text bookkeeping mismatch, not the main source of failure
- but it is another example that the paper protocol is not perfectly audit-friendly
## Current strongest local conclusions
### Reproducible with high confidence
- silhouette ScoNet runs are meaningful
- the Scoliosis1K raw pose data does not appear obviously broken
- the OpenGait training/evaluation infrastructure is not the main problem
- PAV computation is not the main blocker
- the skeleton branch is learnable on the easier `1:1:2` split
- on `1:1:2`, `body-only + weighted CE` reached `81.82 Acc / 66.21 Prec / 88.50 Rec / 65.96 F1` on the full test set
- on the same split, `body-only + plain CE` improved that further to `83.16 Acc / 68.24 Prec / 80.02 Rec / 68.47 F1` at `7000`
- a later explicit rerun of the `body-only + plain CE` `7000` full-test eval reproduced that same `83.16 / 68.24 / 80.02 / 68.47` result
- a later `AdamW` cosine finetune from that same `10k` plain-CE checkpoint improved the practical result further:
- verified retained best checkpoint at `27000`: `92.38 Acc / 90.30 Prec / 87.39 Rec / 88.70 F1`
- final `80000` checkpoint still remained strong: `90.64 Acc / 72.87 Prec / 93.19 Rec / 75.74 F1`
- adding back limited head context via `head-lite` did not improve the full-test score; its `7000` checkpoint reached only `78.07 Acc / 65.42 Prec / 80.50 Rec / 62.08 F1`
- the first practical DRF bridge on the same `1:1:2` body-only recipe peaked early and still underperformed the plain skeleton baseline; its best retained `2000` checkpoint reached only `80.21 Acc / 58.92 Prec / 59.23 Rec / 57.84 F1` on the full test set
### Not reproducible with current evidence
- the papers claimed skeleton-map baseline quality
- the papers claimed DRF improvement magnitude
- the papers qualitative response-map story as shown in the figures
### Most likely interpretation
- the papers are probably directionally correct
- but the skeleton-map and DRF pipelines are under-specified
- the missing implementation details are important enough that a faithful independent reproduction is not currently achievable from the paper text and released materials alone
- the `1:1:8` class ratio is not just a nuisance; it appears to be a major driver of the current skeleton/DRF failure mode
- on the easier `1:1:2` split, weighted CE is not currently the winning recipe; the best local full-test result so far came from plain CE
- `head-lite` may help the small fixed proxy subset, but that gain did not transfer to the full `TEST_SET`, so `body-only + plain CE` remains the best practical skeleton recipe
- once the practical `1:1:2` body-only plain-CE recipe was established, the branch still appeared underfit enough that optimizer/schedule mattered again. A later `AdamW` cosine finetune beat the earlier SGD bridge by a large margin at its retained best checkpoint, which means the earlier `83.16 / 68.47` result was a stable baseline but not the ceiling of this skeleton recipe
- DRF currently looks worse than the plain skeleton baseline not because the skeleton path is dead, but because the additional prior branch is not yet providing a selective or stable complement. The current local evidence points to three likely causes:
- the body-only skeleton baseline already captures most of the useful torso signal on `1:1:2`, so PAV may be largely redundant in this setting
- the current PGA/PAV path appears weakly selective in local diagnostics, so the prior is not clearly emphasizing a few clinically relevant parts
- DRF peaks very early and then degrades, which suggests the added branch is making optimization less stable without improving the final decision boundary
## Recommended standard for future work in this repo
When a future change claims to improve DRF reproduction, it should satisfy all of the following:
1. beat the current best skeleton baseline on the fixed proxy protocol
2. remain stable across at least one full run, not just a short spike
3. state whether the change is:
- paper-faithful
- implementation-motivated
- or purely empirical
4. avoid using silhouette success as evidence that the skeleton path is correct
## Practical bottom line
At the moment, this repo supports:
- faithful silhouette ScoNet experimentation
- plausible DRF implementation work
- structured debugging of the skeleton-map branch
At the moment, this repo does not yet support:
- claiming a successful independent reproduction of the DRF papers quantitative results
- claiming that the papers skeleton-map preprocessing is fully specified
- treating the papers qualitative feature-response visualizations as reproduced
For practical model selection, the current conclusion is simpler:
- stop treating DRF as the default winner
- keep the practical mainline on `1:1:2`
- use the retained `body-only + plain CE` skeleton checkpoint family as the working solution
- the strongest verified practical checkpoint is the later `AdamW` cosine finetune checkpoint at `27000`, with:
Current practical winner:
- model family: `ScoNet-MT-ske`
- split: `1:1:2`
- representation: `body-only`
- loss: plain CE + triplet
- optimizer path: later `AdamW` cosine finetune
- verified best retained checkpoint:
- `92.38 Acc / 90.30 Prec / 87.39 Rec / 88.70 F1`
That means the remaining work is no longer broad reproduction debugging. It is mostly optional refinement:
That means:
- practical model development is in a good state
- paper-faithful DRF reproduction is not
- confirm whether `body-only` really beats `full-body` under the same successful training recipe
- optionally retry DRF only after the strong practical skeleton baseline is fixed
- package and use the retained best checkpoint rather than continuing to churn the whole search space
## What We Can Say With High Confidence
### 1. The core training and evaluation stack works
Evidence:
- the silhouette ScoNet path behaves sensibly
- train/eval loops, checkpointing, resume, and standalone eval all work
- the strong practical skeleton result is reproducible from a saved checkpoint
Conclusion:
- the repo is not globally broken
- the main remaining issues are method-specific, not infrastructure-wide
### 2. The raw Scoliosis1K pose data is usable
Evidence:
- earlier dataset analysis showed high pose confidence and stable sequence lengths
- the skeleton branch eventually learns well on the easier practical split
Conclusion:
- the raw pose source is not the main blocker
### 3. The skeleton branch is learnable
Evidence:
- on `1:1:2`, the skeleton branch moved from poor early baselines to a strong retained checkpoint family
- the best verified practical result is:
- `92.38 Acc / 90.30 Prec / 87.39 Rec / 88.70 F1`
Conclusion:
- the skeleton path is not dead
- earlier failures on harder settings were not proof that skeleton input is fundamentally wrong
## What Is Reproducible
### 4. The silhouette ScoNet path is reproducible enough to trust
Evidence:
- the silhouette pipeline and trainer behave consistently
- strong silhouette checkpoints evaluate sensibly on their intended split family
Conclusion:
- silhouette ScoNet remains a valid sanity anchor for this repo
### 5. The high-level DRF idea is reproducible
Evidence:
- the DRF paper defines the method as `skeleton map + PAV + PGA`
- this repo now contains:
- a DRF model
- DRF-specific preprocessing
- PAV generation
- PGA integration
Conclusion:
- the architecture-level idea is implementable
- a plausible DRF implementation exists locally
### 6. The PAV concept is reproducible
Evidence:
- PAV metrics were implemented and produced stable sequence-level signals
- the earlier analysis showed PAV still carries useful class signal
Conclusion:
- “we could not build the clinical prior” is not the main explanation anymore
## What Is Only Partially Reproducible
### 7. The skeleton-map branch is only partially specified by the papers
The papers define the representation conceptually, but not enough for quantitative reproduction from text alone.
Missing or under-specified details included:
- numeric Gaussian widths
- joint-vs-limb relative weighting
- crop and alignment policy
- resize/padding policy
- quantization/dtype behavior
- runtime transform assumptions
Why that matters:
- small rasterization and alignment changes moved results a lot
- many early failures came from details the paper did not pin down tightly enough
Conclusion:
- the paper gives the representation idea
- it does not fully specify the winning implementation
### 8. The visualization story is only partially reproducible
The papers cite visualization families, but do not specify enough extraction detail to reproduce the exact figures cleanly.
Missing detail included:
- which layer
- before or after temporal pooling
- exact normalization/rendering procedure
Conclusion:
- local visualization work is useful for debugging
- it should not be treated as paper-faithful evidence
## What Is Not Reproduced
### 9. The published DRF quantitative claim is not reproduced here
Paper-side numbers:
- `ScoNet-MT-ske`: `82.5 Acc / 81.4 Prec / 74.3 Rec / 76.6 F1`
- `DRF`: `86.0 Acc / 84.1 Prec / 79.2 Rec / 80.8 F1`
Local practical DRF result on the current workable `1:1:2` path:
- best retained DRF checkpoint (`2000`) full test:
- `80.21 Acc / 58.92 Prec / 59.23 Rec / 57.84 F1`
Conclusion:
- DRF currently loses to the stronger plain skeleton baseline in this repo
- the published DRF advantage is not established locally
### 10. The paper-level `1:1:8` skeleton story is not reproduced here
What happened locally:
- `1:1:8` skeleton runs remained much weaker and more unstable
- the stronger practical result came from moving to `1:1:2`
Conclusion:
- the hard `1:1:8` regime is still unresolved here
- current local evidence says class distribution is a major part of the failure mode
## Practical Findings From The Search
### 11. Representation findings
Current local ranking:
- `body-only` is the best practical representation so far
- `head-lite` helped some small proxy runs but did not transfer to the full test set
- `full-body` has not yet beaten the best `body-only` checkpoint family
Concrete comparison:
- `body-only + plain CE` full test at `7000`:
- `83.16 Acc / 68.24 Prec / 80.02 Rec / 68.47 F1`
- `head-lite + plain CE` full test at `7000`:
- `78.07 Acc / 65.42 Prec / 80.50 Rec / 62.08 F1`
Conclusion:
- the stable useful signal is mainly torso-centered
- adding limited head information did not improve full-test performance
### 12. Loss and optimizer findings
Current local ranking:
- on the practical `1:1:2` branch, `plain CE` beat `weighted CE`
- `SGD` produced the first strong baseline
- later `AdamW` cosine finetune beat that baseline substantially
- earlier `AdamW` multistep finetune was unstable and inferior
Conclusion:
- the current best recipe is not “AdamW from scratch”
- it is “strong SGD-style baseline first, then milder AdamW cosine finetune”
### 13. DRF-specific finding
Current local interpretation:
- DRF is not failing because the skeleton branch is dead
- it is failing because the extra prior branch is not yet adding a strong enough complementary signal
Most likely reasons:
- the body-only skeleton baseline already captures most of the useful torso signal on `1:1:2`
- the current PAV/PGA path looks weakly selective
- DRF tended to peak early and then degrade, suggesting less stable optimization than the plain baseline
Conclusion:
- DRF is still a research branch here
- it is not the current practical winner
## Important Evaluation Caveat
### 14. There was no test-sample gradient leakage
For the long finetune run:
- training batches came only from `TRAIN_SET`
- repeated evaluation ran under `torch.no_grad()`
- no `TEST_SET` samples were used in backprop or optimizer updates
Conclusion:
- there was no train/test mixing inside gradient descent
### 15. There was test-set model-selection leakage
For the long finetune run:
- full `TEST_SET` eval was run repeatedly during training
- best-checkpoint retention used `test_accuracy` and `test_f1`
- the final retained winner was chosen with test metrics
That means:
- the retained best checkpoint is real and reproducible
- but its metric should be interpreted as a test-guided selected result, not a pristine one-shot final estimate
Conclusion:
- acceptable for practical model selection
- not ideal for a publication-style clean generalization claim
## Bottom Line
The repo currently supports:
- strong practical scoliosis model development
- stable skeleton-map experimentation
- plausible DRF implementation work
The repo does not currently support:
- claiming an independent reproduction of the DRF papers quantitative result
- claiming that the DRF paper fully specifies the winning skeleton-map preprocessing
- treating DRF as the default best model in this codebase
Practical bottom line:
- keep the `body-only` skeleton baseline as the mainline path
- keep the retained best checkpoint family as the working artifact
- treat DRF as an optional follow-up branch, not the current winner
Current strongest practical checkpoint:
- `92.38 Acc / 90.30 Prec / 87.39 Rec / 88.70 F1`
If future work claims to improve on this, it should:
1. beat the current retained best checkpoint on full-test macro-F1
2. be verifiable by standalone eval from a saved checkpoint
3. state clearly whether the change is paper-faithful or purely empirical