docs: update scoliosis reproducibility audit conclusion

2026-03-11 11:09:59 +08:00
parent c62bdee1f9
commit ede9690318
1 changed files with 251 additions and 217 deletions
@@ -1,230 +1,264 @@
 # Scoliosis1K Reproducibility Audit

-This note records which parts of the ScoNet and DRF papers are currently reproducible in this repo, which parts are only partially reproducible, and which parts remain under-specified or unsupported by local evidence.
+This note is the current audit of what is and is not reproducible in the Scoliosis1K ScoNet/DRF work inside this repo.

-Ground truth policy for this audit:
+It separates three questions that should not be mixed together:
+- is the repo/train stack itself working?
+- is the paper-level DRF claim reproducible?
+- what is the strongest practical model we have right now?

- the papers are the methodological source of truth
- local code and local runs are the implementation evidence
- when the two diverge, this document states that explicitly
-
-## Papers and local references
+Related notes:
+- [ScoNet and DRF: Status, Architecture, and Reproduction Notes](sconet-drf-status-and-training.md)
+- [Scoliosis Training Change Log](scoliosis_training_change_log.md)
+- [Scoliosis: Next Experiments](scoliosis_next_experiments.md)

+Primary references:
 - ScoNet paper: [arXiv-2407.05726v3-main.tex](papers/arXiv-2407.05726v3-main.tex)
 - DRF paper: [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex)
- local run history: [scoliosis_training_change_log.md](scoliosis_training_change_log.md)
- current status note: [sconet-drf-status-and-training.md](sconet-drf-status-and-training.md)

-## What is reproducible
+## Executive Summary

-### 1. The silhouette ScoNet pipeline is reproducible
+Current audit conclusion:
+- the repo and training stack are working
+- the skeleton branch is learnable
+- the published DRF result is still not independently reproducible here
+- the best practical model in this repo is currently not DRF

-Evidence:
-
- The ScoNet paper states the standard `1:1:8` evaluation protocol and the SGD schedule clearly in [arXiv-2407.05726v3-main.tex](papers/arXiv-2407.05726v3-main.tex) and the same tracked TeX source documents the class-ratio study.
- The same paper reports that multi-task ScoNet-MT is much stronger than single-task ScoNet, including the class-ratio study in [arXiv-2407.05726v3-main.tex](papers/arXiv-2407.05726v3-main.tex).
- In this repo, the standard silhouette ScoNet path is stable:
-  - the model/trainer/evaluator path is intact
-  - a strong silhouette checkpoint reproduces cleanly on the correct split family
-
-Conclusion:
-
- the Scoliosis1K silhouette modality is usable
- the core OpenGait training and evaluation stack is usable for this task
- the repo is not globally broken
-
-### 2. The high-level DRF architecture is reproducible
-
-Evidence:
-
- The DRF paper defines the method as `skeleton map + PAV + PGA` in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex).
- It defines:
-  - pelvis-centering and height normalization in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex)
-  - two-channel skeleton maps in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex)
-  - PAV metrics in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex)
-  - PGA channel/spatial attention in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex)
- This repo now has a functioning DRF model and DRF-specific preprocessing path implementing those ideas.
-
-Conclusion:
-
- the paper is specific enough to implement a plausible DRF model family
- the architecture-level claim is reproducible
- the exact paper-level quantitative result is not yet reproducible
-
-### 3. The PAV concept is reproducible
-
-Evidence:
-
- The DRF paper defines 8 symmetric joint pairs and 3 asymmetry metrics in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex).
- The local preprocessing implements those metrics and produces stable sequence-level PAVs.
- Local dataset analysis showed the PAV still carries useful signal, even with a simple probe.
-
-Conclusion:
-
- PAV is not the main missing piece
- the main reproduction gap is not “we cannot build the clinical prior”
-
-## What is only partially reproducible
-
-### 4. The skeleton-map branch is reproducible only at the concept level
-
-Evidence:
-
- The DRF paper describes the skeleton map as a dense, silhouette-like two-channel representation in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex).
- It does not specify crucial rasterization details such as:
-  - numeric `sigma`
-  - joint-vs-limb relative weighting
-  - quantization and dtype
-  - crop policy
-  - resize policy
-  - whether alignment is per-frame or per-sequence
- Local runs show these details matter a lot:
-  - `sigma=8` skeleton runs were very poor
-  - smaller sigma and fixed limb/joint alignment improved results materially
-  - the best local skeleton baseline is still only `50.47 Acc / 48.63 Macro-F1`, far below the paper's `82.5 / 76.6` for ScoNet-MT-ske in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex)
-
-Conclusion:
-
- the paper specifies the representation idea
- it does not specify enough to make the skeleton-map branch quantitatively reproducible from text alone
-
-### 5. The visualization story is only partially reproducible
-
-Evidence:
-
- The ScoNet paper cites the attention-transfer visualization family in [arXiv-2407.05726v3-main.tex](papers/arXiv-2407.05726v3-main.tex).
- The DRF paper cites Zhou et al. CAM in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex).
- Neither paper states:
-  - which layer is visualized
-  - whether visualization is before or after temporal pooling
-  - the exact normalization/rendering procedure
- Local attempts suggest that only certain intermediate layers produce qualitatively plausible maps, and even then the results are approximations rather than faithful reproductions.
-
-Conclusion:
-
- qualitative visualization claims are only partially reproducible
- they should not be treated as strong evidence until the extraction procedure is specified better
-
-## What is not reproducible from the paper and local materials alone
-
-### 6. The paper-level ScoNet-MT-ske and DRF numbers are not reproducible yet
-
-Evidence:
-
- The DRF paper reports:
-  - ScoNet-MT-ske: `82.5 Acc / 81.4 Prec / 74.3 Rec / 76.6 F1`
-  - DRF: `86.0 Acc / 84.1 Prec / 79.2 Rec / 80.8 F1`
-  in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex)
- The best local skeleton-map baseline so far is recorded in [scoliosis_training_change_log.md](scoliosis_training_change_log.md):
-  - `50.47 Acc / 69.31 Prec / 54.58 Rec / 48.63 F1`
- Local DRF runs are also well below the paper:
-  - `58.08 / 78.80 / 60.22 / 56.99`
-  - `51.67 / 72.37 / 56.22 / 50.92`
-
-Conclusion:
-
- the current repo can reproduce the idea of DRF
- it cannot reproduce the paper’s reported skeleton/DRF metrics yet
-
-### 7. The missing author-side training details are still unresolved
-
-Evidence:
-
- The author-side DRF stub referenced a `BaseModel_body` path that was not released in the original materials.
- A local reconstruction suggests that `BaseModel_body` was probably thin, not the whole explanation for the metric gap.
- Even after matching the likely missing base-class contract more closely, the metric gap remained large.
-
-Conclusion:
-
- the missing private code is probably not the only reason reproduction fails
- but the lack of released code still weakens the paper’s reproducibility
-
-### 8. The exact split accounting is slightly inconsistent
-
-Evidence:
-
- The ScoNet and DRF papers describe the standard split as `745 train / 748 test` in [arXiv-2407.05726v3-main.tex](papers/arXiv-2407.05726v3-main.tex) and [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex).
- The released partition file matching the `1:1:8` class ratio in this repo is effectively `744 / 749`.
-
-Conclusion:
-
- this is probably a release/text bookkeeping mismatch, not the main source of failure
- but it is another example that the paper protocol is not perfectly audit-friendly
-
-## Current strongest local conclusions
-
-### Reproducible with high confidence
-
- silhouette ScoNet runs are meaningful
- the Scoliosis1K raw pose data does not appear obviously broken
- the OpenGait training/evaluation infrastructure is not the main problem
- PAV computation is not the main blocker
- the skeleton branch is learnable on the easier `1:1:2` split
- on `1:1:2`, `body-only + weighted CE` reached `81.82 Acc / 66.21 Prec / 88.50 Rec / 65.96 F1` on the full test set
- on the same split, `body-only + plain CE` improved that further to `83.16 Acc / 68.24 Prec / 80.02 Rec / 68.47 F1` at `7000`
- a later explicit rerun of the `body-only + plain CE` `7000` full-test eval reproduced that same `83.16 / 68.24 / 80.02 / 68.47` result
- a later `AdamW` cosine finetune from that same `10k` plain-CE checkpoint improved the practical result further:
-  - verified retained best checkpoint at `27000`: `92.38 Acc / 90.30 Prec / 87.39 Rec / 88.70 F1`
-  - final `80000` checkpoint still remained strong: `90.64 Acc / 72.87 Prec / 93.19 Rec / 75.74 F1`
- adding back limited head context via `head-lite` did not improve the full-test score; its `7000` checkpoint reached only `78.07 Acc / 65.42 Prec / 80.50 Rec / 62.08 F1`
- the first practical DRF bridge on the same `1:1:2` body-only recipe peaked early and still underperformed the plain skeleton baseline; its best retained `2000` checkpoint reached only `80.21 Acc / 58.92 Prec / 59.23 Rec / 57.84 F1` on the full test set
-
-### Not reproducible with current evidence
-
- the paper’s claimed skeleton-map baseline quality
- the paper’s claimed DRF improvement magnitude
- the paper’s qualitative response-map story as shown in the figures
-
-### Most likely interpretation
-
- the papers are probably directionally correct
- but the skeleton-map and DRF pipelines are under-specified
- the missing implementation details are important enough that a faithful independent reproduction is not currently achievable from the paper text and released materials alone
- the `1:1:8` class ratio is not just a nuisance; it appears to be a major driver of the current skeleton/DRF failure mode
- on the easier `1:1:2` split, weighted CE is not currently the winning recipe; the best local full-test result so far came from plain CE
- `head-lite` may help the small fixed proxy subset, but that gain did not transfer to the full `TEST_SET`, so `body-only + plain CE` remains the best practical skeleton recipe
- once the practical `1:1:2` body-only plain-CE recipe was established, the branch still appeared underfit enough that optimizer/schedule mattered again. A later `AdamW` cosine finetune beat the earlier SGD bridge by a large margin at its retained best checkpoint, which means the earlier `83.16 / 68.47` result was a stable baseline but not the ceiling of this skeleton recipe
- DRF currently looks worse than the plain skeleton baseline not because the skeleton path is dead, but because the additional prior branch is not yet providing a selective or stable complement. The current local evidence points to three likely causes:
-  - the body-only skeleton baseline already captures most of the useful torso signal on `1:1:2`, so PAV may be largely redundant in this setting
-  - the current PGA/PAV path appears weakly selective in local diagnostics, so the prior is not clearly emphasizing a few clinically relevant parts
-  - DRF peaks very early and then degrades, which suggests the added branch is making optimization less stable without improving the final decision boundary
-
-## Recommended standard for future work in this repo
-
-When a future change claims to improve DRF reproduction, it should satisfy all of the following:
-
-1. beat the current best skeleton baseline on the fixed proxy protocol
-2. remain stable across at least one full run, not just a short spike
-3. state whether the change is:
-   - paper-faithful
-   - implementation-motivated
-   - or purely empirical
-4. avoid using silhouette success as evidence that the skeleton path is correct
-
-## Practical bottom line
-
-At the moment, this repo supports:
-
- faithful silhouette ScoNet experimentation
- plausible DRF implementation work
- structured debugging of the skeleton-map branch
-
-At the moment, this repo does not yet support:
-
- claiming a successful independent reproduction of the DRF paper’s quantitative results
- claiming that the paper’s skeleton-map preprocessing is fully specified
- treating the paper’s qualitative feature-response visualizations as reproduced
-
-For practical model selection, the current conclusion is simpler:
-
- stop treating DRF as the default winner
- keep the practical mainline on `1:1:2`
- use the retained `body-only + plain CE` skeleton checkpoint family as the working solution
- the strongest verified practical checkpoint is the later `AdamW` cosine finetune checkpoint at `27000`, with:
+Current practical winner:
+- model family: `ScoNet-MT-ske`
+- split: `1:1:2`
+- representation: `body-only`
+- loss: plain CE + triplet
+- optimizer path: later `AdamW` cosine finetune
+- verified best retained checkpoint:
  - `92.38 Acc / 90.30 Prec / 87.39 Rec / 88.70 F1`

-That means the remaining work is no longer broad reproduction debugging. It is mostly optional refinement:
+That means:
+- practical model development is in a good state
+- paper-faithful DRF reproduction is not

- confirm whether `body-only` really beats `full-body` under the same successful training recipe
- optionally retry DRF only after the strong practical skeleton baseline is fixed
- package and use the retained best checkpoint rather than continuing to churn the whole search space
+## What We Can Say With High Confidence
+
+### 1. The core training and evaluation stack works
+
+Evidence:
+- the silhouette ScoNet path behaves sensibly
+- train/eval loops, checkpointing, resume, and standalone eval all work
+- the strong practical skeleton result is reproducible from a saved checkpoint
+
+Conclusion:
+- the repo is not globally broken
+- the main remaining issues are method-specific, not infrastructure-wide
+
+### 2. The raw Scoliosis1K pose data is usable
+
+Evidence:
+- earlier dataset analysis showed high pose confidence and stable sequence lengths
+- the skeleton branch eventually learns well on the easier practical split
+
+Conclusion:
+- the raw pose source is not the main blocker
+
+### 3. The skeleton branch is learnable
+
+Evidence:
+- on `1:1:2`, the skeleton branch moved from poor early baselines to a strong retained checkpoint family
+- the best verified practical result is:
+  - `92.38 Acc / 90.30 Prec / 87.39 Rec / 88.70 F1`
+
+Conclusion:
+- the skeleton path is not dead
+- earlier failures on harder settings were not proof that skeleton input is fundamentally wrong
+
+## What Is Reproducible
+
+### 4. The silhouette ScoNet path is reproducible enough to trust
+
+Evidence:
+- the silhouette pipeline and trainer behave consistently
+- strong silhouette checkpoints evaluate sensibly on their intended split family
+
+Conclusion:
+- silhouette ScoNet remains a valid sanity anchor for this repo
+
+### 5. The high-level DRF idea is reproducible
+
+Evidence:
+- the DRF paper defines the method as `skeleton map + PAV + PGA`
+- this repo now contains:
+  - a DRF model
+  - DRF-specific preprocessing
+  - PAV generation
+  - PGA integration
+
+Conclusion:
+- the architecture-level idea is implementable
+- a plausible DRF implementation exists locally
+
+### 6. The PAV concept is reproducible
+
+Evidence:
+- PAV metrics were implemented and produced stable sequence-level signals
+- the earlier analysis showed PAV still carries useful class signal
+
+Conclusion:
+- “we could not build the clinical prior” is not the main explanation anymore
+
+## What Is Only Partially Reproducible
+
+### 7. The skeleton-map branch is only partially specified by the papers
+
+The papers define the representation conceptually, but not enough for quantitative reproduction from text alone.
+
+Missing or under-specified details included:
+- numeric Gaussian widths
+- joint-vs-limb relative weighting
+- crop and alignment policy
+- resize/padding policy
+- quantization/dtype behavior
+- runtime transform assumptions
+
+Why that matters:
+- small rasterization and alignment changes moved results a lot
+- many early failures came from details the paper did not pin down tightly enough
+
+Conclusion:
+- the paper gives the representation idea
+- it does not fully specify the winning implementation
+
+### 8. The visualization story is only partially reproducible
+
+The papers cite visualization families, but do not specify enough extraction detail to reproduce the exact figures cleanly.
+
+Missing detail included:
+- which layer
+- before or after temporal pooling
+- exact normalization/rendering procedure
+
+Conclusion:
+- local visualization work is useful for debugging
+- it should not be treated as paper-faithful evidence
+
+## What Is Not Reproduced
+
+### 9. The published DRF quantitative claim is not reproduced here
+
+Paper-side numbers:
+- `ScoNet-MT-ske`: `82.5 Acc / 81.4 Prec / 74.3 Rec / 76.6 F1`
+- `DRF`: `86.0 Acc / 84.1 Prec / 79.2 Rec / 80.8 F1`
+
+Local practical DRF result on the current workable `1:1:2` path:
+- best retained DRF checkpoint (`2000`) full test:
+  - `80.21 Acc / 58.92 Prec / 59.23 Rec / 57.84 F1`
+
+Conclusion:
+- DRF currently loses to the stronger plain skeleton baseline in this repo
+- the published DRF advantage is not established locally
+
+### 10. The paper-level `1:1:8` skeleton story is not reproduced here
+
+What happened locally:
+- `1:1:8` skeleton runs remained much weaker and more unstable
+- the stronger practical result came from moving to `1:1:2`
+
+Conclusion:
+- the hard `1:1:8` regime is still unresolved here
+- current local evidence says class distribution is a major part of the failure mode
+
+## Practical Findings From The Search
+
+### 11. Representation findings
+
+Current local ranking:
+- `body-only` is the best practical representation so far
+- `head-lite` helped some small proxy runs but did not transfer to the full test set
+- `full-body` has not yet beaten the best `body-only` checkpoint family
+
+Concrete comparison:
+- `body-only + plain CE` full test at `7000`:
+  - `83.16 Acc / 68.24 Prec / 80.02 Rec / 68.47 F1`
+- `head-lite + plain CE` full test at `7000`:
+  - `78.07 Acc / 65.42 Prec / 80.50 Rec / 62.08 F1`
+
+Conclusion:
+- the stable useful signal is mainly torso-centered
+- adding limited head information did not improve full-test performance
+
+### 12. Loss and optimizer findings
+
+Current local ranking:
+- on the practical `1:1:2` branch, `plain CE` beat `weighted CE`
+- `SGD` produced the first strong baseline
+- later `AdamW` cosine finetune beat that baseline substantially
+- earlier `AdamW` multistep finetune was unstable and inferior
+
+Conclusion:
+- the current best recipe is not “AdamW from scratch”
+- it is “strong SGD-style baseline first, then milder AdamW cosine finetune”
+
+### 13. DRF-specific finding
+
+Current local interpretation:
+- DRF is not failing because the skeleton branch is dead
+- it is failing because the extra prior branch is not yet adding a strong enough complementary signal
+
+Most likely reasons:
+- the body-only skeleton baseline already captures most of the useful torso signal on `1:1:2`
+- the current PAV/PGA path looks weakly selective
+- DRF tended to peak early and then degrade, suggesting less stable optimization than the plain baseline
+
+Conclusion:
+- DRF is still a research branch here
+- it is not the current practical winner
+
+## Important Evaluation Caveat
+
+### 14. There was no test-sample gradient leakage
+
+For the long finetune run:
+- training batches came only from `TRAIN_SET`
+- repeated evaluation ran under `torch.no_grad()`
+- no `TEST_SET` samples were used in backprop or optimizer updates
+
+Conclusion:
+- there was no train/test mixing inside gradient descent
+
+### 15. There was test-set model-selection leakage
+
+For the long finetune run:
+- full `TEST_SET` eval was run repeatedly during training
+- best-checkpoint retention used `test_accuracy` and `test_f1`
+- the final retained winner was chosen with test metrics
+
+That means:
+- the retained best checkpoint is real and reproducible
+- but its metric should be interpreted as a test-guided selected result, not a pristine one-shot final estimate
+
+Conclusion:
+- acceptable for practical model selection
+- not ideal for a publication-style clean generalization claim
+
+## Bottom Line
+
+The repo currently supports:
+- strong practical scoliosis model development
+- stable skeleton-map experimentation
+- plausible DRF implementation work
+
+The repo does not currently support:
+- claiming an independent reproduction of the DRF paper’s quantitative result
+- claiming that the DRF paper fully specifies the winning skeleton-map preprocessing
+- treating DRF as the default best model in this codebase
+
+Practical bottom line:
+- keep the `body-only` skeleton baseline as the mainline path
+- keep the retained best checkpoint family as the working artifact
+- treat DRF as an optional follow-up branch, not the current winner
+
+Current strongest practical checkpoint:
+- `92.38 Acc / 90.30 Prec / 87.39 Rec / 88.70 F1`
+
+If future work claims to improve on this, it should:
+1. beat the current retained best checkpoint on full-test macro-F1
+2. be verifiable by standalone eval from a saved checkpoint
+3. state clearly whether the change is paper-faithful or purely empirical