Files
OpenGait/docs/scoliosis_reproducibility_audit.md
T

208 lines
9.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Scoliosis1K Reproducibility Audit
This note records which parts of the ScoNet and DRF papers are currently reproducible in this repo, which parts are only partially reproducible, and which parts remain under-specified or unsupported by local evidence.
Ground truth policy for this audit:
- the papers are the methodological source of truth
- local code and local runs are the implementation evidence
- when the two diverge, this document states that explicitly
## Papers and local references
- ScoNet paper: [arXiv-2407.05726v3-main.tex](papers/arXiv-2407.05726v3-main.tex)
- DRF paper: [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex)
- local run history: [scoliosis_training_change_log.md](scoliosis_training_change_log.md)
- current status note: [sconet-drf-status-and-training.md](sconet-drf-status-and-training.md)
## What is reproducible
### 1. The silhouette ScoNet pipeline is reproducible
Evidence:
- The ScoNet paper states the standard `1:1:8` evaluation protocol and the SGD schedule clearly in [arXiv-2407.05726v3-main.tex](papers/arXiv-2407.05726v3-main.tex) and the same tracked TeX source documents the class-ratio study.
- The same paper reports that multi-task ScoNet-MT is much stronger than single-task ScoNet, including the class-ratio study in [arXiv-2407.05726v3-main.tex](papers/arXiv-2407.05726v3-main.tex).
- In this repo, the standard silhouette ScoNet path is stable:
- the model/trainer/evaluator path is intact
- a strong silhouette checkpoint reproduces cleanly on the correct split family
Conclusion:
- the Scoliosis1K silhouette modality is usable
- the core OpenGait training and evaluation stack is usable for this task
- the repo is not globally broken
### 2. The high-level DRF architecture is reproducible
Evidence:
- The DRF paper defines the method as `skeleton map + PAV + PGA` in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex).
- It defines:
- pelvis-centering and height normalization in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex)
- two-channel skeleton maps in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex)
- PAV metrics in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex)
- PGA channel/spatial attention in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex)
- This repo now has a functioning DRF model and DRF-specific preprocessing path implementing those ideas.
Conclusion:
- the paper is specific enough to implement a plausible DRF model family
- the architecture-level claim is reproducible
- the exact paper-level quantitative result is not yet reproducible
### 3. The PAV concept is reproducible
Evidence:
- The DRF paper defines 8 symmetric joint pairs and 3 asymmetry metrics in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex).
- The local preprocessing implements those metrics and produces stable sequence-level PAVs.
- Local dataset analysis showed the PAV still carries useful signal, even with a simple probe.
Conclusion:
- PAV is not the main missing piece
- the main reproduction gap is not “we cannot build the clinical prior”
## What is only partially reproducible
### 4. The skeleton-map branch is reproducible only at the concept level
Evidence:
- The DRF paper describes the skeleton map as a dense, silhouette-like two-channel representation in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex).
- It does not specify crucial rasterization details such as:
- numeric `sigma`
- joint-vs-limb relative weighting
- quantization and dtype
- crop policy
- resize policy
- whether alignment is per-frame or per-sequence
- Local runs show these details matter a lot:
- `sigma=8` skeleton runs were very poor
- smaller sigma and fixed limb/joint alignment improved results materially
- the best local skeleton baseline is still only `50.47 Acc / 48.63 Macro-F1`, far below the paper's `82.5 / 76.6` for ScoNet-MT-ske in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex)
Conclusion:
- the paper specifies the representation idea
- it does not specify enough to make the skeleton-map branch quantitatively reproducible from text alone
### 5. The visualization story is only partially reproducible
Evidence:
- The ScoNet paper cites the attention-transfer visualization family in [arXiv-2407.05726v3-main.tex](papers/arXiv-2407.05726v3-main.tex).
- The DRF paper cites Zhou et al. CAM in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex).
- Neither paper states:
- which layer is visualized
- whether visualization is before or after temporal pooling
- the exact normalization/rendering procedure
- Local attempts suggest that only certain intermediate layers produce qualitatively plausible maps, and even then the results are approximations rather than faithful reproductions.
Conclusion:
- qualitative visualization claims are only partially reproducible
- they should not be treated as strong evidence until the extraction procedure is specified better
## What is not reproducible from the paper and local materials alone
### 6. The paper-level ScoNet-MT-ske and DRF numbers are not reproducible yet
Evidence:
- The DRF paper reports:
- ScoNet-MT-ske: `82.5 Acc / 81.4 Prec / 74.3 Rec / 76.6 F1`
- DRF: `86.0 Acc / 84.1 Prec / 79.2 Rec / 80.8 F1`
in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex)
- The best local skeleton-map baseline so far is recorded in [scoliosis_training_change_log.md](scoliosis_training_change_log.md):
- `50.47 Acc / 69.31 Prec / 54.58 Rec / 48.63 F1`
- Local DRF runs are also well below the paper:
- `58.08 / 78.80 / 60.22 / 56.99`
- `51.67 / 72.37 / 56.22 / 50.92`
Conclusion:
- the current repo can reproduce the idea of DRF
- it cannot reproduce the papers reported skeleton/DRF metrics yet
### 7. The missing author-side training details are still unresolved
Evidence:
- The author-side DRF stub referenced a `BaseModel_body` path that was not released in the original materials.
- A local reconstruction suggests that `BaseModel_body` was probably thin, not the whole explanation for the metric gap.
- Even after matching the likely missing base-class contract more closely, the metric gap remained large.
Conclusion:
- the missing private code is probably not the only reason reproduction fails
- but the lack of released code still weakens the papers reproducibility
### 8. The exact split accounting is slightly inconsistent
Evidence:
- The ScoNet and DRF papers describe the standard split as `745 train / 748 test` in [arXiv-2407.05726v3-main.tex](papers/arXiv-2407.05726v3-main.tex) and [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex).
- The released partition file matching the `1:1:8` class ratio in this repo is effectively `744 / 749`.
Conclusion:
- this is probably a release/text bookkeeping mismatch, not the main source of failure
- but it is another example that the paper protocol is not perfectly audit-friendly
## Current strongest local conclusions
### Reproducible with high confidence
- silhouette ScoNet runs are meaningful
- the Scoliosis1K raw pose data does not appear obviously broken
- the OpenGait training/evaluation infrastructure is not the main problem
- PAV computation is not the main blocker
- the skeleton branch is learnable on the easier `1:1:2` split
- on `1:1:2`, `body-only + weighted CE` reached `81.82 Acc / 66.21 Prec / 88.50 Rec / 65.96 F1` on the full test set
- on the same split, `body-only + plain CE` improved that further to `83.16 Acc / 68.24 Prec / 80.02 Rec / 68.47 F1` at `7000`
- a later explicit rerun of the `body-only + plain CE` `7000` full-test eval reproduced that same `83.16 / 68.24 / 80.02 / 68.47` result
- adding back limited head context via `head-lite` did not improve the full-test score; its `7000` checkpoint reached only `78.07 Acc / 65.42 Prec / 80.50 Rec / 62.08 F1`
### Not reproducible with current evidence
- the papers claimed skeleton-map baseline quality
- the papers claimed DRF improvement magnitude
- the papers qualitative response-map story as shown in the figures
### Most likely interpretation
- the papers are probably directionally correct
- but the skeleton-map and DRF pipelines are under-specified
- the missing implementation details are important enough that a faithful independent reproduction is not currently achievable from the paper text and released materials alone
- the `1:1:8` class ratio is not just a nuisance; it appears to be a major driver of the current skeleton/DRF failure mode
- on the easier `1:1:2` split, weighted CE is not currently the winning recipe; the best local full-test result so far came from plain CE
- `head-lite` may help the small fixed proxy subset, but that gain did not transfer to the full `TEST_SET`, so `body-only + plain CE` remains the best practical skeleton recipe
## Recommended standard for future work in this repo
When a future change claims to improve DRF reproduction, it should satisfy all of the following:
1. beat the current best skeleton baseline on the fixed proxy protocol
2. remain stable across at least one full run, not just a short spike
3. state whether the change is:
- paper-faithful
- implementation-motivated
- or purely empirical
4. avoid using silhouette success as evidence that the skeleton path is correct
## Practical bottom line
At the moment, this repo supports:
- faithful silhouette ScoNet experimentation
- plausible DRF implementation work
- structured debugging of the skeleton-map branch
At the moment, this repo does not yet support:
- claiming a successful independent reproduction of the DRF papers quantitative results
- claiming that the papers skeleton-map preprocessing is fully specified
- treating the papers qualitative feature-response visualizations as reproduced