208 lines
9.7 KiB
Markdown
208 lines
9.7 KiB
Markdown
# Scoliosis1K Reproducibility Audit
|
||
|
||
This note records which parts of the ScoNet and DRF papers are currently reproducible in this repo, which parts are only partially reproducible, and which parts remain under-specified or unsupported by local evidence.
|
||
|
||
Ground truth policy for this audit:
|
||
|
||
- the papers are the methodological source of truth
|
||
- local code and local runs are the implementation evidence
|
||
- when the two diverge, this document states that explicitly
|
||
|
||
## Papers and local references
|
||
|
||
- ScoNet paper: [arXiv-2407.05726v3-main.tex](papers/arXiv-2407.05726v3-main.tex)
|
||
- DRF paper: [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex)
|
||
- local run history: [scoliosis_training_change_log.md](scoliosis_training_change_log.md)
|
||
- current status note: [sconet-drf-status-and-training.md](sconet-drf-status-and-training.md)
|
||
|
||
## What is reproducible
|
||
|
||
### 1. The silhouette ScoNet pipeline is reproducible
|
||
|
||
Evidence:
|
||
|
||
- The ScoNet paper states the standard `1:1:8` evaluation protocol and the SGD schedule clearly in [arXiv-2407.05726v3-main.tex](papers/arXiv-2407.05726v3-main.tex) and the same tracked TeX source documents the class-ratio study.
|
||
- The same paper reports that multi-task ScoNet-MT is much stronger than single-task ScoNet, including the class-ratio study in [arXiv-2407.05726v3-main.tex](papers/arXiv-2407.05726v3-main.tex).
|
||
- In this repo, the standard silhouette ScoNet path is stable:
|
||
- the model/trainer/evaluator path is intact
|
||
- a strong silhouette checkpoint reproduces cleanly on the correct split family
|
||
|
||
Conclusion:
|
||
|
||
- the Scoliosis1K silhouette modality is usable
|
||
- the core OpenGait training and evaluation stack is usable for this task
|
||
- the repo is not globally broken
|
||
|
||
### 2. The high-level DRF architecture is reproducible
|
||
|
||
Evidence:
|
||
|
||
- The DRF paper defines the method as `skeleton map + PAV + PGA` in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex).
|
||
- It defines:
|
||
- pelvis-centering and height normalization in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex)
|
||
- two-channel skeleton maps in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex)
|
||
- PAV metrics in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex)
|
||
- PGA channel/spatial attention in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex)
|
||
- This repo now has a functioning DRF model and DRF-specific preprocessing path implementing those ideas.
|
||
|
||
Conclusion:
|
||
|
||
- the paper is specific enough to implement a plausible DRF model family
|
||
- the architecture-level claim is reproducible
|
||
- the exact paper-level quantitative result is not yet reproducible
|
||
|
||
### 3. The PAV concept is reproducible
|
||
|
||
Evidence:
|
||
|
||
- The DRF paper defines 8 symmetric joint pairs and 3 asymmetry metrics in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex).
|
||
- The local preprocessing implements those metrics and produces stable sequence-level PAVs.
|
||
- Local dataset analysis showed the PAV still carries useful signal, even with a simple probe.
|
||
|
||
Conclusion:
|
||
|
||
- PAV is not the main missing piece
|
||
- the main reproduction gap is not “we cannot build the clinical prior”
|
||
|
||
## What is only partially reproducible
|
||
|
||
### 4. The skeleton-map branch is reproducible only at the concept level
|
||
|
||
Evidence:
|
||
|
||
- The DRF paper describes the skeleton map as a dense, silhouette-like two-channel representation in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex).
|
||
- It does not specify crucial rasterization details such as:
|
||
- numeric `sigma`
|
||
- joint-vs-limb relative weighting
|
||
- quantization and dtype
|
||
- crop policy
|
||
- resize policy
|
||
- whether alignment is per-frame or per-sequence
|
||
- Local runs show these details matter a lot:
|
||
- `sigma=8` skeleton runs were very poor
|
||
- smaller sigma and fixed limb/joint alignment improved results materially
|
||
- the best local skeleton baseline is still only `50.47 Acc / 48.63 Macro-F1`, far below the paper's `82.5 / 76.6` for ScoNet-MT-ske in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex)
|
||
|
||
Conclusion:
|
||
|
||
- the paper specifies the representation idea
|
||
- it does not specify enough to make the skeleton-map branch quantitatively reproducible from text alone
|
||
|
||
### 5. The visualization story is only partially reproducible
|
||
|
||
Evidence:
|
||
|
||
- The ScoNet paper cites the attention-transfer visualization family in [arXiv-2407.05726v3-main.tex](papers/arXiv-2407.05726v3-main.tex).
|
||
- The DRF paper cites Zhou et al. CAM in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex).
|
||
- Neither paper states:
|
||
- which layer is visualized
|
||
- whether visualization is before or after temporal pooling
|
||
- the exact normalization/rendering procedure
|
||
- Local attempts suggest that only certain intermediate layers produce qualitatively plausible maps, and even then the results are approximations rather than faithful reproductions.
|
||
|
||
Conclusion:
|
||
|
||
- qualitative visualization claims are only partially reproducible
|
||
- they should not be treated as strong evidence until the extraction procedure is specified better
|
||
|
||
## What is not reproducible from the paper and local materials alone
|
||
|
||
### 6. The paper-level ScoNet-MT-ske and DRF numbers are not reproducible yet
|
||
|
||
Evidence:
|
||
|
||
- The DRF paper reports:
|
||
- ScoNet-MT-ske: `82.5 Acc / 81.4 Prec / 74.3 Rec / 76.6 F1`
|
||
- DRF: `86.0 Acc / 84.1 Prec / 79.2 Rec / 80.8 F1`
|
||
in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex)
|
||
- The best local skeleton-map baseline so far is recorded in [scoliosis_training_change_log.md](scoliosis_training_change_log.md):
|
||
- `50.47 Acc / 69.31 Prec / 54.58 Rec / 48.63 F1`
|
||
- Local DRF runs are also well below the paper:
|
||
- `58.08 / 78.80 / 60.22 / 56.99`
|
||
- `51.67 / 72.37 / 56.22 / 50.92`
|
||
|
||
Conclusion:
|
||
|
||
- the current repo can reproduce the idea of DRF
|
||
- it cannot reproduce the paper’s reported skeleton/DRF metrics yet
|
||
|
||
### 7. The missing author-side training details are still unresolved
|
||
|
||
Evidence:
|
||
|
||
- The author-side DRF stub referenced a `BaseModel_body` path that was not released in the original materials.
|
||
- A local reconstruction suggests that `BaseModel_body` was probably thin, not the whole explanation for the metric gap.
|
||
- Even after matching the likely missing base-class contract more closely, the metric gap remained large.
|
||
|
||
Conclusion:
|
||
|
||
- the missing private code is probably not the only reason reproduction fails
|
||
- but the lack of released code still weakens the paper’s reproducibility
|
||
|
||
### 8. The exact split accounting is slightly inconsistent
|
||
|
||
Evidence:
|
||
|
||
- The ScoNet and DRF papers describe the standard split as `745 train / 748 test` in [arXiv-2407.05726v3-main.tex](papers/arXiv-2407.05726v3-main.tex) and [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex).
|
||
- The released partition file matching the `1:1:8` class ratio in this repo is effectively `744 / 749`.
|
||
|
||
Conclusion:
|
||
|
||
- this is probably a release/text bookkeeping mismatch, not the main source of failure
|
||
- but it is another example that the paper protocol is not perfectly audit-friendly
|
||
|
||
## Current strongest local conclusions
|
||
|
||
### Reproducible with high confidence
|
||
|
||
- silhouette ScoNet runs are meaningful
|
||
- the Scoliosis1K raw pose data does not appear obviously broken
|
||
- the OpenGait training/evaluation infrastructure is not the main problem
|
||
- PAV computation is not the main blocker
|
||
- the skeleton branch is learnable on the easier `1:1:2` split
|
||
- on `1:1:2`, `body-only + weighted CE` reached `81.82 Acc / 66.21 Prec / 88.50 Rec / 65.96 F1` on the full test set
|
||
- on the same split, `body-only + plain CE` improved that further to `83.16 Acc / 68.24 Prec / 80.02 Rec / 68.47 F1` at `7000`
|
||
- a later explicit rerun of the `body-only + plain CE` `7000` full-test eval reproduced that same `83.16 / 68.24 / 80.02 / 68.47` result
|
||
- adding back limited head context via `head-lite` did not improve the full-test score; its `7000` checkpoint reached only `78.07 Acc / 65.42 Prec / 80.50 Rec / 62.08 F1`
|
||
|
||
### Not reproducible with current evidence
|
||
|
||
- the paper’s claimed skeleton-map baseline quality
|
||
- the paper’s claimed DRF improvement magnitude
|
||
- the paper’s qualitative response-map story as shown in the figures
|
||
|
||
### Most likely interpretation
|
||
|
||
- the papers are probably directionally correct
|
||
- but the skeleton-map and DRF pipelines are under-specified
|
||
- the missing implementation details are important enough that a faithful independent reproduction is not currently achievable from the paper text and released materials alone
|
||
- the `1:1:8` class ratio is not just a nuisance; it appears to be a major driver of the current skeleton/DRF failure mode
|
||
- on the easier `1:1:2` split, weighted CE is not currently the winning recipe; the best local full-test result so far came from plain CE
|
||
- `head-lite` may help the small fixed proxy subset, but that gain did not transfer to the full `TEST_SET`, so `body-only + plain CE` remains the best practical skeleton recipe
|
||
|
||
## Recommended standard for future work in this repo
|
||
|
||
When a future change claims to improve DRF reproduction, it should satisfy all of the following:
|
||
|
||
1. beat the current best skeleton baseline on the fixed proxy protocol
|
||
2. remain stable across at least one full run, not just a short spike
|
||
3. state whether the change is:
|
||
- paper-faithful
|
||
- implementation-motivated
|
||
- or purely empirical
|
||
4. avoid using silhouette success as evidence that the skeleton path is correct
|
||
|
||
## Practical bottom line
|
||
|
||
At the moment, this repo supports:
|
||
|
||
- faithful silhouette ScoNet experimentation
|
||
- plausible DRF implementation work
|
||
- structured debugging of the skeleton-map branch
|
||
|
||
At the moment, this repo does not yet support:
|
||
|
||
- claiming a successful independent reproduction of the DRF paper’s quantitative results
|
||
- claiming that the paper’s skeleton-map preprocessing is fully specified
|
||
- treating the paper’s qualitative feature-response visualizations as reproduced
|