Files
OpenGait/docs/scoliosis_reproducibility_audit.md
T

12 KiB
Raw Blame History

Scoliosis1K Reproducibility Audit

This note records which parts of the ScoNet and DRF papers are currently reproducible in this repo, which parts are only partially reproducible, and which parts remain under-specified or unsupported by local evidence.

Ground truth policy for this audit:

  • the papers are the methodological source of truth
  • local code and local runs are the implementation evidence
  • when the two diverge, this document states that explicitly

Papers and local references

What is reproducible

1. The silhouette ScoNet pipeline is reproducible

Evidence:

  • The ScoNet paper states the standard 1:1:8 evaluation protocol and the SGD schedule clearly in arXiv-2407.05726v3-main.tex and the same tracked TeX source documents the class-ratio study.
  • The same paper reports that multi-task ScoNet-MT is much stronger than single-task ScoNet, including the class-ratio study in arXiv-2407.05726v3-main.tex.
  • In this repo, the standard silhouette ScoNet path is stable:
    • the model/trainer/evaluator path is intact
    • a strong silhouette checkpoint reproduces cleanly on the correct split family

Conclusion:

  • the Scoliosis1K silhouette modality is usable
  • the core OpenGait training and evaluation stack is usable for this task
  • the repo is not globally broken

2. The high-level DRF architecture is reproducible

Evidence:

Conclusion:

  • the paper is specific enough to implement a plausible DRF model family
  • the architecture-level claim is reproducible
  • the exact paper-level quantitative result is not yet reproducible

3. The PAV concept is reproducible

Evidence:

  • The DRF paper defines 8 symmetric joint pairs and 3 asymmetry metrics in arXiv-2509.00872v1-main.tex.
  • The local preprocessing implements those metrics and produces stable sequence-level PAVs.
  • Local dataset analysis showed the PAV still carries useful signal, even with a simple probe.

Conclusion:

  • PAV is not the main missing piece
  • the main reproduction gap is not “we cannot build the clinical prior”

What is only partially reproducible

4. The skeleton-map branch is reproducible only at the concept level

Evidence:

  • The DRF paper describes the skeleton map as a dense, silhouette-like two-channel representation in arXiv-2509.00872v1-main.tex.
  • It does not specify crucial rasterization details such as:
    • numeric sigma
    • joint-vs-limb relative weighting
    • quantization and dtype
    • crop policy
    • resize policy
    • whether alignment is per-frame or per-sequence
  • Local runs show these details matter a lot:
    • sigma=8 skeleton runs were very poor
    • smaller sigma and fixed limb/joint alignment improved results materially
    • the best local skeleton baseline is still only 50.47 Acc / 48.63 Macro-F1, far below the paper's 82.5 / 76.6 for ScoNet-MT-ske in arXiv-2509.00872v1-main.tex

Conclusion:

  • the paper specifies the representation idea
  • it does not specify enough to make the skeleton-map branch quantitatively reproducible from text alone

5. The visualization story is only partially reproducible

Evidence:

  • The ScoNet paper cites the attention-transfer visualization family in arXiv-2407.05726v3-main.tex.
  • The DRF paper cites Zhou et al. CAM in arXiv-2509.00872v1-main.tex.
  • Neither paper states:
    • which layer is visualized
    • whether visualization is before or after temporal pooling
    • the exact normalization/rendering procedure
  • Local attempts suggest that only certain intermediate layers produce qualitatively plausible maps, and even then the results are approximations rather than faithful reproductions.

Conclusion:

  • qualitative visualization claims are only partially reproducible
  • they should not be treated as strong evidence until the extraction procedure is specified better

What is not reproducible from the paper and local materials alone

6. The paper-level ScoNet-MT-ske and DRF numbers are not reproducible yet

Evidence:

  • The DRF paper reports:
    • ScoNet-MT-ske: 82.5 Acc / 81.4 Prec / 74.3 Rec / 76.6 F1
    • DRF: 86.0 Acc / 84.1 Prec / 79.2 Rec / 80.8 F1 in arXiv-2509.00872v1-main.tex
  • The best local skeleton-map baseline so far is recorded in scoliosis_training_change_log.md:
    • 50.47 Acc / 69.31 Prec / 54.58 Rec / 48.63 F1
  • Local DRF runs are also well below the paper:
    • 58.08 / 78.80 / 60.22 / 56.99
    • 51.67 / 72.37 / 56.22 / 50.92

Conclusion:

  • the current repo can reproduce the idea of DRF
  • it cannot reproduce the papers reported skeleton/DRF metrics yet

7. The missing author-side training details are still unresolved

Evidence:

  • The author-side DRF stub referenced a BaseModel_body path that was not released in the original materials.
  • A local reconstruction suggests that BaseModel_body was probably thin, not the whole explanation for the metric gap.
  • Even after matching the likely missing base-class contract more closely, the metric gap remained large.

Conclusion:

  • the missing private code is probably not the only reason reproduction fails
  • but the lack of released code still weakens the papers reproducibility

8. The exact split accounting is slightly inconsistent

Evidence:

Conclusion:

  • this is probably a release/text bookkeeping mismatch, not the main source of failure
  • but it is another example that the paper protocol is not perfectly audit-friendly

Current strongest local conclusions

Reproducible with high confidence

  • silhouette ScoNet runs are meaningful
  • the Scoliosis1K raw pose data does not appear obviously broken
  • the OpenGait training/evaluation infrastructure is not the main problem
  • PAV computation is not the main blocker
  • the skeleton branch is learnable on the easier 1:1:2 split
  • on 1:1:2, body-only + weighted CE reached 81.82 Acc / 66.21 Prec / 88.50 Rec / 65.96 F1 on the full test set
  • on the same split, body-only + plain CE improved that further to 83.16 Acc / 68.24 Prec / 80.02 Rec / 68.47 F1 at 7000
  • a later explicit rerun of the body-only + plain CE 7000 full-test eval reproduced that same 83.16 / 68.24 / 80.02 / 68.47 result
  • a later AdamW cosine finetune from that same 10k plain-CE checkpoint improved the practical result further:
    • verified retained best checkpoint at 27000: 92.38 Acc / 90.30 Prec / 87.39 Rec / 88.70 F1
    • final 80000 checkpoint still remained strong: 90.64 Acc / 72.87 Prec / 93.19 Rec / 75.74 F1
  • adding back limited head context via head-lite did not improve the full-test score; its 7000 checkpoint reached only 78.07 Acc / 65.42 Prec / 80.50 Rec / 62.08 F1
  • the first practical DRF bridge on the same 1:1:2 body-only recipe peaked early and still underperformed the plain skeleton baseline; its best retained 2000 checkpoint reached only 80.21 Acc / 58.92 Prec / 59.23 Rec / 57.84 F1 on the full test set

Not reproducible with current evidence

  • the papers claimed skeleton-map baseline quality
  • the papers claimed DRF improvement magnitude
  • the papers qualitative response-map story as shown in the figures

Most likely interpretation

  • the papers are probably directionally correct
  • but the skeleton-map and DRF pipelines are under-specified
  • the missing implementation details are important enough that a faithful independent reproduction is not currently achievable from the paper text and released materials alone
  • the 1:1:8 class ratio is not just a nuisance; it appears to be a major driver of the current skeleton/DRF failure mode
  • on the easier 1:1:2 split, weighted CE is not currently the winning recipe; the best local full-test result so far came from plain CE
  • head-lite may help the small fixed proxy subset, but that gain did not transfer to the full TEST_SET, so body-only + plain CE remains the best practical skeleton recipe
  • once the practical 1:1:2 body-only plain-CE recipe was established, the branch still appeared underfit enough that optimizer/schedule mattered again. A later AdamW cosine finetune beat the earlier SGD bridge by a large margin at its retained best checkpoint, which means the earlier 83.16 / 68.47 result was a stable baseline but not the ceiling of this skeleton recipe
  • DRF currently looks worse than the plain skeleton baseline not because the skeleton path is dead, but because the additional prior branch is not yet providing a selective or stable complement. The current local evidence points to three likely causes:
    • the body-only skeleton baseline already captures most of the useful torso signal on 1:1:2, so PAV may be largely redundant in this setting
    • the current PGA/PAV path appears weakly selective in local diagnostics, so the prior is not clearly emphasizing a few clinically relevant parts
    • DRF peaks very early and then degrades, which suggests the added branch is making optimization less stable without improving the final decision boundary

When a future change claims to improve DRF reproduction, it should satisfy all of the following:

  1. beat the current best skeleton baseline on the fixed proxy protocol
  2. remain stable across at least one full run, not just a short spike
  3. state whether the change is:
    • paper-faithful
    • implementation-motivated
    • or purely empirical
  4. avoid using silhouette success as evidence that the skeleton path is correct

Practical bottom line

At the moment, this repo supports:

  • faithful silhouette ScoNet experimentation
  • plausible DRF implementation work
  • structured debugging of the skeleton-map branch

At the moment, this repo does not yet support:

  • claiming a successful independent reproduction of the DRF papers quantitative results
  • claiming that the papers skeleton-map preprocessing is fully specified
  • treating the papers qualitative feature-response visualizations as reproduced

For practical model selection, the current conclusion is simpler:

  • stop treating DRF as the default winner
  • keep the practical mainline on 1:1:2
  • use the retained body-only + plain CE skeleton checkpoint family as the working solution
  • the strongest verified practical checkpoint is the later AdamW cosine finetune checkpoint at 27000, with:
    • 92.38 Acc / 90.30 Prec / 87.39 Rec / 88.70 F1

That means the remaining work is no longer broad reproduction debugging. It is mostly optional refinement:

  • confirm whether body-only really beats full-body under the same successful training recipe
  • optionally retry DRF only after the strong practical skeleton baseline is fixed
  • package and use the retained best checkpoint rather than continuing to churn the whole search space