Files

T

crosstyan fbc0696dc4 feat: archive best scoliosis checkpoints

2026-03-11 10:23:38 +08:00

12 KiB

Raw Blame History

Scoliosis1K Reproducibility Audit

This note records which parts of the ScoNet and DRF papers are currently reproducible in this repo, which parts are only partially reproducible, and which parts remain under-specified or unsupported by local evidence.

Ground truth policy for this audit:

the papers are the methodological source of truth
local code and local runs are the implementation evidence
when the two diverge, this document states that explicitly

Papers and local references

ScoNet paper: arXiv-2407.05726v3-main.tex
DRF paper: arXiv-2509.00872v1-main.tex
local run history: scoliosis_training_change_log.md
current status note: sconet-drf-status-and-training.md

What is reproducible

1. The silhouette ScoNet pipeline is reproducible

Evidence:

The ScoNet paper states the standard 1:1:8 evaluation protocol and the SGD schedule clearly in arXiv-2407.05726v3-main.tex and the same tracked TeX source documents the class-ratio study.
The same paper reports that multi-task ScoNet-MT is much stronger than single-task ScoNet, including the class-ratio study in arXiv-2407.05726v3-main.tex.
In this repo, the standard silhouette ScoNet path is stable:
- the model/trainer/evaluator path is intact
- a strong silhouette checkpoint reproduces cleanly on the correct split family

Conclusion:

the Scoliosis1K silhouette modality is usable
the core OpenGait training and evaluation stack is usable for this task
the repo is not globally broken

2. The high-level DRF architecture is reproducible

Evidence:

The DRF paper defines the method as skeleton map + PAV + PGA in arXiv-2509.00872v1-main.tex.
It defines:
- pelvis-centering and height normalization in arXiv-2509.00872v1-main.tex
- two-channel skeleton maps in arXiv-2509.00872v1-main.tex
- PAV metrics in arXiv-2509.00872v1-main.tex
- PGA channel/spatial attention in arXiv-2509.00872v1-main.tex
This repo now has a functioning DRF model and DRF-specific preprocessing path implementing those ideas.

Conclusion:

the paper is specific enough to implement a plausible DRF model family
the architecture-level claim is reproducible
the exact paper-level quantitative result is not yet reproducible

3. The PAV concept is reproducible

Evidence:

The DRF paper defines 8 symmetric joint pairs and 3 asymmetry metrics in arXiv-2509.00872v1-main.tex.
The local preprocessing implements those metrics and produces stable sequence-level PAVs.
Local dataset analysis showed the PAV still carries useful signal, even with a simple probe.

Conclusion:

PAV is not the main missing piece
the main reproduction gap is not “we cannot build the clinical prior”

What is only partially reproducible

4. The skeleton-map branch is reproducible only at the concept level

Evidence:

The DRF paper describes the skeleton map as a dense, silhouette-like two-channel representation in arXiv-2509.00872v1-main.tex.
It does not specify crucial rasterization details such as:
- numeric sigma
- joint-vs-limb relative weighting
- quantization and dtype
- crop policy
- resize policy
- whether alignment is per-frame or per-sequence
Local runs show these details matter a lot:
- sigma=8 skeleton runs were very poor
- smaller sigma and fixed limb/joint alignment improved results materially
- the best local skeleton baseline is still only 50.47 Acc / 48.63 Macro-F1, far below the paper's 82.5 / 76.6 for ScoNet-MT-ske in arXiv-2509.00872v1-main.tex

Conclusion:

the paper specifies the representation idea
it does not specify enough to make the skeleton-map branch quantitatively reproducible from text alone

5. The visualization story is only partially reproducible

Evidence:

The ScoNet paper cites the attention-transfer visualization family in arXiv-2407.05726v3-main.tex.
The DRF paper cites Zhou et al. CAM in arXiv-2509.00872v1-main.tex.
Neither paper states:
- which layer is visualized
- whether visualization is before or after temporal pooling
- the exact normalization/rendering procedure
Local attempts suggest that only certain intermediate layers produce qualitatively plausible maps, and even then the results are approximations rather than faithful reproductions.

Conclusion:

qualitative visualization claims are only partially reproducible
they should not be treated as strong evidence until the extraction procedure is specified better

What is not reproducible from the paper and local materials alone

6. The paper-level ScoNet-MT-ske and DRF numbers are not reproducible yet

Evidence:

The DRF paper reports:
- ScoNet-MT-ske: 82.5 Acc / 81.4 Prec / 74.3 Rec / 76.6 F1
- DRF: 86.0 Acc / 84.1 Prec / 79.2 Rec / 80.8 F1 in arXiv-2509.00872v1-main.tex
The best local skeleton-map baseline so far is recorded in scoliosis_training_change_log.md:
- 50.47 Acc / 69.31 Prec / 54.58 Rec / 48.63 F1
Local DRF runs are also well below the paper:
- 58.08 / 78.80 / 60.22 / 56.99
- 51.67 / 72.37 / 56.22 / 50.92

Conclusion:

the current repo can reproduce the idea of DRF
it cannot reproduce the paper’s reported skeleton/DRF metrics yet

7. The missing author-side training details are still unresolved

Evidence:

The author-side DRF stub referenced a BaseModel_body path that was not released in the original materials.
A local reconstruction suggests that BaseModel_body was probably thin, not the whole explanation for the metric gap.
Even after matching the likely missing base-class contract more closely, the metric gap remained large.

Conclusion:

the missing private code is probably not the only reason reproduction fails
but the lack of released code still weakens the paper’s reproducibility

8. The exact split accounting is slightly inconsistent

Evidence:

The ScoNet and DRF papers describe the standard split as 745 train / 748 test in arXiv-2407.05726v3-main.tex and arXiv-2509.00872v1-main.tex.
The released partition file matching the 1:1:8 class ratio in this repo is effectively 744 / 749.

Conclusion:

this is probably a release/text bookkeeping mismatch, not the main source of failure
but it is another example that the paper protocol is not perfectly audit-friendly

Current strongest local conclusions

Reproducible with high confidence

silhouette ScoNet runs are meaningful
the Scoliosis1K raw pose data does not appear obviously broken
the OpenGait training/evaluation infrastructure is not the main problem
PAV computation is not the main blocker
the skeleton branch is learnable on the easier 1:1:2 split
on 1:1:2, body-only + weighted CE reached 81.82 Acc / 66.21 Prec / 88.50 Rec / 65.96 F1 on the full test set
on the same split, body-only + plain CE improved that further to 83.16 Acc / 68.24 Prec / 80.02 Rec / 68.47 F1 at 7000
a later explicit rerun of the body-only + plain CE 7000 full-test eval reproduced that same 83.16 / 68.24 / 80.02 / 68.47 result
a later AdamW cosine finetune from that same 10k plain-CE checkpoint improved the practical result further:
- verified retained best checkpoint at 27000: 92.38 Acc / 90.30 Prec / 87.39 Rec / 88.70 F1
- final 80000 checkpoint still remained strong: 90.64 Acc / 72.87 Prec / 93.19 Rec / 75.74 F1
adding back limited head context via head-lite did not improve the full-test score; its 7000 checkpoint reached only 78.07 Acc / 65.42 Prec / 80.50 Rec / 62.08 F1
the first practical DRF bridge on the same 1:1:2 body-only recipe peaked early and still underperformed the plain skeleton baseline; its best retained 2000 checkpoint reached only 80.21 Acc / 58.92 Prec / 59.23 Rec / 57.84 F1 on the full test set

Not reproducible with current evidence

the paper’s claimed skeleton-map baseline quality
the paper’s claimed DRF improvement magnitude
the paper’s qualitative response-map story as shown in the figures

Most likely interpretation

the papers are probably directionally correct
but the skeleton-map and DRF pipelines are under-specified
the missing implementation details are important enough that a faithful independent reproduction is not currently achievable from the paper text and released materials alone
the 1:1:8 class ratio is not just a nuisance; it appears to be a major driver of the current skeleton/DRF failure mode
on the easier 1:1:2 split, weighted CE is not currently the winning recipe; the best local full-test result so far came from plain CE
head-lite may help the small fixed proxy subset, but that gain did not transfer to the full TEST_SET, so body-only + plain CE remains the best practical skeleton recipe
once the practical 1:1:2 body-only plain-CE recipe was established, the branch still appeared underfit enough that optimizer/schedule mattered again. A later AdamW cosine finetune beat the earlier SGD bridge by a large margin at its retained best checkpoint, which means the earlier 83.16 / 68.47 result was a stable baseline but not the ceiling of this skeleton recipe
DRF currently looks worse than the plain skeleton baseline not because the skeleton path is dead, but because the additional prior branch is not yet providing a selective or stable complement. The current local evidence points to three likely causes:
- the body-only skeleton baseline already captures most of the useful torso signal on 1:1:2, so PAV may be largely redundant in this setting
- the current PGA/PAV path appears weakly selective in local diagnostics, so the prior is not clearly emphasizing a few clinically relevant parts
- DRF peaks very early and then degrades, which suggests the added branch is making optimization less stable without improving the final decision boundary

Recommended standard for future work in this repo

When a future change claims to improve DRF reproduction, it should satisfy all of the following:

beat the current best skeleton baseline on the fixed proxy protocol
remain stable across at least one full run, not just a short spike
state whether the change is:
- paper-faithful
- implementation-motivated
- or purely empirical
avoid using silhouette success as evidence that the skeleton path is correct

Practical bottom line

At the moment, this repo supports:

faithful silhouette ScoNet experimentation
plausible DRF implementation work
structured debugging of the skeleton-map branch

At the moment, this repo does not yet support:

claiming a successful independent reproduction of the DRF paper’s quantitative results
claiming that the paper’s skeleton-map preprocessing is fully specified
treating the paper’s qualitative feature-response visualizations as reproduced

For practical model selection, the current conclusion is simpler:

stop treating DRF as the default winner
keep the practical mainline on 1:1:2
use the retained body-only + plain CE skeleton checkpoint family as the working solution
the strongest verified practical checkpoint is the later AdamW cosine finetune checkpoint at 27000, with:
- 92.38 Acc / 90.30 Prec / 87.39 Rec / 88.70 F1

That means the remaining work is no longer broad reproduction debugging. It is mostly optional refinement:

confirm whether body-only really beats full-body under the same successful training recipe
optionally retry DRF only after the strong practical skeleton baseline is fixed
package and use the retained best checkpoint rather than continuing to churn the whole search space

12 KiB Raw Blame History Unescape Escape

Scoliosis1K Reproducibility Audit

Papers and local references

What is reproducible

1. The silhouette ScoNet pipeline is reproducible

2. The high-level DRF architecture is reproducible

3. The PAV concept is reproducible

What is only partially reproducible

4. The skeleton-map branch is reproducible only at the concept level

5. The visualization story is only partially reproducible

What is not reproducible from the paper and local materials alone

6. The paper-level ScoNet-MT-ske and DRF numbers are not reproducible yet

7. The missing author-side training details are still unresolved

8. The exact split accounting is slightly inconsistent

Current strongest local conclusions

Reproducible with high confidence

Not reproducible with current evidence

Most likely interpretation

Recommended standard for future work in this repo

Practical bottom line

12 KiB

Raw Blame History