crosstyan/OpenGait

Fork 0

Files

T

crosstyan ede9690318 docs: update scoliosis reproducibility audit conclusion

2026-03-11 11:09:59 +08:00

8.9 KiB

Raw Blame History

Scoliosis1K Reproducibility Audit

This note is the current audit of what is and is not reproducible in the Scoliosis1K ScoNet/DRF work inside this repo.

It separates three questions that should not be mixed together:

is the repo/train stack itself working?
is the paper-level DRF claim reproducible?
what is the strongest practical model we have right now?

Related notes:

Primary references:

ScoNet paper: arXiv-2407.05726v3-main.tex
DRF paper: arXiv-2509.00872v1-main.tex

Executive Summary

Current audit conclusion:

the repo and training stack are working
the skeleton branch is learnable
the published DRF result is still not independently reproducible here
the best practical model in this repo is currently not DRF

Current practical winner:

model family: ScoNet-MT-ske
split: 1:1:2
representation: body-only
loss: plain CE + triplet
optimizer path: later AdamW cosine finetune
verified best retained checkpoint:
- 92.38 Acc / 90.30 Prec / 87.39 Rec / 88.70 F1

That means:

practical model development is in a good state
paper-faithful DRF reproduction is not

What We Can Say With High Confidence

1. The core training and evaluation stack works

Evidence:

the silhouette ScoNet path behaves sensibly
train/eval loops, checkpointing, resume, and standalone eval all work
the strong practical skeleton result is reproducible from a saved checkpoint

Conclusion:

the repo is not globally broken
the main remaining issues are method-specific, not infrastructure-wide

2. The raw Scoliosis1K pose data is usable

Evidence:

earlier dataset analysis showed high pose confidence and stable sequence lengths
the skeleton branch eventually learns well on the easier practical split

Conclusion:

the raw pose source is not the main blocker

3. The skeleton branch is learnable

Evidence:

on 1:1:2, the skeleton branch moved from poor early baselines to a strong retained checkpoint family
the best verified practical result is:
- 92.38 Acc / 90.30 Prec / 87.39 Rec / 88.70 F1

Conclusion:

the skeleton path is not dead
earlier failures on harder settings were not proof that skeleton input is fundamentally wrong

What Is Reproducible

4. The silhouette ScoNet path is reproducible enough to trust

Evidence:

the silhouette pipeline and trainer behave consistently
strong silhouette checkpoints evaluate sensibly on their intended split family

Conclusion:

silhouette ScoNet remains a valid sanity anchor for this repo

5. The high-level DRF idea is reproducible

Evidence:

the DRF paper defines the method as skeleton map + PAV + PGA
this repo now contains:
- a DRF model
- DRF-specific preprocessing
- PAV generation
- PGA integration

Conclusion:

the architecture-level idea is implementable
a plausible DRF implementation exists locally

6. The PAV concept is reproducible

Evidence:

PAV metrics were implemented and produced stable sequence-level signals
the earlier analysis showed PAV still carries useful class signal

Conclusion:

“we could not build the clinical prior” is not the main explanation anymore

What Is Only Partially Reproducible

7. The skeleton-map branch is only partially specified by the papers

The papers define the representation conceptually, but not enough for quantitative reproduction from text alone.

Missing or under-specified details included:

numeric Gaussian widths
joint-vs-limb relative weighting
crop and alignment policy
resize/padding policy
quantization/dtype behavior
runtime transform assumptions

Why that matters:

small rasterization and alignment changes moved results a lot
many early failures came from details the paper did not pin down tightly enough

Conclusion:

the paper gives the representation idea
it does not fully specify the winning implementation

8. The visualization story is only partially reproducible

The papers cite visualization families, but do not specify enough extraction detail to reproduce the exact figures cleanly.

Missing detail included:

which layer
before or after temporal pooling
exact normalization/rendering procedure

Conclusion:

local visualization work is useful for debugging
it should not be treated as paper-faithful evidence

What Is Not Reproduced

9. The published DRF quantitative claim is not reproduced here

Paper-side numbers:

ScoNet-MT-ske: 82.5 Acc / 81.4 Prec / 74.3 Rec / 76.6 F1
DRF: 86.0 Acc / 84.1 Prec / 79.2 Rec / 80.8 F1

Local practical DRF result on the current workable 1:1:2 path:

best retained DRF checkpoint (2000) full test:
- 80.21 Acc / 58.92 Prec / 59.23 Rec / 57.84 F1

Conclusion:

DRF currently loses to the stronger plain skeleton baseline in this repo
the published DRF advantage is not established locally

10. The paper-level `1:1:8` skeleton story is not reproduced here

What happened locally:

1:1:8 skeleton runs remained much weaker and more unstable
the stronger practical result came from moving to 1:1:2

Conclusion:

the hard 1:1:8 regime is still unresolved here
current local evidence says class distribution is a major part of the failure mode

Practical Findings From The Search

11. Representation findings

Current local ranking:

body-only is the best practical representation so far
head-lite helped some small proxy runs but did not transfer to the full test set
full-body has not yet beaten the best body-only checkpoint family

Concrete comparison:

body-only + plain CE full test at 7000:
- 83.16 Acc / 68.24 Prec / 80.02 Rec / 68.47 F1
head-lite + plain CE full test at 7000:
- 78.07 Acc / 65.42 Prec / 80.50 Rec / 62.08 F1

Conclusion:

the stable useful signal is mainly torso-centered
adding limited head information did not improve full-test performance

12. Loss and optimizer findings

Current local ranking:

on the practical 1:1:2 branch, plain CE beat weighted CE
SGD produced the first strong baseline
later AdamW cosine finetune beat that baseline substantially
earlier AdamW multistep finetune was unstable and inferior

Conclusion:

the current best recipe is not “AdamW from scratch”
it is “strong SGD-style baseline first, then milder AdamW cosine finetune”

13. DRF-specific finding

Current local interpretation:

DRF is not failing because the skeleton branch is dead
it is failing because the extra prior branch is not yet adding a strong enough complementary signal

Most likely reasons:

the body-only skeleton baseline already captures most of the useful torso signal on 1:1:2
the current PAV/PGA path looks weakly selective
DRF tended to peak early and then degrade, suggesting less stable optimization than the plain baseline

Conclusion:

DRF is still a research branch here
it is not the current practical winner

Important Evaluation Caveat

14. There was no test-sample gradient leakage

For the long finetune run:

training batches came only from TRAIN_SET
repeated evaluation ran under torch.no_grad()
no TEST_SET samples were used in backprop or optimizer updates

Conclusion:

there was no train/test mixing inside gradient descent

15. There was test-set model-selection leakage

For the long finetune run:

full TEST_SET eval was run repeatedly during training
best-checkpoint retention used test_accuracy and test_f1
the final retained winner was chosen with test metrics

That means:

the retained best checkpoint is real and reproducible
but its metric should be interpreted as a test-guided selected result, not a pristine one-shot final estimate

Conclusion:

acceptable for practical model selection
not ideal for a publication-style clean generalization claim

Bottom Line

The repo currently supports:

strong practical scoliosis model development
stable skeleton-map experimentation
plausible DRF implementation work

The repo does not currently support:

claiming an independent reproduction of the DRF paper’s quantitative result
claiming that the DRF paper fully specifies the winning skeleton-map preprocessing
treating DRF as the default best model in this codebase

Practical bottom line:

keep the body-only skeleton baseline as the mainline path
keep the retained best checkpoint family as the working artifact
treat DRF as an optional follow-up branch, not the current winner

Current strongest practical checkpoint:

92.38 Acc / 90.30 Prec / 87.39 Rec / 88.70 F1

If future work claims to improve on this, it should:

beat the current retained best checkpoint on full-test macro-F1
be verifiable by standalone eval from a saved checkpoint
state clearly whether the change is paper-faithful or purely empirical

8.9 KiB Raw Blame History Unescape Escape

Scoliosis1K Reproducibility Audit

Executive Summary

What We Can Say With High Confidence

1. The core training and evaluation stack works

2. The raw Scoliosis1K pose data is usable

3. The skeleton branch is learnable

What Is Reproducible

4. The silhouette ScoNet path is reproducible enough to trust

5. The high-level DRF idea is reproducible

6. The PAV concept is reproducible

What Is Only Partially Reproducible

7. The skeleton-map branch is only partially specified by the papers

8. The visualization story is only partially reproducible

What Is Not Reproduced

9. The published DRF quantitative claim is not reproduced here

10. The paper-level 1:1:8 skeleton story is not reproduced here

Practical Findings From The Search

11. Representation findings

12. Loss and optimizer findings

13. DRF-specific finding

Important Evaluation Caveat

14. There was no test-sample gradient leakage

15. There was test-set model-selection leakage

Bottom Line

8.9 KiB

Raw Blame History

10. The paper-level `1:1:8` skeleton story is not reproduced here