OpenGait/docs/scoliosis_reproducibility_audit.md

# Scoliosis1K Reproducibility Audit

This note is the current audit of what is and is not reproducible in the Scoliosis1K ScoNet/DRF work inside this repo.

It separates three questions that should not be mixed together:
- is the repo/train stack itself working?
- is the paper-level DRF claim reproducible?
- what is the strongest practical model we have right now?

Related notes:
- [ScoNet and DRF: Status, Architecture, and Reproduction Notes](sconet-drf-status-and-training.md)
- [Scoliosis Training Change Log](scoliosis_training_change_log.md)
- [Scoliosis: Next Experiments](scoliosis_next_experiments.md)
- [DRF Author Checkpoint Compatibility Note](drf_author_checkpoint_compat.md)

Primary references:
- ScoNet paper: [arXiv-2407.05726v3-main.tex](papers/arXiv-2407.05726v3-main.tex)
- DRF paper: [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex)

## Executive Summary

Current audit conclusion:
- the repo and training stack are working
- the skeleton branch is learnable
- the published DRF result is still not independently reproducible here
- the best practical model in this repo is currently not DRF

Current practical winner:
- model family: `ScoNet-MT-ske`
- split: `1:1:2`
- representation: `body-only`
- loss: plain CE + triplet
- optimizer path: later `AdamW` cosine finetune
- verified best retained checkpoint:
  - `92.38 Acc / 90.30 Prec / 87.39 Rec / 88.70 F1`

That means:
- practical model development is in a good state
- paper-faithful DRF reproduction is not

## What We Can Say With High Confidence

### 1. The core training and evaluation stack works

Evidence:
- the silhouette ScoNet path behaves sensibly
- train/eval loops, checkpointing, resume, and standalone eval all work
- the strong practical skeleton result is reproducible from a saved checkpoint

Conclusion:
- the repo is not globally broken
- the main remaining issues are method-specific, not infrastructure-wide

### 2. The raw Scoliosis1K pose data is usable

Evidence:
- earlier dataset analysis showed high pose confidence and stable sequence lengths
- the skeleton branch eventually learns well on the easier practical split

Conclusion:
- the raw pose source is not the main blocker

### 3. The skeleton branch is learnable

Evidence:
- on `1:1:2`, the skeleton branch moved from poor early baselines to a strong retained checkpoint family
- the best verified practical result is:
  - `92.38 Acc / 90.30 Prec / 87.39 Rec / 88.70 F1`

Conclusion:
- the skeleton path is not dead
- earlier failures on harder settings were not proof that skeleton input is fundamentally wrong

## What Is Reproducible

### 4. The silhouette ScoNet path is reproducible enough to trust

Evidence:
- the silhouette pipeline and trainer behave consistently
- strong silhouette checkpoints evaluate sensibly on their intended split family

Conclusion:
- silhouette ScoNet remains a valid sanity anchor for this repo

### 5. The high-level DRF idea is reproducible

Evidence:
- the DRF paper defines the method as `skeleton map + PAV + PGA`
- this repo now contains:
  - a DRF model
  - DRF-specific preprocessing
  - PAV generation
  - PGA integration

Conclusion:
- the architecture-level idea is implementable
- a plausible DRF implementation exists locally

### 6. The PAV concept is reproducible

Evidence:
- PAV metrics were implemented and produced stable sequence-level signals
- the earlier analysis showed PAV still carries useful class signal

Conclusion:
- “we could not build the clinical prior” is not the main explanation anymore

## What Is Only Partially Reproducible

### 7. The skeleton-map branch is only partially specified by the papers

The papers define the representation conceptually, but not enough for quantitative reproduction from text alone.

Missing or under-specified details included:
- numeric Gaussian widths
- joint-vs-limb relative weighting
- crop and alignment policy
- resize/padding policy
- quantization/dtype behavior
- runtime transform assumptions

Why that matters:
- small rasterization and alignment changes moved results a lot
- many early failures came from details the paper did not pin down tightly enough

Conclusion:
- the paper gives the representation idea
- it does not fully specify the winning implementation

### 8. The visualization story is only partially reproducible

The papers cite visualization families, but do not specify enough extraction detail to reproduce the exact figures cleanly.

Missing detail included:
- which layer
- before or after temporal pooling
- exact normalization/rendering procedure

Conclusion:
- local visualization work is useful for debugging
- it should not be treated as paper-faithful evidence

## What Is Not Reproduced

### 9. The published DRF quantitative claim is not reproduced here

Paper-side numbers:
- `ScoNet-MT-ske`: `82.5 Acc / 81.4 Prec / 74.3 Rec / 76.6 F1`
- `DRF`: `86.0 Acc / 84.1 Prec / 79.2 Rec / 80.8 F1`

Local practical DRF result on the current workable `1:1:2` path:
- best retained DRF checkpoint (`2000`) full test:
  - `80.21 Acc / 58.92 Prec / 59.23 Rec / 57.84 F1`

Conclusion:
- DRF currently loses to the stronger plain skeleton baseline in this repo
- the published DRF advantage is not established locally

### 9a. The author-provided DRF checkpoint is partially recoverable

This changed one important part of the audit:
- the author checkpoint itself is not unusable
- the earlier very poor local eval was mostly a compatibility failure

The recovered best author-checkpoint path is:
- config: `configs/drf/drf_author_eval_118_aligned_1gpu.yaml`
- result:
  - `80.24 Acc / 76.73 Prec / 76.40 Rec / 76.56 F1`

This is still below the paper's DRF headline result:
- `86.0 / 84.1 / 79.2 / 80.8`

But it is far better than the earlier broken compat evals, which means:
- the weight file is real
- the stale author YAML is not a reliable runtime contract

The main causes were:
- split mismatch:
  - checkpoint name says `118`
  - provided YAML points to `112`
- class-order mismatch:
  - author stub uses `negative=0, positive=1, neutral=2`
  - repo evaluator assumes `negative=0, neutral=1, positive=2`
- legacy module naming mismatch:
  - `attention_layer.*` vs `PGA.*`
- preprocessing/runtime mismatch:
  - the checkpoint aligns much better with `Scoliosis1K-drf-pkl-118-aligned`
  - it performs very badly on the local `118-paper` export

Conclusion:
- the author checkpoint can be made meaningfully usable in this repo
- but the provided bundle still does not fully specify the original training/eval contract

### 10. The paper-level `1:1:8` skeleton story is not reproduced here

What happened locally:
- `1:1:8` skeleton runs remained much weaker and more unstable
- the stronger practical result came from moving to `1:1:2`

Conclusion:
- the hard `1:1:8` regime is still unresolved here
- current local evidence says class distribution is a major part of the failure mode

## Practical Findings From The Search

### 11. Representation findings

Current local ranking:
- `body-only` is the best practical representation so far
- `head-lite` helped some small proxy runs but did not transfer to the full test set
- `full-body` has not yet beaten the best `body-only` checkpoint family

Concrete comparison:
- `body-only + plain CE` full test at `7000`:
  - `83.16 Acc / 68.24 Prec / 80.02 Rec / 68.47 F1`
- `head-lite + plain CE` full test at `7000`:
  - `78.07 Acc / 65.42 Prec / 80.50 Rec / 62.08 F1`

Conclusion:
- the stable useful signal is mainly torso-centered
- adding limited head information did not improve full-test performance

### 12. Loss and optimizer findings

Current local ranking:
- on the practical `1:1:2` branch, `plain CE` beat `weighted CE`
- `SGD` produced the first strong baseline
- later `AdamW` cosine finetune beat that baseline substantially
- earlier `AdamW` multistep finetune was unstable and inferior

Conclusion:
- the current best recipe is not “AdamW from scratch”
- it is “strong SGD-style baseline first, then milder AdamW cosine finetune”

### 13. DRF-specific finding

Current local interpretation:
- DRF is not failing because the skeleton branch is dead
- it is failing because the extra prior branch is not yet adding a strong enough complementary signal

Most likely reasons:
- the body-only skeleton baseline already captures most of the useful torso signal on `1:1:2`
- the current PAV/PGA path looks weakly selective
- DRF tended to peak early and then degrade, suggesting less stable optimization than the plain baseline

Conclusion:
- DRF is still a research branch here
- it is not the current practical winner

## Important Evaluation Caveat

### 14. There was no test-sample gradient leakage

For the long finetune run:
- training batches came only from `TRAIN_SET`
- repeated evaluation ran under `torch.no_grad()`
- no `TEST_SET` samples were used in backprop or optimizer updates

Conclusion:
- there was no train/test mixing inside gradient descent

### 15. There was test-set model-selection leakage

For the long finetune run:
- full `TEST_SET` eval was run repeatedly during training
- best-checkpoint retention used `test_accuracy` and `test_f1`
- the final retained winner was chosen with test metrics

That means:
- the retained best checkpoint is real and reproducible
- but its metric should be interpreted as a test-guided selected result, not a pristine one-shot final estimate

Conclusion:
- acceptable for practical model selection
- not ideal for a publication-style clean generalization claim

## Bottom Line

The repo currently supports:
- strong practical scoliosis model development
- stable skeleton-map experimentation
- plausible DRF implementation work

The repo does not currently support:
- claiming an independent reproduction of the DRF paper’s quantitative result
- claiming that the DRF paper fully specifies the winning skeleton-map preprocessing
- treating DRF as the default best model in this codebase

Practical bottom line:
- keep the `body-only` skeleton baseline as the mainline path
- keep the retained best checkpoint family as the working artifact
- treat DRF as an optional follow-up branch, not the current winner
- if DRF work continues, the author checkpoint on the aligned `118` path is now a much better starting point than scratch DRF

Current strongest practical checkpoint:
- `92.38 Acc / 90.30 Prec / 87.39 Rec / 88.70 F1`

If future work claims to improve on this, it should:
1. beat the current retained best checkpoint on full-test macro-F1
2. be verifiable by standalone eval from a saved checkpoint
3. state clearly whether the change is paper-faithful or purely empirical