feat: archive best scoliosis checkpoints

2026-03-11 10:23:38 +08:00
parent a0150c791f
commit fbc0696dc4
10 changed files with 489 additions and 4 deletions
@@ -163,6 +163,9 @@ Conclusion:
 - on `1:1:2`, `body-only + weighted CE` reached `81.82 Acc / 66.21 Prec / 88.50 Rec / 65.96 F1` on the full test set
 - on the same split, `body-only + plain CE` improved that further to `83.16 Acc / 68.24 Prec / 80.02 Rec / 68.47 F1` at `7000`
 - a later explicit rerun of the `body-only + plain CE` `7000` full-test eval reproduced that same `83.16 / 68.24 / 80.02 / 68.47` result
+- a later `AdamW` cosine finetune from that same `10k` plain-CE checkpoint improved the practical result further:
+  - verified retained best checkpoint at `27000`: `92.38 Acc / 90.30 Prec / 87.39 Rec / 88.70 F1`
+  - final `80000` checkpoint still remained strong: `90.64 Acc / 72.87 Prec / 93.19 Rec / 75.74 F1`
 - adding back limited head context via `head-lite` did not improve the full-test score; its `7000` checkpoint reached only `78.07 Acc / 65.42 Prec / 80.50 Rec / 62.08 F1`
 - the first practical DRF bridge on the same `1:1:2` body-only recipe peaked early and still underperformed the plain skeleton baseline; its best retained `2000` checkpoint reached only `80.21 Acc / 58.92 Prec / 59.23 Rec / 57.84 F1` on the full test set

@@ -180,6 +183,7 @@ Conclusion:
 - the `1:1:8` class ratio is not just a nuisance; it appears to be a major driver of the current skeleton/DRF failure mode
 - on the easier `1:1:2` split, weighted CE is not currently the winning recipe; the best local full-test result so far came from plain CE
 - `head-lite` may help the small fixed proxy subset, but that gain did not transfer to the full `TEST_SET`, so `body-only + plain CE` remains the best practical skeleton recipe
+- once the practical `1:1:2` body-only plain-CE recipe was established, the branch still appeared underfit enough that optimizer/schedule mattered again. A later `AdamW` cosine finetune beat the earlier SGD bridge by a large margin at its retained best checkpoint, which means the earlier `83.16 / 68.47` result was a stable baseline but not the ceiling of this skeleton recipe
 - DRF currently looks worse than the plain skeleton baseline not because the skeleton path is dead, but because the additional prior branch is not yet providing a selective or stable complement. The current local evidence points to three likely causes:
  - the body-only skeleton baseline already captures most of the useful torso signal on `1:1:2`, so PAV may be largely redundant in this setting
  - the current PGA/PAV path appears weakly selective in local diagnostics, so the prior is not clearly emphasizing a few clinically relevant parts
@@ -210,3 +214,17 @@ At the moment, this repo does not yet support:
 - claiming a successful independent reproduction of the DRF paper’s quantitative results
 - claiming that the paper’s skeleton-map preprocessing is fully specified
 - treating the paper’s qualitative feature-response visualizations as reproduced
+
+For practical model selection, the current conclusion is simpler:
+
+- stop treating DRF as the default winner
+- keep the practical mainline on `1:1:2`
+- use the retained `body-only + plain CE` skeleton checkpoint family as the working solution
+- the strongest verified practical checkpoint is the later `AdamW` cosine finetune checkpoint at `27000`, with:
+  - `92.38 Acc / 90.30 Prec / 87.39 Rec / 88.70 F1`
+
+That means the remaining work is no longer broad reproduction debugging. It is mostly optional refinement:
+
+- confirm whether `body-only` really beats `full-body` under the same successful training recipe
+- optionally retry DRF only after the strong practical skeleton baseline is fixed
+- package and use the retained best checkpoint rather than continuing to churn the whole search space