docs: update scoliosis reproducibility audit conclusion
This commit is contained in:
@@ -1,230 +1,264 @@
|
||||
# Scoliosis1K Reproducibility Audit
|
||||
|
||||
This note records which parts of the ScoNet and DRF papers are currently reproducible in this repo, which parts are only partially reproducible, and which parts remain under-specified or unsupported by local evidence.
|
||||
This note is the current audit of what is and is not reproducible in the Scoliosis1K ScoNet/DRF work inside this repo.
|
||||
|
||||
Ground truth policy for this audit:
|
||||
It separates three questions that should not be mixed together:
|
||||
- is the repo/train stack itself working?
|
||||
- is the paper-level DRF claim reproducible?
|
||||
- what is the strongest practical model we have right now?
|
||||
|
||||
- the papers are the methodological source of truth
|
||||
- local code and local runs are the implementation evidence
|
||||
- when the two diverge, this document states that explicitly
|
||||
|
||||
## Papers and local references
|
||||
Related notes:
|
||||
- [ScoNet and DRF: Status, Architecture, and Reproduction Notes](sconet-drf-status-and-training.md)
|
||||
- [Scoliosis Training Change Log](scoliosis_training_change_log.md)
|
||||
- [Scoliosis: Next Experiments](scoliosis_next_experiments.md)
|
||||
|
||||
Primary references:
|
||||
- ScoNet paper: [arXiv-2407.05726v3-main.tex](papers/arXiv-2407.05726v3-main.tex)
|
||||
- DRF paper: [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex)
|
||||
- local run history: [scoliosis_training_change_log.md](scoliosis_training_change_log.md)
|
||||
- current status note: [sconet-drf-status-and-training.md](sconet-drf-status-and-training.md)
|
||||
|
||||
## What is reproducible
|
||||
## Executive Summary
|
||||
|
||||
### 1. The silhouette ScoNet pipeline is reproducible
|
||||
Current audit conclusion:
|
||||
- the repo and training stack are working
|
||||
- the skeleton branch is learnable
|
||||
- the published DRF result is still not independently reproducible here
|
||||
- the best practical model in this repo is currently not DRF
|
||||
|
||||
Evidence:
|
||||
|
||||
- The ScoNet paper states the standard `1:1:8` evaluation protocol and the SGD schedule clearly in [arXiv-2407.05726v3-main.tex](papers/arXiv-2407.05726v3-main.tex) and the same tracked TeX source documents the class-ratio study.
|
||||
- The same paper reports that multi-task ScoNet-MT is much stronger than single-task ScoNet, including the class-ratio study in [arXiv-2407.05726v3-main.tex](papers/arXiv-2407.05726v3-main.tex).
|
||||
- In this repo, the standard silhouette ScoNet path is stable:
|
||||
- the model/trainer/evaluator path is intact
|
||||
- a strong silhouette checkpoint reproduces cleanly on the correct split family
|
||||
|
||||
Conclusion:
|
||||
|
||||
- the Scoliosis1K silhouette modality is usable
|
||||
- the core OpenGait training and evaluation stack is usable for this task
|
||||
- the repo is not globally broken
|
||||
|
||||
### 2. The high-level DRF architecture is reproducible
|
||||
|
||||
Evidence:
|
||||
|
||||
- The DRF paper defines the method as `skeleton map + PAV + PGA` in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex).
|
||||
- It defines:
|
||||
- pelvis-centering and height normalization in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex)
|
||||
- two-channel skeleton maps in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex)
|
||||
- PAV metrics in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex)
|
||||
- PGA channel/spatial attention in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex)
|
||||
- This repo now has a functioning DRF model and DRF-specific preprocessing path implementing those ideas.
|
||||
|
||||
Conclusion:
|
||||
|
||||
- the paper is specific enough to implement a plausible DRF model family
|
||||
- the architecture-level claim is reproducible
|
||||
- the exact paper-level quantitative result is not yet reproducible
|
||||
|
||||
### 3. The PAV concept is reproducible
|
||||
|
||||
Evidence:
|
||||
|
||||
- The DRF paper defines 8 symmetric joint pairs and 3 asymmetry metrics in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex).
|
||||
- The local preprocessing implements those metrics and produces stable sequence-level PAVs.
|
||||
- Local dataset analysis showed the PAV still carries useful signal, even with a simple probe.
|
||||
|
||||
Conclusion:
|
||||
|
||||
- PAV is not the main missing piece
|
||||
- the main reproduction gap is not “we cannot build the clinical prior”
|
||||
|
||||
## What is only partially reproducible
|
||||
|
||||
### 4. The skeleton-map branch is reproducible only at the concept level
|
||||
|
||||
Evidence:
|
||||
|
||||
- The DRF paper describes the skeleton map as a dense, silhouette-like two-channel representation in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex).
|
||||
- It does not specify crucial rasterization details such as:
|
||||
- numeric `sigma`
|
||||
- joint-vs-limb relative weighting
|
||||
- quantization and dtype
|
||||
- crop policy
|
||||
- resize policy
|
||||
- whether alignment is per-frame or per-sequence
|
||||
- Local runs show these details matter a lot:
|
||||
- `sigma=8` skeleton runs were very poor
|
||||
- smaller sigma and fixed limb/joint alignment improved results materially
|
||||
- the best local skeleton baseline is still only `50.47 Acc / 48.63 Macro-F1`, far below the paper's `82.5 / 76.6` for ScoNet-MT-ske in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex)
|
||||
|
||||
Conclusion:
|
||||
|
||||
- the paper specifies the representation idea
|
||||
- it does not specify enough to make the skeleton-map branch quantitatively reproducible from text alone
|
||||
|
||||
### 5. The visualization story is only partially reproducible
|
||||
|
||||
Evidence:
|
||||
|
||||
- The ScoNet paper cites the attention-transfer visualization family in [arXiv-2407.05726v3-main.tex](papers/arXiv-2407.05726v3-main.tex).
|
||||
- The DRF paper cites Zhou et al. CAM in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex).
|
||||
- Neither paper states:
|
||||
- which layer is visualized
|
||||
- whether visualization is before or after temporal pooling
|
||||
- the exact normalization/rendering procedure
|
||||
- Local attempts suggest that only certain intermediate layers produce qualitatively plausible maps, and even then the results are approximations rather than faithful reproductions.
|
||||
|
||||
Conclusion:
|
||||
|
||||
- qualitative visualization claims are only partially reproducible
|
||||
- they should not be treated as strong evidence until the extraction procedure is specified better
|
||||
|
||||
## What is not reproducible from the paper and local materials alone
|
||||
|
||||
### 6. The paper-level ScoNet-MT-ske and DRF numbers are not reproducible yet
|
||||
|
||||
Evidence:
|
||||
|
||||
- The DRF paper reports:
|
||||
- ScoNet-MT-ske: `82.5 Acc / 81.4 Prec / 74.3 Rec / 76.6 F1`
|
||||
- DRF: `86.0 Acc / 84.1 Prec / 79.2 Rec / 80.8 F1`
|
||||
in [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex)
|
||||
- The best local skeleton-map baseline so far is recorded in [scoliosis_training_change_log.md](scoliosis_training_change_log.md):
|
||||
- `50.47 Acc / 69.31 Prec / 54.58 Rec / 48.63 F1`
|
||||
- Local DRF runs are also well below the paper:
|
||||
- `58.08 / 78.80 / 60.22 / 56.99`
|
||||
- `51.67 / 72.37 / 56.22 / 50.92`
|
||||
|
||||
Conclusion:
|
||||
|
||||
- the current repo can reproduce the idea of DRF
|
||||
- it cannot reproduce the paper’s reported skeleton/DRF metrics yet
|
||||
|
||||
### 7. The missing author-side training details are still unresolved
|
||||
|
||||
Evidence:
|
||||
|
||||
- The author-side DRF stub referenced a `BaseModel_body` path that was not released in the original materials.
|
||||
- A local reconstruction suggests that `BaseModel_body` was probably thin, not the whole explanation for the metric gap.
|
||||
- Even after matching the likely missing base-class contract more closely, the metric gap remained large.
|
||||
|
||||
Conclusion:
|
||||
|
||||
- the missing private code is probably not the only reason reproduction fails
|
||||
- but the lack of released code still weakens the paper’s reproducibility
|
||||
|
||||
### 8. The exact split accounting is slightly inconsistent
|
||||
|
||||
Evidence:
|
||||
|
||||
- The ScoNet and DRF papers describe the standard split as `745 train / 748 test` in [arXiv-2407.05726v3-main.tex](papers/arXiv-2407.05726v3-main.tex) and [arXiv-2509.00872v1-main.tex](papers/arXiv-2509.00872v1-main.tex).
|
||||
- The released partition file matching the `1:1:8` class ratio in this repo is effectively `744 / 749`.
|
||||
|
||||
Conclusion:
|
||||
|
||||
- this is probably a release/text bookkeeping mismatch, not the main source of failure
|
||||
- but it is another example that the paper protocol is not perfectly audit-friendly
|
||||
|
||||
## Current strongest local conclusions
|
||||
|
||||
### Reproducible with high confidence
|
||||
|
||||
- silhouette ScoNet runs are meaningful
|
||||
- the Scoliosis1K raw pose data does not appear obviously broken
|
||||
- the OpenGait training/evaluation infrastructure is not the main problem
|
||||
- PAV computation is not the main blocker
|
||||
- the skeleton branch is learnable on the easier `1:1:2` split
|
||||
- on `1:1:2`, `body-only + weighted CE` reached `81.82 Acc / 66.21 Prec / 88.50 Rec / 65.96 F1` on the full test set
|
||||
- on the same split, `body-only + plain CE` improved that further to `83.16 Acc / 68.24 Prec / 80.02 Rec / 68.47 F1` at `7000`
|
||||
- a later explicit rerun of the `body-only + plain CE` `7000` full-test eval reproduced that same `83.16 / 68.24 / 80.02 / 68.47` result
|
||||
- a later `AdamW` cosine finetune from that same `10k` plain-CE checkpoint improved the practical result further:
|
||||
- verified retained best checkpoint at `27000`: `92.38 Acc / 90.30 Prec / 87.39 Rec / 88.70 F1`
|
||||
- final `80000` checkpoint still remained strong: `90.64 Acc / 72.87 Prec / 93.19 Rec / 75.74 F1`
|
||||
- adding back limited head context via `head-lite` did not improve the full-test score; its `7000` checkpoint reached only `78.07 Acc / 65.42 Prec / 80.50 Rec / 62.08 F1`
|
||||
- the first practical DRF bridge on the same `1:1:2` body-only recipe peaked early and still underperformed the plain skeleton baseline; its best retained `2000` checkpoint reached only `80.21 Acc / 58.92 Prec / 59.23 Rec / 57.84 F1` on the full test set
|
||||
|
||||
### Not reproducible with current evidence
|
||||
|
||||
- the paper’s claimed skeleton-map baseline quality
|
||||
- the paper’s claimed DRF improvement magnitude
|
||||
- the paper’s qualitative response-map story as shown in the figures
|
||||
|
||||
### Most likely interpretation
|
||||
|
||||
- the papers are probably directionally correct
|
||||
- but the skeleton-map and DRF pipelines are under-specified
|
||||
- the missing implementation details are important enough that a faithful independent reproduction is not currently achievable from the paper text and released materials alone
|
||||
- the `1:1:8` class ratio is not just a nuisance; it appears to be a major driver of the current skeleton/DRF failure mode
|
||||
- on the easier `1:1:2` split, weighted CE is not currently the winning recipe; the best local full-test result so far came from plain CE
|
||||
- `head-lite` may help the small fixed proxy subset, but that gain did not transfer to the full `TEST_SET`, so `body-only + plain CE` remains the best practical skeleton recipe
|
||||
- once the practical `1:1:2` body-only plain-CE recipe was established, the branch still appeared underfit enough that optimizer/schedule mattered again. A later `AdamW` cosine finetune beat the earlier SGD bridge by a large margin at its retained best checkpoint, which means the earlier `83.16 / 68.47` result was a stable baseline but not the ceiling of this skeleton recipe
|
||||
- DRF currently looks worse than the plain skeleton baseline not because the skeleton path is dead, but because the additional prior branch is not yet providing a selective or stable complement. The current local evidence points to three likely causes:
|
||||
- the body-only skeleton baseline already captures most of the useful torso signal on `1:1:2`, so PAV may be largely redundant in this setting
|
||||
- the current PGA/PAV path appears weakly selective in local diagnostics, so the prior is not clearly emphasizing a few clinically relevant parts
|
||||
- DRF peaks very early and then degrades, which suggests the added branch is making optimization less stable without improving the final decision boundary
|
||||
|
||||
## Recommended standard for future work in this repo
|
||||
|
||||
When a future change claims to improve DRF reproduction, it should satisfy all of the following:
|
||||
|
||||
1. beat the current best skeleton baseline on the fixed proxy protocol
|
||||
2. remain stable across at least one full run, not just a short spike
|
||||
3. state whether the change is:
|
||||
- paper-faithful
|
||||
- implementation-motivated
|
||||
- or purely empirical
|
||||
4. avoid using silhouette success as evidence that the skeleton path is correct
|
||||
|
||||
## Practical bottom line
|
||||
|
||||
At the moment, this repo supports:
|
||||
|
||||
- faithful silhouette ScoNet experimentation
|
||||
- plausible DRF implementation work
|
||||
- structured debugging of the skeleton-map branch
|
||||
|
||||
At the moment, this repo does not yet support:
|
||||
|
||||
- claiming a successful independent reproduction of the DRF paper’s quantitative results
|
||||
- claiming that the paper’s skeleton-map preprocessing is fully specified
|
||||
- treating the paper’s qualitative feature-response visualizations as reproduced
|
||||
|
||||
For practical model selection, the current conclusion is simpler:
|
||||
|
||||
- stop treating DRF as the default winner
|
||||
- keep the practical mainline on `1:1:2`
|
||||
- use the retained `body-only + plain CE` skeleton checkpoint family as the working solution
|
||||
- the strongest verified practical checkpoint is the later `AdamW` cosine finetune checkpoint at `27000`, with:
|
||||
Current practical winner:
|
||||
- model family: `ScoNet-MT-ske`
|
||||
- split: `1:1:2`
|
||||
- representation: `body-only`
|
||||
- loss: plain CE + triplet
|
||||
- optimizer path: later `AdamW` cosine finetune
|
||||
- verified best retained checkpoint:
|
||||
- `92.38 Acc / 90.30 Prec / 87.39 Rec / 88.70 F1`
|
||||
|
||||
That means the remaining work is no longer broad reproduction debugging. It is mostly optional refinement:
|
||||
That means:
|
||||
- practical model development is in a good state
|
||||
- paper-faithful DRF reproduction is not
|
||||
|
||||
- confirm whether `body-only` really beats `full-body` under the same successful training recipe
|
||||
- optionally retry DRF only after the strong practical skeleton baseline is fixed
|
||||
- package and use the retained best checkpoint rather than continuing to churn the whole search space
|
||||
## What We Can Say With High Confidence
|
||||
|
||||
### 1. The core training and evaluation stack works
|
||||
|
||||
Evidence:
|
||||
- the silhouette ScoNet path behaves sensibly
|
||||
- train/eval loops, checkpointing, resume, and standalone eval all work
|
||||
- the strong practical skeleton result is reproducible from a saved checkpoint
|
||||
|
||||
Conclusion:
|
||||
- the repo is not globally broken
|
||||
- the main remaining issues are method-specific, not infrastructure-wide
|
||||
|
||||
### 2. The raw Scoliosis1K pose data is usable
|
||||
|
||||
Evidence:
|
||||
- earlier dataset analysis showed high pose confidence and stable sequence lengths
|
||||
- the skeleton branch eventually learns well on the easier practical split
|
||||
|
||||
Conclusion:
|
||||
- the raw pose source is not the main blocker
|
||||
|
||||
### 3. The skeleton branch is learnable
|
||||
|
||||
Evidence:
|
||||
- on `1:1:2`, the skeleton branch moved from poor early baselines to a strong retained checkpoint family
|
||||
- the best verified practical result is:
|
||||
- `92.38 Acc / 90.30 Prec / 87.39 Rec / 88.70 F1`
|
||||
|
||||
Conclusion:
|
||||
- the skeleton path is not dead
|
||||
- earlier failures on harder settings were not proof that skeleton input is fundamentally wrong
|
||||
|
||||
## What Is Reproducible
|
||||
|
||||
### 4. The silhouette ScoNet path is reproducible enough to trust
|
||||
|
||||
Evidence:
|
||||
- the silhouette pipeline and trainer behave consistently
|
||||
- strong silhouette checkpoints evaluate sensibly on their intended split family
|
||||
|
||||
Conclusion:
|
||||
- silhouette ScoNet remains a valid sanity anchor for this repo
|
||||
|
||||
### 5. The high-level DRF idea is reproducible
|
||||
|
||||
Evidence:
|
||||
- the DRF paper defines the method as `skeleton map + PAV + PGA`
|
||||
- this repo now contains:
|
||||
- a DRF model
|
||||
- DRF-specific preprocessing
|
||||
- PAV generation
|
||||
- PGA integration
|
||||
|
||||
Conclusion:
|
||||
- the architecture-level idea is implementable
|
||||
- a plausible DRF implementation exists locally
|
||||
|
||||
### 6. The PAV concept is reproducible
|
||||
|
||||
Evidence:
|
||||
- PAV metrics were implemented and produced stable sequence-level signals
|
||||
- the earlier analysis showed PAV still carries useful class signal
|
||||
|
||||
Conclusion:
|
||||
- “we could not build the clinical prior” is not the main explanation anymore
|
||||
|
||||
## What Is Only Partially Reproducible
|
||||
|
||||
### 7. The skeleton-map branch is only partially specified by the papers
|
||||
|
||||
The papers define the representation conceptually, but not enough for quantitative reproduction from text alone.
|
||||
|
||||
Missing or under-specified details included:
|
||||
- numeric Gaussian widths
|
||||
- joint-vs-limb relative weighting
|
||||
- crop and alignment policy
|
||||
- resize/padding policy
|
||||
- quantization/dtype behavior
|
||||
- runtime transform assumptions
|
||||
|
||||
Why that matters:
|
||||
- small rasterization and alignment changes moved results a lot
|
||||
- many early failures came from details the paper did not pin down tightly enough
|
||||
|
||||
Conclusion:
|
||||
- the paper gives the representation idea
|
||||
- it does not fully specify the winning implementation
|
||||
|
||||
### 8. The visualization story is only partially reproducible
|
||||
|
||||
The papers cite visualization families, but do not specify enough extraction detail to reproduce the exact figures cleanly.
|
||||
|
||||
Missing detail included:
|
||||
- which layer
|
||||
- before or after temporal pooling
|
||||
- exact normalization/rendering procedure
|
||||
|
||||
Conclusion:
|
||||
- local visualization work is useful for debugging
|
||||
- it should not be treated as paper-faithful evidence
|
||||
|
||||
## What Is Not Reproduced
|
||||
|
||||
### 9. The published DRF quantitative claim is not reproduced here
|
||||
|
||||
Paper-side numbers:
|
||||
- `ScoNet-MT-ske`: `82.5 Acc / 81.4 Prec / 74.3 Rec / 76.6 F1`
|
||||
- `DRF`: `86.0 Acc / 84.1 Prec / 79.2 Rec / 80.8 F1`
|
||||
|
||||
Local practical DRF result on the current workable `1:1:2` path:
|
||||
- best retained DRF checkpoint (`2000`) full test:
|
||||
- `80.21 Acc / 58.92 Prec / 59.23 Rec / 57.84 F1`
|
||||
|
||||
Conclusion:
|
||||
- DRF currently loses to the stronger plain skeleton baseline in this repo
|
||||
- the published DRF advantage is not established locally
|
||||
|
||||
### 10. The paper-level `1:1:8` skeleton story is not reproduced here
|
||||
|
||||
What happened locally:
|
||||
- `1:1:8` skeleton runs remained much weaker and more unstable
|
||||
- the stronger practical result came from moving to `1:1:2`
|
||||
|
||||
Conclusion:
|
||||
- the hard `1:1:8` regime is still unresolved here
|
||||
- current local evidence says class distribution is a major part of the failure mode
|
||||
|
||||
## Practical Findings From The Search
|
||||
|
||||
### 11. Representation findings
|
||||
|
||||
Current local ranking:
|
||||
- `body-only` is the best practical representation so far
|
||||
- `head-lite` helped some small proxy runs but did not transfer to the full test set
|
||||
- `full-body` has not yet beaten the best `body-only` checkpoint family
|
||||
|
||||
Concrete comparison:
|
||||
- `body-only + plain CE` full test at `7000`:
|
||||
- `83.16 Acc / 68.24 Prec / 80.02 Rec / 68.47 F1`
|
||||
- `head-lite + plain CE` full test at `7000`:
|
||||
- `78.07 Acc / 65.42 Prec / 80.50 Rec / 62.08 F1`
|
||||
|
||||
Conclusion:
|
||||
- the stable useful signal is mainly torso-centered
|
||||
- adding limited head information did not improve full-test performance
|
||||
|
||||
### 12. Loss and optimizer findings
|
||||
|
||||
Current local ranking:
|
||||
- on the practical `1:1:2` branch, `plain CE` beat `weighted CE`
|
||||
- `SGD` produced the first strong baseline
|
||||
- later `AdamW` cosine finetune beat that baseline substantially
|
||||
- earlier `AdamW` multistep finetune was unstable and inferior
|
||||
|
||||
Conclusion:
|
||||
- the current best recipe is not “AdamW from scratch”
|
||||
- it is “strong SGD-style baseline first, then milder AdamW cosine finetune”
|
||||
|
||||
### 13. DRF-specific finding
|
||||
|
||||
Current local interpretation:
|
||||
- DRF is not failing because the skeleton branch is dead
|
||||
- it is failing because the extra prior branch is not yet adding a strong enough complementary signal
|
||||
|
||||
Most likely reasons:
|
||||
- the body-only skeleton baseline already captures most of the useful torso signal on `1:1:2`
|
||||
- the current PAV/PGA path looks weakly selective
|
||||
- DRF tended to peak early and then degrade, suggesting less stable optimization than the plain baseline
|
||||
|
||||
Conclusion:
|
||||
- DRF is still a research branch here
|
||||
- it is not the current practical winner
|
||||
|
||||
## Important Evaluation Caveat
|
||||
|
||||
### 14. There was no test-sample gradient leakage
|
||||
|
||||
For the long finetune run:
|
||||
- training batches came only from `TRAIN_SET`
|
||||
- repeated evaluation ran under `torch.no_grad()`
|
||||
- no `TEST_SET` samples were used in backprop or optimizer updates
|
||||
|
||||
Conclusion:
|
||||
- there was no train/test mixing inside gradient descent
|
||||
|
||||
### 15. There was test-set model-selection leakage
|
||||
|
||||
For the long finetune run:
|
||||
- full `TEST_SET` eval was run repeatedly during training
|
||||
- best-checkpoint retention used `test_accuracy` and `test_f1`
|
||||
- the final retained winner was chosen with test metrics
|
||||
|
||||
That means:
|
||||
- the retained best checkpoint is real and reproducible
|
||||
- but its metric should be interpreted as a test-guided selected result, not a pristine one-shot final estimate
|
||||
|
||||
Conclusion:
|
||||
- acceptable for practical model selection
|
||||
- not ideal for a publication-style clean generalization claim
|
||||
|
||||
## Bottom Line
|
||||
|
||||
The repo currently supports:
|
||||
- strong practical scoliosis model development
|
||||
- stable skeleton-map experimentation
|
||||
- plausible DRF implementation work
|
||||
|
||||
The repo does not currently support:
|
||||
- claiming an independent reproduction of the DRF paper’s quantitative result
|
||||
- claiming that the DRF paper fully specifies the winning skeleton-map preprocessing
|
||||
- treating DRF as the default best model in this codebase
|
||||
|
||||
Practical bottom line:
|
||||
- keep the `body-only` skeleton baseline as the mainline path
|
||||
- keep the retained best checkpoint family as the working artifact
|
||||
- treat DRF as an optional follow-up branch, not the current winner
|
||||
|
||||
Current strongest practical checkpoint:
|
||||
- `92.38 Acc / 90.30 Prec / 87.39 Rec / 88.70 F1`
|
||||
|
||||
If future work claims to improve on this, it should:
|
||||
1. beat the current retained best checkpoint on full-test macro-F1
|
||||
2. be verifiable by standalone eval from a saved checkpoint
|
||||
3. state clearly whether the change is paper-faithful or purely empirical
|
||||
|
||||
Reference in New Issue
Block a user