Add comprehensive knowledge base documentation across multiple domains

2026-02-12 14:36:37 +08:00
parent f754f6f383
commit 0fdd35bd78
8 changed files with 336 additions and 0 deletions
@@ -0,0 +1,22 @@
+# DATA PIPELINE KNOWLEDGE BASE
+
+## OVERVIEW
+`opengait/data/` converts preprocessed dataset trees into training/evaluation batches for all models.
+
+## WHERE TO LOOK
+| Task | Location | Notes |
+|------|----------|-------|
+| Dataset parsing + file loading | `dataset.py` | expects partition json and `.pkl` sequence files |
+| Sequence sampling strategy | `collate_fn.py` | fixed/unfixed/all + ordered/unordered behavior |
+| Augmentations/transforms | `transform.py` | transform factories resolved from config |
+| Batch identity sampling | `sampler.py` | sampler types referenced from config |
+
+## CONVENTIONS
+- Dataset root layout is `id/type/view/*.pkl` after preprocessing.
+- `dataset_partition` JSON with `TRAIN_SET` / `TEST_SET` is required.
+- `sample_type` drives control flow (`fixed_unordered`, `all_ordered`, etc.) and shape semantics downstream.
+
+## ANTI-PATTERNS
+- Never pass non-`.pkl` sequence files (`dataset.py` raises hard ValueError).
+- Don’t violate expected `batch_size` semantics for triplet samplers (`[P, K]` list).
+- Don’t assume all models use identical feature counts; collate is feature-index sensitive.