# DATA PIPELINE KNOWLEDGE BASE ## OVERVIEW `opengait/data/` converts preprocessed dataset trees into training/evaluation batches for all models. ## WHERE TO LOOK | Task | Location | Notes | |------|----------|-------| | Dataset parsing + file loading | `dataset.py` | expects partition json and `.pkl` sequence files | | Sequence sampling strategy | `collate_fn.py` | fixed/unfixed/all + ordered/unordered behavior | | Augmentations/transforms | `transform.py` | transform factories resolved from config | | Batch identity sampling | `sampler.py` | sampler types referenced from config | ## CONVENTIONS - Dataset root layout is `id/type/view/*.pkl` after preprocessing. - `dataset_partition` JSON with `TRAIN_SET` / `TEST_SET` is required. - `sample_type` drives control flow (`fixed_unordered`, `all_ordered`, etc.) and shape semantics downstream. ## ANTI-PATTERNS - Never pass non-`.pkl` sequence files (`dataset.py` raises hard ValueError). - Don’t violate expected `batch_size` semantics for triplet samplers (`[P, K]` list). - Don’t assume all models use identical feature counts; collate is feature-index sensitive.