Files
pose_tracking_exp/research/conversation.md
T
crosstyan f1a2372b3c feat: initialize offline multiview pose tracking experiment
Set up pose_tracking_exp as a uv-managed Python package for offline multiview body tracking experiments.

This initial commit includes:

- the typed package scaffold, CLI entrypoints, and repo-local uv configuration

- scene and replay loaders for generic JSON replays and ActualTest parquet inputs

- ParaJumping payload conversion and RTMPose-to-body20 normalization

- a custom articulated tracker with tentative, active, and lost lifecycle handling

- RPT-backed proposal generation, camera convention handling, and multiview reprojection updates

- regression tests for normalization, camera conventions, ActualTest ingestion, seeding, and tracker smoke flows

- project documentation covering extrinsic formats and the ActualTest calibration caveat
2026-03-26 13:03:56 +08:00

15 KiB
Raw Blame History

N views (known camera parameters, including extrinsic), and have D_n detection per view (unmatched, unknown id, although I could use some 2D tracker like SORT/ByteTrack etc. but it won't solve cross view association, only have a relatively stable 2D view tracking) and the 2D detector is noisy (some random noise, unknown distribution, due to the neural pose estimator or occlusion) if we could somehow get a 3D detection (pick some annotation) (refer: https://arxiv.org/html/2503.21692 a naive way, with DLT) but it's still noisy, and it's still keypoint based non IK (what's the word? SO3?) for a correct human pose representation, design some skeleton like (that could convert from points, assuming using IK? and go back to keypoints with FK) but you still need a skeleton (beta) estimation (once established shouldn't change unless tracking lost) and with that skeleton we could do gating (since the 3D detection is noisy) need something robust (always assuming the neural detection/DLT result is noisy, but we have all the history state, but don't use sliding window, model the noise somehow?) I personally believe sliding window is a bad/lazy design and should be avoid, but it's just my experience


Good. With that context, here is the design I would actually build.

The main separation is this: the triangulated 3D keypoint cloud is not the persistent state. It is a proposal or measurement. The persistent state should be an articulated skeleton with fixed shape. In your terminology, the free 3D keypoint output is a joint-coordinate or kinematic-free representation in (\mathbb{R}^{3J}); it is not an (SO(3)) state. (SO(3)) enters only when you represent joint rotations. RapidPoseTriangulation is essentially a proposal-style pipeline: it forms cross-view 2D pose pairs, filters them using previous 3D pose projections, triangulates pairwise 3D proposals, reprojects them, rejects high-error pairs, groups proposals in 3D, merges them, and only then optionally assigns tracks or clips joint motion. (arXiv)

So the long-lived track for person (p) should be something like [ x_t^p = \big[T_t^p,\ q_t^p,\ \dot T_t^p,\ \dot q_t^p,\ \beta^p,\ \eta_t^p\big]. ] Here (T_t^p\in SE(3)) is root translation and orientation, (q_t^p) are local joint rotations or reduced-DOF angles, (\beta^p) is the subject shape or bone-length parameter, and (\eta_t^p) is a latent reliability or noise state. The observable joints are derived by forward kinematics, [ X_{i,t}^p = FK_i(T_t^p,q_t^p,\beta^p), ] and projected into each calibrated view by [ \hat u_{v,i,t}^p = \Pi_v(X_{i,t}^p). ] That gives you exactly what you wanted earlier: reprojection, IK/FK consistency, fixed skeleton once established, and a place to put historical noise.

For association, I would split the problem in two. For already-existing tracks, do not solve full cross-view association from scratch each frame. Predict each 3D track forward, reproject it into every view, and associate detections to that predicted skeleton in each camera. Cross-view consistency then comes from the fact that the same 3D state must explain all matched views. For births only, use a within-frame geometric proposer: build cross-view hypotheses from stable core joints, triangulate, reproject, reject bad hypotheses, cluster in 3D, then initialize a new articulated track from the surviving cluster. RapidPose explicitly uses a core-joint subset early on—shoulders, hips, elbows, wrists, knees, ankles—to keep association cheap before triangulating all joints. If you want a published association front-end, Dong et al. use multi-way matching over 2D poses with geometry and appearance cues, while Part-Aware Measurement uses temporal consistency to associate current 2D poses with previous 3D skeletons and adds a filter for 2D outliers. (arXiv)

The update should be recursive and articulated. I would not use a Kalman filter on raw DLT joints. I would do a one-step recursive MAP or iterated-EKF-style update on the articulated state: [ \hat x_t^p

\arg\min_x |x-\bar x_t^p|{Q^{-1}}^2 + \sum{(v,m)\in A_t^p}\sum_{i\in\mathcal J} \rho!\left( |u_{v,i,t}^{m}-\Pi_v(FK_i(x))|{R{v,i,t}^{-1}}^2 \right) + \phi_{\text{limits}}(q) + \phi_{\text{shape}}(\beta). ] Here (\bar x_t^p) is the predicted state, (A_t^p) is the set of associated detections, (\rho) is a robust loss, and the priors enforce joint limits and fixed shape. This is fully online. No sliding window is required.

One important point: do not treat the same 2D detections and their DLT triangulation as two independent measurements unless you explicitly model the correlation. If the 3D DLT point was computed from those same 2D keypoints, then using both naively double-counts the evidence. The clean choice is to use raw multi-view 2D detections for the main update, and use DLT 3D only for proposal generation, coarse gating, or initialization. If you insist on using DLT in the update, make it a pseudo-measurement with covariance propagated from the 2D uncertainty.

For noise, assume the detector is heavy-tailed and heteroscedastic. That is the right place to be robust, instead of hiding the problem inside the pose state. The simplest useful model is Student-(t) or a Gaussian scale mixture: [ u_{v,i,t} = \hat u_{v,i,t} + \epsilon_{v,i,t}, \qquad \epsilon_{v,i,t}\sim t_\nu(0,R_{v,i,t}). ] Then update a per-joint, per-view scale recursively from the innovation [ r_{v,i,t}=u_{v,i,t}-\hat u_{v,i,t}, ] for example with [ w_{v,i,t}=\frac{\nu+2}{\nu+r_{v,i,t}^\top S_{v,i,t}^{-1}r_{v,i,t}}, ] and [ \log \sigma^2_{v,i,t}

\lambda \log \sigma^2_{v,i,t-1} + (1-\lambda)\log(\epsilon+|r_{v,i,t}|^2). ] Then define [ R_{v,i,t}=R_{\min}+f(c_{v,i,t})+\sigma^2_{v,i,t}I. ] If you do use triangulated 3D proposals, add a geometry term that depends on ray intersection angle or triangulation condition number, because a narrow baseline should produce much larger depth uncertainty than a wide baseline.

This is the non-window way to use history. The history lives in three places: the dynamic prior on ((T,q)), the nearly-static shape (\beta), and the recursive noise state (\eta) or (\sigma^2). That is not a sliding window. It is a proper state-space model with exponential forgetting.

For (\beta), I would use a two-timescale rule. During track establishment, allow slow adaptation under a strong anthropometric prior. After the track is stable, freeze it. In practice I would not start with fully free per-bone lengths. I would use a low-dimensional shape model: global scale, leg-length ratio, arm-length ratio, torso scale, shoulder width, maybe hip width. Full free bone lengths tend to absorb detector bias and produce a skeleton that explains bad keypoints by deforming itself. Re-open (\beta) only if the track is lost and later reinitialized, or if you have strong evidence over many good frames that the previous estimate was wrong.

For gating, use Mahalanobis distance in image space on core joints: [ d^2_{p,v,m}

\sum_{i\in \mathcal J_{\text{core}}} (u_{v,i}^m-\hat u_{v,i}^p)^\top R_{v,i,t}^{-1} (u_{v,i}^m-\hat u_{v,i}^p). ] Gate per joint or per limb, not only on the full body. If one wrist is bad, mark it missing; do not reject the entire person. A secondary 3D gate is useful only for proposal merging: [ d^2_{3D} = (\hat X^{DLT}-X^{pred})^\top (\Sigma_{DLT}+\Sigma_{pred})^{-1} (\hat X^{DLT}-X^{pred}). ] Primary association should still be driven by the FK-reprojected articulated state.

So the short design is:

Proposal layer: per-frame cross-view geometric hypotheses from noisy 2D detections. State layer: articulated track ((SE(3)+) joint angles (+\beta)). Observation layer: direct multi-view 2D reprojection residuals. Noise layer: recursive per-joint, per-view heavy-tailed covariance adaptation. Association: predicted reprojection gating for existing tracks; triangulation/clustering only for births. Shape: estimate slowly, then freeze.

That is the version I would trust. It is robust, physically constrained, reprojectable, and it does not depend on a sliding window to behave well.


Yes. I would make it explicitly two-stage. RapidPoseTriangulation already has a fast within-frame proposal stage based on cross-view pairs, core-joint triangulation, reprojection filtering, 3D grouping, and only then an optional track-assignment stage; Part-Aware likewise uses the previous 3D skeletons to associate current 2D poses across views and time while filtering 2D outliers. Your “tentative then established” split is the right systems version of that idea. (arXiv)

For the tentative state, I would not use per-joint (SO(3)) yet. I would use one root state plus a free 3D core-joint cloud. Concretely, [ x_t^{\text{tent}}= [c_t,\ \dot c_t,\ Y_t^{\text{core}},\ \dot Y_t^{\text{core}},\ s_t,\ \Sigma_t,\ q_t^{\text{track}}], ] where (c_t) is pelvis/root position, (Y_t^{\text{core}}) are root-centered 3D core joints, (s_t) is just a coarse global scale, (\Sigma_t) is uncertainty, and (q_t^{\text{track}}) is an existence/quality score. If you need an orientation, use a temporary torso frame derived from hips and shoulders, but do not commit to IK yet. RapidPose uses shoulders, hips, elbows, wrists, knees, and ankles as its core set; for confirmation I would weight torso and proximal joints highest and treat distal joints as secondary evidence. That weighting choice is my recommendation, not something stated in the paper. (arXiv)

A short fixed buffer in the tentative stage makes sense. I would treat it as an evidence buffer, not as the main estimator. Promotion should require K-of-M good updates, minimum multi-view support, low normalized reprojection error, stable limb ratios, and non-degenerate triangulation geometry. A practical rule is to estimate only a low-dimensional (\beta) at promotion time, [ \beta \approx [\text{global scale},\ \text{torso scale},\ \text{arm ratio},\ \text{leg ratio}], ] from weighted median limb lengths over the tentative buffer, then soft-lock it for a grace period and hard-freeze it later. I would not freeze a fully free per-bone skeleton immediately on the first “confirmed” frame. Part-Awares emphasis on temporal consistency for association and explicit 2D outlier filtering supports using history mainly for confirmation and robustness, not just smoothing. (arXiv)

For deletion, do not go straight from “confirmed” to “dead.” Use four states: tentative, active, lost, dead. Tentative dies quickly if it is not confirmed within a small age budget or if it loses support for a couple of frames. Active moves to lost after a small number of consecutive misses. Lost keeps predicting forward, inflates covariance, and keeps trying reacquisition using the frozen (\beta) and reprojected core joints as gates. Delete only when the lost age is too large, the predicted uncertainty becomes too large to be useful, or the predicted root has exited the capture volume for several frames. A scalar existence score is better than a pure frame-count rule: [ s_t=\lambda s_{t-1}+a,n^{\text{views}}t+b,n^{\text{joints}}t-c,\bar e^{\text{repr}}t-d,\mathbf 1{\text{miss}}. ] Promote when (s_t>\tau{\text{promote}}), move to lost when (s_t<\tau{\text{lost}}), and delete when lost age or uncertainty crosses a bound. RapidPoses use of previous 3D poses for pair filtering and Part-Awares use of previous 3D skeletons for 2D-3D association both support this “active/lost/reacquire” structure. (arXiv)

On the DLT question: “not using DLT” does not mean “use a neural network.” There are three different operations here. First, algebraic triangulation: OpenCVs sfm::triangulatePoints is explicitly a DLT-based triangulator. Second, geometric or nonlinear triangulation: refine the point by minimizing reprojection error rather than keeping the raw algebraic DLT solution. OpenCV also exposes correctMatches, which implements the optimal-triangulation-style correction of correspondences under the epipolar constraint before triangulation. Third, articulated state fitting: do not reconstruct a separate 3D point cloud at all; solve directly for the skeleton state that best explains all 2D observations. (OpenCV Documentation)

For established tracks, the clean answer is the third one. Do not reconstruct an independent 3D joint cloud every frame. Solve the articulated state directly from all 2D views: [ \hat x_t

\arg\min_x |x-\bar x_t|{Q^{-1}}^2 + \sum{v=1}^{N}\sum_{i\in\mathcal J} w_{v,i,t}, \rho!\left( \left| u_{v,i,t}-\Pi_v(FK_i(x,\beta)) \right|^2 \right) + \phi_{\text{limits}}(q) + \phi_{\text{shape}}(\beta). ] This is the same optimization pattern used by top-down body-model fitting methods: SMPL-X/SMPLify-X fits a parametric body model to 2D features and explicitly relies on a pose prior because 2D-to-3D is ambiguous, and KITRO likewise starts from 2D keypoints and refines the 3D model along the kinematic tree while discussing the depth ambiguity of pure 2D fitting. My extension is: in your calibrated multi-view setting, apply that fitting jointly across views, where the ambiguity is much smaller than in single-view. That multi-view claim is an inference from geometry, not a quoted result from those single-view papers.

So for real time, I would not try to ban DLT. I would confine it to births and cheap initialization. Tentative births can use DLT or another simple ray-based triangulation on a conservative core set because that is fast and good enough to decide whether a hypothesis is real; RapidPose is essentially showing that this kind of proposal stage can be very cheap. Once the track is established, use the previous articulated state as the initializer and do a small number of Gauss-Newton or LM updates on the multi-view reprojection objective. That linear-init plus nonlinear-refine pattern is standard in vision; OpenCVs iterative SolvePnP, for example, uses LM refinement with a DLT initializer for non-planar cases. That is only an analogy for the optimization pattern, not a claim that SolvePnP solves human pose. (arXiv)

So the recommendation is very simple. Tentative stage: root-centered free core-joint state, short evidence buffer, no IK. Promotion: fit a low-dimensional (\beta), soft-lock then freeze. Established stage: articulated IK/FK state updated directly from multi-view 2D reprojection, with recursive noise modeling and no sliding window. Deletion: active/lost/dead, driven by support and uncertainty, not just missed-frame count.

The next useful step is to write the exact state machine and per-frame update equations in implementation form.