feat: initialize offline multiview pose tracking experiment

Set up pose_tracking_exp as a uv-managed Python package for offline multiview body tracking experiments.

This initial commit includes:

- the typed package scaffold, CLI entrypoints, and repo-local uv configuration

- scene and replay loaders for generic JSON replays and ActualTest parquet inputs

- ParaJumping payload conversion and RTMPose-to-body20 normalization

- a custom articulated tracker with tentative, active, and lost lifecycle handling

- RPT-backed proposal generation, camera convention handling, and multiview reprojection updates

- regression tests for normalization, camera conventions, ActualTest ingestion, seeding, and tracker smoke flows

- project documentation covering extrinsic formats and the ActualTest calibration caveat
This commit is contained in:
2026-03-26 13:03:56 +08:00
commit f1a2372b3c
27 changed files with 3791 additions and 0 deletions
+168
View File
@@ -0,0 +1,168 @@
> N views (known camera parameters, including extrinsic), and have D_n detection per view (unmatched, unknown id, although I could use some 2D tracker like SORT/ByteTrack etc. but it won't solve cross view association, only have a relatively stable 2D view tracking) and the 2D detector is noisy (some random noise, unknown distribution, due to the neural pose estimator or occlusion) if we could somehow get a 3D detection (pick some annotation) (refer: https://arxiv.org/html/2503.21692 a naive way, with DLT) but it's still noisy, and it's still keypoint based non IK (what's the word? SO3?) for a correct human pose representation, design some skeleton like (that could convert from points, assuming using IK? and go back to keypoints with FK) but you still need a skeleton (beta) estimation (once established shouldn't change unless tracking lost) and with that skeleton we could do gating (since the 3D detection is noisy) need something robust (always assuming the neural detection/DLT result is noisy, but we have all the history state, but don't use sliding window, model the noise somehow?) I personally believe sliding window is a bad/lazy design and should be avoid, but it's just my experience
---
Good. With that context, here is the design I would actually build.
The main separation is this: the triangulated 3D keypoint cloud is not the persistent state. It is a proposal or measurement. The persistent state should be an articulated skeleton with fixed shape. In your terminology, the free 3D keypoint output is a joint-coordinate or kinematic-free representation in (\mathbb{R}^{3J}); it is not an (SO(3)) state. (SO(3)) enters only when you represent joint rotations. RapidPoseTriangulation is essentially a proposal-style pipeline: it forms cross-view 2D pose pairs, filters them using previous 3D pose projections, triangulates pairwise 3D proposals, reprojects them, rejects high-error pairs, groups proposals in 3D, merges them, and only then optionally assigns tracks or clips joint motion. ([arXiv][1])
So the long-lived track for person (p) should be something like
[
x_t^p = \big[T_t^p,\ q_t^p,\ \dot T_t^p,\ \dot q_t^p,\ \beta^p,\ \eta_t^p\big].
]
Here (T_t^p\in SE(3)) is root translation and orientation, (q_t^p) are local joint rotations or reduced-DOF angles, (\beta^p) is the subject shape or bone-length parameter, and (\eta_t^p) is a latent reliability or noise state. The observable joints are derived by forward kinematics,
[
X_{i,t}^p = FK_i(T_t^p,q_t^p,\beta^p),
]
and projected into each calibrated view by
[
\hat u_{v,i,t}^p = \Pi_v(X_{i,t}^p).
]
That gives you exactly what you wanted earlier: reprojection, IK/FK consistency, fixed skeleton once established, and a place to put historical noise.
For association, I would split the problem in two. For already-existing tracks, do not solve full cross-view association from scratch each frame. Predict each 3D track forward, reproject it into every view, and associate detections to that predicted skeleton in each camera. Cross-view consistency then comes from the fact that the same 3D state must explain all matched views. For births only, use a within-frame geometric proposer: build cross-view hypotheses from stable core joints, triangulate, reproject, reject bad hypotheses, cluster in 3D, then initialize a new articulated track from the surviving cluster. RapidPose explicitly uses a core-joint subset early on—shoulders, hips, elbows, wrists, knees, ankles—to keep association cheap before triangulating all joints. If you want a published association front-end, Dong et al. use multi-way matching over 2D poses with geometry and appearance cues, while Part-Aware Measurement uses temporal consistency to associate current 2D poses with previous 3D skeletons and adds a filter for 2D outliers. ([arXiv][1])
The update should be recursive and articulated. I would not use a Kalman filter on raw DLT joints. I would do a one-step recursive MAP or iterated-EKF-style update on the articulated state:
[
\hat x_t^p
==========
\arg\min_x
|x-\bar x_t^p|*{Q^{-1}}^2
+
\sum*{(v,m)\in A_t^p}\sum_{i\in\mathcal J}
\rho!\left(
|u_{v,i,t}^{m}-\Pi_v(FK_i(x))|*{R*{v,i,t}^{-1}}^2
\right)
+
\phi_{\text{limits}}(q)
+
\phi_{\text{shape}}(\beta).
]
Here (\bar x_t^p) is the predicted state, (A_t^p) is the set of associated detections, (\rho) is a robust loss, and the priors enforce joint limits and fixed shape. This is fully online. No sliding window is required.
One important point: do not treat the same 2D detections and their DLT triangulation as two independent measurements unless you explicitly model the correlation. If the 3D DLT point was computed from those same 2D keypoints, then using both naively double-counts the evidence. The clean choice is to use raw multi-view 2D detections for the main update, and use DLT 3D only for proposal generation, coarse gating, or initialization. If you insist on using DLT in the update, make it a pseudo-measurement with covariance propagated from the 2D uncertainty.
For noise, assume the detector is heavy-tailed and heteroscedastic. That is the right place to be robust, instead of hiding the problem inside the pose state. The simplest useful model is Student-(t) or a Gaussian scale mixture:
[
u_{v,i,t} = \hat u_{v,i,t} + \epsilon_{v,i,t},
\qquad
\epsilon_{v,i,t}\sim t_\nu(0,R_{v,i,t}).
]
Then update a per-joint, per-view scale recursively from the innovation
[
r_{v,i,t}=u_{v,i,t}-\hat u_{v,i,t},
]
for example with
[
w_{v,i,t}=\frac{\nu+2}{\nu+r_{v,i,t}^\top S_{v,i,t}^{-1}r_{v,i,t}},
]
and
[
\log \sigma^2_{v,i,t}
=====================
\lambda \log \sigma^2_{v,i,t-1}
+
(1-\lambda)\log(\epsilon+|r_{v,i,t}|^2).
]
Then define
[
R_{v,i,t}=R_{\min}+f(c_{v,i,t})+\sigma^2_{v,i,t}I.
]
If you do use triangulated 3D proposals, add a geometry term that depends on ray intersection angle or triangulation condition number, because a narrow baseline should produce much larger depth uncertainty than a wide baseline.
This is the non-window way to use history. The history lives in three places: the dynamic prior on ((T,q)), the nearly-static shape (\beta), and the recursive noise state (\eta) or (\sigma^2). That is not a sliding window. It is a proper state-space model with exponential forgetting.
For (\beta), I would use a two-timescale rule. During track establishment, allow slow adaptation under a strong anthropometric prior. After the track is stable, freeze it. In practice I would not start with fully free per-bone lengths. I would use a low-dimensional shape model: global scale, leg-length ratio, arm-length ratio, torso scale, shoulder width, maybe hip width. Full free bone lengths tend to absorb detector bias and produce a skeleton that explains bad keypoints by deforming itself. Re-open (\beta) only if the track is lost and later reinitialized, or if you have strong evidence over many good frames that the previous estimate was wrong.
For gating, use Mahalanobis distance in image space on core joints:
[
d^2_{p,v,m}
===========
\sum_{i\in \mathcal J_{\text{core}}}
(u_{v,i}^m-\hat u_{v,i}^p)^\top
R_{v,i,t}^{-1}
(u_{v,i}^m-\hat u_{v,i}^p).
]
Gate per joint or per limb, not only on the full body. If one wrist is bad, mark it missing; do not reject the entire person. A secondary 3D gate is useful only for proposal merging:
[
d^2_{3D} =
(\hat X^{DLT}-X^{pred})^\top
(\Sigma_{DLT}+\Sigma_{pred})^{-1}
(\hat X^{DLT}-X^{pred}).
]
Primary association should still be driven by the FK-reprojected articulated state.
So the short design is:
Proposal layer: per-frame cross-view geometric hypotheses from noisy 2D detections.
State layer: articulated track ((SE(3)+) joint angles (+\beta)).
Observation layer: direct multi-view 2D reprojection residuals.
Noise layer: recursive per-joint, per-view heavy-tailed covariance adaptation.
Association: predicted reprojection gating for existing tracks; triangulation/clustering only for births.
Shape: estimate slowly, then freeze.
That is the version I would trust. It is robust, physically constrained, reprojectable, and it does not depend on a sliding window to behave well.
[1]: https://arxiv.org/html/2503.21692v5 "RapidPoseTriangulation: Multi-view Multi-person Whole-body Human Pose Triangulation in a Millisecond"
---
Yes. I would make it explicitly two-stage. RapidPoseTriangulation already has a fast within-frame proposal stage based on cross-view pairs, core-joint triangulation, reprojection filtering, 3D grouping, and only then an optional track-assignment stage; Part-Aware likewise uses the previous 3D skeletons to associate current 2D poses across views and time while filtering 2D outliers. Your “tentative then established” split is the right systems version of that idea. ([arXiv][1])
For the tentative state, I would not use per-joint (SO(3)) yet. I would use one root state plus a free 3D core-joint cloud. Concretely,
[
x_t^{\text{tent}}=
[c_t,\ \dot c_t,\ Y_t^{\text{core}},\ \dot Y_t^{\text{core}},\ s_t,\ \Sigma_t,\ q_t^{\text{track}}],
]
where (c_t) is pelvis/root position, (Y_t^{\text{core}}) are root-centered 3D core joints, (s_t) is just a coarse global scale, (\Sigma_t) is uncertainty, and (q_t^{\text{track}}) is an existence/quality score. If you need an orientation, use a temporary torso frame derived from hips and shoulders, but do not commit to IK yet. RapidPose uses shoulders, hips, elbows, wrists, knees, and ankles as its core set; for confirmation I would weight torso and proximal joints highest and treat distal joints as secondary evidence. That weighting choice is my recommendation, not something stated in the paper. ([arXiv][1])
A short fixed buffer in the tentative stage makes sense. I would treat it as an evidence buffer, not as the main estimator. Promotion should require K-of-M good updates, minimum multi-view support, low normalized reprojection error, stable limb ratios, and non-degenerate triangulation geometry. A practical rule is to estimate only a low-dimensional (\beta) at promotion time,
[
\beta \approx [\text{global scale},\ \text{torso scale},\ \text{arm ratio},\ \text{leg ratio}],
]
from weighted median limb lengths over the tentative buffer, then soft-lock it for a grace period and hard-freeze it later. I would not freeze a fully free per-bone skeleton immediately on the first “confirmed” frame. Part-Awares emphasis on temporal consistency for association and explicit 2D outlier filtering supports using history mainly for confirmation and robustness, not just smoothing. ([arXiv][2])
For deletion, do not go straight from “confirmed” to “dead.” Use four states: tentative, active, lost, dead. Tentative dies quickly if it is not confirmed within a small age budget or if it loses support for a couple of frames. Active moves to lost after a small number of consecutive misses. Lost keeps predicting forward, inflates covariance, and keeps trying reacquisition using the frozen (\beta) and reprojected core joints as gates. Delete only when the lost age is too large, the predicted uncertainty becomes too large to be useful, or the predicted root has exited the capture volume for several frames. A scalar existence score is better than a pure frame-count rule:
[
s_t=\lambda s_{t-1}+a,n^{\text{views}}*t+b,n^{\text{joints}}*t-c,\bar e^{\text{repr}}*t-d,\mathbf 1*{\text{miss}}.
]
Promote when (s_t>\tau*{\text{promote}}), move to lost when (s_t<\tau*{\text{lost}}), and delete when lost age or uncertainty crosses a bound. RapidPoses use of previous 3D poses for pair filtering and Part-Awares use of previous 3D skeletons for 2D-3D association both support this “active/lost/reacquire” structure. ([arXiv][1])
On the DLT question: “not using DLT” does not mean “use a neural network.” There are three different operations here. First, algebraic triangulation: OpenCVs `sfm::triangulatePoints` is explicitly a DLT-based triangulator. Second, geometric or nonlinear triangulation: refine the point by minimizing reprojection error rather than keeping the raw algebraic DLT solution. OpenCV also exposes `correctMatches`, which implements the optimal-triangulation-style correction of correspondences under the epipolar constraint before triangulation. Third, articulated state fitting: do not reconstruct a separate 3D point cloud at all; solve directly for the skeleton state that best explains all 2D observations. ([OpenCV Documentation][3])
For established tracks, the clean answer is the third one. Do not reconstruct an independent 3D joint cloud every frame. Solve the articulated state directly from all 2D views:
[
\hat x_t
========
\arg\min_x
|x-\bar x_t|*{Q^{-1}}^2
+
\sum*{v=1}^{N}\sum_{i\in\mathcal J}
w_{v,i,t},
\rho!\left(
\left|
u_{v,i,t}-\Pi_v(FK_i(x,\beta))
\right|^2
\right)
+
\phi_{\text{limits}}(q)
+
\phi_{\text{shape}}(\beta).
]
This is the same optimization pattern used by top-down body-model fitting methods: SMPL-X/SMPLify-X fits a parametric body model to 2D features and explicitly relies on a pose prior because 2D-to-3D is ambiguous, and KITRO likewise starts from 2D keypoints and refines the 3D model along the kinematic tree while discussing the depth ambiguity of pure 2D fitting. My extension is: in your calibrated multi-view setting, apply that fitting jointly across views, where the ambiguity is much smaller than in single-view. That multi-view claim is an inference from geometry, not a quoted result from those single-view papers.
So for real time, I would not try to ban DLT. I would confine it to births and cheap initialization. Tentative births can use DLT or another simple ray-based triangulation on a conservative core set because that is fast and good enough to decide whether a hypothesis is real; RapidPose is essentially showing that this kind of proposal stage can be very cheap. Once the track is established, use the previous articulated state as the initializer and do a small number of Gauss-Newton or LM updates on the multi-view reprojection objective. That linear-init plus nonlinear-refine pattern is standard in vision; OpenCVs iterative `SolvePnP`, for example, uses LM refinement with a DLT initializer for non-planar cases. That is only an analogy for the optimization pattern, not a claim that `SolvePnP` solves human pose. ([arXiv][1])
So the recommendation is very simple. Tentative stage: root-centered free core-joint state, short evidence buffer, no IK. Promotion: fit a low-dimensional (\beta), soft-lock then freeze. Established stage: articulated IK/FK state updated directly from multi-view 2D reprojection, with recursive noise modeling and no sliding window. Deletion: active/lost/dead, driven by support and uncertainty, not just missed-frame count.
The next useful step is to write the exact state machine and per-frame update equations in implementation form.
[1]: https://arxiv.org/html/2503.21692v5 "RapidPoseTriangulation: Multi-view Multi-person Whole-body Human Pose Triangulation in a Millisecond"
[2]: https://arxiv.org/abs/2106.11589 "[2106.11589] Part-Aware Measurement for Robust Multi-View Multi-Human 3D Pose Estimation and Tracking"
[3]: https://docs.opencv.org/4.x/d0/dbd/group__triangulation.html "OpenCV: Triangulation"
+545
View File
@@ -0,0 +1,545 @@
Below is a concrete version that matches your two-stage idea.
The central choice is:
The tentative stage is allowed to use a short evidence buffer and DLT-style 3D core-pose proposals.
The established stage does not maintain a free 3D keypoint cloud. It maintains an articulated state and updates that state directly from multi-view 2D reprojection error. No extra neural network is required.
Notation first.
(t) is frame index.
(v \in {1,\dots,N}) indexes cameras.
(m) indexes 2D detections inside one camera.
(p) indexes tracks.
(i \in \mathcal J) indexes full-body joints.
(\mathcal C \subset \mathcal J) is the core joint set, for example pelvis, hips, shoulders, elbows, knees.
Each camera has known projection (\Pi_v(\cdot)).
Each 2D detection is
[
d_{v,t}^m = {(u_{v,i,t}^m,\ c_{v,i,t}^m)}*{i\in\mathcal J},
]
where (u*{v,i,t}^m \in \mathbb R^2) is the observed 2D joint and (c_{v,i,t}^m \in [0,1]) is its confidence.
For an articulated track, the skeleton model is
[
X_{i,t}^p = FK_i(T_t^p,\ q_t^p,\ \beta^p),
]
where (T_t^p \in SE(3)) is the root transform, (q_t^p) are joint angles, and (\beta^p) are bone-length or shape parameters. The predicted 2D image joint is
[
\hat u_{v,i,t}^p = \Pi_v(X_{i,t}^p).
]
I would use this lifecycle:
```mermaid
stateDiagram-v2
[*] --> Tentative
Tentative --> Active: confirm\nscore > τ_promote\nK hits in last L\nmean views >= V_min\nrepr err <= τ_repr\nbone CV <= τ_len
Tentative --> Dead: age > A_tent_max\nor misses > M_tent
Active --> Active: matched and updated
Active --> Lost: misses > M_lost\nor score < τ_lost
Lost --> Active: reacquired\nsufficient 2D support\nor strong proposal match
Lost --> Dead: lost_age > A_delete\nor trace(P) > τ_P\nor outside volume
Dead --> [*]
```
Now the states.
Tentative state:
[
x_{t}^{p,\mathrm{tent}}
=======================
\big(
r_t^p,\ \dot r_t^p,\ Z_t^p,\ \dot Z_t^p,\ B_t^p,\ s_t^p,\ a_t^p,\ m_t^p,\ \Sigma_{Z,t}^p
\big).
]
Here:
(r_t^p \in \mathbb R^3) is provisional root or pelvis position.
(Z_t^p = {Z_{i,t}^p}*{i\in\mathcal C}) is a root-centered 3D core pose, so (Z*{\text{pelvis},t}^p=0).
(B_t^p) is a short FIFO evidence buffer of the last (L) accepted core-pose proposals.
(s_t^p) is the tentative existence score.
(a_t^p) is age in frames.
(m_t^p) is consecutive miss count.
(\Sigma_{Z,t}^p) is uncertainty of the core-pose estimate.
Established state:
[
x_t^{p,\mathrm{act}}
====================
\big(
T_t^p,\ q_t^p,\ \xi_t^p,\ \dot q_t^p,\ \beta^p,\ f_\beta^p,\ P_t^p,\ \sigma_{v,i,t}^{2,p},\ s_t^p,\ m_t^p,\ \ell_t^p
\big).
]
Here:
(T_t^p \in SE(3)) is root pose.
(q_t^p \in \mathbb R^d) is articulated pose.
(\xi_t^p \in \mathbb R^6) is root velocity in Lie algebra coordinates.
(\dot q_t^p) is angular velocity.
(\beta^p \in \mathbb R^B) is low-dimensional shape or bone-length state.
(f_\beta^p \in {0,1}) says whether (\beta) is frozen.
(P_t^p) is approximate state covariance.
(\sigma_{v,i,t}^{2,p}) is per-view, per-joint measurement noise.
(s_t^p) is track score.
(m_t^p) is consecutive miss count.
(\ell_t^p) is lost age.
The per-frame algorithm is:
```text
function UPDATE_FRAME(t, detections_by_view, tracks, cameras):
split tracks into Tentative, Active, Lost
# 1) Predict established tracks
for p in Active Lost:
p_pred <- PREDICT_ESTABLISHED(p)
# 2) Associate 2D detections to predicted established tracks
matches, unused_detections <- ASSOCIATE_ESTABLISHED(p_pred, detections_by_view, cameras)
# 3) Update Active tracks by direct articulated fitting from 2D
for p in Active:
M_p <- BUILD_MEASUREMENT_SET(p, matches[p], cameras)
if SUFFICIENT_SUPPORT(M_p):
p <- UPDATE_ESTABLISHED(p_pred, M_p, cameras)
p <- UPDATE_ACTIVE_SCORE(p, M_p)
p.miss <- 0
if SHOULD_FREEZE_BETA(p): p.f_beta <- 1
else:
p <- MARK_MISSED(p_pred)
p <- ACTIVE_TO_LOST_IF_NEEDED(p)
# 4) Reacquire Lost tracks
for p in Lost:
M_p <- BUILD_MEASUREMENT_SET(p, matches[p], cameras)
if SUFFICIENT_REACQ_SUPPORT(M_p):
p <- UPDATE_ESTABLISHED(p_pred, M_p, cameras)
p <- REACTIVATE(p)
else:
p <- PREDICT_ONLY(p_pred)
p.lost_age <- p.lost_age + 1
# 5) Build geometric proposals from unused detections
proposals <- BUILD_CORE_PROPOSALS(unused_detections, cameras)
clusters <- CLUSTER_PROPOSALS(proposals)
# 6) Update tentative tracks using proposal clusters
Tentative <- UPDATE_TENTATIVE_TRACKS(Tentative, clusters)
# 7) Promote mature tentative tracks to established
for z in Tentative:
if SHOULD_PROMOTE(z):
p0 <- PROMOTE_TO_ESTABLISHED(z, detections_by_view, cameras)
add p0 to Active
remove z from Tentative
elif SHOULD_DELETE_TENTATIVE(z):
remove z
# 8) Try proposal-based reacquisition for Lost tracks
for p in Lost:
c <- BEST_COMPATIBLE_CLUSTER(p, clusters)
if c exists and REACQ_COMPATIBLE(p, c):
p <- REACQUIRE_FROM_CLUSTER(p, c, detections_by_view, cameras)
move p to Active
# 9) Delete dead Lost tracks
for p in Lost:
if SHOULD_DELETE_LOST(p):
remove p
return Tentative Active Lost
```
Now the actual equations.
For tentative proposals, use only detections not already consumed by established tracks. For each camera pair ((v_1,v_2)) and detection pair ((a,b)), triangulate the visible core joints and build a provisional core pose
[
Y_{i,t}^{(v_1,a,v_2,b)} \in \mathbb R^3,\qquad i\in\mathcal C_{\mathrm{obs}}.
]
Then define the root
[
r_t = \text{pelvis}(Y_t),
]
and root-centered core joints
[
Z_{i,t} = Y_{i,t} - r_t.
]
Score that pair proposal with
[
E_{\mathrm{pair}}
=================
\sum_{i\in\mathcal C_{\mathrm{obs}}}
\left[
\frac{|u_{v_1,i,t}^a-\Pi_{v_1}(Y_{i,t})|^2}{\sigma_{\mathrm{base}}^2(c_{v_1,i,t}^a)}
+
\frac{|u_{v_2,i,t}^b-\Pi_{v_2}(Y_{i,t})|^2}{\sigma_{\mathrm{base}}^2(c_{v_2,i,t}^b)}
\right]
+
\lambda_\theta \sum_{i\in\mathcal C_{\mathrm{obs}}}\frac{1}{\sin^2 \theta_i}
+
\lambda_{\mathrm{anthro}}\psi_{\mathrm{anthro}}(Z_t).
]
(\theta_i) is the angle between the two camera rays for joint (i). Small ray angle means bad geometry, so the penalty increases.
(\psi_{\mathrm{anthro}}(Z_t)) is a soft anthropometric sanity term. It can be very simple: reject impossible torso width, negative limb lengths, or absurd limb ratios.
Keep the proposal only if enough core joints are visible and (E_{\mathrm{pair}}) is below threshold.
Then cluster proposals in 3D using
[
d_{\mathrm{prop}}(p,q)
======================
\frac{|r^p-r^q|^2}{\tau_r^2}
+
\lambda_Z
\sum_{i\in\mathcal C}
\frac{|Z_i^p-Z_i^q|^2}{\tau_Z^2}.
]
Each cluster becomes one proposal centroid for the tentative layer.
Tentative update is simple and robust. Associate each cluster centroid to an existing tentative track or start a new tentative track if no association exists. Update with the buffer, not with a global sliding smoother. The buffer is only a birth-confirmation device.
A useful tentative score is
[
s_t^p
=====
\lambda_s s_{t-1}^p
+
a_1 n_{\mathrm{pairs},t}^p
+
a_2 n_{\mathrm{views},t}^p
--------------------------
## a_3 \bar e_{\mathrm{repr},t}^p
## a_4 \mathrm{CV}_{\mathrm{bone},t}^p
a_5 \mathbf 1_{\mathrm{miss}}.
]
Here:
(n_{\mathrm{pairs},t}^p) is the number of supporting cross-view pair proposals in the cluster.
(n_{\mathrm{views},t}^p) is the number of unique cameras contributing.
(\bar e_{\mathrm{repr},t}^p) is the mean reprojection error of the cluster.
(\mathrm{CV}_{\mathrm{bone},t}^p) is the coefficient of variation of a few stable limb lengths in the tentative buffer.
Promotion rule:
[
\text{promote if}
\quad
a_t^p \ge A_{\min},
\quad
#{\text{hits in last }L} \ge K,
\quad
\overline{n_{\mathrm{views}}} \ge V_{\min},
\quad
\overline{e_{\mathrm{repr}}} \le \tau_{\mathrm{repr}},
\quad
\mathrm{CV}*{\mathrm{bone}} \le \tau*{\mathrm{len}},
\quad
s_t^p \ge \tau_{\mathrm{promote}}.
]
At promotion, estimate (\beta) once from the tentative buffer. Let (\mathcal L) be a set of skeleton limbs. For each limb (b=(i,j)),
[
\hat \ell_b
===========
\operatorname{wmedian}*{\tau \in B_t^p}
|Z*{j,\tau}^p - Z_{i,\tau}^p|.
]
Then fit the low-dimensional skeleton shape
[
\hat\beta
=========
\arg\min_\beta
\sum_{b\in\mathcal L}
\omega_b \big(\ell_b(\beta)-\hat\ell_b\big)^2
+
\lambda_\beta |\beta-\beta_0|*{\Sigma*\beta^{-1}}^2.
]
That gives a stable initial skeleton. Then initialize the articulated pose:
[
(\hat T,\hat q)
===============
\arg\min_{T,q}
\sum_{i\in\mathcal C}
|FK_i(T,q,\hat\beta)-(\hat r+\hat Z_i)|^2
+
\lambda_{\mathrm{lim}}\phi_{\mathrm{lim}}(q).
]
After this, create the established track and allow (\beta) to move only slightly for a short grace period. Then freeze it:
[
\beta_t \leftarrow \beta_{t-1} + \eta_\beta \Delta\beta_t,
\qquad
0 < \eta_\beta \ll 1.
]
Once the track is stable, set (f_\beta=1) and stop updating (\beta).
For established tracks, prediction is recursive:
[
T_{t|t-1} = T_{t-1}\exp(\Delta t,\hat \xi_{t-1}),
]
[
q_{t|t-1} = q_{t-1} + \Delta t,\dot q_{t-1},
]
[
\beta_{t|t-1} = \beta_{t-1}.
]
Now the key point: the established stage does not triangulate a fresh 3D body and then fit to that. It updates the articulated state directly from all available 2D measurements.
First, associate detections per view. For predicted track (p) in camera (v), the detection cost is
[
C_{p,m}^{(v)}
=============
\sum_{i\in\mathcal C}
\delta_{i}
,\omega_i,
\frac{|u_{v,i,t}^m-\hat u_{v,i,t}^p|^2}{\sigma_{v,i,t}^{2,p}}
-------------------------------------------------------------
\lambda_{\mathrm{id}} \mathbf 1[\text{same 2D tracker id}],
]
with (\delta_i=1) only for visible and confidence-valid joints. Use Hungarian or another bipartite matcher independently per view after gating.
Then update the articulated state by minimizing
[
x_t^{p,*}
=========
\arg\min_x
|x-\bar x_t^p|*{Q^{-1}}^2
+
\sum*{(v,m)\in\mathcal M_t^p}
\sum_{i\in\mathcal J}
\delta_{v,i,t}^p
,w_{v,i,t}^p,
\left|
u_{v,i,t}^m - \Pi_v(FK_i(x))
\right|*{R*{v,i,t}^{-1}}^2
+
\lambda_{\mathrm{lim}}\phi_{\mathrm{lim}}(q)
+
\lambda_{\beta}\phi_{\beta}(\beta).
]
Definitions:
(\bar x_t^p) is the predicted state.
(\mathcal M_t^p) is the set of matched detections for track (p) across cameras.
(\delta_{v,i,t}^p \in {0,1}) is the joint acceptance mask after joint-level gating.
(R_{v,i,t} = \sigma_{v,i,t}^{2,p} I_2).
(w_{v,i,t}^p) is a robust weight.
A convenient robust weight is Student-(t):
[
w_{v,i,t}^p
===========
\frac{\nu+2}{\nu + \left|u_{v,i,t}^m-\hat u_{v,i,t}^p\right|*{R*{v,i,t}^{-1}}^2}.
]
Joint-level gating uses
[
d_{v,i,t}^{2,p}
===============
\left(u_{v,i,t}^m-\hat u_{v,i,t}^p\right)^\top
S_{v,i,t}^{-1}
\left(u_{v,i,t}^m-\hat u_{v,i,t}^p\right),
]
and accept the joint if (d_{v,i,t}^{2,p} \le \tau_{\mathrm{joint}}).
In practice, solve the update with 1 to 3 Gauss-Newton or LM steps from the predicted state. Let the local increment be
[
\Delta z = [\delta\xi,\ \delta q,\ \delta\beta_{\mathrm{free}}],
]
where (\delta\xi) is the root pose increment in Lie algebra coordinates, (\delta q) is the joint-angle increment, and (\delta\beta_{\mathrm{free}}) is omitted if (\beta) is frozen.
At one linearized step,
[
H
=
Q^{-1}
+
\sum_{(v,m),i}
\delta_{v,i,t}^p,w_{v,i,t}^p,J_{v,i,t}^\top R_{v,i,t}^{-1} J_{v,i,t}
+
H_{\mathrm{prior}},
]
[
g
=
## Q^{-1}(z-\bar z_t^p)
\sum_{(v,m),i}
\delta_{v,i,t}^p,w_{v,i,t}^p,J_{v,i,t}^\top R_{v,i,t}^{-1} r_{v,i,t}
+
g_{\mathrm{prior}},
]
[
\Delta z = -H^{-1} g.
]
Here (r_{v,i,t} = u_{v,i,t}^m-\hat u_{v,i,t}^p), and (J_{v,i,t}) is the Jacobian of the projected joint with respect to the state increment.
Then update
[
T \leftarrow T\exp(\widehat{\delta\xi}),\qquad
q \leftarrow q + \delta q,\qquad
\beta \leftarrow \beta + \delta\beta_{\mathrm{free}}.
]
Approximate posterior covariance for deletion and gating can be taken as
[
P_t^p \approx H^{-1}.
]
The recursive noise model is important because your detector is noisy and heteroscedastic. I would use
[
\sigma_{v,i,t}^{2,p}
====================
\operatorname{clip}
\left(
\lambda_\sigma \sigma_{v,i,t-1}^{2,p}
+
(1-\lambda_\sigma)
\Big[
\sigma_{\mathrm{base}}^2(c_{v,i,t})
+
\kappa \min(|r_{v,i,t}|^2, r_{\max}^2)
\Big],
\ \sigma_{\min}^2,\ \sigma_{\max}^2
\right).
]
So history enters through the recursive state and recursive noise, not through a sliding window.
For the active track score, use
[
s_t^p
=====
\lambda_s s_{t-1}^p
+
b_1 n_{\mathrm{views},t}^{p,\mathrm{acc}}
+
b_2 \frac{n_{\mathrm{joints},t}^{p,\mathrm{acc}}}{|\mathcal J|}
---------------------------------------------------------------
## b_3 \bar e_{\mathrm{repr},t}^p
b_4 \mathbf 1_{\mathrm{miss}}.
]
Then the lifecycle rules become explicit.
Active to Lost:
[
m_t^p > M_{\mathrm{lost}}
\quad\text{or}\quad
s_t^p < \tau_{\mathrm{lost}}.
]
Lost to Active:
either enough direct 2D support reappears,
[
n_{\mathrm{views},t}^{p,\mathrm{acc}} \ge V_{\mathrm{reacq}},
]
or a tentative proposal cluster matches the lost track prediction:
[
C_{\mathrm{reacq}}(p,c)
=======================
\frac{|r_c-\hat r_t^p|^2}{\tau_r^2}
+
\lambda_Z
\sum_{i\in\mathcal C}
\frac{|Z_{c,i}-\hat Z_{i,t}^p|^2}{\tau_Z^2}
\le \tau_{\mathrm{reacq}}.
]
Lost to Dead:
[
\ell_t^p > A_{\mathrm{delete}}
\quad\text{or}\quad
\operatorname{tr}(P_t^p) > \tau_P
\quad\text{or}\quad
\hat r_t^p \notin \text{working volume for } G \text{ frames}.
]
Tentative to Dead:
[
a_t^p > A_{\mathrm{tent,max}}
\quad\text{or}\quad
m_t^p > M_{\mathrm{tent}}.
]
So the design is:
Tentative stage: free 3D core pose, short evidence buffer, DLT allowed, cheap and conservative.
Promotion: estimate (\beta) once from stable buffer statistics, initialize articulated model, then freeze (\beta) after a short grace period.
Established stage: direct recursive articulated fitting from multi-view 2D reprojection, no per-frame free 3D cloud, no sliding window.
Lost stage: predict only, inflate uncertainty, try 2D reprojection-based reacquisition first, proposal-based reacquisition second, delete only when uncertainty or lost age becomes too large.
This directly answers your earlier question about “how do I get 3D without DLT?”:
In the established stage, you do not first reconstruct a standalone 3D pose. The 3D body is the current articulated state itself. You recover it by minimizing multi-view 2D reprojection error of the FK model. DLT is only a birth and reacquisition aid.