f1a2372b3c
Set up pose_tracking_exp as a uv-managed Python package for offline multiview body tracking experiments. This initial commit includes: - the typed package scaffold, CLI entrypoints, and repo-local uv configuration - scene and replay loaders for generic JSON replays and ActualTest parquet inputs - ParaJumping payload conversion and RTMPose-to-body20 normalization - a custom articulated tracker with tentative, active, and lost lifecycle handling - RPT-backed proposal generation, camera convention handling, and multiview reprojection updates - regression tests for normalization, camera conventions, ActualTest ingestion, seeding, and tracker smoke flows - project documentation covering extrinsic formats and the ActualTest calibration caveat
546 lines
14 KiB
Markdown
546 lines
14 KiB
Markdown
Below is a concrete version that matches your two-stage idea.
|
||
|
||
The central choice is:
|
||
|
||
The tentative stage is allowed to use a short evidence buffer and DLT-style 3D core-pose proposals.
|
||
|
||
The established stage does not maintain a free 3D keypoint cloud. It maintains an articulated state and updates that state directly from multi-view 2D reprojection error. No extra neural network is required.
|
||
|
||
Notation first.
|
||
|
||
(t) is frame index.
|
||
(v \in {1,\dots,N}) indexes cameras.
|
||
(m) indexes 2D detections inside one camera.
|
||
(p) indexes tracks.
|
||
(i \in \mathcal J) indexes full-body joints.
|
||
(\mathcal C \subset \mathcal J) is the core joint set, for example pelvis, hips, shoulders, elbows, knees.
|
||
|
||
Each camera has known projection (\Pi_v(\cdot)).
|
||
|
||
Each 2D detection is
|
||
[
|
||
d_{v,t}^m = {(u_{v,i,t}^m,\ c_{v,i,t}^m)}*{i\in\mathcal J},
|
||
]
|
||
where (u*{v,i,t}^m \in \mathbb R^2) is the observed 2D joint and (c_{v,i,t}^m \in [0,1]) is its confidence.
|
||
|
||
For an articulated track, the skeleton model is
|
||
[
|
||
X_{i,t}^p = FK_i(T_t^p,\ q_t^p,\ \beta^p),
|
||
]
|
||
where (T_t^p \in SE(3)) is the root transform, (q_t^p) are joint angles, and (\beta^p) are bone-length or shape parameters. The predicted 2D image joint is
|
||
[
|
||
\hat u_{v,i,t}^p = \Pi_v(X_{i,t}^p).
|
||
]
|
||
|
||
I would use this lifecycle:
|
||
|
||
```mermaid
|
||
stateDiagram-v2
|
||
[*] --> Tentative
|
||
|
||
Tentative --> Active: confirm\nscore > τ_promote\nK hits in last L\nmean views >= V_min\nrepr err <= τ_repr\nbone CV <= τ_len
|
||
Tentative --> Dead: age > A_tent_max\nor misses > M_tent
|
||
|
||
Active --> Active: matched and updated
|
||
Active --> Lost: misses > M_lost\nor score < τ_lost
|
||
|
||
Lost --> Active: reacquired\nsufficient 2D support\nor strong proposal match
|
||
Lost --> Dead: lost_age > A_delete\nor trace(P) > τ_P\nor outside volume
|
||
|
||
Dead --> [*]
|
||
```
|
||
|
||
Now the states.
|
||
|
||
Tentative state:
|
||
[
|
||
x_{t}^{p,\mathrm{tent}}
|
||
=======================
|
||
|
||
\big(
|
||
r_t^p,\ \dot r_t^p,\ Z_t^p,\ \dot Z_t^p,\ B_t^p,\ s_t^p,\ a_t^p,\ m_t^p,\ \Sigma_{Z,t}^p
|
||
\big).
|
||
]
|
||
|
||
Here:
|
||
|
||
(r_t^p \in \mathbb R^3) is provisional root or pelvis position.
|
||
|
||
(Z_t^p = {Z_{i,t}^p}*{i\in\mathcal C}) is a root-centered 3D core pose, so (Z*{\text{pelvis},t}^p=0).
|
||
|
||
(B_t^p) is a short FIFO evidence buffer of the last (L) accepted core-pose proposals.
|
||
|
||
(s_t^p) is the tentative existence score.
|
||
|
||
(a_t^p) is age in frames.
|
||
|
||
(m_t^p) is consecutive miss count.
|
||
|
||
(\Sigma_{Z,t}^p) is uncertainty of the core-pose estimate.
|
||
|
||
Established state:
|
||
[
|
||
x_t^{p,\mathrm{act}}
|
||
====================
|
||
|
||
\big(
|
||
T_t^p,\ q_t^p,\ \xi_t^p,\ \dot q_t^p,\ \beta^p,\ f_\beta^p,\ P_t^p,\ \sigma_{v,i,t}^{2,p},\ s_t^p,\ m_t^p,\ \ell_t^p
|
||
\big).
|
||
]
|
||
|
||
Here:
|
||
|
||
(T_t^p \in SE(3)) is root pose.
|
||
|
||
(q_t^p \in \mathbb R^d) is articulated pose.
|
||
|
||
(\xi_t^p \in \mathbb R^6) is root velocity in Lie algebra coordinates.
|
||
|
||
(\dot q_t^p) is angular velocity.
|
||
|
||
(\beta^p \in \mathbb R^B) is low-dimensional shape or bone-length state.
|
||
|
||
(f_\beta^p \in {0,1}) says whether (\beta) is frozen.
|
||
|
||
(P_t^p) is approximate state covariance.
|
||
|
||
(\sigma_{v,i,t}^{2,p}) is per-view, per-joint measurement noise.
|
||
|
||
(s_t^p) is track score.
|
||
|
||
(m_t^p) is consecutive miss count.
|
||
|
||
(\ell_t^p) is lost age.
|
||
|
||
The per-frame algorithm is:
|
||
|
||
```text
|
||
function UPDATE_FRAME(t, detections_by_view, tracks, cameras):
|
||
|
||
split tracks into Tentative, Active, Lost
|
||
|
||
# 1) Predict established tracks
|
||
for p in Active ∪ Lost:
|
||
p_pred <- PREDICT_ESTABLISHED(p)
|
||
|
||
# 2) Associate 2D detections to predicted established tracks
|
||
matches, unused_detections <- ASSOCIATE_ESTABLISHED(p_pred, detections_by_view, cameras)
|
||
|
||
# 3) Update Active tracks by direct articulated fitting from 2D
|
||
for p in Active:
|
||
M_p <- BUILD_MEASUREMENT_SET(p, matches[p], cameras)
|
||
if SUFFICIENT_SUPPORT(M_p):
|
||
p <- UPDATE_ESTABLISHED(p_pred, M_p, cameras)
|
||
p <- UPDATE_ACTIVE_SCORE(p, M_p)
|
||
p.miss <- 0
|
||
if SHOULD_FREEZE_BETA(p): p.f_beta <- 1
|
||
else:
|
||
p <- MARK_MISSED(p_pred)
|
||
p <- ACTIVE_TO_LOST_IF_NEEDED(p)
|
||
|
||
# 4) Reacquire Lost tracks
|
||
for p in Lost:
|
||
M_p <- BUILD_MEASUREMENT_SET(p, matches[p], cameras)
|
||
if SUFFICIENT_REACQ_SUPPORT(M_p):
|
||
p <- UPDATE_ESTABLISHED(p_pred, M_p, cameras)
|
||
p <- REACTIVATE(p)
|
||
else:
|
||
p <- PREDICT_ONLY(p_pred)
|
||
p.lost_age <- p.lost_age + 1
|
||
|
||
# 5) Build geometric proposals from unused detections
|
||
proposals <- BUILD_CORE_PROPOSALS(unused_detections, cameras)
|
||
clusters <- CLUSTER_PROPOSALS(proposals)
|
||
|
||
# 6) Update tentative tracks using proposal clusters
|
||
Tentative <- UPDATE_TENTATIVE_TRACKS(Tentative, clusters)
|
||
|
||
# 7) Promote mature tentative tracks to established
|
||
for z in Tentative:
|
||
if SHOULD_PROMOTE(z):
|
||
p0 <- PROMOTE_TO_ESTABLISHED(z, detections_by_view, cameras)
|
||
add p0 to Active
|
||
remove z from Tentative
|
||
elif SHOULD_DELETE_TENTATIVE(z):
|
||
remove z
|
||
|
||
# 8) Try proposal-based reacquisition for Lost tracks
|
||
for p in Lost:
|
||
c <- BEST_COMPATIBLE_CLUSTER(p, clusters)
|
||
if c exists and REACQ_COMPATIBLE(p, c):
|
||
p <- REACQUIRE_FROM_CLUSTER(p, c, detections_by_view, cameras)
|
||
move p to Active
|
||
|
||
# 9) Delete dead Lost tracks
|
||
for p in Lost:
|
||
if SHOULD_DELETE_LOST(p):
|
||
remove p
|
||
|
||
return Tentative ∪ Active ∪ Lost
|
||
```
|
||
|
||
Now the actual equations.
|
||
|
||
For tentative proposals, use only detections not already consumed by established tracks. For each camera pair ((v_1,v_2)) and detection pair ((a,b)), triangulate the visible core joints and build a provisional core pose
|
||
[
|
||
Y_{i,t}^{(v_1,a,v_2,b)} \in \mathbb R^3,\qquad i\in\mathcal C_{\mathrm{obs}}.
|
||
]
|
||
Then define the root
|
||
[
|
||
r_t = \text{pelvis}(Y_t),
|
||
]
|
||
and root-centered core joints
|
||
[
|
||
Z_{i,t} = Y_{i,t} - r_t.
|
||
]
|
||
|
||
Score that pair proposal with
|
||
[
|
||
E_{\mathrm{pair}}
|
||
=================
|
||
|
||
\sum_{i\in\mathcal C_{\mathrm{obs}}}
|
||
\left[
|
||
\frac{|u_{v_1,i,t}^a-\Pi_{v_1}(Y_{i,t})|^2}{\sigma_{\mathrm{base}}^2(c_{v_1,i,t}^a)}
|
||
+
|
||
\frac{|u_{v_2,i,t}^b-\Pi_{v_2}(Y_{i,t})|^2}{\sigma_{\mathrm{base}}^2(c_{v_2,i,t}^b)}
|
||
\right]
|
||
+
|
||
\lambda_\theta \sum_{i\in\mathcal C_{\mathrm{obs}}}\frac{1}{\sin^2 \theta_i}
|
||
+
|
||
\lambda_{\mathrm{anthro}}\psi_{\mathrm{anthro}}(Z_t).
|
||
]
|
||
|
||
(\theta_i) is the angle between the two camera rays for joint (i). Small ray angle means bad geometry, so the penalty increases.
|
||
|
||
(\psi_{\mathrm{anthro}}(Z_t)) is a soft anthropometric sanity term. It can be very simple: reject impossible torso width, negative limb lengths, or absurd limb ratios.
|
||
|
||
Keep the proposal only if enough core joints are visible and (E_{\mathrm{pair}}) is below threshold.
|
||
|
||
Then cluster proposals in 3D using
|
||
[
|
||
d_{\mathrm{prop}}(p,q)
|
||
======================
|
||
|
||
\frac{|r^p-r^q|^2}{\tau_r^2}
|
||
+
|
||
\lambda_Z
|
||
\sum_{i\in\mathcal C}
|
||
\frac{|Z_i^p-Z_i^q|^2}{\tau_Z^2}.
|
||
]
|
||
|
||
Each cluster becomes one proposal centroid for the tentative layer.
|
||
|
||
Tentative update is simple and robust. Associate each cluster centroid to an existing tentative track or start a new tentative track if no association exists. Update with the buffer, not with a global sliding smoother. The buffer is only a birth-confirmation device.
|
||
|
||
A useful tentative score is
|
||
[
|
||
s_t^p
|
||
=====
|
||
|
||
\lambda_s s_{t-1}^p
|
||
+
|
||
a_1 n_{\mathrm{pairs},t}^p
|
||
+
|
||
a_2 n_{\mathrm{views},t}^p
|
||
--------------------------
|
||
|
||
## a_3 \bar e_{\mathrm{repr},t}^p
|
||
|
||
## a_4 \mathrm{CV}_{\mathrm{bone},t}^p
|
||
|
||
a_5 \mathbf 1_{\mathrm{miss}}.
|
||
]
|
||
|
||
Here:
|
||
|
||
(n_{\mathrm{pairs},t}^p) is the number of supporting cross-view pair proposals in the cluster.
|
||
|
||
(n_{\mathrm{views},t}^p) is the number of unique cameras contributing.
|
||
|
||
(\bar e_{\mathrm{repr},t}^p) is the mean reprojection error of the cluster.
|
||
|
||
(\mathrm{CV}_{\mathrm{bone},t}^p) is the coefficient of variation of a few stable limb lengths in the tentative buffer.
|
||
|
||
Promotion rule:
|
||
[
|
||
\text{promote if}
|
||
\quad
|
||
a_t^p \ge A_{\min},
|
||
\quad
|
||
#{\text{hits in last }L} \ge K,
|
||
\quad
|
||
\overline{n_{\mathrm{views}}} \ge V_{\min},
|
||
\quad
|
||
\overline{e_{\mathrm{repr}}} \le \tau_{\mathrm{repr}},
|
||
\quad
|
||
\mathrm{CV}*{\mathrm{bone}} \le \tau*{\mathrm{len}},
|
||
\quad
|
||
s_t^p \ge \tau_{\mathrm{promote}}.
|
||
]
|
||
|
||
At promotion, estimate (\beta) once from the tentative buffer. Let (\mathcal L) be a set of skeleton limbs. For each limb (b=(i,j)),
|
||
[
|
||
\hat \ell_b
|
||
===========
|
||
|
||
\operatorname{wmedian}*{\tau \in B_t^p}
|
||
|Z*{j,\tau}^p - Z_{i,\tau}^p|.
|
||
]
|
||
Then fit the low-dimensional skeleton shape
|
||
[
|
||
\hat\beta
|
||
=========
|
||
|
||
\arg\min_\beta
|
||
\sum_{b\in\mathcal L}
|
||
\omega_b \big(\ell_b(\beta)-\hat\ell_b\big)^2
|
||
+
|
||
\lambda_\beta |\beta-\beta_0|*{\Sigma*\beta^{-1}}^2.
|
||
]
|
||
|
||
That gives a stable initial skeleton. Then initialize the articulated pose:
|
||
[
|
||
(\hat T,\hat q)
|
||
===============
|
||
|
||
\arg\min_{T,q}
|
||
\sum_{i\in\mathcal C}
|
||
|FK_i(T,q,\hat\beta)-(\hat r+\hat Z_i)|^2
|
||
+
|
||
\lambda_{\mathrm{lim}}\phi_{\mathrm{lim}}(q).
|
||
]
|
||
|
||
After this, create the established track and allow (\beta) to move only slightly for a short grace period. Then freeze it:
|
||
[
|
||
\beta_t \leftarrow \beta_{t-1} + \eta_\beta \Delta\beta_t,
|
||
\qquad
|
||
0 < \eta_\beta \ll 1.
|
||
]
|
||
Once the track is stable, set (f_\beta=1) and stop updating (\beta).
|
||
|
||
For established tracks, prediction is recursive:
|
||
[
|
||
T_{t|t-1} = T_{t-1}\exp(\Delta t,\hat \xi_{t-1}),
|
||
]
|
||
[
|
||
q_{t|t-1} = q_{t-1} + \Delta t,\dot q_{t-1},
|
||
]
|
||
[
|
||
\beta_{t|t-1} = \beta_{t-1}.
|
||
]
|
||
|
||
Now the key point: the established stage does not triangulate a fresh 3D body and then fit to that. It updates the articulated state directly from all available 2D measurements.
|
||
|
||
First, associate detections per view. For predicted track (p) in camera (v), the detection cost is
|
||
[
|
||
C_{p,m}^{(v)}
|
||
=============
|
||
|
||
\sum_{i\in\mathcal C}
|
||
\delta_{i}
|
||
,\omega_i,
|
||
\frac{|u_{v,i,t}^m-\hat u_{v,i,t}^p|^2}{\sigma_{v,i,t}^{2,p}}
|
||
-------------------------------------------------------------
|
||
|
||
\lambda_{\mathrm{id}} \mathbf 1[\text{same 2D tracker id}],
|
||
]
|
||
with (\delta_i=1) only for visible and confidence-valid joints. Use Hungarian or another bipartite matcher independently per view after gating.
|
||
|
||
Then update the articulated state by minimizing
|
||
[
|
||
x_t^{p,*}
|
||
=========
|
||
|
||
\arg\min_x
|
||
|x-\bar x_t^p|*{Q^{-1}}^2
|
||
+
|
||
\sum*{(v,m)\in\mathcal M_t^p}
|
||
\sum_{i\in\mathcal J}
|
||
\delta_{v,i,t}^p
|
||
,w_{v,i,t}^p,
|
||
\left|
|
||
u_{v,i,t}^m - \Pi_v(FK_i(x))
|
||
\right|*{R*{v,i,t}^{-1}}^2
|
||
+
|
||
\lambda_{\mathrm{lim}}\phi_{\mathrm{lim}}(q)
|
||
+
|
||
\lambda_{\beta}\phi_{\beta}(\beta).
|
||
]
|
||
|
||
Definitions:
|
||
|
||
(\bar x_t^p) is the predicted state.
|
||
|
||
(\mathcal M_t^p) is the set of matched detections for track (p) across cameras.
|
||
|
||
(\delta_{v,i,t}^p \in {0,1}) is the joint acceptance mask after joint-level gating.
|
||
|
||
(R_{v,i,t} = \sigma_{v,i,t}^{2,p} I_2).
|
||
|
||
(w_{v,i,t}^p) is a robust weight.
|
||
|
||
A convenient robust weight is Student-(t):
|
||
[
|
||
w_{v,i,t}^p
|
||
===========
|
||
|
||
\frac{\nu+2}{\nu + \left|u_{v,i,t}^m-\hat u_{v,i,t}^p\right|*{R*{v,i,t}^{-1}}^2}.
|
||
]
|
||
|
||
Joint-level gating uses
|
||
[
|
||
d_{v,i,t}^{2,p}
|
||
===============
|
||
|
||
\left(u_{v,i,t}^m-\hat u_{v,i,t}^p\right)^\top
|
||
S_{v,i,t}^{-1}
|
||
\left(u_{v,i,t}^m-\hat u_{v,i,t}^p\right),
|
||
]
|
||
and accept the joint if (d_{v,i,t}^{2,p} \le \tau_{\mathrm{joint}}).
|
||
|
||
In practice, solve the update with 1 to 3 Gauss-Newton or LM steps from the predicted state. Let the local increment be
|
||
[
|
||
\Delta z = [\delta\xi,\ \delta q,\ \delta\beta_{\mathrm{free}}],
|
||
]
|
||
where (\delta\xi) is the root pose increment in Lie algebra coordinates, (\delta q) is the joint-angle increment, and (\delta\beta_{\mathrm{free}}) is omitted if (\beta) is frozen.
|
||
|
||
At one linearized step,
|
||
[
|
||
H
|
||
=
|
||
|
||
Q^{-1}
|
||
+
|
||
\sum_{(v,m),i}
|
||
\delta_{v,i,t}^p,w_{v,i,t}^p,J_{v,i,t}^\top R_{v,i,t}^{-1} J_{v,i,t}
|
||
+
|
||
H_{\mathrm{prior}},
|
||
]
|
||
[
|
||
g
|
||
=
|
||
|
||
## Q^{-1}(z-\bar z_t^p)
|
||
|
||
\sum_{(v,m),i}
|
||
\delta_{v,i,t}^p,w_{v,i,t}^p,J_{v,i,t}^\top R_{v,i,t}^{-1} r_{v,i,t}
|
||
+
|
||
g_{\mathrm{prior}},
|
||
]
|
||
[
|
||
\Delta z = -H^{-1} g.
|
||
]
|
||
|
||
Here (r_{v,i,t} = u_{v,i,t}^m-\hat u_{v,i,t}^p), and (J_{v,i,t}) is the Jacobian of the projected joint with respect to the state increment.
|
||
|
||
Then update
|
||
[
|
||
T \leftarrow T\exp(\widehat{\delta\xi}),\qquad
|
||
q \leftarrow q + \delta q,\qquad
|
||
\beta \leftarrow \beta + \delta\beta_{\mathrm{free}}.
|
||
]
|
||
|
||
Approximate posterior covariance for deletion and gating can be taken as
|
||
[
|
||
P_t^p \approx H^{-1}.
|
||
]
|
||
|
||
The recursive noise model is important because your detector is noisy and heteroscedastic. I would use
|
||
[
|
||
\sigma_{v,i,t}^{2,p}
|
||
====================
|
||
|
||
\operatorname{clip}
|
||
\left(
|
||
\lambda_\sigma \sigma_{v,i,t-1}^{2,p}
|
||
+
|
||
(1-\lambda_\sigma)
|
||
\Big[
|
||
\sigma_{\mathrm{base}}^2(c_{v,i,t})
|
||
+
|
||
\kappa \min(|r_{v,i,t}|^2, r_{\max}^2)
|
||
\Big],
|
||
\ \sigma_{\min}^2,\ \sigma_{\max}^2
|
||
\right).
|
||
]
|
||
|
||
So history enters through the recursive state and recursive noise, not through a sliding window.
|
||
|
||
For the active track score, use
|
||
[
|
||
s_t^p
|
||
=====
|
||
|
||
\lambda_s s_{t-1}^p
|
||
+
|
||
b_1 n_{\mathrm{views},t}^{p,\mathrm{acc}}
|
||
+
|
||
b_2 \frac{n_{\mathrm{joints},t}^{p,\mathrm{acc}}}{|\mathcal J|}
|
||
---------------------------------------------------------------
|
||
|
||
## b_3 \bar e_{\mathrm{repr},t}^p
|
||
|
||
b_4 \mathbf 1_{\mathrm{miss}}.
|
||
]
|
||
|
||
Then the lifecycle rules become explicit.
|
||
|
||
Active to Lost:
|
||
[
|
||
m_t^p > M_{\mathrm{lost}}
|
||
\quad\text{or}\quad
|
||
s_t^p < \tau_{\mathrm{lost}}.
|
||
]
|
||
|
||
Lost to Active:
|
||
either enough direct 2D support reappears,
|
||
[
|
||
n_{\mathrm{views},t}^{p,\mathrm{acc}} \ge V_{\mathrm{reacq}},
|
||
]
|
||
or a tentative proposal cluster matches the lost track prediction:
|
||
[
|
||
C_{\mathrm{reacq}}(p,c)
|
||
=======================
|
||
|
||
\frac{|r_c-\hat r_t^p|^2}{\tau_r^2}
|
||
+
|
||
\lambda_Z
|
||
\sum_{i\in\mathcal C}
|
||
\frac{|Z_{c,i}-\hat Z_{i,t}^p|^2}{\tau_Z^2}
|
||
\le \tau_{\mathrm{reacq}}.
|
||
]
|
||
|
||
Lost to Dead:
|
||
[
|
||
\ell_t^p > A_{\mathrm{delete}}
|
||
\quad\text{or}\quad
|
||
\operatorname{tr}(P_t^p) > \tau_P
|
||
\quad\text{or}\quad
|
||
\hat r_t^p \notin \text{working volume for } G \text{ frames}.
|
||
]
|
||
|
||
Tentative to Dead:
|
||
[
|
||
a_t^p > A_{\mathrm{tent,max}}
|
||
\quad\text{or}\quad
|
||
m_t^p > M_{\mathrm{tent}}.
|
||
]
|
||
|
||
So the design is:
|
||
|
||
Tentative stage: free 3D core pose, short evidence buffer, DLT allowed, cheap and conservative.
|
||
|
||
Promotion: estimate (\beta) once from stable buffer statistics, initialize articulated model, then freeze (\beta) after a short grace period.
|
||
|
||
Established stage: direct recursive articulated fitting from multi-view 2D reprojection, no per-frame free 3D cloud, no sliding window.
|
||
|
||
Lost stage: predict only, inflate uncertainty, try 2D reprojection-based reacquisition first, proposal-based reacquisition second, delete only when uncertainty or lost age becomes too large.
|
||
|
||
This directly answers your earlier question about “how do I get 3D without DLT?”:
|
||
|
||
In the established stage, you do not first reconstruct a standalone 3D pose. The 3D body is the current articulated state itself. You recover it by minimizing multi-view 2D reprojection error of the FK model. DLT is only a birth and reacquisition aid.
|
||
|
||
|
||
|