pose_tracking_exp/research/draft.md

Below is a concrete version that matches your two-stage idea.

The central choice is:

The tentative stage is allowed to use a short evidence buffer and DLT-style 3D core-pose proposals.

The established stage does not maintain a free 3D keypoint cloud. It maintains an articulated state and updates that state directly from multi-view 2D reprojection error. No extra neural network is required.

Notation first.

(t) is frame index.
(v \in {1,\dots,N}) indexes cameras.
(m) indexes 2D detections inside one camera.
(p) indexes tracks.
(i \in \mathcal J) indexes full-body joints.
(\mathcal C \subset \mathcal J) is the core joint set, for example pelvis, hips, shoulders, elbows, knees.

Each camera has known projection (\Pi_v(\cdot)).

Each 2D detection is
[
d_{v,t}^m = {(u_{v,i,t}^m,\ c_{v,i,t}^m)}*{i\in\mathcal J},
]
where (u*{v,i,t}^m \in \mathbb R^2) is the observed 2D joint and (c_{v,i,t}^m \in [0,1]) is its confidence.

For an articulated track, the skeleton model is
[
X_{i,t}^p = FK_i(T_t^p,\ q_t^p,\ \beta^p),
]
where (T_t^p \in SE(3)) is the root transform, (q_t^p) are joint angles, and (\beta^p) are bone-length or shape parameters. The predicted 2D image joint is
[
\hat u_{v,i,t}^p = \Pi_v(X_{i,t}^p).
]

I would use this lifecycle:

```mermaid
stateDiagram-v2
    [*] --> Tentative

    Tentative --> Active: confirm\nscore > τ_promote\nK hits in last L\nmean views >= V_min\nrepr err <= τ_repr\nbone CV <= τ_len
    Tentative --> Dead: age > A_tent_max\nor misses > M_tent

    Active --> Active: matched and updated
    Active --> Lost: misses > M_lost\nor score < τ_lost

    Lost --> Active: reacquired\nsufficient 2D support\nor strong proposal match
    Lost --> Dead: lost_age > A_delete\nor trace(P) > τ_P\nor outside volume

    Dead --> [*]
```

Now the states.

Tentative state:
[
x_{t}^{p,\mathrm{tent}}
=======================

\big(
r_t^p,\ \dot r_t^p,\ Z_t^p,\ \dot Z_t^p,\ B_t^p,\ s_t^p,\ a_t^p,\ m_t^p,\ \Sigma_{Z,t}^p
\big).
]

Here:

(r_t^p \in \mathbb R^3) is provisional root or pelvis position.

(Z_t^p = {Z_{i,t}^p}*{i\in\mathcal C}) is a root-centered 3D core pose, so (Z*{\text{pelvis},t}^p=0).

(B_t^p) is a short FIFO evidence buffer of the last (L) accepted core-pose proposals.

(s_t^p) is the tentative existence score.

(a_t^p) is age in frames.

(m_t^p) is consecutive miss count.

(\Sigma_{Z,t}^p) is uncertainty of the core-pose estimate.

Established state:
[
x_t^{p,\mathrm{act}}
====================

\big(
T_t^p,\ q_t^p,\ \xi_t^p,\ \dot q_t^p,\ \beta^p,\ f_\beta^p,\ P_t^p,\ \sigma_{v,i,t}^{2,p},\ s_t^p,\ m_t^p,\ \ell_t^p
\big).
]

Here:

(T_t^p \in SE(3)) is root pose.

(q_t^p \in \mathbb R^d) is articulated pose.

(\xi_t^p \in \mathbb R^6) is root velocity in Lie algebra coordinates.

(\dot q_t^p) is angular velocity.

(\beta^p \in \mathbb R^B) is low-dimensional shape or bone-length state.

(f_\beta^p \in {0,1}) says whether (\beta) is frozen.

(P_t^p) is approximate state covariance.

(\sigma_{v,i,t}^{2,p}) is per-view, per-joint measurement noise.

(s_t^p) is track score.

(m_t^p) is consecutive miss count.

(\ell_t^p) is lost age.

The per-frame algorithm is:

```text
function UPDATE_FRAME(t, detections_by_view, tracks, cameras):

  split tracks into Tentative, Active, Lost

  # 1) Predict established tracks
  for p in Active ∪ Lost:
      p_pred <- PREDICT_ESTABLISHED(p)

  # 2) Associate 2D detections to predicted established tracks
  matches, unused_detections <- ASSOCIATE_ESTABLISHED(p_pred, detections_by_view, cameras)

  # 3) Update Active tracks by direct articulated fitting from 2D
  for p in Active:
      M_p <- BUILD_MEASUREMENT_SET(p, matches[p], cameras)
      if SUFFICIENT_SUPPORT(M_p):
          p <- UPDATE_ESTABLISHED(p_pred, M_p, cameras)
          p <- UPDATE_ACTIVE_SCORE(p, M_p)
          p.miss <- 0
          if SHOULD_FREEZE_BETA(p): p.f_beta <- 1
      else:
          p <- MARK_MISSED(p_pred)
      p <- ACTIVE_TO_LOST_IF_NEEDED(p)

  # 4) Reacquire Lost tracks
  for p in Lost:
      M_p <- BUILD_MEASUREMENT_SET(p, matches[p], cameras)
      if SUFFICIENT_REACQ_SUPPORT(M_p):
          p <- UPDATE_ESTABLISHED(p_pred, M_p, cameras)
          p <- REACTIVATE(p)
      else:
          p <- PREDICT_ONLY(p_pred)
          p.lost_age <- p.lost_age + 1

  # 5) Build geometric proposals from unused detections
  proposals <- BUILD_CORE_PROPOSALS(unused_detections, cameras)
  clusters  <- CLUSTER_PROPOSALS(proposals)

  # 6) Update tentative tracks using proposal clusters
  Tentative <- UPDATE_TENTATIVE_TRACKS(Tentative, clusters)

  # 7) Promote mature tentative tracks to established
  for z in Tentative:
      if SHOULD_PROMOTE(z):
          p0 <- PROMOTE_TO_ESTABLISHED(z, detections_by_view, cameras)
          add p0 to Active
          remove z from Tentative
      elif SHOULD_DELETE_TENTATIVE(z):
          remove z

  # 8) Try proposal-based reacquisition for Lost tracks
  for p in Lost:
      c <- BEST_COMPATIBLE_CLUSTER(p, clusters)
      if c exists and REACQ_COMPATIBLE(p, c):
          p <- REACQUIRE_FROM_CLUSTER(p, c, detections_by_view, cameras)
          move p to Active

  # 9) Delete dead Lost tracks
  for p in Lost:
      if SHOULD_DELETE_LOST(p):
          remove p

  return Tentative ∪ Active ∪ Lost
```

Now the actual equations.

For tentative proposals, use only detections not already consumed by established tracks. For each camera pair ((v_1,v_2)) and detection pair ((a,b)), triangulate the visible core joints and build a provisional core pose
[
Y_{i,t}^{(v_1,a,v_2,b)} \in \mathbb R^3,\qquad i\in\mathcal C_{\mathrm{obs}}.
]
Then define the root
[
r_t = \text{pelvis}(Y_t),
]
and root-centered core joints
[
Z_{i,t} = Y_{i,t} - r_t.
]

Score that pair proposal with
[
E_{\mathrm{pair}}
=================

\sum_{i\in\mathcal C_{\mathrm{obs}}}
\left[
\frac{|u_{v_1,i,t}^a-\Pi_{v_1}(Y_{i,t})|^2}{\sigma_{\mathrm{base}}^2(c_{v_1,i,t}^a)}
+
\frac{|u_{v_2,i,t}^b-\Pi_{v_2}(Y_{i,t})|^2}{\sigma_{\mathrm{base}}^2(c_{v_2,i,t}^b)}
\right]
+
\lambda_\theta \sum_{i\in\mathcal C_{\mathrm{obs}}}\frac{1}{\sin^2 \theta_i}
+
\lambda_{\mathrm{anthro}}\psi_{\mathrm{anthro}}(Z_t).
]

(\theta_i) is the angle between the two camera rays for joint (i). Small ray angle means bad geometry, so the penalty increases.

(\psi_{\mathrm{anthro}}(Z_t)) is a soft anthropometric sanity term. It can be very simple: reject impossible torso width, negative limb lengths, or absurd limb ratios.

Keep the proposal only if enough core joints are visible and (E_{\mathrm{pair}}) is below threshold.

Then cluster proposals in 3D using
[
d_{\mathrm{prop}}(p,q)
======================

\frac{|r^p-r^q|^2}{\tau_r^2}
+
\lambda_Z
\sum_{i\in\mathcal C}
\frac{|Z_i^p-Z_i^q|^2}{\tau_Z^2}.
]

Each cluster becomes one proposal centroid for the tentative layer.

Tentative update is simple and robust. Associate each cluster centroid to an existing tentative track or start a new tentative track if no association exists. Update with the buffer, not with a global sliding smoother. The buffer is only a birth-confirmation device.

A useful tentative score is
[
s_t^p
=====

\lambda_s s_{t-1}^p
+
a_1 n_{\mathrm{pairs},t}^p
+
a_2 n_{\mathrm{views},t}^p
--------------------------

## a_3 \bar e_{\mathrm{repr},t}^p

## a_4 \mathrm{CV}_{\mathrm{bone},t}^p

a_5 \mathbf 1_{\mathrm{miss}}.
]

Here:

(n_{\mathrm{pairs},t}^p) is the number of supporting cross-view pair proposals in the cluster.

(n_{\mathrm{views},t}^p) is the number of unique cameras contributing.

(\bar e_{\mathrm{repr},t}^p) is the mean reprojection error of the cluster.

(\mathrm{CV}_{\mathrm{bone},t}^p) is the coefficient of variation of a few stable limb lengths in the tentative buffer.

Promotion rule:
[
\text{promote if}
\quad
a_t^p \ge A_{\min},
\quad
#{\text{hits in last }L} \ge K,
\quad
\overline{n_{\mathrm{views}}} \ge V_{\min},
\quad
\overline{e_{\mathrm{repr}}} \le \tau_{\mathrm{repr}},
\quad
\mathrm{CV}*{\mathrm{bone}} \le \tau*{\mathrm{len}},
\quad
s_t^p \ge \tau_{\mathrm{promote}}.
]

At promotion, estimate (\beta) once from the tentative buffer. Let (\mathcal L) be a set of skeleton limbs. For each limb (b=(i,j)),
[
\hat \ell_b
===========

\operatorname{wmedian}*{\tau \in B_t^p}
|Z*{j,\tau}^p - Z_{i,\tau}^p|.
]
Then fit the low-dimensional skeleton shape
[
\hat\beta
=========

\arg\min_\beta
\sum_{b\in\mathcal L}
\omega_b \big(\ell_b(\beta)-\hat\ell_b\big)^2
+
\lambda_\beta |\beta-\beta_0|*{\Sigma*\beta^{-1}}^2.
]

That gives a stable initial skeleton. Then initialize the articulated pose:
[
(\hat T,\hat q)
===============

\arg\min_{T,q}
\sum_{i\in\mathcal C}
|FK_i(T,q,\hat\beta)-(\hat r+\hat Z_i)|^2
+
\lambda_{\mathrm{lim}}\phi_{\mathrm{lim}}(q).
]

After this, create the established track and allow (\beta) to move only slightly for a short grace period. Then freeze it:
[
\beta_t \leftarrow \beta_{t-1} + \eta_\beta \Delta\beta_t,
\qquad
0 < \eta_\beta \ll 1.
]
Once the track is stable, set (f_\beta=1) and stop updating (\beta).

For established tracks, prediction is recursive:
[
T_{t|t-1} = T_{t-1}\exp(\Delta t,\hat \xi_{t-1}),
]
[
q_{t|t-1} = q_{t-1} + \Delta t,\dot q_{t-1},
]
[
\beta_{t|t-1} = \beta_{t-1}.
]

Now the key point: the established stage does not triangulate a fresh 3D body and then fit to that. It updates the articulated state directly from all available 2D measurements.

First, associate detections per view. For predicted track (p) in camera (v), the detection cost is
[
C_{p,m}^{(v)}
=============

\sum_{i\in\mathcal C}
\delta_{i}
,\omega_i,
\frac{|u_{v,i,t}^m-\hat u_{v,i,t}^p|^2}{\sigma_{v,i,t}^{2,p}}
-------------------------------------------------------------

\lambda_{\mathrm{id}} \mathbf 1[\text{same 2D tracker id}],
]
with (\delta_i=1) only for visible and confidence-valid joints. Use Hungarian or another bipartite matcher independently per view after gating.

Then update the articulated state by minimizing
[
x_t^{p,*}
=========

\arg\min_x
|x-\bar x_t^p|*{Q^{-1}}^2
+
\sum*{(v,m)\in\mathcal M_t^p}
\sum_{i\in\mathcal J}
\delta_{v,i,t}^p
,w_{v,i,t}^p,
\left|
u_{v,i,t}^m - \Pi_v(FK_i(x))
\right|*{R*{v,i,t}^{-1}}^2
+
\lambda_{\mathrm{lim}}\phi_{\mathrm{lim}}(q)
+
\lambda_{\beta}\phi_{\beta}(\beta).
]

Definitions:

(\bar x_t^p) is the predicted state.

(\mathcal M_t^p) is the set of matched detections for track (p) across cameras.

(\delta_{v,i,t}^p \in {0,1}) is the joint acceptance mask after joint-level gating.

(R_{v,i,t} = \sigma_{v,i,t}^{2,p} I_2).

(w_{v,i,t}^p) is a robust weight.

A convenient robust weight is Student-(t):
[
w_{v,i,t}^p
===========

\frac{\nu+2}{\nu + \left|u_{v,i,t}^m-\hat u_{v,i,t}^p\right|*{R*{v,i,t}^{-1}}^2}.
]

Joint-level gating uses
[
d_{v,i,t}^{2,p}
===============

\left(u_{v,i,t}^m-\hat u_{v,i,t}^p\right)^\top
S_{v,i,t}^{-1}
\left(u_{v,i,t}^m-\hat u_{v,i,t}^p\right),
]
and accept the joint if (d_{v,i,t}^{2,p} \le \tau_{\mathrm{joint}}).

In practice, solve the update with 1 to 3 Gauss-Newton or LM steps from the predicted state. Let the local increment be
[
\Delta z = [\delta\xi,\ \delta q,\ \delta\beta_{\mathrm{free}}],
]
where (\delta\xi) is the root pose increment in Lie algebra coordinates, (\delta q) is the joint-angle increment, and (\delta\beta_{\mathrm{free}}) is omitted if (\beta) is frozen.

At one linearized step,
[
H
=

Q^{-1}
+
\sum_{(v,m),i}
\delta_{v,i,t}^p,w_{v,i,t}^p,J_{v,i,t}^\top R_{v,i,t}^{-1} J_{v,i,t}
+
H_{\mathrm{prior}},
]
[
g
=

## Q^{-1}(z-\bar z_t^p)

\sum_{(v,m),i}
\delta_{v,i,t}^p,w_{v,i,t}^p,J_{v,i,t}^\top R_{v,i,t}^{-1} r_{v,i,t}
+
g_{\mathrm{prior}},
]
[
\Delta z = -H^{-1} g.
]

Here (r_{v,i,t} = u_{v,i,t}^m-\hat u_{v,i,t}^p), and (J_{v,i,t}) is the Jacobian of the projected joint with respect to the state increment.

Then update
[
T \leftarrow T\exp(\widehat{\delta\xi}),\qquad
q \leftarrow q + \delta q,\qquad
\beta \leftarrow \beta + \delta\beta_{\mathrm{free}}.
]

Approximate posterior covariance for deletion and gating can be taken as
[
P_t^p \approx H^{-1}.
]

The recursive noise model is important because your detector is noisy and heteroscedastic. I would use
[
\sigma_{v,i,t}^{2,p}
====================

\operatorname{clip}
\left(
\lambda_\sigma \sigma_{v,i,t-1}^{2,p}
+
(1-\lambda_\sigma)
\Big[
\sigma_{\mathrm{base}}^2(c_{v,i,t})
+
\kappa \min(|r_{v,i,t}|^2, r_{\max}^2)
\Big],
\ \sigma_{\min}^2,\ \sigma_{\max}^2
\right).
]

So history enters through the recursive state and recursive noise, not through a sliding window.

For the active track score, use
[
s_t^p
=====

\lambda_s s_{t-1}^p
+
b_1 n_{\mathrm{views},t}^{p,\mathrm{acc}}
+
b_2 \frac{n_{\mathrm{joints},t}^{p,\mathrm{acc}}}{|\mathcal J|}
---------------------------------------------------------------

## b_3 \bar e_{\mathrm{repr},t}^p

b_4 \mathbf 1_{\mathrm{miss}}.
]

Then the lifecycle rules become explicit.

Active to Lost:
[
m_t^p > M_{\mathrm{lost}}
\quad\text{or}\quad
s_t^p < \tau_{\mathrm{lost}}.
]

Lost to Active:
either enough direct 2D support reappears,
[
n_{\mathrm{views},t}^{p,\mathrm{acc}} \ge V_{\mathrm{reacq}},
]
or a tentative proposal cluster matches the lost track prediction:
[
C_{\mathrm{reacq}}(p,c)
=======================

\frac{|r_c-\hat r_t^p|^2}{\tau_r^2}
+
\lambda_Z
\sum_{i\in\mathcal C}
\frac{|Z_{c,i}-\hat Z_{i,t}^p|^2}{\tau_Z^2}
\le \tau_{\mathrm{reacq}}.
]

Lost to Dead:
[
\ell_t^p > A_{\mathrm{delete}}
\quad\text{or}\quad
\operatorname{tr}(P_t^p) > \tau_P
\quad\text{or}\quad
\hat r_t^p \notin \text{working volume for } G \text{ frames}.
]

Tentative to Dead:
[
a_t^p > A_{\mathrm{tent,max}}
\quad\text{or}\quad
m_t^p > M_{\mathrm{tent}}.
]

So the design is:

Tentative stage: free 3D core pose, short evidence buffer, DLT allowed, cheap and conservative.

Promotion: estimate (\beta) once from stable buffer statistics, initialize articulated model, then freeze (\beta) after a short grace period.

Established stage: direct recursive articulated fitting from multi-view 2D reprojection, no per-frame free 3D cloud, no sliding window.

Lost stage: predict only, inflate uncertainty, try 2D reprojection-based reacquisition first, proposal-based reacquisition second, delete only when uncertainty or lost age becomes too large.

This directly answers your earlier question about “how do I get 3D without DLT?”:

In the established stage, you do not first reconstruct a standalone 3D pose. The 3D body is the current articulated state itself. You recover it by minimizing multi-view 2D reprojection error of the FK model. DLT is only a birth and reacquisition aid.