docs: Update mathematical notation and formatting in Chen et al. paper

- Improved LaTeX formatting for mathematical equations - Added bold notation for vectors and matrices - Corrected superscript and subscript formatting - Enhanced readability of mathematical expressions in the paper
2025-03-04 17:52:16 +08:00
parent a06b3e8399
commit ed4eb75b7f
1 changed files with 41 additions and 14 deletions
--- a/paper/Chen_Cross-View_Tracking_for_Multi-Human_3D_Pose_Estimation_at_Over_100_CVPR_2020_paper.md
+++ b/paper/Chen_Cross-View_Tracking_for_Multi-Human_3D_Pose_Estimation_at_Over_100_CVPR_2020_paper.md
@ -38,21 +38,46 @@ Most of the existing monocular solutions are designed for the single-person case
 To estimate multiple 3D poses from a monocular view, Mehta et al . [22] use the location-maps [23] to infer 3D joint positions at the respective 2D joint pixel locations. Moon et al . [25] propose a root localization network to estimate the camera-centered coordinates of the human roots. Despite lots of recent progress in this area, the task of monocular 3D pose estimation is inherently ambiguous as multiple 3D poses can map to the same 2D joints. The mapping result, unfortunately, often has a large deviation in practice, especially when occlusion or motion blur occurs in images.
-On the other hand, multi-camera systems are becoming progressively available in the context of various applications such as sport analysis and video surveillance. Given images from multiple camera views, most previous methods [27, 29, 8, 2] are generally based on the 3D Pictorial Structure model (3DPS) [8], which discretizes the 3-space by an N × N × N grid and assigns each joint to one of the N 3 bins (hypothesis). The cross-view association and reconstruction are solved by minimizing the geometric error [16] between the estimated 3D poses and 2D inputs among all the hypotheses. Considering all joints of multiple people in all cameras simultaneously, these methods are generally com-
+On the other hand, multi-camera systems are becoming progressively available in
 the context of various applications such as sport analysis and video
 surveillance. Given images from multiple camera views, most previous methods
 [27, 29, 8, 2] are generally based on the 3D Pictorial Structure model (3DPS)
 [8], which discretizes the 3-space by an $N \times N \times N$ grid and assigns each joint
 to one of the $N^3$ bins (hypothesis). The cross-view association and
 reconstruction are solved by minimizing the geometric error [16] between the
 estimated 3D poses and 2D inputs among all the hypotheses. Considering all
 joints of multiple people in all cameras simultaneously, these methods are
 generally computational expensive due to the huge state space. Recent work from
 Dong et al . [13] propose to solve the cross-view association problem at the
 body level first. 3DPS is subsequently applied to each cluster of the 2D poses
 of the same person from different views. The state space is therefore reduced as
 each person is processed individually. Nevertheless, the computational cost of
 cross-view association of this method is still too high to achieve the real-time
 speed.
 Figure 1: Multi-human multi-view 3D pose estimation. The triangles in the 3D view represent camera locations.
 <!-- image -->
 putational expensive due to the huge state space. Recent work from Dong et al . [13] propose to solve the cross-view association problem at the body level first. 3DPS is subsequently applied to each cluster of the 2D poses of the same person from different views. The state space is therefore reduced as each person is processed individually. Nevertheless, the computational cost of cross-view association of this method is still too high to achieve the real-time speed.
 Multi-view tracking for 3D pose estimation. Multi-view tracking for 3D pose estimation is not a new topic in computer vision. However, it is still nontrivial to combine these two tasks for fast and robust multi-human 3D pose estimation, as facing the challenges mentioned above.
 Markerless motion capture, aiming at 3D motion capturing for a single person, has been studied for a decade [33, 14, 34]. Tracking in these early works is developed for joint localization and motion estimation. As the recent progress in deep neural network, temporal information is also investigated with the recurrent neural network [30, 20] or convolutional neural network [28] for single-view 3D pose estimation. However, these approaches are generally designed for well-aligned single person cases, where the critical cross-view association problem is neglected.
-As for the multi-human case, Belagiannis et al . [4] propose employing cross-view tracking results to assist 3D pose estimation under the framework of 3DPS. It introduces the temporal consistency from an off-the-shelf cross-view tracker [5] to reduce the state space of 3DPS. This approach separates tracking and pose estimation into two tasks and
+As for the multi-human case, Belagiannis et al . [4] propose employing
-
+cross-view tracking results to assist 3D pose estimation under the framework of
-runs at 1 fps, which is far from being applied to the timecritical applications. There is also a very recent tracking approach [7] that uses the estimated 3D poses as inputs of the tracker to improve the tracking quality, while the pose estimation is rarely benefited from the tracking results. Tang et al . [32] propose to jointly perform multi-view 2D tracking and pose estimation for 3D scene reconstruction. The 2D detections are associated using a ground plane assumption, which is efficient but limits the accuracy. In contrast, we couple cross-view tracking and multi-human 3D pose estimation in a unified framework, making these two tasks benefit from each other for both accuracy and efficiency.
+3DPS. It introduces the temporal consistency from an off-the-shelf cross-view
 tracker [5] to reduce the state space of 3DPS. This approach separates tracking
 and pose estimation into two tasks and runs at 1 fps, which is far from being
 applied to the timecritical applications. There is also a very recent tracking
 approach [7] that uses the estimated 3D poses as inputs of the tracker to
 improve the tracking quality, while the pose estimation is rarely benefited from
 the tracking results. Tang et al . [32] propose to jointly perform multi-view 2D
 tracking and pose estimation for 3D scene reconstruction. The 2D detections are
 associated using a ground plane assumption, which is efficient but limits the
 accuracy. In contrast, we couple cross-view tracking and multi-human 3D pose
 estimation in a unified framework, making these two tasks benefit from each
 other for both accuracy and efficiency.
 ## 3. Method
@ -78,11 +103,11 @@ Then, supposing there are $M$ detections $\{D_{i,t,c} | i = 1, ..., M\}$ in the
 **Affinity measurement.** Given a pair of target and detection $(T_{t'}, D_{t,c})$, the affinity is measured from both 2D and 3D geometric correspondences:
-$$A ( T _ { t ^ { \prime } } , D _ { t , c } ) = \sum _ { k = 1 } ^ { K } A _ { 2 D } ( x _ { t ^ { \prime } , c } ^ { k } , x _ { t , c } ^ { k } ) + A _ { 3 D } ( X _ { t ^ { \prime } } ^ { k } , x _ { t , c } ^ { k } )$$
+$$A ( T _ { t ^ { \prime } } , D _ { t , c } ) = \sum _ { k = 1 } ^ { K } A _ { 2 D } ( \mathbf{x} _ { t ^ { \prime \prime } , c } ^ { k } , \mathbf{x} _ { t , c } ^ { k } ) + A _ { 3 D } ( \mathbf{X} _ { t ^ { \prime  } } ^ { k } , \mathbf{x} _ { t , c } ^ { k } )$$
-where x k t ′′ ,c is the last matched joint k of the target from camera c . For each type of human joints the correspondence is computed independently, thus we omit the index k in the following discussion for notation simplicity.
+where $\mathbf{x}^{k}_{t^{\prime \prime},c}$ is the last matched joint $k$ of the target from camera $c$. For each type of human joints the correspondence is computed independently, thus we omit the index $k$ in the following discussion for notation simplicity.
-As shown in Figure 2a, the 2D correspondence is computed based on the distance of detected joint x t,c and previously retained joint x t ′′ ,c in the camera coordinate:
+As shown in Figure 2a, the 2D correspondence is computed based on the distance of detected joint $\mathbf{x}_{t,c}$ and previously retained joint $\mathbf{x}_{t^{\prime \prime},c}$ in the camera coordinate:
 $$
 A_{2D}(\mathbf{x}_{t''}, c, \mathbf{x}_{t, c}) = w_{2D} \left(1 - \frac{\|\mathbf{x}_{t, c} - \mathbf{x}_{t''}, c\|}{\alpha_{2D}(t - t'')}\right) \cdot e^{-\lambda_a (t - t'')}.
@ -96,13 +121,13 @@ come from the same person, and vice versa. The magnitude represents the
 confidence of the indication, which decreases exponentially as the time interval
 increases.
-2D correspondence is the most basic affinity measurement that exploited by single-view tracking methods. In order to track people across views, a 3D correspondence is introduced, as illustrated in Figure 2b. We suppose that cameras are well calibrated and the projection matrix of camera $c$ is provided as $P_c \in \mathbb{R}^{3 \times 4}$. We first back-project the detected 2D point $\mathbf{x}_{t,c}$ into 3-space as a ray:
+2D correspondence is the most basic affinity measurement that exploited by single-view tracking methods. In order to track people across views, a 3D correspondence is introduced, as illustrated in Figure 2b. We suppose that cameras are well calibrated and the projection matrix of camera $c$ is provided as $\mathbf{P}_c \in \mathbb{R}^{3 \times 4}$. We first back-project the detected 2D point $\mathbf{x}_{t,c}$ into 3-space as a ray:
 $$
 \tilde{\mathbf{X}}_t(\mu; \mathbf{x}_{t, c}) = \mathbf{P}_c^+ \tilde{\mathbf{x}}_{t, c} + \mu \tilde{\mathbf{X}}_c 
 $$
-where P + c ∈ R 4 × 3 is the pseudo-inverse of P c and X c is the 3D location of the camera center. The symbol with superscript tilde denotes the corresponding homogeneous coordinate. The 3D correspondence is then defined as:
+where $\mathbf{P}_c^+ \in \mathbb{R}^{4 \times 3}$ is the pseudo-inverse of $\mathbf{P}_c$ and $\mathbf{X}_c$ is the 3D location of the camera center. The symbol with superscript tilde denotes the corresponding homogeneous coordinate. The 3D correspondence is then defined as:
 $$
 A_{3D}(\mathbf{X}_{t'}, \mathbf{x}_{t,c}) = w_{3D} \left( 1 - \frac{d_t(\hat{\mathbf{X}}_t, \mathbf{X}_t(\mu))}{\alpha_{3D}} \right) \cdot e^{-\lambda_a (t - t')} 
@ -114,14 +139,16 @@ $$
 \dot{\mathbf{x}}_t = \mathbf{x}_{t'} + \mathbf{V}_{t'} \cdot (t - t')
 $$
-where t ≥ t ′ and V t ′ is 3D velocity estimated via a linear least-square method.
+where $t \geq t'$ and $\mathbf{V}_{t'}$ is 3D velocity estimated via a linear least-square method.
 Here, for the purpose of verifying the iterative processing strategy, we only employ the geometric consistency in the affinity measurement for simplicity. This baseline formulation already achieves state-of-the-art performance for both human body association and 3D pose estimation, as we demonstrated in experiments. The key contribution comes from Equation 4, where we match the detected 2D joints with targets directly in 3-space.
-Compared with matching in pairs of views in the camera coordinates [13], our formulation has three advantages: 1) matching in 3-space is robust to partial occlusion and inaccurate 2D localization, as the 3D pose actually combines the
+Compared with matching in pairs of views in the camera coordinates [13], our formulation has three advantages: 
-
+1. matching in 3-space is robust to partial occlusion and inaccurate 2D localization, as the 3D pose actually combines the
-information from multiple views; 2) motion estimation in 3space is more feasible and reliable than that in 2D camera coordinates; 3) the computational cost is significantly reduced since only one comparison is required in 3-space for each pair of target and detection. To verify this, a quantitative comparison is further conducted in ablation study.
+information from multiple views; 
 2. motion estimation in 3space is more feasible and reliable than that in 2D camera coordinates; 
 3. the computational cost is significantly reduced since only one comparison is required in 3-space for each pair of target and detection. To verify this, a quantitative comparison is further conducted in ablation study.
 Target update and initialization. With previous affinity measurement, this section describes how we update and initialize targets in a particular iteration. Firstly, we compute the affinity matrix between targets and detections using Equation 1 and solve the association problem in bipartite graph matching. Each detection is either assigned to a target or labeled as unmatched based on the association results. In the former case, if a detection is assigned to a target, the 3D pose of the target will be updated gradually with the new detection, as the 2D information is observed over time. Thus, 3D pose reconstruction in our framework is an incremental process, as detailed in Section 3.3.