forked from HQU-gxy/CVTH3PE
docs: Update mathematical notation and formatting in Chen et al. paper
- Improved LaTeX formatting for mathematical equations - Added bold notation for vectors and matrices - Corrected superscript and subscript formatting - Enhanced readability of mathematical expressions in the paper
This commit is contained in:
@ -38,21 +38,46 @@ Most of the existing monocular solutions are designed for the single-person case
|
|||||||
|
|
||||||
To estimate multiple 3D poses from a monocular view, Mehta et al . [22] use the location-maps [23] to infer 3D joint positions at the respective 2D joint pixel locations. Moon et al . [25] propose a root localization network to estimate the camera-centered coordinates of the human roots. Despite lots of recent progress in this area, the task of monocular 3D pose estimation is inherently ambiguous as multiple 3D poses can map to the same 2D joints. The mapping result, unfortunately, often has a large deviation in practice, especially when occlusion or motion blur occurs in images.
|
To estimate multiple 3D poses from a monocular view, Mehta et al . [22] use the location-maps [23] to infer 3D joint positions at the respective 2D joint pixel locations. Moon et al . [25] propose a root localization network to estimate the camera-centered coordinates of the human roots. Despite lots of recent progress in this area, the task of monocular 3D pose estimation is inherently ambiguous as multiple 3D poses can map to the same 2D joints. The mapping result, unfortunately, often has a large deviation in practice, especially when occlusion or motion blur occurs in images.
|
||||||
|
|
||||||
On the other hand, multi-camera systems are becoming progressively available in the context of various applications such as sport analysis and video surveillance. Given images from multiple camera views, most previous methods [27, 29, 8, 2] are generally based on the 3D Pictorial Structure model (3DPS) [8], which discretizes the 3-space by an N × N × N grid and assigns each joint to one of the N 3 bins (hypothesis). The cross-view association and reconstruction are solved by minimizing the geometric error [16] between the estimated 3D poses and 2D inputs among all the hypotheses. Considering all joints of multiple people in all cameras simultaneously, these methods are generally com-
|
On the other hand, multi-camera systems are becoming progressively available in
|
||||||
|
the context of various applications such as sport analysis and video
|
||||||
|
surveillance. Given images from multiple camera views, most previous methods
|
||||||
|
[27, 29, 8, 2] are generally based on the 3D Pictorial Structure model (3DPS)
|
||||||
|
[8], which discretizes the 3-space by an $N \times N \times N$ grid and assigns each joint
|
||||||
|
to one of the $N^3$ bins (hypothesis). The cross-view association and
|
||||||
|
reconstruction are solved by minimizing the geometric error [16] between the
|
||||||
|
estimated 3D poses and 2D inputs among all the hypotheses. Considering all
|
||||||
|
joints of multiple people in all cameras simultaneously, these methods are
|
||||||
|
generally computational expensive due to the huge state space. Recent work from
|
||||||
|
Dong et al . [13] propose to solve the cross-view association problem at the
|
||||||
|
body level first. 3DPS is subsequently applied to each cluster of the 2D poses
|
||||||
|
of the same person from different views. The state space is therefore reduced as
|
||||||
|
each person is processed individually. Nevertheless, the computational cost of
|
||||||
|
cross-view association of this method is still too high to achieve the real-time
|
||||||
|
speed.
|
||||||
|
|
||||||
Figure 1: Multi-human multi-view 3D pose estimation. The triangles in the 3D view represent camera locations.
|
Figure 1: Multi-human multi-view 3D pose estimation. The triangles in the 3D view represent camera locations.
|
||||||
|
|
||||||
<!-- image -->
|
<!-- image -->
|
||||||
|
|
||||||
putational expensive due to the huge state space. Recent work from Dong et al . [13] propose to solve the cross-view association problem at the body level first. 3DPS is subsequently applied to each cluster of the 2D poses of the same person from different views. The state space is therefore reduced as each person is processed individually. Nevertheless, the computational cost of cross-view association of this method is still too high to achieve the real-time speed.
|
|
||||||
|
|
||||||
Multi-view tracking for 3D pose estimation. Multi-view tracking for 3D pose estimation is not a new topic in computer vision. However, it is still nontrivial to combine these two tasks for fast and robust multi-human 3D pose estimation, as facing the challenges mentioned above.
|
Multi-view tracking for 3D pose estimation. Multi-view tracking for 3D pose estimation is not a new topic in computer vision. However, it is still nontrivial to combine these two tasks for fast and robust multi-human 3D pose estimation, as facing the challenges mentioned above.
|
||||||
|
|
||||||
Markerless motion capture, aiming at 3D motion capturing for a single person, has been studied for a decade [33, 14, 34]. Tracking in these early works is developed for joint localization and motion estimation. As the recent progress in deep neural network, temporal information is also investigated with the recurrent neural network [30, 20] or convolutional neural network [28] for single-view 3D pose estimation. However, these approaches are generally designed for well-aligned single person cases, where the critical cross-view association problem is neglected.
|
Markerless motion capture, aiming at 3D motion capturing for a single person, has been studied for a decade [33, 14, 34]. Tracking in these early works is developed for joint localization and motion estimation. As the recent progress in deep neural network, temporal information is also investigated with the recurrent neural network [30, 20] or convolutional neural network [28] for single-view 3D pose estimation. However, these approaches are generally designed for well-aligned single person cases, where the critical cross-view association problem is neglected.
|
||||||
|
|
||||||
As for the multi-human case, Belagiannis et al . [4] propose employing cross-view tracking results to assist 3D pose estimation under the framework of 3DPS. It introduces the temporal consistency from an off-the-shelf cross-view tracker [5] to reduce the state space of 3DPS. This approach separates tracking and pose estimation into two tasks and
|
As for the multi-human case, Belagiannis et al . [4] propose employing
|
||||||
|
cross-view tracking results to assist 3D pose estimation under the framework of
|
||||||
runs at 1 fps, which is far from being applied to the timecritical applications. There is also a very recent tracking approach [7] that uses the estimated 3D poses as inputs of the tracker to improve the tracking quality, while the pose estimation is rarely benefited from the tracking results. Tang et al . [32] propose to jointly perform multi-view 2D tracking and pose estimation for 3D scene reconstruction. The 2D detections are associated using a ground plane assumption, which is efficient but limits the accuracy. In contrast, we couple cross-view tracking and multi-human 3D pose estimation in a unified framework, making these two tasks benefit from each other for both accuracy and efficiency.
|
3DPS. It introduces the temporal consistency from an off-the-shelf cross-view
|
||||||
|
tracker [5] to reduce the state space of 3DPS. This approach separates tracking
|
||||||
|
and pose estimation into two tasks and runs at 1 fps, which is far from being
|
||||||
|
applied to the timecritical applications. There is also a very recent tracking
|
||||||
|
approach [7] that uses the estimated 3D poses as inputs of the tracker to
|
||||||
|
improve the tracking quality, while the pose estimation is rarely benefited from
|
||||||
|
the tracking results. Tang et al . [32] propose to jointly perform multi-view 2D
|
||||||
|
tracking and pose estimation for 3D scene reconstruction. The 2D detections are
|
||||||
|
associated using a ground plane assumption, which is efficient but limits the
|
||||||
|
accuracy. In contrast, we couple cross-view tracking and multi-human 3D pose
|
||||||
|
estimation in a unified framework, making these two tasks benefit from each
|
||||||
|
other for both accuracy and efficiency.
|
||||||
|
|
||||||
## 3. Method
|
## 3. Method
|
||||||
|
|
||||||
@ -78,11 +103,11 @@ Then, supposing there are $M$ detections $\{D_{i,t,c} | i = 1, ..., M\}$ in the
|
|||||||
|
|
||||||
**Affinity measurement.** Given a pair of target and detection $(T_{t'}, D_{t,c})$, the affinity is measured from both 2D and 3D geometric correspondences:
|
**Affinity measurement.** Given a pair of target and detection $(T_{t'}, D_{t,c})$, the affinity is measured from both 2D and 3D geometric correspondences:
|
||||||
|
|
||||||
$$A ( T _ { t ^ { \prime } } , D _ { t , c } ) = \sum _ { k = 1 } ^ { K } A _ { 2 D } ( x _ { t ^ { \prime } , c } ^ { k } , x _ { t , c } ^ { k } ) + A _ { 3 D } ( X _ { t ^ { \prime } } ^ { k } , x _ { t , c } ^ { k } )$$
|
$$A ( T _ { t ^ { \prime } } , D _ { t , c } ) = \sum _ { k = 1 } ^ { K } A _ { 2 D } ( \mathbf{x} _ { t ^ { \prime \prime } , c } ^ { k } , \mathbf{x} _ { t , c } ^ { k } ) + A _ { 3 D } ( \mathbf{X} _ { t ^ { \prime } } ^ { k } , \mathbf{x} _ { t , c } ^ { k } )$$
|
||||||
|
|
||||||
where x k t ′′ ,c is the last matched joint k of the target from camera c . For each type of human joints the correspondence is computed independently, thus we omit the index k in the following discussion for notation simplicity.
|
where $\mathbf{x}^{k}_{t^{\prime \prime},c}$ is the last matched joint $k$ of the target from camera $c$. For each type of human joints the correspondence is computed independently, thus we omit the index $k$ in the following discussion for notation simplicity.
|
||||||
|
|
||||||
As shown in Figure 2a, the 2D correspondence is computed based on the distance of detected joint x t,c and previously retained joint x t ′′ ,c in the camera coordinate:
|
As shown in Figure 2a, the 2D correspondence is computed based on the distance of detected joint $\mathbf{x}_{t,c}$ and previously retained joint $\mathbf{x}_{t^{\prime \prime},c}$ in the camera coordinate:
|
||||||
|
|
||||||
$$
|
$$
|
||||||
A_{2D}(\mathbf{x}_{t''}, c, \mathbf{x}_{t, c}) = w_{2D} \left(1 - \frac{\|\mathbf{x}_{t, c} - \mathbf{x}_{t''}, c\|}{\alpha_{2D}(t - t'')}\right) \cdot e^{-\lambda_a (t - t'')}.
|
A_{2D}(\mathbf{x}_{t''}, c, \mathbf{x}_{t, c}) = w_{2D} \left(1 - \frac{\|\mathbf{x}_{t, c} - \mathbf{x}_{t''}, c\|}{\alpha_{2D}(t - t'')}\right) \cdot e^{-\lambda_a (t - t'')}.
|
||||||
@ -96,13 +121,13 @@ come from the same person, and vice versa. The magnitude represents the
|
|||||||
confidence of the indication, which decreases exponentially as the time interval
|
confidence of the indication, which decreases exponentially as the time interval
|
||||||
increases.
|
increases.
|
||||||
|
|
||||||
2D correspondence is the most basic affinity measurement that exploited by single-view tracking methods. In order to track people across views, a 3D correspondence is introduced, as illustrated in Figure 2b. We suppose that cameras are well calibrated and the projection matrix of camera $c$ is provided as $P_c \in \mathbb{R}^{3 \times 4}$. We first back-project the detected 2D point $\mathbf{x}_{t,c}$ into 3-space as a ray:
|
2D correspondence is the most basic affinity measurement that exploited by single-view tracking methods. In order to track people across views, a 3D correspondence is introduced, as illustrated in Figure 2b. We suppose that cameras are well calibrated and the projection matrix of camera $c$ is provided as $\mathbf{P}_c \in \mathbb{R}^{3 \times 4}$. We first back-project the detected 2D point $\mathbf{x}_{t,c}$ into 3-space as a ray:
|
||||||
|
|
||||||
$$
|
$$
|
||||||
\tilde{\mathbf{X}}_t(\mu; \mathbf{x}_{t, c}) = \mathbf{P}_c^+ \tilde{\mathbf{x}}_{t, c} + \mu \tilde{\mathbf{X}}_c
|
\tilde{\mathbf{X}}_t(\mu; \mathbf{x}_{t, c}) = \mathbf{P}_c^+ \tilde{\mathbf{x}}_{t, c} + \mu \tilde{\mathbf{X}}_c
|
||||||
$$
|
$$
|
||||||
|
|
||||||
where P + c ∈ R 4 × 3 is the pseudo-inverse of P c and X c is the 3D location of the camera center. The symbol with superscript tilde denotes the corresponding homogeneous coordinate. The 3D correspondence is then defined as:
|
where $\mathbf{P}_c^+ \in \mathbb{R}^{4 \times 3}$ is the pseudo-inverse of $\mathbf{P}_c$ and $\mathbf{X}_c$ is the 3D location of the camera center. The symbol with superscript tilde denotes the corresponding homogeneous coordinate. The 3D correspondence is then defined as:
|
||||||
|
|
||||||
$$
|
$$
|
||||||
A_{3D}(\mathbf{X}_{t'}, \mathbf{x}_{t,c}) = w_{3D} \left( 1 - \frac{d_t(\hat{\mathbf{X}}_t, \mathbf{X}_t(\mu))}{\alpha_{3D}} \right) \cdot e^{-\lambda_a (t - t')}
|
A_{3D}(\mathbf{X}_{t'}, \mathbf{x}_{t,c}) = w_{3D} \left( 1 - \frac{d_t(\hat{\mathbf{X}}_t, \mathbf{X}_t(\mu))}{\alpha_{3D}} \right) \cdot e^{-\lambda_a (t - t')}
|
||||||
@ -114,14 +139,16 @@ $$
|
|||||||
\dot{\mathbf{x}}_t = \mathbf{x}_{t'} + \mathbf{V}_{t'} \cdot (t - t')
|
\dot{\mathbf{x}}_t = \mathbf{x}_{t'} + \mathbf{V}_{t'} \cdot (t - t')
|
||||||
$$
|
$$
|
||||||
|
|
||||||
where t ≥ t ′ and V t ′ is 3D velocity estimated via a linear least-square method.
|
where $t \geq t'$ and $\mathbf{V}_{t'}$ is 3D velocity estimated via a linear least-square method.
|
||||||
|
|
||||||
Here, for the purpose of verifying the iterative processing strategy, we only employ the geometric consistency in the affinity measurement for simplicity. This baseline formulation already achieves state-of-the-art performance for both human body association and 3D pose estimation, as we demonstrated in experiments. The key contribution comes from Equation 4, where we match the detected 2D joints with targets directly in 3-space.
|
Here, for the purpose of verifying the iterative processing strategy, we only employ the geometric consistency in the affinity measurement for simplicity. This baseline formulation already achieves state-of-the-art performance for both human body association and 3D pose estimation, as we demonstrated in experiments. The key contribution comes from Equation 4, where we match the detected 2D joints with targets directly in 3-space.
|
||||||
|
|
||||||
Compared with matching in pairs of views in the camera coordinates [13], our formulation has three advantages: 1) matching in 3-space is robust to partial occlusion and inaccurate 2D localization, as the 3D pose actually combines the
|
Compared with matching in pairs of views in the camera coordinates [13], our formulation has three advantages:
|
||||||
|
|
||||||
|
1. matching in 3-space is robust to partial occlusion and inaccurate 2D localization, as the 3D pose actually combines the
|
||||||
information from multiple views; 2) motion estimation in 3space is more feasible and reliable than that in 2D camera coordinates; 3) the computational cost is significantly reduced since only one comparison is required in 3-space for each pair of target and detection. To verify this, a quantitative comparison is further conducted in ablation study.
|
information from multiple views;
|
||||||
|
2. motion estimation in 3space is more feasible and reliable than that in 2D camera coordinates;
|
||||||
|
3. the computational cost is significantly reduced since only one comparison is required in 3-space for each pair of target and detection. To verify this, a quantitative comparison is further conducted in ablation study.
|
||||||
|
|
||||||
Target update and initialization. With previous affinity measurement, this section describes how we update and initialize targets in a particular iteration. Firstly, we compute the affinity matrix between targets and detections using Equation 1 and solve the association problem in bipartite graph matching. Each detection is either assigned to a target or labeled as unmatched based on the association results. In the former case, if a detection is assigned to a target, the 3D pose of the target will be updated gradually with the new detection, as the 2D information is observed over time. Thus, 3D pose reconstruction in our framework is an incremental process, as detailed in Section 3.3.
|
Target update and initialization. With previous affinity measurement, this section describes how we update and initialize targets in a particular iteration. Firstly, we compute the affinity matrix between targets and detections using Equation 1 and solve the association problem in bipartite graph matching. Each detection is either assigned to a target or labeled as unmatched based on the association results. In the former case, if a detection is assigned to a target, the 3D pose of the target will be updated gradually with the new detection, as the 2D information is observed over time. Thus, 3D pose reconstruction in our framework is an incremental process, as detailed in Section 3.3.
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user