196 lines
8.4 KiB
Markdown
196 lines
8.4 KiB
Markdown
# Homework 01: Survey & Design of 3D Multi-view Human Pose Estimation System
|
||
|
||
调研现有的多视角单人 (multiple view of single person) 多视角多人 (multi-view multi-person)
|
||
人体姿态估计系统/管线
|
||
|
||
## 参考系统
|
||
|
||
### [EasyMocap](https://chingswy.github.io/easymocap-public-doc/)
|
||
|
||
国内团队 (浙江大学?) 出品
|
||
|
||
- Github: [zju3dv/EasyMocap](https://github.com/zju3dv/EasyMocap)
|
||
- IK fitting [easymocap/multistage/fitting.py](https://github.com/zju3dv/EasyMocap/blob/master/easymocap/multistage/fitting.py)
|
||
- Pose Prior [easymocap/multistage/gmm.py](https://github.com/zju3dv/EasyMocap/blob/master/easymocap/multistage/gmm.py)
|
||
- Tracking: [Fast and Robust Multi-Person 3D Pose Estimation from Multiple Views](https://arxiv.org/abs/1901.04111)
|
||
- implementation: [easymocap/assignment/track.py](https://github.com/zju3dv/EasyMocap/blob/master/easymocap/assignment/track.py#L199) note that assignment is implemented in C++ (matchSVT, or [matchSVT.hpp](https://github.com/zju3dv/EasyMocap/blob/master/library/pymatch/include/matchSVT.hpp))
|
||
- demo page: [mvmp.md](https://github.com/zju3dv/EasyMocap/blob/master/doc/mvmp.md) or [04 multiperson](https://chingswy.github.io/easymocap-public-doc/develop/04_multiperson.html)
|
||
- Visualization: [Open3D](https://www.open3d.org/) (see [apps/vis/vis_client.py](https://github.com/zju3dv/EasyMocap/blob/master/apps/vis/vis_client.py), a very boring TCP naive socket client-server architecture)
|
||
- see also [doc/realtime_visualization.md](https://github.com/zju3dv/EasyMocap/blob/master/doc/realtime_visualization.md)
|
||
|
||
> [!NOTE]
|
||
> 它也是有其显然的问题的, 找出其问题所在可作为各位同学的练习
|
||
|
||
### [FreeMocap](https://freemocap.org)
|
||
|
||
> Free Motion Capture for Everyone
|
||
|
||
- Github: [freemocap/freemocap](https://github.com/freemocap/freemocap)
|
||
- YouTube: [@jonmatthis](https://www.youtube.com/@jonmatthis) 似乎是大学讲师, 有分享 lectures
|
||
|
||
个人项目, 规模不大
|
||
|
||
### [OpenSim](https://opensim.stanford.edu) 系列
|
||
|
||
斯坦福大学的项目, 大抵的架构能力是比我强的, 但是目标和我试图达到的不同 (区别在何处?)
|
||
|
||
#### [Pose2Sim](https://github.com/perfanalytics/pose2sim)
|
||
|
||
> Pose2Sim stands for "OpenPose to OpenSim", as it originally used OpenPose inputs (2D keypoints coordinates) and led to an OpenSim result (full-body 3D joint angles).
|
||
|
||
- Camera synchronization
|
||
- Multi-person identification
|
||
- Robust triangulation
|
||
- 3D coordinates filtering
|
||
- Marker augmentation
|
||
|
||
#### [OpenCap](https://www.opencap.ai)
|
||
|
||
和 [Pose2Sim](https://github.com/perfanalytics/pose2sim) 试图做的事情似乎类似, 但是给的管线是基于两台
|
||
iPhone 便可实现的, 更偏向 commercial
|
||
|
||
> [!NOTE]
|
||
> 我估计 OpenCap 选择 iPhone 是因为其 ARKit 有 self-calibration 的能力 (只是猜测)
|
||
|
||
论文: [OpenCap: Human movement dynamics from smartphone videos](https://doi.org/10.1371/journal.pcbi.1011462)
|
||
|
||
#### [Sports2D](https://github.com/davidpagnon/Sports2D)
|
||
|
||
和 [Splyza Motion](https://motion.products.splyza.com) 做的事情类似
|
||
|
||
单视角
|
||
|
||
> [!IMPORTANT]
|
||
> note that the motion must lie in the sagittal or frontal plane.
|
||
>
|
||
> 我猜测这里只 (在开始时刻) 进行一次深度估计; 也就是说朝着/远离相机的视线方向的运动是无法估计的
|
||
|
||
> [!NOTE]
|
||
> 注意, 只能是单相机, 否则没有相机外参, 或者说不知道原点在哪里; 而单相机实现如
|
||
[SMPLify](https://smplify.is.tue.mpg.de) 类似的实现只需要估计深度
|
||
|
||
## 注意
|
||
|
||
Devil is in the details. 架构图画完并不算完成——真正的问题都藏在横线里。
|
||
|
||
对系统里的每一条数据流,你都得说清楚:
|
||
|
||
- 数据从哪来、要到哪去,中间经过哪些环节(别只写一个"大盒子")。
|
||
- 传的到底是什么:图像?压缩视频?张量?keypoints?字节流?字段怎么定义?
|
||
- 怎么传:socket、共享内存、文件、消息队列?你选它的理由是什么?延迟/吞吐/背压怎么处理?
|
||
- 多快传:固定 Hz 还是乱序/抖动?时间戳是谁给的,用的是什么时基?怎么对齐多相机?
|
||
- 大概多大:带宽估算要写出来,因为这会反过来决定你能不能用某种协议。
|
||
|
||
另外别忘了:系统必然有 source 和 sink——你还得把中间每一段填满,填得越细越好;
|
||
|
||
再往上一步:你到底是单进程、单机多进程,还是天然就要分布式?(答案往往很显然)将来要做边缘计算时,哪些东西能下沉到边缘,哪些必须留在中心?
|
||
|
||
最后提醒一句:表示层通常是终点,但 3D 渲染/交互的工作量经常等价于做半个 (游戏) 引擎——别指望 "顺手画一下"
|
||
|
||
> [!NOTE]
|
||
> 注:允许使用 LLM 辅助调研与写作,但需要你能够指出其结论的依据、假设与不确定性;换言之,你必须证明自己能识别"看似合理但实际上不成立" 的回答。
|
||
|
||
## 参考答案
|
||
|
||
AI 生成的架构图, 不代表我此刻的实际想法, which changes every moment.
|
||
|
||
能把每一个横线的细节都填上嘛? (在本文行文时时回顾发现其中确实有许多纰漏, 而我暂且无意修正)
|
||
|
||
```mermaid
|
||
flowchart TD
|
||
%% =========================
|
||
%% Multi-view 2D stage
|
||
%% =========================
|
||
subgraph VIEWS["Per-view input (cameras 1..N)"]
|
||
direction LR
|
||
C1["Cam 1\n2D detections: 133×2 (+conf)"] --> T1["2D latest tracking cache\n(view 1)"]
|
||
C2["Cam 2\n2D detections: 133×2 (+conf)"] --> T2["2D latest tracking cache\n(view 2)"]
|
||
C3["Cam 3\n2D detections: 133×2 (+conf)"] --> T3["2D latest tracking cache\n(view 3)"]
|
||
C4["Cam 4\n2D detections: 133×2 (+conf)"] --> T4["2D latest tracking cache\n(view 4)"]
|
||
end
|
||
|
||
%% =========================
|
||
%% Cross-view association
|
||
%% =========================
|
||
subgraph ASSOC["Cross-view data association (epipolar)"]
|
||
direction TB
|
||
EPI["Epipolar constraint\n(Sampson / point-to-epiline)"]:::core
|
||
CYCLE["Cycle consistency / view-graph pruning"]:::core
|
||
GROUP["Assemble per-target multi-view observation set\n{view_id → 133×2}"]:::core
|
||
EPI --> CYCLE --> GROUP
|
||
end
|
||
|
||
T1 --> EPI
|
||
T2 --> EPI
|
||
T3 --> EPI
|
||
T4 --> EPI
|
||
|
||
%% =========================
|
||
%% Geometry / lifting
|
||
%% =========================
|
||
subgraph GEOM["3D measurement construction"]
|
||
direction TB
|
||
RT["Camera models\nK, [R|t], SO(3)/SE(3)"]:::meta
|
||
DLT["DLT / triangulation (init)"]:::core
|
||
NN["Optional NN lifting / completion"]:::core
|
||
BA["Optional reprojection refinement\n(1–5 iters)"]:::core
|
||
Y["3D measurement y(t)\nJ×3 positions (+quality / R / cov)"]:::out
|
||
RT --> DLT --> NN --> BA --> Y
|
||
end
|
||
|
||
GROUP --> DLT
|
||
|
||
%% =========================
|
||
%% Tracking filter + lifecycle
|
||
%% =========================
|
||
subgraph FILTER["Tracking filter (per target)"]
|
||
direction TB
|
||
GATE["Gating\n(Mahalanobis / per-joint + global)"]:::core
|
||
IMM["IMM (motion model bank)\n(CV/CA or low/med/high Q)"]:::core
|
||
PRED["Predict\nΔt, self-propagate"]:::core
|
||
UPD["Update\nKF (linear)\nstate: [p(3J), v(3J)]"]:::core
|
||
MISS["Miss handling & track lifecycle\n(tentative → confirmed → deleted)"]:::meta
|
||
|
||
GATE --> IMM --> PRED --> UPD --> MISS
|
||
end
|
||
|
||
Y --> GATE
|
||
|
||
%% Optional inertial fusion
|
||
IMU["IMU (optional)"]:::meta --> INERT["EKF/UKF branch (optional)\nwhen augmenting state with orientation"]:::meta --> IMM
|
||
|
||
%% =========================
|
||
%% IK + optional feedback
|
||
%% =========================
|
||
subgraph IKSTAGE["IK stage (constraint / anatomy)"]
|
||
direction TB
|
||
IK["IK optimization target\n(minimize joint position error,\nadd bone length / joint limits)"]:::core
|
||
FB["Optional feedback to filter\npseudo-measurement z_IK with large R"]:::meta
|
||
IK --> FB
|
||
end
|
||
|
||
UPD --> IK
|
||
FB -.-> GATE
|
||
|
||
%% =========================
|
||
%% SMPL / mesh fitting
|
||
%% =========================
|
||
subgraph SMPLSTAGE["SMPL / SMPL-X fitting"]
|
||
direction TB
|
||
VP["VPoser / pose prior"]:::core
|
||
SMPL["SMPL(θ, β, root)\nfit to joints / reprojection"]:::core
|
||
JR["JR: Joint Regressor\nmesh → joints (loop closure)"]:::core
|
||
OUT["Outputs\nmesh + joints + pose params"]:::out
|
||
VP --> SMPL --> JR --> OUT
|
||
JR -. residual / reproject .-> SMPL
|
||
end
|
||
|
||
IK --> SMPL
|
||
|
||
classDef core fill:#0b1020,stroke:#5eead4,color:#e5e7eb,stroke-width:1.2px;
|
||
classDef meta fill:#111827,stroke:#93c5fd,color:#e5e7eb,stroke-dasharray: 4 3;
|
||
classDef out fill:#052e2b,stroke:#34d399,color:#ecfeff,stroke-width:1.4px;
|
||
```
|
||
|
||
如果图片不能正确预览, 见 [fig/fig.svg](fig/fig.svg)
|