8.4 KiB
Homework 01: Survey & Design of 3D Multi-view Human Pose Estimation System
调研现有的多视角单人 (multiple view of single person) 多视角多人 (multi-view multi-person) 人体姿态估计系统/管线
参考系统
EasyMocap
国内团队 (浙江大学?) 出品
- Github: zju3dv/EasyMocap
- IK fitting easymocap/multistage/fitting.py
- Pose Prior easymocap/multistage/gmm.py
- Tracking: Fast and Robust Multi-Person 3D Pose Estimation from Multiple Views
- implementation: easymocap/assignment/track.py note that assignment is implemented in C++ (matchSVT, or matchSVT.hpp)
- demo page: mvmp.md or 04 multiperson
- Visualization: Open3D (see apps/vis/vis_client.py, a very boring TCP naive socket client-server architecture)
- see also doc/realtime_visualization.md
Note
它也是有其显然的问题的, 找出其问题所在可作为各位同学的练习
FreeMocap
Free Motion Capture for Everyone
- Github: freemocap/freemocap
- YouTube: @jonmatthis 似乎是大学讲师, 有分享 lectures
个人项目, 规模不大
OpenSim 系列
斯坦福大学的项目, 大抵的架构能力是比我强的, 但是目标和我试图达到的不同 (区别在何处?)
Pose2Sim
Pose2Sim stands for "OpenPose to OpenSim", as it originally used OpenPose inputs (2D keypoints coordinates) and led to an OpenSim result (full-body 3D joint angles).
- Camera synchronization
- Multi-person identification
- Robust triangulation
- 3D coordinates filtering
- Marker augmentation
OpenCap
和 Pose2Sim 试图做的事情似乎类似, 但是给的管线是基于两台 iPhone 便可实现的, 更偏向 commercial
Note
我估计 OpenCap 选择 iPhone 是因为其 ARKit 有 self-calibration 的能力 (只是猜测)
论文: OpenCap: Human movement dynamics from smartphone videos
Sports2D
和 Splyza Motion 做的事情类似
单视角
Important
note that the motion must lie in the sagittal or frontal plane.
我猜测这里只 (在开始时刻) 进行一次深度估计; 也就是说朝着/远离相机的视线方向的运动是无法估计的
Note
注意, 只能是单相机, 否则没有相机外参, 或者说不知道原点在哪里; 而单相机实现如 SMPLify 类似的实现只需要估计深度
注意
Devil is in the details. 架构图画完并不算完成——真正的问题都藏在横线里。
对系统里的每一条数据流,你都得说清楚:
- 数据从哪来、要到哪去,中间经过哪些环节(别只写一个"大盒子")。
- 传的到底是什么:图像?压缩视频?张量?keypoints?字节流?字段怎么定义?
- 怎么传:socket、共享内存、文件、消息队列?你选它的理由是什么?延迟/吞吐/背压怎么处理?
- 多快传:固定 Hz 还是乱序/抖动?时间戳是谁给的,用的是什么时基?怎么对齐多相机?
- 大概多大:带宽估算要写出来,因为这会反过来决定你能不能用某种协议。
另外别忘了:系统必然有 source 和 sink——你还得把中间每一段填满,填得越细越好;
再往上一步:你到底是单进程、单机多进程,还是天然就要分布式?(答案往往很显然)将来要做边缘计算时,哪些东西能下沉到边缘,哪些必须留在中心?
最后提醒一句:表示层通常是终点,但 3D 渲染/交互的工作量经常等价于做半个 (游戏) 引擎——别指望 "顺手画一下"
Note
注:允许使用 LLM 辅助调研与写作,但需要你能够指出其结论的依据、假设与不确定性;换言之,你必须证明自己能识别"看似合理但实际上不成立" 的回答。
参考答案
AI 生成的架构图, 不代表我此刻的实际想法, which changes every moment.
能把每一个横线的细节都填上嘛? (在本文行文时时回顾发现其中确实有许多纰漏, 而我暂且无意修正)
flowchart TD
%% =========================
%% Multi-view 2D stage
%% =========================
subgraph VIEWS["Per-view input (cameras 1..N)"]
direction LR
C1["Cam 1\n2D detections: 133×2 (+conf)"] --> T1["2D latest tracking cache\n(view 1)"]
C2["Cam 2\n2D detections: 133×2 (+conf)"] --> T2["2D latest tracking cache\n(view 2)"]
C3["Cam 3\n2D detections: 133×2 (+conf)"] --> T3["2D latest tracking cache\n(view 3)"]
C4["Cam 4\n2D detections: 133×2 (+conf)"] --> T4["2D latest tracking cache\n(view 4)"]
end
%% =========================
%% Cross-view association
%% =========================
subgraph ASSOC["Cross-view data association (epipolar)"]
direction TB
EPI["Epipolar constraint\n(Sampson / point-to-epiline)"]:::core
CYCLE["Cycle consistency / view-graph pruning"]:::core
GROUP["Assemble per-target multi-view observation set\n{view_id → 133×2}"]:::core
EPI --> CYCLE --> GROUP
end
T1 --> EPI
T2 --> EPI
T3 --> EPI
T4 --> EPI
%% =========================
%% Geometry / lifting
%% =========================
subgraph GEOM["3D measurement construction"]
direction TB
RT["Camera models\nK, [R|t], SO(3)/SE(3)"]:::meta
DLT["DLT / triangulation (init)"]:::core
NN["Optional NN lifting / completion"]:::core
BA["Optional reprojection refinement\n(1–5 iters)"]:::core
Y["3D measurement y(t)\nJ×3 positions (+quality / R / cov)"]:::out
RT --> DLT --> NN --> BA --> Y
end
GROUP --> DLT
%% =========================
%% Tracking filter + lifecycle
%% =========================
subgraph FILTER["Tracking filter (per target)"]
direction TB
GATE["Gating\n(Mahalanobis / per-joint + global)"]:::core
IMM["IMM (motion model bank)\n(CV/CA or low/med/high Q)"]:::core
PRED["Predict\nΔt, self-propagate"]:::core
UPD["Update\nKF (linear)\nstate: [p(3J), v(3J)]"]:::core
MISS["Miss handling & track lifecycle\n(tentative → confirmed → deleted)"]:::meta
GATE --> IMM --> PRED --> UPD --> MISS
end
Y --> GATE
%% Optional inertial fusion
IMU["IMU (optional)"]:::meta --> INERT["EKF/UKF branch (optional)\nwhen augmenting state with orientation"]:::meta --> IMM
%% =========================
%% IK + optional feedback
%% =========================
subgraph IKSTAGE["IK stage (constraint / anatomy)"]
direction TB
IK["IK optimization target\n(minimize joint position error,\nadd bone length / joint limits)"]:::core
FB["Optional feedback to filter\npseudo-measurement z_IK with large R"]:::meta
IK --> FB
end
UPD --> IK
FB -.-> GATE
%% =========================
%% SMPL / mesh fitting
%% =========================
subgraph SMPLSTAGE["SMPL / SMPL-X fitting"]
direction TB
VP["VPoser / pose prior"]:::core
SMPL["SMPL(θ, β, root)\nfit to joints / reprojection"]:::core
JR["JR: Joint Regressor\nmesh → joints (loop closure)"]:::core
OUT["Outputs\nmesh + joints + pose params"]:::out
VP --> SMPL --> JR --> OUT
JR -. residual / reproject .-> SMPL
end
IK --> SMPL
classDef core fill:#0b1020,stroke:#5eead4,color:#e5e7eb,stroke-width:1.2px;
classDef meta fill:#111827,stroke:#93c5fd,color:#e5e7eb,stroke-dasharray: 4 3;
classDef out fill:#052e2b,stroke:#34d399,color:#ecfeff,stroke-width:1.4px;
如果图片不能正确预览, 见 fig/fig.svg