Files
zed-body-tracking-multicamera/ZED_SDK_ARCHITECTURE.md
2026-02-04 03:03:16 +00:00

15 KiB

ZED SDK Architecture: Streaming vs Fusion API

Overview

The ZED SDK provides two distinct APIs for transmitting camera data over a network:

  1. Streaming API (enableStreaming) - Video streaming
  2. Fusion API (startPublishing) - Metadata publishing

These serve fundamentally different use cases and have different compute/bandwidth tradeoffs.

API Comparison

Feature Streaming API Fusion API
Primary Use Case Remote camera access Multi-camera data fusion
Data Transmitted Compressed video (H264/H265) Metadata only (bodies, objects, poses)
Bandwidth per Camera 10-40 Mbps <100 Kbps
Edge Compute Video encoding only (NVENC) Full depth NN + tracking + detection
Host Compute Full depth NN + tracking + detection Lightweight fusion only
Synchronization None Time-synced + geometric calibration
360° Coverage No Yes (fuses overlapping views)
Receiver API zed.open() with INPUT_TYPE::STREAM fusion.subscribe()

Architecture Diagrams

Streaming API (Single Camera Remote Access)

┌─────────────────────┐                      ┌─────────────────────┐
│   Edge (Jetson)     │                      │   Host (Server)     │
│                     │                      │                     │
│  ┌───────────────┐  │    H264/H265 RTP     │  ┌───────────────┐  │
│  │  ZED Camera   │  │   (10-40 Mbps)       │  │  Decode       │  │
│  └───────┬───────┘  │ ───────────────────► │  │  (NVENC)      │  │
│          │          │                      │  └───────┬───────┘  │
│  ┌───────▼───────┐  │                      │          │          │
│  │  NVENC        │  │                      │  ┌───────▼───────┐  │
│  │  Encode       │──┘                      │  │  Neural Depth │  │
│  │  (hardware)   │                         │  │  (NN on GPU)  │  │
│  └───────────────┘                         │  └───────┬───────┘  │
│                                            │          │          │
│                                            │  ┌───────▼───────┐  │
│                                            │  │  Tracking /   │  │
│                                            │  │  Detection    │  │
│                                            │  └───────┬───────┘  │
│                                            │          │          │
│                                            │  ┌───────▼───────┐  │
│                                            │  │  Point Cloud  │  │
│                                            │  └───────────────┘  │
└─────────────────────┘                      └─────────────────────┘

Edge: Lightweight (encode only)
Host: Heavy (NN depth + all processing)

Fusion API (Multi-Camera 360° Coverage)

┌─────────────────────┐
│   Edge #1 (Jetson)  │
│  ┌───────────────┐  │
│  │  ZED Camera   │  │
│  └───────┬───────┘  │     Metadata Only
│  ┌───────▼───────┐  │     (bodies, poses)
│  │  Neural Depth │  │     (<100 Kbps)        ┌─────────────────────┐
│  │  (NN on GPU)  │  │ ──────────────────────►│                     │
│  └───────┬───────┘  │                        │   Fusion Server     │
│  ┌───────▼───────┐  │                        │                     │
│  │  Body Track   │──┘                        │  ┌───────────────┐  │
│  └───────────────┘                           │  │  Subscribe    │  │
└─────────────────────┘                        │  │  to all       │  │
                                               │  │  cameras      │  │
┌─────────────────────┐                        │  └───────┬───────┘  │
│   Edge #2 (Jetson)  │                        │          │          │
│  ┌───────────────┐  │     Metadata Only      │  ┌───────▼───────┐  │
│  │  ZED Camera   │  │ ──────────────────────►│  │  Time Sync    │  │
│  └───────┬───────┘  │                        │  │  + Geometric  │  │
│  ┌───────▼───────┐  │                        │  │  Calibration  │  │
│  │  Neural Depth │  │                        │  └───────┬───────┘  │
│  └───────┬───────┘  │                        │          │          │
│  ┌───────▼───────┐  │                        │  ┌───────▼───────┐  │
│  │  Body Track   │──┘                        │  │  360° Fusion  │  │
│  └───────────────┘                           │  │  (merge views)│  │
└─────────────────────┘                        │  └───────────────┘  │
                                               │                     │
┌─────────────────────┐                        │  Lightweight GPU    │
│   Edge #3 (Jetson)  │     Metadata Only      │  requirements       │
│       ...           │ ──────────────────────►│                     │
└─────────────────────┘                        └─────────────────────┘

Each Edge: Heavy (NN depth + tracking)
Fusion Server: Lightweight (data fusion only)

Communication Modes

Streaming API

Mode Description
H264 AVCHD encoding, wider GPU support
H265 HEVC encoding, better compression, requires Pascal+ GPU

Port: Even number (default 30000), uses RTP protocol.

Fusion API

Mode Description
INTRA_PROCESS Same machine, shared memory (zero-copy)
LOCAL_NETWORK Different machines, RTP over network

Port: Default 30000, configurable per camera.

Bandwidth Requirements

Streaming (H265 Compressed Video)

Resolution FPS Bitrate per Camera 4 Cameras
2K 15 7 Mbps 28 Mbps
HD1080 30 11 Mbps 44 Mbps
HD720 60 6 Mbps 24 Mbps
HD1200 30 ~12 Mbps ~48 Mbps

Fusion (Metadata Only)

Data Type Size per Frame @ 30 FPS 4 Cameras
Body (18 keypoints) ~2 KB ~60 KB/s ~240 KB/s
Object detection ~1 KB ~30 KB/s ~120 KB/s
Pose/Transform ~100 B ~3 KB/s ~12 KB/s

Fusion uses 100-1000x less bandwidth than Streaming.

The Architectural Gap

What You CAN Do

Scenario API Edge Computes Host Receives
Remote camera access Streaming Video encoding Video → computes depth/tracking
Multi-camera fusion Fusion Depth + tracking Metadata only (bodies, poses)
Local processing Direct Everything N/A (same machine)

What You CANNOT Do

There is no ZED SDK mode for:

┌─────────────────────┐                      ┌─────────────────────┐
│   Edge (Jetson)     │                      │   Host (Server)     │
│                     │                      │                     │
│  ┌───────────────┐  │     Depth Map /      │  ┌───────────────┐  │
│  │  ZED Camera   │  │     Point Cloud      │  │  Receive      │  │
│  └───────┬───────┘  │                      │  │  Depth/PC     │  │
│          │          │         ???          │  └───────┬───────┘  │
│  ┌───────▼───────┐  │ ─────────────────X─► │          │          │
│  │  Neural Depth │  │   NOT SUPPORTED      │  ┌───────▼───────┐  │
│  │  (NN on GPU)  │  │                      │  │  Further      │  │
│  └───────┬───────┘  │                      │  │  Processing   │  │
│          │          │                      │  └───────────────┘  │
│  ┌───────▼───────┐  │                      │                     │
│  │  Point Cloud  │──┘                      │                     │
│  └───────────────┘                         │                     │
└─────────────────────┘                      └─────────────────────┘

❌ Edge computes depth → streams depth map → Host receives depth
❌ Edge computes point cloud → streams point cloud → Host receives point cloud

Why This Architecture?

1. Bandwidth Economics

Point cloud streaming would require significantly more bandwidth than video:

Data Type Size per Frame (HD1080) @ 30 FPS
Raw stereo video ~12 MB 360 MB/s
H265 compressed ~46 KB 11 Mbps
Depth map (16-bit) ~4 MB 120 MB/s
Point cloud (XYZ float) ~12 MB 360 MB/s

Compressed depth/point cloud is lossy and still large (~50-100 Mbps).

2. Compute Distribution Philosophy

ZED SDK follows: "Compute entirely at edge OR entirely at host, not split"

Scenario Solution
Low bandwidth, multi-camera Fusion (edge computes all, sends metadata)
High bandwidth, single camera Streaming (host computes all)
Same machine INTRA_PROCESS (shared memory)

3. Fusion API Design Goals

From Stereolabs documentation:

"The Fusion module is lightweight (in computation resources requirements) compared to the requirements for camera publishers."

The Fusion receiver is intentionally lightweight because:

  • It only needs to fuse pre-computed metadata
  • It handles time synchronization and geometric calibration
  • It can run on modest hardware while edges do heavy compute

4. Product Strategy

Stereolabs sells:

  • ZED cameras (hardware)
  • ZED Box (edge compute appliances)
  • ZED Hub (cloud management)

The Fusion API encourages purchasing ZED Boxes for edge compute rather than building custom streaming solutions.

Workarounds for Custom Point Cloud Streaming

If you need to stream point clouds from edge to host (outside ZED SDK):

Option 1: Custom Compression + Streaming

// On edge: compute point cloud, compress, send
sl::Mat point_cloud;
zed.retrieveMeasure(point_cloud, MEASURE::XYZRGBA);

// Compress with Draco/PCL octree
std::vector<uint8_t> compressed = draco_compress(point_cloud);

// Send via ZeroMQ/gRPC/raw UDP
socket.send(compressed);

Option 2: Depth Map Streaming

// On edge: get depth, compress as 16-bit PNG, send
sl::Mat depth;
zed.retrieveMeasure(depth, MEASURE::DEPTH);

// Compress as lossless PNG
cv::Mat depth_cv = slMat2cvMat(depth);
std::vector<uint8_t> png;
cv::imencode(".png", depth_cv, png);

// Send via network
socket.send(png);

Bandwidth Estimate for Custom Streaming

Method Compression Bandwidth (HD1080@30fps)
Depth PNG (lossless) ~4:1 ~240 Mbps
Depth JPEG (lossy) ~20:1 ~48 Mbps
Point cloud Draco ~10:1 ~100 Mbps

10 Gbps Ethernet could handle 4 cameras with custom depth streaming.

Recommendations

Use Case Recommended API
Single camera, remote development Streaming
Multi-camera body tracking Fusion
Multi-camera 360° coverage Fusion
Custom point cloud pipeline Manual (ZeroMQ + Draco)
Low latency, same machine INTRA_PROCESS

Code Examples

Streaming Sender (Edge)

sl::StreamingParameters stream_params;
stream_params.codec = sl::STREAMING_CODEC::H265;
stream_params.bitrate = 12000;
stream_params.port = 30000;

zed.enableStreaming(stream_params);

while (running) {
    zed.grab(); // Encodes and sends frame
}

Streaming Receiver (Host)

sl::InitParameters init_params;
init_params.input.setFromStream("192.168.1.100", 30000);

zed.open(init_params);

while (running) {
    if (zed.grab() == ERROR_CODE::SUCCESS) {
        // Full ZED SDK available - depth, tracking, etc.
        zed.retrieveMeasure(depth, MEASURE::DEPTH);
        zed.retrieveMeasure(point_cloud, MEASURE::XYZRGBA);
    }
}

Fusion Sender (Edge)

// Enable body tracking
zed.enableBodyTracking(body_params);

// Start publishing metadata
sl::CommunicationParameters comm_params;
comm_params.setForLocalNetwork(30000);
zed.startPublishing(comm_params);

while (running) {
    if (zed.grab() == ERROR_CODE::SUCCESS) {
        zed.retrieveBodies(bodies); // Computes and publishes
    }
}

Fusion Receiver (Host)

sl::Fusion fusion;
fusion.init(init_fusion_params);

sl::CameraIdentifier cam1(serial_number);
fusion.subscribe(cam1, comm_params, pose);

while (running) {
    if (fusion.process() == FUSION_ERROR_CODE::SUCCESS) {
        fusion.retrieveBodies(fused_bodies); // Already computed by edges
    }
}

Summary

The ZED SDK architecture forces a choice:

  1. Streaming: Edge sends video → Host computes depth (NN inference on host)
  2. Fusion: Edge computes depth → Sends metadata only (no point cloud)

There is no built-in support for streaming computed depth maps or point clouds from edge to host. This is by design for bandwidth efficiency and to encourage use of ZED Box edge compute products.

For custom depth/point cloud streaming, you must implement your own compression and network layer outside the ZED SDK.