zed-body-tracking-multicamera/ZED_SDK_ARCHITECTURE.md

# ZED SDK Architecture: Streaming vs Fusion API

## Overview

The ZED SDK provides two distinct APIs for transmitting camera data over a network:

1. **Streaming API** (`enableStreaming`) - Video streaming
2. **Fusion API** (`startPublishing`) - Metadata publishing

These serve fundamentally different use cases and have different compute/bandwidth tradeoffs.

## API Comparison

| Feature | Streaming API | Fusion API |
|---------|---------------|------------|
| **Primary Use Case** | Remote camera access | Multi-camera data fusion |
| **Data Transmitted** | Compressed video (H264/H265) | Metadata only (bodies, objects, poses) |
| **Bandwidth per Camera** | 10-40 Mbps | <100 Kbps |
| **Edge Compute** | Video encoding only (NVENC) | Full depth NN + tracking + detection |
| **Host Compute** | Full depth NN + tracking + detection | Lightweight fusion only |
| **Synchronization** | None | Time-synced + geometric calibration |
| **360° Coverage** | No | Yes (fuses overlapping views) |
| **Receiver API** | `zed.open()` with `INPUT_TYPE::STREAM` | `fusion.subscribe()` |

## Architecture Diagrams

### Streaming API (Single Camera Remote Access)

```
┌─────────────────────┐                      ┌─────────────────────┐
│   Edge (Jetson)     │                      │   Host (Server)     │
│                     │                      │                     │
│  ┌───────────────┐  │    H264/H265 RTP     │  ┌───────────────┐  │
│  │  ZED Camera   │  │   (10-40 Mbps)       │  │  Decode       │  │
│  └───────┬───────┘  │ ───────────────────► │  │  (NVENC)      │  │
│          │          │                      │  └───────┬───────┘  │
│  ┌───────▼───────┐  │                      │          │          │
│  │  NVENC        │  │                      │  ┌───────▼───────┐  │
│  │  Encode       │──┘                      │  │  Neural Depth │  │
│  │  (hardware)   │                         │  │  (NN on GPU)  │  │
│  └───────────────┘                         │  └───────┬───────┘  │
│                                            │          │          │
│                                            │  ┌───────▼───────┐  │
│                                            │  │  Tracking /   │  │
│                                            │  │  Detection    │  │
│                                            │  └───────┬───────┘  │
│                                            │          │          │
│                                            │  ┌───────▼───────┐  │
│                                            │  │  Point Cloud  │  │
│                                            │  └───────────────┘  │
└─────────────────────┘                      └─────────────────────┘

Edge: Lightweight (encode only)
Host: Heavy (NN depth + all processing)
```

### Fusion API (Multi-Camera 360° Coverage)

```
┌─────────────────────┐
│   Edge #1 (Jetson)  │
│  ┌───────────────┐  │
│  │  ZED Camera   │  │
│  └───────┬───────┘  │     Metadata Only
│  ┌───────▼───────┐  │     (bodies, poses)
│  │  Neural Depth │  │     (<100 Kbps)        ┌─────────────────────┐
│  │  (NN on GPU)  │  │ ──────────────────────►│                     │
│  └───────┬───────┘  │                        │   Fusion Server     │
│  ┌───────▼───────┐  │                        │                     │
│  │  Body Track   │──┘                        │  ┌───────────────┐  │
│  └───────────────┘                           │  │  Subscribe    │  │
└─────────────────────┘                        │  │  to all       │  │
                                               │  │  cameras      │  │
┌─────────────────────┐                        │  └───────┬───────┘  │
│   Edge #2 (Jetson)  │                        │          │          │
│  ┌───────────────┐  │     Metadata Only      │  ┌───────▼───────┐  │
│  │  ZED Camera   │  │ ──────────────────────►│  │  Time Sync    │  │
│  └───────┬───────┘  │                        │  │  + Geometric  │  │
│  ┌───────▼───────┐  │                        │  │  Calibration  │  │
│  │  Neural Depth │  │                        │  └───────┬───────┘  │
│  └───────┬───────┘  │                        │          │          │
│  ┌───────▼───────┐  │                        │  ┌───────▼───────┐  │
│  │  Body Track   │──┘                        │  │  360° Fusion  │  │
│  └───────────────┘                           │  │  (merge views)│  │
└─────────────────────┘                        │  └───────────────┘  │
                                               │                     │
┌─────────────────────┐                        │  Lightweight GPU    │
│   Edge #3 (Jetson)  │     Metadata Only      │  requirements       │
│       ...           │ ──────────────────────►│                     │
└─────────────────────┘                        └─────────────────────┘

Each Edge: Heavy (NN depth + tracking)
Fusion Server: Lightweight (data fusion only)
```

## Communication Modes

### Streaming API

| Mode | Description |
|------|-------------|
| **H264** | AVCHD encoding, wider GPU support |
| **H265** | HEVC encoding, better compression, requires Pascal+ GPU |

Port: Even number (default 30000), uses RTP protocol.

### Fusion API

| Mode | Description |
|------|-------------|
| **INTRA_PROCESS** | Same machine, shared memory (zero-copy) |
| **LOCAL_NETWORK** | Different machines, RTP over network |

Port: Default 30000, configurable per camera.

## Bandwidth Requirements

### Streaming (H265 Compressed Video)

| Resolution | FPS | Bitrate per Camera | 4 Cameras |
|------------|-----|-------------------|-----------|
| 2K | 15 | 7 Mbps | 28 Mbps |
| HD1080 | 30 | 11 Mbps | 44 Mbps |
| HD720 | 60 | 6 Mbps | 24 Mbps |
| HD1200 | 30 | ~12 Mbps | ~48 Mbps |

### Fusion (Metadata Only)

| Data Type | Size per Frame | @ 30 FPS | 4 Cameras |
|-----------|---------------|----------|-----------|
| Body (18 keypoints) | ~2 KB | ~60 KB/s | ~240 KB/s |
| Object detection | ~1 KB | ~30 KB/s | ~120 KB/s |
| Pose/Transform | ~100 B | ~3 KB/s | ~12 KB/s |

**Fusion uses 100-1000x less bandwidth than Streaming.**

## The Architectural Gap

### What You CAN Do

| Scenario | API | Edge Computes | Host Receives |
|----------|-----|---------------|---------------|
| Remote camera access | Streaming | Video encoding | Video → computes depth/tracking |
| Multi-camera fusion | Fusion | Depth + tracking | Metadata only (bodies, poses) |
| Local processing | Direct | Everything | N/A (same machine) |

### What You CANNOT Do

**There is no ZED SDK mode for:**

```
┌─────────────────────┐                      ┌─────────────────────┐
│   Edge (Jetson)     │                      │   Host (Server)     │
│                     │                      │                     │
│  ┌───────────────┐  │     Depth Map /      │  ┌───────────────┐  │
│  │  ZED Camera   │  │     Point Cloud      │  │  Receive      │  │
│  └───────┬───────┘  │                      │  │  Depth/PC     │  │
│          │          │         ???          │  └───────┬───────┘  │
│  ┌───────▼───────┐  │ ─────────────────X─► │          │          │
│  │  Neural Depth │  │   NOT SUPPORTED      │  ┌───────▼───────┐  │
│  │  (NN on GPU)  │  │                      │  │  Further      │  │
│  └───────┬───────┘  │                      │  │  Processing   │  │
│          │          │                      │  └───────────────┘  │
│  ┌───────▼───────┐  │                      │                     │
│  │  Point Cloud  │──┘                      │                     │
│  └───────────────┘                         │                     │
└─────────────────────┘                      └─────────────────────┘

❌ Edge computes depth → streams depth map → Host receives depth
❌ Edge computes point cloud → streams point cloud → Host receives point cloud
```

## Why This Architecture?

### 1. Bandwidth Economics

Point cloud streaming would require significantly more bandwidth than video:

| Data Type | Size per Frame (HD1080) | @ 30 FPS |
|-----------|------------------------|----------|
| Raw stereo video | ~12 MB | 360 MB/s |
| H265 compressed | ~46 KB | 11 Mbps |
| Depth map (16-bit) | ~4 MB | 120 MB/s |
| Point cloud (XYZ float) | ~12 MB | 360 MB/s |

Compressed depth/point cloud is lossy and still large (~50-100 Mbps).

### 2. Compute Distribution Philosophy

ZED SDK follows: **"Compute entirely at edge OR entirely at host, not split"**

| Scenario | Solution |
|----------|----------|
| Low bandwidth, multi-camera | Fusion (edge computes all, sends metadata) |
| High bandwidth, single camera | Streaming (host computes all) |
| Same machine | INTRA_PROCESS (shared memory) |

### 3. Fusion API Design Goals

From Stereolabs documentation:

> "The Fusion module is **lightweight** (in computation resources requirements) compared to the requirements for camera publishers."

The Fusion receiver is intentionally lightweight because:
- It only needs to fuse pre-computed metadata
- It handles time synchronization and geometric calibration
- It can run on modest hardware while edges do heavy compute

### 4. Product Strategy

Stereolabs sells:
- **ZED cameras** (hardware)
- **ZED Box** (edge compute appliances)
- **ZED Hub** (cloud management)

The Fusion API encourages purchasing ZED Boxes for edge compute rather than building custom streaming solutions.

## Workarounds for Custom Point Cloud Streaming

If you need to stream point clouds from edge to host (outside ZED SDK):

### Option 1: Custom Compression + Streaming

```cpp
// On edge: compute point cloud, compress, send
sl::Mat point_cloud;
zed.retrieveMeasure(point_cloud, MEASURE::XYZRGBA);

// Compress with Draco/PCL octree
std::vector<uint8_t> compressed = draco_compress(point_cloud);

// Send via ZeroMQ/gRPC/raw UDP
socket.send(compressed);
```

### Option 2: Depth Map Streaming

```cpp
// On edge: get depth, compress as 16-bit PNG, send
sl::Mat depth;
zed.retrieveMeasure(depth, MEASURE::DEPTH);

// Compress as lossless PNG
cv::Mat depth_cv = slMat2cvMat(depth);
std::vector<uint8_t> png;
cv::imencode(".png", depth_cv, png);

// Send via network
socket.send(png);
```

### Bandwidth Estimate for Custom Streaming

| Method | Compression | Bandwidth (HD1080@30fps) |
|--------|-------------|-------------------------|
| Depth PNG (lossless) | ~4:1 | ~240 Mbps |
| Depth JPEG (lossy) | ~20:1 | ~48 Mbps |
| Point cloud Draco | ~10:1 | ~100 Mbps |

**10 Gbps Ethernet could handle 4 cameras with custom depth streaming.**

## Recommendations

| Use Case | Recommended API |
|----------|-----------------|
| Single camera, remote development | Streaming |
| Multi-camera body tracking | Fusion |
| Multi-camera 360° coverage | Fusion |
| Custom point cloud pipeline | Manual (ZeroMQ + Draco) |
| Low latency, same machine | INTRA_PROCESS |

## Code Examples

### Streaming Sender (Edge)

```cpp
sl::StreamingParameters stream_params;
stream_params.codec = sl::STREAMING_CODEC::H265;
stream_params.bitrate = 12000;
stream_params.port = 30000;

zed.enableStreaming(stream_params);

while (running) {
    zed.grab(); // Encodes and sends frame
}
```

### Streaming Receiver (Host)

```cpp
sl::InitParameters init_params;
init_params.input.setFromStream("192.168.1.100", 30000);

zed.open(init_params);

while (running) {
    if (zed.grab() == ERROR_CODE::SUCCESS) {
        // Full ZED SDK available - depth, tracking, etc.
        zed.retrieveMeasure(depth, MEASURE::DEPTH);
        zed.retrieveMeasure(point_cloud, MEASURE::XYZRGBA);
    }
}
```

### Fusion Sender (Edge)

```cpp
// Enable body tracking
zed.enableBodyTracking(body_params);

// Start publishing metadata
sl::CommunicationParameters comm_params;
comm_params.setForLocalNetwork(30000);
zed.startPublishing(comm_params);

while (running) {
    if (zed.grab() == ERROR_CODE::SUCCESS) {
        zed.retrieveBodies(bodies); // Computes and publishes
    }
}
```

### Fusion Receiver (Host)

```cpp
sl::Fusion fusion;
fusion.init(init_fusion_params);

sl::CameraIdentifier cam1(serial_number);
fusion.subscribe(cam1, comm_params, pose);

while (running) {
    if (fusion.process() == FUSION_ERROR_CODE::SUCCESS) {
        fusion.retrieveBodies(fused_bodies); // Already computed by edges
    }
}
```

## Summary

The ZED SDK architecture forces a choice:

1. **Streaming**: Edge sends video → Host computes depth (NN inference on host)
2. **Fusion**: Edge computes depth → Sends metadata only (no point cloud)

There is **no built-in support** for streaming computed depth maps or point clouds from edge to host. This is by design for bandwidth efficiency and to encourage use of ZED Box edge compute products.

For custom depth/point cloud streaming, you must implement your own compression and network layer outside the ZED SDK.