zed-body-tracking-multicamera/CUDA_OPENGL_INTEROP_RESEARCH.md

# CUDA-OpenGL Interop: Jetson vs Desktop GPU Differences

## Summary

The CUDA-OpenGL interop issue where `cudaGraphicsMapResources` is required every frame on modern desktop GPUs (RTX 3090/5070Ti with CUDA 12.8) but works with permanent mapping on Jetson (JetPack 5.1.2) is rooted in fundamental architectural differences between integrated GPUs (iGPU) and discrete GPUs (dGPU).

## Root Cause: Memory Architecture Differences

### Jetson (Tegra/iGPU) - Unified Memory Architecture

```
┌─────────────────────────────────────────┐
│           Jetson SoC                    │
│  ┌─────────┐      ┌─────────┐          │
│  │   CPU   │◄────►│   GPU   │          │
│  │  (ARM)  │      │ (iGPU)  │          │
│  └────┬────┘      └────┬────┘          │
│       │                │               │
│       └───────┬────────┘               │
│               ▼                        │
│         ┌──────────┐                   │
│         │ Shared   │  Unified Memory   │
│         │  DRAM    │  (Zero Copy)      │
│         └──────────┘                   │
└─────────────────────────────────────────┘
```

**Characteristics:**
- CPU and GPU share the **same physical memory** (SoC DRAM)
- **Unified Memory** - same pointer value works for both CPU and GPU
- **Zero-copy** possible - no data transfer needed
- OpenGL and CUDA can access the same memory without mapping/unmapping
- Memory is always "mapped" because there's only one memory space

### Desktop GPU (dGPU) - Separate Memory Architecture

```
┌─────────────┐         PCIe Bus         ┌─────────────┐
│   Host PC   │◄───────────────────────►│  Discrete   │
│             │   (High bandwidth but   │    GPU      │
│  ┌───────┐  │    with latency)        │  ┌───────┐  │
│  │  CPU  │  │                         │  │  GPU  │  │
│  └───┬───┘  │                         │  └───┬───┘  │
│      │      │                         │      │      │
│  ┌───▼───┐  │                         │  ┌───▼───┐  │
│  │  RAM  │  │                         │  │ VRAM  │  │
│  │(Host) │  │                         │  │(Device│  │
│  └───────┘  │                         │  │Memory)│  │
└─────────────┘                         │  └───────┘  │
                                        └─────────────┘
```

**Characteristics:**
- CPU and GPU have **separate physical memories** (Host RAM vs VRAM)
- Data must be **transferred** between host and device memory
- OpenGL texture/buffer lives in GPU memory
- CUDA must **explicitly map** the resource to get a device pointer
- **Mapping/Unmapping** is required to synchronize access between OpenGL and CUDA

## Why the Fix is Required on Desktop GPUs

### The Problem with Permanent Mapping on dGPU

When a resource is permanently mapped on desktop GPUs:

1. **CUDA holds the mapping** - thinks it owns the memory
2. **OpenGL tries to use the texture/buffer** for rendering
3. **Memory coherency issues** - CUDA writes may not be visible to OpenGL
4. **Synchronization race conditions** - undefined behavior, corruption, or crashes

### Why It Works on Jetson

On Jetson with unified memory:

1. **Same physical memory** - both CUDA and OpenGL access the same location
2. **No data transfer needed** - writes are immediately visible
3. **Cache coherency handled by hardware** - no explicit synchronization required
4. **Mapping is essentially a no-op** - just gives a pointer to shared memory

## The Correct Pattern (Applied in Our Fix)

```cpp
// Every frame:

// 1. Map resource for CUDA access
//    - On dGPU: Sets up memory mapping, ensures coherency
//    - On Jetson: Lightweight operation (already unified)
cudaGraphicsMapResources(1, &resource, stream);

// 2. Get pointer/array for CUDA to use
cudaGraphicsResourceGetMappedPointer(&ptr, &size, resource);
// or
cudaGraphicsSubResourceGetMappedArray(&arr, resource, 0, 0);

// 3. Copy/process data with CUDA
cudaMemcpy(ptr, data, size, cudaMemcpyHostToDevice);

// 4. Synchronize to ensure copy completes
cudaStreamSynchronize(stream);

// 5. Unmap so OpenGL can use it
//    - On dGPU: Releases mapping, makes data available to OpenGL
//    - On Jetson: Still good practice, minimal overhead
cudaGraphicsUnmapResources(1, &resource, stream);

// 6. OpenGL rendering can now safely use the resource
//    glBindTexture(GL_TEXTURE_2D, texture);
//    glDrawArrays(...);
```

## CUDA Version and Driver Changes

### CUDA 12.8 Behavior Changes

According to NVIDIA Developer Forums (2025):

> "CUDA randomly freezes when using cudaStreamSynchronize with OpenGL interop... `cudaGraphicsMapResources` needs to be called every frame"

This suggests that newer CUDA versions and drivers have **tightened synchronization requirements**:

- **Older CUDA/drivers** may have been more lenient with permanent mapping
- **CUDA 12.8+** enforces proper map/unmap cycles for correctness
- **Stricter memory ordering** between CUDA and OpenGL contexts

### Why JetPack 5.1.2 Still Works

- **Older driver branch** - Jetson drivers lag behind desktop
- **Unified memory hides the issue** - no actual data movement needed
- **Different driver implementation** - Tegra drivers optimized for unified memory path

## Performance Considerations

### Desktop GPU (dGPU)

| Approach | Overhead | Correctness |
|----------|----------|-------------|
| Permanent mapping | Low | **Broken on CUDA 12.8+** |
| Map/unmap per frame | Higher | **Correct** |

**Optimization for dGPU:**
- Use `cudaStreamSynchronize(0)` or explicit events
- Batch multiple copies if possible
- Consider using CUDA arrays directly if no OpenGL needed

### Jetson (iGPU)

| Approach | Overhead | Correctness |
|----------|----------|-------------|
| Permanent mapping | Minimal | Works (unified memory) |
| Map/unmap per frame | Minimal | Works (good practice) |

**Recommendation:** Use the same map/unmap pattern on both platforms for code portability, even though Jetson is more forgiving.

## References

### NVIDIA Documentation

1. **CUDA for Tegra** - https://docs.nvidia.com/cuda/cuda-for-tegra-appnote/index.html
   > "Tegra's integrated GPU (iGPU) shares the same SoC DRAM with CPU... contrasting with dGPUs that have separate memory"

2. **CUDA Runtime API - OpenGL Interop** - https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__OPENGL.html

3. **Unified Memory on Jetson** - https://forums.developer.nvidia.com/t/unified-memory-on-jetson-platforms/187448

### Community Discussions

1. **NVIDIA Developer Forums** (2023-2025)
   - "cudaGraphicsMapResources each frame or just once"
   - Users report CUDA 12.8 requires per-frame mapping
   - Jetson works with permanent mapping due to unified memory

2. **CUDA Random Freezes with OpenGL** (2025)
   - https://forums.developer.nvidia.com/t/cuda-randomly-freezes-when-using-cudastreamsynchronize-with-opengl-interop/318514
   - Confirms stricter synchronization in newer CUDA versions

## Conclusion

The fix applied to `GLViewer.cpp` (map/unmap every frame) is **required for correctness on desktop GPUs** and is **good practice on Jetson**:

1. **Desktop RTX GPUs** - Separate VRAM requires explicit synchronization between CUDA and OpenGL
2. **Jetson iGPU** - Unified memory is more forgiving, but same pattern works
3. **CUDA 12.8+** - Stricter driver requirements enforce proper resource management
4. **Code portability** - Same code works correctly on both platforms

The architectural difference explains why the original code worked on Jetson but failed on your RTX 3090/5070Ti with CUDA 12.8.