Files
zed-body-tracking-multicamera/CUDA_OPENGL_INTEROP_RESEARCH.md
crosstyan 435ea09451 refactor: change ClientPublisher to accept camera streams instead of local USB
BREAKING CHANGE: ClientPublisher::open() signature changed from
  open(sl::InputType input, Trigger* ref, int sdk_gpu_id)
to
  open(const std::string& ip, int port, Trigger* ref, int sdk_gpu_id)

The ClientPublisher now receives camera streams over network using
setFromStream() instead of opening local USB cameras. This allows the
host to receive video from edge devices running enableStreaming().

Changes:
- Use init_parameters.input.setFromStream(ip, port) for stream input
- Keep body tracking and Fusion publishing (INTRA_PROCESS) unchanged
- Add getSerialNumber() getter for logging/debugging
- Improve error messages with context (ip:port)

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2026-02-04 03:25:21 +00:00

8.1 KiB

CUDA-OpenGL Interop: Jetson vs Desktop GPU Differences

Summary

The CUDA-OpenGL interop issue where cudaGraphicsMapResources is required every frame on modern desktop GPUs (RTX 3090/5070Ti with CUDA 12.8) but works with permanent mapping on Jetson (JetPack 5.1.2) is rooted in fundamental architectural differences between integrated GPUs (iGPU) and discrete GPUs (dGPU).

Root Cause: Memory Architecture Differences

Jetson (Tegra/iGPU) - Unified Memory Architecture

┌─────────────────────────────────────────┐
│           Jetson SoC                    │
│  ┌─────────┐      ┌─────────┐          │
│  │   CPU   │◄────►│   GPU   │          │
│  │  (ARM)  │      │ (iGPU)  │          │
│  └────┬────┘      └────┬────┘          │
│       │                │               │
│       └───────┬────────┘               │
│               ▼                        │
│         ┌──────────┐                   │
│         │ Shared   │  Unified Memory   │
│         │  DRAM    │  (Zero Copy)      │
│         └──────────┘                   │
└─────────────────────────────────────────┘

Characteristics:

  • CPU and GPU share the same physical memory (SoC DRAM)
  • Unified Memory - same pointer value works for both CPU and GPU
  • Zero-copy possible - no data transfer needed
  • OpenGL and CUDA can access the same memory without mapping/unmapping
  • Memory is always "mapped" because there's only one memory space

Desktop GPU (dGPU) - Separate Memory Architecture

┌─────────────┐         PCIe Bus         ┌─────────────┐
│   Host PC   │◄───────────────────────►│  Discrete   │
│             │   (High bandwidth but   │    GPU      │
│  ┌───────┐  │    with latency)        │  ┌───────┐  │
│  │  CPU  │  │                         │  │  GPU  │  │
│  └───┬───┘  │                         │  └───┬───┘  │
│      │      │                         │      │      │
│  ┌───▼───┐  │                         │  ┌───▼───┐  │
│  │  RAM  │  │                         │  │ VRAM  │  │
│  │(Host) │  │                         │  │(Device│  │
│  └───────┘  │                         │  │Memory)│  │
└─────────────┘                         │  └───────┘  │
                                        └─────────────┘

Characteristics:

  • CPU and GPU have separate physical memories (Host RAM vs VRAM)
  • Data must be transferred between host and device memory
  • OpenGL texture/buffer lives in GPU memory
  • CUDA must explicitly map the resource to get a device pointer
  • Mapping/Unmapping is required to synchronize access between OpenGL and CUDA

Why the Fix is Required on Desktop GPUs

The Problem with Permanent Mapping on dGPU

When a resource is permanently mapped on desktop GPUs:

  1. CUDA holds the mapping - thinks it owns the memory
  2. OpenGL tries to use the texture/buffer for rendering
  3. Memory coherency issues - CUDA writes may not be visible to OpenGL
  4. Synchronization race conditions - undefined behavior, corruption, or crashes

Why It Works on Jetson

On Jetson with unified memory:

  1. Same physical memory - both CUDA and OpenGL access the same location
  2. No data transfer needed - writes are immediately visible
  3. Cache coherency handled by hardware - no explicit synchronization required
  4. Mapping is essentially a no-op - just gives a pointer to shared memory

The Correct Pattern (Applied in Our Fix)

// Every frame:

// 1. Map resource for CUDA access
//    - On dGPU: Sets up memory mapping, ensures coherency
//    - On Jetson: Lightweight operation (already unified)
cudaGraphicsMapResources(1, &resource, stream);

// 2. Get pointer/array for CUDA to use
cudaGraphicsResourceGetMappedPointer(&ptr, &size, resource);
// or
cudaGraphicsSubResourceGetMappedArray(&arr, resource, 0, 0);

// 3. Copy/process data with CUDA
cudaMemcpy(ptr, data, size, cudaMemcpyHostToDevice);

// 4. Synchronize to ensure copy completes
cudaStreamSynchronize(stream);

// 5. Unmap so OpenGL can use it
//    - On dGPU: Releases mapping, makes data available to OpenGL
//    - On Jetson: Still good practice, minimal overhead
cudaGraphicsUnmapResources(1, &resource, stream);

// 6. OpenGL rendering can now safely use the resource
//    glBindTexture(GL_TEXTURE_2D, texture);
//    glDrawArrays(...);

CUDA Version and Driver Changes

CUDA 12.8 Behavior Changes

According to NVIDIA Developer Forums (2025):

"CUDA randomly freezes when using cudaStreamSynchronize with OpenGL interop... cudaGraphicsMapResources needs to be called every frame"

This suggests that newer CUDA versions and drivers have tightened synchronization requirements:

  • Older CUDA/drivers may have been more lenient with permanent mapping
  • CUDA 12.8+ enforces proper map/unmap cycles for correctness
  • Stricter memory ordering between CUDA and OpenGL contexts

Why JetPack 5.1.2 Still Works

  • Older driver branch - Jetson drivers lag behind desktop
  • Unified memory hides the issue - no actual data movement needed
  • Different driver implementation - Tegra drivers optimized for unified memory path

Performance Considerations

Desktop GPU (dGPU)

Approach Overhead Correctness
Permanent mapping Low Broken on CUDA 12.8+
Map/unmap per frame Higher Correct

Optimization for dGPU:

  • Use cudaStreamSynchronize(0) or explicit events
  • Batch multiple copies if possible
  • Consider using CUDA arrays directly if no OpenGL needed

Jetson (iGPU)

Approach Overhead Correctness
Permanent mapping Minimal Works (unified memory)
Map/unmap per frame Minimal Works (good practice)

Recommendation: Use the same map/unmap pattern on both platforms for code portability, even though Jetson is more forgiving.

References

NVIDIA Documentation

  1. CUDA for Tegra - https://docs.nvidia.com/cuda/cuda-for-tegra-appnote/index.html

    "Tegra's integrated GPU (iGPU) shares the same SoC DRAM with CPU... contrasting with dGPUs that have separate memory"

  2. CUDA Runtime API - OpenGL Interop - https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__OPENGL.html

  3. Unified Memory on Jetson - https://forums.developer.nvidia.com/t/unified-memory-on-jetson-platforms/187448

Community Discussions

  1. NVIDIA Developer Forums (2023-2025)

    • "cudaGraphicsMapResources each frame or just once"
    • Users report CUDA 12.8 requires per-frame mapping
    • Jetson works with permanent mapping due to unified memory
  2. CUDA Random Freezes with OpenGL (2025)

Conclusion

The fix applied to GLViewer.cpp (map/unmap every frame) is required for correctness on desktop GPUs and is good practice on Jetson:

  1. Desktop RTX GPUs - Separate VRAM requires explicit synchronization between CUDA and OpenGL
  2. Jetson iGPU - Unified memory is more forgiving, but same pattern works
  3. CUDA 12.8+ - Stricter driver requirements enforce proper resource management
  4. Code portability - Same code works correctly on both platforms

The architectural difference explains why the original code worked on Jetson but failed on your RTX 3090/5070Ti with CUDA 12.8.