BREAKING CHANGE: ClientPublisher::open() signature changed from open(sl::InputType input, Trigger* ref, int sdk_gpu_id) to open(const std::string& ip, int port, Trigger* ref, int sdk_gpu_id) The ClientPublisher now receives camera streams over network using setFromStream() instead of opening local USB cameras. This allows the host to receive video from edge devices running enableStreaming(). Changes: - Use init_parameters.input.setFromStream(ip, port) for stream input - Keep body tracking and Fusion publishing (INTRA_PROCESS) unchanged - Add getSerialNumber() getter for logging/debugging - Improve error messages with context (ip:port) Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
8.1 KiB
CUDA-OpenGL Interop: Jetson vs Desktop GPU Differences
Summary
The CUDA-OpenGL interop issue where cudaGraphicsMapResources is required every frame on modern desktop GPUs (RTX 3090/5070Ti with CUDA 12.8) but works with permanent mapping on Jetson (JetPack 5.1.2) is rooted in fundamental architectural differences between integrated GPUs (iGPU) and discrete GPUs (dGPU).
Root Cause: Memory Architecture Differences
Jetson (Tegra/iGPU) - Unified Memory Architecture
┌─────────────────────────────────────────┐
│ Jetson SoC │
│ ┌─────────┐ ┌─────────┐ │
│ │ CPU │◄────►│ GPU │ │
│ │ (ARM) │ │ (iGPU) │ │
│ └────┬────┘ └────┬────┘ │
│ │ │ │
│ └───────┬────────┘ │
│ ▼ │
│ ┌──────────┐ │
│ │ Shared │ Unified Memory │
│ │ DRAM │ (Zero Copy) │
│ └──────────┘ │
└─────────────────────────────────────────┘
Characteristics:
- CPU and GPU share the same physical memory (SoC DRAM)
- Unified Memory - same pointer value works for both CPU and GPU
- Zero-copy possible - no data transfer needed
- OpenGL and CUDA can access the same memory without mapping/unmapping
- Memory is always "mapped" because there's only one memory space
Desktop GPU (dGPU) - Separate Memory Architecture
┌─────────────┐ PCIe Bus ┌─────────────┐
│ Host PC │◄───────────────────────►│ Discrete │
│ │ (High bandwidth but │ GPU │
│ ┌───────┐ │ with latency) │ ┌───────┐ │
│ │ CPU │ │ │ │ GPU │ │
│ └───┬───┘ │ │ └───┬───┘ │
│ │ │ │ │ │
│ ┌───▼───┐ │ │ ┌───▼───┐ │
│ │ RAM │ │ │ │ VRAM │ │
│ │(Host) │ │ │ │(Device│ │
│ └───────┘ │ │ │Memory)│ │
└─────────────┘ │ └───────┘ │
└─────────────┘
Characteristics:
- CPU and GPU have separate physical memories (Host RAM vs VRAM)
- Data must be transferred between host and device memory
- OpenGL texture/buffer lives in GPU memory
- CUDA must explicitly map the resource to get a device pointer
- Mapping/Unmapping is required to synchronize access between OpenGL and CUDA
Why the Fix is Required on Desktop GPUs
The Problem with Permanent Mapping on dGPU
When a resource is permanently mapped on desktop GPUs:
- CUDA holds the mapping - thinks it owns the memory
- OpenGL tries to use the texture/buffer for rendering
- Memory coherency issues - CUDA writes may not be visible to OpenGL
- Synchronization race conditions - undefined behavior, corruption, or crashes
Why It Works on Jetson
On Jetson with unified memory:
- Same physical memory - both CUDA and OpenGL access the same location
- No data transfer needed - writes are immediately visible
- Cache coherency handled by hardware - no explicit synchronization required
- Mapping is essentially a no-op - just gives a pointer to shared memory
The Correct Pattern (Applied in Our Fix)
// Every frame:
// 1. Map resource for CUDA access
// - On dGPU: Sets up memory mapping, ensures coherency
// - On Jetson: Lightweight operation (already unified)
cudaGraphicsMapResources(1, &resource, stream);
// 2. Get pointer/array for CUDA to use
cudaGraphicsResourceGetMappedPointer(&ptr, &size, resource);
// or
cudaGraphicsSubResourceGetMappedArray(&arr, resource, 0, 0);
// 3. Copy/process data with CUDA
cudaMemcpy(ptr, data, size, cudaMemcpyHostToDevice);
// 4. Synchronize to ensure copy completes
cudaStreamSynchronize(stream);
// 5. Unmap so OpenGL can use it
// - On dGPU: Releases mapping, makes data available to OpenGL
// - On Jetson: Still good practice, minimal overhead
cudaGraphicsUnmapResources(1, &resource, stream);
// 6. OpenGL rendering can now safely use the resource
// glBindTexture(GL_TEXTURE_2D, texture);
// glDrawArrays(...);
CUDA Version and Driver Changes
CUDA 12.8 Behavior Changes
According to NVIDIA Developer Forums (2025):
"CUDA randomly freezes when using cudaStreamSynchronize with OpenGL interop...
cudaGraphicsMapResourcesneeds to be called every frame"
This suggests that newer CUDA versions and drivers have tightened synchronization requirements:
- Older CUDA/drivers may have been more lenient with permanent mapping
- CUDA 12.8+ enforces proper map/unmap cycles for correctness
- Stricter memory ordering between CUDA and OpenGL contexts
Why JetPack 5.1.2 Still Works
- Older driver branch - Jetson drivers lag behind desktop
- Unified memory hides the issue - no actual data movement needed
- Different driver implementation - Tegra drivers optimized for unified memory path
Performance Considerations
Desktop GPU (dGPU)
| Approach | Overhead | Correctness |
|---|---|---|
| Permanent mapping | Low | Broken on CUDA 12.8+ |
| Map/unmap per frame | Higher | Correct |
Optimization for dGPU:
- Use
cudaStreamSynchronize(0)or explicit events - Batch multiple copies if possible
- Consider using CUDA arrays directly if no OpenGL needed
Jetson (iGPU)
| Approach | Overhead | Correctness |
|---|---|---|
| Permanent mapping | Minimal | Works (unified memory) |
| Map/unmap per frame | Minimal | Works (good practice) |
Recommendation: Use the same map/unmap pattern on both platforms for code portability, even though Jetson is more forgiving.
References
NVIDIA Documentation
-
CUDA for Tegra - https://docs.nvidia.com/cuda/cuda-for-tegra-appnote/index.html
"Tegra's integrated GPU (iGPU) shares the same SoC DRAM with CPU... contrasting with dGPUs that have separate memory"
-
CUDA Runtime API - OpenGL Interop - https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__OPENGL.html
-
Unified Memory on Jetson - https://forums.developer.nvidia.com/t/unified-memory-on-jetson-platforms/187448
Community Discussions
-
NVIDIA Developer Forums (2023-2025)
- "cudaGraphicsMapResources each frame or just once"
- Users report CUDA 12.8 requires per-frame mapping
- Jetson works with permanent mapping due to unified memory
-
CUDA Random Freezes with OpenGL (2025)
- https://forums.developer.nvidia.com/t/cuda-randomly-freezes-when-using-cudastreamsynchronize-with-opengl-interop/318514
- Confirms stricter synchronization in newer CUDA versions
Conclusion
The fix applied to GLViewer.cpp (map/unmap every frame) is required for correctness on desktop GPUs and is good practice on Jetson:
- Desktop RTX GPUs - Separate VRAM requires explicit synchronization between CUDA and OpenGL
- Jetson iGPU - Unified memory is more forgiving, but same pattern works
- CUDA 12.8+ - Stricter driver requirements enforce proper resource management
- Code portability - Same code works correctly on both platforms
The architectural difference explains why the original code worked on Jetson but failed on your RTX 3090/5070Ti with CUDA 12.8.