# CUDA-OpenGL Interop: Jetson vs Desktop GPU Differences ## Summary The CUDA-OpenGL interop issue where `cudaGraphicsMapResources` is required every frame on modern desktop GPUs (RTX 3090/5070Ti with CUDA 12.8) but works with permanent mapping on Jetson (JetPack 5.1.2) is rooted in fundamental architectural differences between integrated GPUs (iGPU) and discrete GPUs (dGPU). ## Root Cause: Memory Architecture Differences ### Jetson (Tegra/iGPU) - Unified Memory Architecture ``` ┌─────────────────────────────────────────┐ │ Jetson SoC │ │ ┌─────────┐ ┌─────────┐ │ │ │ CPU │◄────►│ GPU │ │ │ │ (ARM) │ │ (iGPU) │ │ │ └────┬────┘ └────┬────┘ │ │ │ │ │ │ └───────┬────────┘ │ │ ▼ │ │ ┌──────────┐ │ │ │ Shared │ Unified Memory │ │ │ DRAM │ (Zero Copy) │ │ └──────────┘ │ └─────────────────────────────────────────┘ ``` **Characteristics:** - CPU and GPU share the **same physical memory** (SoC DRAM) - **Unified Memory** - same pointer value works for both CPU and GPU - **Zero-copy** possible - no data transfer needed - OpenGL and CUDA can access the same memory without mapping/unmapping - Memory is always "mapped" because there's only one memory space ### Desktop GPU (dGPU) - Separate Memory Architecture ``` ┌─────────────┐ PCIe Bus ┌─────────────┐ │ Host PC │◄───────────────────────►│ Discrete │ │ │ (High bandwidth but │ GPU │ │ ┌───────┐ │ with latency) │ ┌───────┐ │ │ │ CPU │ │ │ │ GPU │ │ │ └───┬───┘ │ │ └───┬───┘ │ │ │ │ │ │ │ │ ┌───▼───┐ │ │ ┌───▼───┐ │ │ │ RAM │ │ │ │ VRAM │ │ │ │(Host) │ │ │ │(Device│ │ │ └───────┘ │ │ │Memory)│ │ └─────────────┘ │ └───────┘ │ └─────────────┘ ``` **Characteristics:** - CPU and GPU have **separate physical memories** (Host RAM vs VRAM) - Data must be **transferred** between host and device memory - OpenGL texture/buffer lives in GPU memory - CUDA must **explicitly map** the resource to get a device pointer - **Mapping/Unmapping** is required to synchronize access between OpenGL and CUDA ## Why the Fix is Required on Desktop GPUs ### The Problem with Permanent Mapping on dGPU When a resource is permanently mapped on desktop GPUs: 1. **CUDA holds the mapping** - thinks it owns the memory 2. **OpenGL tries to use the texture/buffer** for rendering 3. **Memory coherency issues** - CUDA writes may not be visible to OpenGL 4. **Synchronization race conditions** - undefined behavior, corruption, or crashes ### Why It Works on Jetson On Jetson with unified memory: 1. **Same physical memory** - both CUDA and OpenGL access the same location 2. **No data transfer needed** - writes are immediately visible 3. **Cache coherency handled by hardware** - no explicit synchronization required 4. **Mapping is essentially a no-op** - just gives a pointer to shared memory ## The Correct Pattern (Applied in Our Fix) ```cpp // Every frame: // 1. Map resource for CUDA access // - On dGPU: Sets up memory mapping, ensures coherency // - On Jetson: Lightweight operation (already unified) cudaGraphicsMapResources(1, &resource, stream); // 2. Get pointer/array for CUDA to use cudaGraphicsResourceGetMappedPointer(&ptr, &size, resource); // or cudaGraphicsSubResourceGetMappedArray(&arr, resource, 0, 0); // 3. Copy/process data with CUDA cudaMemcpy(ptr, data, size, cudaMemcpyHostToDevice); // 4. Synchronize to ensure copy completes cudaStreamSynchronize(stream); // 5. Unmap so OpenGL can use it // - On dGPU: Releases mapping, makes data available to OpenGL // - On Jetson: Still good practice, minimal overhead cudaGraphicsUnmapResources(1, &resource, stream); // 6. OpenGL rendering can now safely use the resource // glBindTexture(GL_TEXTURE_2D, texture); // glDrawArrays(...); ``` ## CUDA Version and Driver Changes ### CUDA 12.8 Behavior Changes According to NVIDIA Developer Forums (2025): > "CUDA randomly freezes when using cudaStreamSynchronize with OpenGL interop... `cudaGraphicsMapResources` needs to be called every frame" This suggests that newer CUDA versions and drivers have **tightened synchronization requirements**: - **Older CUDA/drivers** may have been more lenient with permanent mapping - **CUDA 12.8+** enforces proper map/unmap cycles for correctness - **Stricter memory ordering** between CUDA and OpenGL contexts ### Why JetPack 5.1.2 Still Works - **Older driver branch** - Jetson drivers lag behind desktop - **Unified memory hides the issue** - no actual data movement needed - **Different driver implementation** - Tegra drivers optimized for unified memory path ## Performance Considerations ### Desktop GPU (dGPU) | Approach | Overhead | Correctness | |----------|----------|-------------| | Permanent mapping | Low | **Broken on CUDA 12.8+** | | Map/unmap per frame | Higher | **Correct** | **Optimization for dGPU:** - Use `cudaStreamSynchronize(0)` or explicit events - Batch multiple copies if possible - Consider using CUDA arrays directly if no OpenGL needed ### Jetson (iGPU) | Approach | Overhead | Correctness | |----------|----------|-------------| | Permanent mapping | Minimal | Works (unified memory) | | Map/unmap per frame | Minimal | Works (good practice) | **Recommendation:** Use the same map/unmap pattern on both platforms for code portability, even though Jetson is more forgiving. ## References ### NVIDIA Documentation 1. **CUDA for Tegra** - https://docs.nvidia.com/cuda/cuda-for-tegra-appnote/index.html > "Tegra's integrated GPU (iGPU) shares the same SoC DRAM with CPU... contrasting with dGPUs that have separate memory" 2. **CUDA Runtime API - OpenGL Interop** - https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__OPENGL.html 3. **Unified Memory on Jetson** - https://forums.developer.nvidia.com/t/unified-memory-on-jetson-platforms/187448 ### Community Discussions 1. **NVIDIA Developer Forums** (2023-2025) - "cudaGraphicsMapResources each frame or just once" - Users report CUDA 12.8 requires per-frame mapping - Jetson works with permanent mapping due to unified memory 2. **CUDA Random Freezes with OpenGL** (2025) - https://forums.developer.nvidia.com/t/cuda-randomly-freezes-when-using-cudastreamsynchronize-with-opengl-interop/318514 - Confirms stricter synchronization in newer CUDA versions ## Conclusion The fix applied to `GLViewer.cpp` (map/unmap every frame) is **required for correctness on desktop GPUs** and is **good practice on Jetson**: 1. **Desktop RTX GPUs** - Separate VRAM requires explicit synchronization between CUDA and OpenGL 2. **Jetson iGPU** - Unified memory is more forgiving, but same pattern works 3. **CUDA 12.8+** - Stricter driver requirements enforce proper resource management 4. **Code portability** - Same code works correctly on both platforms The architectural difference explains why the original code worked on Jetson but failed on your RTX 3090/5070Ti with CUDA 12.8.