BREAKING CHANGE: ClientPublisher::open() signature changed from open(sl::InputType input, Trigger* ref, int sdk_gpu_id) to open(const std::string& ip, int port, Trigger* ref, int sdk_gpu_id) The ClientPublisher now receives camera streams over network using setFromStream() instead of opening local USB cameras. This allows the host to receive video from edge devices running enableStreaming(). Changes: - Use init_parameters.input.setFromStream(ip, port) for stream input - Keep body tracking and Fusion publishing (INTRA_PROCESS) unchanged - Add getSerialNumber() getter for logging/debugging - Improve error messages with context (ip:port) Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
186 lines
8.1 KiB
Markdown
186 lines
8.1 KiB
Markdown
# CUDA-OpenGL Interop: Jetson vs Desktop GPU Differences
|
|
|
|
## Summary
|
|
|
|
The CUDA-OpenGL interop issue where `cudaGraphicsMapResources` is required every frame on modern desktop GPUs (RTX 3090/5070Ti with CUDA 12.8) but works with permanent mapping on Jetson (JetPack 5.1.2) is rooted in fundamental architectural differences between integrated GPUs (iGPU) and discrete GPUs (dGPU).
|
|
|
|
## Root Cause: Memory Architecture Differences
|
|
|
|
### Jetson (Tegra/iGPU) - Unified Memory Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────┐
|
|
│ Jetson SoC │
|
|
│ ┌─────────┐ ┌─────────┐ │
|
|
│ │ CPU │◄────►│ GPU │ │
|
|
│ │ (ARM) │ │ (iGPU) │ │
|
|
│ └────┬────┘ └────┬────┘ │
|
|
│ │ │ │
|
|
│ └───────┬────────┘ │
|
|
│ ▼ │
|
|
│ ┌──────────┐ │
|
|
│ │ Shared │ Unified Memory │
|
|
│ │ DRAM │ (Zero Copy) │
|
|
│ └──────────┘ │
|
|
└─────────────────────────────────────────┘
|
|
```
|
|
|
|
**Characteristics:**
|
|
- CPU and GPU share the **same physical memory** (SoC DRAM)
|
|
- **Unified Memory** - same pointer value works for both CPU and GPU
|
|
- **Zero-copy** possible - no data transfer needed
|
|
- OpenGL and CUDA can access the same memory without mapping/unmapping
|
|
- Memory is always "mapped" because there's only one memory space
|
|
|
|
### Desktop GPU (dGPU) - Separate Memory Architecture
|
|
|
|
```
|
|
┌─────────────┐ PCIe Bus ┌─────────────┐
|
|
│ Host PC │◄───────────────────────►│ Discrete │
|
|
│ │ (High bandwidth but │ GPU │
|
|
│ ┌───────┐ │ with latency) │ ┌───────┐ │
|
|
│ │ CPU │ │ │ │ GPU │ │
|
|
│ └───┬───┘ │ │ └───┬───┘ │
|
|
│ │ │ │ │ │
|
|
│ ┌───▼───┐ │ │ ┌───▼───┐ │
|
|
│ │ RAM │ │ │ │ VRAM │ │
|
|
│ │(Host) │ │ │ │(Device│ │
|
|
│ └───────┘ │ │ │Memory)│ │
|
|
└─────────────┘ │ └───────┘ │
|
|
└─────────────┘
|
|
```
|
|
|
|
**Characteristics:**
|
|
- CPU and GPU have **separate physical memories** (Host RAM vs VRAM)
|
|
- Data must be **transferred** between host and device memory
|
|
- OpenGL texture/buffer lives in GPU memory
|
|
- CUDA must **explicitly map** the resource to get a device pointer
|
|
- **Mapping/Unmapping** is required to synchronize access between OpenGL and CUDA
|
|
|
|
## Why the Fix is Required on Desktop GPUs
|
|
|
|
### The Problem with Permanent Mapping on dGPU
|
|
|
|
When a resource is permanently mapped on desktop GPUs:
|
|
|
|
1. **CUDA holds the mapping** - thinks it owns the memory
|
|
2. **OpenGL tries to use the texture/buffer** for rendering
|
|
3. **Memory coherency issues** - CUDA writes may not be visible to OpenGL
|
|
4. **Synchronization race conditions** - undefined behavior, corruption, or crashes
|
|
|
|
### Why It Works on Jetson
|
|
|
|
On Jetson with unified memory:
|
|
|
|
1. **Same physical memory** - both CUDA and OpenGL access the same location
|
|
2. **No data transfer needed** - writes are immediately visible
|
|
3. **Cache coherency handled by hardware** - no explicit synchronization required
|
|
4. **Mapping is essentially a no-op** - just gives a pointer to shared memory
|
|
|
|
## The Correct Pattern (Applied in Our Fix)
|
|
|
|
```cpp
|
|
// Every frame:
|
|
|
|
// 1. Map resource for CUDA access
|
|
// - On dGPU: Sets up memory mapping, ensures coherency
|
|
// - On Jetson: Lightweight operation (already unified)
|
|
cudaGraphicsMapResources(1, &resource, stream);
|
|
|
|
// 2. Get pointer/array for CUDA to use
|
|
cudaGraphicsResourceGetMappedPointer(&ptr, &size, resource);
|
|
// or
|
|
cudaGraphicsSubResourceGetMappedArray(&arr, resource, 0, 0);
|
|
|
|
// 3. Copy/process data with CUDA
|
|
cudaMemcpy(ptr, data, size, cudaMemcpyHostToDevice);
|
|
|
|
// 4. Synchronize to ensure copy completes
|
|
cudaStreamSynchronize(stream);
|
|
|
|
// 5. Unmap so OpenGL can use it
|
|
// - On dGPU: Releases mapping, makes data available to OpenGL
|
|
// - On Jetson: Still good practice, minimal overhead
|
|
cudaGraphicsUnmapResources(1, &resource, stream);
|
|
|
|
// 6. OpenGL rendering can now safely use the resource
|
|
// glBindTexture(GL_TEXTURE_2D, texture);
|
|
// glDrawArrays(...);
|
|
```
|
|
|
|
## CUDA Version and Driver Changes
|
|
|
|
### CUDA 12.8 Behavior Changes
|
|
|
|
According to NVIDIA Developer Forums (2025):
|
|
|
|
> "CUDA randomly freezes when using cudaStreamSynchronize with OpenGL interop... `cudaGraphicsMapResources` needs to be called every frame"
|
|
|
|
This suggests that newer CUDA versions and drivers have **tightened synchronization requirements**:
|
|
|
|
- **Older CUDA/drivers** may have been more lenient with permanent mapping
|
|
- **CUDA 12.8+** enforces proper map/unmap cycles for correctness
|
|
- **Stricter memory ordering** between CUDA and OpenGL contexts
|
|
|
|
### Why JetPack 5.1.2 Still Works
|
|
|
|
- **Older driver branch** - Jetson drivers lag behind desktop
|
|
- **Unified memory hides the issue** - no actual data movement needed
|
|
- **Different driver implementation** - Tegra drivers optimized for unified memory path
|
|
|
|
## Performance Considerations
|
|
|
|
### Desktop GPU (dGPU)
|
|
|
|
| Approach | Overhead | Correctness |
|
|
|----------|----------|-------------|
|
|
| Permanent mapping | Low | **Broken on CUDA 12.8+** |
|
|
| Map/unmap per frame | Higher | **Correct** |
|
|
|
|
**Optimization for dGPU:**
|
|
- Use `cudaStreamSynchronize(0)` or explicit events
|
|
- Batch multiple copies if possible
|
|
- Consider using CUDA arrays directly if no OpenGL needed
|
|
|
|
### Jetson (iGPU)
|
|
|
|
| Approach | Overhead | Correctness |
|
|
|----------|----------|-------------|
|
|
| Permanent mapping | Minimal | Works (unified memory) |
|
|
| Map/unmap per frame | Minimal | Works (good practice) |
|
|
|
|
**Recommendation:** Use the same map/unmap pattern on both platforms for code portability, even though Jetson is more forgiving.
|
|
|
|
## References
|
|
|
|
### NVIDIA Documentation
|
|
|
|
1. **CUDA for Tegra** - https://docs.nvidia.com/cuda/cuda-for-tegra-appnote/index.html
|
|
> "Tegra's integrated GPU (iGPU) shares the same SoC DRAM with CPU... contrasting with dGPUs that have separate memory"
|
|
|
|
2. **CUDA Runtime API - OpenGL Interop** - https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__OPENGL.html
|
|
|
|
3. **Unified Memory on Jetson** - https://forums.developer.nvidia.com/t/unified-memory-on-jetson-platforms/187448
|
|
|
|
### Community Discussions
|
|
|
|
1. **NVIDIA Developer Forums** (2023-2025)
|
|
- "cudaGraphicsMapResources each frame or just once"
|
|
- Users report CUDA 12.8 requires per-frame mapping
|
|
- Jetson works with permanent mapping due to unified memory
|
|
|
|
2. **CUDA Random Freezes with OpenGL** (2025)
|
|
- https://forums.developer.nvidia.com/t/cuda-randomly-freezes-when-using-cudastreamsynchronize-with-opengl-interop/318514
|
|
- Confirms stricter synchronization in newer CUDA versions
|
|
|
|
## Conclusion
|
|
|
|
The fix applied to `GLViewer.cpp` (map/unmap every frame) is **required for correctness on desktop GPUs** and is **good practice on Jetson**:
|
|
|
|
1. **Desktop RTX GPUs** - Separate VRAM requires explicit synchronization between CUDA and OpenGL
|
|
2. **Jetson iGPU** - Unified memory is more forgiving, but same pattern works
|
|
3. **CUDA 12.8+** - Stricter driver requirements enforce proper resource management
|
|
4. **Code portability** - Same code works correctly on both platforms
|
|
|
|
The architectural difference explains why the original code worked on Jetson but failed on your RTX 3090/5070Ti with CUDA 12.8.
|