Files

crosstyan 435ea09451 refactor: change ClientPublisher to accept camera streams instead of local USB

BREAKING CHANGE: ClientPublisher::open() signature changed from
  open(sl::InputType input, Trigger* ref, int sdk_gpu_id)
to
  open(const std::string& ip, int port, Trigger* ref, int sdk_gpu_id)

The ClientPublisher now receives camera streams over network using
setFromStream() instead of opening local USB cameras. This allows the
host to receive video from edge devices running enableStreaming().

Changes:
- Use init_parameters.input.setFromStream(ip, port) for stream input
- Keep body tracking and Fusion publishing (INTRA_PROCESS) unchanged
- Add getSerialNumber() getter for logging/debugging
- Improve error messages with context (ip:port)

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

2026-02-04 03:25:21 +00:00

8.1 KiB

Raw Blame History

CUDA-OpenGL Interop: Jetson vs Desktop GPU Differences

Summary

The CUDA-OpenGL interop issue where cudaGraphicsMapResources is required every frame on modern desktop GPUs (RTX 3090/5070Ti with CUDA 12.8) but works with permanent mapping on Jetson (JetPack 5.1.2) is rooted in fundamental architectural differences between integrated GPUs (iGPU) and discrete GPUs (dGPU).

Root Cause: Memory Architecture Differences

Jetson (Tegra/iGPU) - Unified Memory Architecture

┌─────────────────────────────────────────┐
│           Jetson SoC                    │
│  ┌─────────┐      ┌─────────┐          │
│  │   CPU   │◄────►│   GPU   │          │
│  │  (ARM)  │      │ (iGPU)  │          │
│  └────┬────┘      └────┬────┘          │
│       │                │               │
│       └───────┬────────┘               │
│               ▼                        │
│         ┌──────────┐                   │
│         │ Shared   │  Unified Memory   │
│         │  DRAM    │  (Zero Copy)      │
│         └──────────┘                   │
└─────────────────────────────────────────┘

Characteristics:

CPU and GPU share the same physical memory (SoC DRAM)
Unified Memory - same pointer value works for both CPU and GPU
Zero-copy possible - no data transfer needed
OpenGL and CUDA can access the same memory without mapping/unmapping
Memory is always "mapped" because there's only one memory space

Desktop GPU (dGPU) - Separate Memory Architecture

┌─────────────┐         PCIe Bus         ┌─────────────┐
│   Host PC   │◄───────────────────────►│  Discrete   │
│             │   (High bandwidth but   │    GPU      │
│  ┌───────┐  │    with latency)        │  ┌───────┐  │
│  │  CPU  │  │                         │  │  GPU  │  │
│  └───┬───┘  │                         │  └───┬───┘  │
│      │      │                         │      │      │
│  ┌───▼───┐  │                         │  ┌───▼───┐  │
│  │  RAM  │  │                         │  │ VRAM  │  │
│  │(Host) │  │                         │  │(Device│  │
│  └───────┘  │                         │  │Memory)│  │
└─────────────┘                         │  └───────┘  │
                                        └─────────────┘

Characteristics:

CPU and GPU have separate physical memories (Host RAM vs VRAM)
Data must be transferred between host and device memory
OpenGL texture/buffer lives in GPU memory
CUDA must explicitly map the resource to get a device pointer
Mapping/Unmapping is required to synchronize access between OpenGL and CUDA

Why the Fix is Required on Desktop GPUs

The Problem with Permanent Mapping on dGPU

When a resource is permanently mapped on desktop GPUs:

CUDA holds the mapping - thinks it owns the memory
OpenGL tries to use the texture/buffer for rendering
Memory coherency issues - CUDA writes may not be visible to OpenGL
Synchronization race conditions - undefined behavior, corruption, or crashes

Why It Works on Jetson

On Jetson with unified memory:

Same physical memory - both CUDA and OpenGL access the same location
No data transfer needed - writes are immediately visible
Cache coherency handled by hardware - no explicit synchronization required
Mapping is essentially a no-op - just gives a pointer to shared memory

The Correct Pattern (Applied in Our Fix)

// Every frame:

// 1. Map resource for CUDA access
//    - On dGPU: Sets up memory mapping, ensures coherency
//    - On Jetson: Lightweight operation (already unified)
cudaGraphicsMapResources(1, &resource, stream);

// 2. Get pointer/array for CUDA to use
cudaGraphicsResourceGetMappedPointer(&ptr, &size, resource);
// or
cudaGraphicsSubResourceGetMappedArray(&arr, resource, 0, 0);

// 3. Copy/process data with CUDA
cudaMemcpy(ptr, data, size, cudaMemcpyHostToDevice);

// 4. Synchronize to ensure copy completes
cudaStreamSynchronize(stream);

// 5. Unmap so OpenGL can use it
//    - On dGPU: Releases mapping, makes data available to OpenGL
//    - On Jetson: Still good practice, minimal overhead
cudaGraphicsUnmapResources(1, &resource, stream);

// 6. OpenGL rendering can now safely use the resource
//    glBindTexture(GL_TEXTURE_2D, texture);
//    glDrawArrays(...);

CUDA Version and Driver Changes

CUDA 12.8 Behavior Changes

According to NVIDIA Developer Forums (2025):

"CUDA randomly freezes when using cudaStreamSynchronize with OpenGL interop... cudaGraphicsMapResources needs to be called every frame"

This suggests that newer CUDA versions and drivers have tightened synchronization requirements:

Older CUDA/drivers may have been more lenient with permanent mapping
CUDA 12.8+ enforces proper map/unmap cycles for correctness
Stricter memory ordering between CUDA and OpenGL contexts

Why JetPack 5.1.2 Still Works

Older driver branch - Jetson drivers lag behind desktop
Unified memory hides the issue - no actual data movement needed
Different driver implementation - Tegra drivers optimized for unified memory path

Performance Considerations

Desktop GPU (dGPU)

Approach	Overhead	Correctness
Permanent mapping	Low	Broken on CUDA 12.8+
Map/unmap per frame	Higher	Correct

Optimization for dGPU:

Use cudaStreamSynchronize(0) or explicit events
Batch multiple copies if possible
Consider using CUDA arrays directly if no OpenGL needed

Jetson (iGPU)

Approach	Overhead	Correctness
Permanent mapping	Minimal	Works (unified memory)
Map/unmap per frame	Minimal	Works (good practice)

Recommendation: Use the same map/unmap pattern on both platforms for code portability, even though Jetson is more forgiving.

References

NVIDIA Documentation

CUDA for Tegra - https://docs.nvidia.com/cuda/cuda-for-tegra-appnote/index.html

"Tegra's integrated GPU (iGPU) shares the same SoC DRAM with CPU... contrasting with dGPUs that have separate memory"
CUDA Runtime API - OpenGL Interop - https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__OPENGL.html
Unified Memory on Jetson - https://forums.developer.nvidia.com/t/unified-memory-on-jetson-platforms/187448

Community Discussions

NVIDIA Developer Forums (2023-2025)
- "cudaGraphicsMapResources each frame or just once"
- Users report CUDA 12.8 requires per-frame mapping
- Jetson works with permanent mapping due to unified memory
CUDA Random Freezes with OpenGL (2025)
- https://forums.developer.nvidia.com/t/cuda-randomly-freezes-when-using-cudastreamsynchronize-with-opengl-interop/318514
- Confirms stricter synchronization in newer CUDA versions

Conclusion

The fix applied to GLViewer.cpp (map/unmap every frame) is required for correctness on desktop GPUs and is good practice on Jetson:

Desktop RTX GPUs - Separate VRAM requires explicit synchronization between CUDA and OpenGL
Jetson iGPU - Unified memory is more forgiving, but same pattern works
CUDA 12.8+ - Stricter driver requirements enforce proper resource management
Code portability - Same code works correctly on both platforms

The architectural difference explains why the original code worked on Jetson but failed on your RTX 3090/5070Ti with CUDA 12.8.

8.1 KiB Raw Blame History