camera-extrinsic-play/note.md at dbe366208881e47520e6f72aaeafdd8ead835581

Files

crosstyan 6406dbd19f Add comprehensive documentation on coordinate system conversions in note.md. Explain the necessity and order of transformations for world and camera coordinates, clarifying the distinction between Z-up to Y-up and OpenCV to OpenGL conversions.

2025-03-25 10:22:26 +08:00

6.1 KiB

Raw Blame History

I'll write down the transformation sequence using LaTeX notation. Let me break down the complete operation:

For a point P in the original world space, the final transformation can be written as:


P_{final} = (C_{GL} \cdot E \cdot W_{Y})^{-1} \cdot P

Where:

E is your original extrinsic matrix (world-to-camera transform)
W_{Y} is the Z-up to Y-up world conversion matrix
C_{GL} is the OpenCV to OpenGL camera conversion matrix
^{-1} denotes matrix inversion

Breaking down each matrix:


W_{Y} = \begin{bmatrix} 
1 & 0 & 0 & 0 \\
0 & 0 & 1 & 0 \\
0 & -1 & 0 & 0 \\
0 & 0 & 0 & 1
\end{bmatrix}


C_{GL} = \begin{bmatrix}
1 & 0 & 0 & 0 \\
0 & -1 & 0 & 0 \\
0 & 0 & -1 & 0 \\
0 & 0 & 0 & 1
\end{bmatrix}

The order of operations (from right to left) is:

Convert world from Z-up to Y-up (W_{Y})
Apply camera transform (E)
Convert camera space to OpenGL convention (C_{GL})
Invert the whole transformation (^{-1})

This matches our code where we write:

finalTransform
  .multiply(cameraConversion)    // C_GL
  .multiply(worldToCamera)       // E
  .multiply(worldZupToYup)       // W_Y
  .invert()                      // ^{-1}

Let me explain the reasoning behind this order:

First, W_Y (Z-up to Y-up):
- We need to convert the world coordinates BEFORE applying the camera transform
- Because your original extrinsic matrix E was defined in Z-up world coordinates
- If we didn't do this first, the camera transform would be applied in the wrong coordinate system
Then, E (your extrinsic matrix):
- This transforms points from the (now Y-up) world space into camera space
- The camera space at this point is still in OpenCV convention
Finally, C_{GL} (OpenCV to OpenGL):
- This converts the camera space from OpenCV convention to OpenGL convention
- It must be applied AFTER the camera transform because it's converting the camera's view space
- If we applied it before, we'd be incorrectly modifying the world coordinates

Think about it like this:

We first need to fix the world coordinate system to match Three.js (W_Y)
Then we can apply your camera's view of this world (E)
Finally, we adapt how the camera "sees" to match OpenGL conventions (C_{GL})

If we did it in a different order:

Applying C_{GL} before E would incorrectly transform the world coordinates instead of the camera space
Applying W_Y after E would transform the camera space instead of the world coordinates
The transformations would effectively be applied to the wrong spaces

The inversion at the end (^{-1}) is needed because Three.js expects camera-to-world transforms rather than world-to-camera transforms.

Ah yes, you're absolutely right! I made a mistake in my explanation. In matrix multiplication, the operations are indeed applied from right to left.

When we write:


P_{final} = (C_{GL} \cdot E \cdot W_{Y})^{-1} \cdot P

The actual order of operations on point P (before inversion) would be:

W_Y \cdot P: Convert from Z-up to Y-up world coordinates
E \cdot (W_Y \cdot P): Apply camera transform
C_{GL} \cdot (E \cdot (W_Y \cdot P)): Convert to OpenGL camera convention

In our code:

finalTransform
  .multiply(cameraConversion)    // C_GL
  .multiply(worldToCamera)       // E
  .multiply(worldZupToYup)       // W_Y

Three.js's multiply() method actually applies the matrices in the opposite order of how we write them (it pre-multiplies), which is why our code worked. But my explanation of the order was incorrect.

The actual sequence is:

C_{GL}: Convert camera space to OpenGL convention
E: Transform from world to camera
W_Y: Convert world from Z-up to Y-up

Thank you for catching this! It's crucial to understand the correct order of operations, especially when dealing with coordinate system transformations.

Let me break this down more clearly:

Why do we need these conversions?

World Coordinate Conversion (Z-up to Y-up):
- Your original data uses a Z-up coordinate system (Z is "up" in the real world)
- Three.js uses a Y-up coordinate system (Y is "up" in Three.js)
- We need to convert between these two world coordinate systems
Camera Coordinate Conversion (OpenCV to OpenGL):
- Camera coordinate systems define how the camera "sees" the world
- OpenCV uses a right-handed system where Z points forward from the camera
- OpenGL/Three.js uses a system where Z points backward from the camera
- We need to convert between these camera conventions

What's the difference?

World conversion (Z-up to Y-up) affects how we represent points in the world
Camera conversion (OpenCV to OpenGL) affects how the camera interprets what it sees

These are completely separate transforms that operate on different coordinate spaces:

One operates on the world before the camera sees it
The other operates on the camera's view of the world

Why this specific order?

The order matters because we're dealing with a transformation pipeline:

First, we want to convert the world from Z-up to Y-up (worldCvt)
- This makes the world match Three.js's expectations
Then, apply the camera's extrinsic matrix (Rt)
- This transforms world points into the camera's coordinate system
- At this point, we're in OpenCV camera space
Finally, convert from OpenCV to OpenGL camera conventions (cameraCvt)
- This converts the camera space to match Three.js's expectations

If we changed the order:

Applying camera conversion before the camera transformation would incorrectly transform world points
Applying world conversion after the camera transformation would transform points that are already in camera space

Think of it as a pipeline:

Fix the world coordinates (world conversion)
View the world through the camera (camera extrinsic)
Adjust how the camera interprets what it sees (camera conversion)

The fact that matrix multiplication is associative means we can compute this entire pipeline as a single matrix operation, but the conceptual order still matters for getting the correct result.

Matrix4.multiply

6.1 KiB Raw Blame History

Why do we need these conversions?

What's the difference?

Why this specific order?

6.1 KiB

Raw Blame History