Autonomous vehicles need to see by themselves and perceive their environment to make sense of it and make appropriate decisions when required.
To perceive their environment, autonomous vehicles collect different types of data with the help of sensors, such as cameras, LiDAR, RADAR, GPS, or ultrasonics.
This article will walk you through how to fuse raw data coming from cameras and LiDAR scanners. This is an essential part of the perception process in the field of self-driving cars.
We will see that it is all about linear algebra. If you find this article difficult to understand because of the linear algebra part, you can check 3Blue1Brown’s excellent Essence of Linear Algebra video series.
Without further ado, let’s get right into it!
What is Sensor Fusion
Sensor Fusion is the process of combining sensory data coming from multiple sources. It helps to reduce the uncertainty on the information compared as if the data sources were used individually.
We need to distinguish between Early Sensor Fusion and Late Sensor Fusion.
Early Sensor Fusion Vs. Late Sensor Fusion
Early Sensor Fusion is about fusing 3D point clouds with 2D images. Here, we do not combine the results of the detections, but instead, we combine the raw data, e.g., the pixels and the point clouds. Early Sensor Fusion is a 3D to 2D projection and is performed before the object detection. This is why it is called Early Fusion.
On the other hand, Late Sensor Fusion happens after object detection, and tracking is performed.
This article shows an example of Early Sensor Fusion implementation.
In this article, we describe Early Sensor Fusion with Stereo Cameras and a LiDAR Scanner.
Different Types of Sensors: Camera Vs. LiDAR Scanner
Cameras are affordable and can collect a lot of information but cannot collect information about the distance. Therefore, we need to use a LiDAR scanner that works like radar but with infrared instead.
How does a Camera works
A camera is a sensor used to capture a 3D point and convert it to a pixel in an image.
Source: Visual Fusion for Autonomous Cars - PyImageSearch
It is a two steps process:
- Convert the real-world point to the camera coordinates using EXTRINSIC parameters
- Convert the point from the camera to the pixel coordinates using INTRINSIC parameters.
On the one hand, extrinsic parameters are rotation and translation matrices used to convert information from the world frame to the camera frame.
On the other hand, intrinsic parameters are the internal camera parameters, such as the focal length, to convert these information into pixels.
Source: Visual Fusion for Autonomous Cars - PyImageSearch
These intrinsic and extrinsic matrices are automatically calculated using a process called Camera Calibration.
Every camera needs calibration. Calibration is the process of converting a Real-World 3D point with (x,y,z) coordinates to a 2D pixel with (x,y) coordinates.
We talk about extrinsic calibration when we refer to going from world coordinates to camera coordinates and intrinsic calibration when we refer to the process of going from camera coordinates to pixel coordinates.
The purpose of the calibration process is to rectify images and avoid image distortions.
Because each camera is different and has its parameters, we need to know the specific parameters of the camera we are working with for sensor fusion.
How does a LiDAR Scanner Works
LiDAR stands for Light Detection And Ranging. LiDAR works with laser beams in the infrared. In a way, a LiDAR is similar to a Radar in its principle but uses light instead. A LiDAR is also a more accurate and less noisy sensor than a Radar.
LiDARs are Time-Of-Flight sensors, meaning that they use the time it takes to hit a target and come back to measure the distance. They are helpful to get an accurate distance and depth measure of an obstacle.
The distance is calculated with the following equation:
- d = distance
- c = speed of light
- T = time of flight
On Wikipedia, the time of flight (ToF) is defined as the measurement of the time taken by an object, particle, or wave (be it acoustic, electromagnetic, etc.) to travel a distance through a medium.
LiDAR scanners generate point clouds with XYZ positions.
In this project, our LiDAR scanner is a Velodyne, a 360° rotating sensor, similar to the image above.
Data Origin of Our Early Sensor Fusion Project
The car used to collect data is a modified Volkswagen Passat B6 with the following sensors:
- 1 Inertial Navigation System (GPS/IMU): OXTS RT 3003
- 1 Laserscanner: Velodyne HDL-64E
- 2 Grayscale cameras, 1.4 Megapixels: Point Grey Flea 2 (FL2-14S3M-C)
- 2 Color cameras, 1.4 Megapixels: Point Grey Flea 2 (FL2-14S3C-C)
- 4 Varifocal lenses, 4-8 mm: Edmund Optics NT59-917
You can find more information about the data and the car used to collect it on the KITTI Vision Benchmark Suite website.
For this project, we need to pay attention to the following variables:
- Velo-To-Cam is a variable that we will call V2C — It gives the rotation and translation matrices from the Velodyne to the Left Grayscale camera.
- R0_rect is used in Stereo Vision to make the images co-planar, which is our case here.
- P2 is the matrix obtained after camera calibration. It contains the intrinsic and the extrinsic matrices.
How To Project a 3D Points Cloud on a 2D Image
In this project, we want to project a point cloud to an image by converting a point X in 3D into a Y point in 2D.
Visually, here is what we want to achieve:
It is mathematically translated as:
Y = P x R0 x R|T x X
- Y = point in 2D (pixels)
- P = intrinsic calibration matrix
- R0 = stereo rectification matrix
- R|T = Velodyne to camera matrix
- X = point in 3D
Enter The Matrix
Following the mathematical equation previously stated, you noticed that we need to multiply several matrices together. But what do they represent?
First, we need the intrinsic calibration matrix P. When you purchase a camera, it already comes pre-calibrated, and you do not have to perform this step. However, it is a required step for every camera.
Second, we need the stereo rectification matrix R0. This matrix is not required if we use a monocular camera. We need to perform this step because we are in a stereo setting with cameras working together.
Third, R|T is the transition matrix from the Velodyne to the camera frame. This is what allows us to transition data from one sensor to another.
Finally, we need to know our sensors’ exact position and coordinate system to project our points.
How To Perform Sensor Fusion Algorithm
To illustrate the theory shared above, let’s see how it works for one point.
First, we need to find the coordinates of one 2D point on the image and the coordinates of the 3D point from the points cloud generated by the LiDAR scanner.
To apply sensor fusion, what we need to do is to select the Points that are visible in the image, convert them from the LiDAR frame to the Camera Frame, and finally find a way to project the points from the Camera Frame to the Image Frame.
We take one of our images and its associated points cloud as follow:
In our calibrated files, we have the following matrices:
1P :[[7.215377e+02 0.000000e+00 6.095593e+02 4.485728e+01]2 [0.000000e+00 7.215377e+02 1.728540e+02 2.163791e-01]3 [0.000000e+00 0.000000e+00 1.000000e+00 2.745884e-03]]4-5RO [[ 0.9999239 0.00983776 -0.00744505]6 [-0.0098698 0.9999421 -0.00427846]7 [ 0.00740253 0.00435161 0.9999631 ]]8-9Velo 2 Cam [[ 7.533745e-03 -9.999714e-01 -6.166020e-04 -4.069766e-03]10 [ 1.480249e-02 7.280733e-04 -9.998902e-01 -7.631618e-02]11 [ 9.998621e-01 7.523790e-03 1.480755e-02 -2.717806e-01]]12-
Previously, we wrote the following mathematical equation to fuse the visual sensors:
Y(2D) = P x R0 x R|t x X(3D)
However, when looking at the dimensions, P is a 3x4 matrix, R0 is a 3x3 matrix, R|t (Velo 2 Cam) is a 3x4 matrix, and X is a 3x1 matrix.
We need to convert R0 as a 4x3 matrix and X as a 4x1 matrix by converting the points into homogeneous coordinates.
First, we need to convert R0 into homogeneous coordinates and multiply this value by P. Here is the result:
1[[73.70800018 6.42700005 2.71099997]]2[[ 7.25995070e+02 9.75088151e+00 6.04164924e+02 4.48572800e+01]3 [-5.84187278e+00 7.22248117e+02 1.69760552e+02 2.16379100e-01]4 [ 7.40252700e-03 4.35161400e-03 9.99963100e-01 2.74588400e-03]]
We have the point P with the following coordinates:
1[[73.70800018 6.42700005 2.71099997]]
And R0 with the following coordinates:
1[[ 7.25995070e+02 9.75088151e+00 6.04164924e+02 4.48572800e+01]2 [-5.84187278e+00 7.22248117e+02 1.69760552e+02 2.16379100e-01]3 [ 7.40252700e-03 4.35161400e-03 9.99963100e-01 2.74588400e-03]]
Then, we need to multiply the result of P x R0 by R|T, and finally to multiply by X. X needs to be converted to a 4 x 1 matrix. The output will be the homogeneous coordinates of our single point.
We have the following 3D point X:
1[[73.70800018 6.42700005 2.71099997]]
And Y in homogeneous coordinates:
1[[40176.41872366]2 [11292.90007164]3 [ 73.46372075]]
To get the pixel value, we need to divide by 73.46 to return to the euclidian space. Here is the result:
1[[73.70800018 6.42700005 2.71099997]]2Euclidean Pixels [[546.88788308 153.72077478]]
Let’s verify our result graphically:
Our point is actually in the tree. While it’s a bit approximative as we looked at our point manually, we know that it works. Let’s verify that we get the same thing with the code:
And here is the result with all the points:
This article showed you how to perform point pixel projection, which implements Early Sensor Fusion as it is completed before object detection.
Together we learned:
- How a camera works
- How a LiDAR scanner works
- How to perform early sensor fusion by projecting a 3D point on a 2D image
This is an essential step for self-driving cars and autonomous vehicles in general.
You can watch the demo on Youtube here.
If you liked this content, feel free to subscribe to my newsletter here to receive my updates.