Early Sensor Fusion For Autonomous Vehicles

Autonomous vehicles need to see by themselves and perceive their environment to make sense of it and make appropriate decisions when required.

TTo perceive their environment, autonomous vehicles such as self-driving cars collect different types of sensor data with the help of onboard sensors, such as cameras, LiDAR sensors, RADAR, GPS, or ultrasonics.

This article will walk you through the process of LiDAR and camera sensor fusion. This is one of the key components in building a perception system in the field of self-driving cars.

We will see that it is all about linear algebra. If you find this article difficult to understand because of the linear algebra part, you can check 3Blue1Brown’s excellent Essence of Linear Algebra video series.

Without further ado, let’s get right into it!

What is Sensor Fusion

Sensor Fusion is the process of combining sensory data coming from multiple sources. It helps to reduce the uncertainty on the information compared as if the data sources were used individually.

We need to distinguish between Early Sensor Fusion and Late Sensor Fusion.

Early Sensor Fusion Vs. Late Sensor Fusion For Automated Vehicles

Early Sensor Fusion is about fusing 3D point clouds with 2D images. Here, we do not combine the results of the detections, but instead, we combine the raw data, e.g., the pixels and the point clouds. Early Sensor Fusion is a 3D to 2D projection and is performed before the object detection. This is why it is called Early Fusion.

On the other hand, Late Sensor Fusion happens after object detection, and tracking is performed.

This article shows how to implement an Early Sensor Fusion algorithm before object detection.

In this article, we describe Early Sensor Fusion with Stereo Cameras and a LiDAR Scanner.

Different Types of Autonomy Sensors: Camera Vs. LiDAR Scanner

Camera sensors are affordable and can collect a lot of information but cannot collect information about the distance. Therefore, we need to use a LiDAR scanner that works like radar but with laser beams instead. The LiDAR is an important sensor in the perception pipeline of a self-driving car.

How does a Camera works

A camera is a sensor used to capture a 3D point and convert it to a pixel in an image.

Source: Visual Fusion for Autonomous Cars - PyImageSearch

It is a two steps process:

Convert the real-world point to the camera coordinates using EXTRINSIC parameters
Convert the point from the camera to the pixel coordinates using INTRINSIC parameters.

On the one hand, extrinsic parameters are rotation and translation matrices used to convert information from the world frame to the camera frame.

On the other hand, intrinsic parameters are the internal camera parameters, such as the focal length, to convert these information into pixels.

Source: Visual Fusion for Autonomous Cars - PyImageSearch

These intrinsic and extrinsic matrices are automatically calculated using a process called Camera Calibration.

Every camera needs calibration. Calibration is the process of converting a Real-World 3D point with (x,y,z) coordinates to a 2D pixel with (x,y) coordinates.

We talk about extrinsic calibration when we refer to going from world coordinates to camera coordinates and intrinsic calibration when we refer to the process of going from camera coordinates to pixel coordinates.

Source: LearnOpenCV

The purpose of the calibration process is to rectify images and avoid image distortions.

Because each camera is different and has its parameters, we need to know the specific parameters of the camera we are working with for sensor fusion.

How does a LiDAR Scanner Works

LiDAR stands for Light Detection And Ranging. LiDAR works with laser beams in the infrared. In a way, a LiDAR is similar to a Radar in its principle but uses light instead. A LiDAR is also a more accurate and less noisy sensor than a Radar.

LiDARs are Time-Of-Flight sensors, meaning that they use the time it takes to hit a target and come back to measure the distance. They are helpful to get an accurate distance and depth measure of an obstacle.

The distance is calculated with the following equation:

d = distance
c = speed of light
T = time of flight

On Wikipedia, the time of flight (ToF) is defined as the measurement of the time taken by an object, particle, or wave (be it acoustic, electromagnetic, etc.) to travel a distance through a medium.

LiDAR scanners generate point clouds with XYZ coordinates. If you are interested in point cloud data and how to work with it, in this article, I talk about PointNet, the first neural network that was taking 3d point cloud data directly as input.

In this project, our LiDAR scanner is a Velodyne, a 360° rotating sensor, similar to the image above.

Data Origin of Our Early Sensor Fusion Project

The car used to collect sensor data is a modified Volkswagen Passat B6 with multiple vision sensors:

1 Inertial Navigation System (GPS/IMU): OXTS RT 3003
1 Laserscanner: Velodyne HDL-64E
2 Grayscale cameras, 1.4 Megapixels: Point Grey Flea 2 (FL2-14S3M-C)
2 Color cameras, 1.4 Megapixels: Point Grey Flea 2 (FL2-14S3C-C)
4 Varifocal lenses, 4-8 mm: Edmund Optics NT59-917

You can find more information about the data and the car used to collect it on the KITTI Vision Benchmark Suite website.

For this project, we need to pay attention to the following variables:

Velo-To-Cam is a variable that we will call V2C — It gives the rotation and translation matrices from the Velodyne to the Left Grayscale camera.
R0_rect is used in Stereo Vision to make the images co-planar, which is our case here.
P2 is the matrix obtained after camera calibration. It contains the intrinsic and the extrinsic matrices.

How To Project a 3D Points Cloud on a 2D Image

In this project, we want to project a point cloud to an image by converting a point X in 3D into a Y point in 2D.

Visually, here is what we want to achieve:

It is mathematically translated as:

Y = P x R0 x R|T x X

Y = point in 2D (pixels)
P = intrinsic calibration matrix
R0 = stereo rectification matrix
R|T = Velodyne to camera matrix
X = point in 3D

Enter The Autonomous Driving Matrix

Following the mathematical equation previously stated, you noticed that we need to multiply several matrices together. But what do they represent?

First, we need the intrinsic calibration matrix P. When you purchase a camera, it already comes pre-calibrated, and you do not have to perform this step. However, it is a required step for every camera.

Second, we need the stereo rectification matrix R0. This matrix is not required if we use a monocular camera. We need to perform this step because we are in a stereo setting with cameras working together.

Third, R|T is the transition matrix from the Velodyne to the camera frame. This is what allows us to transition data from one sensor to another.

Finally, we need to know our sensors’ exact position and coordinate system to project our points.

How To Implement Sensor Fusion Algorithms For Autonomous Vehicles

To illustrate the theory shared above, let’s see how it works for one point.

First, we need to find the coordinates of one 2D point on the image and the coordinates of the 3D point from the points cloud generated by the LiDAR scanner.

To apply sensor fusion techniques, what we need to do is to select the points that are visible in the image, convert them from the LiDAR frame to the Camera Frame, and finally find a way to project the points from the Camera Frame to the Image Frame.

We take one of our camera images and its associated LiDAR point cloud as follow:

We take one of our images and its associated points cloud as follow:

In our calibrated files, we have the following matrices:

1P :[[7.215377e+02 0.000000e+00 6.095593e+02 4.485728e+01]
2 [0.000000e+00 7.215377e+02 1.728540e+02 2.163791e-01]
3 [0.000000e+00 0.000000e+00 1.000000e+00 2.745884e-03]]
4-
5RO [[ 0.9999239   0.00983776 -0.00744505]
6 [-0.0098698   0.9999421  -0.00427846]
7 [ 0.00740253  0.00435161  0.9999631 ]]
8-
9Velo 2 Cam [[ 7.533745e-03 -9.999714e-01 -6.166020e-04 -4.069766e-03]
10 [ 1.480249e-02  7.280733e-04 -9.998902e-01 -7.631618e-02]
11 [ 9.998621e-01  7.523790e-03  1.480755e-02 -2.717806e-01]]
12-

Previously, we wrote the following mathematical equation to fuse the vision sensors:

Y(2D) = P x R0 x R|t x X(3D)

However, when looking at the dimensions, P is a 3x4 matrix, R0 is a 3x3 matrix, R|t (Velo 2 Cam) is a 3x4 matrix, and X is a 3x1 matrix.

We need to convert R0 as a 4x3 matrix and X as a 4x1 matrix by converting the points into homogeneous coordinates.

First, we need to convert R0 into homogeneous coordinates and multiply this value by P. Here is the result:

1[[73.70800018  6.42700005  2.71099997]]
2[[ 7.25995070e+02  9.75088151e+00  6.04164924e+02  4.48572800e+01]
3 [-5.84187278e+00  7.22248117e+02  1.69760552e+02  2.16379100e-01]
4 [ 7.40252700e-03  4.35161400e-03  9.99963100e-01  2.74588400e-03]]

We have the point P with the following coordinates:

1[[73.70800018  6.42700005  2.71099997]]

And R0 with the following coordinates:

1[[ 7.25995070e+02  9.75088151e+00  6.04164924e+02  4.48572800e+01]
2 [-5.84187278e+00  7.22248117e+02  1.69760552e+02  2.16379100e-01]
3 [ 7.40252700e-03  4.35161400e-03  9.99963100e-01  2.74588400e-03]]

Then, we need to multiply the result of P x R0 by R|T, and finally to multiply by X. X needs to be converted to a 4 x 1 matrix. The output will be the homogeneous coordinates of our single point.

We have the following 3D point X:

1[[73.70800018  6.42700005  2.71099997]]

And Y in homogeneous coordinates:

1[[40176.41872366]
2 [11292.90007164]
3 [   73.46372075]]

To get the pixel value, we need to divide by 73.46 to return to the euclidian space. Here is the result:

1[[73.70800018  6.42700005  2.71099997]]
2Euclidean Pixels [[546.88788308 153.72077478]]

Let’s verify our result graphically:

Our point is actually in the tree. While it’s a bit approximative as we looked at our point manually, we know that it works. Let’s verify that we get the same thing with the code:

And here is the result with all the points:

Closing Thoughts on Early Sensor Fusion

This article showed you how to perform point pixel projection, which implements Early Sensor Fusion on raw data before object detection.

Together we learned:

How a camera works
How a LiDAR scanner works
How to perform early sensor fusion by projecting a 3D point on a 2D image

Sensor fusion algorithms are essential for self-driving cars and autonomous vehicles in general. They are involved in pedestrian detection, vehicle detection, in the context of static and dynamic obstacles detection. Last but not least, sensor fusion is also important for adaptive cruise control.

If you are interested to know more about sensor fusion algorithms and object detection with neural networks, feel free to check my next article here.

To build this project, I used Jérémy Cohen’s course on Visual Fusion.

You can watch the demo on Youtube here.

If you liked this content, feel free to subscribe to my newsletter here to receive my updates.