Navigate back to the homepage

Early Sensor Fusion For Autonomous Vehicles

Xavier Rigoulet
September 28th, 2021 · 6 min read

Autonomous vehicles need to see by themselves and perceive their environment to make sense of it and make appropriate decisions when required.

To perceive their environment, autonomous vehicles collect different types of data with the help of sensors, such as cameras, LiDAR, RADAR, GPS, or ultrasonics.

This article will walk you through how to fuse raw data coming from cameras and LiDAR scanners. This is an essential part of the perception process in the field of self-driving cars.

We will see that it is all about linear algebra. If you find this article difficult to understand because of the linear algebra part, you can check 3Blue1Brown’s excellent Essence of Linear Algebra video series.

Without further ado, let’s get right into it!

What is Sensor Fusion

Sensor Fusion is the process of combining sensory data coming from multiple sources. It helps to reduce the uncertainty on the information compared as if the data sources were used individually.

We need to distinguish between Early Sensor Fusion and Late Sensor Fusion.

Early Sensor Fusion Vs. Late Sensor Fusion

Early Sensor Fusion is about fusing 3D point clouds with 2D images. Here, we do not combine the results of the detections, but instead, we combine the raw data, e.g., the pixels and the point clouds. Early Sensor Fusion is a 3D to 2D projection and is performed before the object detection. This is why it is called Early Fusion.

On the other hand, Late Sensor Fusion happens after object detection, and tracking is performed.

This article shows an example of Early Sensor Fusion implementation.

In this article, we describe Early Sensor Fusion with Stereo Cameras and a LiDAR Scanner.

Different Types of Sensors: Camera Vs. LiDAR Scanner

Cameras are affordable and can collect a lot of information but cannot collect information about the distance. Therefore, we need to use a LiDAR scanner that works like radar but with infrared instead.

How does a Camera works

A camera is a sensor used to capture a 3D point and convert it to a pixel in an image.

Camera Schema
Source: Visual Fusion for Autonomous Cars - PyImageSearch

It is a two steps process:

  1. Convert the real-world point to the camera coordinates using EXTRINSIC parameters
  2. Convert the point from the camera to the pixel coordinates using INTRINSIC parameters.

On the one hand, extrinsic parameters are rotation and translation matrices used to convert information from the world frame to the camera frame.

On the other hand, intrinsic parameters are the internal camera parameters, such as the focal length, to convert these information into pixels.

Mathematical Conversion of Real World Point to Image
Source: Visual Fusion for Autonomous Cars - PyImageSearch

These intrinsic and extrinsic matrices are automatically calculated using a process called Camera Calibration.

Every camera needs calibration. Calibration is the process of converting a Real-World 3D point with (x,y,z) coordinates to a 2D pixel with (x,y) coordinates.

We talk about extrinsic calibration when we refer to going from world coordinates to camera coordinates and intrinsic calibration when we refer to the process of going from camera coordinates to pixel coordinates.

Image Rectification
Source: LearnOpenCV

The purpose of the calibration process is to rectify images and avoid image distortions.

Because each camera is different and has its parameters, we need to know the specific parameters of the camera we are working with for sensor fusion.

How does a LiDAR Scanner Works

LiDAR stands for Light Detection And Ranging. LiDAR works with laser beams in the infrared. In a way, a LiDAR is similar to a Radar in its principle but uses light instead. A LiDAR is also a more accurate and less noisy sensor than a Radar.

LiDAR Scanner

LiDARs are Time-Of-Flight sensors, meaning that they use the time it takes to hit a target and come back to measure the distance. They are helpful to get an accurate distance and depth measure of an obstacle.

The distance is calculated with the following equation:

Flight distance

  • d = distance
  • c = speed of light
  • T = time of flight

On Wikipedia, the time of flight (ToF) is defined as the measurement of the time taken by an object, particle, or wave (be it acoustic, electromagnetic, etc.) to travel a distance through a medium.

LiDAR scanners generate point clouds with XYZ positions.

In this project, our LiDAR scanner is a Velodyne, a 360° rotating sensor, similar to the image above.

Data Origin of Our Early Sensor Fusion Project

The car used to collect data is a modified Volkswagen Passat B6 with the following sensors:

  • 1 Inertial Navigation System (GPS/IMU): OXTS RT 3003
  • 1 Laserscanner: Velodyne HDL-64E
  • 2 Grayscale cameras, 1.4 Megapixels: Point Grey Flea 2 (FL2-14S3M-C)
  • 2 Color cameras, 1.4 Megapixels: Point Grey Flea 2 (FL2-14S3C-C)
  • 4 Varifocal lenses, 4-8 mm: Edmund Optics NT59-917

volgswagen Passat With Visual Sensors
Volgswagen Setup Diagram Top View

You can find more information about the data and the car used to collect it on the KITTI Vision Benchmark Suite website.

For this project, we need to pay attention to the following variables:

  • Velo-To-Cam is a variable that we will call V2C — It gives the rotation and translation matrices from the Velodyne to the Left Grayscale camera.
  • R0_rect is used in Stereo Vision to make the images co-planar, which is our case here.
  • P2 is the matrix obtained after camera calibration. It contains the intrinsic and the extrinsic matrices.

How To Project a 3D Points Cloud on a 2D Image

In this project, we want to project a point cloud to an image by converting a point X in 3D into a Y point in 2D.

Visually, here is what we want to achieve:

Sensor Fusion Result

It is mathematically translated as:

Y = P x R0 x R|T x X

  • Y = point in 2D (pixels)
  • P = intrinsic calibration matrix
  • R0 = stereo rectification matrix
  • R|T = Velodyne to camera matrix
  • X = point in 3D

Enter The Matrix

Following the mathematical equation previously stated, you noticed that we need to multiply several matrices together. But what do they represent?

First, we need the intrinsic calibration matrix P. When you purchase a camera, it already comes pre-calibrated, and you do not have to perform this step. However, it is a required step for every camera.

Second, we need the stereo rectification matrix R0. This matrix is not required if we use a monocular camera. We need to perform this step because we are in a stereo setting with cameras working together.

Third, R|T is the transition matrix from the Velodyne to the camera frame. This is what allows us to transition data from one sensor to another.

Finally, we need to know our sensors’ exact position and coordinate system to project our points.

How To Perform Sensor Fusion Algorithm

To illustrate the theory shared above, let’s see how it works for one point.

First, we need to find the coordinates of one 2D point on the image and the coordinates of the 3D point from the points cloud generated by the LiDAR scanner.

To apply sensor fusion, what we need to do is to select the Points that are visible in the image, convert them from the LiDAR frame to the Camera Frame, and finally find a way to project the points from the Camera Frame to the Image Frame.

We take one of our images and its associated points cloud as follow:

Points Cloud

In our calibrated files, we have the following matrices:

1P :[[7.215377e+02 0.000000e+00 6.095593e+02 4.485728e+01]
2 [0.000000e+00 7.215377e+02 1.728540e+02 2.163791e-01]
3 [0.000000e+00 0.000000e+00 1.000000e+00 2.745884e-03]]
5RO [[ 0.9999239 0.00983776 -0.00744505]
6 [-0.0098698 0.9999421 -0.00427846]
7 [ 0.00740253 0.00435161 0.9999631 ]]
9Velo 2 Cam [[ 7.533745e-03 -9.999714e-01 -6.166020e-04 -4.069766e-03]
10 [ 1.480249e-02 7.280733e-04 -9.998902e-01 -7.631618e-02]
11 [ 9.998621e-01 7.523790e-03 1.480755e-02 -2.717806e-01]]

Previously, we wrote the following mathematical equation to fuse the visual sensors:

Y(2D) = P x R0 x R|t x X(3D)

However, when looking at the dimensions, P is a 3x4 matrix, R0 is a 3x3 matrix, R|t (Velo 2 Cam) is a 3x4 matrix, and X is a 3x1 matrix.

We need to convert R0 as a 4x3 matrix and X as a 4x1 matrix by converting the points into homogeneous coordinates.

First, we need to convert R0 into homogeneous coordinates and multiply this value by P. Here is the result:

1[[73.70800018 6.42700005 2.71099997]]
2[[ 7.25995070e+02 9.75088151e+00 6.04164924e+02 4.48572800e+01]
3 [-5.84187278e+00 7.22248117e+02 1.69760552e+02 2.16379100e-01]
4 [ 7.40252700e-03 4.35161400e-03 9.99963100e-01 2.74588400e-03]]

We have the point P with the following coordinates:

1[[73.70800018 6.42700005 2.71099997]]

And R0 with the following coordinates:

1[[ 7.25995070e+02 9.75088151e+00 6.04164924e+02 4.48572800e+01]
2 [-5.84187278e+00 7.22248117e+02 1.69760552e+02 2.16379100e-01]
3 [ 7.40252700e-03 4.35161400e-03 9.99963100e-01 2.74588400e-03]]

Then, we need to multiply the result of P x R0 by R|T, and finally to multiply by X. X needs to be converted to a 4 x 1 matrix. The output will be the homogeneous coordinates of our single point.

We have the following 3D point X:

1[[73.70800018 6.42700005 2.71099997]]

And Y in homogeneous coordinates:

2 [11292.90007164]
3 [ 73.46372075]]

To get the pixel value, we need to divide by 73.46 to return to the euclidian space. Here is the result:

1[[73.70800018 6.42700005 2.71099997]]
2Euclidean Pixels [[546.88788308 153.72077478]]

Let’s verify our result graphically:

Single Point Cloud Coordinates
Single Point Image

Our point is actually in the tree. While it’s a bit approximative as we looked at our point manually, we know that it works. Let’s verify that we get the same thing with the code:

Programmatically Projected 3D Point on 2D Image

And here is the result with all the points:

Sensor Fusion Result

Closing Thoughts

This article showed you how to perform point pixel projection, which implements Early Sensor Fusion as it is completed before object detection.

Together we learned:

  • How a camera works
  • How a LiDAR scanner works
  • How to perform early sensor fusion by projecting a 3D point on a 2D image

This is an essential step for self-driving cars and autonomous vehicles in general.

To build this project, I used PyImageUniversity’s resources on Sensor Fusion by Jérémy Cohen from

You can watch the demo on Youtube here.

If you liked this content, feel free to subscribe to my newsletter here to receive my updates.

Join our email list and get notified about new content

Be the first to receive our latest content with the ability to opt-out at anytime. We promise to not spam your inbox or share your email with any third parties.

More articles from Digital Nuage

The Cost of Technical Debts in Machine Learning

Discover what is a technical debt in Machine Learning and some mitigation strategies

September 5th, 2021 · 9 min read

Machine Learning For Mental Healthcare Part 2

Discover how to preprocess text data for a nlp machine learning project

September 1st, 2021 · 10 min read
© 2021 Digital Nuage
Link to $