Discover how a car can assess the distance with another vehicle in real-time.
In a previous article, I explained the process of projecting a 3D point cloud onto a 2D image.
In this article, we will go further by performing object detection with a deep learning model. In the end, it will have the distance between our car and the other vehicles in real-time.
This technology is essential in the perception pipeline of autonomous vehicles. If the algorithm can detect the distance that separates us from an obstacle in front, it can adapt its speed and make appropriate decisions, such as an emergency brake if needed.
You can watch the final result on Youtube.
Let’s get right into it! But before, let me provide some context about sensor fusions. There are different types of sensor fusion.
Combining data from a camera with a or any other sensor is called low-level sensor fusion. This is because the fusion is on raw data before performing object detection.
If we finally perform sensor fusion after object detection on the camera and the LiDAR, we talk about mid-level sensor fusion.
Finally, if we have object tracking on both devices, we talk about high-level sensor fusion.
In this article, we talk about low-level fusion, which is also called early sensor fusion. Mid-level and high-level sensor fusions are what we call late sensor fusion.
In Late sensor fusion, we perform 2D object detection from the image and 3D object detection from LiDAR before fusing the sensors.
If early sensor fusion is about raw data, late sensor fusion is about objects.
It is important to note that early sensor fusion is nowadays preferred because it is safer. With early sensor fusion, we can build a security bubble, meaning that if the algorithm fails to detect the object, we can still stop the car.
On the other hand, late sensor fusion relies on object detection; therefore, if the system fails to detect the object, the whole system crashes.
To complete the process of early sensor fusion, we need to perform the three following steps:
- Project 3D point cloud onto 2D images
- 2D Object Detection
- Outlier removal
Since I already explained the first step of early fusion, I will only go through steps 2 and 3.
However, if you are not familiar with it, I recommend reading the article first before coming back here.
The data used for this project have been collected by a real car, which is equipped with a rotating LiDAR and a stereo camera system with four cameras, as seen below:
The car is a modified Volkswagen Passat B6 with the following sensors:
- 1 Inertial Navigation System (GPS/IMU): OXTS RT 3003
- 1 Laserscanner: Velodyne HDL-64E
- 2 Grayscale cameras, 1.4 Megapixels: Point Grey Flea 2 (FL2–14S3M-C)
- 2 Color cameras, 1.4 Megapixels: Point Grey Flea 2 (FL2–14S3C-C)
- 4 Varifocal lenses, 4–8 mm: Edmund Optics NT59–917
You can find more information about the data and the car used to collect it on the KITTI Vision Benchmark Suite website.
3D point cloud projection onto a 2D image
As I mentioned earlier, I will not detail how to project a 3D point cloud onto a 2D image, but I believe it is necessary to show the final output before going further.
The image above shows the 3D point clouds on the 2D image. The color of the 3D points changes depending on the distance between the point and our sensor. This step is essential because we reduce the uncertainties generated by each sensors’ limitations by combining different sensors.
Next, we perform 2D object detection.
2D Object Detection with Deep Learning
Object detectors can be divided into two categories of algorithms: Region Proposal Detectors and One-Shot Detectors.
In a Region Proposal Network such as R-CNN, the network will propose 2000 regions possibly containing an object. Then, we compute the convolutional features using CNN for each proposal. Finally, each proposal is classified with the help of a linear SVM.
On the other hand, One-Shot Detectors do not need region proposals and directly regress bounding box location. YOLO is an example of a One-Shot Detector and is our algorithm of choice for this project.
YOLO stands for You Only Look Once. This algorithm works by dividing an image into a grid, which is usually of dimensions 13x13. For each cell, the model will predict two boxes with their confidence interval. Then, the model will predict a class probability for each cell and combine the boxes with the class predictions. The final step is to apply Non-Maxima Suppression and Thresholding.
The YOLO architecture is a Deep Learning model with 24 convolutional layers.
Here is the result of the object detection:
Once we have the 2D object detection on the image, we need to fuse the point cloud and the bounding boxes.
Fuse Point Cloud and Bounding Boxes
At this stage, when we fuse the point cloud with the bounding boxes, here is our result:
Now, we need to remove the irrelevant points.
The first step is to remove the points outside the bounding boxes with a simple if-else statement.
Next, we remove the points inside the bounding boxes but are irrelevant because these points are sources of errors.
This process is called outliers removal. The outliers are the points belonging to the box but are not part of the object.
This step is essential to ensure accuracy in estimating the distance between the obstacle and our vehicle.
There are several methods to perform it. A standard solution is to use a shrink factor. In other words, instead of considering the whole box, we consider only a part of it.
A common practice is to remove 10 to 15% of the points to keep only relevant ones.
The image above shows that while the points are inside the bounding box, not all the points belong to the detected object.
After shrinking the box by 10%, it helps to reduce the number of outliers, but we still have points that do not belong to the object.
Shrinking the size of the bounding box by 20% improves the result, but we still need to optimize further as unwanted points remain.
We can reduce their number further by using the sigma rule. The sigma rule removes outliers based on the number of standard deviations sigma.
In this case, a “1-Sigma” is one standard deviation from the norm (i.g., mean or average), “2-sigma” is two standard deviations from the norm, and “3-sigma” represents three standard deviations from the norm.
The final step is to select a distance between the bounding boxes and the points as a reference.
Choosing the median, the average, the closest, or even a random point is possible. It is usually safer to select the closest point from the bounding box.
Finally, here is the final result:
Closing Thoughts on Early Sensor-Fusion for Self-Driving Cars
This article and the previous one explained the complete process of early sensor fusion for self-driving cars.
We have learned the complete process of early sensor fusion, in this case, between a LiDAR and a camera. Once the 3D point cloud was projected onto the 2D image, we learned to perform 2D object detection using YOLO. Finally, we learned how to remove the outliers to ensure the accuracy of the output.
While this article illustrates the application of this technology with LiDAR and cameras, it also works with other sensors, such as cameras and a RADAR. This technology finds a lot of applications in robotics, drones, augmented reality, etc.