Can we estimate the distance to a vehicle with cameras only?
In fact, it is possible with a stereo camera system. In such a system, we use two cameras to estimate the disparity and the depth. From there, it is possible to estimate the distance to a vehicle.
You can visualize the final result of this project on KITTI dataset below:

But what is the disparity?
Depth and Disparity Estimation
Simply put, a point or an object can be captured by two cameras mounted in a stereo fashion, but this point will have different coordinates. The disparity is the distance between these two sets of coordinates. If we compute the disparity for each pixel on the image, the output will be the disparity map.
Computing the disparity map is essential because it allows us to extract the depth of the image, which can be helpful to know how far is a vehicle or for other applications such as 3D reconstruction. If you are interested in 3D, I also wrote about 3D perception here and 3D deep learning here.
Once the disparity has been estimated, it is possible to estimate the depth. Once we have the depth, it becomes straightforward to calculate the distance. With an object detector, such as YOLO, we can detect the objects of interest and estimate their distance. But how does it work? Below is the diagram of the stereo camera model.
The depth Z is the distance between a point P in the real world and the camera. This diagram presents a stereo vision system with two parallel cameras, C and C’. The distance B between the cameras is called the baseline, f is the focal length, and x and x’ are the image planes of the cameras C and C’.
By triangulation, we can compute the depth Z with the following formula, where (x - x’) is the disparity:
From the equation above, it is essential to note that depth and disparity have an inverse relationship with one another. In other words, the greater the depth, the lesser the disparity, and the lesser the depth, the greater the disparity.
Now, we know how to get the depth granted that we know the disparity. It is a crucial step as it opens the doors to 3D reconstruction and 3D computer vision in general. But wait! How do we compute the disparity?
Epipolar Geometry and Disparity Estimation
First, let’s look at a diagram of a stereo camera model where two cameras look at the same point X.
What is an Epipolar Line
The line Ol - X represents the point X seen by the left camera, which is directly aligned to its optical center OL. On the right camera, this line materialized by the line (eR - XR) is called the epipolar line. Similarly, the line OR - X represents a point for the right camera, but for the left camera, this is materialized by the epipolar line eL-XL. The goal is to find the corresponding point on the right image plane so that we can draw the line that will intersect with X and compute the disparity.
What is an Epipolar Plane
To generalize the previous explanation, the plane X, OL, OR shown as a green triangle on the previous diagram is called the epipolar plane.
Observation on the Epipolar Constraint
If the relative position of the two cameras is known, it is possible to test whether two points correspond to the same 3D points because of the epipolar constraint. The epipolar constraint means that the projection of X on the right camera plane xR must be contained in the eR–xR epipolar line.
Finally, it is important to note that epipolar constraints can be described algebraically by the fundamental matrix.
Semi Global Block matching
There are different methods to estimate the disparity. One traditional method is semi global block matching. It’s fast to implement but it lacks of accuracy and it is computationally intensive. However, it can be a good starting point, and depending on your use case, it might already solve your problem.
The method of Semi Global Block Matching is an intensity-based algorithm used to compute the dense disparity from a pair of rectified stereo images. It works by analyzing the similarity between pixels in multiple directions.
If the results with the semi global block matching algorithm are not good enough for your use case, the next step is to implement a deep learning solution, but it takes more time to develop as it is more complexe. At the moment, it is a very active research area.
Closing Thought on Stereo Vision
In this article, we briefly learned that it is possible to estimate the distance with at least two cameras, despite having images in 2D, by estimating the disparity and the depth of the image.
If you are curious about this project, you can see the final result on Youtube, here.
To be continued