# What Is Stereo Disparity?

Stereopsis is a term that refers to perception of depth, and thus 3D structure, based on observing a scene from two different vantage points. The way this works in nature is that humans, and quite a few other animals, have two eyes located so that these observe the scene from two different positions and thus two slightly different versions of the scene are projected to the retinas. Based on this the visual cortex deduces 3D structure of the scene being observed. In computer vision this concept is typically called stereo, stereo disparity being the difference in position between the two images, and this is used for 3D reconstruction of a scene. The image above shows a 3D reconstruction based on stereo disparity. As it can be understood, this kind of technology can be used for making 3D scanners, amongst other things.

## Stereo Disparity Briefly

Following figure clarifies how stereopsis, or stereo disparity, is used in computer vision:

\( C_1 \) and \( C_2 \) are the 'left-' and 'right' camera centres, \(X\) is the point of interest in 3D world, while \(x\) and \(x^\prime\) are the images of \(X\) as seen in the respective cameras. As it can be understood, this is just two pinhole cameras, one next to the other. Now, if \(x=[x\;y\;1]\) and \(x^\prime=[x^\prime\;y^\prime\;1]\), then stereo disparity is defined as the difference in horizontal position \(d=x-x^\prime\). The way this is done typically in computer vision is that for each pixel in the left camera we try to find the corresponding pixel in the right camera. By reversing this process, if we know positions \(x\) and \(x^\prime\), by using simple trigonometry we can work out coordinates \( X = [X\;Y\;Z\;1]^T\) with respect to the left camera. This well known process is known as triangulation. What this means is that we know where each pixel lies in 3D world with respect the left camera centre, which is how 3D scanners or time-of-flight cameras work. Below there is a little bit more in depth description of the actual process of triangulation.

## Stereo Disparity Map

Stereo disparity map, which is inversily proportional to the distance, can be visualized easily, and it tells us something about the 3D structure of the scene being observed. Following is an example of using stereo disparity and 3D reconstruction in the field of robotics.

In order for a robot to manipulate objects of interest, it needs to know where these are with respect to a known coordinate system and what these objects look like. Based on this we can calculate something called a grasping vector, but that is a different story and we'll leave it there for the time being.

## Triangulation

We can think of a pixel in the camera plane to be formed by a 'ray of light' descibed by a vector. For example, in the 'left' camera case, the vector would be: \( \overline{XC_1}\). If we know the internal parameters of the cameras, we can 'backproject' these rays. If we also know the external parameters of the cameras (e.g. rotation \( R \) and translation \( T \) between the left- and the right cameras), the we can calculate where these backprojected rays intersect and, thus, obtain a 3D-coordinates of the point of interest \(X\) in a metric space.

## Epipolar Rectification

In order to calculate the stereo disparity map we have to find corresponding pixels \(x^\prime\) in the right camera for pixels \(x\) observed in the left camera. The good news is that by using a concept of epipolar rectification this problem can be simplified so that the corresponding pixel can be found on the same horizontal line in the right image as where the pixel is located in the left image. Not only does this considerably simplify the problem, but it reduces the computational complexity as well.

The following two figures show the concept of epipolar rectification. In the figures below, the left hand image is captured by the left camera and the right hand image is captured by the right camera. The upper figure shows the non-rectified case while the lower figure shows the same images but after having being rectified.

Above figure shows non-rectified stereo images. As it can be seen, corresponding pixels are not on the same horizontal lines (i.e. vertical coordinate is not the same for both).

Above figure shows rectified stereo images. As it can be seen, now the corresponding pixel indeed can be found on the same horizontal lines (i.e. vertical coordinate is the same for both).

# Computing Stereo Disparity

As it can be understood, there are many different ways of finding these corresponding pixels. The one I explain below is a so called variational model (based on the calculus of variations). In order to get a better insight of the model and how it is resolved, have a look at my thesis. In the variational disparity calculation the energy functional describing the system is as follows:

\[ E(d) = \min_{d} \int_{\Omega} \Psi \Big( Edata(d)^2 \Big) dx + \alpha \int_{\Omega} \Psi \Big( Esmooth(d)^2 \Big) dx\]

where Edata and Esmooth are the data and the smoothness (energy) terms, d is the disparity, and \( \Psi(s^2)\) is a robust error function. The data term measures how well the 'model' fits the data, while the smoothness term regularizes the solution to be smooth. Edata and Esmooth are defined as follows:

\[\begin{split}Edata(d) = &I_L(x,y) - I_R(x+d,y) \\ Esmooth(d)^2 = &| \nabla d|^2 \end{split}\]

where \(I_{\{L,R\}}\) refers to the left- or right stereo-image, respectively, while \( \nabla = [\frac{\partial}{\partial x}, \frac{\partial}{\partial y}] \) is the gradient operator. In a sense, we are looking for a transformation, defined by \( d \), that moves/morphs the right image into the left image (this is the 'physical' interpretation of the data term), while imposing smoothness on the solution (i.e \(d\)) simultaneously.

One possible error function is as follows:

\[ \Psi( s^2) = \sqrt{ s^2 + \epsilon^2 }\]

The purpose of the robust error function is to deal with 'outliers'. In the Edata case such outliers are, for example, occluded zones (i.e. image structures that are present/visible only in one of the images). In the Esmooth case purpose of the robust error function is to make the solution piece-wise smooth: we do not want to propagate values across object boundaries or, in other words, values residing on different disparity levels.

## Extended Variational Model for Disparity

The extended model contains an additional term, which is based on what might be known of the solution beforehand. Such knowledge can be, for example, that the sky is far away from the viewer (disparity 0 or near 0), roads are relatively flat surfaces, as are many other man built structures like walls, tables and so on. This term allows encoding context-related knowledge in the variational methods.

\[\begin{equation} \min_{d} \int_{\Omega} \Psi \Big( Edata(d)^2 \Big) dx + \int_{\Omega} \Psi \Big( (d_{sc}-d)^2 \Big) + \alpha \int_{\Omega} \Psi \Big( Esmooth(d)^2 \Big) dx \end{equation}\]

where \( d_{sc} \) is apriori disparity map that in a sense `guides' the disparity calculation. More information about constraining the solution, and results, can be found at PUBLICATIONS - CONSTRAINTS. HTML version of a paper describing how disparity- and optical flow fields can be constrained is available here.

## Resolving the Equations(s)

A necessary (but not sufficient) condition for the minima is for the corresponding Euler-Lagrange equation(s) to be zero. Following is the corresponding Euler-Lagrange equation in elliptic form.

\[ \Psi{\prime} \left( Edata^2 \right) Edata \frac{\partial I_R(x+d,y)}{\partial x} + \alpha DIV \left( \Psi{\prime} \left(Esmooth^2 \right) \nabla d \right) = 0 \]

where \( DIV \) is the divergence operator.

Because of the late linearization of the data term, the model copes with large displacements. However, this comes at a price: the energy functional may not be convex. Therefore, searching for a suitable minimizer becomes more difficult.

A suitable minimizer is searched for using a coarse-to-fine strategy, while non-linearities are dealt with using a fixed point scheme. Eventually, after discretization and linearization, we are left with a linear system of equations:

\[ Ax = y \]

where y is a column vector size of \( m*n \) and A is a diagonally dominant sparse matrix, with size of \( (m*n)^2 \) (m and n refer to size of the input image).