published in journal of Machine Vision and Applications
Disparity Disambiguation by Fusion of Signal-
and Symbolic-Level Information
Jarno Ralli
1
, Javier D
´
ıaz
1
, Sinan Kalkan
2
, Norbert Kr
¨
uger
3
, and
Eduardo Ros
1
jarno@ralli.fi, jdiaz@atc.ugr.es, skalkan@ceng.metu.edu.tr,
norbert@mip.sdu.dk, eros@atc.ugr.es
1
Departamento de Arquitectura y Tecnolog
´
ıa de Computadores
2
KOVAN Research Lab
3
Cognitive Vision Lab
Abstract
We describe a method for resolving ambiguities in low-level disparity calcu-
lations in a stereo-vision scheme by using a recurrent mechanism that we call
signal-symbol loop. Due to the local nature of low-level processing it is not al-
ways possible to estimate the correct disparity values produced at this level. Sym-
bolic abstraction of the signal produces robust, high confidence, multimodal image
features which can be used to interpret the scene more accurately and therefore
disambiguate low-level interpretations by biasing the correct disparity. The fusion
process is capable of producing more accurate dense disparity maps than the low-
and symbolic-level algorithms can produce independently. Therefore we describe
an efficient fusion scheme that allows symbolic- and low-level cues to complement
each other, resulting in a more accurate and dense disparity representation of the
scene.
1 Introduction
Visual perception is a complex process that transforms image signals into cognitive in-
formation. The complexity of the vision system is due to multiple levels of abstraction
that must be taken into account when interpreting the image scene. In order to under-
stand the vision system better we can represent schematically the different levels of
the process. Vision researchers tend to classify vision algorithms and representations
into three levels: low (sensory/signal), middle (symbolic) and high (knowledge based)
[28][3][22]. Low-level vision deals with local operations such as spatio-temporal fil-
ters to extract low-level cues. In biological systems this is done by cells in the retina
Escuela T
´
ecnica Superior de Ingenier
´
ıa Informatica y de Telecomunicac
´
ıon, Universidad de Granada,
Calle Periodista Daniel Saucedo Aranda s/n, E-18071 Granada, Spain
Dept. of Computer Engineering, Middle East Technical University, 06531 Ankara, Turkey
The Maersk Mc-Kinney Moller Institute, University of Southern Denmark, Niels Bohrs Alle 1, DK-5230
Odense M, Denmark
www.jarnoralli.fi 1
and the primary visual cortex. From a set of basic spatio-temporal filters of different
sizes and temporal characteristics, low-level vision models generate information about
stereopsis, motion within the scene, local contrast and so on. Low-level operations on
the image signal are local in nature and can produce several possible interpretations due
to the lack of a more global scene interpretation. At middle-level vision, visual cues
and segmentation mechanisms are integrated, thus allowing the efficient and construc-
tive combination of different visual modalities (motion, stereo, orientation and so on)
or the segmentation of abstracted information such as independently moving objects
(IMOs)[27][5][19]. High-level vision is a cognitive processing stage, where scenes
are interpreted via more specific sub-tasks, such as object recognition, prediction and
comparison with already perceived scenarios. We use the terms low- or signal-level
algorithm when referring to algorithms that work at the signal-level (i.e. in a pixel-
wise representation) without trying to arrive at a higher-level description of the scene
and the terms middle- or symbolic-level to refer to algorithms that arrive at a higher
description of the scene using semantically meaningful and discrete symbolic descrip-
tors. Even though at this level reasoning based on the symbolic descriptors takes place,
we are still far from high-level processes where actual scene understanding happens.
In this paper we propose a disambiguation mechanism for creating coherent dispar-
ity estimations by fusing signal- and symbolic level information i.e. fusing estimations
of different level of abstraction within a cross-validation scheme. There are several
kinds of disambiguation mechanism used in disparity calculation algorithms [25], both
local and global, such as: aggregation of evidence, search for salient features [14][13],
combination of both monocular and binocular cues [15] and so on. Where our work
differs from earlier studies is that before disambiguation we arrive at a symbolic-level
scene description using robust, cross-validated, biologically motivated, multimodal im-
age features that we shall refer to as primitives in the rest of the paper [12] [18] [16].
Our main contributions in this paper are as follows. First we show that the proposed
disambiguation mechanism can greatly enhance quality of the resulting disparity es-
timations: coherency is increased by accepting those feedback values that fit the ev-
idence suggested by the data and by rejecting those that are not consistent with the
low-level data-driven estimations. Secondly we show that hardware implementations
suffering from numerical restrictions will also benefit from the proposed scheme.
Our system consists of two parallel data streams preceded by a process that trans-
forms the signal into harmonic representation [23]. By “harmonic representation” we
mean a multichannel, band-pass representation of the image, achieved by filtering op-
erations with complex valued band-pass kernels. From this perspective the visual stim-
ulus is represented locally by phase, energy and orientation on several scales [23]. The
two parallel data streams are the following: a signal-level process that calculates the
dense disparity map; and a symbolic-level process that arrives at a scene reconstruction
using perceptual grouping constraints for the multimodal primitives.
Fig. 1 illustrates the parallel data streams, both using a common harmonic rep-
resentation with colour information, and the feedback signal from the symbolic-level
to the signal-level. Due to the lack of a more general interpretation of the scene, the
low-level process is prone to error when several different interpretations are possible.
The symbolic-level process, on the other hand, generates a more robust and descrip-
tive representation of the scene, capable of refining the estimates for a better overall
semantic coherence. Coherent interpretation at this level is possible through seman-
tic reasoning using concepts such as co-linearity, co-colority, co-planarity and so on
[18]. The dense signal-level disparity is disambiguated by feeding the symbolic-level
information back into the low-level process [10][11] and by biasing those low-level in-
2
Left image
Signal-level
process
Symbolic-level
process
Signal-level disparity
Symbolic-level disparity
Right image
Harmonic
transformation
(Harmonic representation and
color)
Recurrent feed back
Figure 1: shows both the signal- and symbolic-level disparity calculation processes,
which are based on the harmonic representation obtained by harmonic transformation.
terpretations that are coherent with the symbolic-level. Since the feedback takes place
at several image scales (multi-scale), the sparse symbolic-level information is propa-
gated spatially, thus being capable of ‘guiding’ the low-level process over a far greater
area than the original density [20]. We refer to the symbolic-level information used in
the disambiguation as feedback maps. The system presented in this paper has only one
direction’ of feedback (from symbolic- to signal-level) and extending the feedback
into the other direction (signal- to symbolic-level) is left for future work.
1.1 Signal-symbol Loop
As mentioned above, we use the concept of signal-symbol loop as a feedback mech-
anism by which discrete symbolic descriptors obtained from the harmonic representa-
tion are fed back into the signal-level so as to enhance the extraction of desired features.
To the best of our knowledge the term ‘signal-symbol loop’ was first introduced in [11]
to describe a way of dealing with three dilemmas that computer vision encounters when
interpreting a scene. It is argued that such interpretations require the original signal to
be turned into semantic tokens or symbols, which, however, involves a number of prob-
lems. The first problem (known as the interpretation/decision dilemma) is of particular
relevance in the context of this paper. It deals with the need to interpret the input sig-
nal, which in turn requires binary decisions. These decisions concern, for example,
setting thresholds for edge detection or discrete selection of feature positions. More-
over, decisions about which features are relevant for a specific task often need to be
made. Without making further assumptions about the input signal or the task in hand,
these decisions are difficult to justify. Hence it is important that they become verified
and guided by higher level processes that operate on the symbolic level. In [11] it is
argued that feedback mechanisms in terms of signal-symbol loops can moderate be-
tween the different levels of information and be used for enhancing the image signal
to detect desired features and disambiguate unclear interpretations of the local cues. In
[10] a first example is given for the application of a signal-symbol loop in the context
of taking advantage of the regularity of rigid motion for edge detection. In this paper
we give a further example by addressing the interaction of sparse and dense stereo by
signal-symbol loops.
3
1.2 Hardware based real-time low-level processing
Low level stages (extraction engines of primitives) can be efficiently implemented
through special purpose hardware such as reconfigurable devices [1][2][3]. However
in order maximise on-chip parallel processing capabilities only restricted fixed point
arithmetics are allowed in the model. Furthermore the models are usually simplified in
order to adapt better to the technological substrate in which they will be implemented.
Therefore these kinds of low-level processing engines produce noisier results than their
respective software implementations. In this work we study if the signal-symbol fusion
mechanisms (described in this paper) help to enhance the system accuracy by con-
structively integrating higher level information and thus allowing designs with lower
resource requirements and power consumption (critical in embedded systems).
1.3 Structure of the Document
We proceed by describing briefly both the low- and symbolic-level algorithms followed
by a description of the fusion process. As low-level algorithm we have chosen a method
that is based on the phase component of a band-pass filtered image due to the robust-
ness of the phase information. We cannot overstress the fact that we are not trying to
come up with a new stereo algorithm but to validate our concept that by fusing infor-
mation from several different visual representation levels more robust and meaningful
interpretations can be achieved. After this we demonstrate quantitatively results of the
fusion process using several well known stereo-images. Testing was done by using
both a software implementation and a simulation of hardware implementation (FPGA)
of the system. Due to a increasing interest, both in scientific community as well as in
the commercial sector, in implementing artificial vision systems solving complex tasks
in real-time, we feel that such results should be of interest to anyone implementing real-
time vision systems on chip. After presenting the results we proceed to conclusions,
future work and acknowledgements.
2 Method
In this section we describe the low-level method used for generating the disparity es-
timations, followed by a description of the symbolic-level process used for generating
robust, sparse, feature-based, disparities employing multimodal primitives. After the
low- and symbolic-level algorithms have been covered, the fusion process will be in-
troduced.
2.1 Low-level algorithm description
For the dense, low-level disparity estimation we have used a method based on the
phase component of band-pass filtered versions of the stereo-images (input). Phase in-
formation was used for reasons of efficiency and stability [24][3][26][6][23]. Fleet and
Jepson showed the stability of the phase component with respect to small geometric
deformations [7], making phase more robust than amplitude for computing disparity
based on binocular cues. If the cameras have almost identical orientation and the base-
line is not too big whilst the distance to the object being observed is sufficient then the
geometric deformations induced by motion parallax due to a change of viewpoint will
be small. In this case, the phase-based estimations obtained might be expected to be
accurate.
4
The model used is inspired by the optical-flow calculation model of Gautama and
Van Hulle [8] and by the single-scale disparity calculation of Solari et al. [26]. The
final model combines the advantages of both methods, using a coarse-to-fine multi-
resolution computation scheme with warping [23]. In the chosen implementation, dis-
parity is calculated based on the phase-difference between stereo-images filtered by a
bank of seven Gabor filters with different orientations, without explicitly calculating
the phase, thus rendering the method both hardware friendly and suitable for real-time
computations [3][2]. In addition to the above-mentioned properties (density, efficiency
and stability in the face of small geometric deformations), the phase-difference method
works explicitly at sub-pixel accuracy. For a more detailed explanation of the algo-
rithm, see Section 4.3. The stages of the algorithm are the following:
1. If on the coarsest scale skip this stage, otherwise
-expansion of results to current scale: D
k
(x) = expand(D
k+1
(x))
-warping of right stereo-image: I
r
(x) = warp(I
r
(x + D
k
(x)))
2. Convolution of the input images I
k
r
(x) and I
k
l
(x) with the Gabor filters to obtain
the Gabor filter responses. Each image scale is convolved with the same set of
filters, tuned to seven different orientations and with a spatial frequency peak of
0.25.
3. Filtering out those responses that are below a given energy threshold: those re-
sponses that do not tune well with the filters, corresponding to low energy, are
considered unreliable and thus are filtered out.
4. Disparity calculation using remaining responses (those that have not been filtered
out). Since there are seven filters each image position receives several disparity
estimations.
5. Choosing the disparity estimation for each image position using median filter, as
indicated by (1), in order to obtain D
k
new
(x).
6. Merge valid disparity estimations
D
k
(x) = merge(D
k
(x),D
k
new
(x)).
7. If not at the final scale return to 1.
where I
k
l
(x) and I
k
r
(x) are the left- and right stereo-images and D
k
(x) is the disparity
map corresponding to scale (resolution) k and position x = (x, y). Disparity estimation
for each image position is chosen using a median filter, as indicated by (1)
D
k
(x) = median(D
k
θ
(x,dP; f
0
)) (1)
where D
k
(x) is the final resulting disparity for each image position x = (x, y) and
D
k
θ
(x,dP; f
0
) are the disparity responses corresponding to filter orientation
θ
and scale
k, where dP is the phase difference, f
0
is the peak frequency of the filter.
2.1.1 Hardware implementation
In this section we describe how the low-level algorithm, without fusion at this stage, has
been implemented in a hardware design. The reason for including this part is that we
have studied effectiveness of the proposed fusion scheme in a simulation of the hard-
ware implementation and by combining this information with the results, feasibility of
5
implementing the fusion in the hardware can be estimated. The hardware architecture
was implemented in a Xilinx Virtex XC4VFX1000 FPGA using a high-level-hardware
(high-level HDL) description language which permits description of the functionality
at algorithmic level.
The system consists of two different main stages:
1. Stage 1: rectification and image pyramid creation.
2. Stage 2: processing loop, coarse to fine scale.
With rectification we mean stereo-rectification using epipolar geometry and by image
pyramid we refer to a multi-resolution strategy. The design aims at a fine pipelined
circuit benefitting from high parallelism of the FPGA circuit. The initial processing
circuits for left- and right images are replicated and work in parallel. Nevertheless
inside these processing blocks the work is done sequentially, combining stages where
possible: image rectification and first down-scaling are done simultaneously as soon
as enough rectified pixels are available. As soon as the image pyramids have been
created, processing loop starts from the coarsest scale, advancing towards the finest, by
repeating sequentially the same block.
Main steps of the processing loop are:
1. Expansion of results to next scale,
D
k
(x) = expand
D
k+1
(x)
.
2. Warping of input images as per expanded disparity.
3. Disparity calculation for current scale, D
k
new
(x).
4. Merging of disparity estimations,
D
k
(x) = merge
D
k
new
(x),D
k
(x)
.
The architecture works at a data rate of one pixel per clock cycle. Table 1 displays
the amount of resources consumed by the implementation. The implementation uses
a fixed-point representation and the number of bits used for representing fractions is
given in the table 1.
Table 1: Implementation details for Xilinx Virtex XC4VFX1000 FPGA.
LUTs
(50560)
Slice
Flip Flops
(50560)
Slices
(25280)
DSP
(128)
Block
RAM
(232)
Freq.
MHz
Frac.
bits
15810 11693 12464 80 16 60,0 2
2.2 Symbolic-level algorithm description
The multi-modal visual primitives are local, multi-modal visual feature descriptors that
were described in [12]. They are semantically and geometrically meaningful descrip-
tions of local image patches, motivated by the hyper-columnar structures in V1 [9].
Primitives can be edge-like or homogeneous and either 2D or 3D. In this work, only
edge-like primitives are relevant and for other definitions the reader should consult [12].
An edge-like 2D primitive is defined by equation 2:
π
= (x,
θ
,
ω
,(c
l
,c
m
,c
r
)), (2)
6
where x is the image position of the primitive;
θ
is the 2D orientation;
ω
represents the
contrast transition; and (c
l
,c
m
,c
r
) is the representation of the color, corresponding to
the left (c
l
), the middle (c
m
) and the right side (c
r
) of the primitive. Fig. 2 shows the
extracted primitives for an example scene.
(a) (b)
(c) (d)
Figure 2: Extracted primitives (b) for the example image in (a). Magnified primitives
in (d) and edge primitives in (c) for the marked region of interest in (b).
A 2D edge primitive
π
is a 2D feature which can be used to find correspondences
in a stereo framework to create 3D edge primitives (as introduced in [17]) the formula
for which is given in equation 3:
Π = (X,Θ, ,(c
l
,c
m
,c
r
)), (3)
where X is the 3D position; Θ is the 3D orientation; is the phase (i.e., contrast
transition); and, (c
l
,c
m
,c
r
) is the representation of the colour, corresponding to the left
(c
l
), the middle (c
m
) and the right side (c
r
) of the 3D primitive.
2.3 Fusion process
The density of sparse algorithms is considerably lower than that of dense algorithms,
typically well below 15%. Without propagating the sparse symbolic-level disparity,
before feeding back at the signal-level, such low-density maps would be able to dis-
ambiguate only locally over a very limited area thus making the improvements depend
directly upon both accuracy and the density of the symbolic-disparity. In a multi-scale
approach, however, fusion is done at each scale, meaning that the sparse disparity map
7
has to be scaled down to match the downscaled image sizes. We scale the symbolic-
level disparity map down using median filtering (ignoring positions that do not contain
disparity values) which results in the natural propagation of the disparities. Neverthe-
less, before scaling down the symbolic-level disparity, it is densified by applying voting
mask propagation (VMP) [20]. Ralli et al. show in their VMP paper that significant
densification of a sparse-disparity is possible, producing only a very minor increase in
error, using mask propagation with a voting scheme (cf. [20]). Once the disparity map
provided by the symbolic-level algorithm is densified by applying VMP and scaled
down, it is used for disambiguation of the interpretations generated by the low-level al-
gorithm by biasing the corresponding nearest (most similar) value and thus maximising
the coherency. Pseudo-code of the fusion process is given in Algorithm 1.
Algorithm 1 If the difference between the most similar low- and symbolic-level values
is above a given rejection threshold then the symbolic-level value is discarded and
decision is made as indicated by Equation (1).
if min(D
sym
(x) D
low
θ
(x,
θ
)) > thr OR (D
sym
(x) = /0) then
D = median(D
low
θ
(x))
else
D = nearest(D
symbol
(x),D
low
θ
(x))
end if
where D
sym
(x) is the symbolic-level disparity approximation, D
low
θ
(x) are the energy-
filtered, low-level disparity approximations as per orientations, thr is the rejection
threshold, /0 is an empty set and the function nearest(A, B) returns the value from B that
is nearest to A (nearest in Euclidean sense). Therefore, if there are no symbolic-level
disparity approximations or the difference between the closest symbolic- and low-level
disparities is greater than the rejection threshold, then the disparity is chosen ‘nor-
mally’ as defined by Equation (1). If the difference between the closest symbolic-level
and low-level disparity is below the rejection threshold, then the closest low-level dis-
parity is chosen. This selection mechanism can be understood as biasing the closest
low-level disparity value in such a way as to have more likelihood of being chosen.
Biasing can be done in other ways, possibly using either a cost function or a reliability
measure for instance. Since the disparities provided by the symbolic-level are based on
multimodal visual primitives that already have local support [11][18], there is no need
to aggregate local evidence in the fusion process when biasing the disparity calculated
by the low-level. If aggregation of evidence is needed, because the multimodal visual
primitives arrive at a higher scene description the aggregation of evidence should be
carried out at this level by grouping the primitives into groups describing the same ob-
jects and/or object contours, for example [18]. Figs. 3 and 4 display data flow without
and with fusion.
3 Experiments
The proposed fusion model was tested both quantitatively using well known bench-
mark images from the Middlebury
1
database and qualitatively using images from the
DRIVSCO
2
project. In the Middlebury case the results are given in two different ways:
1
http://vision.middlebury.edu/stereo/data
2
http://www.pspc.dibe.unige.it/
~
drivsco/
8