How Neural Networks Are Changing the Advanced Driver Assistance Systems


Embedded convolutional neural networks (CNNs) now provide the performance
needed to enable real-time analysis of the streaming video from multiple cameras
on a car, and to determine what to react to and what to ignore. It will change the future
of the Advanced Driver Assistance Systems (ADAS).

by Gordon Cooper, Embedded Vision Product Marketing Manager, Synopsys

With the increase in autonomous and semi-autonomous
vehicles, the role of embedded vision has never been greater.
Embedded vision gives an automobile a set of eyes, in the form
of multiple cameras and image sensors, and the neural networks
behind the vision is critical for the automobile to interpret
content from those images and react accordingly. To accomplish
this complex set of functions, embedded vision processors
must have algorithms to run on these processors (based on
neural networks), be hardware optimized for performance while
achieving low power and small area and have robust tools to
program the hardware efficiently.

The significant automotive safety improvements in the past
(e.g., shatter-resistant glass, three-point seatbelts, airbags), were
passive safety measures designed to minimize damage during
an accident. We now have technology that can actively help
the driver avoid crashing in the first place. Advanced Driver
Assistance Systems (ADAS) are behind the semi-autonomous
features we see today, and will help autonomous vehicles become
an everyday reality. Blind spot detection can alert a driver
as he or she tries to move into an occupied lane. Lane departure
warning and Lane Keep Aid alerts the driver if the car is drifting
outside its lane and actively steers the car back into their own
lane. Pedestrian detection notifies the driver that pedestrians are
in front or behind the car and Automatic Emergency Braking
applies the brakes to avoid an accident or pedestrian injury.
As ADAS features are combined, we get closer to autonomous
vehicles—all enabled by convolutional neural networks (CNNs)
and high-performance vision processing.

Auto manufacturers are including more cameras in their cars,
as shown in Figure 1. A front facing camera can detect pedestrians
or other obstacles and with the right algorithms, assist
the driver in braking. A rear-facing camera – mandatory in the
United States for most new vehicles starting in 2018 – can save
lives by alerting the driver to objects behind the car, out of the
driver’s field of view. A camera in
the cars cockpit facing the driver
can identify and alert for distracted
driving. And most recently, adding
four to six additional cameras can
provide a 360-degree view around
the car.

Vision Processors +
CNN for Object

Since the driver is already facing
forward, a front facing camera may
seem unnecessary. However, a front
facing camera that is consistently
faster than the driver in detecting
and alerting for obstacles is very
valuable. While an ADAS system
can physically react faster than a
human driver, it needs embedded
vision to provide real-time analysis
of the streaming video and know
what to react to.

Vision processors are based on
heterogeneous processing units.
That means the programming tasks
are divided into processing units
with different strengths. Most of
the code will be written using C or
C++ for a traditional 32-bit scalar
processor, which provides an easyto-program
processor. The vector
DSP unit will perform most of the
computations, because its very large
instruction word can handle a lot
of parallel computations for pixel
processing of each incoming image.

Detecting a pedestrian in front
of a car is part of a broad class of
“object detection.” For each object
to be detected, traditional computer
vision algorithms were hand-crafted.
Examples of algorithms used
for detection include Viola-Jones
and more recently Histogram of
Oriented Gradients (HoG). The
HOG algorithm looks at the edge directions within an image to
try to describe objects. HOG was considered a state-of-the art
for pedestrian detection as late as 2014.

Emergence of Neural Networks for Object

CNNs are organized as a set of layers of artificial neurons,
each of which undertakes a series of operations and commu
nicates its results to adjacent layers. Each type of layer offers
different functions, e.g., input layers which take in the image
data, output layers which deliver the specified results (such as
recognition of objects in an image), and one or more hidden
layers between the input and output which help refine the network’s

Although the concept of neural networks, which are computer
systems modeled after the brain, have been around for a long
time, only recently have semiconductors
achieved the processor performance to
make them a practical reality. In 2012, a
CNN-based entry into the annual ImageNet
competition showed a significant
improvement in accuracy in the task of
image classification over the traditional
computer vision algorithms. Because of
the improved accuracy, the use of neural
network-based techniques for image classification,
detection and recognition have
been gaining momentum ever since.

The important breakthrough of deep
neural networks is that object detection
no longer has to be a hand-crafted
coding exercise. Deep neural networks
allow features to be learned automatically
from training examples. A neural
network is considered to be “deep” if
it has an input and output layer and
at least one hidden middle layer. Each
node is calculated from the weighted inputs
from multiple nodes in the previous
layer. CNNs are the current state-of-the
art for efficiently implementing deep neural networks for vision.
CNNs are more efficient because they reuse a lot of weights
across the image.

Early CNNs in the embedded space were performed using a
GPU or using the vector DSP portion of a vision processor. However,
it’s helpful to look at the task performed in terms of three
different heterogeneous processing units.
Early implementations of CNNs in hardware had a limited
number of Multiply-Accumulator (MAC) units. For example,
Synopsys’s EV5x, the industry’s first programmable and configurable
vision processor IP cores implemented a CNN engine
with 64 MACs. Running at 500 MHz, the EV5x could produce
32 GMACs/s or 64 GOPs/s of performance (a multiply-accumulator
performs two operations in one instruction). That
was not enough performance to process an entire 1MP (1280
x 1024) frame or image. However, it was enough processing
power to perform a CNN on a portion of the image (say a 64×64
pixel patch). To process the entire image, a two-step process
for pedestrian detection was needed. The vector DSP would
perform a computationally intensive Region of Interest (ROI)
algorithm on each incoming image of the video stream. ROI
identifies candidates using a sliding window approach that could
be a pedestrian (ruling out, for example, portions of the sky).
Those “pedestrian” patches were then processed by the CNN to
determine if it was in fact a pedestrian. CNN-based pedestrian
detection solutions have been shown to have better accuracy
than algorithms like HoG and perhaps more importantly, it is
easier to retrain a CNN to look for a bicycle than it is to write
a new hand-crafted algorithm to detect a bicycle instead of a

Larger CNNs for Whole Frame Object

As embedded CNNs become more powerful, they no longer
are restricted to processing patches of the incoming image. Synopsys’s
latest vision processor, the EV6x, includes a CNN engine
with 880 MACs – a significant performance leap compared to its
predecessor. Running at 800MHz, this produces (880 x .8) = 704
GMACs/s or about 1400 GOPs/s. That performance is enough
to process an entire 1MP image using CNN. The vector DSP is
still valuable for pre-processing the images (e.g., reformatting
and pyramiding) and performing post-processing tasks like
non-maximum suppression (NMS). As shown in Figure 2, the
EV6x still has scalar, vector and CNN units for heterogeneous
processing. It was also designed with multicore features that
allow it to easily scale to multiple vision cores.

The benefit of processing the entire image frame is that CNN
can be trained to detect multiple objects. Now, instead of just
finding a pedestrian, the CNN graph can be trained to find a
bicycle, other automobiles, trucks, etc. To do that with an algorithm
like HoG would require hand-crafting the algorithm for
each new object type.

Training and Deploying CNNs

As mentioned earlier, a CNN is not programmed. It is trained.
A deep learning framework, like Caffe or TensorFlow, will use
large data sets of images to train the CNN graph – refining
coefficients over multiple iterations – to detect specific features
in the image. Figure 3 shows the key components for CNN graph
training, where the training phase uses banks of GPUs in the
cloud for the significant amount of processing required.

The deployment – or “inference” – phase is executed on
the embedded system. Development tools, such as Synopsys’s
MetaWare EV Toolkit, take the 32-bit floating point weights or
coefficients output from the training phase and scale them to a
fixed point format. The goal is to use the smallest bit resolution
that still produces equivalent accuracy compared to the 32-bit
floating point output. Fewer bits in a multiply-accumulator
means less power required to calculate the CNN and smaller
die area (leading to lower the cost) for the embedded solution.
Based on Synopsys calculations, 10-bit or higher in resolution
is needed to assure the same accuracy of the 32-bit Caffe output
without graph retraining.

The MetaWare EV tools take the weights and the graph topology
(the structure of the convolutional, non-linearity, pooling,
and fully connected layers that exist in a CNN graph) and maps
them into the hardware for the dedicated CNN engine. Assuming
there are no special graph layers, the CNN is now “programmed”
to detect the objects that it’s been trained to detect.

To keep the size small, the CNN engine is optimized to execute
for key CNN features such as 3×3 and 5×5 matrix multiples,
but not so optimized that it becomes a hard wired solution. It’s
important to be programmable to maintain flexibility. As CNNs
continue to evolve – new layer techniques or pooling methods
for example – the vector DSP can play another important role in
the vision processing. Since the vector DSP and CNN engine are
closely coupled in the Synopsys EV6x, it is easy to dispatch tasks
from the CNN to the vector DSP as needed. OpenVX runtime,
incorporated into the MetaWare EV tools, makes sure those
tasks are scheduled with other vector DSP processing requirements.

The vector DSP future-proofs the CNN engine.
Figure 4 shows the inputs and outputs of an embedded vision
processor. The streaming images from the car’s camera are fed
into the CNN engine that is preconfigured with the graph and
weights. The output of the CNN is a classification of the contents
of the image.

Scene Segmentation and Navigation

Up to now, we’ve discussed object classification of pedestrians
(or bicycles or cars or trucks) that can be used for collision
avoidance – an ADAS example. CNNs with high enough performance
can also be used for scene segmentation – the identifying
of all the pixels in an image. The goal for scene segmentation
is less about identifying specific pixels than it is to identify the
boundaries between types of objects in the scene. Knowing
where the road is compared to other objects in the scene provides
a great benefit to a car’s navigation and brings us one step
closer to autonomous vehicles.

One scene segmentation example, running on a DesignWare
EV61’s CNN, segmented the streaming images using 11 categories
of objects (road, sky, buildings, pedestrians, etc.). With
five channels of 1920×1080 images as input, the CNN, running
at 800MHz, achieved 18fps. Scene segmentation is difficult
for CNNs that don’t have the horsepower to process multiple
instances of the full images (frames).

Future Requirements for Vision Processors
in Automotive Vision

Vision processing solutions will need to scale as future demands
call for more processing performance. A 1MP image is a
reasonable resolution for existing cameras in automobiles. However,
more cameras are being added to the car and the demand
is growing from 1MP to 3MP or even 8MP cameras. The greater
a camera’s resolution, the farther away an object can be detected.
There are simply more bits to analyze to determine if an object,
such as a pedestrian, is ahead. The camera frame-rate (FPS) is
also important. The higher the frame rate, the lower the latency
and the greater the stopping distance. For a 1MP RGB camera
running at 15 FPS, that would be 1280×1024 pixels/frame times
15 frames/second times three colors or about 59M bytes/second
to process. An 8MP image at 30fps will require 3264×2448
pixels/frame times 30 frames/second times three colors or about
720M bytes/second.

This extra processing performance can’t come with a
disproportionate spike in power or die area. Automobiles are
consumer items that have constant price pressures. Low power
is very important. Vision processor architectures have to be as
optimized as power and yet still retrain programmability.


As the requirements for ADAS in automotive applications
continue to grow, embedded vision and deep learning technology
will keep up. Object detection has evolved from small-scale
identification to full scenes with every pixel accounted for, and
flexibility will continue to be as important as performance,
power and area. Synopsys’ DesignWare EV6x Embedded Vision
Processors are fully programmable to address new graphs as
they are developed, and offer high performance in a small area
and with highly efficient power.

Author Bio:

Gordon Cooper is a Product Marketing Manager for Synopsys’
Embedded Vision Processor family. Gordon brings more than 20
years of experience in digital design, field applications and marketing
at Raytheon, Analog Devices, and NXP to the role. Gordon
also served as a Commanding Officer in the US Army Reserve,
including a tour in Kosovo. Gordon holds a Bachelor of Science
degree in Electrical Engineering from Clarkson University.


About Author

Comments are closed.