27 Sep 2016

Deep Learning Challenges in Embedded Platforms

Liran Bar, Director of Product Marketing, CEVA, Imaging & Vision DSP core product line looks at how to overcome the deep learning challenges in embedded systems.

The successful spread of artificial intelligence (AI) into everyday applications will be dependent on how easy it is to deploy deep neural networks in small, low-power devices rather than large server networks

In this post we look at ways to deal with those challenges.

Googlenet deep convolutional neural network

Googlenet deep convolutional neural network

In 2014, Google made an entry to the ImageNet large-scale visual recognition challenge (ILSVRC), titled GoogLeNet. It is an interesting case study because it is a 22-layer deep convolutional network, and includes nine inceptions, creating a very rich and complex topology.

In the GoogLeNet network, each connection in each layer can potentially go back and forth through DDR. To handle this in an embedded system poses a challenge. The complex topology of the network must be divided into batches of layers to run on a DSP or dedicated hardware. We call this subnetwork division.

In our CEVA network generator tool, all analysis is done automatically without user intervention. The network is divided into subnetworks and each subnetwork runs on the DSP according to the execution order set by the network generator. For example, let’s take a look at the inception part of the GoogLeNet network after going through our network generator tool.

CEVA network generator tool

CEVA network generator tool

As you can see in the above image, the network generator created four subnetworks. Of these subnetworks, three run at different execution time, but two can run in parallel on different cores. Additionally, the network generator is designed to create long layer sequences, which potentially will only go through internal memory.

Overcoming the Challenges

Next, let’s take a look at methods designed to overcome some of the most significant challenges of deep learning in embedded platforms.

Reducing bandwidth

Due to tight constraints of bandwidth in embedded platforms, implementation of convolutional neural networks will undoubtedly generate some bandwidth issues. These are caused by either the network filter weight, or data transfer from layer to layer.

Here are two rules that can help reduce the bandwidth significantly:

  1. Each output map is created by running the same filter on a different position in the input map. Relying on this rule, we can save the massive load of the data weight, reducing unnecessary bandwidth usage.

  2. Each output is calculated by the same input data. Applying this rule, the input can be loaded and used for all the outputs without utilizing the DDR more than once.

Multiply and Accumulate Utilization

A powerful feature of DSP architecture is the ability to perform single cycle multiply-accumulate (MAC) instructions for intense computations. In order to maximize efficiency, it is beneficial to have a continuous sequence of MAC instructions. This can be handled differently in two distinct cases:

  1. A low number of large input maps

  2. A high number of small input maps

In the first case, we will prefer to complete the filter calculation for each input map before going to the next map. This way we benefit from overlapping filters, and on the edges of the map we will have redundant MAC utilization loss. As shown in the formula below, width and height are calculated first, in this case. We call this approach local filter calculation.

Local Filter Calculation

Googlenet deep convolutional neural network

Formula for local filter calculation, used for large sized maps

In the second case, of small-sized input maps that occur in large amounts, the calculation should be performed across the maps. Different input maps are processed to one output map. In this case, partial filter results are calculated and at the end of the process all the partial results are summed together to one result using the property of the convolutional filter enabling this. As shown in the next formula, channels are calculated first. We call this approach cross map filter calculation.

Cross map filter calculation

Googlenet deep convolutional neural network

Formula for cross map filter calculation, used for large number of maps with small size (last layers)

Utilizing internal memory

To use the embedded resources efficiently, we must have all the input maps in the internal memory, and loaded only once. But, what if we don’t have enough memory to preserve this rule? In this case we will need to perform tile division of the input, but still preserve the rule. After the division, we will have the same number of inputs, but in tiles. The impact of this division is loading the weights in correlation to the number of tiles.

All these problems and their solutions are clearly something that the user would like to avoid dealing with when implementing deep learning on an embedded platform. At CEVA, we believe this should be a basic demand for a real-time system to perform without the user’s involvement, or even awareness. This is core responsibility of the CEVA deep neural network framework and CEVA network generator.

What else can be done?

We’ve covered a few embedded algorithmic solutions that serve to change the convolution calculation to our benefit. In addition to these, more things can be done on the algorithmic level by understanding neural networks work. Here are a few examples that use compression approach and prior knowledge to reduce bandwidth and improve performance:

  • Using algorithms like Huffman coding

  • Work in pipeline to save BW

  • Identify when some of the calculation can be saved

  • Share data between calculations

  • Recognize when the focus should be on the weights and when it should be on the map size – network dependent

  • Compress and decompress better over time (learn from frame by frame execution)


As you can see, there is a lot that can be done in the technical aspects of deep convolutional neural networks for embedded systems. Once the challenges of deep learning in embedded systems has been overcome, there are many opportunities that are open.

Page 1 of 1

About the author

Lira Bar is Director of Product Marketing, CEVA, Imaging & Vision DSP core product line. Liran has more than fifteen years of experience in the imaging semiconductor industry. He holds a B.Sc. in Electrical Engineering from Ben-Gurion University.

CEVA is the leading licensor of signal processing IP for a smarter, connected world. We partner with semiconductor companies and OEMs worldwide to create power-efficient, intelligent and connected devices for a range of end markets, including mobile, consumer, automotive, industrial and IoT. Our ultra-low-power IPs for vision, audio, communications and connectivity include comprehensive DSP-based platforms for LTE/LTE-A/5G baseband processing in handsets, infrastructure and machine-to-machine devices, computer vision and computational photography for any camera-enabled device, audio/voice/speech and ultra-low power always-on/sensing applications for multiple IoT markets. CEVA can be found at www.ceva-dsp.com

Most popular articles in Processing & embedded

  • Deep Learning Challenges in Embedded Platforms
  • Embedded World 2017
  • Capacitive Proximity Sensing Technology Update
  • Choice: Microcontroller, MCU or Microprocessor, MPU
  • Xilinx FPGA Enables Scalable MIMO Precoding Core
  • Share this page

    Want more like this? Register for our newsletter

    Perpetual Motion Machines - Always Giving 110% Mark Patrick | Mouser Electronics
    Perpetual Motion Machines - Always Giving 110%
    The perpetual motion machine is something that has been sought by inventors from the very earliest days of science . . . . but does the concept have links to the IIoT?
    LTE for Automotive Applications
    Read the insight in this white paper from u-Blox about LTE for automotive applications. Discover all you need to know.

    More whitepapers

    Radio-Electronics.com is operated and owned by Adrio Communications Ltd and edited by Ian Poole. All information is © Adrio Communications Ltd and may not be copied except for individual personal use. This includes copying material in whatever form into website pages. While every effort is made to ensure the accuracy of the information on Radio-Electronics.com, no liability is accepted for any consequences of using it. This site uses cookies. By using this site, these terms including the use of cookies are accepted. More explanation can be found in our Privacy Policy