27 Sep 2016
Deep Learning Challenges in Embedded Platforms
Liran Bar, Director of Product Marketing, CEVA, Imaging & Vision DSP core product line looks at how to overcome the deep learning challenges in embedded systems.
The successful spread of artificial intelligence (AI) into everyday applications will be dependent on how easy it is to deploy deep neural networks in small, low-power devices rather than large server networks
In this post we look at ways to deal with those challenges.
Googlenet deep convolutional neural network
In 2014, Google made an entry to the ImageNet large-scale visual recognition challenge (ILSVRC), titled GoogLeNet. It is an interesting case study because it is a 22-layer deep convolutional network, and includes nine inceptions, creating a very rich and complex topology.
In the GoogLeNet network, each connection in each layer can potentially go back and forth through DDR. To handle this in an embedded system poses a challenge. The complex topology of the network must be divided into batches of layers to run on a DSP or dedicated hardware. We call this subnetwork division.
In our CEVA network generator tool, all analysis is done automatically without user intervention. The network is divided into subnetworks and each subnetwork runs on the DSP according to the execution order set by the network generator. For example, let’s take a look at the inception part of the GoogLeNet network after going through our network generator tool.
CEVA network generator tool
As you can see in the above image, the network generator created four subnetworks. Of these subnetworks, three run at different execution time, but two can run in parallel on different cores. Additionally, the network generator is designed to create long layer sequences, which potentially will only go through internal memory.
Overcoming the Challenges
Next, let’s take a look at methods designed to overcome some of the most significant challenges of deep learning in embedded platforms.
Due to tight constraints of bandwidth in embedded platforms, implementation of convolutional neural networks will undoubtedly generate some bandwidth issues. These are caused by either the network filter weight, or data transfer from layer to layer.
Here are two rules that can help reduce the bandwidth significantly:
Each output map is created by running the same filter on a different position in the input map. Relying on this rule, we can save the massive load of the data weight, reducing unnecessary bandwidth usage.
Each output is calculated by the same input data. Applying this rule, the input can be loaded and used for all the outputs without utilizing the DDR more than once.
Multiply and Accumulate Utilization
A powerful feature of DSP architecture is the ability to perform single cycle multiply-accumulate (MAC) instructions for intense computations. In order to maximize efficiency, it is beneficial to have a continuous sequence of MAC instructions. This can be handled differently in two distinct cases:
A low number of large input maps
A high number of small input maps
In the first case, we will prefer to complete the filter calculation for each input map before going to the next map. This way we benefit from overlapping filters, and on the edges of the map we will have redundant MAC utilization loss. As shown in the formula below, width and height are calculated first, in this case. We call this approach local filter calculation.
Formula for local filter calculation, used for large sized maps
In the second case, of small-sized input maps that occur in large amounts, the calculation should be performed across the maps. Different input maps are processed to one output map. In this case, partial filter results are calculated and at the end of the process all the partial results are summed together to one result using the property of the convolutional filter enabling this. As shown in the next formula, channels are calculated first. We call this approach cross map filter calculation.
Formula for cross map filter calculation, used for large number of maps with small size (last layers)
Utilizing internal memory
To use the embedded resources efficiently, we must have all the input maps in the internal memory, and loaded only once. But, what if we don’t have enough memory to preserve this rule? In this case we will need to perform tile division of the input, but still preserve the rule. After the division, we will have the same number of inputs, but in tiles. The impact of this division is loading the weights in correlation to the number of tiles.
All these problems and their solutions are clearly something that the user would like to avoid dealing with when implementing deep learning on an embedded platform. At CEVA, we believe this should be a basic demand for a real-time system to perform without the user’s involvement, or even awareness. This is core responsibility of the CEVA deep neural network framework and CEVA network generator.
What else can be done?
We’ve covered a few embedded algorithmic solutions that serve to change the convolution calculation to our benefit. In addition to these, more things can be done on the algorithmic level by understanding neural networks work. Here are a few examples that use compression approach and prior knowledge to reduce bandwidth and improve performance:
Using algorithms like Huffman coding
Work in pipeline to save BW
Identify when some of the calculation can be saved
Share data between calculations
Recognize when the focus should be on the weights and when it should be on the map size – network dependent
Compress and decompress better over time (learn from frame by frame execution)
As you can see, there is a lot that can be done in the technical aspects of deep convolutional neural networks for embedded systems. Once the challenges of deep learning in embedded systems has been overcome, there are many opportunities that are open.
Page 1 of 1
About the author
Lira Bar is Director of Product Marketing, CEVA, Imaging & Vision DSP core product line. Liran has more than fifteen years of experience in the imaging semiconductor industry. He holds a B.Sc. in Electrical Engineering from Ben-Gurion University.
CEVA is the leading licensor of signal processing IP for a smarter, connected world. We partner with semiconductor companies and OEMs worldwide to create power-efficient, intelligent and connected devices for a range of end markets, including mobile, consumer, automotive, industrial and IoT. Our ultra-low-power IPs for vision, audio, communications and connectivity include comprehensive DSP-based platforms for LTE/LTE-A/5G baseband processing in handsets, infrastructure and machine-to-machine devices, computer vision and computational photography for any camera-enabled device, audio/voice/speech and ultra-low power always-on/sensing applications for multiple IoT markets. CEVA can be found at www.ceva-dsp.com
Most popular articles in Processing & embedded
Share this page
Want more like this? Register for our newsletter