This is a self-archived – parallel published version of this article in the publication archive of the University of Vaasa. It might differ from the original. Author(s): Please cite the original version: Title: Year: Version: Copyright Neural Network-based Vehicle Image Classification for IoT Devices Payvar, Saman; Khan, Mir; Stahl, Rafael; Mueller-Gritschneder, Daniel; Boutellier, Jani Neural Network-based Vehicle Image Classification for IoT Devices 2019 Final draft (post print, aam, accepted manuscript) ©2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Payvar, S., Khan, M., Stahl, R., Mueller-Gritschneder, D., & Boutellier, J., (2019). Neural Network-based Vehicle Image Classification for IoT Devices. IEEE International Workshop on Signal Processing Systems (SiPS), Nanjing, China, 2019, pp. 148-153. https://doi.org/10.1109/ SiPS47522.2019.9020464 Neural Network-based Vehicle Image Classification for IoT Devices Saman Payvar Unit of Computing Sciences Tampere University Tampere, Finland saman.payvar@tuni.fi Daniel Mueller-Gritschneder Chair of Electronic Design Automation Technical University of Munich Munich, Germany daniel.mueller@tum.de Mir Khan Unit of Computing Sciences Tampere University Tampere, Finland mir.markhan@tuni.fi Jani Boutellier Tampere University / University of Vaasa Finland jani.boutellier@tuni.fi Rafael Stahl Chair of Electronic Design Automation Technical University of Munich Munich, Germany r.stahl@tum.de Abstract—Convolutional Neural Networks (CNNs) have previ- ously provided unforeseen results in automatic image analysis and interpretation, an area which has numerous applications in both consumer electronics and industry. However, the signal processing related to CNNs is computationally very demanding, which has prohibited their use in the smallest embedded computing platforms, to which many Internet of Things (IoT) devices belong. Fortunately, in the recent years researchers have developed many approaches for optimizing the performance and for shrinking the memory footprint of CNNs. This paper presents a neural- network-based image classifier that has been trained to classify vehicle images into four different classes. The neural network is optimized by a technique called binarization, and the resulting binarized network is placed to an IoT-class processor core for execution. Binarization reduces the memory footprint of the CNN by around 95% and increases performance by more than 6×. Furthermore, we show that by utilizing a custom instruction ’popcount’ of the processor, the performance of the binarized vehicle classifier can still be increased by more than 2×, making the CNN-based image classifier suitable for the smallest embedded processors. Index Terms—model compression, convolutional neural net- works, image classification, internet-of-things I. INTRODUCTION Convolutional neural networks (CNNs) have enabled a significant advance in automatic image analysis, such as image classification [1], image segmentation [2], image captioning [3] and object detection [4]. Unfortunately, up to recently the computational requirements of CNNs have restricted their use to server or desktop class computers, although their deployment to edge devices could open up a variety of new applications [5]. In the Internet-of-Things (IoT), the network edge refers to devices that are within immediate connection to sensors that provide input data for the whole IoT system. Such an edge device can be a smartphone [6], or a tiny sensor node commonly equipped with less than a megabyte of RAM [7]. A CNN consists of a sequence of layers, of which the most common types are fully-connected layers and convolutional layers. Once a CNN has been trained [8], e.g. for image classification, the parameters and weights of the layers are fixed for deployment to a target device. On the target device, the process that evaluates given input data is called inference, where the input data flows through the layers of the CNN, providing the requested output (e.g. classification result) from the last layer. In terms of computation, convolutional layers consist of repeated 2D convolutions, where the input data of the layer is convolved by 2D kernels with common sizes of 5×5, 3×3 or 1×1 [9]. The computational effort of convolutional layers grows rapidly as the size of input images or kernels grows [10]. However, it has been well-known for some time that 2D convolution can also be interpreted and computed as a 2D matrix multiplication [11]. The inference of a fully- connected layer is also commonly performed by 2D matrix multiplication. Optimization of CNN processing can be performed by optimizing software, hardware, or both [12]. Examples for software-based optimizations are model compression [9][13] or reduction of arithmetic precision [14][12]. Software-based optimizations that target convolutional layers include separable convolution [15] and depthwise convolution [16], whereas fully-connected layers can be optimized by weight pruning [13]. All of these optimizations have some negative impact on the CNN accuracy. Reduction of arithmetic precision, on the other hand, is not limited to separate types of layers, but can be applied to the whole CNN. Arithmetic precision can be reduced from floating-point to, e.g., 16-bit fixed point [12] with minimal degradation of CNN (classification) accuracy, or by extreme quantization down to two [17] bits or one bit [18][14] of weight precision. When the precision of weights (and possibly also input data) is reduced to a single bit, the CNN is binarized. Binarization dramatically reduces the memory foot- print of a CNN, as the original weights, which are normally expressed in 32-bit floating point, can be represented with a single bit. This evidently has an impact on the CNN’s accuracy [18]. However, besides shrinking the size of the TABLE I RELATED NEURAL NETWORK OPTIMIZATION WORKS Work Type Optimization Platform Courbariaux et al. SW Binarization NVidia GPU [18] only (fc layers only) Rastegari et al. SW Binarization 64-bit CPU [24] only (conv and fc layers) Khan et al. SW Binarization NVidia and [14] only (conv and fc layers) OpenCL GPUs ESPRESSO SW Binarization NVidia GPU, [25] only (conv and fc layers) CPU Park et al. HW Zero skipping, Nvidia GPU, [26] SW Data reuse GPU simulation (conv layers only) Conti et al. HW Binarization HW accelerator [27] SW (conv and fc layers) for MCUs Proposed HW Binarization RISC-V MCU SW (conv and fc layers) (simulation) network, binarization also enables CNN inference on devices that have no support for floating-point arithmetic, such as microcontrollers and FPGAs [19]. This paper presents a CNN for vehicle image classification [20] that has been binarized including the weights of all layers, as well as the input data, following the principles of our recent work [14]. However, unlike our recent work that concentrated on CNN inference on graphics processing units, in this paper we focus on microcontroller-class devices that can be found on edge nodes of an IoT system. As the target microcontroller, we have selected PULPino [21], which is based on the open-source instruction-set architecture RISC-V [22], which is gaining interest in both academia and industry. The contributions of this paper are as follows: • Performance and memory footprint measurements of our binarized CNN-based image classifier on a RISC-V mi- crocontroller, and • Optimization of binarized CNN computations by the custom instruction ’popcount’ found in a proposal for RISC-V instruction set extensions [23]. The structure of this paper is as follows: Section II intro- duces other works related to optimization of CNNs; Section III describes the PULPino microcontroller that we use as the target device for our image classifier; Section IV covers the structure and binarization process of our CNN; Section V presents our experimental results, and Section VI concludes the paper. II. RELATED WORK This section describes previous works related to acceleration of CNNs, some also considering acceleration by hardware. Table I presents a summary of these works and the target platforms they consider. Binarized neural networks (BNN) were originally intro- duced in [18]: network weights and activations are restricted to +1 and −1, which enables replacing multiplications and additions with bit-wise operations. Experiments have been performed on MNIST and CIFAR-10 datasets. The authors demonstrate a speedup of 7× for a multi-layer perceptron network trained for MNIST handwritten digit classification. Experimental results are limited to GPU acceleration of bina- rized fully-connected layers. Somewhat later the binarization optimization was extended to the large-scale ImageNet image classification challenge [24]. The authors of [24] concentrate on CPU targets and report up to 58× execution time reduction on 64-bit CPUs for binarized convolution and fully-connected layers. Also, the authors claim an accuracy improvement of 16% compared to [18] in the ImageNet top-1 classification challenge. Our previous work [14] was among the first ones to present GPU acceleration of both binarized convolution and fully- connected layers. Experimental results are presented for two mobile GPUs (NVidia Jetson and ARM Mali-T860), as well as for a desktop GPU (NVidia GTX1080). Layer implementations have been written from scratch in OpenCL and CUDA and made available open source. Additionally, the accuracy impact of various input image binarization approaches are analyzed. In [25] a self-contained library ESPRESSO for binarized neural networks is presented. The library provides layer im- plementations in C and CUDA for both CPU and NVidia GPU targets. ESPRESSO [25] uses an optimization called unrolling (similar to im2col used in our previous work [14] and the proposed work) for reshaping tensors prior to computing convolution. Optimization of CNN convolution operations is studied in [26]. The authors have observed that Winograd convolutions can involve a high number of multiplications by zero, espe- cially if weight pruning (see, e.g. [13]) has been applied. This redundancy is avoided by skipping zero weights by a software- only and by a hardware-assisted approach. Additionally, the authors present a data reuse approach for reducing the number of additions. Both optimizations target NVidia GPUs. In [27] the XNOR Neural Engine (XNE) is presented, a hardware accelerator for binary neural networks to be closely coupled with an MCU (micro controller unit) system. The XNE is capable of executing both binarized convolutional and fully-connected layers. The authors provide post-layout results where the accelerator has been placed on the same chip and same clock domain with a RISC-V microcontroller that acts as the host processor for the accelerator. The proposed work is similar to the work of Conti et al. [27] in the sense that both consider an IoT edge computing scenario, build on binarized CNNs, and consider RISC-V MCU cores. However, a substantial difference is that the XNE accelerator of [27] is a dedicated datapath for CNNs next to the MCU core, whereas our proposed solution builds on a basic microcontroller architecture with just one custom processor instruction (’popcount’) for accelerating BNNs. Evidently, the specialized circuit of [27] can achieve much higher energy efficiency than our proposed solution, whereas our solution only requires a tiny modification to a basic RISC-V MCU system, and otherwise remains very generic and capable of accelerating other types of applications as well. Fig. 1. From left to right, a ’bus’, ’normal car’, ’truck’, and a ’van’. III. THE PULPINO RISC-V PROCESSOR FOR IOT APPLICATIONS RISC-V is an open source instruction set architecture (ISA) that is gaining interest in both academia and industry [22]. The ISA is open and standardized, such that it is free to use for both academia and industry. To promote adoption of the new ISA, another goal was to design a modern ISA: it is designed in a modular way by providing a small base instruction set with optional extensions. Additionally, certain instruction opcodes are reserved for custom extensions. This flexibility allows to design RISC-V processors that are customized for special workloads, which makes the ISA interesting for specialized IoT devices. While the open standard is just referring to the ISA itself and not any micro-architecture, the community around RISC-V has provided many open-source cores. An important motivation for open hardware is security, especially with recent micro- architecture bugs Spectre and Meltdown appearing in popular media [28][29]. Kerckhoff’s principle and a long history of research suggests that open systems provide certain advantages over closed systems in terms of security [30][31][32]. The Parallel Ultra-Low-Power (PULP) project has devel- oped several RISC-V-based microcontrollers that are suitable for IoT applications [21]. The PULPino is particularly suited for low cost, low power tasks, because it is a simple in-order single-core microcontroller with many configuration options. Due to these advantages, the custom processor used in this work was derived from the PULPino-based SoC (System-on- Chip). IV. NEURAL NETWORK DESIGN A. Network for Vehicle Classification The neural network model we use is that of the vehicle classifier network presented in [20]. The network has five layers in total, starting with two convolutional layers, each one with 32 output feature maps, and kernel sizes 5×5. Each of the convolutional layers is followed by a 2×2 maxpooling operation. The second convolutional layer is followed by three fully-connected layers. The first fully-connected layer (the 3rd layer in the network) has 100 neurons, resulting in the shape 24×24×32×100. The two layers that follow have shapes 100×100, and 100×4, in that order. The dataset we use for training the network has 6555 photos of vehicles from four categories: bus, normalcar, truck, and van. Each vehicle image is a full-color image of size 96× 96. Example images from each class in the data set are shown in Fig. 1. We split the data into a training set (80%), validation set (10%) and a test set (10%). Our test-set accuracy reports are the recorded accuracy reports that correspond to the best validation set accuracy. B. Neural Network Binarization We implement a binarized version of the vehicle classifier network introduced in [20] reducing the precision of CNN weights and their activations to 1-bit. This concept was first introduced in [18], with reports of substantial reductions of model execution time and size. In this work, we replace all ReLU activations in the network with the sign function, which is given as sign(x) = { −1 if x ≤ 0 +1 if x > 0 (1) We binarize the weights of the network using the sign function as well. During training, the gradient of sign activations are explicitly defined to be the identity function in the backward pass so that ∂sign(x)∂x = x. The full-precision version of the network (non-binarized) is trained using the RMSprop optimizer, and the binarized version is trained with the ADAM optimizer. For the binarized version of the network, only the binarized weights, where all have a value of either −1 or +1, are used for inference on the target device. The network is trained from scratch using binarization in a separate training process. It would also have been possible to quantize the network to ternary values [17] (or even higher 8- or 16- bit precision), but that would have multiplied the memory footprint of the solution compared to binarization. We use the terms packing or bit-packing to denote the encapsulation of an array of 1-bit values (+1’s and −1’s) into one 32-bit unsigned integer. For example, if we wish to pack a vector x ∈ {−1,+1}32, its packed representation, xp, is given by xp = 31∑ i=0 (xi + 1)2 i−1, (2) where xi is the ith element of x. This then allows operations such as vector-summations and dot products to be performed using binary (bit manipulation) operations. The dot-product, for example, can be represented as a·b = 32− 2× popcount(xor(A,B)), (3) where both A and B are 32-bit unsigned integers holding the packed representations of the vectors a, b ∈ {−1,+1}32. The operation ’popcount’ (also known as Hamming weight calculation) is a function for computing the number of bits set to 1, which can essentially simulate vector summation. The operation xor in Eq. 3 is the bit-wise ’xor’ operation. C. Acceleration by Bit Manipulation Instructions Looking at Eq. 3 we see that both ’xor’ and ’popcount’ are used in inference of binarized CNNs to perform an operation that emulates multiplication for packed weights; this means that both for fully-connected and convolutional layers ’xor’ and ’popcount’ are in heavy use and offer a clear optimization target. The hardware implementation of ’xor’ can be found on any programmable processor, whereas a hardware implementation for ’popcount’ is mostly available on graphics processing units or CPU SIMD extensions such as ARM NEON. For our target processor, the PULPino microcontroller, the base ISA does not include ’popcount’ – this instruction is only present in the bit manipulation extension of RISC-V that is still under development [23]. In our experiments, in cases where the target processor did not have a hardware instruction for ’popcount’, the LLVM C language description1 shown in Algorithm 1 was called through builtin popcount(). Algorithm 1 LLVM ’popcount’, i.e. Hamming weight int32 popcountsi2 (int32 a) { uint32 x = (uint32) a; x = x− ((x >> 1) & 0x55555555); x = ((x >> 2) & 0x33333333) + (x & 0x33333333); x = (x + (x >> 4)) & 0x0F0F0F0F; x = (x + (x >> 16)); return (x + (x >> 8)) & 0x0000003F; } V. EXPERIMENTS The experimental evaluation of this work consisted of two parts: 1) evaluating the effect of the software-based binarization optimization for our image classifier, and 2) evaluating the effect of the ’popcount’ custom instruction on the binarized classifier. Unfortunately, as our ultimate target platform was the PULPino microcontroller for IoT devices, it was not possible to benchmark the original non-binarized vehicle classifier on this device as it has no hardware support for floating point computations. Hence it was necessary to use two different target platforms to complete our experiments, and these platforms are summarized in Table II. The ARM Cortex A53 core is a powerful mobile processor and in our experiments the processor was used under Linux for benchmarking a C language implementation of the original vehicle classifier [20], as well as for the C language imple- mentation of the binarized vehicle classifier. Experiments on the PULPino microcontroller platform were performed in a simulation environment, which is described next. A. The ETISS Simulator The RISC-V ISA is still in a phase of development, as for example the specification is not officially standardized yet. Still, the central components of the specification have matured and have been used to fabricate various chips such as the SiFive FE310 SoC [33]. The application being evaluated in 1https://github.com/sifive/riscv-llvm/blob/master/compiler- rt/lib/builtins/popcountsi2.c this work however requires the bit manipulation instruction extension (’B extension’) of the RISC-V ISA. This extension is still in active development [23] and not part of the current specification. Therefore, there is no RISC-V chip available that could be used for evaluating our results, however an alternative way to estimate the performance gain achievable through custom instructions is by simulation. An RTL (Register-Transfer Level) hardware simulation would not be suitable for fast prototyping as the micro- architecture should be modified to enable the execution of the chosen custom instructions. Additionally, for a time- consuming workload such as our CNN application, the RTL simulation time would be prohibitively high. The Extensible Translating Instruction Set Simulator (ETISS) focuses on extensibility [34] to support fast prototyp- ing. As ETISS already supports the standard RISC-V base in- struction sets, contains a virtual prototype of the PULPino [21] SoC, and allows profiling the application execution time, the use of this simulator was a natural decision our binarized image classifier application. B. Implementation of the Popcount Instruction As the PULPino virtual prototype of ETISS currently only supports the RISC-V base ISA, a temporary modification of the virtual prototype was required to enable profiling with support for ’popcount’. From ETISS execution traces it was discovered that the ’xori’ instruction of the RISC-V base ISA remained almost unused throughout the whole execution of the binarized vehicle classifier. Therefore, in the PULPino virtual prototype the functional description of ’xori’ was modified to provide alternative functionality, i.e. ’popcount’, toggled by the value of the 2nd instruction operand. In the software implementation of the binarized vehicle classifier, the calls to ’popcount’ were then replaced with inline assembly calls to ’xori’ with the specific operand value that would invoke ’popcount’ behavior. C. Execution Time and Memory Footprint Analysis Table III shows the experimental results for both A53 and PULPino. From top to bottom the table rows report execution time on A53, execution time on PULPino, data memory footprint, PULPino instruction memory footprint, and CNN classification accuracy. Looking at the A53 results it can be seen that binarization alone reduced the execution time by more than 80%, and dropped the data memory usage close to 95% when compared to the original floating point C version. Acceleration by the hardware ’popcount’ instruction re- duced the computation time of the binarized vehicle classifier by around 55% on the PULPino platform, and also reduced the instruction memory footprint by around 2 kB. The reason for the 55% reduction in execution time can be seen from Table IV that shows the count of executed instructions on the PULPino platform for the binarized vehicle classifier with and without the hardware ’popcount’ instruction: the code version that calls the hardware ’popcount’ instruction has respectively TABLE II PLATFORMS USED FOR EXPERIMENTS. Tag CPU Platform type Compiler Operating System A53 ARM Cortex A53 (1416 MHz) Silicon SoC g++ 5.4.0 Linux Firefly 4.4 PULPino PULPino (33 MHz) Virtual prototype on ETISS riscv32-unknown-elf-gcc 7.1.1 n/a TABLE III EXECUTION TIME, MEMORY FOOTPRINT AND ACCURACY Application version Baseline Binarized Bin+pop Arithmetic float32 int32 int32 A53 Execution time 0.362 s 0.057 s - PULPino Exec. time - 2.62 s 1.18 s Data Memory 7.2 MB 369 kB 369 kB Pulpino Instr. Memory - 21 kB 19 kB Accuracy [14] 97.09% 92.52% 92.52% 55% less executed instructions. This is because if there is no hardware support for ’popcount’, the functionality must be implemented by means of several regular instructions, which can be seen in increased execution counts of ’srli’, ’and’, ’sub’ and ’add’ instructions for the binarized version without the hardware ’popcount’ instruction. Algorithm 1 shows that these instructions are needed for the software implementation of ’popcount’ The accuracy results shown in Table III are identical to our previous work on binarization that targeted graphics process- ing units [14]. VI. CONCLUSIONS In this paper we have presented a convolutional neural network based vehicle image classifier that has been opti- mized for real-time execution and small memory footprint by a technique called binarization. We show that by using ’popcount’, a custom instruction in our target processor, the runtime of the binarized image classifier can be reduced by 55%. This result is important due to the fact that ’popcount’ has been proposed to be included to a standardized instruction set extension (’B extension’) of the recently introduced open source RISC-V instruction set architecture. Besides RISC-V, ’popcount’ is already supported in graphics processing units and e.g. in the NEON SIMD extension of ARM processors. Our work shows that the software-based binarization trans- formation coupled with the hardware-based ’popcount’ in- struction yields an extremely powerful combination for opti- mizing inference of convolutional neural networks. Together, the memory footprint is reduced by close to 95%, and exe- cution time is reduced by a magnitude while maintaining an acceptable loss in accuracy. As a results, image classification is performed in 1.18 seconds on the tiny 33 MHz RISC-V microcontroller that is well suited for IoT applications. As binarization inevitably reduces classification accuracy (most clearly on larger datasets), a potential step for improving 2’popcount’ implemented as ’xori’ alternative behavior TABLE IV NUMBER OF EXECUTED INSTRUCTIONS Instruction Binarized Bin+pop name int32 int32 lw 8797430 8797417 lbu 272 272 addi 6372539 6354083 slli 2801668 2801668 popcount/xori2 4 3302052 srli 16510241 1 srai 4 4 ori 1 1 andi 3302062 14 sb 268 268 sh 4 4 sw 782165 782109 add 16704267 3496013 mul 0 0 sub 3670893 368845 sll 18632 18632 slt 2553032 2553032 xor 3302048 3302048 or 2451656 2451656 and 13208192 0 bne 3232555 3232555 blt 0 0 bge 370058 370058 bltu 4 4 jalr 39 39 jal 57 57 csrrw 1 1 Total 84078092 37830833 accuracy would be the adoption of heterogeneous bitwidth bi- narization [35]. This approach degrades accuracy considerably less than full binarization, already when on average 1.4 bits per weight are used [35]. ACKNOWLEDGMENT This work was partially funded by the Academy of Finland project 309903 CoEfNet, and by the ITEA3 project 16018 COMPACT (Business Finland diary number 3098/31/2017, German ministry of education and research reference number 01IS17028). REFERENCES [1] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE conference on computer vision and pattern recognition (CVPR), 2016, pp. 770–778. [2] J. Dai, K. He, Y. Li, S. Ren, and J. Sun, “Instance-sensitive fully convolutional networks,” in European Conference on Computer Vision (ECCV). Springer, 2016, pp. 534–549. [3] J. Johnson, A. Karpathy, and L. Fei-Fei, “DenseCap: Fully convolutional localization networks for dense captioning,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4565– 4574. [4] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in IEEE conference on computer vision and pattern recognition (CVPR), 2017, pp. 7263–7271. [5] G. Ananthanarayanan, P. Bahl, P. Bodı´k, K. Chintalapudi, M. Philipose, L. Ravindranath, and S. Sinha, “Real-time video analytics: The killer app for edge computing,” Computer, vol. 50, no. 10, pp. 58–67, 2017. [6] W. Shi and S. Dustdar, “The promise of edge computing,” Computer, vol. 49, no. 5, pp. 78–81, 2016. [7] M. Alioto and M. Shahghasemi, “The Internet of Things on its edge: Trends toward its tipping point,” IEEE Consumer Electronics Magazine, vol. 7, no. 1, pp. 77–87, 2018. [8] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: A system for large- scale machine learning,” in USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2016, pp. 265–283. [9] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, “SqueezeNet: Alexnet-level accuracy with 50x fewer parameters and <0.5 MB model size,” arXiv preprint arXiv:1602.07360, 2016. [10] J. Shen, Y. Huang, Z. Wang, Y. Qiao, M. Wen, and C. Zhang, “Towards a uniform template-based architecture for accelerating 2D and 3D CNNs on FPGA,” in ACM/SIGDA International Symposium on Field- Programmable Gate Arrays (FPGA). ACM, 2018, pp. 97–106. [11] K. Chellapilla, S. Puri, and P. Simard, “High performance convolutional neural networks for document processing,” in International Workshop on Frontiers in Handwriting Recognition, 2006. [12] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “EIE: efficient inference engine on compressed deep neural network,” in ACM/IEEE International Symposium on Computer Architecture (ISCA). IEEE, 2016, pp. 243–254. [13] M. Zhu and S. Gupta, “To prune, or not to prune: exploring the efficacy of pruning for model compression,” in International Conference on Learning Representations (ICLR) Workshops, 2018. [14] M. Khan, H. Huttunen, and J. Boutellier, “Binarized convolutional neural networks for efficient inference on GPUs,” in European Signal Processing Conference (EUSIPCO). IEEE, 2018, pp. 682–686. [15] M. Jaderberg, A. Vedaldi, and A. Zisserman, “Speeding up convolutional neural networks with low rank expansions,” in British Machine Vision Conference (BMVC). BMVA Press, 2014. [16] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “MobileNets: Efficient convo- lutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017. [17] F. Li, B. Zhang, and B. Liu, “Ternary weight networks,” arXiv preprint arXiv:1605.04711, 2016. [18] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Ben- gio, “Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or -1,” arXiv preprint arXiv:1602.02830, 2016. [19] Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre, and K. Vissers, “FINN: A framework for fast, scalable binarized neural network inference,” in ACM/SIGDA International Symposium on Field- Programmable Gate Arrays (FPGA). ACM, 2017, pp. 65–74. [20] H. Huttunen, F. S. Yancheshmeh, and K. Chen, “Car type recognition with deep neural networks,” in IEEE Intelligent Vehicles Symposium (IV). IEEE, 2016, pp. 1115–1120. [21] A. Traber, F. Zaruba, S. Stucki, A. Pullini, G. Haugou, E. Flamand, F. K. Gurkaynak, and L. Benini, “PULPino: A small single-core RISC- V SoC,” in RISC-V Workshop, 2016. [22] The RISC-V Instruction Set Manual, RISC-V Foundation, 2017, version 2.2. [23] RISC-V Bitmanip Extension, Clifford Wolf, 2019, version 0.37. [24] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net: Imagenet classification using binary convolutional neural networks,” in European Conference on Computer Vision (ECCV). Springer, 2016, pp. 525–542. [25] F. Pedersoli, G. Tzanetakis, and A. Tagliasacchi, “Espresso: Efficient forward propagation for binary deep neural networks,” in International Conference on Learning Representations (ICLR), 2018. [26] H. Park, D. Kim, J. Ahn, and S. Yoo, “Zero and data reuse-aware fast convolution for deep neural networks on GPU,” in International Confer- ence on Hardware/Software Codesign and System Synthesis (CODES+ ISSS). IEEE, 2016, pp. 1–10. [27] F. Conti, P. D. Schiavone, and L. Benini, “XNOR neural engine: A hardware accelerator IP for 21.6-fJ/op binary neural network inference,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 37, no. 11, pp. 2940–2951, 2018. [28] P. Kocher, D. Genkin, D. Gruss, W. Haas, M. Hamburg, M. Lipp, S. Mangard, T. Prescher, M. Schwarz, and Y. Yarom, “Spectre attacks: Exploiting speculative execution,” arXiv preprint arXiv:1801.01203, 2018. [29] M. Lipp, M. Schwarz, D. Gruss, T. Prescher, W. Haas, S. Mangard, P. Kocher, D. Genkin, Y. Yarom, and M. Hamburg, “Meltdown,” arXiv preprint arXiv:1801.01207, 2018. [30] J.-H. Hoepman and B. Jacobs, “Increased security through open source,” arXiv preprint arXiv:0801.3924, 2008. [31] B. Witten, C. Landwehr, and M. Caloyannides, “Does open source improve system security?” IEEE Software, vol. 18, no. 5, pp. 57–61, 2001. [32] C. Cowan, “Software security for open-source systems,” IEEE Security & Privacy, vol. 99, no. 1, pp. 38–45, 2003. [33] SiFive FE310-G000 Manual, SiFive, Inc., 2017, version v2p3. [34] D. Mueller-Gritschneder, M. Dittrich, M. Greim, K. Devarajegowda, W. Ecker, and U. Schlichtmann, “The extendable translating instruction set simulator (ETISS) interlinked with an MDA framework for fast RISC prototyping,” in International Symposium on Rapid System Prototyping (RSP). IEEE, 2017, pp. 79–84. [35] J. Fromm, S. Patel, and M. Philipose, “Heterogeneous bitwidth bi- narization in convolutional neural networks,” in Advances in Neural Information Processing Systems, 2018, pp. 4006–4015.