A Quantization-Friendly Separable Convolution for MobileNets

arxv:1803.08607v1 [cs.cv] 22 Mar 2018 A Quantzaton-Frendly Separable for MobleNets Abstract Tao Sheng tsheng@qt.qualcomm.com Xaopeng Zhang parker.zhang@gmal.com As deep learnng (DL) s beng rapdly pushed to edge computng, researchers nvented varous ways to make nference computaton more effcent on moble/iot devces, such as network prunng, parameter compresson, and etc. Quantzaton, as one of the key approaches, can effectvely offload GPU, and make t possble to deploy DL on fxed-pont ppelne. Unfortunately, not all exstng networks desgn are frendly to quantzaton. For example, the popular lghtweght MobleNetV1 [1], whle t successfully reduces parameter sze and computaton latency wth separable convoluton, our experment shows ts quantzed models have large accuracy gap aganst ts float pont models. To resolve ths, we analyzed the root cause of quantzaton loss and proposed a quantzaton-frendly separable convoluton archtecture. By evaluatng the mage classfcaton task on ImageNet2012 dataset, our modfed MobleNetV1 model can archve 8-bt nference top-1 accuracy n 68.03%, almost closed the gap to the float ppelne. Keywords Separable, MobleNetV1, Quantzaton, Fxed-pont Inference 1 Introducton Quantzaton s crucal for DL nference on moble/iot platforms, whch have very lmted budget for power and memory consumpton. Such platforms often rely on fxed-pont computatonal hardware blocks, such as Dgtal Sgnal Processor (DSP), to acheve hgher power effcency over float pont processor, such as GPU. On exstng DL models, such as VGGNet [2], GoogleNet [3], ResNet [4] and etc., although quantzaton may not mpact nference accuracy for ther over-parameterzed desgn, t would be dffcult to deploy those models on moble platforms due to large computaton latency. Many lghtweght networks, however, can trade off accuracy wth effcency by replacng conventonal convoluton wth depthwse separable convoluton, as shown n the Fgure 1(a)(b). For example, the MobleNets proposed by Google, drastcally shrnk parameter sze and memory 2018. Chen Feng chenf@qt.qualcomm.com Lang Shen lang.shen@qt.qualcomm.com (a) Standard Shaoje Zhuo shaojez@qt.qualcomm.com Mckey Aleksc maleksc@qt.qualcomm.com Qualcomm Technologes, Inc. Depthwse /6 Pontwse /6 (b) MobleNet Separable Depthwse Pontwse (c) Proposed Quantzaton-frendly Separable Fgure 1. Our proposed quantzaton-frendly separable convoluton core layer desgn vs. separable convoluton n MobleNets and standard convoluton footprnt, thus are gettng ncreasngly popular n moble platforms. The downsde s that the separable convoluton core layer n MobleNetV1 causes large quantzaton loss, and thus resultng n sgnfcant feature representaton degradaton n the 8-bt nference ppelne. To demonstrate the quantzaton ssue, we selected Tensor- Flow mplementaton of MobleNetV1 [6] and InceptonV3 [7], and compared ther accuracy on float ppelne aganst 8-bt quantzed ppelne. The results are summarzed n Table1. The top-1 accuracy of InceptonV3 drops slghtly after applyng the 8-bt quantzaton, whle the accuracy loss s sgnfcant for MobleNetV1. Table 1. Top-1 accuracy on ImageNet2012 valdaton dataset Networks Float Ppelne 8-bt Ppelne InceptonV3 78.00% 76.92% MobleNetV1 70.50% 1.80% Comments Only standard convoluton Manly separable convoluton There are a few ways that can potentally address the ssue. The most straght forward approach s quantzaton wth more bts. For example, ncreasng from 8-bt to 16-bt could

Tao Sheng, Chen Feng, Shaoje Zhuo, Xaopeng Zhang, Lang Shen, and Mckey Aleksc boost the accuracy [14], but ths s largely lmted by the capablty of target platforms. Alternatvely, we could re-tran the network to generate a dedcated quantzed model for fxed-pont nference. Google proposed a quantzed tranng framework [5] co-desgned wth the quantzed nference to mnmze the loss of accuracy from quantzaton on nference models. The framework smulates quantzaton effects n the forward pass of tranng, whereas back-propagaton stll enforces float ppelne. Ths re-tranng framework can reduce the quantzaton loss dedcatedly for fxed-pont ppelne at the cost of extra tranng, also the system needs to mantan multple models for dfferent platforms. In ths paper, we focus on a new archtecture desgn for the separable convoluton layer to buld lghtweght quantzatonfrendly networks. The proposed new archtecture requres only sngle tranng n the float ppelne, and the traned model can then be deployed to dfferent platforms wth float or fxed-pont nference ppelnes wth mnmum accuracy loss. To acheve ths, we look deep nto the root causes of accuracy degradaton of MobleNetV1 n the 8-bt nference ppelne. And based on the fndngs, we proposed a re-archteched quantzaton-frendly MobleNetV1 that mantans a compettve accuracy wth float ppelne, but a much hgher nference accuracy wth a quantzed 8-bt ppelne. Our man contrbutons are: 1. We dentfed batch normalzaton and 6 are the major root causes of quantzaton loss for MobleNetV1. 2. We proposed a quantzaton-frendly separable convoluton, and emprcally proved ts effectveness based on MobleNetV1 n both the float ppelne and the fxed-pont ppelne. 2 Quantzaton Scheme and Loss Analyss In ths secton, we wll explore the TensorFlow (TF) [8] 8-bt quantzed MobleNetV1 model, and fnd the root cause of the accuracy loss n the fxed-pont ppelne. Fgure 2 shows a typcal 8-bt quantzed ppelne. A TF 8-bt quantzed model s drectly generated from a pre-traned float model, where all weghts are frstly quantzed offlne. Durng the nference, any float nput wll be quantzed to an 8-bt unsgned value before passng to a fxed-pont runtme operaton, such as QuantzedConv2d, QuantzedAdd, and QuantzedMul, etc. These operatons wll produce a 32-bt accumulated result, whch wll be converted down to an 8-bt output through an actvaton re-quantzaton step. Noted that ths output wll be the nput to the next operaton. 2.1 TensorFlow 8-bt Quantzaton Scheme TensorFlow 8-bt quantzaton uses a unform quantzer, n whch all quantzaton steps are of equal sze. Let x f loat represent for the float value of sgnal x, the TF 8-bt quantzed value, denoted as x quant8 can be calculated as: x quant8 = [x f loat/ x ] δ x, (1) Inputs float32 weghts Quantzaton Input Quantzaton Loss float32 Quantzaton (offlne) unt8 unt8 nt32 unt8 Re- OP TF8 Quantzaton Saturaton & Clppng Loss Weght Quantzaton Loss Actvaton Quantzaton Loss Fgure 2. A fxed-pont quantzed ppelne Outputs x = x max x mn 2 b and δ x = [x mn/ x ] (2) 1 where x represents for the quantzaton step sze; b s the bt-wdth,.e., b = 8, and δ x s the offset value such that float value 0 s exactly represented. x mn and x max are the mn and max values of x n the float doman, and [ ] represents for the nearest roundng operaton. In the TensorFlow mplementaton, t s defned as [x] = sдn(x) x + 0.5 (3) where sgn(x) s the sgn of the sgnal x, and represents for the floor operaton. Based on the defntons above, the accumulated result of a convoluton operaton s computed by: (xf ) accum f loat = loat w f loat (xquant8 ) ( ) = x w + δ x wquant8 + δ w = x w accum nt32 (4) Fnally, gven known mn and max values of the output, by combnng equaton (1) and (4), the re-quantzed output can be calculated by multplyng the accumulated result wth x w, and then subtractng the output offset δ ouput. [ ] 1 output quant8 = accum f loat δ ouput [ ] (5) x w = accum nt32 δ ouput 2.2 Metrc for Quantzaton Loss As depcted n Fgure 2, there are fve types of loss n the fxed-pont quantzed ppelne, e.g., nput quantzaton loss, weght quantzaton loss, runtme saturaton loss, actvaton re-quantzaton loss, and possble clppng loss for certan non-lnear operatons, such as 6. To better understand the loss contrbuton that comes from each type, we use Sgnal-to-Quantzaton-Nose Rato (SQNR), defned as the power of the unquantzed sgnal x devded by the power of the quantzaton error n as a metrc to evaluate the quantzaton accuracy at each layer output. ( ) SQNR = 10 log E(x 2 10 )/E(n 2 ) n db (6) Snce the average magntude of the nput sgnal x s much larger than the quantzaton step sze x, t s reasonable to

A Quantzaton-Frendly Separable for MobleNets assume that the quantzaton error s zero mean wth unform dstrbuton and the probablty densty functon (PDF) ntegrates to 1 [10]. Therefore, for an 8-bt lnear quantzer, the nose power can be calculated by E(n 2 ) = x 2 x 2 1 x n 2 dn = 2 x 12 Substtutng equaton (2) and (7) nto equaton (6), we get SQNR = 58.92 10 log 10 (x max x mn ) 2 E(x 2 ) (7) n db (8) SQNR s tghtly coupled wth sgnal dstrbuton. From equaton (8), t s obvous that SQNR s determned by two terms: the power of the sgnal x, and the quantzaton range. Therefore, ncreasng the sgnal power or decreasng the quantzaton range can help to ncrease the output SQNR. 2.3 Quantzaton Loss Analyss on MobleNetV1 2.3.1 Norm n Depthwse Layer As shown n Fgure 1(b), a typcal MobleNetV1 core layer conssts of a depthwse convoluton and a pontwse convoluton, each of whch followed by a [9] and a non-lnear actvaton functon, respectvely. In the TensorFlow mplementaton, 6 [11] s used as the non-lnear actvaton functon. Consder a layer nput x = (x (1),..., x (d) ), wth d-channels and m elements n each channel wthn a mn-batch, the Transform n depthwse convoluton layer s appled on each channel ndependently, and can be expressed by, y (k) = γ (k) xˆ (k) + β (k) x(k) (k) µ (k) = γ + β (k) = 1,...,m, k = 1,...,d where xˆ (k) represents for the normalzed value of x (k) on channel k. µ (k) and σ (k) are mean and varance over the mnbatch. γ (k) and β (k) are scale and shft. Noted that ϵ s a gven small constant value. In the TensorFlow mplementaton, ϵ = 0.0010000000475. The Transform can be further folded n the fxed-pont ppelne. Let α (k) = γ (k) equaton (9) can be reformulated as and β (k) = β (k) γ (k) µ (k) y (k) = α (k) x (k) = 1,...,m, + β (k) k = 1,...,d (9) (10) (11) In the TensorFlow mplementaton, for each channel k, α can be combned wth weghts and folded nto the convoluton operatons to further reduce the computaton cost. 21.5191 0.223199 1.08687 14.1944 0.863097 0.183423 0.101928 0.194357 24.2374 0.0637057 0.113416 1.48599 0.299164 0.6962 0.932557 3.70812 14.8428 0.0413136 0.184471 0.228431 17.1846 30.6505 0.114515 1.3315 0.214905 0.627026 0.0899751 0.199664 0.0595126 0.461869 0.339099 0.229413 1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526272829303132 Fgure 3. An example of α values across 32 channels of the frst depthwse conv. layer from MobleNetV1 float model Depthwse convoluton s appled on each channel ndependently. However, the mn and max values used for weghts quantzaton are taken collectvely from all channels. An outler n one channel can easly cause a huge quantzaton loss for the whole model due to an enlarged data range. Wthout correlaton crossng channels, depthwse convoluton may prone to produce all-zero values n one channel, leadng to zero varance (σ (k) = 0) for that specfc channel. Ths s commonly observed n MobleNetV1 models. Refer to equaton (10), zero varance of channel k would produce a very large value of α (k) due to the small constant value of ϵ. Fgure 3 shows observed α values across 32 channels extracted from the frst depthwse convoluton layer n MobleNetV1 float model. It s notced that the 6 outlers of α caused by the zero-varance ssue largely ncrease the quantzaton range. As a result, the quantzaton bts are wasted on preservng those large values snce they all correspond to all-zero-value channels, whle those small α values correspondng to nformatve channels are not well preserved after quantzaton, whch badly hurts the representaton power of the model. From our experments, wthout retranng, proper handlng the zero-varance ssue by changng the varance of a channel wth all-zero values to the mean value of varances of the rest of channels n that layer, the top-1 accuracy of the quantzed MobleNetV1 on ImageNet2012 valdaton dataset can be dramatcally mproved from 1.80% to 45.73% on TF8 nference ppelne. A standard convoluton both flters and combnes nputs nto a new set of outputs n one step. In MobleNetV1, the depthwse separable convoluton splts ths nto two layers, a depthwse separable layer for flterng and a pontwse separable layer for combnng [1], thus drastcally reducng computaton and model sze whle preservng feature representatons. Based on ths prncple, we can remove the non-lnear operatons,.e., and 6, between the two layers, and let the network learn proper weghts to handle the Transform drectly. Ths procedure preserves all the feature representatons, whle makng the model quantzaton-frendly. To further understand per-layer output accuracy of the network,

Tao Sheng, Chen Feng, Shaoje Zhuo, Xaopeng Zhang, Lang Shen, and Mckey Aleksc SQNR (db) 35 30 25 20 15 10 5 0 conv2d dw1 pw1 dw2 pw2 dw3 pw3 dw4 pw4 dw5 pw5 dw6 pw6 dw7 pw7 dw8 pw8 dw9 pw9 dw10 pw10 dw11 pw11 dw12 pw12 dw13 Layer n all pontwse layers depthwse conv output pontwse conv layer output 6 n all pontwse layers depthwse conv output pontwse conv layer output Orgnal MobleNet, alpha folded depthwse conv output pontwse conv layer output Fgure 4. A comparson on the averaged per-layer output SQNR of MobleNetV1 wth dfferent core layer desgns we use SQNR, defned n equaton (8) as a metrc, to observe the quantzaton loss n each layer. Fgure 4 compares an averaged per-layer output SQNR of the orgnal MobleNetV1 wth α folded nto convoluton weghts (black curve) wth the one that smply removes and 6 n all depthwse convoluton layers (blue curve). We stll keep the and 6 n all pontwse convoluton layers. 1000 mages are randomly selected from ImageNet2012 valdaton dataset (one n each class). From our experment, ntroducng and 6 between the depthwse convoluton and pontwse convoluton largely n fact degrades the per-layer output SQNR. 2.3.2 6 or In ths secton, we stll use SQNR as a metrc to measure the effect of choosng dfferent actvaton functons n all pontwse convoluton layers. Noted that for a lnear quantzer, SQNR s hgher when sgnal dstrbuton s more unform, and s lower when otherwse. Fgure 4 shows an averaged per-layer output SQNR of MobleNetV1 by usng and 6 as dfferent actvaton functons at all pontwse convoluton layers. A huge SQNR drop s observed n the frst pontwse convoluton layer whle usng 6. Based on equaton (8), although 6 helps to reduce the quantzaton range, the sgnal power also gets reduced by the clppng operaton. Ideally, ths should produce smlar SQNR wth that of. However, clppng the sgnal x at early layers may have a sde effect of dstortng the sgnal dstrbuton to make t less quantzaton frendly, as a result of compensatng the clppng loss durng tranng. As we observed, ths leads to a large SQNR drop from one layer to the other. Expermental result on the mproved accuracy by replacng 6 wth wll be shown n Secton 4. 2.3.3 L2 Regularzaton on Weghts Snce SQNR s tghtly coupled wth sgnal dstrbuton, we further enable the L2 regularzaton on weghts n all depthwse convoluton layers durng the tranng. The L2 regularzaton penalzes weghts wth large magntudes. Large weghts could potentally ncrease the quantzaton range, and make the weght dstrbuton less unform, leadng to a large quantzaton loss. By enforcng a better weghts dstrbuton, a quantzed model wth an ncreased top-1 accuracy can be expected. 3 Quantzaton-Frendly Separable for MobleNets Based on the quantzaton loss analyss n the prevous secton, we propose a quantzaton-frendly separable convoluton framework for MobleNets. The goal s to solve the large quantzaton loss problem so that the quantzed model can acheve smlar accuracy to the float model whle no re-tranng s requred for the fxed-pont ppelne. 3.1 Archtecture of the Quantzaton-frendly Separable Fgure 1(b) shows the separable convoluton core layer n the current MobleNetV1 archtecture, n whch a and a non-lnear actvaton operaton are ntroduced between the depthwse convoluton and the pontwse convoluton. From our analyss, due to the nature of depthwse convoluton, ths archtecture would lead to a problematc quantzaton model. Therefore, n Fgure 1(c), three major changes are made to make the separable convoluton core layer quantzaton-frendly. 1. and 6 are removed from all depthwse convoluton layers. We beleve that a separable convoluton shall consst of a depthwse convoluton followed by a pontwse convoluton drectly wthout any non-lnear operaton between the two. Ths procedure not only well preserves feature representatons, but s also quantzaton-frendly. 2. All 6 are replaced wth n the rest layers. In the TensorFlow mplementaton of MobleNetV1, 6 s used as the non-lnear actvaton functon. However, we thnk 6 s a very arbtrary number. Although [11] ndcates that 6 can encourage a model learn sparse feature earler, clppng the sgnal at early layers may lead to a quantzaton-unfrendly sgnal dstrbuton, and thus largely decreases the SQNR of the layer output. 3. The L2 Regularzaton on the weghts n all depthwse convoluton layers are enabled durng the tranng. 3.2 A Quantzaton-Frendly MobleNetV1 Model The layer structure of the proposed quantzaton-frendly MobleNetV1 model s shown n Table2, whch follows the overall layer structure defned n [1]. The separable convoluton core layer has been replaced wth the quantzatonfrendly verson as descrbed n the prevous secton. Ths model stll nherts the effcency n terms of the computatonal cost and model sze, whle acheves hgh precson for fxed-pont processor.

A Quantzaton-Frendly Separable for MobleNets Table 2. Quantzaton-frendly modfed MobleNetV1 Input Operator Repeat Strde 224x224x3 Conv2d+ 1 2 112x112x32 DC+PC++ 1 1 112x112x64 DC+PC++ 1 2 56x56x128 DC+PC++ 1 1 56x56x128 DC+PC++ 1 2 28x28x256 DC+PC++ 1 1 28x28x256 DC+PC++ 1 2 14x14x512 DC+PC++ 5 1 14x14x512 DC+PC++ 1 2 7x7x1024 DC+PC++ 1 2 7x7x1024 AvgPool 1 1 1x1x1024 Conv2d+ 1 1 1x1x1000 Softmax 1 1 4 Expermental Results We tran the proposed quantzaton-frendly MobleNetV1 float models usng the TensorFlow tranng framework. We follow the same tranng hyperparameters as MobleNetV1 except that we use one Nvda GeForce GTX TITAN X card and a batch sze of 128 s used durng the tranng. ImageNet2012 dataset s used for tranng and valdaton. Note that the tranng s only requred for float models. The expermental results on takng each change nto the orgnal MobleNetV1 model n both the float ppelne and the 8-bt quantzed ppelne are shown n Fgure 5. In the float ppelne, our traned float model acheves smlar top-1 accuracy as the orgnal MobleNetV1 TF model. In the 8-bt ppelne, by removng the and 6 n all depthwse convoluton layers, the top-1 accuracy of the quantzed model can be dramatcally mproved from 1.80% to 61.50%. In addton, by smply replacng 6 wth, the top-1 accuracy of 8-bt quantzed nference can be further mproved to 67.80%. Furthermore, by enablng the L2 regularzaton on weghts n all depthwse convoluton layers durng the tranng, the overall accuracy of the 8-bt ppelne can be mproved by another 0.23%. From our experments, the proposed quantzaton-frendly MobleNetV1 model acheves an accuracy of 68.03% n the 8-bt quantzed ppelne, whle mantanng an accuracy of 70.77% n the float ppelne for the same model. 5 Concluson and Future Work We proposed an effectve quantzaton-frendly separable convoluton archtecture, and ntegrated t nto MobleNets for mage classfcaton. Wthout reducng the accuracy n the float ppelne, our proposed archtecture shows a sgnfcant accuracy boost n the 8-bt quantzed ppelne. To generalze ths archtecture, we wll keep applyng t on more networks based on separable convoluton, e.g., MobleNetV2 Core Layer Desgn Float Ppelne 8-bt Ppelne Orgnal 6 6 6 Our proposed desgns + L2 Regularzer 70.50% 70.55% 70.80% 70.77% 1.80% 61.50% 67.80% 68.03% Fgure 5. Top-1 accuracy wth dfferent core layer desgns on ImageNet2012 valdaton dataset [12], ShuffleNet [13] and verfy ther fxed-pont nference accuracy. Also, we wll apply proposed archtecture to object detecton and nstance segmentaton applcatons. And we wll measure the power and latency wth the proposed quantzaton frendly MobleNets on devce. References [1] A. Howard, M. Zhu, B. Chen, D. Kalenchenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Moblenets: Effcent convolutonal neural networks for moble vson applcatons. Apr. 17, 2017, https://arxv.org/abs/1704.04861. [2] K. Smonyan and A. Zsserman. Very deep convolutonal networks for large-scale mage recognton. Sep.4, 2014, https://arxv.org/abs/1409.1556. [3] C. Szegedy, W. Lu, Y. Ja, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabnovch. Gong deeper wth convolutons. In Proceedngs of the IEEE Conference on CVPR, pages 1-9, 2015. 1 [4] K. He, X. Zhang, S. Ren, and J. Sun. Deep resdual learnng for mage recognton. Dec. 10, 2015, https://arxv.org/abs/1512.03385. [5] B. Jacob., S Klgys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalengchenko. Quantzaton and Tranng of Neural Networks for Effcent Integer-Arthmetc-Only Inference. Dec.15, 2017, https://arxv.org/abs/1712.05877. [6] Google TensorFlow MobleNetV1 Model. https://storage.googleaps.com/download.tensorflow.org/models/tflte/- moblenet_v1_1.0_224_float_2017_11_08.zp [7] Google TensorFlow InceptonV3 Model. http://download.tensorflow.org/models/- ncepton_v3_2016_08_28.tar.gz [8] Google TensorFlow Framework. https://www.tensorflow.org/ [9] S. Loff, and C. Szegedy. : Acceleratng Deep Network Tranng by Reducng Internal Covarate Shft. Feb. 11, 2015, https://arxv.org/abs/1502. [10] Udo ZÃűlzer. Dgtal Audo Sgnal Processng, Chapter 2 John Wley & Sons, Dec. 15, 1997 [11] A. Krzhevsky. al Deep Belef Networks on CIFAR-10. http://www.cs.utoronto.ca/ krz/conv-cfar10-aug2010.pdf [12] M. Sandler, A. Howard, M. Zhu, A. Zhmognov, and L. Chen. Inverted Resduals and Lnear Bottlenecks: Moble Networks for Classfcaton, Detecton and Segmentaton. Jan. 13, 2018, https://arxv.org/abs/1801.04381. [13] X. Zhang, X. Zhou, M. Ln, and J. Sun. ShuffleNet: An Extremely Effcent al Neural Network for Moble Devces Dec. 7, 2017, https://arxv.org/abs/1707.01083. [14] J. Cheng, P. Wang, G. L, Q. Hu, and H. Lu. Recent Advances n Effcent Computaton of Deep al Neural Networks Feb. 11, 2018, https://arxv.org/abs/1802.00939.