Anchor Box Optimization for Object Detection

Anchor Box Optmzaton for Object Detecton Yuany Zhong 1, Janfeng Wang 2, Jan Peng 1, and Le Zhang 2 1 Unversty of Illnos at Urbana-Champagn 2 Mcrosoft Research 1 {yuanyz2, janpeng}@llnos.edu, 2 {janfw, lezhang}@mcrosoft.com arxv:1812.00469v1 [cs.cv] 2 Dec 2018 Abstract In ths paper, we propose a general approach to optmze anchor boxes for object detecton. Nowadays, anchor boxes are wdely adopted n state-of-the-art detecton framewors. However, all these framewors pre-defne anchor box shapes n a heurstc way and fx the sze durng tranng. To mprove the accuracy and reduce the effort to desgn the anchor boxes, we propose to dynamcally learn the shapes, whch allows the anchors to automatcally adapt to the data dstrbuton and the networ learnng capablty. The learnng approach can be easly mplemented n the stochastc gradent descent way and be plugged nto any anchor box-based detecton framewor. The extra tranng cost s almost neglgble and t has no mpact on the nference tme cost. Exhaustve experments also demonstrate that the proposed anchor optmzaton method consstently acheves sgnfcant mprovement ( 1% map absolute gan) over the baselne method on several benchmar datasets ncludng Pascal VOC 07+12, MS COCO and Branwash. Meanwhle, the robustness s also verfed towards dfferent anchor box ntalzaton methods, whch greatly smplfes the problem of anchor box desgn. 1. Introducton Object detecton plays an mportant role n many real applcatons and recent years have seen great mprovement n terms of speed and accuracy based on neural networs [18, 16, 17, 13, 11]. Many of these modern deep learnng based detectors mae use of the anchor boxes (or default boxes), whch serves as the ntal guess of the boundng box. These anchor boxes are densely dstrbuted across the output feature map, typcally centered at each neuron of the feature map. The neural networ s traned to predct the poston offset relatve to the cell center (sometmes normalzed by the anchor sze) and the wdth/heght offsets relatve to the anchor box shape, as well as the classfcaton confdence. One of the crtcal factors s the desgn of the anchor wdth and the anchor heght, and most of the approaches determne the values by ad-hoc heurstc methods. For nstance of Faster R-CNN[18], the anchor shapes are of 3 scales (128 2, 256 2, 512 2 ) and of 3 aspect ratos (1 : 1, 1 : 2, 2 : 1). In SSD[13], the aspect ratos also nclude 1 : 3 and 3 : 1 wth multple scales for dfferent feature maps. The approach of YOLO [15] has no anchor boxes, but the mproved verson YOLOv2 [16] ncorporates the dea of anchor boxes to mprove the accuracy, where the anchor shapes are obtaned by -means clusterng on the szes of the ground truth boundng boxes. When applyng the general object detectors on specfc domans, the anchor shape has to be manually modfed to mprove the accuracy. For text detecton n [8], the aspect ratos also nclude 5 : 1 and 1 : 5, snce the text could exhbt wder or hgher than the general objects. For the face detecton n [14, 24], the aspect rato s only 1 : 1 snce the face s roughly n a square shape. Once the anchor shapes are determned, the sze wll be fxed durng tranng. Ths mght be sub-optmal snce t dsregards the augmented data dstrbuton n tranng, the characterstcs of the neural networ structure and the tas. Improper desgn of the anchor sze could lead to nferor performance for specfc domans. To address the ssue, we propose a novel anchor optmzaton approach that can automatcally learn the anchor shapes durng tranng. Ths could leave the choce of anchor shapes completely n networ learnng such that the learned shapes can adapt better to the dataset, networ and tas wthout much human nterference. The learnng approach can be easly mplemented n the stochastc gradent descent way and could be plugged nto any anchor box based detecton framewor. To verfy the deas, we conduct extensve experments on several benchmar datasets ncludng Pascal VOC 07+12, MS COCO and Branwash. The results strongly demonstrate that the optmzed anchor 1

boxes could sgnfcantly mprove the accuracy ( 1% map absolute gan) over the baselne method. Meanwhle, the robustness s also verfed towards dfferent anchor box ntalzaton, whch greatly smplfes the problem of how to desgn the anchor sze. The man contrbutons of ths paper are summarzed as follows: We present a novel approach to optmze the anchor shapes durng tranng, whch, to the best of our nowledge, s the frst tme to treat anchor shapes as tranable varables wthout modfyng the nference networ. We demonstrate through extensve experments that the proposed anchor optmzaton method not only learns the approprate anchor shapes but also boost the detecton accuracy of exstng detectors sgnfcantly. We also verfy that the proposed method s robust towards ntalzaton, so the burden of handcraftng good anchor shapes for specfc dataset s greatly smplfed. The rest of the paper s organzed as follows. In Sec. 2, we summarze the related wors and present the relatonshp wth our approach. In Sec. 3, we present the detals of the optmzed anchor boxes for object detecton, whch s followed by the experment study n Sec. 4. Sec. 5 concludes the paper and dscusses the extensons to our wor. 2. Related Wor The modern object detectors normally contan two heads: one s the classfcaton whle the other s the localzaton. The classfcaton part s to predct the class confdence, whle the localzaton part s to predct the boundng box coordnates. Based on how the locaton s predcted, we roughly categorze the related wor nto two branches: relatve offset predcton based on some pre-defned anchor boxes [20, 13], and absolute offset predcton [15, 21, 7]. 2.1. Relatve Offset Predcton The networ predcts the offset relatve to the pre-defned anchor boxes, whch s also named as default boxes [13], prors [20]. These boxes serve as the ntal guess of the boundng box poston. The anchor shapes are fxed durng tranng and the neural networ learns to regress the relatve offsets. Assume ( (x), (y), (w), (h) ) are the neural net outputs, one typcal approach [18, 13] s to express the predcted boundng box as (a (x) + (x) a (w), a (y) + (y) a (h) ), a (w) exp( (w) ), a (h) exp( (h) )) where a (w) and a (h) are the pre-defned anchor wdth and heght, a (x) and a (y) are the anchor box center, the frst two numbers represent the box center and the last two represent the boundng box wdth and heght. Thus, one of the crtcal problems s how to desgn the anchor shape. In Faster R-CNN [18], the anchor shapes are chosen wth 3 scales (128 2, 256 2, 512 2 ) and 3 aspect ratos (1 : 1, 1 : 2, 2 : 1), yeldng 9 dfferent anchors at each output sldng wndow poston. In Sngle Shot MultBox detector (SSD) [13], the anchor boxes also have several scales on dfferent feature map levels and aspect ratos nclude 1 : 3, 3 : 1 as well as 1 : 1, 1 : 2, 2 : 1. In YOLO [15], the networ predcts the absolute offset and has no anchor boxes, but the mproved verson of YOLOv2 [16] ncorporates the dea of anchor boxes to mprove the accuracy. The anchor shapes are not handcrafted, but are the -Means centrods wth IoU as the smlarty crteron. The utlzaton of anchors has greatly mproved deep learnng based object detecton performance n recent years. When the general object detecton framewor s appled to specfc problems, the anchor szes have to be revsted and modfed accordngly. For example of the text detecton n [8], the aspect ratos also nclude 5 : 1 and 1 : 5 as well as 1 : 1, 1 : 2, 2 : 1, 1 : 3, 3 : 1, snce the text could exhbt wder or hgher than the general objects. For the face detecton n [14, 24], the aspect rato only nclude 1 : 1 snce the face s roughly n a square shape. For pedestran detecton n [23], a rato of 0.41 based on [2] s adopted for the anchor box. As suggested n [23], napproprate anchor boxes could be nosy and degrade the accuracy. To ease the effort of anchor shape desgn, the most relevant wor mght be MetaAnchor [22]. Leveragng neural networ weght predcton, the anchors are modeled as functons mplemented by an extra neural networ and computed from customzed pror boxes. The mechansm s shown to be robust to anchor settngs and boundng box dstrbutons, compared to predefned fxed anchor scheme. However, the method nvolves an extra networ to predct the weghts of another neural networ, resultng extra tranng effort and nference tme cost, and also needs to choose a set of customzed pror boxes by hand. Comparatvely, our method can be easly embedded nto any detecton framewor wthout extra networ, and has neglgble mpact on the tranng tme/space cost and no mpact on the nference tme. 2.2. Absolute Offset Predcton Another research effort s to drectly predct the absolute locaton values rather than ts poston and sze relatve to pre-defned anchor boxes. The YOLO [15] belongs to ths spectrum but was mproved by YOLOv2 [16] wth anchorbased approach. For DeNet [21], the networ outputs the confdence of each neuron belongng to one of the boundng box corners, and then collects the canddate boxes by Drected Sparse Samplng. More recently, CornerNet [7] proposed detectng objects by the top-left and bottom-rght eypont pars, and ntroduces the corner poolng operaton to better localze corners. Whle these two anchor-

free methods form a promsng future research drecton, yet anchor-based methods stll acheves the best accuracy n the publc benchmars. 3. Proposed Approach We frst present an overvew on exstng anchor-based object detecton framewors, and then descrbe the proposed optmzaton technques n detals. 3.1. Object Detecton Overvew In state-of-the-art object detecton framewors, the tranng procedure s normally formulated as an emprcal mnmzaton problem over a combnaton of boundng box localzaton loss and the classfcaton loss. 3.1.1 Localzaton Loss For one feature map wth A dfferent anchor shapes from the networ, each spatal locaton could correspond to A anchor boxes centered at the cell. Thus the total number of anchor boxes are N A H f W f, where H f and W f are the feature map heght and wdth, respectvely. Stacng all the anchor boxes, we can denote by a = (a (x), a (y), a (w), a (h) ) the -th ( {1,, N}) anchor box, where a (x) and a (y) represents the center of the box and a (w) and a (h) represent the wdth and heght, respectvely. For multple feature maps as n [11, 13], we can also use smlar notatons to represent all the anchor boxes staced together. Note snce we have A dfferent anchor shapes, the value of a (w) and a (h) can have A dfferent values nstead of N dfferent values. The anchor center of a (x) and a (y) are normally lnearly related the spatal locaton n the feature map. The shape a (w) and a (h) and reman constant durng tranng n exstng wor. are pre-defned Let = ( (x), (y), (w), (h) ) be the networ output for the -th anchor box. Then, the localzaton loss s to algn the networ predcton to the ground-truth boundng box coordnates wth respect to the anchor box. Specfcally, the loss for the -th anchor box could be wrtten as where g j = (g (x) j L loc = δ,j L( ; a, g j ), (1), g (y) j, g (w) j, g (h) j ) are the j-th ground-truth box and δ,j measures how much the -th anchor should be responsble to the j-th ground-truth. The value of δ,j s usually restrcted to dscrete value n {0, 1}, n whch 1 ndcates that -th anchor box s responsble for the j-th ground-truth box. For example n [18, 11, 13], δ,j s 1 f the IoU rato between the anchor box and the ground-truth box s larger than a threshold e.g. 0.5 or the anchor box s the one wth the largest overlap wth the ground-truth. In YOLOv2 [16], δ,j s 1 f the anchor box and the ground-truth are located n the same spatal locaton and the anchor box s the one wth the largest IoU wth the ground-truth box. The form of the localzaton loss could be the L 2 dstance [16], or the smoothed L 1 loss (also nown as Huber loss) [18, 13]. Tang the L 2 loss as the example, the loss of L( ; a, g j ) can be wrtten as the sum of L (x,y),j wth L (w,h),j L (x,y),j L (w,h),j â (w) ĝ (w) j =( (x) =( (w) + a (x) + â (w) and g (x) j ) 2 + ( (y) + a (y) g (y) j ) 2 (2) ĝ (w) j ) 2 + ( (h) + â (h) ĝ (h) j ) 2 (3) log(a (w) ), â (h) log(a (h) ) (4) log(g (w) j ), ĝ (h) j log(g (h) j ) (5) The wdth and heght are wth the log encodng scheme because the value should always be postve. Note that they appear explctly n the wh-loss term Eqn. 3. Ths enables drect gradent computaton on a (w) j, a (h) j, whch s the ey of our anchor optmzaton method and wll be detaled n Sec. 3.2. 3.1.2 Classfcaton Loss For each anchor box, the networ also outputs the confdence score to dentfy whch class t belongs to. In tranng, normally cross entropy loss s employed, e.g. n [18, 13, 15, 16]. One mproved verson s the focal loss [11], whch focuses on the mbalance ssue. To handle the bacground class, one can use an extra bacground class n the cross entropy loss, e.g. n [13, 18]. Another approach s to learn a class-agnostc objectness score to dentfy f there s an object, e.g. n YOLOv2[16] and the RPN of Faster R-CNN[18]. 3.2. Anchor Box Optmzaton By combnng the localzaton loss and the classfcaton loss, we can wrte the optmzaton problem as θ,{(s (w),s(h) )}A =1 mn θ L(θ) (6) where θ s the neural networ parameters. In exstng methods, the anchor shapes are treated as constants. For all the N anchor boxes a, we extract all the dstnct anchor shapes and denote them by (s (w), s(h) )A =1. We propose to treat them as learnable varables n the optmzaton problem Eqn. 7. ( ) mn L θ, {(s (w), s(h) )}A =1 (7)

Gradent Classfcaton Loss CNN s (w) s (h) Anchor Shapes (x) (y) (w) (h) (x) (y) (w) (h) Δ 1 Δ1 Δ1 Δ1 Δ Δ Δ Δ (x) (y) (w) (h) (x) (y) (w) (h) a 1 a1 a1 a1 a a a a Gradent + Gradent Localzaton Loss δ,j L Δ ; a, g j j Onlne Clusterng Warm-Up Soft Assgnment Warm-Up Batch Normalzaton w/o Shftng Fgure 1. An llustraton of the anchor optmzaton process. The localzaton loss s to mnmze the error between the ground-truth boundng box and the predcted offset relatve to the anchor box. The error s bac-propagated to the anchor shapes as well as the CNN parameters to automatcally learn the anchor sze. The anchor shape s warmed up by the onlne clusterng and the soft assgnment wth the batch normalzaton wthout shftng. Obvously, Eqn. 7 s guaranteed to reach a lower optmal loss value than Eqn. 6 snce the set of learnable varable set s enlarged (so s the feasble soluton set). The anchor shape values can be adjusted n the goal of lowerng the overall loss value. Moreover, wth the learned optmal anchor shapes, the magntudes of the offsets (resduals) become smaller, whch mght mae the regresson problem easer. The ey dea s summarzed n Fg. 1. Followng common practce, we use the bac-propagaton to solve the optmzaton problem. Instead of learnng s w and sh, we learn ŝ w log(sw ) and ŝh log(sh ) because of equvalence and smplcty. For one tranng mage, the dervatve of the loss functon wth respect to ŝ (w) can be computed as L ( ) δ ŝ (w),j L (w,h),j ŝ (w),j (8) ( δ,j + â w ĝ w ) j δ(â w = ŝ w ), (9),j where δ(â w = ŝ w ) s 1 f âw corresponds to â w j and 0, otherwse. Smlarly, we can have the dervatve wth respect to the anchor heght â (h). In one tranng teraton of the mn-batch stochastc gradent descent algorthm, we frstly assgn the ground-truth boxes to the anchors,.e. computes δ,j. Then, wth δ,j fxed, bac-propagate the error sgnal to all remanng parameters ncludng the anchor shapes. To calculate the varables δ,j, we normally use the IoU as the metrc [18, 13, 16]. If we use L 2 dstance n the log space of wdth and heght as dstance metrc to compute δ,j, the method algns more closely wth the loss. Emprcally, we fnd that usng L 2 dstance or IoU results n smlar performance and anchor shapes. To further facltate automatc learnng of anchor shapes, we ntroduce the followng three tranng technques. 3.2.1 Onlne Clusterng Warm-Up Motvated by the -means approach n [16], we augment the loss functon wth an extra onlne clusterng term durng the early stage of tranng. Ths term mnmzes the squared L 2 dstance between the anchor shapes and the ground-truth box shapes and can be wrtten as L aug = L + λ 1 2N T,j (â (w) δ,j T,j, (10),j where N s,j δ,j for normalzaton. ĝ (w) j ) 2 + (â (h) ĝ (h) j ) 2, (11) The coeffcent λ s lnearly annealed from 1 to 0 durng the early stage of tranng (frst 20% teratons n experments) to c off the learnng of anchors. The underlyng dea s that the -means approach could serve as a good startng pont. Ths maes the networ more robust to the ntalzaton and fast to converge. In the early tranng stage, the clusterng term could qucly tune the anchors to (near) -means centrods. Then, the orgnal loss of L n Eqn. 7 begns to show more nfluence. Hence, the anchor shapes adapt more closely to the data dstrbuton and the networ predctons, followng gradents comng from the orgnal loss term. The dervatves of the augmented loss n Eqn. 10 wth

respect to ŝ (w) and ŝ (h) are L aug ŝ (w) L aug ŝ (h) = L + λ ŝ (w) N = L + λ ŝ (h) N,j,j δ,j (â (w) δ,j (â (h) 3.2.2 Soft Assgnment Warm-Up ĝ (w) j )δ(â (w) = ŝ (w) ) ĝ (h) j )δ(â (h) = ŝ (h) ) In some extreme stuaton, the ground-truth boundng box could be very small or very large, and only one anchor box s actvated. All other anchor boxes are never used, even f we have the onlne clusterng term. To address the ssue, we propose to adopt a soft assgnment approach at the early tranng stage. That s δ,j = Softmax( dst(a, g j )/T ), (12) where Softmax s the softmax functon over all anchor boxes at the same spatal cell. The temperature T s annealed from 2 to 0 n the frst few tranng steps (1500 teratons n the experments). Wth non-zero assgnment values, all anchor shapes could jon nto the learnng procedure. After the warm-up, t falls bac to the orgnal assgnment scheme. In the normal tranng tas, we fnd ths tem has almost no effect on the accuracy, but n specfc tas doman, t sgnfcantly solves the problem and mproves the accuracy. 3.2.3 Batch Normalzaton wthout Shftng Wth the onlne clusterng term n Eqn. 11, the networ output tends to have a zero mean potentally followng Gaussan dstrbuton. To further reduce the learnng dffcultes, we apply the batch normalzaton [5] on the output of w and h wthout the shftng parameters. That s, the networ output s frst normalzed to zero mean and unt varance, followed by scalng operaton wthout the shft operaton. Ths could enforce the zero mean dstrbuton and mae the networ converge fast. 4. Experments We frst present the mplementaton detals and then the extensve experment results on wdely used Pascal VOC 07+12 [3] and MS-COCO Challenge 2017 object detecton datasets [12], along wth a head detecton dataset named Branwash [19], to demonstrate the effectveness of the proposed anchor optmzaton method. 4.1. Implementaton Detals Snce the proposed approach for optmzng anchors s qute general, t can be appled to most anchor-based object detecton framewors. We choose the YOLOv2 [16] framewor as the testbed to demonstrate the effectveness. Extensons to other detectors should be straghtforward, such as the RPN of Faster R-CNN [18], SSD [13], Feature Pyramd Networ (FPN) [9], and RetnaNet [11]. YOLOv2 s one of the typcal one-stage detectors, whch maps the nput mage to a feature map by convolutonal neural networ and nfers the boundng box relatve offsets and the classfcaton results based on the feature map. The networ conssts of a DarNet-19 bacbone CNN pretraned on ImageNet classfcaton dataset, and several convolutonal detecton heads. Wth A = 5 anchor shapes, the last convolutonal layer outputs a feature map of 5 (4 + 1 + C) channels, correspondng to 4 coordnate regresson outputs ( ), 1 class-agnostc objectness score, and C category scores for each anchor box. We also employ the same data augmentaton technques as n YOLOv2, ncludng random jtterng, scalng, and random hue, exposure, saturaton change of the mage. The same loss weghts are used to balance the localzaton loss, the objectness loss and the classfcaton loss. Durng testng, an mage s reszed to a specfed sze (e.g. 416-by-416 pxels), and then fed nto the detecton networ. For each anchor box a and the correspondng output, the output boundng box s (a (x) (y), a (w) exp{ (w) }, a (h) exp{ (h) + (x), a (y) + }) wth the score beng the multplcaton of the objectness score and the condtonal classfcaton score. The fnal predcton results are the top- (typcally = 300) canddate boxes sorted by the box scores, after the class specfc Non-Maxmum Suppresson (NMS) wth IoU threshold as 0.45. We mplement the approach on Caffe [6]. 4.2. Experment Results 4.2.1 PASCAL VOC The PASCAL VOC dataset [3] contans box annotatons over 20 object categores. We adopt the commonly used 07+12 tran/test splt, where the VOC 2007 tranval (5 mages) and VOC 2012 tranval (11 mages) are used as tranng set, and VOC 2007 test (4952 mages) s used as testng set. The model tranng s done n 30,000 teratons of SGD (Momentum = 0.9) wth mn-batch sze 64 evenly dvded onto 4 GPUs. The learnng rate s set to step-wse schedule: (0 100,1e-4), (100 15,000,1e-3), (15,000 27,000,1e-4), (27,000 30,000, 1e-5). The tranng mage sze s set to 416 or 544 to match the test sze. The anchor shapes are ntalzed by three methods to study the robustness. 1. unform: The anchor shapes are chosen unformly,.e. [(3, 3), (3, 9), (9, 9), (9, 3), (6, 6)] strde wth the strde beng 32 here.

Table 1. Detecton results on Pascal VOC 2007 test set, traned on VOC 07+12 tranval sets. Sze represents the shorter edge of test mage sze. map.5 stands for mean average precson at IoU 0.5. AP for each class s also reported. Method Sze map.5 aero be brd boat bottle bus car cat char cow table dog horse mbe person plant sheep sofa tran tv Faster rcnn vgg[18] 600 73.2 76.5 79.0 70.9 65.5 52.1 83.1 84.7 86.4 52.0 81.9 65.7 84.8 84.6 77.5 76.7 38.8 73.6 73.9 83.0 72.6 Faster rcnn res[4] 600 76.4 79.8 80.7 76.2 68.3 55.9 85.1 85.3 89.8 56.7 87.8 69.4 88.3 88.9 80.9 78.4 41.7 78.6 79.8 85.3 72.0 SSD512 [13] 512 76.8 82.4 84.7 78.4 73.8 53.2 86.2 87.5 86.0 57.8 83.1 70.2 84.9 85.2 83.9 79.7 50.3 77.9 73.9 82.5 75.3 YOLOv2 [16] 416 76.8 - - - - - - - - - - - - - - - - - - - - YOLOv2 [16] 544 78.6 - - - - - - - - - - - - - - - - - - - - Baselne (dentcal) 416 75.76 75.6 84.2 77.0 63.0 47.3 82.8 84.1 90.6 55.2 80.8 72.5 86.3 87.4 84.6 75.9 48.0 79.1 77.2 85.8 77.9 Baselne (unform) 416 76.32 75.9 83.7 75.1 64.1 50.5 84.3 83.9 91.4 57.7 81.8 73.7 88.6 88.0 83.8 77.1 47.6 77.0 78.4 88.1 75.8 Baselne (-means) 416 76.83 76.9 85.1 76.3 63.8 46.8 83.6 83.4 91.4 56.4 84.8 77.3 88.5 88.2 83.5 77.2 50.3 80.2 81.2 86.6 75.3 Baselne (-means) 544 79.45 77.5 87.2 80.1 66.5 56.1 85.3 86.2 89.7 63.0 88.6 76.5 88.0 91.0 87.9 81.9 53.8 84.9 79.5 86.1 79.5 Opt (dentcal) 416 78.01 77.8 86.6 78.3 67.6 50.6 85.1 85.1 91.6 59.1 82.3 78.0 88.5 90.2 86.2 79.0 53.1 81.4 81.9 89.5 76.3 Opt (unform) 416 77.95 78.2 87.3 75.3 67.2 52.9 86.3 85.3 90.5 56.2 84.1 76.9 89.7 89.5 85.9 78.9 50.2 79.2 82.1 87.1 76.9 Opt (-means) 416 77.99 76.3 87.4 77.6 66.6 52.0 85.3 85.0 91.5 57.5 83.6 77.1 88.6 90.6 84.7 78.4 50.2 82.7 80.3 87.1 76.6 Opt (-means) 544 80.69 75.8 88.3 79.4 66.8 56.9 88.5 87.9 89.6 62.4 88.8 75.4 89.0 90.9 88.7 83.2 51.1 84.7 73.2 86.6 80.3 2. dentcal: All 5 anchors boxes are dentcal and ntalzed as (5, 5) strde. 3. -means: The values are borrowed from the open source code of YOLOv2 1 to perform the -means clusterng on the ground-truth box shapes wth the IoU as metrc. 1 https://gthub.com/pjredde/darnet Fgure 2. Pascal VOC anchors and box dstrbuton n log scale. The red star marers show the learned anchor shapes. Underlyng the marers s Kernel Densty of the boundng box wth and heght wth the mge reszed to 416 416. Darer color means hgher densty. Around the fgure are the margnal dstrbutons of log(w) and log(h). The results are shown n Table 1. We also lst Faster R- CNN and SSD results n the table for completeness. Note that we are not targetng the best accuracy but manly the effectveness of the proposed approach. In the Baselne (*) rows of the table, the anchor shapes are fxed as n conventonal detecton model tranng, whereas n the Opt (*) rows, the anchor shapes are optmzed wth our proposed method n Sec. 3. From the results, our anchor optmzaton method consstently produces better results compared to the baselnes. Our re-mplementatons of YOLOv2 attan smlar or better performances compared to what the orgnal paper reports (comparng Baselne (-means) wth YOLOv2). The proposed anchor learnng method further boosts the performance by more than 1.2% n terms of absolute map value. For example wth -means ntalzaton and 544 as the mage sze, the baselne acheves 79.45% map, whle our method boosts the accuracy to 80.69%, leadng to 1.2 pont mprovement. Furthermore, dfferent anchor shape ntalzaton acheves smlar accuracy. Wthn our proposed approaches wth dfferent ntalzatons, the accuracy dfference between the best and the worse s only 0.06 pont for the sze of 416, suggestng that our method s very robust to dfferent ntal anchor confguratons. Hence, the manual choce of approprate ntal anchor shapes becomes less crtcal wth our method. Note for the settng of dentcal, though the anchor szes are the same at the begnnng, the values can be optmzed to dfferent values snce dfferent anchors are responsble for dfferent ground-truth boxes durng tranng. Fgure 2 llustrates the unform anchors, the -means anchors, the learned anchors (wth -means ntalzaton), and the ground-truth box shape dstrbuton n the log w-log h plane. We observe that both the learned anchors and the -means anchors algn closely wth the underlyng ground truth box dstrbuton, whch ntutvely explans why they

produce better performance than the unform anchors. The learned anchors spread broader and are slghtly smaller than the -means anchors. The reason mght be that the small boundng box s relatvely hard to regress and the networ pushes the anchors to focus more on small objects to lower the loss. Ths ndcates that the anchor optmzaton process s more than merely clusterng. It s also able to adapts the anchor shapes to the data augmentaton and the networ regresson capablty to mprove the accuracy. 4.2.2 MS COCO Table 2. Detecton results on MS COCO val. Average Precsons (AP) at dfferent IoU thresholds and dfferent box scales (Small, Medum, Large at IoU 0.5) are reported. Method AP.5:.95 AP.5 AP.75 AP.5S AP.5M AP.5L Baselne (unform) 21.90 42.06 20.57 2.27 29.04 57.10 Baselne (-means) 23.45 43.87 22.84 2.55 31.00 59.43 Opt (dentcal) 24.43 45.07 24.05 3.03 32.24 60.61 Opt (unform) 24.55 45.33 24.04 3.11 32.52 60.70 Opt (-means) 24.47 45.07 23.74 3.06 32.65 60.09 We adopt the frequently used COCO [12] 2017 Detecton Challenge tran/val splts, where the tranng set has 115K mages, the val set has 5K mages, and the test-dev set contans about 20 mages whose box annotatons are not publcly avalable. The dataset contans 80 object categores. We use smlar tranng confguratons as the VOC experments. Mn-batch sze s 64 and evenly splt n 4 GPUs. Momentum of SGD s set to 0.9. Snce the COCO dataset has substantally more mages than VOC, we ncrease the number of teratons to 100,000, and set the learnng rate schedule to (0 1,000,1e-4), (1,000 80,000,1e- 3), (80,000 90,000,1e-4), (90,000 100,000, 1e-5). The tranng and testng mage szes are both set to 544 n all experments. Snce the boundng box annotatons of the test-dev s not exposed, we upload our detecton results to the offcal COCO evaluaton server 2 to retreve the scores. The results on the val set are shown n Table 2, and the results on the test-dev set are n Table 3. In the table, AP.5:.95 denotes the mean of AP evaluated at IoU threshold evenly dstrbuted between 0.5 and 0.95; AR denotes the average recall rate. Compared to the orgnal YOLOv2 results, our remplementaton even acheves a hgher accuracy wth the AP.5:.95 ncreased from 21.6% to 24.0% and AP.5 ncreased from 44.0% to 44.9%. When equpped wth the proposed anchor optmzaton method, the accuracy s further sgnfcantly mproved by 1%, wth AP.5:.95 to 25.0% and 2 https://compettons.codalab.org/compettons/5181 Fgure 3. MS COCO anchors and box dstrbuton n log scale. Underlyng the marers s Kernel Densty of the ground-truth boundng box wdth and heght wth the mage reszed to 544x544. Around the fgure are the margnal dstrbutons of log(w) and log(h). AP.5 to 45.9%. Smlar mprovements can also be observed from the results on val splt. Ths strongly demonstrates the superor of the anchor optmzaton to acheve hgher accuracy. Meanwhle, the baselne approach wthout anchor optmzaton s qute senstve to the anchor shapes. On val, -means ntalzaton acheves 23.45%, whle the unform ntalzaton acheves 21.90% wth 1.55 pont dfference. Comparatvely, our optmzaton approach s more robust and the dfference between the hghest (24.55 on val ) and the lowest (24.43 on val) s only 0.12. On test-dev, dfferent ntalzaton methods acheve the same map.5:.95 (25.0), whch further verfes the robustness towards dfferent ntalzaton methods. The learned anchors wth dfferent ntalzatons are shown n Table 4, and we can easly observe that the anchor shapes are qute smlar though the ntal values are dfferent. Fgure 3 shows the learned anchor shapes aganst the unform and the -means anchors. The learned anchors ncely cover the ground-truth boundng box dstrbuton. They also tend to be slghtly smaller than the orgnal -means values, whch could help the small object detecton snce the large object s relatvely easy to detect. Ths can also be verfed from Table 3. Tang the nstance of -means ntalzaton, the gan from small (from 4.4% to 5.7%) and medum object (from 24.6% to 26.6%) s hgh whle t even sacrfces the accuracy for large objects a lttle bt (from 40.9% to 40.8%).

Table 3. Detecton results from the evaluaton server on MS COCO test-dev. AP means average precson, AR means average recall. AP.5:.95 s the mean of AP at IoU 0.5:0.05:0.95. Subscrpt S,M & L correspond to small, medan & large boundng boxes respectvely. Method AP.5:.95 AP.5 AP.75 AP S AP M AP L AR 1 AR 10 AR 100 AR S AR M AR L Faster RCNN vgg[18] 21.9 42.7 - - - - - - - - - - Faster RCNN [1] 24.2 45.3 23.5 7.7 26.4 37.1 23.8 34.0 34.6 12.0 38.5 54.4 SSD512 [13] 26.8 46.5 27.8 9.0 28.9 41.9 24.8 37.5 39.8 14.0 43.5 59.0 YOLOv2 [16] 21.6 44.0 19.2 5.0 22.4 35.5 20.7 31.6 33.3 9.8 36.5 54.4 Baselne (unform) 22.4 42.5 21.4 4.4 21.5 38.8 21.5 31.2 32.1 7.3 32.9 57.4 Baselne (-means) 24.0 44.9 23.3 4.4 24.6 40.9 22.4 33.0 34.1 7.6 37.2 58.4 Opt (dentcal) 25.0 45.8 24.5 5.7 26.5 40.4 23.3 34.4 35.6 9.5 39.4 58.8 Opt (unform) 25.0 45.8 24.3 5.9 26.1 40.8 23.3 34.4 35.6 9.5 39.0 59.1 Opt (-means) 25.0 45.9 24.7 5.7 26.6 40.8 23.3 34.4 35.6 9.5 39.6 58.8 Table 4. Learned anchors from dfferent ntalzatons on COCO wth mage sze as 544. Int s (w) 1, s (h) 1 s (w) 2, s (h) 2 s (w) 3, s (h) 3 s (w) 4, s (h) 4 s (w) 5, s (h) 5 dentcal 5.8, 6.7 17.4, 20.1 44.8, 45.8 108, 99.2 241, 237 unform 5.8, 6.8 17.4, 20.5 44.8, 45.8 106, 101 245, 237 -means 5.7, 6.7 16.9, 20.1 43.8, 44.8 104, 98.9 241, 230 4.2.3 Branwash Branwash s a head detecton dataset ntroduced n [19], whch has about 10 mages for tranng, about 500 mages for valdaton and 484 mages for testng. The mages are of ndoor scenes where people come and go captured wth a survellance camera. We tran the detecton model for 10,000 steps, wth learnng rate schedule (0 100,1e-4), (100 5,000,1e-3), (5,000 9,000,1e-4), (9,000 10,000, 1e-5). No random scalng augmentaton s used snce the camera s stll, whle other nds of data augmentaton reman unchanged. The mage crop sze durng tranng s set to 320, and the test mage sze s chosen as 640. We stll choose to employ 5 anchor shapes. No classfcaton loss s appled snce there s only one category (head). We report AP.5 as the performance crteron n Table 5. The baselne result wth the anchor shapes from COCO s also presented. The -means anchors are computed n smlar way as n YOLOv2. Snce the head boundng boxes are much smaller than those of the VOC or COCO datasets, we fnd that only one anchor shape wll be actvated throughout the tranng and the remanng anchor shapes never get used wth the anchor shape from COCO settngs. In ths case, the neural networ wll also need to predct large devatons for w and h to ft all the ground-truth boxes, whch s suboptmal. Ths means t s sub optmal to use the anchor shapes from other domans. Comparably, the proposed anchor learnng method could adjust the anchor shape qucly to cover the ground-truth boundng box well. From the results, we can observe Opt (*) consstently outperform the baselnes by a large margn, demonstratng the effectveness of the proposed method. Even wth the -means as the ntalzed anchor shapes, our approach can also mprove the accuracy by 1.2 pont (from 78.98% to 80.18%). Table 5. Detecton results on Branwash dataset. Test mage sze s 640. AP.5 s the average precson wth IoU threshold 0.5. 5. Concluson Method Sze AP.5 Baselne (coco) 640 77.96 Baselne (unform) 640 78.03 Baselne (-means) 640 78.98 Opt (dentcal) 640 79.85 Opt (unform) 640 79.86 Opt (-means) 640 80.18 In ths paper, we have ntroduced an anchor optmzaton method whch can be employed n most exstng anchorbased object detecton framewors to automatcally learn the anchor shapes durng tranng. The learned anchors are better suted for specfc data and networ structure and can produce better accuracy. We demonstrated the effectveness of the proposed method based on the popular one-stage object detecton framewor YOLOv2. Extensve experments on Pascal VOC, MS COCO and Branwash benchmar datasets show superor detecton accuracy of our proposed method over the baselne. We also show that the anchor optmzaton method s robust to ntalzaton (dentcal, unform, -means), and hence the careful handcraftng of anchor shapes s greatly allevated for good performance. Moreover, the proposed method s qute general. The same method can also be appled n other one-stage methods such as SSD[13], RetnaNet[11], etc., whch s based on the anchor box, and n two-stage methods to mprove the regon proposals. The method s ndependent to mprove-

ments such as Feature Pyramds Networ (FPN) [10] and thus can potentally be combned wth them to further boost performance. Our wor solves the problem of optmzng anchor shapes, but not of the number of anchors, whch would be an nterestng topc to study. Fnally, theoretcal wors on why and how the anchor mechansm wors better than plan regresson would also be very valuable to the feld. References [1] Coco: Common objects n context. http://mscoco. org/dataset/#detectons-leaderboard. Accessed: 2018-11-10. 8 [2] P. Dollár, C. Woje, B. Schele, and P. Perona. Pedestran detecton: An evaluaton of the state of the art. IEEE Transactons on Pattern Analyss and Machne Intellgence, 34:743 761, 2012. 2 [3] M. Everngham, L. V. Gool, C. K. I. Wllams, J. M. Wnn, and A. Zsserman. The pascal vsual object classes (voc) challenge. Internatonal Journal of Computer Vson, 88:303 338, 2009. 5 [4] K. He, X. Zhang, S. Ren, and J. Sun. Deep resdual learnng for mage recognton. 2016 IEEE Conference on Computer Vson and Pattern Recognton (CVPR), pages 770 778, 2016. 6 [5] S. Ioffe and C. Szegedy. Batch normalzaton: Acceleratng deep networ tranng by reducng nternal covarate shft. In ICML, 2015. 5 [6] Y. Ja, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Grshc, S. Guadarrama, and T. Darrell. Caffe: Convolutonal archtecture for fast feature embeddng. arxv preprnt arxv:1408.5093, 2014. 5 [7] H. Law and J. Deng. Cornernet: Detectng objects as pared eyponts. CoRR, abs/1808.01244, 2018. 2 [8] M. Lao, B. Sh, and X. Ba. Textboxes++: A sngle-shot orented scene text detector. IEEE Trans. Image Processng, 27(8):3676 3690, 2018. 1, 2 [9] T. Ln, P. Dollár, R. B. Grshc, K. He, B. Harharan, and S. J. Belonge. Feature pyramd networs for object detecton. In 2017 IEEE Conference on Computer Vson and Pattern Recognton, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017 [10], pages 936 944. 5 [10] T.-Y. Ln, P. Dollár, R. B. Grshc, K. He, B. Harharan, and S. J. Belonge. Feature pyramd networs for object detecton. 2017 IEEE Conference on Computer Vson and Pattern Recognton (CVPR), pages 936 944, 2017. 9 [11] T.-Y. Ln, P. Goyal, R. B. Grshc, K. He, and P. Dollár. Focal loss for dense object detecton. IEEE transactons on pattern analyss and machne ntellgence, 2018. 1, 3, 5, 8 [12] T.-Y. Ln, M. Mare, S. J. Belonge, L. D. Bourdev, R. B. Grshc, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Ztnc. Mcrosoft coco: Common objects n context. In ECCV, 2014. 5, 7 [13] W. Lu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C.- Y. Fu, and A. C. Berg. Ssd: Sngle shot multbox detector. In ECCV, 2016. 1, 2, 3, 4, 5, 6, 8 [14] M. Najb, P. Samangoue, R. Chellappa, and L. S. Davs. SSH: sngle stage headless face detector. In IEEE Internatonal Conference on Computer Vson, ICCV 2017, Vence, Italy, October 22-29, 2017, pages 4885 4894, 2017. 1, 2 [15] J. Redmon, S. K. Dvvala, R. B. Grshc, and A. Farhad. You only loo once: Unfed, real-tme object detecton. 2016 IEEE Conference on Computer Vson and Pattern Recognton (CVPR), pages 779 788, 2016. 1, 2, 3 [16] J. Redmon and A. Farhad. Yolo9000: Better, faster, stronger. 2017 IEEE Conference on Computer Vson and Pattern Recognton (CVPR), pages 6517 6525, 2017. 1, 2, 3, 4, 5, 6, 8 [17] J. Redmon and A. Farhad. Yolov3: An ncremental mprovement. CoRR, abs/1804.02767, 2018. 1 [18] S. Ren, K. He, R. B. Grshc, and J. Sun. Faster r-cnn: Towards real-tme object detecton wth regon proposal networs. IEEE Transactons on Pattern Analyss and Machne Intellgence, 39:1137 1149, 2015. 1, 2, 3, 4, 5, 6, 8 [19] M. A. Russell Stewart and A. Y. Ng. End-to-end people detecton n crowded scenes. 2016 IEEE Conference on Computer Vson and Pattern Recognton (CVPR), pages 2325 2333, 2016. 5, 8 [20] C. Szegedy, S. E. Reed, D. Erhan, and D. Anguelov. Scalable, hgh-qualty object detecton. CoRR, abs/1412.1441, 2014. 2 [21] L. Tychsen-Smth and L. Petersson. Denet: Scalable realtme object detecton wth drected sparse samplng. 2017 IEEE Internatonal Conference on Computer Vson (ICCV), pages 428 436, 2017. 2 [22] T. Yang, X. Zhang, W. Zhang, and J. Sun. Metaanchor: Learnng to detect objects wth customzed anchors. NIPS, abs/1807.00980, 2018. 2 [23] L. Zhang, L. Ln, X. Lang, and K. He. Is faster r-cnn dong well for pedestran detecton? In European Conference on Computer Vson, pages 443 457. Sprnger, 2016. 2 [24] S. Zhang, X. Zhu, Z. Le, H. Sh, X. Wang, and S. Z. L. Sˆ3fd: Sngle shot scale-nvarant face detector. In IEEE Internatonal Conference on Computer Vson, ICCV 2017, Vence, Italy, October 22-29, 2017, pages 192 201, 2017. 1, 2