Vadim V. Romanuke * (Professor, Polish Naval Academy, Gdynia, Poland)

Size: px

Start display at page:

Download "Vadim V. Romanuke * (Professor, Polish Naval Academy, Gdynia, Poland)"

Peregrine Peters
5 years ago
Views:

1 Electrical, Control and Commnication Engineering ISSN (online) ISSN (print) 20, vol. 4, no., pp doi: 0.247/ecce An Attempt of Finding an Appropriate Nmber of Convoltional Layers in CNNs Based on Benchmarks of Heterogeneos Datasets Vadim V. Romanke * (Professor, Polish Naval Academy, Gdynia, Poland) Abstract An attempt of finding an appropriate nmber of convoltional layers in convoltional neral networks is made. The benchmark datasets are, and, whose diversity and heterogeneosness mst serve for a general applicability of a rle presmed to yield that nmber. The rle is drawn from the best performances of convoltional neral networks bilt with 2 to 2 convoltional layers. It is not an exact best nmber of convoltional layers bt the reslt of a short process of trying a few versions of sch nmbers. For small images (like those in ), the initial nmber is 4. For datasets that have a few tens of image categories and more, initially setting five to eight convoltional layers is recommended depending on the complexity of the dataset. The fzziness in the rle is not removable becase of the reqired diversity and heterogeneosness. Keywords Convoltional neral networks; Convoltional layers; Error rate; Hyperparameters; Performance. I. THE PROBLEM OF AN APPROPRIATE NUMBER OF CONVOLUTIONAL LAYERS In machine learning for image recognition, the convoltional layer () is the core bilding block of a convoltional neral network (CNN). A is a set of learnable filters which actally are three-dimensional matrices, to which a bias vector is attached [], [2]. The parameters of a, called hyperparameters, are as follows [2], [3]:. Height F height of the filter (size along the vertical axis). Integer F height mst be positive. 2. Width F width of the filter (the horizontal axis). Integer Fwidth mst be positive, and commonly F width F height [4], [5]. 3. Depth K of the filter. The depth of the filter of the first is eqal to the nmber of color channels in the inpt image. The depth of the filter of a sbseqent is eqal to the nmber of filters of the antecedent [6]. 4. Stride s. Integer s mst be positive for controlling how depth colmns are allocated arond the spatial dimensions (width and height). Often s, so then a new depth colmn of nerons is allocated to spatial positions only one spatial nit apart [7]. 5. Zero-padding p. Integer p mst be non-negative for preserving exactly the spatial size of the otpt volmes [2], [5], []. All these hyperparameters are set by rles of thmb [2], [7]. Moreover, when CNN architectre is bilt, the nmber of s N (a positive integer) is set jst by experience. Ths, setting the integer N appropriately is an open isse. Answering this qestion can significantly improve performance. II. BACKGROUND AND MOTIVATION It is believed that complexity of an image recognition problem () is associated with the nmber of s. The complexity of s isses from the nmber of image categories, the nmber of featres (dimensionality), the inflence of color, the inflence of chrominance, diversities in images labelled as belonging to the same category [9], [0]. The more complex s may naïvely need a greater N. This has, however, not been proved yet. Moreover, it is nknown whether this is provable or not []. Unlike its hyperparameters, the nmber of s is not limited from above [], [2], [6], [7]. If the hyperparameters are selected appropriately, N shold be varied starting from 2 p to some integer N, at which the effectiveness of CNNs is less than at N. The effectiveness means performance and operation speed (comptational rate) [], [2], [5], [0], [2], [3]. Obviosly, the comptational rate slightly (at least) decreases as N increases, so this is a constraint preventing the assigning of a great N [6], [7], [9]. For instance, the position of the rnner-p in ILSVRC 204 was taken by the CNN that became known as VGGNet [], [4] containing 6 s. A downside of VGGNet is that it is very expensive to evalate and ses mch more memory and parameters (a MATLAB.mat file of VGGNet has the size of abot GB). Bt if some s nearest to the VGGNet otpt layer are removed, the performance is still the same and the nmber of necessary parameters is significantly redced [], [5], [6]. III. A GOAL FOR FINDING A RULE OF APPROPRIATELY SETTING THE NUMBER OF CONVLS The goal is to find a rle for appropriately setting the integer N regarding the nmber of image categories and the dimensionality of an. In other words, once an is given with its nmber of image categories and image size, the rle mst yield a certain integer N or a few versions of this * romankevadimv@gmail.com 20 Vadim V. Romanke. This is an open access article licensed nder the Creative Commons Attribtion License ( in the manner agreed with Sciendo. 5

2 Electrical, Control and Commnication Engineering 20, vol. 4, no. nmber. In the worst case, an integer interval for an appropriate nmber of s shold be formed. For stating the rle, for tasks need to be accomplished.. To form a variety of s for benchmarking. 2. To test the s on an admissible interval of integers N. 3. To establish the correspondence of the best performance to N. 4. To formalise the correspondence as a rle. The rle will allow rationally constrcting a pivot of CNNs which is a seqence of s. Having the pivot, the remaining parts of the CNNs (pooling layers, ReLUs, DropOt layers, normalisation layers) are allocated easier. This wold be a profond contribtion to the theory of CNNs for making image recognition more effective. category, the diversity of its entries is rather high. The objects were originally imaged by two cameras at six sets of lighting conditions, nine elevations, and eighteen azimths. Then they were jittered and clttered by random pertrbation of position, scaling, varying brightness and contrast. The disparities were adjsted and randomly picked so that the objects appeared placed on highly textred horizontal srfaces at a small random distance from those srfaces. In addition, a randomly picked distracting object was placed at the periphery of the image. IV. S FOR BENCHMARKING The rle is expected to be generally acceptable for a wide range of s. That is, it mst be generalisable. To prevent an from overfitting (this is a meta-overfitting to a grop of s an extension of the common overfitting to training sets), the benchmark s shold be dissimilar. Ths, the datasets with their entries shold satisfy a reqirement of dissimilarity in the following: ) the nmber of image categories; 2) the nmber of color channels; 3) the initial image size; 4) the origination of the image content; 5) the types of objects to be recognised. These five dissimilarities ensre diversity and heterogeneosness to s. However, this is not sfficient for benchmarking, since, for instance, the ImageNet dataset is too hge for statistical research. Therefore, an additional reqirement is that the size of the benchmark shold be moderate. This implies a medim image size (not larger than 2 pixels) as well as a fairly small nmber of image categories (a few tens at the most). There are three datasets that completely satisfy these reqirements: (Fig. ), (Fig. 2), (Fig. 3). Althogh has only 0 image categories, the diversity of its entries is the highest. The image categories labelled as airplane, atomobile, bird, cat, deer, dog, frog, horse, ship, trck are diverse themselves. consists of images, where each category is represented with 6000 entries. Fig.. A sbset of the dataset consisting of color images whose original size is in each of the three color channels [6], [9], [0]. The diversity of its entries is highest as the dataset is heterogeneos itself. The dataset consists of images (with a total of images served for training) representing fifty toys belonging to five generic categories (for-legged animals, hman figres, airplanes, trcks and cars). Althogh has only six image categories inclded one image backgrond Fig. 2. A sbset of the dataset consisting of 0 0 -bit greyscale images [6]. The diversity here is high bt has only six image categories. A far lighter and easier dataset is, which represents images of enlarged capital letters of the English alphabet. It has 26 categories, and it is a completely artificial dataset, and hence it is scalable as many images can be generated as needed, and their size is adjsted. There are three types of distortion scaling, rotation, shifting. The intensity of these distortions is reglated with their magnitdes. Fig. 3 shows a moderate intensity of the distortions. At sch intensity, entries (2000 entries per letter) are enogh for training and validating [3], [7], []. V. ADMISSIBILITY OF INTEGERS N Admissibility here implies rationality and reasonability, i.e. testing the s on an admissible interval of integers N mst expose the best performance as well as a moderate one, while the worse performance is expected closer to the endpoints of the interval. Setting a single is obviosly inappropriate (there wold not have been any convoltion), so let N 2 be the left endpoint of the interval for the worst-case reference. The imm integer N depends on the and its image size. The entries of are recognised sccessflly by for to six s for any image size between and The same goes for. For sccessfl training on the dataset, some versions of CNNs have only three s [3]. Eventally, the nmber of s is also adjsted with the nmber of pooling layers which follow the s. Hence, let N for images by applying no resizing for and downsampling the entries. Then let N 9 for 4 4 images and N 0 for images by psampling the entries and downsampling the entries. It is appropriate to set N for images. Separately, N 2 for the original 0 0 images. All the versions of CNN architectre to be tested are shown as binary combinations in Table I, where the pooling (2 2 sbsampling) is indicated with ones, and zeros indicate that a is not followed by a pooling layer [9], [9], [20]. 52

3 Electrical, Control and Commnication Engineering 20, vol. 4, no. Fig. 3. A sbset of the dataset consisting of -bit greyscale images created from originally monochrome 60 0 images [9]. Unlike or, images are extremely simple; however, they fall into 26 classes. TABLE I VERSIONS OF CNN ARCHITECTURE TO BE TESTED ON THE DATASETS # CNN architectre (N ) Size of s filters (in order of Image size s nmbering from the CNN inpt) (dimension) Datasets, , (2) 2, 2 64,, 4 33, , , 5, , 7, 6 4 (3) 9, 9, 9 64,, 9 5, 4, , 5, 5 0 5, 5, 3, (4) 2 7, 6, 5, , 7, 4, (4) 4 9, 9, 7, 6 96,, 5 (4) 9, 9,, , 3, 2, 2, (5) 7 5, 3, 3, 3, 3 4,, 0 (5) 5, 3, 3, 3, , 5, 4, 4, 2 96 (5) 20 5, 5, 5, 3, (6) 3, 2, 2, 2, 2, (6) 5, 3, 3, 2, 2, (6) 5, 5, 2, 2, 2, 2 64,, 24 0 (6) 5, 5, 4, 2, 2, (6) 5, 5, 3, 2, 2, (7) 3, 2, 2, 2, 2, 2, (7) 5, 3, 2, 2, 2, 2, (7) 5, 5, 4, 2, 2, 2, 64,, 29 5, 5, 4, 2, 2, 2, (7) 30 7, 6, 4, 3, 2, 2, () 3, 2, 2, 2, 2, 2, 2, () 5, 3, 2, 2, 2, 2, 2, () 5, 5, 3, 2, 2, 2, 2, 64,, () 5, 5, 2, 2, 2, 2, 2, () 5, 5, 3, 2, 2, 2, 2, (9) 5, 3, 2, 2, 2, 2, 2, 2, (9) 5, 3, 2, 2, 2, 2, 2, 2, 2 64,, (9) 5, 3, 2, 2, 2, 2, 2, 2, (9) 5, 3, 2, 2, 2, 2, 2, 2, (0) 3, 3, 2, 2, 2, 2, 2, 2, 2, (0) 5, 3, 2, 2, 2, 2, 2, 2, 2, 96,, (0) 5, 3, 2, 2, 2, 2, 2, 2, 2, () 5, 3, 2, 2, 2, 2, 2, 2, 2, 2, 96,, () 5, 3, 2, 2, 2, 2, 2, 2, 2, 2, (2) 5, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0 53

4 Electrical, Control and Commnication Engineering 20, vol. 4, no. The listed architectres are close to being qasi-optimal for the corresponding N. For accelerating the training processes, a single ReLU before the last is inserted, withot DropOt layers [2], [22]. Althogh it wold impair generalisation, or task is to obtain consistent statistics on performance. The performance consistency implies a good enogh differentiation of error rate over varios versions of CNN architectre (see Table I), which mst help in finding the most appropriate integer(s) N. v 5 0. v v v v 32, 4, 64, 96, VI. EXTRACTION OF INTEGERS TO THE BEST PERFORMANCE N CORRESPONDING It takes a few epochs to obtain a sfficiently discriminated performance. Let vp be the error rate for the with image size W W for the -th CNN architectre version (the first colmn in Table ) after the p -th epoch. Then the performance is normalised to either [9] or vp p v by Q W v q qq W p p () v v by Q W v q qq W for comparing among s, where Q (2) W is the set of the versions for the given and the given image size. For instance, and Q Q 32, 6,, 6, 2, 26, , 9, 4, 9, 24, 29, 34, 3, 4, 43 are the sets for researching the minimm and imm size of images. The sets are the same for. The dataset is researched in a wider range, starting with to 32, 6,, 6, 2, 26, 3 Q 0 5, 0, 5, 20, 25, 30, 35, 39, 42, 44, 45 Q. Figres 4 6 show the normalised error rates () polylined for flfilling trend comparisons along the axis, where Qˆ W Q W Qˆ W Q W,, ˆ Q W Q W. The final-epoch normalised error rates (2) are polylined in Figres 7 9 by the same axes. A similarity between a dataset s polylines holds. However, the polylines of final-epoch-performance (2) look more scattered Fig. 4. The normalised error rates () for. The best performance is observed at for s, except for the largest image size, for which the best performance corresponds to five s. v v v v v v 32, 4, 64, 96, 0, Fig. 5. The normalised error rates () for. The best performance is observed at five s, except for the smallest image size, where the best performance is provided by for s. 54

5 Electrical, Control and Commnication Engineering 20, vol. 4, no. v v v v v 32, 4, 64, 96, v, 5 W v 32, v 4, Fig. 6. The normalised error rates () for. The best performance is observed at five s, except for the smallest images, where the best performance is provided by three or for s. v 5, 0. W v 32, v 4, Fig.. The final-epoch normalised error rates (2) for. Unexpectedly, N = 5 fits for W 64, 96,0 whereas the smaller images prefer N = 5. v v 32, v 4, v 64, v 96, v 64, v 96, v 0, 5 v 64, v 96, Fig. 7. The final-epoch normalised error rates (2) for. The best N for W = 32 is 4, the best N for W = 96 is 6, N = 5 fits for the rest of the cases Fig. 9. The final-epoch normalised error rates (2) for. For W = 4, two minima exist, so the appropriateness of s is similar to that in Fig

6 Electrical, Control and Commnication Engineering 20, vol. 4, no. An apparent tendency that can be seen in Fig. 4 9 lies in the risk of CNN training failre when we increase the nmber of s. Too primitive architectres (consisting of only two s) do not work either. However, making a distinct conclsion on these polylines is hardly possible. So, and by frther averaging is needed. This will not concern the size ˆQ W are W = 0. As sets ˆQ W, ˆQ W, pairwise different (bt, perhaps fortnately, not disjointed), the average performance of the three s is to be viewed in the form (Fig. 0),,, v W v W v W vw, (3) 3,,, v W v W v W v (4) 3 ˆ ˆ ˆ for 32, 4, 64, 96 Q W Q W Q W For the dataset of the largest image size, formally, and 0, 0, v v (6) v 0, v 0, by Qˆ 0 (7). Data (6) and (7) being a segment longer than the rest, they are taken back from Figres 5 and, respectively. v 32, 5 v 32, 0.25 v 4, 0. v 4, v 64, v 64, v 96, 0.3 v 96, Fig. 0. The average performance of the three s by (3) and (4), wherein only three common CNN architectres constitte an argment axis for each of the eight polylines. In the vertical direction, there are not more than two points above the same CNN architectre version. Except for the image size of 4, and W. (5) 0 (only by final-epoch performance), all of these polylines (there are twosegmented lines, except for (6) consisting of three segments in Fig. 5) increase. Althogh Figre 0 only deals with the dimensionality of an, it gives s a straight conclsion on that s of a higher dimensionality reqire more s. Nevertheless, the appropriate nmber of s for sch s is not mch greater than that for lower dimensionalities: with the image size increased three times (from 32 p to 96), the appropriate N does not change more than from 4 to 6 (if all the polylines are considered). Moreover, considering only the eight polylines in Figre 0, the appropriate N is jst 5 for any image size, except for images, where the appropriate N is 4 (see e.g. [9]). VII. THE RULE FOR AN APPROPRIATE N Apparently, as the image size increases, we may need more s. Then, however, the appropriate N shold always be slightly increased to prevent the risk of CNN training failre. Setting seven s for the benchmarked datasets has adverse conseqences. How does the nmber of image categories/classes inflence the appropriateness of N? Table II, which contains integers N that correspond to the error rate minima (in Figres 4 0) helps s see this. As can be easily seen, the dependence of the appropriate integer N on the nmber of classes is hardly perceptible. It rather depends on the complexity of the. And the nmber of classes is one of the components of the complexity of s. TABLE II THE APPROPRIATE NUMBER OF CONVLS THAT CORRESPONDS TO THE ERROR RATE MINIMA IN FIGURES 4 0 Datasets with the increasing nmbers of classes Error rate Error rate Error rate Error rate Error rate Error rate W () (2) () (2) () (2)

7 Electrical, Control and Commnication Engineering 20, vol. 4, no. Hence, the rle for appropriate N in CNNs is to try fewer s (an initial nmber) and then increase the nmber of s ntil the CNN performance starts deteriorating. For small images (like those in ), that initial nmber is 4. For mch complex s (in particlar, ones with a few tens of image categories and more), it is recommended to initially set N = 5. Definitely, the initial nmber of s for s with a few thosand image categories is recommended to be set at 6, 7 or. Starting with N 0 is not recommended. VIII. CONCLUSION The attempt of finding an appropriate nmber of s in CNNs has been based on benchmarks of heterogeneos datasets. The heterogeneosness is principally needed for ensring applicability to the appropriateness rle. Generally, the rle cannot give an exact nmber of s or even a few versions for this nmber otright. The rle is rather a short process of trying a few versions of N, starting from N 4 for datasets whose image size is less than 00 and whose nmber of image categories is a few tens. In other cases, N 5, 6, 7, at the beginning, where the greater N corresponds to s with a higher degree of complexity [23]. It seems that sch fzziness in the rle is not removable becase of the reqired diversity and heterogeneosness of s. REFERENCES [] H. H. Aghdam and E. J. Heravi, Gide to Convoltional Neral Networks: A Practical Application to Traffic-Sign Detection and Classification. Cham, Switzerland: Springer, [2] A. Gibson and J. Patterson, Deep Learning: A Practitioner s Approach. O Reilly Media, 207. [3] S. Srinivas, R. K. Sarvadevabhatla, K. R. Mopri, N. Prabh, S. S. S. Krthiventi, and R. V. Bab, Chapter 2 An Introdction to Deep Convoltional Neral Nets for Compter Vision, in Deep Learning for Medical Image Analysis, S. K. Zho, H. Greenspan, and D. Shen, Eds. Academic Press, 207, pp [4] V. Andrearczyk and P. F. Whelan, Using Filter Banks in Convoltional Neral Networks for Textre Classification, Pattern Recognition Letters, vol. 4, pp , Dec [5] Z. Liao and G. Carneiro, A Deep Convoltional Neral Network Modle that Promotes Competition of Mltiple-Size Filters, Pattern Recognition, vol. 7, pp , [6] D. Ciresan, U. Meier, J. Masci, L. M. Gambardella, and J. Schmidhber, Flexible, High Performance Convoltional Neral Networks for Image Classification, in Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence, vol. 2, pp , 20. [7] A. Krizhevsky, I. Stskever, and G. E. Hinton, ImageNet Classification With Deep Convoltional Neral Networks, Commnications of the ACM, vol. 60, iss. 6, pp. 4 90, [] J. Mtch and D. G. Lowe, Object Class Recognition and Localization Using Sparse Featres With Limited Receptive Fields, International Jornal of Compter Vision, vol. 0, iss., pp , [9] V. V. Romanke, Appropriate Nmber and Allocation of ReLUs in Convoltional Neral Networks, Research Blletin of the National Technical University of Ukraine Kyiv Polytechnic Institte, no., pp. 69 7, [0] P. Date, J. A. Hendler, and C. D. Carothers, Design Index for Deep Neral Networks, Procedia Compter Science, vol., pp. 3 3, [] K. Simonyan and A. Zisserman, Very Deep Convoltional Networks for Large-Scale Image Recognition, Compter Vision and Pattern Recognition, 205. [2] V. V. Romanke, Boosting Ensembles of Heavy Two-Layer Perceptrons for Increasing Classification Accracy in Recognizing Shifted-Trned- Scaled Flat Images With Binary Featres, Jornal of Information and Organizational Sciences, vol. 39, no., pp. 75 4, 205. [3] V. V. Romanke, Two-Layer Perceptron for Classifying Flat Scaled- Trned-Shifted Objects by Additional Featre Distortions in Training, Jornal of Uncertain Systems, vol. 9, no. 4, pp , 205. [4] P. K. Rhee, E. Erdenee, S. D. Kyn, M. U. Ahmed, and S. Jin, Active and Semi-Spervised Learning for Object Detection With Imperfect Data, Cognitive Systems Research, vol. 45, pp , [5] P. Tang, H. Wang, and S. Kwong, G-MS2F: GoogLeNet Based Mlti- Stage Featre Fsion of Deep CNN for Scene Recognition, Nerocompting, vol. 225, pp. 97, [6] C. Szegedy, W. Li, Y. Jia, P. Sermanet, S. Reed, D. Angelov, D. Erhan, V. Vanhocke, and A. Rabinovich, Going Deeper With Convoltions, Compter Vision and Pattern Recognition, 204. [7] V. V. Romanke, Classifying Scaled-Trned-Shifted Objects With Optimal Pixel-to-Scale-Trn-Shift Standard Deviations Ratio in Training 2-Layer Perceptron on Scaled-Trned-Shifted 400-Featred Objects Under Normally Distribted Featre Distortion, Electrical, Control and Commnication Engineering, vol. 3, iss., pp , [] V. V. Romanke, Classification Error Percentage Decrement of Two- Layer Perceptron for Classifying Scaled Objects on the Pattern of Monochrome 60-by-0-Images of 26 Alphabet Letters by Training With Pixel-Distorted Scaled Images, Scientific blletin of Chernivtsi National University of Yriy Fedkovych. Series: Compter systems and components, vol. 4, iss. 3, pp , 203. [9] M. Sn, Z. Song, X. Jiang, J. Pan, and Y. Pang, Learning Pooling for Convoltional Neral Network, Nerocompting, vol. 224, pp , [20] D. Scherer, A. Müller, and S. Behnke, Evalation of Pooling Operations in Convoltional Architectres for Object Recognition, in International Conference on Artificial Neral Networks (ICANN 200), pp. 92 0, [2] S. Lai, L. Jin, and W. Yang, Toward High-Performance Online HCCR: A CNN Approach With DropDistortion, Path Signatre and Spatial Stochastic Max-Pooling, Pattern Recognition Letters, vol. 9, pp , [22] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Stskever, and R. R. Salakhtdinov, Dropot: A Simple Way to Prevent Neral Networks From Overfitting, Jornal of Machine Learning Research, vol. 5, pp , 204. [23] L. P. F. Garcia, A. C. P. L. F. de Carvalho, and A. C. Lorena, Effect of Label Noise in the Complexity of Classification Problems, Nerocompting, vol. 60, pp. 0 9, Vadim V. Romanke was born in 979. The higher edcation was received in 200. In 2006, he received the degree of Candidate of Technical Sciences in Mathematical Modelling and Comptational Methods. His candidate dissertation sggested a way of increasing the interference noise immnity of data transferred over radio systems. Mr. Romanke received his degree of Doctor of Technical Sciences in mathematical modelling and comptational methods in 204. His Doctor-of-Science dissertation solved the problem of increasing the efficiency of the identification of models for mltistage technical control and rn-in nder mltivariate ncertainties of their parameters and relationships. In 206, he received the stats of Fll Professor. Mr. Romanke is a Professor at the Faclty of Navigation and Naval Weapons at the Polish Naval Academy. His research interests concern decision-making, game theory, statistical approximation, and control engineering based on statistical correspondence. Vadim Romanke has good programming skills in MATLAB. For practical implementations, Mr. Romanke ses Python. Also, he directs a branch of fitting statistical approximators at the Centre of Parallel Comptations managed by Khmelnitskiy National University (Ukraine). Address for correspondence: 69 Śmidowicza Street, Gdynia, Poland, romankevadimv@gmail.com ORCID id: 57

Speech Recognition Combining MFCCs and Image Features

Speech Recognition Combining MFCCs and Image Featres S. Karlos from Department of Mathematics N. Fazakis from Department of Electrical and Compter Engineering K. Karanikola from Department of Mathematics