The Comparison of Selected Audio Features and Classification Techniques in the Task of the Musical Instrument Recognition

POSTER 206, PRAGUE MAY 24 The Comarison of Selected Audio Features and Classification Techniques in the Task of the Musical Instrument Recognition Miroslav MALÍK, Richard ORJEŠEK Det. of Telecommunications and Multimedia, University of Žilina, Univerzitná, 00 26 Žilina, Slovakia miroslav.malik@fel.uniza.sk, richard.orjesek@fel.uniza.sk Abstract. We resent a comarative evaluation of classification of 3 tyes of Euroean orchestral musical instruments by classification methods k-nearest Neighbors, Gaussian mixture model, Artificial neural network and its imrovement, the Droout ANN. The main objective was to investigate recognition caabilities of these methods with an alication of several audio features, namely MFCC, LPC, LSP and derived features, which has been tested indeendently and in double combinations of the best resulting features. Using the mentioned features, the best ercentage of 92% has been achieved. Keywords Recognition, classification, musical instruments, knn, gmm, ann, audio features.. Introduction Perhas, the humankind was accomanied with music throughout its history. This miscellaneous sequence of tones with various rhythm, temo and style can abstractly exress author's feelings. Music also involves characteristic cultural and locale elements. Musical instruments are various in the similar way, they have a big art in musical creation, where a significant fraction cannot exist without them. Nowadays, there is a huge amount of digital musical records available and still increases, so the automatic musical instrument recognition is one of tools which can hel with data search. More ossibilities of alication are in the automatic annotation of audio content, structural coding, or some software for musicians. In the sixties of the last century, scientists began to analyze audio roerties of musical instruments. First attemts in the musical instrument recognition showed u in the nineties, mainly due to the wider availability and develoment of informational technologies. These early systems was caable to recognize only small number of musical instruments, also with limited tonal range. Later, K. Martin and Y. Kim created a system working with isolated tones of 4 instruments with their full tonal range. The k-nearest neighbor algorithm roved itself as best classification technique, accomanied with Fisher's discriminant analysis for data reduction and a hierarchic classification architecture, which select the arent class first and then the algorithm continues to selecting a articular instrument. The dataset was slitted with ratio of 70/30 for training/testing samles. For 5 instrument classes, the system reached the accuracy of 93% and 72% accuracy for articular instruments [2]. The other bigger research in the recognition of musical instruments and audio features belongs to A. Eronen [3]. His system worked with 30 orchestral instruments and reaches the accuracy of 94% for instrument class. However, in the commercial area the musical instrument recognition remains behind its affiliated sector, the seech recognition. This situation can by caused by the fact, that the seech contains much valuable information, often more simly describable than music. 2. Audio features It is necessary to describe an audio signal by the certain grou of arameters which sufficiently recise reresents roerties of an audio signal for a ossible analysis of an audio record from the view of its content. This arameters are called the audio features and in general, we could divide this features into 3 basic grous: temoral sectral statistical From inut audio signal samles, we could directly obtain temoral features, such as coefficients of FIR, IIR, LPC, or zero crossing. To obtain the sectral information, we must transform the audio signal by one of many transformations, often the Fourier transform, discrete cosine transform or wavelet transform. By this manner we can acquire sectral features, such as FFT coefficients, cestral coefficients or the sectral centroid. The mean

2 M. MALÍK, R. ORJEŠEK, MUSICAL INSTRUMENT RECOGNITION value of the signal energy, the skewness coefficient and the sectral sloe can be mentioned as tyical reresentatives of statistical arameters which describes the signal roerties in the term of statistics. Nowadays, there is a big amount of audio features, the ercentage of classification deends on the discrimination caability of selected audio features. For our exeriment have been selected secies of features mentioned below. The arameterization of audio recordings has been erformed using the freely available tool oensmile [5] and it has been alied on frames of the length of 30 ms with the half overlaing, using the Hamming window function. 2. MFCC The biggest emloyment in the field of audio rocessing surely belongs to Mel frequency cestral coefficients. The main area of usage of the MFCC is seech rocessing (seech and seaker recognition, authentication and verification of seaker, etc.), but MFCC is widely used in the semantic analysis of audio content, such as recognition of common sound events, musical genre recognition and lenty other alications. The algorithm of the MFCC consists from following stes. The inut signal is at first filtered using re-emhasis filter which corrects ossible weakened higher frequencies in the signal caused by signal ath. Then is the signal transformed into the sectral domain, tyically by the FFT, and then the transformed signal enters the filter bank with the Mel frequency division. Using the logarithm oeration, the signal of each Mel frequency bands is non-linearly transformed to decrease dynamic range of values, so it reduces the sensitivity of frequency estimations. Finally, the MFCC are obtained by the inverse transformation, the discrete cosine transform can be used as well as the inverse Fourier transform. To the MFCC are also added the dynamical features of signal describing temoral changes of the sectrum, which are very imortant in the human ercetion of sound - the coefficients and the coefficients. 2.2 LPC An another method originally develoed for seech rocessing alications that accomlishes very good results also in the musical instruments recognition is the method of linear rediction coefficients. The LPC are obtained using linear rediction, which is based on the simle assumtion, that the n-th samle of the audio signal can be relaced by the linear combination of Q revious samles: s(n) = Q i = a i s( n i). () For a calculation of rediction coefficients is most frequently used the autocorrelation method and Levinson- Durbin's algorithm. With sufficiency high order of the redictor, it can be ossible to very accurately describe the sectral enveloe of the audio signal. Over the years, there have been successively roosed many modifications of the linear rediction which take into account the human ercetion of sound. For this urose was designed new features such as ercetual linear rediction coefficients PLP, the linear rediction cestral coefficients LPCC or the ercetual linear rediction cestral coefficients PLPCC. The PLP are derived from the LPC by an alication of the Mel scaled filter bank, similar to MFCC. By the cestral transformation as in the case of the MFCC, the LPCC are obtained from the LPC. Using combination of Mel scaled filter bank together with cestral analysis alied on the LPC we obtain the PLPCC. Another derived grou of features tyically used in the seech coding are LSP coefficients (Line Sectral Pairs), which are gained by decomosing of the redictor to a symmetric and an asymmetric art signifying zeros and oles of the LP filter. 2.3 Formants A formant is a concentration of acoustic energy around a articular frequency in the audio wave. In the connection with the seech signal, formants reresent an indication of the resonant frequencies of the vocal tract model, the similar analogy can be alied for musical instruments too. With formants, the sectral enveloe can be described like using the LPC. 2.4 Sectral coefficients As mentioned above, by alication of suitable sectral transformation the original temoral signal can be transferred into the sectral domain. In our tests, the FFT coefficients has been used, filtered in octave bands sulemented by following coefficients derived from FFT: Sectral centroid - indicates a region with the biggest density of frequency reresentation in the audio signal, often called "brightness". Sectral flux - reorts temoral changes in the signal sectra. Sectral kurtosis - describes sharness of the signal sectra. Sectral roll-off - determines a threshold bellow which the biggest art of the signal energy resides. Sectral skewness - statistical measure of the asymmetry of the robability distribution of the audio signal sectrum. Sectral sloe - characterize the loss of signal's energy at higher frequencies.

POSTER 206, PRAGUE MAY 24 3 Sectral sread - reresents instantaneous bandwidth around signal's sectral centroid. 3. Classification There are several classification techniques used for classification of audio content. They are based on comaring the similarities between unknown inut audio files and known sounds. In the ast, the sound rocessing used the intuitive comarison of functional vector atterns. Current studies of acoustical roerties favor statistical models because they rovides more flexible robabilistic results. The most common methods of classification are based on the Gaussian Mixture Model (GMM), Hidden Markov Model (HMM), k-nearest Neighbors (knn), Artificial Neural Networks (ANNs), Vector Quantization (VQ) and Suort Vector Machines (SVM). 3. Artificial Neural Networks The main concet of artificial neural networks has been insired by biological neural networks. ANN is created with mathematical neurons, rimitive units, where each unit rocessed weighted inut signals and generates the outut. The neural network is a toological arrangement of individual neurons into the structure communicated with oriented weighted interconnections. Each artificial neural network is defined by the tye of neurons, toological arrangements and the strategy of the adatation in the training rocedure. Basic ideas of the concet and arrangements of the neural network are shown on the most used feed-forward neural network in the Fig. : A schematic model of a single neuron is shown in Figure 2. We can slit the neuron into several arts as synases rated with weights w, which feed the inut x into the neuron, the body where an inner otential of neuron z is obtained, the block with a transfer function f, and finally, the outut of the neuron y. Into the body of the neuron also enters a value b called the bias. Fig. 2. Schematic model of a single neuron. The inner otential and the outut signal of the simle neuron can be comuted by following equation: z = n i= w x + b i i (2) y = f (z) (3) where n is the number of inuts. Multilayer neural networks have at least one hidden layer excet for the inut and the outut layer. The number of neurons in hidden layers can be different and it is selected according to the character of a solving task. Inut and outut layers are defined in the same way as for the single layer neural network, but the outut is involved by hidden layers: n k k y = f ( w x + b ) (4) j i= 0 ji i j where n reresents not only the dimension of the inut vector, but in general, for the k-th layer it is defined as the number of neurons in the revious (k-)-th layer. Fig.. Multilayer feed-forward neural network. The Fig. shows that individual neurons are arranged to several layers. Neurons of the same layer are not connected between each other, but they are usually fully connected to all neurons of neighbor layers. Connections between neurons reresent aths for the signal's roagation. They are oriented and rated with weights which modify the intensity of the assing signal. A zero weight reresents that the connection does not exist. The first layer of the neural network is called the inut layer and the last layer is called the outut layer, other layers of the ANN are called hidden layers. 3.2 Neural network training The term of neural network's training rocess reresents adatation of network's arameters and synatic weights to minimize the error function. This error function constitutes differences between the outut of the ANN and the exected outut. A basic algorithm used in neural network learning is the back roagation (BP). The term "back" means that comared to the inut signal, which moves through the ANN from the inut layer to the outut layer, the error moves through the ANN in the oosite direction, and can be comuted as: e = y y (5) d where ē reresents the error vector. Squared error for the ANN is obtained by:

4 M. MALÍK, R. ORJEŠEK, MUSICAL INSTRUMENT RECOGNITION = N ( j e ) j= 2 ε. (6) This squared error comutation is based on a single inut. It would be otimal if we were able to exress the mean square error for each of ossible inuts, but we do not have these inuts. We have only the training set of training airs. In this case we can exress the mean square error as: ε = E{ ε } (7) for each ε. By minimization of this function, for examle, with the gradient descent method we would have obtained the best ossible classification in a meaning of the mean squared error for the training set. The mean squared error is a function of weights w and via adatation of this weights can be realized a minimization rocess of the mean squared error. For weights modifications we can write: ( t + ) = w( t) µ ε w, (8) where µ reresents a learning rate. The gradient ε can be written as: ε = w ( t) ε. (9) The calculation of the gradient is not ractical due to the large number of elements in the training set and a large number of network weights, unfortunately, very difficult also for less extensive networks. Therefore, it is relaced by calculation of the sequence of artial gradients when each artial gradient obtained in one ste learning network aroximates the value of the gradient over the whole training set: ε w ε ( t) ( t) w. (0) One of comlications in the BP algorithm is that it can be traed in a local minimum. There have been develoed some imrovements of the BP that hels avoiding of this issue. The modification of the BP algorithm is the imlementation of the arameter η into the adatation rocess, where the arameter η is initially chosen close to and then it decreases the value during the adatation rocess to 0. In the case of increasing the error rate, it is recommended modify the arameter η: w ( t + ) = w( t) + ( ) w( t) + η w( t ) η. () For η = 0, the algorithm is the same as the basic BP algorithm. To revent overfitting, it can be used a method called droout described in [9]. 3.3 Gaussian mixture model The density in comonent models is exressed as a linear combination of density functions. A model with M comonents can be written in the form: M ( x) = P( j) ( x j) j=, (2) where arameters P(j) are called mixture coefficients which reresent robabilities of j-th comonent. Part (x j) reresents arameters of density functions which tyically fluctuate around j. The condition is that the function must not be negative and should integrate to over its entire definition field. A limitation of comonent s coefficients M j= P ( j) = ( j) 0, (3) P, (4) ensures that the model will reresent the robability function. The comonent model is generative and is useful as a rocess of generating samles from reresented density. The first of j-comonents is randomly selected with robability P(j). It just deends on choosing the form of density comonents. To maximize the similarity of the GMM data is used the Exectation Maximization algorithm (EM). EM is aroriate for rerocessing of roblem with the equivalence of minimizing of negative record of similarity in data set using the relation: E = Θ = N n= n ( x ) log, (5) which is regarded as an error function [6]. Θ reresents the set of arameters of GMM which is needed to find. After the a roer training of the model, the GMM method becomes very efficient and fast tool for classification, which is comutationally inexensive. A disadvantage may be the absence of a higher order signal information. 3.4 k-nearest Neighbors The main rincile of the knn algorithm is very simle and it is based on a comarison of data distances in selected feature s sace. A more similar data are closer together than a less similar data. An inut data of algorithm are divided into a training and a test data. The training data are data sets which are divided into grous, each grou characterize one class. The KNN seeks certain distance surroundings for each element of the test data the neighborhood, containing k training elements, resectively the reference data, and on the basis of certain criteria most often the majority rule algorithm assigns the tested element to one of classes. Desite the simlicity of the algorithm, this method gives good results and is used as well as a verification method. Additional advantages of the knn include ease of imlementation and flexibility. As the main disadvantage is

POSTER 206, PRAGUE MAY 24 5 considered the fact that the training data must be stored in memory and that the knn do not create a comlex model from the training data, thus saves some time, but very comarison is time deendant on the size of the database of training elements [6]. formants and sectral coefficients didn't reach good results, also the combination of MFCC with LPC didn't, so we mentioned only the others remaining audio features and combination of LPC with LSP in the Table 2. Informational results of absent features for knn and GMM can be found in []. 4. Dataset The dataset of audio recordings of musical instruments used for our exeriments is based rimarily on the database of Electronic music studios at the University of Iowa. It contains recordings of articular string and woodwind musical instruments that lay individual notes of the chromatic scale across the full range of the instrument including various laying techniques tyical for some instruments, for examle arco, izzicato, or vibrato. These audio recordings also involves various dynamics of layed tone,, mf and ff, and thus sound samles reresent the entire dynamic structure of selected musical instruments. Audio samles were recorded mostly by the condenser microhone Neumann KM 84 with cardioid characteristics, in the anechoic chamber in Wendell Johnson Seech and Hearing Centre. The samle frequency is 44, khz and bit deth is 6 bit. Overall, 3 western orchestral musical instrument classes has been used for training and testing, with its durations, the number of clis and the number of classes listed in Tab.. Tab. 2. The classification results. Determining of success was realized by the F- measure. The F-measure (also F-score or F-score) is frequently used statistical recision, for examle, in a data retrieval or machine learning. It is defined by the equation: 2. P. R F = P + R, (6) where P (Precision) is the number of correct ositive results divided by the number of all ositive results, R (Recall) reresents the number of correct ositive results divided by the number of ositive results that should have been returned and F is then a weighted average of P and R. 6. Conclusion Tab.. Comosition of the musical instruments dataset. 5. Results The entire database was divided into two arts similar as in [2], 70 to 30 %, where the bigger art was used as the training data and the smaller art for testing. We tested all audio features mentioned above indeendently and in combinations of the MFCC with the LPC and the LPC with the LSP, by four classification algorithms - the knn, the GMM, simle multilayer ANN and the Droout ANN. The solo features PLP, PLPCC, The aim of the exeriment was ut to the test the ability of recognition of selected orchestral musical instruments using four classification methods - knn, GMM, ANN and the Droout ANN. These methods achieve the best final ercentage for the combination of audio features LPC with LSP. These features overcome in ractice the most commonly used MFCC coefficients, and therefore they aear to be suitable for use in the rocessing of not only seech but also musical instruments. The ANN and its imrovement, the Droout ANN roved themselves as the best methods for musical instruments recognition. Also in time-deendency of training rocedure, the ANNs score the best results. In the future, the recognition model could by extended by hierarchical distribution of musical instruments and also other imroved ANN algorithms could be emloyed. References [] MALÍK, M., CHMULÍK, M., TICHÁ, D. Musical Instrument Recognition Using Selected Audio Features. In Proc. of the th Conf. Transcom, Žilina (Slovakia), 205,. 33 38.

6 M. MALÍK, R. ORJEŠEK, MUSICAL INSTRUMENT RECOGNITION [2] MARTIN, K. D., KIM, Y. E. Musical Instrument Identification: A Pattern-recognition Aroach. In Proc. of the 36th meeting of the Acoustical Society of America. USA, 998. [3] ERONEN, A., KLAPURI, A. Musical Instrument Recognition Using Cestral Coefficients and Temoral Features. In Proc. of Int. Conf. on Acoustics, Seech, and Signal Processing. Istambul (Turkey), 2000,. 753 756. [4] ERONEN, A. Comarison of Features for Musical Instrument Recognition. In Proc. of Int. Worksho on Alications of Signal Processing to Audio and Acoustics. NY (USA), 200,. 9 22. [5] The Munich Versatile and Fast Oen-Source Audio Feature Extractor. Online: htt://www.audeering.com/research/oensmile [6] THEODORIDIS, S., KOUTROUMBAS, K. Pattern Recognition. 2 nd ed. Elsevier Academic Press, 2003. [7] BENADE, A. H. Fundamentals of Musical Acoustics. 2 nd ed. Dover, 990. [8] KOSTEK, B. Percetion Based Data Processing in Acoustics: Alications to Music Information Retrieval and Psychohysiology of Hearing. Sringer, 2005. [9] SRIVASTAVA, N., HINTON, G., KRIZHEVSKY, A. Droout: A Simle Way to Prevent Neural Networks from Overfitting. In Journal of Machine Learning Research, 204, vol. 5,. 929-958. About Authors... Miroslav MALÍK was born in Žilina, Slovakia, in 990. He finished MSc. degree at the University of Žilina, Faculty of Electrical Engineering, Deartment of Telecommunications and Multimedia in 204. Currently he studies doctoral degree at the above mentioned deartment. His research area involves acoustics, audio features, machine learning techniques for musical instrument recognition and emotion in music detection. Richard ORJEŠEK was born in Banská Bystrica, Slovakia, in 99. He finished MSc. degree at the University of Žilina, Faculty of Electrical Engineering, Deartment of Telecommunications and Multimedia in 205. Currently he studies doctoral degree at the above mentioned deartment and 2 nd MSc. at the Faculty of Management Science and Informatics. His research area includes machine learning techniques for the audio and image rocessing uroses.