Automatic Identification of Instrument Type in Music Signal using Wavelet and MFCC Arijit Ghosal, Rudrasis Chakraborty, Bibhas Chandra Dhara +, and Sanjoy Kumar Saha! * CSE Dept., Institute of Technology and Marine Engg. 24 Parganas (South), West Bengal, India ghosal.arijit@yahoo.com - Indian Statistical Institute, Kolkata, India rudrasischa@gmail.com + IT Dept., Jadavpur University, Kolkata, India bcdhara@gmail.com! CSE Dept., Jadavpur University, Kolkata, India sks_ju@yahoo.co.in Abstract. In this work, we have presented a simple but novel scheme for automatic identification of instrument type present in the music signal. A hierarchical approach has been devised by observing the characteristics of different types of instruments. Accordingly, suitable features are deployed at different stages. In the first stage, wavelet based features are used to subdivide the instruments into two groups which are then classified using MFCC based features at second stage. RANSAC has been used to classify the data. Thus, a system has been proposed which unlike the previous system relies on very low dimensional feature. Key words: Audio Classification, Instrument Identification, MFCC, Music Retrieval, RANSAC, Wavelet Feature 1 Introduction An efficient audio classification system can serve as the foundation for various applications like audio indexing, content based audio retrieval, music genre classification. In the context of a music retrieval system, at first level it is necessary to classify them as music without voice i.e. instrumental and music with voice i.e. song. A few works [1, 2] have been reported in this direction. At subsequent stages further sub-classification can be carried out. Automatic recognition of instrument or its type like string, woodwind, keyboard is an important issue in dealing with instrument signals. In several works like [3], isolated musical notes have been considered as input to the system. But, in the signal arising out of a performance, the notes are not separated [4]. On the other hand recognition of musical instruments in a polyphonic, multi-instrumental music is a difficult challenge and a successful recognition system for a single instrument music may help in addressing the case [4].
2 Lecture Notes in Computer Science: Ghosal, Chakraborty, Dhara, Saha A comprehensive study made by Deng [5] indicates that a wide variety of features and classification schemes have been reported by the researchers. Mel Frequency Cepstral Coefficient(MFCC) have been used in different manner in number of systems. Brown et al. [7] have relied on MFCC, spectral centroid, auto correlation coefficients and adopted Bayes decision rules for classification. Agostini et al. [8] have dealt with timbre classification based on spectral features. A set of 62-dimensional temporal, spectral, harmonic and perceptual features is used by Livshin et al. [4] and k-nn classification is tried for recognition. Kaminskyj et al. [9] have initially considered 710 features including MFCC, rms, spectral centroid, amplitude envelope and dimensionality is reduced by performing PCA. Finally, k-nn classifier is used. The branch and bound search technique and non negative matrix factorization have been tried by Benetos et al. [6] respectively for feature selection and classification. Past study reveals that different schemes have tried with various combination of the features with high dimensionality and classification techniques. Still the task of instrument recognition system, even for single instrument signal, is an open issue. In this work, we have classified instrumental signal based on the instrument type. The paper is organized as follows. The brief introduction is followed by the description of proposed methodology in section 2. Experimental result and concluding remarks are put in section 3 and 4 respectively. 2 Proposed Methodology The proposed scheme deals with recorded signals of single instrument. A hierarchical framework is presented to classify the signal according to the type of instrument used in generating the music. Instruments are commonly categorized as String (Violin, Guitar etc.), Woodwind (Flute, Saxophone etc.), Percussion (Drum, Tabla etc.) and Keyboard (Piano, Organ etc.). Sound produced by different instruments bear different acoustics. Sound envelopes produced by a note may reflect signature of the instrument. Shape of the envelope is determined by the variation in sound level of the note and represents the timbral characteristics. The envelope includes attack i.e. time from silence to peak, sustain ı.e. time length for which the amplitude level is maintained and decay i.e. time the sound fades to silence. As in a continuous signal, it is difficult to isolate a note, a higher level features are designed that can exploit the underlying characteristics. In our effort, we try to deal with small number of features and rely on the basic perception of the sound generated by the instruments. As we perceive, sound generated by a string or percussion instrument persists longer till it gradually fades away completely and it is not so for a conventional keyboard or woodwind type instrument. This observation has motivated us to classify the signals into two groups at first stage. The first group consists of keyboard and woodwind whereas the second group consists of string and percussion. At subsequent level, we take up the task of classifying the individual groups. In the following subsections we discuss about the features and classification technique that we have used.
Lecture Notes in Computer Science: Identification of Instrument Type 3 2.1 Extraction of Features At the first level of classification we have opted for features that can reflect the difference in the sound envelope of the two groups of instruments as discussed earlier. Basically, the envelope is formed by the variation in amplitude. It has motivated us to look for wavelet based feature. Audio signal is decomposed Fig. 1. Schematic Diagram for Wavelet Decomposition following Haar Wavelet transform [10]. As it has been shown in Fig. 1, a signal is first decomposed in low (L 1 ) and High (H 1 ) bands. Low band is successively decomposed giving rise to L 2 and H 2 and so on. In general, high band contains the variation details at each level. Wavelet decomposed signals (after 3rd level of decomposition) for different types of instruments have been shown in Fig. 2. Sustain phase of audio envelope is mostly reflected in low band. On the other hand, amplitude variation during attack and decay have substantial impact on the high bands. A fast attack or decay will give rise to sharp change in amplitude in the high band and a steady rise or fall is reflected by uniform amplitude in high bands. As it appears in Fig. 2, the high bands show discriminating characteristics for the two group of instruments. There is a uniform variation of amplitudes for the first group of instruments. On the other hand, for the second group a noticeable phase of uniform amplitude without much variation is reflected. (a) Signal of Keyboard, Woodwind, String and Percussion (b) Signal after wavelet decomposition of corresponding signal shown in (a) Fig. 2. Signal of different instruments and corresponding signal after wavelet decomposition
4 Lecture Notes in Computer Science: Ghosal, Chakraborty, Dhara, Saha Features are computed based on short time energy (STE) for the decomposed signals in H 1, H 2, H 3 and L 3 bands. For each band, signal is first divided into frames consisting of 400 samples. For each frame, short time energy (STE) is computed. Finally, the average and standard deviation of STE of all frames in the band are taken to form 8-dimensional feature. (a) (b) (c) (d) Fig. 3. MFCC plots for different instrument signal shown in Fig. 2: (a) Keyboard, (b) Woodwind, (c) String and (d) Percussion For the second stage, in order to discriminate the instrument types within the groups, we have considered Mel Frequency Cepstral Co-efficients (MFCC) as the features. As the instruments within each group differs in terms of distribution of spectral power, we have considered 13-dimensional MFCC features. The steps for computing the features are same as elaborated in [11]. Features are obtained by taking the average of first 13 co-efficients obtained for each frame. The plot of MFCC co-efficients for different signals have been shown in Fig. 3. It clearly shows that the plots for a keyboard and woodwind are quite distinctive and same is also observed for a string and percussion instrument. 2.2 Classification The variety in the audio database under consideration makes the task of classification critical. The variation even within a class poses problem for NN based classification. For SVM, the tuning of parameters for optimal performance is very critical. It has motivated us to look for a robust estimator capable of handling the diversity of data and can model the data satisfactorily. RANdom Sample And Consensus (RANSAC) appears as a suitable alternative to fulfill the requirement. RANSAC [12] is an iterative method to estimate the parameters of a certain model from a set of data contaminated by large number of outliers. The major strength of RANSAC over other estimators lies in the fact that the estimation is made based on inliers i.e. whose distribution can be explained by a set of model parameters. It can produce reasonably good model provided a data set contains a sizable amount of inliers. It may be noted that RANSAC can work satisfactorily even with outliers amounting to 50% of entire data set [13]. Classically, RANSAC is an estimator for the parameters of a model from a given data set. In this work, the evolved model has been used for classification.
Lecture Notes in Computer Science: Identification of Instrument Type 5 3 Experimental Result In order to carry out the experiment, we have prepared a database consisting of 334 instrumental files. 86 files corresponds to different keyboard instruments like piano, organ. 82 files corresponds to woodwind instrument like flute, saxophone. String instrument like guitar, violin, sitar contribute 84 files and remaining 82 files represent percussion instruments like drum, tabla. The database thus reflects appreciable variety in each class of instrument. Each file has the audio of around 40-45 seconds duration. Sampling frequency for the data is 22050 Hz. Samples are of 16-bits and of type mono. Table 1. Classification Accuracy (in %) at First Stage Classific. Keyboard String Scheme and and Woodwind Percussion MLP 81.95 85.94 SVM 88.40 85.54 RANSAC 91.50 92.67 Table 2. Classification Accuracy (in %) at Second Stage Classific. Keyboard Woodwind String Percussion Scheme MLP 81.40 76.74 71.43 75.61 SVM 82.55 79.26 73.80 90.69 RANSAC 87.21 85.37 84.52 89.02 Table 1 and 2 show the performance of the proposed scheme at two stages. We have used 50% data of each class as training set and remaining data for testing. Experiment is once again repeated by reversing the training and test set. Average accuracy has been shown in the tables. For MLP, there are 8 and 13 nodes in the input layers at first and second stage respectively. Number of output nodes is 2. we have considered single hidden layer with 6 and 8 internal nodes at first and second stage respectively. For SVM we have considered RBF kernel. Tables clearly show that performance of RANSAC based classification (with default parameter setting) is better.
6 Lecture Notes in Computer Science: Ghosal, Chakraborty, Dhara, Saha 4 Conclusion We have presented a hierarchical scheme for automatic identification of instrument type in a music signal. Unlike other systems, proposed system works with features which are simple and of very low dimension. Wavelet based features categorizes the instruments in two groups and finally, MFCC based features classify the individual instrument classes in each group. RANSAC has been utilized as a classification tool which is quite robust in handling the variety of data. Experimental result also indicates the effectiveness of this simple but novel scheme. Acknowledgment The work is partially supported by the facilities created under DST-PURSE program in Computer Science and Engineering Department of Jadavpur University. References 1. Zhang, T., Kuo, C.C.J.: Content-based Audio Classification and Retrieval for Audiovisual Data Parsing. Kluwer Academic (2001) 2. Ghosal, A., Chakraborty, R., Dhara, B.C., Saha, S.K.: Instrumental/song classification of music signal using ransac. In: 3 rd Intl. Conf. on Electronic Computer Technology, India, IEEE CS Press (2011) 3. Herrera, P., Peeters, G., Dubnov, S.: Automatic classification of musical instrument sounds. New Music Research (2000) 4. Livshin, A.A., Rodet, X.: Musical instrument identification in continuous recordings. In: Intl. Conf. Digital Audio Effects. (2004) 222 226 5. Deng, J.D., Simmermacher, C., Cranefield, S.: A study on feature analysis for musical instrument classification. IEEE Trans. on System, Man and Cybernatics Part B 38 (2008) 429 438 6. Kotti, E.B.M., Kotropoulos, C.: Musical instrument classification using nonnegative matrix factorization algorithms and subset feature selection. In: ICASSP. (2006) 7. Brown, J.C., Houix, O., McAdams, S.: Feature dependence in the automatic identification of musical woodwind instruments. Journal of Acoustic Soc. America 109 (2001) 1064 1072 8. Agostini, G., Longari, M., Poolastri, E.: Musical instrument timbres classification with spectral features. EURASIP Journal Appl. Signal Process. (2003) 5 14 9. Kaminskyj, L., Czaszejko, T.: Automatic recognition of isolated monophonic musical instrument using knnc. J. Intell. Inf. Syst. 24 (2005) 199 221 10. Gonzalez, C.R., Woods, E.R.: Digital Image Processing (3rd Edition). Prentice- Hall Inc., NJ, USA (2006) 11. Rabiner, L.R., Juang, B.H.: Fundamentals of Speech Recoognition. Prentice-Hall (1993) 12. Fischler, M.A., Bolles, R.C.: Random sample consensus: A paradigm for model for model fitting with applications to image analysis and automated cartography. ACM Communications 24 (1981) 381 395 13. Zuliani, M., Kenney, C.S., Manjunath, B.S.: The multiransac algorithm and its application to detect planar homographies. In: IEEE Conf. on Image Processing. (2005)