AUDIO CLASSIFICATION USING SEMANTIC TRANSFORMATION AND CLASSIFIER ENSEMBLE

6th International WOCMAT & Ne Media Conference 200, YZU, Taouan, Taian, November 2-3, 200 AUDIO CLASSIFICATION USING SEMANTIC TRANSFORMATION AND CLASSIFIER ENSEMBLE Ju-Chiang Wang *, Hung-Yi Lo, Shh-ang Jeng * and Hsin-Min Wang * Department of Electrical Engineering, National Taian Universit, Taipei, Taian Institute of Information Science, Academia Sinica, Taipei, Taian E-mail: asriver@iis.sinica.edu.t, hungi@iis.sinica.edu.t, seng@cc.ee.ntu.edu.t, hm@iis.sinica.edu.t ABSTRACT This paper presents our inning audio classification sstem in MIREX 200. Our sstem is implemented as follos. First, in the training phase, the frame-based 70- dimensional feature vectors are extracted from a training audio clip b MIRToolbox. Next, the Posterior Weighted Bernoulli Mixture Model (PWBMM) is applied to transform the frame-decomposed feature vectors of the training song into a fixed-dimensional semantic vector representation based on the pre-defined music tags; this procedure is called Semantic Transformation. Finall, for each class, the semantic vectors of associated training clips are used to train an ensemble classifier consisting of SVM and AdaBoost classifiers. In the classification phase, a testing audio clip is first represented b a semantic vector, and then the class ith the highest score is selected as the final output. Our sstem as raned first out of 36 submissions in the MIREX 200 audio mood classification tas.. INTRODUCTION Automatic music classification is a ver important topic in the music information retrieval (MIR) field. It as first addressed b Tanetais et al., ho ored on automatic musical genre classification of audio signals in 200 []. After ten ears of development, man inds of audio classification datasets have been created ith categor definitions and class labels corresponding to a set of audio examples. In addition, man approaches have been proposed for classifing music data according to genre [, 2], mood [3, 4], or artists [5, 6]. Music Information Retrieval Evaluation exchange (MIREX), an annul MIR algorithm competition held ointl ith ISMIR, started to evaluate audio classification from 2005. In the audio classification field, fixed numbers of categories or classes are usuall pre-defined b experts for different application tass. In general, these categories or classes should be definite and as mutuall exclusive as possible. Hoever, hen most people listen to a song the have never heard before, the usuall have certain musical impressions in their minds, although the ma not be able to name the exact musical categor of the song. These musical impressions inspired b direct auditor cues can be described b some general ords, such as exciting, nois, fast, male vocal, drum, and guitar. We believe that the co-occurrences of the musical impressions or concepts ma indicate the membership of a song in a specific audio class. Therefore, in this stud, e ill explore the relationship beteen the general tag ords and the specific categories. Since people tend to mentall tag a piece of music ith specific ords hen the listen to it, music tags are a natural a to describe the general musical concepts. The tags can include different tpes of musical information, such as genre, mood, and instrumentation. Therefore, e believe that the noledge of pregenerated music tags in a music dataset can help the classification of another music dataset. In other ords, e can train a music tagging sstem to recognie musical concepts of a song in terms of semantic tags first, and then the music classification sstem can classif the song into specific classes based on the semantic representation. Figure shos an overvie of our music classification sstem. There are to laers in our sstem, i.e., semantic transformation (ST) and ensemble classification (EC). In the training phase of the ST laer, e first extract audio features ith respect to various tpes of musical characteristics, including dnamics, spectral, timbre, and tonal features, from the training audio clips. Next, e appl the Posterior Weighted Bernoulli Mixture Model (PWBMM) [7] to automaticall tag the clips. The PWBMM performed ver ell in terms of the tag-based area under the receiver operating characteristic curve (AUC-ROC) in the MIREX 200 audio tag classification tas [8]. The AUC-ROC of the tag affinit output is an important a to evaluate the correct tendenc of the tagging prediction; therefore, e have proper confidence in appling the PWBMM in the music tagging step in our sstem. The PWBMM is trained on the MaorMiner dataset claed from the ebsite of the MaorMiner music tagging game. The dataset contains 2,472 ten-second audio clips and their associated tags. As shon in Table, e select 45 tags to define the semantic space. In other ords, a http://maorminer.org/

6th International WOCMAT & Ne Media Conference 200, YZU, Taouan, Taian, November 2-3, 200 song is transformed into a 45-dimensional semantic vector over the pre-defined tags b ST based on the tagging procedure. In the MaorMiner dataset, the counts of a tag given to a music clip ranges from 2 to 2. These counts are also modeled b PWBMM and have been shon to facilitate the performance of music tag annotation [7]. In the training phase of the EC laer, for each class, the associated training audio clips, each represented b a 45-dimensional semantic vector, are used to train an ensemble classifier, consisting of support vector machine (SVM) and AdaBoost classifiers. In the final classification phase, given a testing audio clip, the class ith the highest output score is assigned to it. Audio Clips MIRToolbox.3 70-dim Feature Vectors Semantic Representation SVM Pr PWBMM T T2 T3 T4 T5 T45 Figure. The flochart of our audio classification sstem. The remainder of this paper is organied as follos. In Section 2, e describe the music features used in this or. In Section 3, e present ho to appl PWBMM for music semantic representation, and in Section 4, e present our ensemble classification method. We introduce the MIREX 200 audio train/test: mood classification tas and discuss the results in Section 5. Finall, the conclusion is given in Section 6. Probabilit Ensemble MaorMiner Dataset 2,472 ten-sec clips 45 pre-defined tags tag counts: 2~2 Fitted using ML AdaBoost Class Class2 Class3 Class4 Class5 Final Scores of Each Class Table. The 45 tags used in our music classification sstem. metal instrumental horns piano guitar ambient saxophone house loud bass fast eboard roc noise british solo electronica beat 80s dance strings drum machine a pop r&b female electronic voice rap male trumpet distortion quiet techno drum fun acoustic vocal organ soft countr hip hop snth slo pun 2. MUSIC FEATURE EXTRACTION We use MIRToolbox.3 2 for music feature extraction [9]. As shon in Table 2, four tpes of features are used in our sstem, including dnamics, spectral features, timbre, and tonal features. To ensure the alignment and prevent the mismatch of different features in a vector, all the features are extracted from the same fixed-sie shorttime frame. Given a song, a sequence of 70-dimensional feature vectors is extracted ith 50ms frame sie and 0.5 hop shift. Table 2. The music features used in the 70-dimensional frame-based vector. Tpes Feature Description Dim dnamics rms centroid spread seness urtosis spectral entrop flatness rolloff at 85% rolloff at 95% brightness roughness irregularit ero crossing rate spectral flux timbre MFCC 3 delta MFCC 3 delta-delta MFCC 3 e clarit e mode possibilit tonal HCDF chroma pea chroma centroid chroma 2 2 http://.u.fi/music/coe/materials/mirtoolbox

6th International WOCMAT & Ne Media Conference 200, YZU, Taouan, Taian, November 2-3, 200 3. POSTERIOR WEIGHTED BERNOULLI MIXTURE MODEL The PWBMM-based music tagging sstem contains to steps. First, it converts the frame-based feature vectors of a song into a fixed-dimensional vector (in a Gaussian Mixture Model (GMM) posterior representation). Then, the Bernoulli Mixture Model (BMM) [0] predicts the scores over 45 music tags for the song. 3.. GMM Posterior Representation Before training the GMM, the feature vectors from all training audio clips are normalied to have a mean of 0 and standard deviation of in each dimension. Then, the GMM is fitted b using the expectation and maximiation (EM) algorithm. The generation of the GMM posterior representation can be vieed as a process of soft toeniation from a music bacground model. We address a latent music class as a latent variable {,, 2, } corresponding to the -th Gaussian component ith mixture eight π, mean vector μ, and covariance matrix Σ in the GMM. With the GMM, e can describe ho liel a given feature vector x belongs to a latent music class b the posterior probabilit of the latent music class: x) i π i π N ( x μ, Σ ). N ( x μ, Σ ) i i () Given a song s, b assuming that each frame contributes equall to the song, the posterior probabilit of a certain latent music class can be computed b N s ) x n ), (2) N n here x n is the feature vector of the n-th frame of song s and N is the number of frames in song s. 3.2. Bernoulli Mixture Model Assume that e have a training music corpus ith J audio clips, each denoted as s,,,j, and ith associated tag counts c,,,w. The tag counts are positive integers indicating the number of times that tag t has been assigned to clip s. The binar random variable, ith {0,}, represents the event of tag t appling to song s. 3.2.. Generative Process The generative process of BMM has to steps. First, the probabilit of the latent class,,,, is chosen from song s s class eight vector θ : p ( θ ) θ, (3) here θ is the eight of the -th latent class. Second, a case of the discrete variable is selected based on the folloing conditional probabilities: 0, β ) β. (4), β ) β The conditional probabilit that models the probabilit of clip s having tag t is a Bernoulli distribution ith input discrete variable and parameter β for the -th class. The complete oint distribution over and is described ith model parameter β and eight matrix Θ, here its ro vector is θ of clip s :, β, Θ) J, β, θ ) J W θ β. (5) The marginal log lielihood of the music corpus can be expressed as: J W log β, Θ) log θ β. (6) 3.2.2. Model Inference b the EM Algorithm The BMM can be fitted ith respect to parameter β and eight matrix Θ b maximum-lielihood (ML) estimation. B lining the latent class of BMM ith the latent music class of GMM described in Section 3., the posterior probabilit in Eq. (2) can be vieed as the class eight, i.e., θ s ). Therefore, e onl need to estimate β, hich corresponds to the probabilit that a latent music class occurs. We appl the EM algorithm to maximie the corpus-level log-lielihood in Eq. (6) in the presence of latent variable. In the E-step, given the clip-level eight matrix Θ and the model parameter β, the posterior probabilit of each latent variable can be computed b γ ( ) β, Θ, ), β ) β, θ θ β for θ i β θ ( β ) for 0. θ ( ) i β In the M-step, the update rule for β is as follos, ) θ ) (7)

6th International WOCMAT & Ne Media Conference 200, YZU, Taouan, Taian, November 2-3, 200 γ ( ) β. (8) γ ( ) From the tag counts of the music corpus, e no that there exist different levels of relationship beteen a clip and a tag. If clip s has a more-than-one tag count c for tag t, e can mae song s contribute to β c times rather than onl once in each iteration of EM. This leads to a ne update rule for β : c ( ) γ β. (9) c γ ( ) + γ ( ),, 0 3.2.3. Semantic Transformation ith PWBMM The -th component of the semantic vector v of a given clip s is computed as the conditional probabilit of given θ and β: β, θ) θ β ) θ β. (0) For the ensemble classification laer, given an audio clip s m, m,2,,m, its semantic representation is generated in the same a. First, a sequence of music feature vectors is extracted from s m. Second, the vector sequence is transformed into a fixed dimensional posterior eight vector θ m via Eq. (2). Third, the eight vector θ m is transformed into a fixed dimensional semantic vector v m via Eq. (0). 4. THE ENSEMBLE CLASSIFICATION METHOD Assume that e have G classes for the audio classification tas, and all the classes are independent. We can train G binar ensemble classifiers, denoted as C g, g, 2,,G, for each class. Each ensemble classifier C g calculates a final score b combining the outputs of to sub-classifiers: SVM and AdaBoost. 4.. Support Vector Machine SVM finds a separating surface ith a large margin beteen training samples of to classes in a highdimensional feature space implicitl introduced b a computationall efficient ernel mapping. The large margin implies good generaliation abilit in theor. In this or, e exploited a linear SVM classifier f(v) of the folloing form: W f ( v ) λ v + b, () here v is the -th component of the semantic vector v of a testing clip; λ and b are parameters to be trained from (v m, l mg ), m,,m, here v m is the semantic vector of the m-th training clip and l mg {, 0} is the g- th class label of the m-th training clip; and W is the dimension of the semantic vector. The advantage of linear SVM is training efficienc. Certain recent literature has shon that it has comparable prediction performance compared to non-linear SVM. A single cost parameter is determined b using cross-validation. 4.2. AdaBoost Boosting is a method of finding a highl accurate classifier b combining several base classifiers, even though each of them is onl moderatel accurate. We use decision stumps as the base learner. The decision function of the boosting classifier taes the folloing form: T g( v) α th t ( v), (2) t here α t is set as suggested in [5]. The model selection procedure can be done efficientl as e can iterativel increase the number of base learners and stop hen the generaliation abilit ith respect to the validation set does not improve. 4.3. Calibrated Probabilit Scores and Probabilit Ensemble The ensemble classifier averages the scores of the to sub-classifiers, i.e., SVM and AdaBoost. Hoever, since the sub-classifiers for different classes are trained independentl, their ra scores are not comparable. Therefore, e transform the ra scores of the subclassifiers into probabilit scores ith a sigmoid function [7]: Pr( l v ), (3) + ex Af + B) here f is the ra score of a sub-classifier, and A and B are learned b solving a regularied maximum lielihood problem [8]. As the sub-classifier output has been calibrated into a probabilit score, a classifier ensemble for a specific class is formed b averaging the probabilit scores of associated SVM and AdaBoost subclassifiers, and the probabilit scores of classifiers for different classes become comparable. The class ith the highest output score is assigned to a testing music clip. 4.4. Cross-Validation We first perform inner cross-validation on the training

6th International WOCMAT & Ne Media Conference 200, YZU, Taouan, Taian, November 2-3, 200 set to determine the cost parameter C of linear SVM and the number of base learners in AdaBoost. Then, e retrain the classifiers ith the complete training set and the selected parameters. We use the AUC-ROC as the model selection criterion. 5. MIREX 200 AUDIO TRAIN/TEST: MUSIC MOOD CLASSIFICATION We submitted our audio classification sstem described above to the MIREX 200 Audio Train/Test tass. Due to some unnon reasons, onl the evaluation results on the music mood dataset ere reported (this also happens to some other teams), although e believe that our sstem is dedicated to adapt to an inds of audio classification datasets. In the folloing discussions, this sstem is denoted as WLJW2. We also submitted a simple sstem (WLJW) as a baseline sstem. In WLJW, the representation of an audio clip is the mean vector of all frame-based feature vectors of the clip, and a simple quadratic classifier [5] for each class is trained. 5.. The Music Mood Dataset The music mood dataset [4] as first used in MIREX 2007. There are 600 30-second audio clips in 22,050H mono ave format selected from the APM collection 3. The corresponding five mood categories, each contains 20 clips, are shon in Table 3. The mood class of an audio clip is labeled b human udges using the Evalutron 6000 sstem [6]. Table 3. The five mood categories and their components. Class Mood Components passionate, rousing, confident, boisterous, rod rollicing, cheerful, fun, seet, amiable/good 2 natured literate, poignant, istful, bitterseet, 3 autumnal, brooding humorous, sill, camp, quir, himsical, 4 itt, r aggressive, fier, tense/anxious, intense, 5 volatile, visceral 5.2. Evaluation Results MIREX uses three-fold cross-validation to evaluate the sstems submitted. In each fold, one subset is selected as the test set and the remaining to subsets serve as the training set. The performance is summaried in Table 4 [7]. The summar accurac is the average accurac of 3 http://.apmmusic.com/ the three folds. The bold values represent the best performance in each evaluation metric. Table 4. The performance of all submissions on the music mood dataset. Submission Summar Accurac per Testing Fold Code Accurac 0 2 WLJW 0.5383 0.590 0.500 0.525 WLJW2 0.647 0.735 0.595 0.595 BMPE2 0.5467 0.585 0.505 0.550 BRPC 0.5867 0.645 0.575 0.540 BRPC2 0.5900 0.695 0.550 0.525 CH 0.6300 0.705 0.65 0.570 CH2 0.6300 0.725 0.605 0.560 CH3 0.6350 0.70 0.640 0.555 CH4 0.6267 0.70 0.65 0.555 FCY 0.607 0.70 0.540 0.555 FCY2 0.5950 0.685 0.550 0.550 FE 0.6083 0.690 0.555 0.580 GP 0.637 0.695 0.565 0.635 GR 0.6067 0.685 0.570 0.565 HE 0.547 0.580 0.520 0.525 JR 0.4633 0.480 0.435 0.475 JR2 0.57 0.535 0.520 0.480 JR3 0.4683 0.475 0.475 0.455 JR4 0.57 0.560 0.50 0.465 MBP 0.5400 0.585 0.530 0.505 MP2 0.367 0.200 0.385 0.500 MW 0.5400 0.600 0.520 0.500 RJ 0.5483 0.570 0.555 0.520 RJ2 0.507 0.505 0.495 0.505 R 0.5483 0.595 0.520 0.530 R2 0.4767 0.55 0.450 0.465 RRS 0.667 0.695 0.595 0.560 SSP 0.6383 0.665 0.630 0.620 TN 0.5550 0.650 0.55 0.500 TN2 0.4858 0.540 0.430 0.480 TN4 0.5750 0.645 0.540 0.540 TS 0.600 0.705 0.575 0.550 WLB 0.5550 0.605 0.535 0.525 WLB2 0.5767 0.625 0.550 0.555 WLB3 0.6300 0.690 0.65 0.585 WLB4 0.6300 0.705 0.600 0.585 It is clear that our sstem WLJW2 is raned first out of 36 submissions in terms of summar accurac. The summar accurac of WLJW2 is 0.34% higher than that of our baseline sstem WLJW. The results demonstrate that semantic transformation and classifier ensemble indeed enhance the audio classification performance. MIREX has also performed significance tests, and the results are shon in Figure 2. Figure 3 shos the overall class-pairs confusion matrix of WLJW2. According to the confusion matrix, our sstem reveals high confidence in classes 3 and 5, and the accuracies are 83.33% and 88.33%, respectivel.

6th International WOCMAT & Ne Media Conference 200, YZU, Taouan, Taian, November 2-3, 200 [3] D. Liu, L. Lu, and H.-J. Zhang, Automatic Mood Detection from Acoustic Music Data, ISMIR, 2003. [4] X. Hu, J. S. Donie, C. Laurier, M. Ba, and A. F. Ehmann, The 2007 MIREX Audio Mood Classification Tas: Lessons Learned, ISMIR, 2008. Figure 2. The significance tests on accurac per fold b Friedman's ANOVA / Tue ramer HSD [7]. Figure 3. The overall confusion matrix of WLJW2. 6. CONCLUSIONS In this paper, e have presented a music classification sstem integrating to laers of prediction based on semantic transformation and ensemble classification. The semantic transformation provides a musicall conceptual representation, hich matches human auditor sense to some extent, to a given audio clip. The robust ensemble classifier facilitates the final classification step. The results of MIREX evaluation tass have shon that our sstem achieves ver good performance compared to other sstems. 7. ACNOWLEDGEMENTS This or as supported in part b Taian e-learning and Digital Archives Program (TELDAP) sponsored b the National Science Council of Taian under Grant: NSC99-263-H-00-020. 8 REFERENCES [] G. Tanetais, G. Essl, and P. Coo, Automatic Musical Genre Classification of Audio Signals, ISMIR, 200. [2] T. Li, M. Ogihara, and Q. Li, A Comparative Stud on Content-Based Music Genre Classification, ACM SIGIR, 2003. [5] D. Ellis, B. Whitman, A. Bereneig, and S. Larence, "The Quest for Ground Truth in Musical Artist Similarit," ISMIR, 2002. [6] T. Li and M. Ogihara, Music Artist Stle Identification b Semisupervised Learning from both Lrics and Content, ACM MM, 2004. [7] J.-C. Wang, H.-S. Lee, S.-. Jeng, and H.-M. Wang, Posterior Weighted Bernoulli Mixture Model for Music Tag Annotation and Retrieval, APSIPA ASC, 200. [8] MIREX 200 Results: Audio Tag Affinit Estimation, Submission Code: WLJW3, Name: Adaptive PWBMM, http://nema.lis.illinois.edu/nema_out/mirex200/results/at g/subtas2_report/aff/ [9] O. Lartillot and P. Toiviainen, A Matlab Toolbox for Musical Feature Extraction from Audio, DAFx, 2007. [0] C. M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006. [] Y. Freund and R. E. Schapire, A Decision-theoretic Generaliation of On-line Learning and An Application to Boosting, Journal of Computer and Sstem Sciences, vol. 55, no., pp.9-39, 997. [2] H.-Y. Lo, J.-C. Wang, and H.-M. Wang, Homogeneous Segmentation and Classifier Ensemble for Audio Tag Annotation and Retrieval, ICME, 200. [3] J. Platt, Probabilistic Outputs for Support Vector Machines and Comparison to Regularied Lielihood Methods, Advances in Large Margin Classifiers, Cambridge, MA. [4] H.-T. Lin, C.-J. Lin, and R.-C. Weng, A Note on Platt's Probabilistic Outputs for Support Vector Machines, Machine Learning, vol. 68, no.3, pp. 267-276, 2007. [5] W. J. ranosi, Principles of Multivariate Analsis: A User's Perspective, Ne Yor: Oxford Universit Press, 988. [6] A. A. Grud, J. S. Donie, M. C. Jones, and J. H. Lee, Evalutron 6000: Collecting Music Relevance Judgments, ACM JCDL, 2007. [7] MIREX 200 Results: Audio Mood Classification, http://nema.lis.illinois.edu/nema_out/9ba5c8-9fcf-4029-95eb-5ed56cfb5f/results/evaluation/index.html