D3.4.1 Music Similarity Report

Size: px

Start display at page:

Download "D3.4.1 Music Similarity Report"

Philippa Mason
5 years ago
Views:

1 3.4.1 Music Similarity Report bstract The goal of Work Package 3 is to take the features and metadata provided by Work Package 2 and provide the technology needed for the intelligent structuring, presentation, and use of large music collections. This deliverable is about audio and web-based similarity measures, novelty detection (which we demonstrate to be a useful tool to combine with similarity), and first outcomes applying mid-level WP2 descriptors from and preliminary versions of We present improvements of the similarity measures presented in and The outcomes will serve as foundation for 3.5.1, 3.6.2, and the prototypes, in particular for the recommender and organizer. Version 1.0 ate: May 2005 ditor:. Pampalk Contributors:.Flexer,. Pampalk, M. Schedl, J. Bello, and C. Harte Reviewers: G. Widmer and P. Herrera

2 Contents 1 Introduction 3 2 udio-based Similarity ata Hidden Markov Models for Spectral Similarity Methods Results Comparing log-likelihoods directly Genre Classification iscussion Spectral Similarity Combined with Complementary Information Spectral Similarity Fluctuation Patterns Combination Genre Classification Conclusions Summary & Recommendations Web-Based Similarity Web Mining by Co-occurrence nalysis xperiments and valuation Intra-/Intergroup-Similarities Classification with k-nearest Neighbors Conclusions & Recommendations Novelty etection and Similarity ata Methods Music Similarity lgorithms for novelty detection Results iscussion

3 CONTNTS 2 5 Chroma-Complexity Similarity Chromagram Calculation Chromagram Tuning Chromagram Processing Chroma Complexity ChromaVisu Tool Results Jazz (ave Brubeck Quartet) Classic Orchestra Classic Piano ance Hip Hop Pop iscussion & Conclusions Conclusions and Future Work 59

4 1. Introduction Overall goal of Workpackage 3: The goal of workpackage 3 (WP3) is to take the features and meta-data provided by workpackage 2 (WP2) and provide the technology needed for the intelligent structuring, presentation, and use (query processing and retrieval) of large music collections. This general goal can be broken down into two major task groups: the automatic structuring and organisation of large collections of digital music, and intelligent music retrieval in such structured music spaces. Role of Similarity within SIMC: Similarity measures are a key technology in SIMC. They are the foundation of the deliverables (music collection structuring and navigation module) and (module for retrieval by similarity and semantic descriptors). They enable core functionalities of the organizer and recommender prototypes. Without similarity, functions such as playlist generation, organization and visualization, hierarchical structuring, retrieval, and recommendations can not be implemented. In fact, the importance of similarity goes beyond their role in the prototypes. similarity measure can be licensed as is, and can easily find its way into online music stores or mobile audio players. Thus, it is highly recommended to continue the development and improvement of similarity measures throughout the SIMC project beyond efinition of Similarity in 3.4.1: There are many aspects of similarity (timbre, harmony, rhythm, etc.), and there are different sources from which these can be computed (audio, web-pages, lyrics, etc.). Most of all, similarity is a perception which depends on the listeners point of view and context. Within SIMC any important dimension of perceived similarity is useful. However, the main parts of this deliverable define similarity as the concept which pieces within a genre (or subgenre) have in common. s already pointed out in and the reason for is that this allows highly efficient (i.e. fast and cheap) evaluations of the similarity measures (since genre labels for artists are readily available). valuation Procedures: In this deliverable we primarily use nearest neighbor classifiers (and genre classification) to evaluate the similarity measures. The idea is that pieces within the same genre should be very close to each other. In addition we use inter and intra group distances as described in Section These are particularly useful in understanding how well each group is modeled by the similarity measure. In 3

5 1. Introduction 4 Section we compute the log-likelihood of the self-similarity within a song and use it to evaluate similarity measures. In particular, the first half of each song is compared to the second half. The idea is, that a good similarity measure would recognize these to be highly similar. In Section 4.3 we use receiver operator characteristics (ROC) to measure the tradeoff between sensitivity and specificity. Throughout this deliverable (and in particular in Chapter 5) we use illustrations to demonstrate characteristics of the similarity measures. Specific Context of in SIMC: is built on the code and ideas from WP2 (i.e and preliminary versions of 2.1.2) uses findings of and most of all of In particular, a large part of covers similarity measures (including the implementation of these) used in this deliverable. The recommendations of are the foundation for 3.5.1, and the organizer and recommender prototypes. Relationship to and 3.2.1: In these previous deliverables of WP3 we presented a literature review of similarity measures, implementations (M toolbox), and extensive evaluations thereof based on genre classification. In this deliverable we present improvements of these results and recommendations for the implementations of the prototypes. Topics covered in detail in these previous deliverables are only repeated if they are necessary to define the context. Thus, this deliverable is not self-contained but rather an add-on to and Outcomes of 3.4.1: There are 5 main outcomes of this deliverable. The following 5 chapters are structured according to these.. Recommendations for audio-based similarity: We report on findings using HMMs and using combinations of different similarity measures. We show that a combination of different approaches can improve genre classification performance up to 14% in average (measured on four different collections). B. Recommendations for web-based similarity: We report a simpler alternative to the approach presented in and based on co-occurrences and number of pages retrieved by Google. We show that depending on the size of the music collection different approaches are preferable. C. We demonstrate how novelty detection can be combined with similarity measures to improve the performance. We show that using simple techniques genre classification or playlist generation can be improved.

6 1. Introduction 5. We report first results using outcomes from and for similarity. In particular we present a general approach to using musical complexity for similarity computations and in particular using chroma patterns. Our preliminary results demonstrate possible applications of mid-level descriptors developed in WP2.. We give an outlook of topics to pursue in the remainder of the SIMC project and the necessary next steps.

7 2. udio-based Similarity dvantages and Limitations of udio-based Similarity: udio-based similarity is cheap and fast. Computing the similarity of two pieces can be done within seconds. The similarity can be computed between artists, songs, or even below the song level (e.g. between segments of different pieces). The main limitation is the quality of the computed similarity. For example, the presence and the expressiveness of a singing voice, or instruments such as electric guitars are not modeled appropriately. In general the meaning (or the message) of the piece of music (including for example emotions) will as far as we can foresee remain incomprehensible to the computer. Furthermore, the audio-signal does not contain cultural information. Both history and social context of the piece of music are not accessible. In fact, as we will discuss in the next chapter these are better extracted through analysis of contents in the Internet. This remainder of this chapter is organized as follows:. escription of the four data sets which we used for evaluation. One of which was used as training set for the ISMIR 05 genre classification contest. B. Results on modeling temporal information for spectral similarity. Our results show that temporal information improves the performance of the similarity within a song. However, this improvement does not appear significant measured in a genre classification task. C. Results on combining different approaches. In particular, we combine the spectral similarity (which we have shown to outperform other approaches in in various tasks) with information gathered from fluctuation patterns. On average (using the four collections) the improvement is about 14% for genre classification.. We summarize our findings and make recommendations for the prototypes. 2.1 ata For our experiments we use four music collections with a total of almost 6000 pieces. etails are given in Tables 2.1 and 2.2. For the evaluation (especially to avoid overfitting) 6

8 2. udio-based Similarity 7 rtists/genre Tracks/Genre Genres rtists Tracks Min Max Min Max In-House Small (B-S) In-House Large (B-L) Magnatune Small (B-MS) Magnatune Large (B-ML) Table 2.1: Statistics of the four collections. B-S B-L B-MS B-ML alternative, blues, classic orchestra, classic piano, dance, eurodance, happy sound, hard pop, hip hop, mystera, pop, punk rock, rock, rock & roll, romantic dinner, talk a cappella, acid jazz, blues, bossa nova, celtic, death metal, nb, downtempo, electronic, euro-dance, folk-rock, German hip hop, hard core rap, heavy metal/thrash, Italian, jazz, jazz guitar, melodic metal, punk, reggae, trance, trance2 classical, electronic, jazz/blues, metal/punk, pop/rock, world ambient, classical, electronic, jazz, metal, new age, pop, punk, rock, world Table 2.2: List of genres for each collection. it is important that the collections are structured differently and have different types of contents. B-S The smallest collection consists of 100 pieces. We have previously used it in [26]. However, we removed all classes consisting of one artist only. The categories are not strictly genres (e.g. one of them is romantic dinner music). Furthermore, the collection also includes one non-music category, namely speech (German cabaret). This collection has a very good (i.e low) ratio of tracks per artist. However, due to its size the results need to be treated with caution. B-L The second largest collection has mainly been organized according to genre/artist/album. Thus, all pieces from an artist (and album) are assigned to the same genre, which is a questionable but common practice. Only two pieces overlap between B-L and B- S, namely Take Five and Blue Rondo by the ave Brubeck Quartet. The genres are

9 2. udio-based Similarity 8 user defined and inconsistent. In particular, there are two different definitions of trance. Furthermore, there are overlaps, for example, jazz and jazz guitar, heavy metal and death metal etc. B-MS This collection is a subset of B-ML which has been used as training set for the ISMIR 2004 genre classification contest. The music originates from Magnatune 1 and is available via creative commons. UPF/MTG arranged with Magnatune a free use for research purposes. lthough we have a larger set from the same source we use it to compare our results to those of the ISMIR 04 results. The genre labels are given on the Magnatune website. The collection is very unbalanced. Most pieces belong to the genre classical and a large number of pieces in world sound like classical music. Some of the original Magnatune classes were merged by UPF/MTG due to ambiguities and the small number of tracks in some of the genres. B-ML This is the largest set in our experiments. B-MS is a subset of this collection. The genres are also very unbalanced. The number of artists is not much higher than in B-MS. The number of tracks per artist is very high. The genres which were merged for the ISMIR contest are separated. 2.2 Hidden Markov Models for Spectral Similarity This section deals with modeling temporal aspects to improve spectral similarity. The work presented in this section has been submitted to a conference [12]. s shown in 3.1.1, the following approach to music similarity based on spectral similarity pioneered by [20] and [1] (and later refined in [2]) outperformed all other alternatives. In the following we will refer to it as P. For a given music collection of S songs, each belonging to one of G music genres, it consists of the following basic steps: for each song, divide raw data into overlapping frames of short duration (around 25ms) compute Mel Frequency Cepstrum Coefficients (MFCC) for each frame (up to 20) train a Gaussian Mixture Model (GMM, number of mixtures up to 50) for each of the songs 1

10 2. udio-based Similarity 9 compute a similarity matrix between all songs using the likelihood of a song given a GMM based on the genre information, do k-nearest neighbor classification using the similarity matrix The last step of genre classification can be seen as a form of evaluation. Since usually no ground truth with respect to music similarity exists, each song is labeled as belonging to a music genre using e.g. music expert advice. High genre classification results indicate good similarity measures. This approach based on GMMs disregards the temporal order of the frames, i.e. to the algorithm it makes no difference whether the frames in a song are ordered in time or whether this order is completely reversed or scrambled. Research on perception of musical timbre of single musical instruments clearly shows that temporal aspects of the audio signals play a crucial role (see e.g. [15]). spects like spectral fluctuation, attack or decay of an event cannot be modelled without respecting the temporal order of the audio signals. natural way to incorporate temporal context into the above described framework is the usage of Hidden Markov Models (HMM) instead of GMMs. HMMs trained on MFCCs have already been used for music summarization ([19; 3; 30]) and genre classification [2] but with rather limited success. For the experiments reported in this section we use the B-MS collection. We divide the raw audio data into overlapping frames of short duration and use Mel Frequency Cepstrum Coefficients (MFCC) to represent the spectrum of each frame. The frame size for computation of MFCCs for our experiments was 23.2ms (512 samples), with a hop-size of 11.6ms (256 samples) for the overlap of frames. lthough improved results have been reported with numbers of MFCCs of up to 20 [2], we used only the first 8 MFCCs for all our experiments to limit the computational burden. In order to allow modeling of a bigger temporal context we also used so-called texture windows [36]: we computed means and variances of MFCCs across the following numbers of frames and used them as alternative input to the models: 22 frames, hop-size 11 (510.4ms, 255.2ms), 10 frames, hop-size 5 (232ms, 116ms), 10 frames, hop-size 2 (232ms, 46.4ms). This means that if a texture window is being used, after preprocessing a single data point x t is a 16-dimensional vector (8 mean MFCCs plus 8 variances across MFCCs) instead of a 8-dimensional vector if no texture window is used.

11 2. udio-based Similarity Methods Gaussian Mixture Model (GMM) models the density of the input data by a mixture model of the form p GMM (x) = M P m N[x,µ m,u m ] (2.1) m=1 where P m is the mixture coefficient for the m-th mixture, N is the normal density and µ m and U m are the mean vector and covariance matrix of the m-th mixture. The log-likelihood function is given by L GMM = 1 T T log(p GMM (x t )) (2.2) t=1 for a data set containing T data points. This function is maximized both with respect to the mixing coefficients P m and with respect to the parameters of the Gaussian basis functions using xpectation-maximization (see e.g. [8]). Hidden Markov Models (HMM) [32] allow analysis of non-stationary multi-variate time series by modeling both the probability density functions of locally stationary multivariate data and the transition probabilities between these stable states. If the probability density functions are modelled with mixtures of Gaussians, HMMs can be seen as GMMs plus transition probabilities. n HMM can be characterized as having a finite number N of states Q: Q = {q 1,q 2,...,q N } (2.3) new state q j is entered based upon a transition probability distribution which depends on the previous state (the Markovian property): = {a ij },a ij = P(q j (t) q i (t 1)) (2.4) where t = 1,...,T is a time index with T being the length of the observation sequence. fter each transition an observation output symbol is produced according to a probability distribution B which depends on the current state. lthough the classical HMM uses a set of discrete symbols as observation output, [32] already discuss the extension to continuous observation symbols. We use a Gaussian Observation Hidden Markov Model (GOHMM) where the observation symbol probability distribution for state j is given by a mixture of Gaussians: B = {b j (x)},b j (x) = p GMM j (x) (2.5)

12 2. udio-based Similarity 11 where p GMM j (x) is the density as defined for a mixture of Gaussians in qu The xpectation-maximization (M) algorithm is used to train the GOHMM thereby estimating the parameter sets and B. The log-likelihood function is given by L HMM = 1 T T log(b qt (x t )) + log(a qt,t 1 ) (2.6) t=1 for an observation sequence of length t = 1,...,T with q 1,...,q T being the most likely state sequence and q 0 a start state. The forward algorithm is used to identify most likely state sequences corresponding to a particular time series and enables the computation of the log-likelihoods. Full details of the algorithms can be found in [32] p(d) sec d Figure 2.1: uration probability densities p(d) (y-axis) for durations d (x-axis) in seconds for different combinations of window and hop sizes: line (1) win 23.2ms, hop 11.6ms, line (2) win 232ms, hop 46.4ms, line (3) win 232ms, hop 116ms, line (4) win 510.4ms, hop 255.2ms. It is informative to have a closer look at how the transition probabilities influence the state sequence characteristics. The inherent duration probability density p i (d) associated with state q i, with self transition coefficient a ii is of the form p i (d) = (a ii ) d 1 (1 a ii ) (2.7) This is the probability of d consecutive observations in state q i, i.e. the duration probability of staying d times in one of the locally stationary states modeled with a mixture of Gaussians. s [31] noted, this exponential state duration density is not optimal for a lot of physical signals. The duration of a single data point in our case is dependent on the window length win of the frame used for computing the MFCCs or the size of the texture window as well as the hop size hop. The length l of staying in the same state expressed in msec is then: l = (d 1)hop + win (2.8)

13 2. udio-based Similarity 12 with hop and win given in msec. Fig. 2.1 gives duration probability densities for all different combinations of hop and win used for preprocessing with a ii set to 0.99 (which is a reasonable choice for audio data). One can see that whereas for hop = 11.6 and win = 23.2 the duration probability at five seconds is already almost zero, there still is an albeit small probability for durations up to 120 seconds for hop = and win = Our choice of different frame sizes and texture windows seems to guarantee a range of different duration probabilities. The shorter the state durations in HMMs are, the more often the state sequence will switch from state to state and the less clear the boundaries between the mixture of Gaussians of the individual states will be. Therefore, with shorter state durations the HMMs will be more akin to GMMs in their modeling behavior. n important open issue is the model topology of the HMM. Looking again at the work by [32] on speech analysis, we can see that the standard model for isolated word recognition is a left-to-right HMM. No transitions are allowed to states whose indices are lower than the current state, i.e. as time increases the state index increases. This has been found to account well for modeling of words which rarely have repeating vowels or sounds. For songs, a fully connected so-called ergodic HMM seems to be more suitable for modeling than the constrained left-to-right model. fter all, repeating patterns seem to be an integral part of music. Therefore it makes sense to allow states to be entered more than once and hence use ergodic HMMs. There is a small number of papers describing applications of HMMs to the modeling of some form of spectral similarity. [19] compare HMMs and static clustering for music summarization. Fully ergodic HMMs with five to twelve states of single Gaussians are trained on the first 13 MFCCs (computed from 25.6ms overlapping windows). Key phrases are chosen based on state frequencies and evaluated in a user study. Clustering performs best and HMMs do not even surpass the performance of a random algorithm. [3] use fully ergodic three state HMMs with single Gaussians per state trained on the first ten MFCCs (computed from 30ms overlapping windows) for segmentation of songs into chorus, verse, etc. The authors found little improvement over using static k-means clustering for the problem. The same approach is used as part of a bigger system for audio thumb-nailing in [4]. [30] also compare HMMs and k-means clustering for music audio summary generation. The authors report about achieving smoother state jumps using HMMs. [2] report about genre classification experiments using HMMs with numbers of states ranging from 3 to 30 where the states are mixtures of four Gaussians. For their genre classification task the best HMM is the one with 12 states. Its performance is slightly worse than that of a GMM with a mixture of 50. The authors do not give any detail about the topology of the HMM, i.e. whether it is a fully ergodic one or one with left-

14 2. udio-based Similarity 13 to-right topology. It is also unclear whether they use full covariance matrices for the mixtures of Gaussians. From the graph in their paper (Figure 6) it is evident that HMMs with numbers of states ranging from 4 to 25 perform at a very comparable level in terms of genre classification accuracy. HMMs have also been used successfully for audio fingerprinting (see e.g. [5]). There HMMs with tailor made topologies trained on MFCCs are used to fully represent each detail of a song in a huge database. The emphasis is on exact identification of a specific song and not on generalization to songs with similar characteristics Results Table 2.3: Overview of all types of models used and results achieved: index of model nr, model type model, number of states states, size of mixture mix, window size win, hop size hop, texture window tex, degrees of freedom df, mean log-likelihood likeli, number of HMM based log-likelihoods bigger than GMM based log-likelihoods H > G, z-statistic z, mean accuracy acc, standard deviation stddev, t-statistic t. nr model states mix win hop tex df likeli H > G z acc stddev t 1 HMM n GMM n HMM n GMM n HMM n GMM n HMM y GMM y HMM y GMM y HMM y GMM y For our experiments with GMMs and HMMs we used the following parameters (abbreviations correspond to those used in Table 2.3): preprocessing: we used combinations of window (win) and hop sizes (hop) and texture windows (tex set to yes ( y ) or no ( n )) topology: 3, 6 and 10 state ergodic (fully connected) HMMs with mixtures of 1, 3 or 5 Gaussians per state, GMMs with mixtures of 9, 10 or 30 Gaussians (see states and mix in Table 2.3 for combinations used); Gaussians use diagonal covariance matrices for HMMs and GMMs computation of similarity: similarity is computed using qu. 2.6 fpr HMMs and qu. 2.2 for GMMs

15 2. udio-based Similarity 14 The combinations of parameters states, mix, win, hop and tex used for this study yielded twelve different model classes: six types of HMMs and six types of GMMs. We made sure to employ comparable types of GMMs and HMMs by having comparable degrees of freedom for pairs of model classes: HMM (states 10, mix 1) vs. GMM (mix 10), HMM (states 3, mix 3) vs. GMM (mix 9), HMM (states 6, mix 5) vs. GMM (mix 30). The degrees of freedom (number of free parameters) for HMMs and GMMs are df GMM = mix dim(x) (2.9) df HMM = states mix dim(x) + states 2 (2.10) with dim(x) being the dimensionality of the input vectors. Column df in Table 2.3 gives the degrees of freedom for all types of models. With the first column nr indexing the different models, odd numbered models are always HMMs and the next even numbered model is always the associated GMM. The difference in degrees of freedom between two associated types of GMMs and HMMs is always the number of transition probabilities (states 2 ) Comparing log-likelihoods directly The first line of experiments compares goodness-of-fit criteria (log-likelihoods) between songs and models in order to explore which type of model best describes the data. Out-of-sample log-likelihoods were computed in the following way: train HMMs and GMMs for each of the twelve model types for each of the songs in the training set, using only the first half of each song use the second half of each song to compute log-likelihoods L HMM and L GMM This yielded S = 729 log-likelihoods for each of the twelve model types. verage log-likelihoods per model type are given in column likeli in Table 2.3. Since the absolute values of log-likelihoods very much depend on the type of songs used, it is much more informative to compare log-likelihoods on a song-by-song basis. In Fig. 2.2 histogram plots of the differences of log-likelihoods L i L i+1 between associated model types are shown: L i L i+1 = L HMM(i) L GMM(i+1) (2.11) with HMM(i) being an HMM of model type index nr = i and GMM(i + 1) being the associated GMM of model type index nr = i + 1 and i = 1, 3, 5, 7, 9, 11.

16 2. udio-based Similarity 15 The differences L i L i+1 are computed for all the S = 729 songs before doing the histogram plots. s can be seen in Fig. 2.2, except for one histogram plot the majority of HMM models show a better goodness-of-fit of the data than their associated GMMs (i.e. their log-likelihoods are higher for most of the songs). The only exception is the comparison of model types 1 and 2 (HMM (states 10, mix 1) vs. GMM (mix 10)) which is interesting because in this case the HMMs have the biggest advantage in terms of degrees of freedom (180 vs. 80) over the GMMs of all the comparisons. This is due to the fact that this type of HMM models has the highest number of states with states = 10. But it also has only a single Gaussian per state to model probability density functions. xperiments on isolated word recognition in speech analysis [32] have shown that small sizes of the mixtures of Gaussians used in HMMs do not catch the full detail of the emission probabilities which often are not Gaussian at all. Mixtures of five Gaussians with diagonal covariances per state have been found to be a good choice. Lik 1 Lik 2 Lik 3 Lik 4 Lik 5 Lik Lik 7 Lik 8 Lik 9 Lik 10 Lik 11 Lik Figure 2.2: Histogram plots of differences in log-likelihood between associated models. Finding a correct statistical test for comparing likelihoods of so-called non-nested models is far from trivial (see e.g. [23] or [14]). HMMs and GMMs are non-nested models because one is not just a subset of the other as would e.g. be the case with a mixture of five Gaussians compared to a mixture of six Gaussians. What makes the models nonnested is the fact that it is not clear how to weigh the parameter of a transition probability a ij against, say, a mean µ m of a Gaussian. Nevertheless, it is correct to compare the log-likelihoods since we use out-of-sample estimates, which automatically punishes overfitting due to excessive free parameters. It is just the distribution characteristics of the log-likelihoods which are hard to describe. Therefore we resorted to the distribution free sign test which relies only on the rank of results (see e.g. [34]). Let C I be the score

17 2. udio-based Similarity 16 under condition I and C II the score under condition II then the null hypothesis tested by the sign test is H 0 : p(c I > C II ) = p(c I < C II ) = 1 (2.12) 2 In our case the two scores C I and C II are the matched pairs of log-likelihoods for a song given associated models HMM I and GMM II. If c is the number of times that C I > C II and the number of matched pairs N is greater than 25 then the sampling distribution is the normal distribution with z = c 1N 2 (2.13) N 1 2 Column H > G in Table 2.3 gives the count c of HMM based log-likelihoods being bigger than GMM based log-likelihoods for all pairs of associated model types. Column z gives the corresponding z-values obtained using qu ll z-values are highly significant at the 99% error level since all z > z 99 = Therefore HMMs always better describe the data compared to their associated GMMs with the exception of the comparison of model types 1 and 2 (HMM (states 10, mix 1) vs. GMM (mix 10)). To counter the argument that the superior performance of the HMMs is due to their extra number of degrees of freedom (i.e. number of transition probabilities, see column df in Table 2.3) we also compared the smallest type of HMMs (model nr 3: HMM (states 3, mix 3), df = 81) with the biggest type of GMMs (model nr 6: GMM (mix 30), df = 240). This comparison yielded a count c (H > G) of 635, and a z-value of z = > z 99 = 2.58 again being highly significant. We conclude that it is not the sheer number of degrees of freedom in the models but the quality of the free parameters which decides which type of model better fits the data. fter all, the degrees of freedom of the HMMs in our last comparison are outnumbered three times by those of the GMMs Genre Classification The second line of experiments compares genre classification results. In a 10-fold cross validation we did the following: train HMMs and GMMs for each of the twelve model types for each of the songs in the training set (the nine training folds), this time using the complete songs for each of the model types, compute a similarity matrix between all songs using the log-likelihood of a song given a HMM or a GMM (L HMM and L GMM ) based on the genre information, do one-nearest neighbor classification for all songs in the test fold using the similarity matrices

18 2. udio-based Similarity 17 verage accuracies and standard deviations across the ten folds of the cross validation are given in columns acc and stddev in Table 2.3. Looking at the results one can see that the achieved accuracies range from around 73% to around 78% with standard deviations of up to 5%. We compared accuracy results of associated model types in a series of paired t-tests (model nr 1 vs. nr 2,..., nr 11 vs. nr 12). The resulting t-values are given in column t in Table 2.3. ll t-values are not significant at the 99% error level since all t < t (99,df=9) = 3.25 (the same holds true at the 95% error level). ven the biggest difference in accuracy (between model type nr 4, GMM (mix 9), acc = 73.38, and model type nr 6, GMM (mix 30), acc = 78.19) is not significant: t = 0.43 < t (99,df=9) = 3.25 (the same holds true at the 95% error level). We therefore conclude that there is no significant difference in genre classification performance between any of the twelve model types. They all perform at the same level of accuracy iscussion There are two main results: (i) HMMs better describe spectral similarity of songs than the standard technique of GMMs. Comparison of log-likelihoods clearly shows that HMMs allow for a better fit of the data. This holds not only if looking at competing models with comparable numbers of degrees of freedom but also for GMMs with numbers of parameters that are much larger than of those of the HMMs. The only outlier in this respect is model type 1 (HMM (states 10, mix 1)). But as discussed in the previous section this is probably due to the poor choice of single Gaussians for modeling the emission probabilities. (ii) HMMs perform at the same level as GMMs when used for spectral similarity based genre classification. There is no significant gain in terms of classification accuracy. Genre classification is of course a rather indirect way of measuring differences between alternative similarity models. The human error in classifying some of the songs gives rise to a certain percentage of misclassification already. Inter-rater reliability between a number of music experts is far from perfect for genre classification. lthough we believe this is the most comprehensive study on using HMMs for spectral similarity of songs so far, there is of course a lot still to be done. Two possible routes for further improvements come to mind: the topology of the HMMs and the handling of the state duration. Choosing a topology for an HMM still is more of an art than a science (see e.g. [10] for a discussion). Our limited set of examined combinations of numbers of states and sizes of mixtures could be extended. One should however notice that too large numbers for these parameters quickly lead to numerical problems due to insufficient training data. We also have not yet tried out left-to-right models. With our choice of different frame sizes and texture windows we tried to explore a

19 2. udio-based Similarity 18 range of different state duration densities. There are of course a number of alternative and possibly more principled ways of doing this. The usage of so-called explicit state duration modeling could be explored. duration parameter d per HMM state is added. Upon entering a state q i a duration d i is chosen according to a state duration density p(d i ). Formulas are given in [32]. nother idea is to use an array of n states with identical self transition probabilities where it is enforced to pass each state at least once. This gives rise to more flexible so-called rlang duration density distributions (see [10]). n altogether different approach of representing the dynamical nature of audio signals is the computation of dynamic features by substituting the MFCCs with features that already code some temporal information (e.g. autocorrelation or reflection coefficients). xamples can be found in [32]. Some of these ideas might be able to further improve the modeling of songs by HMMs but it is not clear whether this will also help the genre classification performance. 2.3 Spectral Similarity Combined with Complementary Information In this section we demonstrate how the performance of the P spectral similarity can be improved. In particular, we combine it with complementary information taken from fluctuation patterns (which describe loudness fluctuations over time) and two new descriptors derived thereof. The work presented in this section has been submitted to a conference [27]. To evaluate the results we use the four music collections described previously. Compared to the winning algorithm of the ISMIR 04 genre classification contest our findings show improvements of up to 41% (12 percentage points) on one of the collections, while the results on the contest training set (using the same evaluation procedure as in the contest) increased by merely 2 percentage points. One of our main observations is that not using different music collections (with different structures and contents) can lead to overfitting. nother observation is the need to distinguish between artist identification and genre classification. Furthermore, our findings confirm the findings of ucouturier and Pachet [2] who suggest the existence of a glass ceiling which cannot be surpassed without taking higher level cognitive processing into account.

20 2. udio-based Similarity Spectral Similarity We use the same spectral similarity described in the previous section on HMMs. We used the implementations in the M Toolbox [26] and the Netlab Toolbox 2 for Matlab. From the 22050Hz mono audio signals two minutes from the center are used for further analysis. The signal is chopped into frames with a length of 512 samples (about 23ms) with 50% overlap. The average energy of each frame s spectrum is subtracted. The 40 Mel frequency bands (in the range of 20Hz to 16kHz) are represented by the first 20 MFCC coefficients. For clustering we use a Gaussian Mixture Model with 30 clusters and trained using expectation maximization (after k-means initialization). The cluster model similarity is computed with Monte Carlo sampling and a sample size of The classifier in the experiments described below computes the distances of each piece in the test set to all pieces in the training set. The genre of the closest neighbor in the training set is used as prediction (nearest neighbor classifier) Fluctuation Patterns Fluctuation Patterns (FPs) describe loudness fluctuations in 20 frequency bands [25; 29]. They describe characteristics of the audio signal which are not described by the spectral similarity measure. First, the audio signal is cut into 6-second sequences. We use the center 2 minutes from each piece of music and cut it into non-overlapping sequences. For each of these sequences a psychoacoustic spectrogram, namely the Sonogram is computed. For the loudness curve in each frequency band a FFT is applied to describe the amplitude modulation of the loudness. From the FPs we extract two new descriptors. The first one, describes how distinctive the fluctuations at specific frequencies are, we call it Focus. The second one which we call Gravity, is related to the overall perceived tempo. Sone ach 6-second sequence is cut into overlapping frames with a length of 46ms. For each frame the FFT is computed. The frequency bins are weighted according to a model of the outer and middle-ear to emphasize frequencies around 3-4kHz and suppress very low or high frequencies. The FFT frequency bins are grouped into frequency bands according to the critical-band rate scale with the unit Bark [40]. model for spectral masking is applied to smooth the spectrum. Finally, the loudness is computed with a non-linear function. We normalize the loudness of each piece such that the peak loudness is

21 2. udio-based Similarity 20 Fluctuation Patterns Given a 6-second Sonogram we compute the amplitude modulation of the loudness in each of the 20 frequency bands using a FFT. The amplitude modulation coefficients are weighted based on the psychoacoustic model of the fluctuation strength [11]. This modulation has different effects on our hearing sensation depending on the frequency. The sensation of fluctuation strength is most intense around 4Hz and gradually decreases up to a modulation frequency of 15Hz. The FPs analyze modulations up to 10Hz. To emphasize certain patterns a gradient filter (over the modulation frequencies) and a Gaussian filter (over the frequency bands and the modulation frequencies) are applied. Finally, for each piece the median from all FPs representing a 6-second sequence is computed. This final FP is a matrix with 20 rows (frequency bands) and 60 columns (modulation frequencies). Two pieces are compared by interpreting their FP matrices as 1200-dimensional vectors and computing the uclidean distance. n implementation of the FPs is available in the M Toolbox [26]. Figure 2.3 shows some examples of FPs. The vertical lines indicate reoccurring periodic beats. The song Spider, by Flex, which is a typical example of the genre eurodance, has the strongest vertical lines. Focus The Focus (FP.F) describes the distribution of the energy in the FP. In particular, FP.F is low if the energy is focused in small regions of the FP, and high if the energy is spread out over the whole FP. The FP.F is computed as mean value of all values in the FP matrix, after normalizing the FP such that the maximum value equals 1. The distance between two pieces of music is computed as the absolute difference between their FP.F values. Figure 2.3 shows five example histograms of the values in the FPs and the mean thereof (as vertical line). Black Jesus by verlast (belonging to the genre alternative) has the highest FP.F value (0.42). The song has a strong focus on guitar chords and vocals, while the drums are hardly noticeable. The song Spider by Flex (belonging to eurodance) has the lowest FP.F value (0.16). Most of the songs energy is in the strong periodic beats. Figure 2.4 shows the distribution of FP.F over different genres. The values have a large deviation and the overlap between quite different genres is significant. lectronic has the lowest values while punk/metal has the highest. The amount of overlap is an important factor for the quality of the descriptor. s we will see later, in the optimal combination of all similarity sources, FP.F has the smallest contribution.

2. udio-based Similarity 21 Brubeck et al. Take Five CM FP FP.F FP.G 0.28 6.4 Flex Spider 0.16 5.0 Beach Boys Surfin US 0.23 6.4 Spears Crazy 0.32 5.9 verlast Black Jesus 0.42 5.8 Figure 2.

The plots show the 30 centers and their variances on top of each other. On the y-axis of the FP are the Bark frequency bands, the x-axis is the modulation frequency (in the range from 0-10Hz).

G is the sum of values per FP column, the x-axis is the modulation frequency (from 0-10Hz). Gravity The Gravity (FP.G) describes the center of gravity (CoG) of the FP on the modulation frequency axis.

14) ij where FP is a 20 60 matrix and i is the index of the frequency band, and j of the modulation frequency. We compute FP.

22 2. udio-based Similarity 21 Brubeck et al. Take Five CM FP FP.F FP.G Flex Spider Beach Boys Surfin US Spears Crazy verlast Black Jesus Figure 2.3: Visualization of the features. On the y-axis of the cluster model (CM) is the loudness (db-spl), on the x-axis are the Mel frequency bands. The plots show the 30 centers and their variances on top of each other. On the y-axis of the FP are the Bark frequency bands, the x-axis is the modulation frequency (in the range from 0-10Hz). The y-axis on the FP.F histogram plots are the counts, on the x-axis are the values of the FP (from 0 to 1). The y-axis of the FP.G is the sum of values per FP column, the x-axis is the modulation frequency (from 0-10Hz). Gravity The Gravity (FP.G) describes the center of gravity (CoG) of the FP on the modulation frequency axis. Given 60 modulation frequency-bins (linearly spaced in the range of 0-10Hz) the CoG usually lies between the 20th and the 30th bin, and is computed as j CoG = j i FP ij ij FP, (2.14) ij where FP is a matrix and i is the index of the frequency band, and j of the modulation frequency. We compute FP.G by subtracting the theoretical mean of the fluctuation model (which is around the 31st band) from the CoG. Low values indicate that the piece might be perceived slow. However, FP.G is not intended to model the perception of tempo. ffects such as vibrato or tremolo are also reflected in the FP. The distance between two pieces of music is computed as the absolute difference between their FP.G values. Figure 2.3 shows the sum of the values in the FP over the frequency bands (i.e. the sum over the rows in the FP matrix) and the CoGs marked with a vertical line. Spider

23 2. udio-based Similarity 22 FP.F World Pop/Rock Metal/Punk Jazz/Blues lectronic Classical FP.G (a) B-MS FP.F Punk Jazz Guitar Jazz lectronic eath Metal Cappella FP.G (b) B-L Figure 2.4: Boxplots showing the distribution of the descriptors per genre on two music collections. description of the collections can be found in Section 2.1. The boxes have lines at the lower quartile, median, and upper quartile values. The whiskers show the extent of the rest of the data (the maximum length is 1.5 of the inter-quartile range). ata beyond the ends of the whiskers are marked with plus-signs. by Flex has the highest value (-5.0), while the lowest value (-6.4) is computed for Take Five by the ave Brubeck Quartet and Surfin US by the Beach Boys. Figure 2.4 shows the distribution of FP.G over different genres. The values have a smaller deviation compared to FP.F and there is less overlap between different genres. Classical and a capella have the lowest values, while electronic, metal, and punk have the highest values Combination To combine the distance matrices obtained with the 4 above mentioned approaches we use a linear combination similar to the idea used for the aligned Self-Organizing Maps (SOMs) [28]. Before combining the distances we normalize the four distances such that the standard deviation of all pairwise distances within a music collection each equals 1. In contrast to the aligned-soms we do not rely on the user to set the optimum weights for the linear combination, instead we automatically optimize the weights for

24 2. udio-based Similarity 23 genre classification Genre Classification We evaluate the genre classification performance on four music collections to find the optimum weights for the combination of the different similarity sources. We use a nearest neighbor classifier and leave-one-out cross validation for the evaluation. The accuracies are computed as ratio of the correctly classified compared to the total number of tracks (without normalizing the accuracies with respect to the different class probabilities). Genre classification is not the best choice to evaluate the performance of a similarity measure. However, unlike listening tests it is very fast and cheap. In contrast to the ISMIR 2004 genre contest we apply an artist filter. In particular, we ensure that all pieces of an artist are either in the training set or test set. Otherwise we would be measuring the artist identification performance, since all pieces of an artist are in the same genre (in all of the collections we use). The resulting performance is significantly worse. For example, on the ISMIR 2004 genre classification training set (using the same algorithm we submitted last year) we get 79% accuracy without artist filter and only 64% with artist filter. The difference is even bigger on a large in-house collection where (using the same algorithm) we get 71% without artist filter and only 27% with filter. In the results described below we always use an artist filter if not stated otherwise. In the remainder of this section first the four music collections we use are described. Second, results using only one similarity source are presented. Third, pairwise combinations with spectral similarity (P) are evaluated. Fourth, all four sources are combined. Finally, the performances on all collections is evaluated to avoid overfitting. Individual Performance The performances using one similarity source are given in Figure 2.5 in the first (only spectral similarity, P) and last columns (only the respective similarity source). P clearly performs best, followed by FP. The performance of FP.F is extremely poor on B-S while it is equal to FP.G on B-L. For B-MS without the artist filter we obtain 79% using only P (this is the same performance also obtained on the ISMIR 04 genre contest test set, which indicates that there was no overfitting on the data). Using only FP we obtain 66% accuracy which is very close to the 67% Kris West s submission achieved. The accuracy for FP.F is 30% and 43% for FP.G. lways guessing that a piece is classical gives 44% accuracy. Thus, the performance of FP.F is significantly below the random guessing baseline.

25 2. udio-based Similarity 24 FP FP.F FP.G (a) B-S FP FP.F FP.G (b) B-L FP FP.F FP.G (c) B-MS FP FP.F FP.G (d) B-ML Figure 2.5: Results for combining P with one of the other sources. ll values are given in percent. The values on the x-axis are the mixing coefficients. For example, the fourth column in the second row is the accuracy for combining 70% P with 30% of FP.F. Combining Two The results for combining P with one of the other sources are given in Figure 2.5. The main findings are that combining P with FP or FP.G performs better than combining P with FP.F (except for 10% FP.F and 90% P in B-MS). For all collections a combination can be found which improves the performance. However, the improvements on the Magnatune collection are marginal. The smooth changes of the accuracy with respect to the mixing coefficient are an indicator that the the approach is relatively robust (within each collection).

26 2. udio-based Similarity P FP FP.F FP.G (a) B-S P FP FP.F FP.G (b) B-L P FP FP.F FP.G (c) B-MS P FP FP.F FP.G (d) B-ML Figure 2.6: Results for combining all similarity sources. total of 270 combinations are summarized in each table. ll values are given in percent. The mixing coefficients for P (the first row) are given above the table, for all other rows below. For each entry in the table of all possible combinations the highest accuracy is given. For example, the second row, third column depicts the highest accuracy obtained from all possible combinations with 10% FP. The not specified 90% can have any combination of mixing coefficients, e.g. 90% P, or 80% P and 10% FP.G etc.

27 2. udio-based Similarity 26 Weights Classification ccuracy Rank P FP FP.F FP.G B-S B-L B-MS B-ML Score Table 2.4: Overall performance on all collections. The displayed values in columns 2-4 are the mixing coefficients in percent. The values in columns 5-8 are the rounded accuracies in percent. Combining ll Figure 2.6 shows the accuracies obtained when all similarity sources are combined. There are a total of 270 possible combinations using a step size of 5 percent-points and limiting P to a mixing coefficient between % and the other sources to 0-50%. nalogously to the previous results FP.F has the weakest performance and the improvements for the Magnatune collection are not very exciting. s in Figure 2.5 the smooth changes of the accuracy with respect to the mixing coefficient are an indicator for the robustness of the approach (within each collection). Without the artist filter the combinations on the B-MS reach a maximum of 81% (compared to 79% using only P). It is clearly noticeable that the results on the collections are quite different. For example, for B-S using as little P as possible (highest values around 45-50%) and a lot of FP.G (highest values around 25-40%) gives best results. On the other hand, for the B-MS collection the best results are obtained using 90% P and only 5% FP.G. These deviations indicate overfitting, thus we analyze the performances across collections in the next section. Overall Performance To study overfitting we compute the relative performance gain compared to the P baseline (i.e. using only P). We compute the score (which we want to maximize) as

28 2. udio-based Similarity verage B S B L B MS B ML Figure 2.7: Individual relative performance ranked (x-axis) by score (y-axis). the average of these gains over the four collections. The results are given in Table 2.4. The worst combination (using 50% P and 50% FP.F) yields a score of (That is, in average, the accuracy using this combination is 15% lower compared to using 100% P.) There are a total of 247 combinations which perform better than the P baseline. lmost all of the 22 combinations that fall below P have a large contribution of FP.F. The best score is 14% above the baseline. The ranges of the top 10 ranked combinations are 55-75% P, 5-20% FP, 5-10% FP.F, 10-30% FP.G. Without artist filter, for B-MS the top three ranked combinations from Table 2.4 have the accuracies 1: 79%, 2: 78%, 3: 79% (the P baseline is 79%, the best possible combination yields 81%). For the B-S collection without artist filter the P baseline is 52% and the top three ranked combinations have the accuracies 1: 63%, 2: 61%, 3: 62% (the best possible score achieved through combination is 64%). This is another indication that genre classification and artist identification are not the same type of problem. Thus, it is necessary to ensure that all pieces from an artist (if all pieces from an artist belong to the same genre) are either in the training or test set. Figure 2.7 shows the relative performance of all combinations ranked by their score. s can be seen there are significant deviations. In several cases a combination performs

29 2. udio-based Similarity 28 well on one collection and performs poor on another. This indicates that there is a large potential for overfitting if the necessary precautions are not taken (such as using several different music collections). However, another observation is that although there is a high variance the performance stays above the baseline for most of the combinations and there is a common trend. Truly reliable results would require further testing on additional collections Conclusions In this section we have presented an approach to improve audio-based music similarity and genre classification. We have combined spectral similarity with three additional information sources based on Fluctuation Patterns. In particular, we have presented two new descriptors and a series of experiments evaluating the combinations. lthough we obtained an average performance increase of 14%, our findings confirm the glass ceiling observed in [2]. Preliminary results with a larger number of descriptors indicate that the performance per collection can only be further improved by up to 1-2 percent-points. However, the danger of overfitting is eminent. Our results show that there is a significant difference in the overall performance if pieces from the same artist are in the test and training set. We believe this shows the necessity to use an artist filter to evaluate genre classification performances (if all pieces from an artist are assigned to the same genre). Furthermore, the deviations between the collections suggest that it is necessary to use different collections to avoid overfitting. One possible future direction is to focus on developing similarity measures for specific music collections (analogously to developing specialized classifiers able to distinguish only two genres). However, combining audio-based approaches with information from different sources (such as the web), or modeling the cognitive process of music listening are more likely to help us get beyond the glass ceiling. 2.4 Summary & Recommendations In this chapter we have followed two paths. The motivation for following the first one is that spectral similarity as we use it does not capture many aspects of the audio signal which are very important for the perception of timbre (such as the attack or decay). lthough we were able to show that using HMMs allows us to better model a song, we do not recommend its use in the SIMC prototypes. Primarily, because of the drastic increase in computation time. Furthermore, in terms of genre classification (which is not the best choice for evaluation) the performance does not improve significantly. However, applying HMMs to model temporal aspects for spectral similarity appear to be

30 2. udio-based Similarity 29 an interesting direction for future research. The second path we followed in this chapter was to combine what we knew works best with other approaches. s a result we have found a combination which significantly improves the results on some of the collections we used for evaluation. We recommend the usage of this combination as described in detail above for and the prototypes. The implementation of the fluctuation patterns and the spectral similarity is available in the M toolbox for Matlab.

31 3. Web-Based Similarity In this chapter we propose an alternative, which we have published in [33], to the webbased similarity measure described in detail in The similarity measure operates on artists names based on search results of Google queries. Co-occurrences of artist names on web pages are analyzed to measure how often two artists are mentioned together on the same web page. We estimate conditional probabilities using the extracted page count. These conditional probabilities give a similarity measure which is evaluated using a data set containing 224 artists from 14 genres. For evaluation, we use two different methods, intra-/intergroup-similarities and k-nearest Neighbors classification. Furthermore, a confidence filter and combinations of the results gained from three different query settings are tested. It is shown that these enhancements can raise the performance of the web-based similarity measure. Comparing the results to those of similar approaches show that our approach, though being quite simple, performs well and can be used as a similarity measure that incorporates social knowledge. The approach is similar to the approach presented in [38]. The main difference is that we calculate the complete distance matrix. This offers additional information since we can also predict which artists are not similar. Such information is necessary, for example, when it comes to creating playlists that incorporate a broad variety of different music styles. Moreover, in [38], artists are extracted from Listmania!, which uses the database of the web shop mazon. The number of artists in this database is obviously smaller than the number of artist-related web pages indexed by Google. For example, most local artists or artists without a record deal are not contained. Thus, the approach of [38] cannot be used for such artists. shortcoming of the co-occurrence approach is that creating a complete distance matrix has quadratic computational complexity in the number of artists. espite this fact, the approach is quite fast for small- and medium-sized collections with some hundreds of artists since it is very simple and does not rely on extracting and weighting hundreds of thousands of words like the tf idf approach of [18]. Moreover, using heuristics could reduce the computational complexity. 30

32 3. Web-Based Similarity Web Mining by Co-occurrence nalysis Since our similarity measure is based on artist co-occurrences, we need to count how often artist names are mentioned together on the same web page. To obtain these page counts, the search engine Google was used. Google has been chosen for the experiments because it is the most popular search engine at the moment. Furthermore, investigations of different search engines showed that Google yields the best results for musical web crawling [18]. Given a list of artist names, we use Google to estimate the number of web pages containing each artist and each pair of artists. Since we are not interested in the content of the found web pages, but only in their number, the search is restricted to display only the top-ranked page. In fact, the only information we use is the page count that is returned by Google. This raises performance and limits web traffic. The outcome of this procedure is a symmetric matrix C, where element c ij gives the number of web pages containing the artist with index i together with the one indexed by j. The values of the diagonal elements c ii show the total number of web pages containing artist i. Based on the page count matrix C, we then use relative frequencies to calculate a conditional probability matrix P as follows. Given two events a i (artist with index i is mentioned on web page) and a j (artist with index j is mentioned on web page), we estimate the conditional probability p ij (the probability for artist j to be found on a web page that is known to contain artist i) as shown in Formula 3.1. p (a i a j a i ) = c ij c ii (3.1) Obviously, P is not symmetric. Since we need a symmetric similarity function in order to use k-nn, we compute a symmetric equivalent P s by simply calculating the arithmetical mean of p ij and p ji for every pair of artists i and j. ddressing the problem of finding only music-related web pages, we used three different query settings. artist1 artist2 music artist1 artist2 music review allintitle: artist1 artist2 The first one, in the following abbreviated as M, searches only for web pages containing the two artist names as exact phrases and the word music. The second one, which has already been used in [37], restricts the search to pages containing the additional terms music and review. This setting, abbreviated as MR, was used to compare our

33 3. Web-Based Similarity 32 results to those of [18]. The third setting (allintitle) only takes into consideration web pages containing the two artists in their title. It is the most limiting setting, and the resulting page count matrices are quite sparse. However, our evaluation showed that this setting performs quite well on the k-nn classification task and can be used successfully in combination with M or MR. 3.2 xperiments and valuation We conducted our experiments on the data set already used in [18]. It comprises 14 quite general and well-known genres with 16 assigned artists each. complete list can be found on the Internet 1. Two different evaluation methods were used: ratios between intra- and intergroup-similarities and hold-out experiments using k-nn classification Intra-/Intergroup-Similarities This evaluation method is used to estimate how well the given genres are distinguished by our similarity measure P. For each genre, the fraction between the average intragroupprobability and the average intergroup-probability is calculated. The higher this ratio, the better the differentiation of the respective genre. The average intragroup-probability for a genre g is the probability that two arbitrarily chosen artists a and b from genre g co-occur on a web page that is known to contain either artist a or b. The average intergroup-probability for a genre g is the probability that two arbitrarily chosen artists a (from genre g) and b (from any other genre) co-occur on a web page that is known to contain either artist a or b. Thus, the average intragroup-probability gives the probability that two artists from the same genre co-occur. The average intergroup-probability gives the probability that an artist from genre g co-occurs with an artist not from genre g. Let be the set of all artists and g the set of artists assigned to genre g. Formally, the average intra- and intergroup-probabilities are given by quations 3.2 and 3.3, where g is the cardinality of g and \ g is the set without the elements contained in the set g. intra g = inter g = a 1 g a2 a 1 a 2 g p a1 a 2 g 2 g a 1 g a 2 \ g p a1 a 2 \ g g (3.2) (3.3) 1 list 224.pdf

34 3. Web-Based Similarity 33 Obviously, the ratio intra g /inter g should be at least greater than 1.0 if the similarity measure is to be of any use. Results and iscussion Table 3.1 shows the results of evaluating our co-occurrence approach with this first evaluation method. It can be seen that the allintitle-setting yields the best results as the average intergroup-similarities are very low. Hence, nearly no artists from different genres occur together in the title of the same web page. specially for the genres Jazz and Classical, the results are excellent. However, for lternative Rock/Indie and lectronica, the ratios are quite low. This can be explained by the low average intragroup-similarities for these genres. Thus, artists belonging to these genres are seldom mentioned together in titles. nalyzing the page count matrices revealed that the allintitle-setting yields good results if web pages containing artists from the same genre in their title are found. If not, the results are obviously quite bad. This observation motivated us to conduct experiments with confidence filters and combinations of the allintitle-setting with M and MR. These experiments are described in detail in the next section. Moreover, Table 3.1 shows that, aside from Classical, Blues is distinguished quite well. lso remarkable is the very bad result for Folk music in the MR-setting. This may be explained by intersections with other genres, e.g. Country. The approach presented in [39] was tested on the list of artists already used in [18]. The results, which are visualized in Table 3.2, are slightly worse than the results using our approach on the same data set. n explanation for this is that we use an asymmetric similarity measure that, for each pair of artists (artist1 and artist2), incorporates probability estimations for artist1 being mentioned on web pages containing artist2 and for artist2 appearing on web pages of artist1. This additional information is lost when using the normalization method proposed in [39]. In Table 3.3, the evaluation results for the approach of [18], again using exactly the same list of artists, are depicted. To obtain them, the distances between the feature vectors gained from the tf idf calculations are computed for every pair of artists. This gives a complete similarity matrix. Since most of the query settings used in [18] differ from ours, we can only compare the results of the MR-setting. Taking a closer look at the results shows that tf idf performs better for eight genres, our approach performs better for six genres. However, the mean of the ratios is better for our approach because of the high value for the genre Classical. possible explanation is that web pages concerning classical artists often also contain words which are used on pages of other genres artists. In contrast, classical artist names seem to be mentioned only together

35 3. Web-Based Similarity allintitle only M only MR only allintitle+m allintitle+mr accuracy threshold for the number of allowed zero distance elements Figure 3.1: ccuracies in percent for single and combined similarity measures using 9-NN t15-validation and the confidence filter. The combined results are depicted as dotted lines. It is remarkable that the high values for the allintitle-accuracies come along with up to 18% of unpredictable artists. ll other measures (single and combined) leave no data items unpredicted. with other artists belonging to the same genre, which is reflected by the very high ratios of our approach for this genre Classification with k-nearest Neighbors The second set of evaluation experiments was conducted to show how well our similarity measure works for classifying artists into genres. For this purpose, the widely used technique of k-nearest Neighbors was chosen. This technique simply uses the k data items for prediction that have a minimal distance to the item that is to be classified. The most frequent class among these k data items is predicted for the unclassified data item. s for the partitioning of the complete data set into training set and test set, we used different settings, referred to as tx, where x is the number of data items from each genre that are assigned to the training set. In a t15-setting, for example, 15 artists from each genre are used for training and one remains for testing. For measuring the distances

36 3. Web-Based Similarity accuracy t15: allintitle+m t8: allintitle+m t4: allintitle+m t2: allintitle+m t15: allintitle+mr t8: allintitle+mr t4: allintitle+mr t2: allintitle+mr threshold for the number of allowed zero distance elements Figure 3.2: ccuracies in percent for different combinations of the three settings (allintitle, M, MR) and different training set sizes. 9-NN classification was used. between two data items, we use the similarities given by the symmetric probability matrix P s. We ran all experiments times to minimize the influence of statistical outliers on the overall results. The accuracy, in the following used for measuring performance, is defined as the percentage of correctly classified data items over all classified data items in the test set. Since the usage of confidence filters may result in unclassified data items, we introduce the prediction rate which we define as the percentage of classified data items in the complete test set. In a first test with setting t8, k-nn with k = 9 performed best, so we simply used 9-NN for classification in the subsequent experiments. It is not surprising that values around 8 perform best in a t8-setting, because in this case the number of data items from the training set that are used for prediction equals the number of data items chosen from each class to represent the class in the training set. The t8-setting without any confidence filter gives accuracies of about 69% for M, about 59% for MR and about 74% for allintitle. Using setting t15, these results can be improved for M ( 75% using 9-NN) and for allintitle ( 80% using 6-NN). For MR, no remarkable improvement could

37 3. Web-Based Similarity t15 t8 t4 t accuracy prediction rate Figure 3.3: ccuracy plotted against prediction rate for different training set sizes and 9-NN classification. Only the uncombined allintitle-setting was used for this plot. be achieved. In the case that no confidence filter is used, like in the first tests described above, a random genre is predicted for the artist to be classified if his/her similarity to all artists in the training set is zero. ue to the sparseness of its similarity matrix, this problem mainly concerns the allintitle-measure. To overcome the problem and benefit from the good performance of the allintitle-measure but also address the sparseness of the respective similarity matrix, we tried out some confidence filters to combine the similarity measures that use the three different query settings. The basic idea is to use the allintitle-measure if the confidence in its results is high enough. If not, the M- or MR-measure is used to classify an unknown data item. We experimented with confidence filters using mathematical properties of the distances between the unclassified data item and its nearest neighbors. The best results, however, were achieved with a very simple approach based on counting the number of elements with a probability/similarity of zero in the set of the nearest neighbors. If this number exceeds a given threshold, the respective data item is not classified with the allintitle-measure, but the M- or MR-

3. Web-Based Similarity 37 Figure 3.4: Confusion matrix for the averaged results of 1.000 runs using 9-NN t15- validation. The confidence filter was applied to the allintitle-setting.

38 3. Web-Based Similarity 37 Figure 3.4: Confusion matrix for the averaged results of runs using 9-NN t15- validation. The confidence filter was applied to the allintitle-setting. The values are the average accuracies in percent. measure is used instead. Using this method, only artists that co-occur at least with some others in the title of some web pages are classified with allintitle. On the other hand, if not enough information for a certain artist is available in the allintitle-results, MR or M is used instead. These two measures usually give enough information for prediction. Indeed, their prediction rates equal 100% for the data set used for our evaluations. This is also manifested in Figure 3.1 which shows that the accuracies for MR and M are independent of the threshold for the confidence filter. Results and iscussion We already mentioned the classification accuracies of up to 80% for uncombined measures. Since we wanted to analyze to what extent the performance can be improved when using combinations, we conducted t15-validations using either a single measure or

39 3. Web-Based Similarity 38 combinations of allintitle with MR and M. The results are shown in Figure 3.1. long the abscissa, the influence of different thresholds for the confidence filter can be seen. The falling accuracies for allintitle with raising threshold values confirms our assumption that the performance of the allintitle-measure depends strongly on the availability of enough information. It is important to note that the uncombined allintitle-measure does not always make a prediction when using the confidence filter, also cf. Figure 3.3. Remarkable are the very high accuracies (fraction between correctly classified artists and classifiable artists) of up to 89,5% for allintitle with a threshold value of 2. However, in this setting, 14% of the artists cannot be classified. Taking a closer look at the MR- and M-settings shows that they reach accuracies of about 54% and 75% respectively and that these results are independent of the threshold for the confidence filter. In fact, MR and M, at least for the used data set, always provide enough information for prediction. Combining the measures by taking allintitle as primary one and, if no prediction with it is possible, MR or M as fallback also combines the advantages of high accuracies and high prediction rates. Indeed, using the combination allintitle+m gives accuracies of 85% at 100% prediction rate. Since the accuracies for M are much higher than for MR, the combination of allintitle with M yields better results than with MR. Compared to the k-nn results of [18], these accuracies are at least equal although the co-occurrence approach is much simpler than the tf idf approach. However, the single MR-setting performs quite poorly with our approach. This can be explained by the fact that web pages containing music reviews seldom mention other artists, but usually compare new artists albums to more recent ones by the same artist. In addition, we were interested in the number of artists needed to define a genre adequately. For this reason, we ran some experiments using different training set sizes. In Figure 3.2, the results of these experiments for 9-NN classification using the combinations allintitle+m and allintitle+mr are depicted. It was observed that t15 and t8 again provide very high accuracies of up to 85% and 78% respectively. xamining the results of the t4- and t2-settings reveals much lower accuracies. These results are remarkably worse than those of [18] for the same settings (61% for t4 with our approach using 9-NN vs. 76% with the tf idf approach using 7-NN and the additional search keywords music genre style, 35% for t2 with our approach vs. 43% with the tf idf approach using 7-NN and the same additional keywords). In these two settings, the additional information used by the tf idf approach seems to be highly valuable. s a final remark on Figure 3.2, we want to point out that the prediction rate for all depicted experiments is 100%. s already mentioned, the uncombined allintitle-setting using the confidence filter does not always yield a prediction. To analyze the trade-off between accuracy and prediction rate, we plotted these properties for the allintitle-setting in Figure 3.3. This

40 3. Web-Based Similarity 39 figure shows that, in general, an increase in accuracy goes along with a decrease in prediction rate. However, an increase in prediction rate accompanied by a slight increase in accuracy which yields the maximum accuracy values can be seen at the beginning of each plot. The highest accuracies obtained for the different settings are 89% for t15 (86% prediction rate), 84% for t8 (59% prediction rate), 64% for t4 (34% prediction rate), and 35% for t2 (10% prediction rate). These maximum accuracy values are usually achieved with a threshold of 1 or 2 for the confidence filter. It seems that restricting the number of allowed zero-distance-elements in the set of the nearest neighbors to 0 is counterproductive since it decreases the prediction rate without increasing the accuracy. Finally, to investigate which genres are likely to be confused with others, we calculated a confusion matrix, cf. Figure 3.4. It can be seen that the genres Jazz, Blues, Reggae, and Classical are perfectly distinguished. Heavy Metal/Hard Rock, lectronica, and Rock n Roll also show very high accuracies of about 95%. For Country, Folk, RnB/Soul, Punk, Rap, and Pop, accuracies between 83% and 89% are achieved. In comparison with the results of [18], where Pop achieved only 80%, we reach 88% for this genre. In contrast, our results for the genre lternative Rock/Indie are very bad ( 50%). more precise analysis reveals that this genre is often confused with lectronica, which may be explained by some artists producing music of different styles (over time), like epeche Mode in lternative Rock/Indie or Moby and Massive ttack in lectronica. epeche Mode, for example, was a pioneer of Synthesizer-Pop in the 1980s. 3.3 Conclusions & Recommendations In this section we presented an artist similarity measure based on co-occurrences of artist names on web pages. We used three different query settings (M, MR, and allintitle) to retrieve page counts from the search engine Google. xperiments showed that the allintitle-setting provides high accuracies with k-nearest Neighbors classification. High prediction rates, however, are achieved with the M-setting. In order to exploit the advantages of both settings, the two measures were combined using a simple thresholdbased confidence filter. We showed that this combination gives accuracies of up to 85% at 100% prediction rate (no unclassified artists). These results are at least equal to those presented in [18] when using a sufficient number of training samples from each genre. In [18], however, a much more complex approach, tf idf, is used. For scenarios with only very few artists available to define a genre, the tf idf approach performs better due to its extensive use of additional information. In contrast, less information is used in the approach presented in [39]. Our approach differs from that of Zadel and Fujinaga, among other things, in that they use a symmetric similarity measure and a different

41 3. Web-Based Similarity 40 normalization method. s a result, their approach performs slightly worse than ours. Further research may focus on the combination of web-based and signal-based data to raise the performance of similarity measures or to enrich signal-based approaches with cultural metadata from the Internet. Since the data set used for evaluation contains quite general genres and well-known artists, it would be interesting to test our approach on a more specific data set with a more fine-grained genre taxonomy. Finally, heuristics that reduce the computational complexity of our approach should be tested. This would enable us to process also large artist lists. For the SIMC prototypes depending on the number of artists in the collection we recommend either the tf idf approach if the number of artists is beyond 100. Or the simpler co-occurrences approach if the number of artists is below. If it is not clear if the number of artists will increase at a later point in time, the preference should be given to tf idf.

42 keywords music, review music allintitle genre intra avg inter avg ratio intra avg intra avg ratio intra avg inter avg ratio Country e e Folk e e Jazz e e Blues e e RnB/Soul e e Heavy Metal/Hard Rock e e lternative Rock/Indie e e Punk e e Rap/Hip-Hop e e lectronica e e Reggae e e Rock n Roll e e Pop e e Classical e e mean Table 3.1: Results of the evaluation of intra-/intergroup-similarities using our co-occurence measure. On the left, the results for the queries using the additional keywords +music+review are illustrated. The middle columns show the results for the queries with additional +music. The rightmost columns show the results for the queries only taking into account web pages with artists in their title. For each genre, the average intragroup-probability, the average intergroup-probability and the ratio between these two probabilities is depicted. The higher the ratio, the better the differentiation of the respective genre. 3. Web-Based Similarity 41

43 keywords music, review music allintitle genre intra avg inter avg ratio intra avg intra avg ratio intra avg inter avg ratio Country e e Folk e e Jazz e e Blues e e RnB/Soul e e Heavy Metal/Hard Rock e e lternative Rock/Indie e e Punk e e Rap/Hip-Hop e e lectronica e e Reggae e e Rock n Roll e e Pop e e Classical e e mean Table 3.2: Results of the evaluation based on intra-/intergroup-similarities using relatednesses according to [39]. 3. Web-Based Similarity 42

44 3. Web-Based Similarity 43 keywords music, review genre intra avg inter avg ratio Country Folk Jazz Blues RnB/Soul Heavy Metal/Hard Rock lternative Rock/Indie Punk Rap/Hip-Hop lectronica Reggae Rock n Roll Pop Classical mean Table 3.3: Results of the evaluation based on intra-/intergroup-similarities using the tf idf approach according to [18].

45 4. Novelty etection and Similarity This chapter presents novelty detection as a tool in MIR to improve the performances of similarity measures. The work presented in this chapter has been submitted to a conference [13]. Novelty detection is the identification of new or unknown data that a machine learning system is not aware of during training (see [22] for a review). It is a fundamental requirement for every good machine learning system to automatically identify data from regions not covered by the training data since in this case no reasonable decision can be made. In the field of music information retrieval so far the problem of novelty detection has been ignored. For music information retrieval, the notion of central importance is musical similarity. Proper modeling of similarity enables automatic structuring and organization of large collections of digital music, and intelligent music retrieval in such structured music spaces. This can be utilized for numerous different applications: genre classification, play list generation, music recommendation, etc. What all these different systems lack so far is the ability to decide when a new piece of data is too dissimilar for making a decision. Let us e.g. assume the following user scenario: a user has on her hard drive a collection of songs classified into the three genres hip hop, punk and death metal ; given a new song from a genre not yet covered by the collection (say, a reggae song), the system should mark this song as novel therefore needing manual processing instead of automatically and falsely classifying it into one of the three already existing genres (e.g. hip hop ). nother example is the automatic exclusion of songs from play lists because they do not fit the overall flavor of the majority of the list. Novelty detection could also be utilized to recommend new types of music different from a given collection if users are longing for a change. 4.1 ata For the experiments presented in this chapter we used the B-ML collection as described in Section 2.1. From the 22050Hz mono audio signals two minutes from the center of each song are used for further analysis. We divide the raw audio data into overlapping frames of short duration and use Mel Frequency Cepstrum Coefficients (MFCC) to represent the spectrum of each frame. The frame size for computation of MFCCs for our 44

46 4. Novelty etection and Similarity 45 experiments was 23.2ms (512 samples), with a hop-size of 11.6ms (256 samples) for the overlap of frames. The average energy of each frame s spectrum was subtracted. We used the first 20 MFCCs for all our experiments. 4.2 Methods Music Similarity The approach presented in this chapter can be applied to any type of music similarity. For the experiments presented here we use the spectral similarity as described in Section lgorithms for novelty detection Ratio-reject: The first reject rule is based on density information about the training data captured in the similarity matrix. n indication of the local densities can be gained from comparing the distance between a test object X and its nearest neighbor in the training set NN tr (X), and the distance between this NN tr (X) and its nearest neighbor in the training set NN tr (NN tr (X)) [35]. The object is regarded as novel if the first distance is much larger than the second distance. Using the following ratio we reject X if: ρ(x) = d(x,nn tr (X)) d(nn tr (X),NN tr (NN tr (X))) (4.1) ρ(x) > ε[ρ(x tr )] + s std(ρ(x tr )) (4.2) with ε[ρ(x tr )] being the mean of all quotients ρ(x tr ) inside the training set and std(ρ(x tr )) the corresponding standard deviation (i.e. we assume that the ρ(x tr ) have a normal distribution). Parameter s can be used to change the probability threshold for rejection. Setting s = 3 means that we reject a new object X if its ratio ρ(x) is larger then the mean ρ within the training set plus three times the corresponding standard deviation. In this case a new object is rejected because the probability of its distance ratio ρ(x) is less than 1% when compared to the distribution of ρ(x tr ). Setting s = 2 rejects objects less probable than 5%, s = 1 less than 32%, etc. Knn-reject: It is possible to directly use nearest neighbor classification to reject new data with higher risk of being misclassified [17]: reject X if not:

47 4. Novelty etection and Similarity 46 g(nn1 tr (X)) = g(nn2 tr (X)) =... = g(nnk tr (X)) (4.3) with NNi tr (X)) being the ith nearest neighbor of X in the training set, g() a function which gives the genre information for a song and i = 1,...,k. new object X is rejected if the k nearest neighbors do not agree on its classification. This approach will work for novelty detection if new objects X induce high confusion in the classifier. The higher the value for k the more objects will be rejected. 4.3 Results To evaluate the two novelty detection approaches described in Sec we use the following approach shown as pseudo-code in Table 4.1. First we set aside all songs belonging to a genre g as new songs ([new,data]=separate(alldata,g)) which yields data sets new and data (all songs not belonging to genre g). Then we do a ten-fold crossvalidation using data and new: we randomly split data into train and test fold ([train,test] = split(data,c)) with train always consisting of 90% and test of 10% of data. We compute the percentage of new songs which are rejected as being novel (novel reject(g,c) = novel(new)) and do the same for the test songs (test reject(g,c) = novel(test)). Last we compute the accuracy of the nearest neighbor classification on test data that has not been rejected as being novel (accuracy(g,c) = classify(test(not test reject))). The evaluation procedure gives G C (22 10) matrices of novel reject, test reject and accuracy for each parameterization of the novelty detection approaches. Table 4.1: Outline of valuation Procedure for g = 1 : G [new,data] = separate(alldata,g) for c = 1 : 10 [train,test] = split(data,c) novel_reject(g,c) = novel(new) test_reject(g,c) = novel(test) accuracy(g,c) = classify(test(not test_reject)) end end The results for novelty detection based on the Ratio-reject and the Knn-reject rule are given in Figs. 4.1 and 4.2 as Receiver Operating Characteristic (ROC) curves [24]. To

48 4. Novelty etection and Similarity 47 obtain an ROC curve the fraction of false positives (object is not novel but it is rejected, in our case test reject) is plotted versus the fraction of true positives (object is novel and correctly rejected, in our case novel reject). n ROC curve shows the tradeoff between how sensitive and how specific a method is. ny increase in sensitivity will be accompanied by a decrease in specificity. If a method becomes more sensitive towards novel objects it will reject more of them but at the same it will also become less specific and also falsely reject more non-novel objects. Consequently, the closer a curve follows the left-hand border and then the top border of the ROC space, the more accurate the method is. The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the method. We plot the mean test reject versus the mean novel reject for falling numbers of s (Ratio-reject) and growing numbers of k (Knnreject). In addition the mean accuracy for each of the different values of s and k are depicted as separate curves. ll means are computed across all corresponding values. The accuracy without any rejection due to novelty detection is 70% s=0 novel_reject s=3 s=2 s= test_reject Figure 4.1: Ratio-reject ROC, mean test reject vs. novel reject (circles, solid line) and accuracy (diamonds, broken line) for no rejection, s=5,3,2,1,0. Ratio-reject: The results for novelty detection based on the Ratio-reject rule are given in Fig With the probability threshold for rejection set to s = 2 (rejection because data is less probable than 5%), the accuracy rises up to 79% while 19% of the test songs are falsely rejected as being novel and therefore not classified at all and 42% of the new songs are being rejected correctly. If one is willing to lower the threshold to s = 0 (rejection because data is less probable than 50%) the accuracy is at 92% with already 49% of the test songs rejected erroneously and 84% of the new songs rejected

49 4. Novelty etection and Similarity 48 correctly k=3 novel_reject k= k= test_reject Figure 4.2: Knn-reject ROC, mean test reject vs. novel reject (circles, solid line) and accuracy (diamonds, broken line) for k=1 (no rejection) and k=2,3,4,5,6,7,8,9,10,20. Knn-reject: The results for novelty detection based on the Knn-reject rule are given in Fig If k is set to 2 the accuracy rises up to 89% while 35% of the test songs are wrongly rejected as being novel and therefore not classified at all and 65% of the new songs are being rejected correctly. With k = 3 the accuracy values start to saturate at 95% with already 49% of the test songs rejected erroneously and 81% of the new songs rejected correctly. 4.4 iscussion We have presented two approaches to novelty detection, where the first (Ratio-reject) is based directly on the distance matrix and does not, contrary to Knn-reject, need the genre labels. When comparing the two ROC curves given in Figs. 4.1 and 4.2 it can be seen that both approaches work approximately equally well..g. the performance of the Ratio-reject rule with s = 1 resembles that of the Knn-reject rule with k = 2. The same holds for s = 0 and k = 3. lso the increase in accuracy is comparable for both methods. epending on how much specificity one is willing to sacrifice, the accuracy can be increased from 70% to well above 90%. Looking at both ROC curves, we would like to state that they indicate quite fair accurateness of both novelty detection methods. When judging genre classification results, it is important to remember that the human

50 4. Novelty etection and Similarity 49 error in classifying some of the songs gives rise to a certain percentage of misclassification already. Inter-rater reliability between a number of music experts is usually far from perfect for genre classification. Given that the genres for our data set are user and not expert defined and therefore even more problematic, it is not surprising that there is a considerable decrease in specificity for both methods. Of course there is still room for improvement in novelty detection for music similarity. The two presented methods are a first attempt to tackle the problem and could probably be improved themselves. One could change the Knn-reject rule given in qu. 4.3 by introducing a weighting scheme which puts more emphasis on closer than on distant neighbors. Then there is a whole range of alternative methods which could be explored: probabilistic approaches (see e.g. [7]), Bayesian methods [21] and neural network based techniques (see [22] for an overview). Finally we would like to comment that whereas the Knn-reject rule is bound to the genre classification framework, Ratio-reject is not. Knn-reject probably is the method of choice if classification is the main interest. ny algorithm that is able to find a range of nearest neighbors in a data base of songs can be used together with the Knn-reject rule. Ratio-reject on the other hand has an even wider applicability. It is a general method to detect novel songs given a similarity matrix of songs. Since it does not need genre information it could be used for anything from play list generation and music recommendation to music organization and visualization.

51 5. Chroma-Complexity Similarity In this chapter we use the chromagram implementation of developed by Chris Harte at QMUL to compute descriptors for similarity measures which could be useful for playlist generation and related tasks. We briefly review the chromagram implementation based on the constant Q transform and how we use this to compute a measure for chroma complexity. (Note that chroma complexity is closely related to chord complexity.) We discuss possibilities for further development and how the prototypes can benefit from the descriptors developed in WP2. The general approach we apply can be applied to any similar mid-level representation and thus opens the way for further integration of WP2 results in WP3. The following description of the chromagram calculation and the chromagram tunning has been copied from a paper [6] which was submitted to a conference and will be part of The remaining work was part of the collaboration between Chris Harte and Juan Bello from QMUL and lias Pampalk from OFI visiting QMUL. 5.1 Chromagram Calculation standard approach to modeling pitch perception is as a function of two attributes: height and chroma. Height relates to the perceived pitch increase that occurs as the frequency of a sound increases. Chroma, on the other hand, relates to the perceived circularity of pitched sounds from one octave to the other. The musical intuitiveness of the chroma makes it an ideal feature representation for note events in music signals. temporal sequence of chromas results in a time-frequency representation of the signal known as chromagram. common method for chromagram generation is the constant Q transform [9]. It is a spectral analysis where frequency domain channels are not linearly spaced, as in FT-based analysis, but logarithmically spaced, thus closely resembling the frequency resolution of the human ear. The constant Q transform X cq of a temporal signal x(m) can be calculated as: X cq (k) = N(k) 1 n=0 w(n,k)x(n)e j2πf kn (5.1) where both, the analysis window w(k) and its length N(k), are functions of the bin position k. The center frequency f k of the k th bin is defined according to the frequencies 50

52 5. Chroma-Complexity Similarity 51 of the equal-tempered scale such that: f k = 2 k/β f min (5.2) where β is the number of bins per octave, thus defining the resolution of the analysis, and f min defines the starting point of the analysis in frequency. From the constant Q spectrum X cq, the chroma for a given frame can then be calculated as: Chroma(b) = M X cq (b + mβ) (5.3) m=0 where b [1,β] is the chroma bin number, and M is the total number of octaves in the constant Q spectrum. We downsample, the signal to 11025Hz, β = 36 and analysis is performed between f min = 98Hz and f max = 5250Hz. The resulting window length and hop size are 8192 and 1024 samples respectively. 5.2 Chromagram Tuning Real-world recordings are often not perfectly tuned, and slight differences between the tuning of a piece and the expected position of energy peaks in the chroma representation can have an important influence on the estimation of chords. The 36-bin per octave resolution is intended to clearly map spectral components to a particular semitone regardless of the tuning of the recording. ach note in the octave is mapped by 3 bins in the chroma, such that bias towards a particular bin (i.e. sharpening or flattening of notes in the recording) can be spotted and corrected. To do this we use a simpler version of the tuning algorithm proposed in [16]. The algorithm starts by picking all peaks in the chromagram. Resulting peak positions are quadratically interpolated and mapped to the [1:5; 3:5] range. histogram is generated with this data, such that skewness in the distribution is indicative of a particular tuning. corrective factor is calculated from the distribution and applied to the chromagram by means of a circular shift. Finally, the tuned chromagram is low-pass filtered to eliminate sharp edges. 5.3 Chromagram Processing To emphasize certain patterns in the chromagram and to remove temporal variations we use several filters. (1) We use a Gaussian filter over time. This window is very large and removes variations within 50ms. This helps reduce the impact of, for example, the broad spectrum of sharp attacks. (2) We use a loudness normalization to remove the impact of the changing loudness levels in different sections of the piece. (3) We use

53 5. Chroma-Complexity Similarity 52 gradient filters to emphasize horizontal lines in the chromagram. (4) We smooth the chromagram over the tone scale to a resolution of about one semitone (i.e. 12 bins instead of 36). This smoothing is done circular in accordance with the distance between semitones. The result are the chromagrams displayed in the figures of this chapter. 5.4 Chroma Complexity epending on the number of chords and their similarity, the patterns which appear in a chromagram might be very complex, or very simple. To measure this we use clustering. In particular, the chromagram is clustered with k-means, finding groups of similar chroma patterns. The cluster algorithm starts with 8 clusters. If two clusters are very similar, the groups are merged. This is repeated until convergence or until only 2 groups are left. The similarity is measured using a heuristic for perceptual similarity of two patterns. To avoid getting stuck in local minimas, the clustering is repeated several times (with different initializations). (The time resolution of the chromagram is very low, thus the computation time for clustering is neglectable.) This is not the optimal choice for several reasons. lternatives include using, for example, the Bayes information criteria, or avoiding quantization in the first place. In the following we will give some examples to illustrate the chroma complexity and to show that there are general tendencies for genres which make the approach interesting for tasks such as genre classification or playlist generation ChromaVisu Tool To study the chromagram patterns we developed a Matlab tool to visualize patterns while listening to the corresponding music. screenshot is shown in Figure 5.1. The six main components are:. The large circle in the upper left is the current chroma-pattern (i.e. the pattern associated with the part of the song just playing). B. To the right is the mean over all patterns which for many types of music is a useful indicator of the key. C. To the right are up to eight different chroma patterns which occur frequently. The number above each pattern indicates how often it occurs (percentage). The number of different patterns is determined automatically as described above and is the measure for the chroma complexity.

54 5. Chroma-Complexity Similarity 53 Figure 5.1: ChromaVisu: a tool to study chroma pattern complexity.. Beneath is a fuzzy segmentation (cluster assignment) indicating when which of the eight patterns is active. ach line represents one cluster (with the most frequent cluster in the first row). White means that the cluster is a good match, black is a very poor match. If none of the clusters is a good match, the last line (which does not represent a cluster) is white. Modeling the repetitions and the structure which immediately become apparent are the primary target for further work on chroma complexity.. Just below the cluster assignment is a slider which helps track the current position of the song. F. Below it is the chromagram in a flat representation. The first five rows and the last five rows are repetitions to help recognize patterns on the boundaries.

HIDDEN MARKOV MODELS FOR SPECTRAL SIMILARITY OF SONGS. Arthur Flexer, Elias Pampalk, Gerhard Widmer

Proc. of the 8 th Int. Conference on Digital Audio Effects (DAFx 5), Madrid, Spain, September 2-22, 25 HIDDEN MARKOV MODELS FOR SPECTRAL SIMILARITY OF SONGS Arthur Flexer, Elias Pampalk, Gerhard Widmer