Automatic Music Similarity Assessment and Recommendation. A Thesis. Submitted to the Faculty. Drexel University. Donald Shaul Williamson

Size: px

Start display at page:

Download "Automatic Music Similarity Assessment and Recommendation. A Thesis. Submitted to the Faculty. Drexel University. Donald Shaul Williamson"

Christopher Hall
5 years ago
Views:

1 Automatic Music Similarity Assessment and Recommendation A Thesis Submitted to the Faculty of Drexel University by Donald Shaul Williamson in partial fulfillment of the requirements for the degree of Master of Science in Electrical Engineering June 2007

2 i Dedications This is dedicated to all of my family members and friends who encouraged me to strive for the best.

3 ii Acknowledgments I would like to thank my advisor, Dr. Youngmoo Kim, and all of the members of the Music Entertainment and Technology (MET) Lab for all of their guidance and assistance.

4 iii Table of Contents LIST OF TABLES... V LIST OF FIGURES...VI ABSTRACT...VIII 1. INTRODUCTION SCOPE OF RESEARCH MUSIC SIMILARITY MEASURES Subjective User Measures Quantitative Measures for Machine Learning ARTIST AND GENRE CLASSIFICATION MUSIC INFORMATION RETRIEVAL (MIR) INTRODUCTION ESTABLISHED FEATURES COMPARING MFCCS TO OTHER POTENTIAL FEATURES (SPECTROGRAM, TIME-DOMAIN) CLASSIFICATION METHOD PREVIOUS APPROACHES Quantitative Machine-Learning Approaches Subjective Approaches Evaluation of Machine-Learning Approaches (MIREX 2006) METHODOLOGY MEL-FREQUENCY CEPSTRAL COEFFICIENTS (MFCCS) MFCCS COMPUTED FROM DIFFERENT MP3 ENCODINGS KULLBACK-LEIBLER DIVERGENCE PRELIMINARY WORK Feature Computation Preliminary Results... 27

5 iv 3.5 MUSIC SIMILARITY Similarity Matrix Multidimensional Scaling Visualizations System Evaluation MUSIC SIMILARITY RESULTS EVIDENCE OF ALBUM EFFECT SURVEY ANALYSIS CONCLUSION LIST OF REFERENCES... 58

6 v List of Tables 1. Comparison of prior research approaches for various MIR task Results for automatic genre classification Average deviation from the average slider rating for each survey participant Slope and intercept values for the lines in the slider ratings versus similarity features image...51

7 vi List of Figures 1. Flowchart of supervised-learning approach used in MIR Mel-Scale versus Frequency Scale (from [15]) Time-domain signal comparison for two different songs Comparison of spectrograms for two different audio signals Comparison of MFCC-Spectrograms for two audio signals Graphical Representation of Kullback-Leibler Divergence. The top image shows two Gaussians with similar parameters, resulting in a low KL divergence. The opposite is true for the Gaussians in the bottom image Flowchart for computing MFCCs from an audio signal Image of 20-dimensional MFCCs computed from a time-domain signal Supervised-learning Approach for Artist and Genre Classification KL Divergence between artist testing (x-axis) and training (y-axis) models. The lighter the box the smaller the KL divergence value between the two models Song Similarity Approach Survey used to assess the effectiveness of our song similarity system Similarity between songs in dataset displayed using MDS MDS plot of song by Bryan Adams and its visual comparison with other songs Comparison of re-mastered songs with the originals using MDS Comparison of pre- and post- mastered songs using MDS Correlation between KL divergence values and slider ratings. The slope of the estimate of the best fit line is Preliminary results from the calibration survey Calibration survey response after user normalization...47

8 vii 20. Correlation between KL divergence values and slider ratings after user normalization. The slope of the best-fit line is Comparison between similarity systems. Left - other systems (from [7]). Right - Our Approach Correlation between selected similarity features and the slider rating values Correlation between KL divergence values and similarity features Illustration of correlation between KL divergence, slider rating values, and similarity features Comparison of normalized distributions for each candidate song type. The number shown at the top of each bar, at a particular slider value, is the total number of songs that were given that rating value Comparison of features by artist type...55

9 viii Abstract Automatic Music Similarity Assessment and Recommendation Donald Shaul Williamson Dr. Youngmoo Kim Ph.D., Advisor The growth of digital music has caused a new set of problems pertaining to recommending music to individuals and organizing music on electronic devices. Mp3 players such as ipods have exacerbated the task of organizing music due to the exorbitant amount of songs that a single device may contain. ipods can carry thousands of songs, making it extremely tedious and time-consuming to perform the simple task of playlist generation. Most users simply place their ipods on shuffle mode, allowing the device to randomly determine which songs are played. This random mode of listening is not what we usually desire, since it doesn t give the listener the opportunity to select a particular style or kind of music. This project explores a method that uses a computer algorithm to assess song similarity, where the degree of similarity is based solely on the sound waveform of various songs. By extracting acoustic features that represent the timbre, or sound quality, of a song similarity is assessed by comparing the quantitative distance between features. The similarity between songs is also visually revealed through a computer interface, where songs that are similar are clustered together and dissimilar sounded songs are placed further apart. This visualization can serve as a means for recommending music based on similarities. Human subjects have also rated the similarity between songs by using a predefined rating system. A final evaluation of this algorithm consists of comparing the data generated from human subjects to the data generated by the automatic song similarity algorithm.

10 1 1. Introduction During the last decade, there has been a tremendous growth in the amount of commercially available digital music. The beginning of the digital music era was denoted by the advent of the compact disc (CD). CDs, unlike commercial and recordable cassette tapes, provided a higher sound quality that deteriorated at a much slower rate, ultimately allowing users to enjoy their musical content for a longer period of time. Portable players and stereos were no longer the only means of listening to audio; computers were capable of playing music from CDs and could be used to listen to tunes from your favorite artist. Over time, compact-disc recordables (CD-Rs) and CD-recordable drives (burners) were invented which made it possible to copy commercial CDs onto blank discs. CD-Rs were blank compact discs that with a CD burner could be used to store audio content. They differ from commercial CDs, which are preloaded with audio content and are non recordable. CD burners were crucial to the growth and exchange of digital music after peer-to-peer software providers such as Napster, and later, illegal services such as Limewire and Kazaa made it possible to download free music directly onto your computer from other users that shared their music collections; Limewire and Kazaa applications were not illegal in themselves, but their method of implementation was illegal. More recently Apple s itunes has dominated the music downloading industry, making it extremely easy for individuals to legally obtain digital music files from almost any artist during any time period for 99 cents per song. The music downloaded from itunes, usually ends up on an ipod, or other mp3 players, which are capable of containing hundreds or thousands of songs. A typical CD can store 700 MB of music, while the amount of memory on mp3 players ranges from 128 MB to 60 GB. When a CD is loaded into a computer, it usually takes some time before the artist, album, and song title information is displayed. This occurs because the tagging information is not present on the CD itself; it needs to be downloaded from a database over the Internet. This database (CDDB) [14] has been manually developed, requiring human involvement to type the information for each artist and their respective albums. Since the CD contains information such as the number of tracks, track lengths, and other information, the database correctly determines the contents of the CD and thus the correct metadata tags by comparing the track information, with the information that the database system contains. Sometimes though the database can only narrow down the list of possible albums and then the user has to select the appropriate one.

11 2 Nevertheless, since humans are involved in this process there is inevitably an associated human input error that could be avoided if there was a method that took into account the audio information from the music, rather than the CD content. More importantly, there are also no standards in defining the CD metadata. For instance, if an artist changes their stage name, how should their past and present music be organized; should The be included in the artist name, such as Beatles versus The Beatles? In essence, the current music tagging system is flawed because it needs to interact with humans and lacks a standardized database system in order to function somewhat accurately. The explosion of mp3s and ipods has also created an overabundance of musical choices, and now the act of organizing one s music has become frustrating and laborious. Since mp3 players are capable of carrying thousands of songs, the amount of time needed to manually traverse every song in the data set to generate appropriate playlists is exceedingly long. In most cases if the size of the musical collection is enormous, instead of generating a playlist manually, individuals would just set the playback mode to shuffle; thus allowing the mp3 player to randomly determine what music is sequentially played. Instead of randomly shuffling through the music collection, it may be desired that only songs that sound similar to a particular song or artist, or only music for specific occasions are played: workout music, romantic music, or any other mood specific playlist. This requires more than just artist or genre tags, because artist produce songs with a wide range of moods and the concept of genre is vague; in most cases, songs with the same genre tag can sound completely different. The concept of genre is a subjective category because most people have different interpretations on its meaning. For instance, individuals from the 1960s may associate Rhythm and Blues (R&B) music with artists such as The Temptations, Smokey Robinson, and Gladys Knight, while today s youth may define R&B based on artists such as Beyonce, Mary J. Blige, and Usher. While the musical content between these sets of artists differs, these artists are grouped together because of the vagueness associated with genre tags. This again goes back to the lack of standards used for tagging songs based on information from a database. Also, the large musical collections of some individuals may only consist of songs that are tagged with the same genre. These individuals would most likely have to generate playlists manually or randomly shuffle

12 3 through their collection to listen to various songs. If playlists were generated from acoustical features rather than genre or any other metadata, the associated labeling inaccuracies could possibly be avoided. The growth of digital music has caused another problem amongst music consumers; how to select new music based on its comparison with artists or songs with which you are already familiar? Primarily, since there is such a large selection of music, how can one recommend new and/or unfamiliar music that is based on user preferences? Current recommendation systems incorporate user interaction by means of a method known as collaborative filtering, in which the other musical purchases of users whom purchased a specific album or song are displayed as recommendations, [16]. This is not necessarily a good means of recommending music based on the sound of the song; rather it shows the taste and preferences of a particular group of buyers. 1.1 Scope of Research Over the past decade or so, computers have allowed consumers to listen to, store, and purchase new musical content; however, they currently are not completely capable of organizing and recognizing a musical piece based on its sound. Computers still relate to sound as an abstract series of numbers and not as humans do: as instruments, performers, and songs. The goal of this research project is to use signal processing techniques and machine learning methods to develop an automatic music similarity assessment system that determines the similarity amongst songs from various artist. Given a digital database of music, the system should extract meaningful features from each artist and song. These features should somehow represent the perceptual recognition features of the human auditory system that allow us to perceive that two songs are similar. Therefore every song should consist of its own perceptual signature that distinguishes it from all other possible songs. Even cover songs or live performances of songs should differ from the original studio versions. A meaningful song signature comparison needs to be evaluated that is capable of comparing and separating two songs to and from each other, respectively. A distance measure will be used to determine the similarity between the features sets, culminating in a one-dimensional value

13 4 between every song in the dataset. The types of features and the distance measures will be discussed in later sections. Some preliminary definitions follow. 1.2 Music Similarity Measures Music similarity is the process of comparing two different audio songs or two different artists and determining how close they are to each other in terms of sound or acoustics. The actual process of determining the similarity of two songs is to some extent ill defined, since there are many ways for two songs or artists to sound similar. Vocal characteristics, instrumentation or other acoustical measures could be used to classify the similarity between songs. Similarity could also be determined from mood, tempo, or rhythm. In general there are two methods that have been used in practice to assess musical similarity between two artists or songs: a subjective measure and a machine learning method Subjective User Measures Subjective measures of music similarity deal with user interaction. Users typically define how similar songs are by using a rating system or scale. A user will listen to two different songs concurrently and rate their similarity based on a pre-defined numerical scale or simply by stating their similarity. Subjective measures are appropriately termed because they deal with the musical preferences of an individual. The measures that individuals use to classify the similarities between two different musical pieces are in fact subjective in the sense that those measures may, and in most cases will, vary amongst different users. For instance, one user may classify similarity based on the tempo or rhythm of the songs, while another may use the vocal characteristics of the artist or the usage of certain instruments. A song similarity rating also depends on the users familiarity with the musical selection. If a user generally listens to a particular style or genre of music, they, most likely, are more familiar with the intricacies and nuances of that type of music. They are aware of how instruments are used differently within that style of music and also how the intensity of the sound or the vocal qualities differs. A user in this circumstance might provide a more accurate assessment of similarity. However, this type of user may know too much about this form of music and feel that nothing sounds similar to a particular song. On the other hand, a

14 5 person that is not too familiar with the musical content may do one of two things: they may say that all music that fits this style all sound the same or they may feel that they do not understand the music and therefore are not capable of accurately assessing the music for similarity. For instance, someone who is not familiar with classic rock may feel that all of the songs are very intense; consisting of a lot of heavy guitar and loud drums, thus they may rate every song that fits into this style as being identical. Also, individuals that may not be familiar with rap or hip hop music may not realize that within this genre there are various styles of music such as club, conscience, southern, or even hardcore types that all produce songs that sound different from each other. Whatever the case, subjective measures fail to provide a feature that consistently and accurately determines the similarity of music due to the fact that it is user dependent, and users define similarity differently, [1]. This necessitates the need for a method that gives a consistent feature to musical pieces that also describes the overall sound quality, or timbre, of a song Quantitative Measures for Machine Learning Machine-learning techniques use computers to determine the similarities amongst music. Many algorithms employ a supervised-learning method were the data is separated into a training set and a testing set. Temporal and/or spectral features are then extracted from each song or each group of songs that represent an artist. Again, these features either define the sound of that song or the sound of a particular artist. Different researchers use different quantitative features to represent their data and many often use multiple features. The type of features in many cases depends on the application. In the music similarity case, the extracted features are supposed to represent a computerized-perceptual component of the digital music. By perceptual, it is supposed to give a numerical value to a feature that mimics how humans interpret sound through the sense of hearing. In many cases, the feature sets are very large; the features consume a sizable volume of computer memory. Therefore statistical-models of the extracted features are often used to represent a song or artist. Once the features are extracted from each song or artist, the features of the testing models are then compared to the features of the training model. A distance measure is used to determine how closely the features and thus the artists or songs are related. Again, there are a multitude of distance measures and each one has its advantages and disadvantages, where the quality of their performance depends on the application.

15 6 1.3 Artist and Genre Classification The difference between classifying musical artist or song similarities and classifying the artist or genre of a song are very subtle. In most automated implementations, both employ a supervised-learning technique where there is some pre-defined training data that represents a particular group, i.e. artist, genre, or a particular style of music. Features are extracted from each song in the group and combined together to get an overall representation of that category. Testing groups (artist, genre, etc) are represented in the same manner, but the group category is not known beforehand. The category of the group is estimated by comparing the testing model to all of the training models through some distance or divergence measure. Usually the training model that produces the smallest distance between itself and the testing model is assigned as the category of the testing model. Even though the process of assessing the similarity between song models is relatively new, researchers have been able to develop a multitude of algorithms that have improved the capabilities of computers in this area. Many of the current musical similarity approaches along with artist and genre classification techniques are discussed in subsequent chapters.

16 7 2. Music Information Retrieval (MIR) Introduction Music Information Retrieval is the process of developing computer algorithms to extract perceptual and physical representations of music. Some of the perceptual algorithms aim to determine the mood, tempo, and melody of musical pieces. Algorithms that quantify the artist, genre and score of a song are examples of physical MIR tasks. Whatever the task, in general there is one approach that is often used to extract these representations. Many researchers have employed a supervised-learning technique, where knowledge of some pre-defined data is used to acquire knowledge about unknown data, Figure 1. For MIR, a large database of songs is first sub-divided into the pre-defined data set (training) and the unknown (testing) set. The pre-defined data is grouped into known categories, i.e. by artist, mood, genre, etc., and features are extracted from the different groups to provide an overall representation of that category. The same features are extracted from the unknown testing set and then the songs of the testing set are classified into a specific known category by comparing it to the training dataset. A popular feature that is generally used to represent music models is Mel-Frequency Cepstral Coefficients (MFCCs). They provide a good estimation of how humans process sound information and these features are generally compared to each other using the Kullback-Leibler (KL) divergence. Figure 1: Flowchart of supervised-learning approach used in MIR 2.1 Established Features Mel-Frequency Cepstral Coefficients are the multidimensional-timbral features of choice in most artist or music similarity classifications, [9]. They provide a spectral representation of sound that mimics how humans process acoustical information. Humans do not process all sound frequencies in the same way; we

17 8 can recognize and distinguish between certain frequencies better than others. Our ears can hear frequencies in the 20 Hz to 20,000 Hz range. Within this range humans resolve the lower range of frequencies much better than the higher range; i.e., we have higher frequency resolution at lower frequencies as compared to higher frequencies. Instead of hearing frequencies on a linear scale, humans auditory systems process frequencies on a somewhat logarithmic scale. The mel scale is a relatively good representation for the manner at which humans process sound energy, because it groups frequencies into a somewhat logarithmic scale, Figure 2. MFCCs that are computed for a song are in general very large, thus each song is modeled as a Gaussian distribution of the MFCC features. Figure 2: Mel-Scale versus Frequency Scale (from [15]) 2.2 Comparing MFCCs to other potential features (spectrogram, time-domain) MFCCs were first developed for the problem of automatic speech recognition, and it was determined that they could be beneficial in the field of music information retrieval or music similarity [9]. They are also used in this research area because they provide better and more distinguishable features as compared with other feature types. For instance, the time domain signal of a song does not present particularly useful discriminating information for a given song. Comparing two time-domain signals, you will see that the two waveforms more than likely look very similar to each other and neither has an attribute that differentiates

9 itself from the other. A time-domain signal comparison is visually displayed below, where the waveforms for two different songs are displayed.

The spectrogram is another feature that could potentially be used to characterize a song, but the short-time spectrum reflects hundreds of linearly-spaced frequency channels that contain redundant

18 9 itself from the other. A time-domain signal comparison is visually displayed below, where the waveforms for two different songs are displayed. As can be seen, nothing noticeably stands out that differentiates the two signals. Figure 3: Time-domain signal comparison for two different songs. The spectrogram is another feature that could potentially be used to characterize a song, but the short-time spectrum reflects hundreds of linearly-spaced frequency channels that contain redundant information (Figure 4). This redundant information makes it difficult to distinguish between two musical pieces because they both contain a lot of the same information. This is one reason why in general spectrograms have not been directly used in music similarity algorithms and similar task. Also unlike MFCCs, spectrograms do not accurately model the human auditory system; they treat all frequencies equally.

10 Figure 4: Comparison of spectrograms for two different audio signals Comparing the MFCCs of these two signals, the differences between them are easily noticed.

The MFCC spectrogram is computed by first calculating the MFCC matrix of the time-domain signal, then calculating the inverse MFCC of this matrix, resulting in a time-domain signal.

19 10 Figure 4: Comparison of spectrograms for two different audio signals Comparing the MFCCs of these two signals, the differences between them are easily noticed. Figure 5 shows the MFCC spectrograms of the time-domain signals of Figure 3. The MFCC spectrogram is computed by first calculating the MFCC matrix of the time-domain signal, then calculating the inverse MFCC of this matrix, resulting in a time-domain signal. The re-computed time-domain signal is not the same as the initial input signal; information is lost when transforming from a time signal to MFCCs. The spectrogram of the newly computed time-domain signal is then taken. As can be seen in Figure 5, the information is much smoother and the content at the lower frequencies is stretched to enable better discrimination.

Before this occurs, the MFCC features for each song are represented as single-gaussian distributions, where the mean vector and covariance matrix of the MFCC feature are computed over time.

20 11 Figure 5: Comparison of MFCC-Spectrograms for two audio signals 2.3 Classification Method Once the features are calculated for each artist or song model, there is now a need to determine the distance between these models. Before this occurs, the MFCC features for each song are represented as single-gaussian distributions, where the mean vector and covariance matrix of the MFCC feature are computed over time. The Gaussian statistics are multidimensional because the MFCC features are multidimensional. Now song models (which are Gaussian distributions of MFCC data) are compared to each other via a dissimilar measure. The Kullback-Leibler (KL) divergence measure is a common way of determining the relative similarity between two single-gaussian distributions [3]. A small divergence value signifies that the Gaussian distributions are similar; the result is opposite for large divergence values. A graphical visualization of the one-dimensional case of KL divergence calculation is shown below. In each image, a plot of two Gaussian distributions is shown. The top image displays two distributions with similar mean and covariance values, thus producing a small KL value. The KL divergence value between the two song distributions in the bottom image would be higher because their mean and variance differ significantly. More details on the calculation of KL divergence are provided in Section 3.3.

21 12 Figure 6: Graphical Representation of Kullback-Leibler Divergence. The top image shows two Gaussians with similar parameters, resulting in a low KL divergence. The opposite is true for the Gaussians in the bottom image. 2.4 Previous Approaches Quantitative Machine-Learning Approaches Over the past decade, a large amount of research in the area of artist classification and music similarity has been accomplished. Most algorithms, for these two tasks, have incorporated a supervised-learning approach. For artist similarity, the MFCCs for each audio song are computed, and these features are grouped together into one large matrix for each artist. If the goal of the project is to compute the similarity between two songs, the MFCCs are computed for each song; however, they are not grouped together by artist. At this point the researchers approaches tend to distinguish themselves from one another, where many model the MFCCs differently and they often use different distance measures to compare models, Table 1. Table 1: Comparison of prior research approaches for various MIR task. Aucouturier Berenzweig Logan Mandel and Ellis Goal Song Similarity Artist Artist Song Similarity Similarity Similarity Features(s) MFCCs (GMM) Clusters of Clusters MFCCs MFCCs of (GMM) MFCCs Classifier Likelihood/Sampling KL divergence and others Earth- Movers Distance KL divergence and others Pampalk Song Similarity MFCCs, Tempo, Loudness Modulation KL divergence Pohle Song Similarity MFCCs KL divergence

22 Aucouturier (2002) In [1], three Gaussians were used to model the eight-dimensional Mel-Frequency Cepstral Coefficients. The goal was to obtain a timbre measure that is independent of pitch, [1]. The distributions of each song s MFCCs are modeled as a mixture of Gaussian distributions (GMM) over the space of all MFCCs. The similarity between each song is computed using two different distance measures, Likelihood and Sampling. The likelihood approach computes the probability that the given MFCCs of test song A are generated by the model of song B. With the sampling approach one GMM in the artist feature set is sampled and the likelihood of the samples given the other GMMs are computed. The distances are finally normalized between 0 and Berenzweig (2003) In [2], Berenzweig aimed to develop an automatic algorithm that is capable of assessing similarity amongst artist. Berenzweig proposed a question of whether the similarities established using machine-learning approaches is consistent with the results from subjective measures. After computing the Mel-Frequency Cepstral Coefficients, Berenzweig used two methods to define the training model for each artist. One method incorporates K-means clustering of the data points to generate a Gaussian mixture model (GMM) representation of the clusters. In essence, after the MFCCs from each artist are grouped together, this approach groups the entire data set of each artist into K clusters, were each cluster is represented by the mean, covariance, and weight of that cluster. A second approach uses standard Expectation-Maximization (EM) re-estimation of the GMM parameters initialized from the K-means clustering [2]. Once the feature space for each artist is modeled, several different distance measures were used to determine the similarity between artists. The centroid distance measure was initially used, which is a Euclidean distance between the overall means of two models. The Earth-Mover s distance (EMD) was also explored. This distance measure calculates the cost of moving probability mass between mixture components to make them equivalent. Finally, the Asymptotic Likelihood Approximation (ALA) to the Kullback-Leibler divergence between GMMs was used to determine the similarity between feature spaces. This measure segments each feature space and assumes only one Gaussian component dominates in each region.

23 14 Once the training and testing models have been generated, Berenzweig used the above-mentioned distance measures to calculate the similarity between a testing model and each training model. The results were stored in a square N x N similarity matrix, where each entry gives the similarity between a particular pair of artists Logan (2004) Like the other previous approaches, the goal of the work presented in [4] by Logan was to design a system that determined similarity strictly based on the music or audio content of a song. In this approach, each song was first divided into time frames of equal lengths. A spectral representation, MFCC, was then computed for each time frame. Given a sequence of spectral frames, Logan then grouped the frames into clusters based on the similarity between frames. The number of clusters for each song can either be fixed, meaning each song is grouped into K clusters, or it can depend on the particular song. In this particular case, the sequences of coefficients are clustered into 16 groups. Then each cluster is defined by its Gaussian statistics, i.e., mean and covariance along with its weight. These values define the signature for a song, where each song has its own signature. The Earth Movers Distance (EMD) is then used to compare the signatures between two different songs Mandel and Ellis (2005) The music classification approach of Mandel and Ellis is similar to the above approach; however, they use support-vector machine (SVM) active learning as the classifier for their system. The user presents their system with a song of interest and their algorithm uses SVMs to develop a classifier based on that query song. Support Vector Machines attempts to find the hyperplane that separates two classes of data, [3][8]. The hyperplane is called the maximum margin hyperplane because it maximizes the distance to the closest points from each class. The system returns six songs that the user then labels as being similar or dissimilar to the seed song. The system then presents the user with a refined list of six songs and the user can again label those songs as being similar or dissimilar. This process is complete once the user feels satisfied with the six songs being similar to the initial song of interest.

24 15 The points of each class that the SVM tries to separate are represented by different versions of MFCCs. After computing the MFCCs for each song they either modeled the cluster of points with one mean and covariance, or they used a Gaussian mixture model where multiple Gaussians are used to represent a cluster of MFCC points. They determined that the best-fit GMM represented a cluster of MFCCs with 50 to 100 Gaussians. Mandel and Ellis used various distance measures to separate songs from each other: the Kullback-Leibler divergence measure, a non-probabilistic distance measure, GMM posterior, and two implementations of the Fisher Kernel. When a song was modeled as a single Gaussian distribution, KL divergence or the non-probabilistic distance measure was used to compare two songs for similarity. A modified form of KL divergence was also used when songs were modeled with multiple Gaussians. The GMM posterior measures the posterior probability of each Gaussian in a mixture model given the frames from each song. Fisher kernel is described as a method for summarizing the influence of the parameters of a generative model on a collection of samples from that model, [3]. Mandel and Ellis used the means of the Gaussians in the global GMM as the parameters. The Fisher Kernel methods are based on the partial derivatives of the log likelihood of the song with respect to each Gaussian mean. For their two implementations of the Fisher Kernel, a song was modeled using the full Fisher kernel and the magnitudes of the Fisher kernel respectively Pampalk (2006) In [5], Pampalk envisioned creating a system that was capable of rating similarity between songs. Similar to some of the other researchers, Pampalk used MFCCs to represent a given song. However, instead of computing MFCCs over the entire duration of a song, MFCCs were computed for a maximum of two minutes from the center of the song. Using this time frame of a song, the signal is divided into time frames of length 23 ms and 19-dimensional MFCCs were computed for each frame. A single Gaussian distribution of the MFCC frames was then used to represent a test song. Two songs were compared to each other using a symmetric version of Kullback-Leibler divergence. Besides MFCCs, Pampalk also used other features to represent a given song. Specifically, he used three features that are computed from the fluctuation patterns of a given song. Fluctuation patterns are two-

25 16 dimensional matrices that describe the modulation of loudness amplitudes per frequency band; they essentially describe the periodic beats. Two features that are extracted from the fluctuation patterns are the gravity and bass. Gravity corresponds to the tempo of a song and it is the center of gravity of fluctuation pattern along the modulation frequency. The bass is the strength of the lower frequency bands at higher modulation frequencies, [5]. For each of these three features, the distance between two songs is based on the absolute value between the respective features. These three features along with the single Gaussian distribution means that there are four total features that define a song and the overall similarity of two songs is based on a linear combination of the four different distance values Pohle (2006) Pohle also used a similar technique as the other researches that planned to develop a music similarity system. In [6], Pohle describes how MFCCs are used to represent a song and how the Kullback-Leibler divergence is used to compare two songs for similarity. Unlike the other researchers, Pohle computes the MFCCs for songs differently. An input wave song of 22kHz is converted to mono and first divided into frames of 512 samples with 50 percent overlap between frames. The first and last 30 seconds of a song are discarded. The number of nonconsecutive frames corresponding to a song length of 2 minutes are randomly selected and used for feature extraction. Each song is represented by a single Gaussian distribution of the MFCCs. Using this approach Pohle was able to generate a distance matrix that contains the distance between a seed song and all the other songs that are in the database Subjective Approaches Probably the most important step in the music similarity and classification process is evaluating the effectiveness of each algorithm. Different but similar evaluation processes were used by each researcher. In most cases the test songs or artist are compared to a reference model, better known as the training model. The artist for the test songs are generally known, so it is simple to determine if each test song was correctly classified with the training data set. There is a distinction between classifying the artist of a test model and determining the similarity between models in the data set. Initially, the distance values are computed between each model and a test model is classified based on the smallest distance value. However, music

26 17 similarity incorporates these distance values between models to generate a similarity matrix, which is then evaluated. In [2], a reference similarity matrix was used, where each row of the matrix is sorted by decreasing similarity, thus allowing there to be a list of the N most similar songs for each song. Another approach in the music similarity evaluation process compares the results from the algorithm with the results of a subjective survey. Individuals are given the task of comparing a certain amount of songs with a test song, and rating how similar each song is with the testing song. The comparisons by all the users are averaged to give an overall similarity rating between all songs. Then the results from the machine learning similarity algorithms will be compared to the similarity results from the user surveys. The most common approach to subjectively measure the similarity between music is to incorporate user opinion, by means of a survey. Through surveying, subtle relationships between songs can be uncovered that may be difficult to detect from the audio signal, [2]. In [2], Berenzweig conducted a survey on musicseer.com with the purpose of obtaining user input. The music database consisted of 400 popular artists. Each artist had 10 candidate artists that were compared with it. The user was given the task of answering which of the 10 artists is most similar to the testing artist. Similarly, Aucouturier conducted a survey where users were given a target song S, and two test songs A and B. The user then had to decide which test song is closet to S. The pair of songs A and B were not chosen at random, but deliberately selected because most likely random songs would not have any similarity. Both A and B were chosen so that A and S are close to one another according to their own measure, while B and S were more distinct. A similarity matrix was subsequently made, where the values depended on the ratings from the users were stored Expert Opinion Ground-Truth similarity matrices were also produced based on expert opinions. The All Music Guide is a professional service that provides information about almost every possible artist. They provide album information, genre information, and even information on which artists are similar. Using these similarity measures, Aucouturier was able to generate a graph that shows which artists are similar to each other in his music dataset. From there, he was able to create a similarity matrix by computing the path length between

27 18 artists in the graph, where nodes are artists and there is an edge between two artists if the All Music Guide considered them similar. Similarly, Berenzweig used this approach to also generate a similarity matrix and ultimately used this matrix to evaluate his machine-learning approach Pandora The Music Genome Project of the Pandora website uses a similar kind of expert assessment in determining similarity. They used a group of musicians to determine a similarity metric between artists based on melody, rhythm, lyrics, instrumentation, and vocal characteristics. Inputting an artist name or song title into their system, Pandora creates a station of artists or songs that are similar. The user is then capable of determining if the songs in the station are dissimilar or not to the target song Evaluation of Machine-Learning Approaches (MIREX 2006) Subjective measures are fairly often used to evaluate quantitative systems. In general, the results from an automatic system are compared to the similarity results from a user survey. One such system that has evaluated a number of automatic similarity systems is the Evalutron 6000, developed for the Music Information Retrieval Exchange (MIREX) 2006 similarity competition. Evalutron 6000 surveyed a variety of individuals to rate song similarities using various scales. With this system, a group of roughly 20 individuals were required to rate the similarities between a test song and approximately 30 candidate songs, [7]. Two similarity scales were used to determine the similarity between each candidate song and its testing song. One scale used a general rating system, known as the Broad Category score, where either the songs were not similar, somewhat similar, or very similar. A numerical scale (Fine Score) was also used, where a numerical value between 0 and 10, with increments of 0.1, where used to define similarity. The Broad Category and Fine Score are supposed to represent the same similarity rating but in different manners. This rating system is supposed to provide a ground truth for similarities between models. Individuals that participated in the MIREX 2006 similarity competition compared similarity assessments from their various similarity algorithms with the ground-truth data set. A system is deemed successful if its similarity scores closely match the similarity scores from the user-developed similarity survey.

28 19 Six researchers participated in the Audio Music Similarity and Retrieval task, which was held during the fall of 2006 at the International Symposium on Music Information Retrieval (ISMIR) conference in Victoria, British Columbia in Canada. Each participant was required to submit an automatic-computer algorithm that classified similarity between 5000 songs from various artists, genres, and eras. All participants used the same 5000 songs that were extracted from uspop, uscrap, and cover song collections. The six participants, two of which algorithms are described in the previous approaches section, were Elias Pampalk, Tim Pohle, Victor Soares, Thomas Lidy and Andreas Rauber, and Kris West. Each algorithm was required to submit a 5000 x 5000 similarity matrix that determines the similarity between every song in the data set. To compare the algorithms, 60 songs were selected at random from this set of 5000, and the 5 most highly ranked songs for each song query were extracted and stored. The 5 songs were not from the same artist as the query song and all cover songs that match the query song were filtered out. Also, the query song itself could not be contained in the 5 songs. The 60 query songs were the same for each algorithm, but the 5 songs for each query (totaling 300 songs) may have differed; some algorithms may have returned some of the same songs as being similar to a query song. From there, human subjects rated each query song for similarity to the 6 candidates songs. Three different graders evaluated each query song. The six systems were evaluated by computing the correlation between the user graded responses and each algorithms results. Six scores were given for each algorithm. The scores were: Fine, PSum, WCsum, SDsum, Greater0, and Greater1, [7]. The Fine score consists of summing all of the finegrained human similarity slider ratings (0 10) between each query and its candidate songs. The PSum score sums all of the human broad category similarity decisions, where a value of 0 was given when Not Similar was checked, a value of 1 was given when Somewhat Similar was selected, and a value of 2 was given when Very Similar was selected. The WCsum score is the same as the PSum score; however, a value of 3 was given when Very Similar was selected. Like the WCsum score, the SDsum gives a larger value when two songs are said to be very similar, but the value is 4. The Greater0 and Greater1 scores differs from the others in that they either assign of value of 0 or 1 for the broad category rating. For Greater0 a value of 1 is given when two songs are rated as being somewhat similar or very similar while not similar ratings received a value of 0. The Greater1 score gives a value of 1 only when the two songs are

29 20 said to be very similar. The various scores for each system are computed, but each score is normalized to a range of 0 to 1.

30 21 3. Methodology 3.1 Mel-Frequency Cepstral Coefficients (MFCCs) MFCCs are the features we used to represent a song model. They are computed directly from an audio waveform using the approach in Figure 7. Figure 7: Flowchart for computing MFCCs from an audio signal. The audio signal is first divided into frames of equal length. In our case, the frame rate was either 23 milliseconds (genre classifier) or 12.5 milliseconds. These frame rates were based on the default frame-rate used in MATLAB to calculate MFCCs; different MFCC calculation approaches were used for the genre and artist classifiers. Successive frames overlap the same segment of the time-domain signal by 50 %. This is done to improve or smooth the transition from frame to frame. A Hanning window, which smoothes the edges, is then applied to each frame. The Discrete Fourier Transform (DFT) is then taken for each frame. These above steps basically compute the spectrogram of the time-domain signal. The spectrogram gives a frequency versus time visualization of the input signal; the cause of the variations of the frequencies over time can be determined by listening to the song concurrently with the spectrogram visualization.. A FT length of 512 samples was used to optimize the time and frequency resolution of the signal. Next, the logarithm of the amplitude spectrum is taken for each frame, and then the mel scale is applied to remove the linearity within the frequency scale. Taking the logarithm of the amplitude spectrum provides information about the spectral envelope of the signal, and computing this over all frames shows how the spectral envelope varies with time. The Mel-Scale is a nearly logarithmic scale that gives a higher emphasis to the lower frequencies. It is based on a mapping between actual frequency and perceived pitch of the human auditory system, [9]. The mel scale sections a range of frequency values into a number of bins. The number of bins used to adjust the frequency scale can vary and it depends on the application. The bins are also logarithmically spaced and the number of bins defines the dimensionality of the MFCC feature. In

22 order to account for both the low and high varying spectral envelope, which includes timbre and pitch measures, this work will incorporate 20-diminensional Mel-frequency Cepstral Coefficients.

31 22 order to account for both the low and high varying spectral envelope, which includes timbre and pitch measures, this work will incorporate 20-diminensional Mel-frequency Cepstral Coefficients. The timbre of the song can be represented by the first 8 bins of a MFCC vector, while the last 10 bins can represent the pitch of a song, as demonstrated by Berenzweig et al in [2]. The first channel, or bin, of the MFCC matrix is the overall energy of the signal. It generally defines how the loudness of the song varies with time. The final step in computing the MFCCs of a signal consists of taken the Discrete Cosine Transform (DCT) to reduce the number of parameters in the system and to remove the correlation between the Mel-spectral vectors calculated for each frame, [9]. Figure 8 shows a visualization of a MFCC matrix computed from one song waveform. The size of this matrix is 20 x N, where N is the number of frames. The first or top bin shown in Figure 8 represents the energy of the waveform, and it shows that it is fairly constant over time. The other rows of the matrix represent one channel of a Mel-scale bin and you can visually see how each bin varies with time frames. Figure 8: Image of 20-dimensional MFCCs computed from a time-domain signal 3.2 MFCCs computed from different mp3 encodings The MFCCs were computed on songs with different encodings. Specifically, features were extracted from wav and mp3 files, where some of the mp3 files were encoded using LAME-encoder and others with Apples itunes mp3 encoder. The number of bits per second (bps) varied from song to song, where most of the songs were generated with bit rates of 128 kbps and a few were generated using either 192 kbps or 164

32 23 kbps. Nevertheless, a study has shown, [10], that these different parameters should not alter the resulting MFCCs; the MFCCs for the different encodings should be approximately the same, [10]. In [10], Sigurdsson studies the effect that different mp3 encodings have on the computation of MFCCs and thus how they may affect music information retrieval performances. Mp3 encodings of different sampling and bit rates were used to compute MFCCs and they were evaluated against the MFCCs computed from wav audio files; wav files contain little to no loss of information during compression as compared to mp3 files. The bit rates that were used were 320-, 128-, and 64-kbps. Sampling rates of Hz or Hz were used in this comparison. The squared-pearson correlation coefficient was used to compare the MFCCs generated from the wav files to the MFCCs generated from different mp3 encodings. The Pearson s correlation coefficient r xy for two variables x and y, is a measure of the correlation between them given a linear model and Gaussian noise, [10]. Sigurdsson uses the squared correlation r 2 xy, which indicates the percentage of variation in the data. A squared correlation value of 1 signifies that the two models are exact. The correlation between the two models lessens as Pearson s squared-correlation coefficient becomes smaller. When the sampling rate was 44.1 khz and using a bit rate of 320 kbps, it was determined that the correlation between the MFCCs from the wav file and the MFCCs from the mp3 was approximately equal to 1, signifying little or no loss of information. With the same sampling rate, but with a lower bit rate of 128 kbps, Sigurdsson determined that the squared correlation between the mp3 MFCCs and the wav MFCCs slightly decreased from 1 to approximately 0.95 for the first 15 MFCCs. They found that higher MFCC channels are less robust to MP3 encodings of music, which is shown by a lower sample correlation. At a 64 kbps bit rate and a sampling rate of 44.1 khz, it was determined that the correlation has significantly decreased. The robustness of the MFCCs can be improved by using a sampling rate of khz instead. This occurs because higher frequencies are more expensive to code and deviate more from the original. Therefore, since in our case a sampling rate of Hz is used, the MFCCs obtained from mp3 encodings should not vary significantly from the MFCCs generated from the wav files.

33 Kullback-Leibler Divergence The equation used to compute the KL divergence value between two single-gaussian distributions, p and q, is shown below. As can be seen by this equation, the KL divergence between two distributions is based on the entropy of those distributions. After re-writing the above equation, the KL divergence reduces to the difference between the entropy of distribution p, and the cross-entropy between the p- and q- distributions. Entropy is a measure of what is unknown. Further simplifying the equation for KL divergence shows that it is nothing more than the relative entropy between the p-distribution, and the p- and q-distributions. Since the KL divergence is simply the relative entropy, the properties of entropy require that it must always be positive. Since the distributions p and q are both normally distributed, the KL divergence between them can be simplified after substituting the equations for the probability distributions into the above equation. This simplified equation is shown below as a function of only the means and covariance s of the two distributions; d is the dimensionality of the MFCC features [3]. The KL divergence value does not disclose the distance between two feature models since it is a divergence and not a distance measure. In other words, the KL divergence between distributions p and q may not be the same as the KL divergence between distributions q and p. In order to determine the similarity between two song feature models, a closed-form symmetrized version of the KL divergence has to be calculated. This is accomplished by summing the KL divergence between distributions p and q, with the KL divergence value between distributions q and p, as shown in the equation below. This final equation is used in our system to quantify the similarity between two song or artist models.

34 Preliminary Work Our initial research task was to successfully classify the genre or artist of a musical piece. A supervisedlearning approach was employed to successfully classify the artist or genre of a group of songs or an individual song, respectively. The actual approach, which is re-shown in Figure 9, is very similar to many of the approaches from previous researchers and was initially based on the approach by Mandel and Ellis, [3][8], where the main difference is in the classification stage of the algorithm. Figure 9: Supervised-learning Approach for Artist and Genre Classification The initial step in any classifier generation is to compile the data used to train the classifier. To develop a classifier, we first had to define the types of genres that would be used and gather a data set that would be divided into training and testing sets. The training data used consists of fifty-seven songs from five different genres: Rap, Rhythm and Blues, Rock, Pop, and Gospel. Each song in the training set consisted of an mp3 version of the song that were downsampled to Hz, and converted to mono. In the artist classification task, the initial dataset differed significantly from the dataset used for genre classification. The complete dataset consisted of songs from 18 different artists. Each artist had approximately 5 albums that were included in the dataset. The songs from each artist that were used in this system were gathered from the uspop2002 music database, [19]. Most of the artists in the dataset can be categorized as being in the Rock or Pop genres, where most of them were most popular in the 1970s, 1980s or the early 1990s. Like the genre classification case, the entire song was used and MFCC features were extracted from each song. Different versions of mp3 encodings were

35 26 used in this song database, but it will be shown later that this has little effect on the performance of the classification algorithm. Before extracting features and modeling the training dataset, each song was first downsampled to khz and converted to mono Feature Computation Using the MFCC function found in the MATLAB Auditory Toolbox [reference], thirteen MFCC s were generated for each frame of each song, resulting in a 13 x N matrix (N: number of frames) for each song in the genre classification database. The MFCC matrices from the songs of a genre were concatenated to form a 13 x L matrix; L is equal to the number of songs that make up a genre, k, multiplied by the size of each song N (L = k*n). This was done for every genre in the data set, thus resulting in five MFCC-genre matrices. The features of the training data were normalized by first computing the Gaussian statistics over all of the data. The multidimensional mean was then subtracted from each MFCC frame and this result was divided by the variance. After normalizing the data set, each genre was modeled as a single Gaussian probability distribution. Therefore, each genre was defined with a mean vector (size 13 x 1), and a covariance matrix (size 13 x 13), generated by taking the mean and covariance of the MFCC-genre matrix over time. The data used to test each classifier consists of a total of 20 songs: four songs from each genre. Again, using the built-in MFCC command, the MFCC s of each song are generated, resulting in twenty different 13 x N MFCC-song matrices. The MFCC s for the testing dataset is normalized with the mean and variance of the training dataset. Like the MFCCs used for genre classification, the MFCCs used to classify the artist of a song are multidimensional matrices. However, using the melfcc function in MATLAB 20-dimensional MFCCs were computed for each song resulting in a 20 x N matrix of MFCCs for each song. For both datasets, the MFCCs for each artist were grouped together in one large matrix resulting in 36 MFCC matrices (18 for each artist in the training set and another 18 for each artist in the testing set). Note that the MFCCs for each

36 27 song were stored into separate matrices and were not discarded. Each artist was modeled as a single- Gaussian distribution with a mean vector of size 20 x 1 and a full covariance matrix of size 20 x Preliminary Results Using the previously described methods for genre classification, the genre category of twenty different songs was calculated. The twenty songs consisted of four songs from each genre (Rap, R&B, Gospel, Rock, and Pop). The supervised-learning genre classifier was able to correctly classify 60 % of the testing songs. Table 2 shows the actual genre of each testing artist and the classified genre that our system assigned to each particular testing song. Table 2: Results for automatic genre classification KULLBACK-LEIBLER CLASSIFIER TEST ARTIST ACTUAL GENRE CLASSIFIED GENRE AZ Rap Rock Big L Rap Rap Canibus Rap Rap Da Band Rap Rap Avant feat Olivia R & B R & B Az Yet R & B R & B Carl Thomas R & B Rock Dionne Warwick R & B R & B Bishop Eddie Long Gospel Rock Favor Gospel R & B Hezekiah Walker Gospel Gospel Gospel Remixes Gospel R & B MatchBox 20 Rock Rock Men at Work Rock Gospel The Beach Boys Rock Rock The Beatles Rock Rock 98 Degrees Pop R & B Backstreet Boys Pop Pop Christina Aguilera Pop Pop Jessica Simpson Pop Rock Using the same approach as in the genre classifier case, the artist classifier performed substantially better than the genre classifier. Specifically, of the 18 artist that were represented in the original testing data set, 72 % or 13 of the testing models were classified as the correct artist. An illustration of the KL divergence

37 28 values between each artist model and the training models are shown below, Figure 10. By showing the intensity of the KL values the image depicts how a testing model compares with each training model. Each column of the figure represents a testing model. The lighter the box indicates that there is a lower KL divergence value between the two models. Ideally, there would be a sequence of light boxes going along the diagonal of the image indicated that all testing models were correctly classified. Figure 10: KL Divergence between artist testing (x-axis) and training (y-axis) models. The lighter the box the smaller the KL divergence value between the two models. There are many reasons as to why this classifier is not performing as desired. One reason is that we are modeling both the classes in the training and testing sets by a single-gaussian distribution. In most cases, the MFCCs that define a song and an artist are clustered into separate regions. This means that the song- or artist features can be better represented by a multitude of Gaussians that are capable of defining the different clusters within each MFCC feature set. Using Gaussian Mixture Models (GMMs) as opposed to single-gaussian distributions may improve the classification results, but a different distance measure

38 29 between models would have to be used because Kullback-Leibler divergence does not have a closedform version for multiple Gaussian. Another reason for these results deals with the supervised-learning approach where a database is divided into a testing and training set. This approach makes the assumption that a model can be completely represented by a combination of songs from a particular artist or genre. This brings back the question of how can you select the defining songs for a particular training model? And is it even possible to completely characterize a model based on a finite sequence of songs? This problem is known as the ground-truth issue with the supervised-learning methodology, [17]. The main problem is that if the testing and training models were re-arranged, the classification results more than likely will not remain the same and how can you automatically determine the models that will produce the most accurate results? There does not seem to be any direct answers to any of the questions associated with establishing a ground-truth and this is a reason why using a supervised-learning approach to classify a testing model is extremely difficult. Although these results are by no means indicative of an automatic artist or genre classification system that is ready to be distributed to consumers, these results are comparable with other systems that desired to tackle this area of research. Specifically in [3], an automatic artist classification system was developed, and the accuracy of the system was tested using various features to represent an artist. Overall using six different features, Mandel was only able to correctly classify the artist of a testing model 32% to 72% of the time. The accuracy results of our system and systems from other researches, shows that there is a long way to go before artist/genre classifiers are ready to be implemented. 3.5 Music Similarity The inherent problem with the supervised-learning approach for artist and genre classification consists of accurately defining the training set, and if this is actually possible. The training set implies that this artist is defined by this group of songs. How can one say that these are the songs that are most important to this artist? How can you be sure that these are the correct songs or albums used? This brings us to the problem

39 30 of establishing a ground truth data set, and if it is in fact possible to define the ground truth for a particular genre or artist. The vagueness associated with genres led us to believe that a genre classifier may not be that useful. Since many people have their own definitions of genres, we concluded that even if we developed a genre classifier that some individuals might not agree with the classifier s results. We then decided to revise our initial goals and work more towards creating an automatic algorithm that assessed the similarity between two songs. All people, regardless of their opinions, can benefit from a system that is capable of defining song similarities. Song or music similarity only slightly differs from artist or genre classification. The main difference is in the supervised-learning approach where for music similarity there is not a need to establish a ground-truth that defines a particular artist or genre. Rather, instead each song is defined by its own features, which in our case are Mel-frequency cepstral coefficients (MFCCs). The dataset was still divided into a training set and a testing set for the evaluation process of our algorithm, which will be explained later. The aim of this music similarity algorithm consists of employing a hybrid-supervised learning technique that quantifies the similarity between all song models. An artist classifier is trained using a set of prelabeled data to determine how individual songs are grouped together. Our initial dataset consisted of 1277 songs from 20 different artists (two additional artists were added to the dataset from the artist classification system). On average each artist had songs from 5 albums. Some of the albums may be studio recordings, live performances or greatest hit albums. The genres of the artist are: Soft Rock, Hard Rock, Classic Rock, 80s pop, 90s pop, Rap/Hip Hop, and Rhythm & Blues. The pre-labeled data, or training data, consisted of 3 albums from all 20 artists in the data set, totaling 747 songs. The un-labeled data or testing set consisted of the remaining albums from each artist, approximately 2 albums, and a total of 530 songs. The supervisedlearning approach is used to classify a given test song. Each test song in the set is compared with the artistfeature sets of the training data.

40 31 The process of this music similarity model consists of first extracting the features from the prelabeled training data set. The features of choice are MFCCs. For song classification, the features from each song for every artist are grouped together into one feature representation. Thus there is a feature set for each artist in the dataset, for a total of 20 artist-feature sets. During the classification process the distance between each artist and each test song in the data set are computed, and these results will be extended to determine a similarity matrix. The classifying artist for each test song is computed and stored. Each test song in the data set is classified to determine if our method can accurately define and distinguish between the various artists in the data set. Besides just classify the testing songs in the set, a distance is also computed between every song, both training and testing, in the data set. This gives our algorithm the ability to determine overall song similarity between every song in the set regardless of artist. The procedure for assessing similarity between models is displayed in Figure 11. Figure 11: Song Similarity Approach Similarity Matrix After computing and storing the classifying artist for each test song, the next step was to determine similarity between each song in the entire data set. This was accomplished by generating a square N x N matrix that will store the similarity values between all songs. The similarity value is the symmetrized KL divergence between two songs. Since the data was subdivided into training and testing sets, to generate the similarity matrix four separate matrices were generated and then they would be concatenated together to get our similarity matrix. The four sub-matrices were generated by (1) computing the KL divergence value between all of the songs in the testing set and all of the songs in the training set (2) computing the KL divergence value between the testing set songs against themselves (3) computing the KL divergence

41 32 between the training set and the testing set and (4) computing the KL divergence value between the training set songs. Note that the matrices computing for (1) and (4) are matrix transposes of each other. The matrices in (2) and (3) are both square matrices of sizes 550 by 550 and 767 by 767, respectively; the average song from each artist are stored in both cases resulting in an additional 20 comparisons for both matrices. The average song is not an actual song from an artist; rather it is represented by the mean and covariance over all songs in the training data set from an artist. The final similarity matrix was formed by concatenating the four above-mentioned matrices into one 1317 by 1317 square matrix Multidimensional Scaling Visualizations One method of showing the similarity between the songs is to perform multi-dimensional scaling (MDS) on the square similarity matrix. MDS is a set of statistical techniques for exploring similarities or differences in data, often displayed visually, [11]. MDS is used to scale some high-dimensional data into a lower dimensionality so that it can be visually displayed. The dimensionality of the similarity matrix is considered high because for each song model there are N different distance values (N is the total number of songs), where N is usually very large. It is performed on a square distance matrix, which consists of the distance between all models in the data set. Multidimensional scaling works by basically performing eigenanalysis on the distance matrix. The first step is to transform the distance matrix into a cross-product matrix. Since distance matrices are not necessarily positive and semi-definite, eigen-decomposition cannot be performed on them. However, eigenanalysis can be performed on the cross-product matrix. A distance matrix D is transformed into a crossproduct matrix S by performing matrix multiplication with a centering matrix C, as shown in the equation below. The centering matrix provides a reference origin for the distances between each data model. The centering matrix is based on a mass vector m that provides a weight for each value in the distance matrix. This weight can be used to give more or less significance to particular elements of the distance matrix. All

42 33 elements of the mass vector are positive and the sum of all of its elements is always equal to a value of 1. After defining the mass vector, the square-centering matrix is obtained from the equation below. No information is lost from calculating the cross-product matrix; the distance matrix can be re-computed from the cross-product matrix. Now that the cross-product matrix S has been obtained, eigen-decomposition can be performed on it, resulting in a relationship between the eigenvalues, eigenvectors and the cross-product matrix. U is the matrix of eigenvectors and Λ is a diagonal matrix of eigenvalues. Finally the scores F are computed by projecting the rows on the principal components of the analysis of the cross-product matrix, shown below. The score matrix is the final variable needed to visually display the similarity between the models in the distance matrix System Evaluation The accuracy of this similarity classification system will be determined by comparing the results of the machine-learning similarity approach, with a user-dependent surveying system. Similar to the Evalutron 6000 system, a set of test songs will be compared with a certain amount of candidate songs. Users will then have to rate the similarity between the test songs and their candidates, while also explaining what makes the two comparison songs similar and/or different User survey The ultimate success of this algorithm and any other is contingent upon the agreement of user opinion on the final music similarity results. It would be nice to believe that our system is capable of accurately defining the similarity between a seed song and all other possible songs alone, but since users will ultimately benefit from this research, it is imperative that they be involved in this process. That being said,

43 34 our automatic music similarity assessment algorithm has been evaluated by various individuals to quantify the system s strengths and shortcomings. Before users could evaluate our system we first had to determine which seed songs would be used for evaluation. The seed songs that were rated against various candidate songs were first determined by selecting a set of songs from each artist. We wanted to insure that a large scope of our test songs, from various artists, were rating; we did not want each user to only rate songs from the same artist because this could tend to produce more similarity results based on artist rather than the songs. Keeping this in mind, each user was assigned a set of songs from a particular artist. Since our system at the time included songs from 20 different artists, we began the task of searching for 20 different users that were somewhat familiar with the songs from a particular artist. We were able to gather a wide range of individuals, some of whom were graduate students, undergraduate students, professors, and even some corporate professionals. After gathering a group of users, and delegating which artist that they would be assigned to, the next task was to determine which songs from each artist would be used as seed songs, and how to determine the prospective candidate songs. The seed songs from each artist were determined by selecting the 15 songs from that artist that had the closets nearest neighbors. In other words, from all of the testing songs from a particular artist, the 15 songs that had the shortest distance between itself and its closet neighbor were used as the seed songs for each artist. These 15 songs were used because we did not want to use a song that our system says is vastly different from its nearest neighbor. After extracting the 15 songs for each artist, the next task was to determine which candidate songs to use and how many candidate songs would be compared with a single seed song. The number of candidate songs for each seed song is dependent on the artist classification of the seed song. If you recall, each testing song in the database was compared with all of the artist training data models, and the testing song was classified as one of the training artists. Regardless of correct classification of the artist, a seed song was initially assigned 5 candidate songs. The candidate songs initially were composed of the two closes songs to the seed song regardless of artist. If the seed song was incorrectly classified the remaining 3 songs were the two songs that define the classifying artist and the song that is from the classifying artist but is closest to the seed song. The two songs that define the classifying artist were

44 35 initially used in the survey to give the user a sense of the style and type of music from that artist. The defining songs were determined by first computing the average song for each artist in the training set. The average song was computed by first concatenating all of the MFCC frames from every song for a particular artist in the training data set. The average over time of all of the frames and the covariance of this matrix were then computed and stored as the defining song of this artist. Note that this song is not an actual song but an approximation that is represented by a mean vector and a covariance matrix. The two defining songs of an artist are then determined by finding the two songs that are closet in terms of KL divergence to the average song. After doing an initial trial survey on individuals in our lab group, we realized that this survey was extremely long and it needed to be modified. From there we decided to base the candidate songs on 3 conditions: a song from the artist of the seed song, a song from any other artist (excluding seed song artist), and a song from the classifying artist. This was a far better implementation of the survey because in this case similarity would not be based solely on songs from the same artist and also it eliminated the trivial surveying between a seed song and the two defining songs from the classifying artist. In all cases, these 3 songs were the closets songs from that group to the seed song. To summarize users would rate the similarity between 15 seed songs from an artist an all of the comparison songs. The number of candidate songs depends on the classification of the seed song with the maximum number of candidate songs being 45 (= 15 x 3) Excel Survey Sheet After generating a list of all of the seed songs and their perspective candidate songs for each artist, a Microsoft excel sheet was developed, Figure 12, for each artist and would be used to store the survey responses of the users. On each excel sheet, the first two songs consisted of comparing the two defining songs from the artist for similarity. For all comparisons similarity was assessed using 4 different measures. The first measure was a simple sliding rating scale that assess similarity on a scale from 0 to 10 with increments of 1; 0 being no similarity between the two songs at all and 10 being prefect similarity.

45 36 Figure 12: Survey used to assess the effectiveness of our song similarity system Users were also asked to specify what was similar and also what was different between the two comparison songs. In both cases, the user had to specify if the two songs had similarities and/or differences in timbre, instrumentation, vocal qualities of the performer, the mood of the song, the tempo and the intensity between the songs. If none of these categories fit or if something else could also describe their relation, users could specify that relationship. To further explain each categorization of similarity, timbre is defined as the overall sound quality of a musical piece. Instrumentation defines what instruments were used and how were they used. Mood is used to classify a song as being sad and depressing, upbeat and energetic, and etc. Vocal qualities can tell if the singer has a low-pitch bass voice or a soulful country sounding voice. Tempo is fairly self-explanatory, slow-, fast-, or medium-paced. Intensity is more or less used to describe the overall loudness or energy of the song. A good example of a song with a lot of intensity would be a hard heavy metal song, with loud drums and electric guitar. A low intensity song would definitely sound soft and relaxing. The last measure used to rate similarity is selecting whether this candidate song could have been created, performed, or could have been associated with the artist of the seed song itunes and Web-based Music Server

46 37 The music was made available to the survey participants through an itunes playlist or an onlinepassword protected website. For users whom had access to the Music & Engineering Technology (MET) Lab s local Drexel University network subnet, they could see the playlist for a particular artist via itunes shared playlist capability. This method did not infringe upon any copyright protection laws since users could only play the music and not copy it to there local machines. For individuals that did not have access to our particular subnet of Drexel Universities computer network, we made the playlist for the artist available via a password protected website. Survey participants who choose to listen to the perspective music in this manner could log on to the site and click on the songs for comparison. Again, since the site was password protected and users could not download the songs we did not infringe upon any copyright infringement law.

47 38 4 Music Similarity Results The automatic music similarity assessment system can be evaluated in many ways. One way of evaluation deals with visually showing how songs from various artists map against each other. Using multidimensional scaling (MDS) the similarity between various songs can be visually displayed on a threedimensional display. After a square similarity matrix between all songs in the dataset was generated and then MDS was performed on this matrix, the following similarity visualization was created. Figure 13: Similarity between songs in dataset displayed using MDS The first illustration shows how all of the songs map against each other. This MDS plot highlights the distribution of songs from each artist. As you can see the songs from each artist tend to cluster together but there is definitely overlap between songs from different artists. Within this plot, the song title and artist for each model is also displayed. By displaying this information it is possible to visually determine the songs that our system classifies as being similar (points that are close to each other) and dissimilar (far away points). Whether an individual agrees with the classification is another issue, but this MDS plot will allow a person to visually determine the similarity between two songs on a three-dimensional space. Currently, it is

39 not known what the three dimensions of the MDS plot represent, but further research will have to be conducted in this area to make that determination.

48 39 not known what the three dimensions of the MDS plot represent, but further research will have to be conducted in this area to make that determination. Another MDS plot that can be used to assess the similarity between songs is based on the artist classification of each testing song (Figure 14). For a testing song from a particular artist, the corresponding training model of that artist is also displayed in the image. This makes it possible to visual determine how closely related that testing song and artist-training model are to each other. The average song of the artisttraining model is also displayed. Figure 14: MDS plot of song by Bryan Adams and its visual comparison with other songs Along with the artist-training model of the actual artist of the testing song, the artist-training model of the classifying artist is also depicted in this plot, if the testing song was incorrectly classified. This makes it possible to see where a testing song maps against the training model of the classifying artist and the actual artist of the song. The figure helps explain why the artists of certain testing songs are miss-classified. This

49 40 is visually shown by comparing the location of the testing song with the locations of the average songs of the training models of the actual and classifying artist. It is also shown by comparing the location of the testing song with the locations of the points of the training models of the actual artist and the classifying artist. 4.1 Evidence of Album Effect Our music dataset consists of some original recordings of hit songs and some of the re-releases of those songs on other Greatest Hits albums. These re-releases are the exact same version of the original recordings, but they have been re-mastered to fit the characteristics of the new album that they are featured on. Mastering consists of audio processing (e.g., equalization, dynamic range manipulation, spatial sound effects) that is applied to all of the individual songs of an album after they have been mixed to provide a consistent sound quality across the album, [12]. One would believe that since these two songs are exactly the same recording, that our system would determine that they are exactly the same. This, however, is not that case and it can be visually shown with a MDS plot. The MDS plots below shows the effect that the remastering process has on songs and it helps prove the existence of the album-effect. The plots show the point that represents the original recording and how its location in the song space changes after it is remastered. The two plots depict how duplicate songs map against each other from songs from the singing group U2 (Figure 15) and the projection of a set of pre and post-mastered songs from a compilation album produced by Drexel University s record label, Mad Dragon Records (Figure 16). The numbers define the song title, whereas the color defines the album the song appears on.

50 41 Figure 15: Comparison of re-mastered songs with the originals using MDS Figure 16: Comparison of pre- and post- mastered songs using MDS In both cases the (re-)mastered songs move substantially in the projected song space, with a tendency towards clustering of songs by album, [12]. Since songs are clustering together by album, this could be negatively affecting both the artist classification and the song similarity systems. Instead of classifying songs by artist, our system could be classifying songs based on the album since songs cluster by albums.

51 42 The music similarity assessment algorithm is affected because this means that songs from the same album will tend to produce lower KL divergence values because they are clustered together. Songs from the same album may not sound similar to each other but since they are mastered in similar manners, this system will assess that they are similar to each other. 4.2 Survey Analysis The most useful method of evaluating our musical song similarity algorithm is to incorporate user feedback into our system. The user survey will provide critical information that can be used to drastically improve our system while also explaining some of our system s strengths. As previously mentioned, twenty different surveys were completed by twenty different participants. Based on all of the responses from different users, we were able to provide a variety of analysis tools to determine the efficiency of our system. One such analysis was to determine the correlation between the slider ratings that the users used to give a fine score rating between two comparison songs, and the Kullback-Leibler divergence values that our system used to determine song similarity. Since our system classified similarity based on the magnitude of the KL divergence value between two song models, we would hope that there is a correlation between these magnitudes and the slider ratings. Specifically, a small KL divergence value between two models signifies that the two models are similar. This small KL divergence value should correlate with a high value on the slider scale. Similarly a large KL divergence value, signifying that the two models are dissimilar, should correspond to a low rating on the slider. The overall correlation between the KL divergence values and the slider ratings of each user are shown graphically in Figure 17. Ideally we would have a negatively sloping line from the top left corner of the figure to the bottom right corner, where the data is grouped around this line. This, however, is not the case, and the graphic does not seem to depict an ideal correlation between the KL divergence values and the user ratings.

52 43 Figure 17: Correlation between KL divergence values and slider ratings. The slope of the estimate of the best fit line is There are a variety of reasons as to why this phenomenon is occurring. One such reason is based on the manner at which the user survey was generated. In this system, a testing song was compared with a song from the same artist that had a small KL divergence value between them, with a song from a different artist that had a small KL divergence value between them, and with a song from the classifying artist that produced the smallest KL divergence value between itself and the testing song. All of these songs have small KL divergence values between themselves and the testing song. Providing a user with songs that should all be similar to the testing song is quite possibly a flawed way of gathering user feedback. Users may inherently assess similarity relative to a prior comparison. For instance, if a user provides a value of 7 to the first song comparison, and after listening to the second song comparison, they feel that these two songs are less similar then the first case. Since a value of 7 was giving to the first song comparison a value of 4 or 5 may be giving to this song comparison since it is less similar than the initial comparison. Instead of using all similar songs, it might be more feasible to use three songs with vastly different KL divergence values between themselves and the testing song. One song could have a small KL divergence value between the testing model, while the other two have a moderate and a large value respectively. This would

53 44 remove any relative rating system that the user subconsciously uses to evaluate similarity. Berenzweig used a similar approach for his survey assessment and received moderate results as discussed in [2]. Surveys are by definition subjective measures because different participants use their own knowledge and background to response to the proposed questions. Since users may equate similarity differently, there may be a need to first quantify how users rate the same material and if there are strong differences between responses, there results may need to be normalized against each other. That being said, a second survey was given to each participant, but this time the song-pair comparisons for each user were the same. This was done to determine if the participants rate the same songs similarly, and if not steps would have to be taken to account for the differences. Below is an image of the second survey that was given to each participant. Basically users were asked to rate the similarity between 7 song pairs (14 total songs), the same way as the survey that was previously given to them. After gathering many of the surveys, analysis had to be performed on them to determine if normalization of the previous surveys was deemed necessary. Figure 18 shows the rating from each participant for each song-pair. The mean rating at each song comparison is also displayed in black. As can be seen, some users deviate significantly from the average value for certain comparison pairs.

45 Figure 18: Preliminary results from the calibration survey This figure shows that at some song-pairs, certain users deviate significantly from the average value, meaning that normalization of

54 45 Figure 18: Preliminary results from the calibration survey This figure shows that at some song-pairs, certain users deviate significantly from the average value, meaning that normalization of the survey responses needs to take place. This was accomplished by computing the average deviation from the mean-rating value over all song-comparison pairs, for each survey participant, Table 3.

55 46 Table 3: Average deviation from the average slider rating for each survey participant Average Deviation from the Survey Participant Mean Donald S. Williamson! 6/5/07 5:48 PM Formatted: Centered, Indent: Left: 0" Donald S. Williamson! 6/5/07 5:49 PM Formatted Table Donald S. Williamson! 6/5/07 5:48 PM Formatted: Centered, Indent: Left: 0" Donald S. Williamson! 6/5/07 5:48 PM Formatted: Centered, Indent: Left: 0" Donald S. Williamson! 6/5/07 5:48 PM Formatted: Centered, Indent: Left: 0" Donald S. Williamson! 6/5/07 5:48 PM Formatted: Centered, Indent: Left: 0" Donald S. Williamson! 6/5/07 5:48 PM Formatted: Centered, Indent: Left: 0" Donald S. Williamson! 6/5/07 5:48 PM Formatted: Centered, Indent: Left: 0" Donald S. Williamson! 6/5/07 5:48 PM Formatted: Centered, Indent: Left: 0" Donald S. Williamson! 6/5/07 5:48 PM Formatted: Centered, Indent: Left: 0" Donald S. Williamson! 6/5/07 5:48 PM Formatted: Centered, Indent: Left: 0" Donald S. Williamson! 6/5/07 5:48 PM Formatted: Centered, Indent: Left: 0" Donald S. Williamson! 6/5/07 5:48 PM Formatted: Centered, Indent: Left: 0" Donald S. Williamson! 6/5/07 5:48 PM Formatted: Centered, Indent: Left: 0" Donald S. Williamson! 6/5/07 5:48 PM Formatted: Centered, Indent: Left: 0" This average value was then subtracted from the survey response from the corresponding user at each songpair comparison. The next plot, Figure 19, shows how the normalized slider rating values for the second survey now compare. As you can see, the responses from all of the users now follow the same trend.

56 47 Figure 19: Calibration survey response after user normalization The final step is to subtract this average deviation value from the initial survey responses for each corresponding user, with the hopes of normalizing the user ratings. This resulted into an image that shows the correlation between the KL divergence values and the slider ratings (Figure 20). Comparing it to the initial correlation plot in Figure 17, this figure shows that even with user normalization, there is not much improvement in the correlation between KL values and slider ratings. We thus decided to perform further analysis on the raw (non-normalized) slider ratings.

57 48 Figure 20: Correlation between KL divergence values and slider ratings after user normalization. The slope of the best-fit line is Another reason behind the lack of correlation between the KL divergence values and the slider ratings could be that the two song pairs just are not similar to each other. This would mean that there is definitely a major flaw in our system and drastic changes would have to be made to improve upon these results. I would agree with this but other studies by different researchers returned similar results, [7]. In particular, for the music similarity task held at the ISMIR 2006 conference for MIR, six different researchers with six different similarity algorithms all returned similar results, [7]. Figure 21 shows the average slider rating that was giving between two song pairs for all six systems. It also shows the distribution of slider ratings for all song comparisons in our system. The average slider ratings for their systems are approximately between 2.6 and 4, whereas our algorithm produced an average slider rating of approximately 3.5, Figure 21. This average value is currently the only way for our system to be evaluated against other similar algorithms.

58 49 Figure 21: Comparison between similarity systems. Left - other systems (from [7]). Right - Our Approach Even though ideal results were not giving between the KL divergence values and user slider ratings, our survey did provide valuable incite about the types of features that our system uses to determine similarity and on how users determine similarity. On the survey, the user had the options of selecting what was similar and what was different between the two song pairs. The optional check boxes were: timbre, instrumentation, vocal qualities, tempo, mood, and intensity. All of these categories are defined in more detail in the user survey section of this paper. From these responses, we were able to analyze the surveys to evaluate the correlation between which features were selected and the actual slider values. The similarity check boxes on the survey enabled the determination of the types of features that users use to rate similarity. One useful illustration is the correlation between the selected features and the slider values, Figure 22. This type of illustration allows us to determine which features are used to rate high similarity (slider rating of 10) and low similarity (slider rating around 0).

59 50 Figure 22: Correlation between selected similarity features and the slider rating values The plot shows that for all slider-rating values, instrumentation is the most prevalent feature selected. This however, does not mean that users view instrumentation as the most important feature to rate similarity between two songs; it simply means that users believe that the instrumentation between the two songs is similar. This is true because even for low similarity (low slider values) instrumentation is checked as being similar fairly often. The feature(s) that best describe how individuals quantify similarity would have an ideal line plot with a slope of 0.1 and a y-intercept equivalent to zero. Table 4 shows the slopes and y-intercepts of all of the lines in Figure 22.

60 51 Table 4: Slope and intercept values for the lines in the slider ratings versus similarity features image Feature Slope Y-Intercept Timbre Instrumentation Vocal Qualities Tempo Mood Intensity The features that best fit this ideal trend are timbre and vocal qualities. This means that future features should take into account the timbre and vocal characteristics of the performer. This is extremely useful Donald S. Williamson! 6/5/07 3:14 PM Formatted: Font:10 pt, Bold Donald S. Williamson! 6/5/07 3:14 PM Formatted: Centered, Indent: Left: 0" Donald S. Williamson! 6/5/07 3:14 PM Formatted Table Donald S. Williamson! 6/5/07 3:14 PM Formatted: Centered, Indent: Left: 0" Donald S. Williamson! 6/5/07 3:14 PM Formatted: Centered, Indent: Left: 0" Donald S. Williamson! 6/5/07 3:14 PM Formatted: Centered, Indent: Left: 0" Donald S. Williamson! 6/5/07 3:14 PM Formatted: Centered, Indent: Left: 0" Donald S. Williamson! 6/5/07 3:14 PM Formatted: Centered, Indent: Left: 0" Donald S. Williamson! 6/5/07 3:14 PM Formatted: Centered, Indent: Left: 0" because the purpose of this research project is to gain a better understanding as to how human s rate song similarity, and a feature that better mimics this may radically improve song similarity results. However, this is easier said then done because extracting the vocals of a song via separating them from the music, is an entirely different research area that still needs further development. Another useful tool is determining how the selected features vary with KL divergence values. Figure 23 shows how the similarities between songs varies as the KL divergence increases.

61 52 Figure 23: Correlation between KL divergence values and similarity features It shows that for low KL divergence values (ideally more similarity between two songs), the feature that most often occurs to describe similarity is vocal qualities of the performer. This reiterates that features that better quantify the vocal qualities of a song should be used to represent song models. The other similarity features between songs that also correspond to song similarities are timbre and mood. These results imply that MFCC features along with KL divergence quantify the vocal qualities, timbre, and mood of the songs. After comparing the percentage of occurrence of features to the slider rating and to the Kullback-Leibler divergence value, it was determined that it would be good to display the correlation between the slider ratings, KL divergence values and the similarity features on a three-dimensional plot. This is graphically displayed in the image below.

53 Figure 24: Illustration of correlation between KL divergence, slider rating values, and similarity features For all of the feature plots, Figure 24 shows that there is a strong correlation between

62 53 Figure 24: Illustration of correlation between KL divergence, slider rating values, and similarity features For all of the feature plots, Figure 24 shows that there is a strong correlation between KL divergence values and user slider ratings for KL values less than 30. However, the features with the strongest correlation between slider ratings and KL values are timbre and vocals, which is denoted by the magnitudes at high slider ratings. This means that KL divergence is better at picking out timbre and vocal similarities as opposed to the other similarity features. As previously mentioned, the candidate songs that were compared to a seed song either (a) came from the same artist as the seed song (b) came from the pre-process classifying artist or (c) came from any other artist. It would be useful to determine how the slider ratings vary across these different types of candidate songs. Ideally, you would believe that for two songs from the same artist, the percentage of occurrence would increase as the slider value increases. This is proven in Figure 24, which also shows how the percentage of occurrence for the classifying and non-classifying artist varies with the slider value.

63 54 Figure 25: Comparison of normalized distributions for each candidate song type. The number shown at the top of each bar, at a particular slider value, is the total number of songs that were given that rating value. Specifically Figure 25 shows that with lower slider values, the percentage of times that these values occur for the same artist is the lowest, whereas non-classifying artists produce songs with lower slider values to the same seed songs more often than the other types of songs. Likewise, as the slider value increases, the percentage of same artist songs increases, while the percentage of non-classifying artist decreases. The percentage of occurrence for the classifying artist stays fairly constant as the slider value increases, however, for high slider values the percentage of occurrence for the classifying artist is much higher than the percentage of occurrence for the non-classifying artist. This illustration points out a few things, (1) that users more often than not believe that songs from the same artist are most similar and that (2) even though are system does not always correctly classify the artist of a song, users feel that the classifying artist is still more similar to the seed song at higher slider values (more similarity). This result is also illustrated in Figure 26, which shows how the percentage of occurrence for each artist type varies with the similarity feature. The percentage of occurrence for the system determined artist is higher than the percentage of occurrence for any other artist for each similarity feature. This means that if it was desired to find a similar song to some seed song, where the similar song must be selected from the system determined artist or from

Computational Models of Music Similarity. Elias Pampalk National Institute for Advanced Industrial Science and Technology (AIST)

Computational Models of Music Similarity 1 Elias Pampalk National Institute for Advanced Industrial Science and Technology (AIST) Abstract The perceived similarity of two pieces of music is multi-dimensional,