UNIVERSITY OF MIAMI FROST SCHOOL OF MUSIC A METRIC FOR MUSIC SIMILARITY DERIVED FROM PSYCHOACOUSTIC FEATURES IN DIGITAL MUSIC SIGNALS.

Size: px

Start display at page:

Download "UNIVERSITY OF MIAMI FROST SCHOOL OF MUSIC A METRIC FOR MUSIC SIMILARITY DERIVED FROM PSYCHOACOUSTIC FEATURES IN DIGITAL MUSIC SIGNALS."

Sara Hood
5 years ago
Views:

1 UNIVERSITY OF MIAMI FROST SCHOOL OF MUSIC A METRIC FOR MUSIC SIMILARITY DERIVED FROM PSYCHOACOUSTIC FEATURES IN DIGITAL MUSIC SIGNALS By Kurt Jacobson A Research Project Submitted to the Faculty of the University of Miami in partial fulfillment of the requirements for the degree of Master of Science in Music Engineering Technology Coral Gables, FL April 2006

2 UNIVERSITY OF MIAMI Submitted in partial fulfillment of the requirements for the degree of Master of Science in Music Engineering Technology A METRIC FOR MUSIC SIMILARITY DERIVED FROM PSYCOACOUSTIC FEATURES IN DIGITAL MUSIC SIGNALS Kurt Jacobson Approved: Ken C. Pohlmann Professor, Music Engineering Dr. Edward P. Asmus Associate Dean, Graduate Studies Fred DeSena Dr. James Shelley Asst. Prof., Music Theory and Composition VP, Academic and Research Systems

3 JACOBSON, KURT (M.S. in Music Engineering Technology) (April 2006) A Metric for Music Similarity Derived from Psychoacoustic Features In Digital Music Signals Abstract of a Master s Research Project at the University of Miami Research project supervised by Professor Ken Pohlmann From purchase to playback, digital music formats are becoming the pervasive mode of music consumption. Technologies like perceptual audio encoding and peer-topeer networking have enabled even casual enthusiasts to amass large digital music collections. New online music services offer customers millions of song titles for download. Portable digital music players allow listeners to carry thousands of music files in their front pocket. At every level, the amount digital of music content available for consumption is growing to nearly unmanageable proportions. Finding better ways to organize, index, and search digital music collections is the focus of content-based music information retrieval (MIR). A diverse body of research, MIR deals with problems like automatic genre classification, automatic song summarization, and music similarity quantification as well as others. This work describes a system for deriving music similarity measures from a set of music signals using digital signal processing techniques. The system employs three distinct dimensions of similarity: timbral similarity, rhythmic similarity, and structural similarity, to place individual songs in a music similarity space. The system is tested on a set of popular music files obtained from the itunes online music store as well as other music collections. Multidimensional scaling of the resulting similarity data is used to visualize song files in the music similarity and calibrate the system to estimate genre boundaries. Also, an intelligent jukebox application is implemented that generates playlists based on the system s music similarity measures.

4 Acknowledgements I would like to thank my family and fris for their support, especially my mother for proof reading. I would like the thank Ken Pohlmann, Colby Lieder, and James Shelley for their direction and assistance with this work, and Fred DeSena for agreeing to join my thesis committee on such short notice. I would also like to acknowledge and thank Jason Haft for sharing his insightful thoughts on music similarity and Etienne Handman of the Music Genome Project for contributing indepent music similarity data for benchmarking this system. I would also like to thank my peers Ben Fields and Nicolas Betancur for their assistance with this work.

5 Table of Contents Introduction Music Similarity Defined The Music Genre Psychoacoustic Features Melodic Content Rhythmic Patterns Timbre Features Song Structure Design Goals for a Music Similarity System Literature Review MPEG Mel-Frequency Cepstral Analysis K-means Clustering Beat Spectrum Analysis Automatic Audio Segmentation Implementation Feature Extraction System Tempo Calculation MFCC Analysis Timbre Model Rhythm Model Structure Model SimMetrix Model Comparison Tempo Similarity Timbre Model Similarity Rhythm Model Similarity Structure Model Similarity Combined Model Similarity Experiments Preliminary Tests Assumptions about the riddim Small-Scale Experiment Design Preliminary Test Results Larger-scale Test itunes Top Ten Test Set Additional Test Sets Results The Same Song Title Sanity-Check Multi-dimensional Scaling Music Genome Project SimMetricPlayer Discussion Future Work Conclusions 46 References Appix A. 50 Appix B.. 53 Appix C.. 56

6 List of Figures and Tables Figure Figure Figure Figure Figure Figure Figure Figure Table Table Figure 6.1a Figure 6.1b Figure 6.1c. 41 Figure 6.1d Table A.. 50 Table B.1 53 Table B.2 54 Table B.3 55

7 Introduction Digital audio has become nearly ubiquitous. The combination of increasing storage capacity, robust perceptual codec technology, and efficient networking protocols have led to a music content explosion. Technologies like mp3 and peer-to-peer networking have enabled even the casual music enthusiast to amass large digital audio collections. Navigating such large stores of digital audio can be daunting. While text-based queries can sort audio files based on manually entered metadata (i.e. ID3 tags), this approach has some significant disadvantages: manually entering metadata is timeconsuming, the metadata could be entered incorrectly or it could be incomplete, and metadata descriptions are limited to pre-defined fields. Some of these problems are addressed by recent technologies that retrieve metadata from the internet automatically (Windows Media Player, CDDB, etc), but the underlying musical content retrieval issues remain. Although text-based searching methods are well developed, text-based descriptions of digital music fail to significantly describe the musical content. The artist name and genre are often the only text-based information available to describe the musical content of a digital audio file. This provides a weak foundation for applying text-based information retrieval methods to large digital music collections. The prevalence of digital music content and the shortcomings of text-based retrieval methods have motivated a significant amount of research in content-based music information retrieval (MIR) an approach that utilizes some characteristics of the musical signal rather than text-based metadata. A diverse group of musicians, librarians, 1

8 researchers, and software developers have contributed to this effort, producing a wide array of theories and techniques related to MIR. The first annual International Conference on Music Information Retrieval (ISMIR) held in October of 2000, provided an international forum for those involved in work on accessing digital music content. The Motion Picture Expert Group (MPEG), a body of the International Organization for Standardization (ISO), has developed a substantial international standard for media content description known as MPEG-7. Although not as widely implemented as other MPEG technologies, MPEG-7 provides a robust and flexible framework for content-based retrieval of digital media. Simply put, these efforts are to enable more efficient indexing and searching of media content, whether music, images, or video. Such efforts are of interest to education, academia, entertainment, and industry. One approach to MIR is the audio-based music similarity measure. Audio-based music similarity measures could be applied to MIR in a number of ways including automatic playlist generation, recommation of unknown song titles or artists, organization and visualization of music collections, and music retrieval by example. Providing a consumer who purchases a given song with recommations of similar songs is a very common scenario in music distribution. In current systems, this is almost always accomplished using accumulated sales data. If many consumers have purchased both song title A and song title B, and a given consumer decides to purchase song title A, title B is automatically recommed. Such a system is easy to implement and undeniably effective as a sales mechanism. However, such systems are not content- 2

9 based and t to neglect less popular song titles discouraging diversity and even hindering the advancement of new artists. Automatic playlist generation is another significant application of music similarity measures. Generating playlists based on music similarity information enables intelligent jukebox applications, where the user would select an initial song and the application would automatically play similar songs. Such an application is developed as means of subjectively evaluating the music similarity system described here. As a contribution to the field of music information retrieval and as the foundation for an intelligent jukebox application, this work describes a system for quantifying music similarity between a set of songs based on the psychoacoustic features present in the digital signals corresponding to that set of songs. The purposed system combines several previously developed methods for quantifying music similarity, including the works of Logan [8], Aucouturier [5], Pampalk [6], and Foote [15]. A timbre model for music similarity is implemented based on [5, 6, 8]. The timbre model also uses parts of the open-source MA Toolbox [19]. A rhythm model for music similarity is implemented based on [15, 16]. A new method for modeling song structures is introduced as well. The similarity measures of these three models are combined to get an overall similarity distance between digital music signals. 3

10 1 - Music Similarity Defined Before developing a system for quantifying music similarity, we must ask: what is it that really makes distinct pieces of music similar? When comparing pieces of music, a listener forms a judgment about similarity based on the psychoacoustic features of the set of songs in question. These features include short-time temporal variations (rhythm), spectral content (timbre), the changes in spectral content (melody, harmony), and overall temporal variations (song structure). Of course, the perception of sound, particularly the perception of musical sound is a very complex phenomenon and this list of psychoacoustic features is by no means complete. However, after soliciting the opinions of a wide variety of professional musicians, music engineers, and music enthusiasts, there seems to be a general consensus that this list includes the more salient psychoacoustic features for determining music similarity. Let us then define music similarity as the product of an individual s personal taste and the psychoacoustic features present in a set of distinct musical signals. Of course, music similarity is largely an abstraction and notions of music similarity are subjective. However, there is usually at least a loose consensus among individuals as to which songs are similar and which artists are similar. This is evidenced by the fact that music critics almost always describe a particular artist or song in terms of similar artists or songs. Statements like, Blind Melon sounds like Lynyrd Skynyrd meets the Grateful Dead, are common and can generally be agreed upon. This suggests, at least in some broad sense, there exists a ground truth for music similarity, indepent of perception. Researchers have purposed a variety of methods for obtaining such ground truth data, but with only moderate success. 4

11 If we neglect the individual s personal taste as a contributing factor, music similarity can be considered a multidimensional model, derived from the psychoacoustic features present in a set of songs. We will treat music similarity as such in this work. 1.1 Musical Genres The music genre is the most common manifestation of music similarity. Songs that share a certain style or basic musical language are considered to be of the same genre. Although music genre implies music similarity and vice a versa, it is important to note that musical genre is not identical to musical similarity. Music genres can be based on time period (Baroque Music), geographical origin (Cuban Music), or even media format (videogame music). While these genres reflect similarity, they do not necessarily reflect musical similarity. Therefore, music similarity cannot be defined solely in terms of genres, however the music genre provides important clues to music similarity. Any measure of music similarity should adhere to at least some of the inter-song relationships established by music genres and, to a greater degree, music sub-genres. At the same time, a measure of music similarity should find meaningful relationships in songs across genres. 1.2 Psychoacoustic Features Putting aside the pigeonhole approach of assigning song titles to music genres, consider again that music similarity is a function of psychoacoustic features. Research in psychoacoustics has shown that certain aspects of an audio signal are more salient than others. Therefore it is reasonable to assume certain psychoacoustic features of a set of 5

12 music signals are more relevant in determining music similarity. Let us consider more closely the psychoacoustic features mentioned above: short-time temporal variations (rhythm), spectral content (timbre), the changes in spectral content (melody, harmony), and overall temporal variations (song structure) Melodic and Harmonic Features In part, music similarity is a function of melodic and harmonic content. This is perhaps most true with respect to trained musicians, who actively recognize the melodic and harmonic relationships in a piece of music. However, a listener with no musical training is still sensitive to the harmonic content of a song, consciously or unconsciously. Distinct harmonic relationships t to evoke distinct perceptions of mood or emotion. Musicologists have long argued that the mode and the tonic of a musical piece relate to the feelings and moods evoked by that piece. While there exists a large body of work on pitch tracking and harmonic analysis of digital music signals, the contribution of melodic and harmonic features to music similarity are essentially neglected here, however possible modeling techniques are discussed Rhythmic Features Music similarity is also a function of rhythm. In Electronic Dance music, rhythm ts to be one of the most salient features. It is not surprising that subgenre divisions of Dance music correspond to rhythmic style: Trance, House, DnB, etc. Even in a broader scope, intuition holds that music similarity is, in part, a function of rhythmic style. 6

13 There exist several methods for modeling the rhythm or temporal characteristics of a digital signal including [16] and [18]. The system developed here uses a specialized autocorrelation method first purposed by Foote in [16] Timbral Features Timbre is another dimension of music similarity. Timbre is commonly defined as the attribute which allows a listener to discriminate two sounds with the same pitch and loudness. The voicing, the instrumentation, the singing style, and many other elements contribute to the overall timbre of a song. From a signal processing perspective, timbre is more difficult to define. It is clear that the timbre of an audio signal is somehow related to its spectral content. This relationship is not as deterministic as the relationship between an audio signal s perceived pitch and its spectral content, where a certain fundamental frequency corresponds to a specific pitch. There does not exist a scale of timbres as there exists a scale of pitches. However, timbre is still a useful concept in audio signal processing. Timbre modeling techniques have been applied to automatic instrument identification by Brown [11], content-based retrieval of audio samples by MuscleFish Audio (1996), and even content-based music information retrieval [5, 6, 8, 9]. The MPEG-7 standard for multimedia content description even provides a set of descriptions and higher-level description schemes for modeling the timbre of an audio signal. Specifically, the standard includes InstrumentTimbre, HarmonicInstrumentTimbre, and PercussiveInstrumentTimbre as frameworks for modeling the timbre of a given audio signal. These models as well as other descriptors 7

14 and description schemes in MPEG-7 can be used evaluate music similarity in the timbral dimension. Beyond the MPEG-7 standard, the method of Mel-frequency cepstral analysis has also been associated with musical timbre. Initially used in speech processing, Mel-frequency cepstral coefficients (MFCC) are used throughout the system developed here and will be described in section Song Structure Features Another aspect of music similarity is song structure. The phrasing and the changes in a set of songs contribute to music similarity. An electronic dance tune with simple phrasing and only one break-down or change is more similar to another dance tune with a similar structure than to an experimental electronic tune with complicated phrasing and many changes. This example illustrates how song structure can reflect music similarity (or dissimilarity) within the broad genre of electronic music. However, song structure ts to follow genre divisions: Hip-hop and Dance music have fewer changes and a simpler structure than Jazz or Orchestral music. While methods for automatic audio segmentation can be used to model song structure as in [13, 17], these techniques have not been applied to music similarity measures. 8

15 2. Design Goals for a Music Similarity System If we neglect the individual s personal taste and assume music similarity is a function of psychoacoustic features, it is possible to design a system that automatically models music similarity for a set of songs. The goal of this work is to develop and test such a system. The system should rely only on psychoacoustic features extracted from the music signals in question. To facilitate scalability, the system should be divided into two parts: one part to extract, model, and store the psychoacoustic features of a given music signal, and a second part to compare the stored models of two distinct music signals to derive a music similarity measure. Let us call part one a Feature Extractor or FeatX, and part two a Similarity Metric calculator or SimMetrix. Such a configuration allows for songs to be added to or deleted from the test set at any time. The FeatX process need only run once on a given song, so the most processor intensive work should be done there. The SimMetrix part of the system should be of lower complexity so comparisons between songs can be made quickly (assuming FeatX has already annotated the songs). The system should make use of the psychoacoustic features identified in the previous section (due to time constraints, melodic content will be neglected). For each song, FeatX will extract and store a model for timbre, rhythm, and song structure. SimMetrix will compare the models between songs and calculated a similarity distance measure. 9

16 3. Literature Review 3.1 MPEG-7 Descriptors Since 1998, the ISO body known as MPEG has been developing a standard for multimedia content description called MPEG-7. The standard applies to digital video, audio, and images. Part 4 of the standard is dedicated to audio and specifies a rich set of description tools pertaining to audio content. These include low-level feature Descriptors (Ds), like AudioSpectrumEnvelopeType and LogAttackTimeType, that directly describe features of the audio signal, as well as higher-level Description Schemes (DSs), like AudioSignatureType and InstrumentTimbreType, which combine low-level features to form more abstract description schemes [1]. Any standard MPEG-7 description relies on three main components: Descriptors (Ds), Description Schemes (DSs), and Description Definition Language (DDL). Descriptors are representations of distinctive characteristics, or features, of the media data. Description Schemes specify the structure and semantics of the relationships between their components, which may be either Descriptors and/or Description Schemes. Both Descriptions and Description Schemes are expressed using the Description Definition Language (DDL). The DDL is based on XML schema. One Descriptor of particular interest here is the AudioBPMType. It is inted to describe the frequency of beats in an audio signal representing musical content. The beat frequency information is given in units of beats per minute (bpm), together with optional weights indicating the reliability of this measurement. This basic description of tempo is quite objective, and derived measures can be checked against a software or hardware 10

17 BPM calculator a device that allows the user to tap along with the music to get the BPM tempo. Knowing the tempo of a music signal is also useful in creating dance music playlists or dj-style mixes of music. Although the actual tempos of a given set of songs undoubtedly affect their similarity, tempo measures themselves provide little help. Of course tempo measures are insufficient for determining music similarity, but more importantly tempo measures are error prone. Even the most accurate tempo measures can be off by a factor of two because of the half-time / double-time effect. Tempo analysis for a song that actually has a presto tempo may result in a largo tempo measure because the algorithm was counting half-time. However, the AudioBPMType is implemented in this system as a starting point and as a guide for storing model information in xml format. 3.2 Mel-Frequency Cepstral Analysis Although MPEG-7 provides a vast array of content description methods for audio, there are some very interesting techniques not included in the standard. One notable omission is that of Mel-frequency cepstral coefficient (MFCC) analysis. Several groups of researchers, including Logan and Aucouturier, have explored using MFCCs as a means to describe the timbre of music signals with some promising results. Some research even indicates that MFCCs out perform MPEG-7 implementations for general recognition tasks [10]. MFCCs are derived from the discrete cosine transform (DCT) or Fourier Transform of the log amplitude of the Mel-frequency spectrum of an audio signal. The 11

18 Mel scale is a perceptual scale of pitches judged by listeners to be equal in distance from one another. Quantitatively, to convert from f hertz to m mel: = ln$ & 1 + f! # % 700" m (1) To perform Mel-scale frequency warping a Mel filter bank like the one in Figure 3.1 is applied. Applying the Mel filter bank positions spectral information logarithmically instead of linearly. This approach more closely approximates the human auditory system s response. Conceptually, MFCCs can be thought of as information about the rate of change in the different perceptual spectral bands. Since the method was first proposed as a front- to a word recognition system by Davis and Mermelstein in 1980, MFCC analysis has been one of the most successful techniques in speech Figure 3.1. A typical Mel filter bank used in the Mel-Frequency Cepstral Coefficient (MFCC) analysis 12

19 processing. MFCC analysis has also proven useful in other audio applications. Automatic instrument identification tasks have been performed using MFCC analysis by Brown [11]. This motivates the use of MFCC analysis to model musical timbre. Some recent research has focused on using MFCC analysis as a means of measuring timbral similarity including the work of Aucouturier [5], Logan [8], and Pampalk [6]. 3.3 K-Means Clustering In the MFCC analysis of an entire song, a set of coefficients is generated for every frame of audio. Two main statistical modeling approaches have been proposed to model a music signal s timbre from frames of MFCCs. Gaussian Mixture Models (GMMs) have been suggested by Aucouturier to model the timbre of a song by a probability density function describing the spectral distribution across all MFCC frames [5]. However, Logan has shown that K-means clustering provides comparable performance with less complexity [8]. The K-means algorithm starts by partitioning the input points into k initial sets. It then calculates the centroid of each set. A new partition is constructed by associating each point with the closest centroid. Then the centroids are recalculated for the new clusters, and algorithm repeated by alternate application of these two steps until convergence, which is obtained when the points no longer switch clusters (or alternatively centroids are no longer changed). The result of K-means clustering is a set of K clusters characterized by the mean, covariance, and weight of each cluster. 13

K-means cluster models are compared to each other using a technique called Earth Mover s Distance (EMD). The EMD between two songs provides a measure of timbre similarity.

20 K-means cluster models are compared to each other using a technique called Earth Mover s Distance (EMD). The EMD between two songs provides a measure of timbre similarity. EMD is described in detail in section Beat Spectrum Given that the cluster model approach to timbre description using MFCCs ignores all temporal information, it is necessary to consider additional methods for describing the rhythmic qualities of a musical signal. The BPM measure is only an indication of tempo, and gives no clues to the rhythmic feel or style of beat present in the song. However, methods for quantifying the rhythmic similarity of a set of music Figure 3.2. Self-similarity Matrix for a 10s segment of Bob Marley s No More Trouble Both axis are in terms of frame number. The checkerboard patterns indicate rhythmic repetition in the music signal. 14

21 signals have been developed. Most notable is the idea of the Beat Spectrum proposed by Foote in [15]. The Beat Spectrum is calculated from an audio signal using three basic steps: 1) Parameterization: The audio is parameterized using some spectral representation. MFCC frames are actually well-suited to this task. 2) Self-similarity: Some distance measure is used to calculate the similarity between every frame of audio. These measures are embedded in a 2-D selfsimilarity matrix. The visualization of such a matrix is shown if Figure ) Summation: The Beat Spectrum results from finding the periodicities in the similarity matrix using diagonal sums or autocorrelation. The result is a vector that models the rhythmic patterns in a music signal as a function of time. The Beat Spectrum vectors can be stored and compared directly using any standard distance measure, however Foote s research suggests that the cosine distance measure is most appropriate [15, 16]. 3.5 Automatic Audio Segmentation Methods for automatically segmenting audio have been developed by Tzanetakis [13], Foote [17], and others. While these methods have been developed with the application of browsing and annotating in mind, they could be applied to model the overall song structure of a music signal. The segmentation technique presented in [17] as a front- for modeling song structure similarity. 15

22 4. Implementation: SimMetrix System Design Using the MATLAB programming environment, a system called FeatX was developed to extract and store the tempo, timbre model, rhythm model, and structure model for a digital music file. The various models and id3 tag information are stored in an MPEG-7 compliant xml file (named songfilename.xml). Figure 4.1. A High-level block diagram describing the FeatX system, which generates models for musical timbre, rhythm, and musical structure and stores them in xml format. The bpm measure is not shown. To compare music files based on the song models stored in xml meta-files, SimMetrix was developed. Different methods are used to compute appropriate similarity distances for each of the three models. These methods are described in detail later. Each song-to-song comparison results in three similarity distances, one for each model. The 16

23 distances can be weighted and combined to plot music files in a similarity space. Techniques like multidimensional scaling (MDS) allow for music similarity visualization in two, three, or four dimensions. The source code is included in Appix C. 4.1 Feature Extraction with FeatX The feature extraction process begins by searching the target directory for all readable audio files. At this time FeatX only supports.wav and.mp3 formats. The system has five main processes: Tempo Calculation with getbpm() To derive the tempo of a music signal the MPEG-7 AudioBPMTypeD is used [2]. In FeatX this Descriptor is encapsulated in the function getbpm(), which is passed the actual music signal along with some parameters. The incoming signal is decomposed and pre-processed into a number of spectral bands with transition frequencies of 200 Hz, 400 Hz, 800 Hz, 1600 Hz, and 3200 Hz, respectively. The following processing steps are carried out for each frequency band: The band limited signal is derived from the input signal by means of bandpass filtering (lowpass filtering for the first frequency band, highpass filtering for the last frequency band). The band limited signal is two-way rectified (i.e. the absolute values are taken) and smoothed over time with a time constant around 100 ms to calculate an envelope signal. At this point, the signal may be decimated in order to reduce computational complexity. The envelope signal is differentiated (i.e. the differences between subsequent samples are calculated) and the result is limited to non-negative values, corresponding to the onset portions of the signal. Each differentiated envelope is normalized by its maximum value. The biased normalized autocorrelation function (ACF) is calculated for all lag values up to a maximum value which corresponds to a minimum detectable beat frequency value (note: in this context, normalized means that the function is scaled to reach a maximum value of 1.0 for lag=0. biased means that the so-called auto-correlation method is used to estimate the 17

24 correlation coefficients rather than the co-variance method ). The relation between beat frequency values (in bpm) and lag is given by the relation: 60 BeatFreq = for t > 0 (2) f s * t lag Since using integer lag values leads to a limitation in representable bpm values, further refinement can be achieved by using appropriate (e.g. quadratic) means for interpolation. A weighting factor for each frequency band is determined, e.g. as the ratio between the maximum value and the mean value of the bandwise ACF within the range of relevant lags, reduced by one. A combined envelope is then computed by summing all individual (differentiated and normalized) envelopes weighted by their respective weighting factors. Next, the combined envelope autocorrelation function (CEACF) is calculated from the combined envelope in the same way as described above for the ACF calculation in individual frequency bands. A reliability measure may be extracted in the same manner as described before for the weighting factor for individual frequency bands. Next, all meaningful local maximum peaks in the CEACF are detected within the range of meaningful lags, i.e. the lag range corresponding to the permissible range of musical beat frequencies (e.g. 60 BPM to 200 BPM). Each peak value, weighted by the summary reliability, is stored in a result vector together with its corresponding BPM value. Each weighted peak indicates the possibility of the corresponding BPM value to represent the correct beat frequency value. A final estimation stage decides which of the detected BPM values will eventually be returned as the beat frequency. If there are BPM values detected that fit into a tolerance range of already stored BPM values (BPM class) the corresponding relative peaks are added. If there are BPM values detected that are not within an already stored BPM class, these values and their corresponding relative peaks are inserted into the 18

25 vector and define a new BPM class. Each BPM class represents a plausible BPM estimate at the BPM value averaged across its members. This process is repeated twice for each audio file once for a frame near the beginning of the song and once seven seconds into the song. Two BPM values and two weight values are stored in an xml file with the same name as the audio file is in [2] MFCC Analysis with getmfcc() The next process is the extraction of the MFCCs. The audio signal is subsampled to 22.05kHz to save on computation. This of course has the effect of cutting the signal s spectral content above 11kHz. But imagine an even more dramatic low-pass filtering of a set of audio signals with a cutoff around 8kHz. Although the audio quality would be diminished, the music would still be intelligible and a human listener would still perceive the same music similarity relationships. It is a reasonable assumption that the most salient features for music similarity are present in the lower frequencies, so this type of sub-sampling is common in music information retrieval processes. The sub-sampled signal is windowed with a Hanning window of length 1024 and 50% overlap. A Fast Fourier Transform (FFT) is performed on each frame using the Matlab fft routine. A Mel filter-bank is used to perform the frequency warping to the Mel-frequency scale. The filter-bank described in terms of the magnitude response is multiplied by a scaled output from the fft to derive a Mel-frequency spectrum. The Discrete Cosine Transform (DCT) of the logarithm of the magnitude of the Melfrequency spectrum results in the vector of MFCCs. Twenty MFCCs were used including the first coefficient. 19

26 4.1.3 TimbreModel() To store the MFCCs, a statistical model is constructed. The K-means approach described earlier is used to cluster the MFCCs of a given music signal into 10 clusters (this number was chosen as a compromise between the suggestions of Logan [8] and Aucouturier [5]). Fig The Timbre Model for Your Man. The ten means are indicated by the fine dark lines and the covariances are indicated by the gray shading. The ma_kmeans algorithm is used from the open-source MA Toolbox developed by Pampalk [19]. All the parameters returned by the k-means algorithm are apped to the xml file in the type TimbreModel the syntax is shown below: 20

27 <SoundModel xsi:type="timbremodel" ClusterMethod="kmeans" CovarianceType="diag"> <SeriesOfVectors nin="20" totalnumcentres="10" nwts="410"> <Means> (10x20 matrix of floats) </Means> <Covariances> (10x20 matrix of floats) </Covariances> <priors> (1x10 matrix of floats) </priors> </SeriesOfVectors> </SoundModel> The TimbreModel type includes the means (centers), covariances, and weights of the k-means clustering of the MFCC frames. Various studies indicate that this type of statistical description of MFCC distributions across a music signal provides a good model for the timbre of an entire song [8, 9] RhythmModel() To model the rhythm of the musical signals in question, the beat spectrum technique developed by Foote in [15] is implemented. The MFCC representation of the audio signal is used again to calculate the RhythmModel. To minimize computation time, only a 1000x1000 frame self-similarity matrix is used. This corresponds to about 10 seconds of audio taken about 10 seconds into the song. Because the matrix is symmetric, only half of self-similarity matrix is calculated. The pdist Matlab function is used to calculate the cosine distance between the MFCC vectors of frames i and j for the selfsimilarity matrix S : S( i, j) =1" $ $ MFCC k (i) # MFCC k ( j) MFCC k (i) 2 $ MFCC k ( j) 2 (3) Because the beat spectrum is a function of lag time, l = i! j, and only a finite lag time needs to be considered, S( i, j) only needs to be calculated for l > i! j. This saves considerable computation time. max 21

28 Once S ( i, j) has been calculated, the beat spectrum B (l) can be found by simply summing the diagonals of S.! B ( l) = S( k, k + l) (4) k " R Here, B (0) is the main diagonal across the range R for which S has been solved. B (1) is the first super diagonal, B (2) is the second super diagonal, and so on. The RhythmModel is stored in xml as follows: <SoundModel xsi:type="rhythmmodel" DistanceType="cosine"> <SeriesOfVectors totalnum="25"> <Bl> (1x25 of floats) </Bl> </SeriesOfVectors> </SoundModel> The RhythmModel is adapted from Foote s Beat Spectrum method. The RhythmModels for two songs are shown in fig 4.3. Axel F is an electronic dance tune and its repetitive four-on-the-floor rhythm creates regular peaks in the Beat Spectrum. The jazz classic Take Five creates a less regular Beat Spectrum, but the 5/4 time signature is still apparent. 22

29 Figure 4.3. Rhythm models for Axel F and Take Five notice the strong repetitive peaks in the fouron-the floor dance music and the 5/4 beat structure of Take Five 23

30 Figure 4.4. The lo-resolution self-similarity matrix and novelty index for Axel F. Note the changes suggested by the self-similarity matrix are reflected as spikes in the novelty index. The spikes in the novelty index around 1.5 minutes correspond to a break down section in the song. 24

31 4.1.5 SongStructure() To model a song s musical structure a modified version of the self-similarity matrix in getrhythmmodel() is created. A self-similarity matrix is created for the entire song, skipping some integer number of frames. The number of frames skipped is the subframerate (usually 16). This sub-sampling of frames has the effect of low-pass filtering the self-similarity matrix eliminating more rapid temporal changes, as well as reducing the required computing time. The lo-resolution self-similarity, S lo"res, matrix is calculated as in getrhythmmodel(), using a cosine distance between MFCC vectors. Scaling the S lo"res to gray-scale values produces an image of the song s overall structure. The S lo"res image for Axel F can be seen in Figure 4.4. Note the lighter cross-like area Figure 4.5. A Gaussian-Tapered checkerboard kernel used for audio segmentation. The kernel is correlated along the main diagonal of the self-similarity matrix S lo"res to produce the novelty index Nv as shown in Figure

32 in the center of the image. This area corresponds to a break down section in this repetitive, four-on-the-floor dance tune. To find points where a given song changes, a Gaussian Checkerboard kernel is correlated across the main diagonal (see Figure 4.5). This is a method developed by Foote for automatic audio segmentation and automatic thumbnailing of music. The kernel consists of four blocks in a square checkerboard configuration. The top-left and bottom-right blocks are matrices of 1 s and the top-right and bottom-left blocks are matrices of -1 s. The entire checkerboard square is multiplied by a Gaussian window, resulting in the Gaussian Checkerboard kernel, show in Figure 4.5. The correlation of the kernel across the main diagonal of S lo"res results in a novelty index, Nv, as shown in figure 4.4. The novelty index Nv constitutes the StructureModel and it is stored in an xml file as follows: <SoundModel xsi:type="structuremodel" DistanceType="cosine"> <SeriesOfVectors totalnum="n" subframerate= 16 > <Nv> (1xN of floats) </Nv> </SeriesOfVectors> </SoundModel> The size of Nv is determined by the length of the music signal in question. A longer song will have a longer Nv. Preliminary experiments indicated non-overlapping hann windows of length 512, with a subframerate of 16, and a 32x32 kernel produce the most meaningful StructureModels. This results in vector lengths between 1700 and 2800 for the songs in the itunes test set. 26

33 Although the StructureModel could be represented more compactly (this will be shown later), the entire novelty vector is stored to facilitate experimentation with different methods for comparing StructureModels. 4.2 SimMetrix Model Comparison Feature vectors describing a musical signal only become useful if they can provide some additional functionality for comparison or search. A high-level block diagram of the SimMetrix system can be seen in Figure 4.6. The system operates on a query music file and a target directory containing music files and their FeatX-generated xml metadata. The system can also calculate similarities between every song in a target Figure 4.6. A High-level block diagram describing the SimMetrix system, which calculates the music similarity between songs p and q from the corresponding xml files. The bpm measure is not shown. 27

34 directory and create a square matrix of inter-song distances, SM. SimMetrix generates such a matrix for TimbreModel similarity, RhythmModel similarity, and StructureModel similarity. A tempo similarity method is included for completeness. Each model requires a different distance calculation Tempo similarity Tempo similarity is the simplest distance calculation because it involves simply taking the absolute value of the difference of the query file BPM and the target file BPM. Since BPM values are frequently off by a factor of two, it is best to take the minimum value of the following: n ( BPM BPM ) " BPM = 2! min n=[ ] (5) q p TimbreModel similarity The TimbreModel similarity requires a more complicated distance measure. Again, the open-source MA Toolbox [19] is used, this time to implement the Earth Mover s Distance (EMD). EMD calculates the minimum amount of work required to transform one model into another. Consider the clusters as piles of earth, we are interested in how much earth (or probability mass) we need to move to transform model TimbreModel P into model TimbreModel Q. Let P = µ,!, w )...( µ,!, w )} be the model for song P with m clusters {( p p1 p1 1 pm pm pm where µ pi,! pi, and w pi are the mean, covariance, and weight of that cluster. Similarly, let Q = µ,!, w )...( µ,!, w )} be the model for the query song. We can calculate {( q q1 q1 1 qm qm qm the distance between clusters by 28

35 d p i q i ' = ' pi qj ' + ' qj pi & # 2 $ ( µ ) ( +! pi µ qj ) (6) % ' pi ' qj " Let f piqj be the flow between p i and q j. This flow reflects the cost of moving probability mass from one cluster to another. We solve for all f piqj that minimize the overall cost W defined by W = m n!! i= 1 j = 1 d p q i j f p q i j (7) That is, we seek the cheapest way to transform signal P to signal Q. This can be formulated as a linear programming task for which efficient solutions exist. Having solved for all f p i q j, the EMD is then calculated as EMD( p,q) = " " m " n i=1 m " n f i=1 j=1 pi q j j=1 d pi q j f pi q j (8) If P and Q are very different TimbreModels, the EMD will be large. If P and Q are very similar TimbreModels, the EMD will be small. If P and Q are the same TimbreModel, the EMD will be zero. Because EMD values for very different TimbreModels can be arbitrarily large, a maximum limit is set. Preliminary experiments indicated that for most TimbreModel pairs, EMD "1000. However, for some pairs EMD >> To compensate for this, a special normalization process is applied: if EMD( p,q) >1000, EMD( p,q) =1000 SM timbre ( p,q) = EMD( p,q) 1000 (9) This normalizes the range for TimbreModel similarity to 0 " SM timbre "1, 29

36 conforming to the other model similarities RhythmModel similarity The RhythmModel similarity can be found quite easily by computing the cosine distance between two RhythmModels SM rhythm ( p,q) =1" $ $B p # B q 2 2 B p $ B q (10) Where B p and B q are the RhythmModels for songs p and q, respectively. The cosine distance is inherently normalized to one. This is implemented based on Foote s work in [16] StructureModel similarity The StructureModel similarity calculation is based on the number changes detected in a music signal and their locations relative to the length of the song. The number of significant changes, C p, in song p can be determined from Nv p - the novelty index. if (Nv p (i) > Nv threshold & Nv p (i "1) # Nv threshold ) C p = C p +1 (11) rl p ( j) = i length(nv p ) j = j +1 Where Nv threshold is some constant and Nv p (i "1) refers to the previous value of the novelty index. In this way, only the positive crossings of the Nv threshold increment C p. This is a fairly accurate method for finding changes in an audio signal. Note that rl p records the normalized locations of the changes. Should rl p = 0.5, this would indicate a 30

37 change in the middle of the song. The mean of rl p is taken to get µl p. To compare the StructureModel of song p with that of song q, C p and C q are considered magnitudes, while µl p and µl q are taken to be the corresponding angles. The StructureModel similarity between songs p and q is calculated as a Euclidean distance between the resulting vectors: SM structure ( p,q) = ( C p cos(" # µl p ) $ C q cos(" # µl q )) 2 ( ) 2 (12) + C p sin(" # µl p ) $ C q sin(" # µl q ) Here," represents some maximum angle. The lower " is set, the less impact the relative location of changes has on StructureModel Similarity. Preliminary tests indicate that " = # 4 is an appropriate value. This allows songs with different change distributions, but the identical numbers of changes to still be similar. SM structure is normalized to one by dividing by the maximum value of SM structure. This normalization conforms to the other similarity measures Combined Model Similarity To derive one matrix of similarity distances SM total that combines all three model similarities the following equation is used: SM total = w timbre SM timbre + w rhythm SM rhythm + w structure SM structure (13) The weights reflect the relative importance of each model to overall music similarity. Initial values were chosen as w timbre = 0.5, w rhythm = 0.4, and w structure = 0.1. The weights are adjusted experimentally as described in section

38 5 - Experiments 5.1 Preliminary Experiments During the development process for FeatX and SimMetrix, some small-scale experiments were designed to quickly assess the systems performance. These tests were not at all rigorous, but still provided direction for improving the system Assumptions About The Riddim One of the biggest challenges in music information retrieval lies in evaluating a MIR system s performance. Because musical tastes and genre descriptions are rather subjective, it is difficult to objectively grade any music similarity system. Web surveys, probing P2P networks, and text-mining online music guides are methods that have been proposed as a means to obtain ground truth data on music similarity. Time constraints dictate a simpler approach for initial experiments. It is common place in some musical genres for several different artists to use the same backing music or riddim. This is especially common in modern Dancehall and Reggae music. Dozens of different vocalists may release different song titles - with unique vocal melodies and arrangements but all using the same riddim. One example is the Diesel Riddim which has been used by artists like Beenie Man, Elephant Man, Lexxus, Captian Barkey, and many others. Intuition suggests SimMetrix should score song titles on the same riddim as similar if not, the system fails. 32

39 5.1.2 Small-Scale Experiment Design A total of 30 music files were included in the initial test set. The files were selected from my personal digital music collection. Of the small-scale test set, six songs used the Soprano s Riddim, five songs used the Diesel Riddim, and two songs used the Bad Road Riddim. The rest of the songs were selected from Reggae, Hip-Hop, Rock, and Jazz Vocalist genres. Xml files were generated for each file using FeatX Preliminary Test Results Three query songs were chosen: Elephant Man s Passa Passa on the Diesel Riddim, Alavode s Burn Dem on the Soprano s Riddim, and Billie Holiday s God Bless the Child as a control. Each query was run separately. Averages across all songs of the same riddim and across all other songs were calculated. These results are presented in Table 5.1. The results of these initial experiments suggest that all three models are useful in Query song Elephant (Diesel) Alavod (Sprn.) Billie Holiday Group TABLE 5.1 EXPERIMENT RESULTS Average tempo distance Average cluster distance Songs on Diesel Riddim Songs on Soprano s Other songs Songs on Soprano s Songs on Diesel Riddim Other songs All songs Songs on Diesel Riddim Songs on Soprano s Average Beat spectrum distance Table 5.1. The average distances for each similarity parameter for three different query songs. The query songs are Elephant Man (on Diesel), Alavode (on Soprano s), and Billie Holiday s God Bless the Child. These results seem to indicate that songs which use the same riddim have considerably smaller distances across all parameters. Note this is prior to normalization. 33

40 measuring music similarity. The intra-riddim similarity distances should be smaller than the cross-riddim similarity distances. In other words, songs on the same riddim should be rated as most similar. This was true for both the TimbreModel and the RhythmModel. This small-scale experiment served as a sanity check for the system. Given that intrariddim similarity distances were smallest, the system passes. 5.2 Full-scale Experiment itunes Top Ten Test Set A more thorough test of the system would require a larger pool of test songs not bound by any single individual s personal collection. To create such a test pool of digital music files, the top ten rated songs in eleven different genres were purchased from the itunes music store. The genres were selected arbitrarily and included Hip-hop/Rap, Classical, Pop, World, Jazz, Dance, Electronic, Country, Blues, Alternative, and R&B/Soul. The itunes top ten ratings are based on route sales data. To date, over 980 million songs have been purchased since the service first launched on April 28, There are currently itunes stores available in the United States, United Kingdom, France, Germany, Austria, Belgium, Finland, Greece, Ireland, Italy, Luxembourg, the Netherlands, Portugal, Spain, Canada, Denmark, Norway, Sweden, Switzerland, Japan, and Australia. Given the popularity and broad scope of the itunes music store, the genre distinctions applied to the test songs can be used as a benchmark for the music similarity system. A useful system should at least loosely identify genre boundaries defined by the itunes music store. 34

Computational Models of Music Similarity. Elias Pampalk National Institute for Advanced Industrial Science and Technology (AIST)

Computational Models of Music Similarity 1 Elias Pampalk National Institute for Advanced Industrial Science and Technology (AIST) Abstract The perceived similarity of two pieces of music is multi-dimensional,