Audio-based Music Segmentation Using Multiple Features

Audio-based Music Segmentation Using Multiple Features Pedro Girão Antunes Dissertation submitted for obtaining the degree of Master in Electrical and Computer Engineering Jury President: Doutor Carlos Filipe Gomes Bispo Supervisor: Doutor David Manuel Martins de Matos Members: Doutora Isabel Maria Martins Trancoso Doutor Thibault Nicolas Langlois December 2011

Acknowledgements I would like to show my gratitude to my professor David Matos. Also, to Carlos Rosão for his contribution. Also to my family, especially to my parents Ana and António, my brother Francisco and my aunt Maria do Carmo; and to my friends, especially to Mariana Fontes, Gonçalo Paiva, João Fonseca, Bernardo Lopes, Catarina Vazconcelos, Pedro Mendes, João Devesa, Luis Nunes, Manuel Dordio and Miguel Pereira. Lisboa, December 13, 2011 Pedro Girão Antunes

Resumo A segmentação estrutural baseada em sinal de áudio musical é uma área de investigação em crescimento. Destina-se a segmentar uma peça de música em partes estruturalmente significativas, ou segmentos de alto nível. Entre muitas aplicações, oferece grande potencial para melhorar a compreensão acústica e musicológica de uma peça de música. Esta tese descreve um método para localizar automaticamente os pontos de mudança na música, fronteiras entre segmentos, com base numa representação bidimensional de si mesma, a SDM (Self Distance Matrix)(Matriz de Auto Distância), e em onsets de áudio. Os recursos utilizados para o cálculo da SDM são: os MFCCs, o chromagram e o rhythmogram, sendo também combinados. Os onsets de áudio são determinados usando diversos métodos do estado da arte. A sua utilização baseia-se na suposição de que cada fronteira de segmento deve ser um onset de áudio. Basicamente, a SDM é usada para determinar qual dos onsets detectados é um momento de mudança de segmento. Para tal, usando a SDM, em que um núcleo tabuleiro de xadrez é aplicado ao longo de sua diagonal, obtém-se uma função cujos picos são considerados instantes candidatos a fronteira. Os instantes selecionados são os onsets de áudio mais próximos dos picos detectados. A aplicação do método baseia-se no uso do Matlab e diversas toolboxes. Os resultados obtidos para um corpus de 50 canções, são comparáveis com os do estado da arte.

Abstract Structural segmentation based in the musical audio signal is a growing area of investigation. It aims to segment a piece of music into structurally significant parts, or higher level segments. Among many applications, it offers great potential for improving the acoustic and musicological modeling of a piece of music. This thesis describes a method for automatically locate points of change in the music, based on a two dimensional representation of itself, the SDM (Self Distance Matrix), and the detection of audio onsets. The features used for the computation of the SDM are: the MFCCs, the chromagram and the rhythmogram which are also combined together. The audio onsets are determined using distinct state of the art methods, they are used in the assumption that every segment changing moment must be an audio onset. Basically, the SDM is used to determine which of the detected onsets are a moment of segment change. To do so, using the SDM, on which a checkboard kernel with radial smoothing is applied along its diagonal, a novelty score function is obtained of which the peaks are considered to be candidate instants. The selected instants are the audio onsets closer to the detected peaks. The application of the method relies on the use of Matlab and several toolboxes. Our results, obtained for a corpus of 50 songs, are comparable with the state of the art.

Índice 1 Introduction 1 1.1 Music - Audio Signal............................................ 1 1.2 MIR Audio-based Approaches...................................... 3 1.3 Automatic Music Structural Segmentation................................ 4 1.3.1 Feature extraction.......................................... 6 1.3.1.1 Timbre Features..................................... 6 1.3.1.2 Pitch related Features.................................. 8 1.3.1.3 Rhythmic Features................................... 8 1.3.2 Techniques.............................................. 8 1.4 Objective................................................... 9 1.5 Document Structure............................................ 9 2 Music Structure Analysis 11 2.1 Structural Segmentation Types of Approaches............................. 11 2.1.1 Novelty-based Approaches.................................... 12 2.1.2 State Approaches.......................................... 13 2.1.3 Sequence Approaches....................................... 14 2.2 Segment Boundaries and Note Onsets.................................. 15 2.3 Summary................................................... 15 iii

3 Method 17 3.1 Extracted Features............................................. 17 3.1.1 Window of Analysis........................................ 17 3.1.2 Mel Frequency Cepstral Coefficients............................... 19 3.1.3 Chromagram............................................ 19 3.1.4 Rhythmogram............................................ 20 3.2 Segment Boundaries Detection...................................... 20 3.2.1 Self Distance Matrix........................................ 20 3.2.2 Checkboard Kernel Correlation.................................. 21 3.2.3 Peak Selection............................................ 24 3.3 Mixing Features............................................... 25 3.4 Note Onsets................................................. 27 3.5 Summary................................................... 27 4 Evaluation and Discussion of the Results 29 4.1 Corpus and Groundtruth.......................................... 30 4.2 Baseline Results............................................... 31 4.3 Feature Window of Analysis........................................ 33 4.4 SDM Distance Measure........................................... 33 4.5 Note Onsets................................................. 33 4.6 Mixing Features............................................... 35 4.7 Discussion.................................................. 36 4.8 Summary................................................... 41 iv

5 Conclusion 43 5.1 Conclusion.................................................. 43 5.2 Contributions................................................ 44 5.3 Future Work................................................. 44 v

List of Figures 1.1 Signals Spectrum.............................................. 2 1.2 Musical Score................................................ 3 1.3 Audio Signal................................................. 3 1.4 MIR Tasks.................................................. 4 1.5 Features Representation.......................................... 7 2.1 HMM..................................................... 13 2.2 SDM Sequence................................................ 14 3.1 Chroma Helix................................................ 19 3.2 Rhythmogram................................................ 21 3.3 Flowchart of the method implemented.................................. 22 3.4 MFCC SDM................................................. 23 3.5 Checkboard Kernel............................................. 23 3.6 Novelty-score Computation........................................ 24 3.7 Novelty-score................................................ 25 vii

viii

List of Tables 3.1 State of the Art Works and Features.................................... 18 4.1 Method Baseline Setup........................................... 30 4.2 Corpus.................................................... 32 4.3 Baseline Average F-measure Results.................................... 34 4.4 Average Results - Window Size Experiment............................... 35 4.5 Average Results - Distance Measure Experiment............................ 35 4.6 Average Results - Onsets Experiment................................... 35 4.7 Best Sum of SDMs Results......................................... 36 4.8 Best SVD Results.............................................. 36 4.9 Best Intersection Results.......................................... 37 4.10 Average Results - Feature Mixture Experiment............................. 37 4.11 Average Results - Feature Mixture Experiment............................. 37 4.12 Method Best Setup............................................. 38 4.13 Best Result.................................................. 39 4.14 State of the Art Results........................................... 40 4.15 MIREX Boundary recovery results.................................... 40 ix

Nomenclature abs Absolute value AT Automatic generated boundaries C k d s Gaussian tapered checkboard Distance measure function F F-measure GT Groundtruth boundary annotations N Novelty-score function P Precision r Correlation coefficient R Recall v Feature vector w s w t Window size Groundtruth threshold xi

xii

1 Introduction The expansion of music in digital format due to the growing efficiency of compression algorithms led to the massification of music consumption. Such a phenomenon led to the creation of a new research field called musical information retrieval (MIR). Information retrieval (IR) is the science of retrieving from a collection of items a subset that serves some defined purpose. In this case it is applied to music. The goal of this chapter is to present the context in which this thesis has been developed, including the motivation for this work, some practical aspects related to automatic audio segmentation and finally a summary of the work carried out and how it is organized. 1.1 Music - Audio Signal In an objective and simple way music can be defined as the art of arranging sounds and silences in time. Any sound can be described as a combination of sine waves, each with its own frequency of vibration, amplitude, and phase. In particular, the sounds produced by musical instruments are the result of the combination of different frequencies, which are all multiple integers of a fundamental frequency, called harmonics, (figure 1.1). The perception of this frequency is called pitch, which is one of the characterizing elements of a sound alongside loudness (related with the amplitude of the signal) and timbre. Typically, humans cannot perceive the harmonics as separate notes. Instead, a musical note composed of many harmonically related frequencies is perceived as one sound, where the relative strengths of the individual harmonic frequencies gives the timbre of that sound. Considering polyphonic music, sound is composed by various instruments that interact through time, all together, composing the diverse dimensions of music. The main musical dimensions of interest for music retrieval are: Timbre can be simply defined as everything about a sound which is neither loudness nor pitch (Erickson 1975). As an example, it is what is different about the same tone performed in an acoustic guitar and a flute.

2 CHAPTER 1. INTRODUCTION Figure 1.1: The figure presents some periodic audio signals on the left and their frequency counterparts, on the right. The first signal is a simple sine wave used to tune musical instruments. As can be seen, the subsequent signals in time present a growing complexity relatively to the sine wave. Their harmonics are the peaks presented on the frequency plot, evident on the violin and flute. Rhythm is the arrangement of sounds and silences in time. It is related to the periodic repetition of a temporal pattern of onsets. The perception of rhythm is closely related to the sound onsets alone, so sounds can be unpitched, as for example the percussion instruments sounds are. Melody is a linear succession of musical tones which is perceived as a single entity. Usually the tones have similar timbre and a recognizable pitch within a small frequency range. Harmony is the conjugation of diverse pitches simultaneously. Harmony can be conveyed by polyphonic instruments, by a group of monophonic instrument, or may be indirectly implied by the melody. Structure is on a different level of the previous dimensions, as it covers them all. Structure, or musical form, relates to the way previous dimensions create determined patterns making structural segments that repeat themselves in some way, like the chorus, the verse and so on. Music can be represented in a symbolic way, as a musical score, used by musicians to read and write music (figure 1.2). Another form of representation, and the more common one, is the auditory representation in a waveform (e.g., WAV, MP3, etc.) (figure 1.3). It is based on this representation that most of MIR researches are made, they are called audio-based approaches.

1.2. MIR AUDIO-BASED APPROACHES 3 Figure 1.2: A musical score sample of the famous song Hey Jude by The Beatles. Figure 1.3: Audio signal from the song Northern Sky by Nick Drake. 1.2 MIR Audio-based Approaches The main idea underlying content-based approaches is that a document can be described by a set of features that are directly computed from its content, in this case, audio. Despite the existence of metadata, namely: author name, work title, genre classification and so on; the basic assumption behind audio-based approaches is that metadata may be either not suitable, or unreliable, or missing. On one hand, relying only on the information within the music is advantageous because that is generally the only information available. On the other hand, it presents many difficulties due to the heterogeneity and complexity of musical data. Listening to music, we humans can easily perceive a variety of events: the progression of harmonies and the melodic cadences, although we might not be able to name them; changes of instrumentation, the presence of drum fills, the presence of vocals, etc. We can perceive many events in music, and even without formal musical training, by identifying repetitions and abrupt changes, we can perceive structure. For the past decade, MIR as a research field has grown significantly. Given the multidisciplinary of the field, it brings together experts from many different areas of research: signal processing, database research,

4 CHAPTER 1. INTRODUCTION Figure 1.4: Some MIR tasks organized by level. machine learning, musicology, perception, psychology, sociology, etc. Figure 1.4 presents some examples of MIR tasks and their level. Note that, the objectivity of the task tends to be inversely proportional to the level. This thesis focuses on the structural segmentation task. 1.3 Automatic Music Structural Segmentation Every piece of music has an overall plan or structure. This is called the form of the music. Musical forms offer a great range of complexity. For example, most occidental pop music tends to be short and simple, often built upon repetition; on the other hand, classical music traditions around the world tend to encourage longer, more complex forms. Note that, from an abstract point of view, structure is closely related to the human perception of it. For instance, most occidental people can easily distinguish the verse from the chorus of some pop song, but will have trouble recognizing what is going on in a piece of Chinese traditional music for instance. Furthermore, classical music forms may be difficult to recognize without the familiarity that come from study or repeated hearings. Regarding pop music, modern production techniques often use copy and paste to clone multiple segments of the same type, even to clone components within the segment. This obviously facilitates the work of automatic segmentation, thus good results are obtained on this kind of music. This task can be divided in three problems: Determine the segment boundaries - beginning and ending instants of each segment; Determine the recurrent form - grouping the segments that are occurrences of the same musical part. They can be repetitions of the exact same segment or slight variations, that depends on the music genre.

1.3. AUTOMATIC MUSIC STRUCTURAL SEGMENTATION 5 The groups are often specified by letters A, B, C... Each group of segments is called a part; Determine the part label - for example, the chorus, the verse, the intro, etc. The second and third problems are similar: they are basically distance measurements. The third one normally depends on the second one, so it will be considered as less important on the scope of this thesis, also because of the extreme difficulty it presents. Some work has been done in this particular problem, for example by Paulus (2010). Furthermore, there are some methods focused only on the detection of the chorus, for example Goto (2006). The second problem is more commonly addressed. On some cases, it is following the first one, i.e., after determining the segment boundaries, each piece of music standing between two boundaries is considered to be a segment. Segments are then grouped by applying a measure of distance. An example of this method is Cooper and Foote (2003). Others address directly the problem of determining the parts. What is generally done using clustering algorithms or using Hidden Markov Models (HMM). The main idea underlying these methods is that music is made of repetition and, in that sense, the states of the HMM would represent the different parts. Note that these methods also determine the segment boundaries, determining the structural parts, the boundary instants are implicitly determined. Finally, the first problem will be the one addressed by this thesis. One example of some work done addressing this problem, is the work carried out by Foote (2000), following his work on a two dimensional representation of a musical signal, the Self-similarity Matrix (SSM) (Foote 1999), one of the most important breakthroughs on the structural segmentation task. Other works, include the one by Tzanetakis and Cook (1999). In chapter 2, the state of the art approaches are presented in more detail. The knowledge of the structure has various useful practical applications, for example: audio browsing i.e., besides browsing an album through songs it could also be possible to browse a song through segments; a starting point for other MIR tasks, including: music summarization (automatic selection of short representative audio thumbnails ), music recommendation (recommend songs with similar structure), genre classification, etc.; and even assist in musicological studies, for example, study the musical structure of songs from a determined culture or time, or the structure of songs that were in the top charts of the last decades. All the procedures start with a feature extraction step, where the audio stream is split into a number of frames from which feature vectors are calculated. Since the audio stream samples themselves do not provide

6 CHAPTER 1. INTRODUCTION relevant information, feature extraction is essential. And even more essential is to understand the meaning of the extracted features, i.e. what they represent regarding the musical dimensions. The subsequent steps, depend on the procedure and on the goals that are to be reached (summarization, chorus detection, segment boundaries detection, etc.), however, they are limited to the extracted features and what they represent. So the feature extraction step plays a central role in any MIR procedure. 1.3.1 Feature extraction Feature extraction is essential for any music information retrieval system. In particular, when detecting segment boundaries. In general, humans can easily perceive segment boundaries in popular music that is familiar to them. But what information contained in a musical signal is important to perceive that event? According to Bruderer et al experiments on humans perception of structural boundaries in popular music (Bruderer et al. 2006); global structure (repetition, break), change in timbre, change in level and change in rhythm, represent the main perceptual cues responsible for the perceiving of boundaries in music. Therefore, in order to optimize the detection of such boundaries, extracted features shall roughly represent the referred perceptual cues. Considering the perceptual cues and the presented musical dimensions, the musical signal is generally summarized in three dimensions: the timbre, the tonal part (pitch related, harmony and melody) and the rhythm. The features used in our method are presented in more detail in chapter 3. 1.3.1.1 Timbre Features Perceptually, timbre is one of the most important dimensions in a piece of music. Its importance relatively other musical dimensions can be easily understood by the fact that anyone can recognize familiar instruments, even without conscious thought, and people are able to do it with much less effort and much more accuracy than for recognizing harmonies or scales. As determined by Terasawa et al. (2005), Mel-frequency cepstral coefficients (MFCC) are a good model for the perceptual timbre space. MFCC is well known as a front-end for speech recognition systems. The first part of figure 1.5 represents a 40 dimensional MFCC vector over time. In addition to the use of MFCCs, in order to complete the timbre information of the musical signal, computation of: spectral centroid, spectral spread and spectral slope can be also useful (Kaiser and Sikora 2010). As an

1.3. AUTOMATIC MUSIC STRUCTURAL SEGMENTATION 7 Figure 1.5: Representation of various features as well as the segment boundaries groundtruth (dashed lines). The first corresponds to MFCCs, the second to chromagram and the third to the rhythmogram.

8 CHAPTER 1. INTRODUCTION alternative to the use of MFCCs, Levy and Sandler (2008) uses AudioSpactrumEnvelope, AudioSpectrumProjection and SoundModel descriptors of the MPEG-7 standard. Other alternative feature, include the Perceptual Linear Prediction (PLP) (Hermansky 1990), used by Jensen (2007). 1.3.1.2 Pitch related Features Pitch, upon which harmonic and melodic sequences are built, represents an important musical dimension. One example of its importance to the human perception, are the music covers. Music covers usually preserve harmony and melody while using a different set of musical instruments, thus altering the timbre information of the song. However, they are usually accurately recognized by people. In the context of music structural segmentation, chroma features represent the most powerful representation for describing harmonic information (Müller 2007). The most important advantage of chroma features is their robustness to changes in timbre. A similar feature is the Pitch Class Profile coefficients (PCP) (Gómez 2006), used by Shiu et al. (2006). 1.3.1.3 Rhythmic Features The rhythmic features are among the less used in the task of music structural segmentation. Considering the perceptual cue identified by Bruderer et al. study, change in rhythm. In fact, Paulus and Klapuri (2008) noted that the use of rhythmic information in addition to timbre and harmonic features provide useful information to structure analysis. The rhythmic content of a musical signal can be described with a rhythmogram as introduced by Jensen (2004) (third part of figure 1.5). It is comparable to a spectrogram, but instead of representing the frequency spectrum of the signal, it represents the rhythmic content. 1.3.2 Techniques Some techniques were already referred in the beginning of this section, they are presented in more detail in chapter 2: Self Distance Matrix The Self Distance Matrix (SDM) compares the feature vectors with each other, using some determined distance measure (for example, the euclidean) (Foote 1999).

1.4. OBJECTIVE 9 Hidden Markov Models The use of an HMM to represent music, assumes that each state represents some musical information, thus defining a musical alphabet, where each state represents a letter. Clustering The idea underlying the use of clusters to represent music is that different segments are represented by different clusters. Time difference Using the time differential of the feature vector large differences would indicate sudden transitions, thus a possible segment boundaries boundaries. Cost Function The cost function determines the cost of a determined segment, so that, segments where the composing frames have a high degree of self similarity have a low cost. 1.4 Objective The goal of this thesis is to perform structural segmentation on audio stream files, that is, to identify the instants of segment change, boundaries between segments. The computed boundaries will be then compared with manually noted ones in order to evaluated their quality. 1.5 Document Structure After presenting the context in which this thesis has been developed, including the motivation for this work and some practical aspects related to automatic audio segmentation. The remaining of this document is organized as follows: Chapter 2 introduces the state of the art approaches. Chapter 3 introduces the used features, followed by the presentation of the implemented method and each used tool. Chapter 4 introduces the final results discussion and a comparison with the state of art ones. Chapter 5 introduces the conclusions and future work.

10 CHAPTER 1. INTRODUCTION

2 Music Structure Analysis Music is structured, generally respecting some rules that vary regarding the genre of music. Music can be divided into many genres in many different ways. And each genre of music can also be divided in a variety of styles. For instance, the Pop/Rock genre includes over 50 different styles 1, and most of them are extremely different (for example: Death Metal and Country Rock). Then, even if there is controversy on the way music genres are divided, the diversity of sounds in different genres is unquestionable. In that sense, achieving the capability to adapt to such a variety of sounds presents the major difficulty for the automatic segmentation approaches. The goal of this chapter is to introduce the state of the art approaches to the problem of structural segmentation in music. They are organized in three sets as proposed by Paulus et al. (2010): novelty-based approaches, state approaches and sequence approaches. Additionally, it will discuss the relation between the segment boundaries and the note onsets. 2.1 Structural Segmentation Types of Approaches The various techniques used to solve the structural segmentation problem so far can be grouped according to their paradigm. Peeters (2004) considered dividing the approaches into two sets: sequence approaches and state approaches. The sequence approaches consider that there are sequences of events that are repeated several times in a given music. The state approaches consider the musical audio signal to be a succession of states, where each state produces some part of the signal. Paulus et al. (2010) on the other hand, suggested dividing the methods into three main sets: novelty-based approaches, homogeneity-based approaches and repetition-based approaches. In fact, the homogeneity-based approaches are basically the same as the state approaches defined by Peeters, and the repetition-based approaches are the sequence approaches. The third set proposed by Paulus, novelty-based approach, can be seen as a front-end for one of the other approaches or both. The goal of this section is to introduce each one of the three sets of approaches, as well as the state of the 1 http://www.allmusic.com/explore/genre/poprock-d20

12 CHAPTER 2. MUSIC STRUCTURE ANALYSIS art methods referred to each. Starting with the novelty-based approaches, followed by the state approaches and finally the sequence approaches. 2.1.1 Novelty-based Approaches The goal of the novelty-based approaches is to locate instants where changes occur in a song, usually referred to as segment boundaries. Knowing those, segments can be defined between them. The most common way of doing so is using a Self-Distance Matrix (SDM). The SDM is computed as follows: SDM(i, j) = d s (v i, v j ) i, j = 1,..., n (2.1) Where d s represents a distance measure (for example, Euclidean distance), v represents a feature vector and i and j are the frame numbers, where a frame is the smallest piece of music used. Using a checkboard kernel (figure 3.5) to be correlated along the diagonal of the SDM yields a novelty-score function. The peaks of the novelty score represent candidate boundaries between segments. This method was first introduced by Foote (2000). More about this method is introduced in chapter 3. Other method to detect boundaries was proposed by Tzanetakis and Cook (1999), by using the time differential of the feature vector, defined as the Mahalanobis distance: i = ((v i v i 1 ) T ( ) 1 (v i v i 1 )) (2.2) where is an estimate of the feature covariance matrix, calculated from the training data, and i is the frame number. This measure is related to the Euclidean distance but takes into account the variance and correlations among features. Large differences would indicate sudden transitions, thus a possible boundary. More recently, Jensen (2007) proposed a method where boundaries are detected using a cost function. This cost function determines the cost of a determined segment, so that, segments where the composing frames have a high degree of self similarity have a low cost.

2.1. STRUCTURAL SEGMENTATION TYPES OF APPROACHES 13 Figure 2.1: Representation of a simple HMM. Taken from Dannenberg and Goto (2008). 2.1.2 State Approaches This kind of approaches considers the music audio signal as a succession of states. The most notable methods included in this set are the ones based on Hidden Markov Models (HMM) (Rabiner 1989). Using an HMM, the concept of state is taken more explicitly. It is assumed that each musical excerpt is represented by a state in the HMM. This way a musical alphabet is defined. Where each musical excerpt (each state) represents a letter (what is referred here as a musical excerpt can be one frame or a group of frames, depending on the approach). Time advances in discrete steps corresponding to feature vectors, transitions from one state to the next are modeled by a probabilistic distribution that only depends on the current state. This forms a Markov model that generates a sequence of states. Note that the states are hidden because only feature vectors are observable. Another probability function models the generation of a determined feature vector from a determined state, figure 2.1. The features are then decoded using the Viterbi algorithm and the most likely sequence of states is determined. The first approaches to use this method (Aucouturier and M.Sandler 2001) (Chu and Logan 2000) (Peeters and Rodet 2002) were initially implemented using a small number of states, in the assumption that each state would represent one part (verse, chorus, etc.). Although this model had a certain appeal, it did not work very well because the result was often temporally fragmented. Considering the analogy used before, in this case, different letters would represent different segments. Levy and Sandler (2008) used the same method with much better results. Using a larger number of states, then calculating histograms of the states with a sliding window over the entire sequence of states. Their assumption was that each segment type is characterized by a particular distribution of states, because roughly each kind of segment contains similar music. In order to implement such an assumption, clustering algorithms are applied to the histogram where each cluster corresponds to a particular part. Considering the analogy, in this case, segments would be composed by sets of letters, words, i.e. a particular part would correspond to a particular word.

14 CHAPTER 2. MUSIC STRUCTURE ANALYSIS Figure 2.2: Representation of parallel stripes. On bottom row the zoom of the top one, note that, from left to right the matrix is being processed as described in the main text. Figure taken from Müller (2007). Other common approach is based in clustering instead of HMMs. In Cooper and Foote (2003), clustering is used to determine the most frequent segment, of which the segments were determined using the novelty-score peaks. And Goodwin and Laroche (2004) used and algorithm that performs segmentation and clustering at the same time. 2.1.3 Sequence Approaches The sequence approaches consider the music audio signal as a repetition of sequences of events. This set of approaches rely mainly on the detection of diagonal stripes parallel to the matrix main diagonal, figure 2.2. These stripes represent similar sequences of features as first verified by Foote (1999). The diagonal stripes, when present, can easily be detected by humans in the SDM. However, the same is not true for automatic detection, due to varied distortions of the musical signal. For example, dynamics (example: retardando). In order to facilitate the detection of such stripes, several authors propose the use of a low pass filtering along the diagonal to smooth the SDM. Peeters (2007) in addition, proposed a high-pass filter in the direction perpendicular to the stripes to enhance such stripes. Others proposed enhancing methods, employing multiple iterations of erosions and dilations filtering along the diagonals (Lu et al. 2004). At this point the discovery of music repetition turned into an image processing task. Goto (2006) proposed the use of a time-lag matrix where the coordinates of the system were changed so that stripes appear horizontally or vertically, and would be easily detected. Shiu et al. (2006) proposed the use of the viterbi algorithm to detect the diagonal stripes of

2.2. SEGMENT BOUNDARIES AND NOTE ONSETS 15 musical parts that present a weaker similarity value, for example verses. These approaches somehow fail in a basic assumption that the stripes are parallel the main diagonal. Furthermore, although the detection of sequence repetition represents a great improvement to the musical structure analysis, it is not usually enough to represent the whole higher level structural segmentation, as it requires a part to occur at least twice to be found. Accordingly the combination of state approaches with sequence approaches appear to be the most reasonable. A good example of the combination of both approaches is the work done by Paulus and Klapuri (2009). 2.2 Segment Boundaries and Note Onsets The note onsets are defined as the start of a musical note, not only pitched notes but also unpitched ones, rhythmic notes. In monophonic music a note onset is well defined as well as its duration, however, in polyphonic music, note onsets of various instruments overlap. This makes them more difficult to identify, both automatically and perceptually. A variety of methods to detect the note onsets are presented by Rosão and Ribeiro (2011). Considering the detection of segment boundaries task, it is of our belief that the note onsets can be used to validate the segment boundaries. The assumption is that any segment is defined between note onsets, then any segment must start in a note onset. In that sense, the note onsets are seen as the events that trigger the segment change. Not only the segment change but every other event in music. In the extreme, without note onsets there is absence of sound. 2.3 Summary In this chapter the state of the art approaches were introduced according to the division proposed by Paulus et al. (2010): novelty-based approaches, state approaches and sequence approaches. The first set is focused on the detection of segment boundaries and is generally used as front-end for one of the other approaches. The second set, considers the musical audio signal to be a succession of states, where each state produces some part of the signal. The last set, considers that there are sequences of events repeated several times in a given music. To finalize the chapter, we considered the note onsets to be events that trigger the segment change.

16 CHAPTER 2. MUSIC STRUCTURE ANALYSIS

3Method Considering the introduced sets of methods, the implemented method belongs to the novelty-based approaches. It is focused on determining the segment boundaries. The goal of this chapter is to introduce the method developed aiming to solve the problem of segmentation of audio music streams, describing each used tool. It starts by considering the features collected from the audio stream and how they were mixed, followed by the introduction of the actual method. 3.1 Extracted Features The extraction of features is a very important step in any MIR system. Table 3.1 shows the features used in some structural segmentation works. In our case the features extracted are an attempt to represent the main three musical dimensions: timbre, tonal (harmony and melody) and rhythmic. In this section, we introduce the extracted features and their mixture, before, we consider the windows of analysis used to collect those features. 3.1.1 Window of Analysis The audio stream is first downsampled to 22050Hz, since this number of samples is enough. Considering those samples, they are then grouped in windows or frames. In music structure segmentation to compare frames with each other is a usual task, as it is evident in the SDM. Such a task can represent heavy computation depending on the number of frames used. Generally, larger frame length are used (0.1 1s), compared with most of the audio content analysis (0.01 0.1s). This fact reduces the number of frames in a song, thus reducing the SDM size. Moreover, larger frame length allows a larger temporal resolution which according to Peeter represents something musically more meaningful (Peeters 2004). Some proposed methods unlike using fixed length frames tend to use variable ones. This has two benefits: tempo invariance, which means that some melody, for example, that has some tempo fluctuation relatively the same pitch progression melody, can be successfully match; sharper feature differences, preventing sound

18 CHAPTER 3. METHOD Authors Task Features Goto (2006) Chorus Detection Chroma Jensen (2007) Music Structure Perceptual Linear Prediction (PLP), Chroma and Rhythmogram Kaiser and Sikora (2010) Music Structure 13 MFCCs, spectral centroid, spectral slope and spectral spread Levy and Sandler (2008) Music Structural Segmentation AudioSpectrumEnvelope, AudioSpectrumProjection, and SoundModel descriptors of the MPEG-7 standard Paulus and Klapuri (2009) Music Structural Segmentation 12 MFCCs (excluding the 0th), Chroma and Rhythmogram Peeters (2007) Music Structural Segmentation 13 MFCCs (excluding the 0th), 12 Spectral Contrast coefficients and Pitch Class Profile coefficients Peiszer et al. (2008) Music Structural Segmentation 40 MFCCs Shiu et al. (2006) Similar Segment Identification Pitch Class Profile coefficients Turnbull and Lanckriet (2007) Music Structural Segmentation MFCCs and Chroma Table 3.1: Compilation of works and features used. events respective features from spreading to other frames. Peiszer et al. (2008) for example, used the note onsets to set out window sizes. In our case, in order to accomplish sharper feature difference, the size of the windows are determined depending on the bpm. The bpm are determined using the function mirtempo() from MIRtoolbox (Lartillot 2011), which estimates the tempo by detecting periodicities from the onset detection curve. In this case the onsets are determined using the function mironsets() also from the MIRtoolbox. The mirtempo() function is quite accurate. It is made the assumption that the tempo in pop music is constant. The window size is determined as follows: w s = 1 2.bpm 60 (3.1) This yields window sizes between 0.15s and 0.3s, which is equivalent to bpm between 100 and 200, depending on the song. We used no overlapping of windows, except for the rhythmogram. The impact of using variable window size compared to fixed is discussed with actual evaluation values in the next chapter.

3.1. EXTRACTED FEATURES 19 Figure 3.1: Pitch represented by two dimensions: height, moving vertically in octaves, and chroma, or pitch class determining the rotation position within the helix. Taken from Gómez (2006). 3.1.2 Mel Frequency Cepstral Coefficients The MFCCs are extensively used to represent the timbre in music. We used 40 MFCCs calculated using a filter bank composed by linear and logarithmic filters to model loudness compression, in order to simulate the characteristics of the human auditory system. The coefficients are obtained by taking the discrete cosine (DCT) of the log-power spectrum on the Mel-frequency scale. Most of the authors do not use more then 20 coefficients. However, some tests were made with 40 coefficients and final results show that the increase of coefficients has a significant influence (around 5% in our case). This was also verified by Peiszer et al. (2008), as well as Santos (2010), who also used 40 MFCC. The first part of figure 1.5 represents a 40 dimensional MFCC vector over time. 3.1.3 Chromagram The chroma refers to the 12 traditional pitch classes (the 12 semitones) {C, C#, D,..., A#, B}. As pitch repeats itself every octaves (12 semitones), a pitch class is defined to be the set of all pitches that share the same chroma. This is represented in figure 3.1. For example, the pitch class corresponding to the chroma F is the set {..., F0, F1, F2,...}, where 0, 1 and 2 represent the pitch F of each octave. Therefore, the chroma representation is a 12 dimensional vector, where each dimension is the respective chroma of the signal. Figure 1.5 shows a

20 CHAPTER 3. METHOD chromagram, the chroma represented in time. We used Müller and Ewert (2011) method to extract the chroma features. First, the pitch values are determined using a 88 filter centered in each pitch, from A0 to C8. The chroma vector is then calculated simply by adding pitches that correspond to the same chroma. 3.1.4 Rhythmogram The rhythmogram was first presented by Jensen (2004). It is computed by determining the autocorrelation of the note onsets on intervals of 2s, using a millisecond scale, what produces o vector of dimension 200, figure 3.2. Unlike the other two features extracted, the rhythmogram is calculated using a window of analysis of 2s and a hop size of w s. This way, the rhythmogram will have the same number of samples per song as the MFCCs and the chromagram. We used 4 different onsets: one taken from Peiszer et al. (2008), which was taken from a beat tracker. Other by Rosão (2011), based on the Spectral Flux (Bello et al. 2005). And the others using MIRToolbox function mironsets() (Lartillot 2011): one using the envelope of the signal and the other using the Spectral Flux as well. The first onsets are very few compared with the other three, this suggest that there must have been some selection. That fact is adverse to the usefulness of the rhythmogram. As shown in figure 3.2 (c)), the rhythmogram presents too few information. On the other hand the other note onsets, figure 3.2 (a), b) and d)), convey much more information. This is reflected on the final results as it will be shown in the next chapter. 3.2 Segment Boundaries Detection In this section the algorithm is presented. Figure 3.3 represents a flowchart of the implemented method using Matlab. The used method is based on the approach by Foote (2000), where he first introduced the novelty-score function. Firstly, the SDM matrix is computed then a novelty-score function is calculated from it and finally the peaks of such function are determined as candidates for segment boundaries. Following, each of the steps of the algorithm are presented. 3.2.1 Self Distance Matrix The SDM is determined by 2.1. The distance measure used was the Manhattan distance measures. As it is known to perform well when dealing with high dimensionality data (Aggarwal et al. 2001), which is the case. But the fact is, that experiences made with the Euclidean and the Cosine distance showed that there is not

3.2. SEGMENT BOUNDARIES DETECTION 21 Figure 3.2: Rhythmograms computed using different note onsets. a) Rosão; b) mironsets() using spectral flux; c) from Peiszer et al. (2008) and d) mironsets(). much difference of performance (Chapter 4). The SDM presents some characteristics. At first, since every frame is similar to itself the matrix diagonal will be zero. Furthermore, assuming that the distance measure is symmetric, the matrix will be as well. The SDM can be visualized using a gray-scale image where similar frames are presented as black and infinitely different ones in white or the other way round, this permits a somehow useful visual representation of a music (figure 3.4). The rectangular structures presented in the matrix, represent the structural elements present in a song. In order to detect them a checkboard kernel (3.5) is correlated along the matrix diagonal. 3.2.2 Checkboard Kernel Correlation A checkboar Kernel is presented in figure 3.5. Such kernel is correlated along the matrix diagonal, according to the novelty-score function: N(i) = k/2 k/2 m= k/2n= k/2 abs(r(c k (m, n), SDM(i + m, i + n))) (3.2) Where C k denotes a Gaussian tapered checkboard kernel of size k, radially symmetric and centered on (0, 0) and i and j are the frame numbers, figure 3.6 illustrates the novelty-score computation. The abs() represents

22 CHAPTER 3. METHOD Figure 3.3: Flowchart of the method implemented.

3.2. SEGMENT BOUNDARIES DETECTION 23 Figure 3.4: The MFCC SDM for the song Northern Sky by Nick Drake. Figure 3.5: Checkboar kernel with a size of 96 (k = 96).

24 CHAPTER 3. METHOD Figure 3.6: Illustration of the novelty-score computation. the absolute value and r() represents the correlation coefficient which is computed as follows: r = ( m m n (A mn Ā)(B mn B) n (A mn Ā))2 ( m n (B mn B)) (3.3) 2 Where A and B represent the Gaussian tapered checker board kernel matrix and the subset of SDM respectively, and Ā and B are the respective scalar means. This computation of N(i) is slightly different from the presented in Foote (2000), presented better final results. This can be justified by the fact that the computation of the correlation takes into account the mean values of both matrices, thus eliminating eventual noise. 3.2.3 Peak Selection The peaks of the novelty-score function are determined simply by detecting the signal changes in the derivative (positive to negative) of the novelty-score function. Generally the number of peaks detected is way above the number of segment boundaries present in an average 3 minutes pop song, then some selection is needed. One way of doing so is using a windows of 6s, half overlap, to analyze the function. Figure 3.7 shows a novelty-score peak selection, using this method. In each window the local maxima, if any, is chosen. This is done under the assumption that there are no segments smaller than 6s. Another way of doing so, is to define a threshold to eliminate peaks beneath its value. This approach was

3.3. MIXING FEATURES 25 Figure 3.7: The novelty-score from the SDM 3.4. The groundtruth is represented by the red interrupted vertical lines and the automatic generated boundaries by the red crosses. tested but with unsatisfying results due to the fact that most top peaks are not actual boundaries. Instead, lower local maxima are. To face this problem an average weighted threshold was tested, which obtained better results than the constant threshold but still below the results obtained with the window approach. 3.3 Mixing Features The idea underlying mixing features is using information from different musical dimensions, so that they complete themselves. In that sense, mixing feature seems a perfectly justified operation and even a simple one to do but in practice it is not. The first basic idea used to combine features, was to validate the boundaries intersecting the novelty-score peaks of the three different features alone. Every instant that was repeated at least twice in two different features, in a threshold of 1.5s, was considered. The intersection was done in three different ways, each one taking as reference one of the three features, i. e., first the MFCCs selected peaks are compared to the chromagram and rhythmogram selected peaks; peaks are discarded if they are not repeated at least once, then the same is done for the rhythmogram and for the chromagram. Note that, this can also be viewed as a peak selection process, and not a feature mixture per se, since the idea of different features completing one another is not present in

26 CHAPTER 3. METHOD this approach. The second idea was to sum the SDMs before the computation of the novelty-score function, as follows: SDM(M + R) = αsdm(mf CC) + SDM(Rhythmogram) (3.4) SDM(C + R) = βsdm(chroma) + SDM(Rhythmogram) (3.5) SDM(M + C) = SDM(MF CC) + σsdm(chroma) (3.6) SDM(M + C + R) = αsdm(mf CC) + βsdm(chroma) + SDM(Rhythmogram) (3.7) Where, the SDMs respective features are represented in brackets. M, C and R stand for, MFCC, Chromagram and Rhythmogram respectively. The coefficients alpha, beta and sigma are computed as follows: α = mean(sdm(rhythmogram)) ; (3.8) mean(sdm(mf CC) β = mean(sdm(rhythmogram)) ; (3.9) mean(sdm(chroma) σ = mean(sdm(mf CC)) mean(sdm(chroma) ; (3.10) Where, the operation mean() determines the mean value of the matrix. The purpose of it is balancing the factors of the sum, trying to give the same weight to each one. Finally, the third idea was to use a dimensionality reduce method on the concatenated feature vector, combining features in groups of two and three. This created new feature vectors, then used to compute the SDM and the remainder of the method. To that end, the Singular Value Decomposition (SVD) method was used. The SVD is based on a theorem from linear algebra which says that a rectangular matrix M (which in this case represents the feature vectors) can be broken down into the product of three matrices: an orthogonal matrix U, a diagonal matrix S, and the transpose of an orthogonal matrix V. The decomposition is usually presented as: M mn = U mm S mn V T nn (3.11) According to the diagonal of S, which present a descending curve representing the descending representation of each feature vector, the first n vectors from Vnn T are used, meaning that the ones that are left unused are useless or even adverse (noise) for further computations. The results for each hypothesis are presented and discussed in the next chapter.