Music Segmentation Using Markov Chain Methods Paul Finkelstein March 8, 2011 Abstract This paper will present just how far the use of Markov Chains has spread in the 21 st century. We will explain some of the techniques being used to segment audio, including the Wolff- Gibbs Algorithm, Hidden Markov Models, and Prior Distribution. Finally, we will discuss the possibilities for improvement in segmentation and the various uses for such a process. 1 Introduction Music segmentation is a process by which specialists in digital music are extracting data from sound waves and using the base units of sound (pitch, rhythm, and volume) to separate segments of a song from one another. There are many different types of segments, as is exhibited in Figure 1.
Fig. 1 Examples of different types of segmentation The level of segmentation we will be looking for is of the highest order, as is shown in the top line of the Figure. The smaller chunks, though, inform us when separating into its largest chunks. Our goal will be to separate a popular song into its introduction, verse(s), chorus(es), bridge, transitions, and conclusion. 2 Spectrogram Creation Previous attempts at music segmentation involved segmenting by spectral shape, segmenting by harmony, and segmenting by pitch and rhythm. While these methods exhibited some amount of success, they generally resulted in over- segmentation. The first step of every processing chain is to take the sinusoidal waveform of the sample audio and convert it into a spectrogram that can more easily be used to extract information. We will use a constant- Q transform to map the frequencies of Western harmony, which are geometrically spaced. A constant- Q transform, as opposed to a discrete Fourier transform, will provide us with a log frequency representation for the spectral components and makes the process of recognizing timbre, instrument, and speech differences. The Q constant is determined based on the sampling rate (usually in frames per second) and the amount of pitches desired
within a single octave. A constant- Q, logarithmically- banded spectrogram with 1/12 octave- wide bands will provide the best representation of pitch since Western harmony is based on 12 equally spaced tones within every octave. Let z m (n ) denote the mth band of the nth short- term log- frequency power spectrum in the sequence, with M bands in total, M w (n ) = ( (logz (n) m ) 2 ) 1/ 2 (n and u ) m = logz m m =1 The M sequences are then collected into an array w (n ) (n ) X after subtracting the mean of each band to make sure each rows sums to zero. This is why this method is called mean- based clustering. Fig. 2 Example of the logarithmically banded constant- Q spectrogram 3 Hidden Markov Models The next step is to construct a song- specific Hidden Markov Model to perform the actual segmentation. An HMM is a Markov Chain where the state space Ω is hidden (in the case of segmentation, Ω is the various types of segments), but there is a string of known observations resulting from the order of the chain (the audio data retrieved from the spectrogram). Using the string of observations, the HMM can train itself using Baum- Welch training to decide the most probable state- sequence for the given observations. Left to its own devices the HMM will segment a song into clusters like this:
Fig. 3 Segmentation by clustering Each different colored rectangle is a segment in Figure 3. While the song is certainly starting to take shape, it is exhibiting over- segmentation because the HMM is not trained to cluster on a large time- scale. Given a sequence {a,a,a,a,b,b,c,d,c,d,c,d,c,d}, the HMM will segment it to be (4a, 2b, c, d, c, d, c, d, c, d). It is clear, though, that the repeated c,d section is probably a 4cd segment rather than 8 distinct segments. Therefore, we must find a way to reduce the amount of fragmentation. 4 Temporal Coherence A common tool in Bayesian statistics is the prior distribution. A prior can be decided arbitrarily or through an informed process. Our prior for segment duration needs to favor longer segments so as to avoid over- segmentation (Fig. 3), a method first described by Abdallah, Rhodes, Sandler and Casey in 2005. This can be modeled fairly well by an inverse Gamma discrete- time duration model p p D (d) = IG (τ 1 fd M (d) γ) L p (τ 1 fd M (l) γ) IG l =1 where τ is a scale factor representing the most likely segment length, which was determined to be 20 seconds in a popular song.
Fig. 4 The resulting prior distribution, τ = 20 Now using a Gibbs sampler, informed by the durational prior, we can update and expand the segments in Fig. 3. To do this, we will use a Wolff- Gibbs algorithm with block- updates. This algorithm will simulate Ising, Potts and x y systems while avoiding the critical slowing down of a normal Gibbs sampler. We will converge upon the results we are seeking with the following steps: 1.) Choose a seed site based on temperature T and configuration so that the proposed step will be accepted with probability = 1. 2.) Expand the block left and right into a band of contiguous sites. 3.) Stop when a boundary is reached or with probability α after each step (one frame of the audio), based on the duration prior. 5 Results After running the algorithm, the resulting segmentation proved to be much more accurate. (a) (b) Fig. 5 (a) The final machine segmentation, (b) the ground- truth annotation, as determined by expert listeners
We compare the experimental results to the ground- truth, by computing overlap between the segments in the two graphs. This is known as the Directional Hamming Distance (DHD). For every section S i in the machine segmentation, find the section of maximum overlap S k in the ground- truth segmentation and sum the overlaps over every section. d GM = S M i S M i S G k S G j S G k We can use the DHD from ground- truth to machine and machine to ground- truth to come up with metrics for Precision P and Recall R. The results of the experiment for many songs looked like this, Fig. 6 Success of the temporal coherence model with the better results tending towards (1,1).
6 Conclusions and Implications While Abdallah, et al. have found a way to segment popular music with great accuracy, the real challenge and academic interest is in the analysis of classical music. Consider a database of 100s of recordings of Grieg s First Piano Concerto. If we could machine segment those recordings, we could actually quantify the significance of particular recordings and performers in setting tempo and expression standards. This tool could be of great use to musicologists. One could also track trends over the history of music. For instance, we could track average segment length from classical music up to the music of today and prove a definite diminishing of attention span to musical ideas over the past 100 years. To make these applications possible, it will be necessary to develop a different model for segmentation. Temporal coherence won t suffice because of the great variation in durational expectations in classical music. It will become necessary to better hone the durational prior, but there may be too many variables in classical music for this to be possible. References [1] Abdallah, Samer (2006). "Using duration models to reduce fragmentation in audio segmentation". Machine learning (0885-6125), 65 (2-3), p.485. [2] Aucouturier, J.- J. (2005). ""The way it Sounds": timbre models for analysis and retrieval of music signals". IEEE transactions on multimedia(1520-9210), 7 (6), p. 1028. [3] Haggstrom, Olle. Finite Markov Chains and Algorithmic Applications. Cambridge [u.a.: Cambridge UP, 2008. Print. [4] Rhodes, C. (2006). "A Markov- Chain Monte- Carlo Approach to Musical Audio Segmentation". Acoustics, Speech, and Signal Processing (ICASSP), International Conference on (1520-6149), 5, p. V.