A probabilistic framework for audio-based tonal key and chord recognition

A probabilistic framework for audio-based tonal key and chord recognition Benoit Catteau 1, Jean-Pierre Martens 1, and Marc Leman 2 1 ELIS - Electronics & Information Systems, Ghent University, Gent (Belgium) Benoit.Catteau@elis.UGent.be 2 IPEM - Department of Musicology, Ghent University, Gent (Belgium) Marc.Leman@UGent.be Abstract A unified probabilistic framework for audio-based chord and tonal key recognition is described and evaluated. The proposed framework embodies an acoustic observation likelihood model and key & chord transition models. It is shown how to conceive these models and how to use music theory to link key/chord transition probabilities to perceptual similarities between keys/chords. The advantage of a theory based model is that it does not require any training, and consequently, that its performance is not affected by the quality of the available training data. 1 Introduction Tonal key and chord recognition from audio are important steps towards the construction of a mid-level representation of Western tonal music for e.g. Music Information Retrieval (MIR) applications. A straightforward approach to key recognition (e.g. Pauws (2004), Leman (2000)) is to represent the acoustic observations and the keys by chroma vectors and chroma profiles respectively, and to use an ad hoc distance measure to assess how well the observations match a suggested key profile. Well-known profiles are the Krumhansl and Kessler (1982) and Temperley (1999) profiles, and a popular distance measure is the cosine distance. The classical approach to chord recognition is one of key detection before chord recognition. Recently however, Shenoy and Wang (2005) proposed a 3-step algorithm performing chord detection first, key detection then, and finally, chord enhancement on the basis of high-level knowledge and key information. Our point of departure is that tonal key and chord recognition should preferably be accomplished simultaneously on the basis of a unified probabilistic framework. We will propose a segment-based framework that extends e.g. the frame-based HMM framework for chord detection proposed by Bello and Pickens (2005).

2 Benoit Catteau, Jean-Pierre Martens, and Marc Leman In the subsequent sections we provide a general outline (Section 2) and a detailed description (Sections 3 and 4) of our approach, as well as an experimental evaluation (Section 5) of our current implementation of this approach. 2 General outline of the approach Before introducing our probabilistic framework, we want to recall some basics about the links between notes, chords and keys in Western tonal music. The pitch of a periodic sound is usually mapped to a pitch class (a chroma) collecting all the pitches that are in an octave relation to each other. Chroma s are represented on a log-frequency scale of 1 octave long, and this chromatic scale is divided into 12 equal intervals, the borders of which are labeled as notes: A, As, B,.., Gs. A tonal key is represented by 7 eligible notes selected from the set of 12. Characteristics of a key are its tonic (the note with the lowest chroma) and the mode (major, minor harmonic,..) that was used to select the 7 notes starting from the tonic. A chord refers to a stack of three (=triad) or more notes sounding together during some time. It can be represented by a 12-bit binary chromatic vector with ones on the chord note positions and zeroes on the remaining positions. This vector leads to a unique chord label as soon as the key is available. Having explained the links between keys, chords and notes, we can now present our probabilistic framework. We suppose that an acoustic front-end has converted the audio into a sequence of N events which are presumed to represent individual chords. Each event is characterized by an acoustic observation vector x n, and the whole observation sequence is denoted as X = {x 1,.., x N }. The aim is now to assign key labels k n and chord chroma vectors c n to these events. More precisely, we seek for the sequence pair ( ˆK, Ĉ) that maximizes the posterior probability P (K, C X). By applying Bayes law, and by noting that the prior probability P (X) is independent of (K,C), one comes to the conclusion that the problem can also be formulated as ˆK, Ĉ = arg max P (K, C, X) = arg max P (K, C) P (X K, C) (1) K,C K,C By sticking to two key modes, namely the major and minor harmonic mode, and by only examining the 4 most important triads (major, minor, augmented and diminished) per tonic we achieve that only 48 chord vectors and 24 keys per event have to be tested. If we can then assume that the acoustic likelihood P (X K, C) can be factorized as P (X K, C) = N P (x n k n, c n ) (2) and if P (K, C) can be modeled by the following bigram music model n=1

A probabilistic framework for audio-based tonal key and chord recognition 3 P (K, C) = N P (k n, c n k n 1, c n 1 ) (3) n=1 then it is straightforward to show that the problem can be reformulated as ˆK, Ĉ = arg max K,C N P (k n k n 1, c n 1 )P (c n k n 1, c n 1, k n )P (x n k n, c n ) (4) n=1 The solution can be found by means of a Dynamic Programming search. In the subsequent two sections we describe the front-end that was used to construct the acoustic observations and the models that were developed to compute the probabilities involved. 3 The acoustic front-end The objective of the acoustic front-end is to segment the audio into chord and rest intervals and to create a chroma vector for each chord interval. 3.1 Frame-by-frame analysis The front-end first performs a frame-by-frame short-time power spectrum (STPS) analysis. The frames are 150 ms long and two subsequent frames overlap by 130 ms. The frames are Hamming windowed and the STPS is computed in 1024 points equidistantly spaced on a linear frequency scale. The STPS is then mapped to a log-frequency spectrum comprising 84 samples: 7 octaves (between the MIDI scores C1 and C8) and 12 samples per octave. By convolving this spectrum with a Hamming window of 1 octave wide, one obtains a so-called background spectrum. Subtracting this from the original spectrum leads to an enhanced log-frequency spectrum. By means of sub-harmonic summation (Terhardt et al. (1982)), the latter is converted to a sub-harmonic sum spectrum T (i), i = 0,..83 which is finally folded into one octave to yield the components of the chroma vector x of the analyzed frame: x m = 3.2 Segmentation 6 T n (12j + m), m = 0,.., 11 (5) j=0 The chroma vectors of the individual frames are used to perform a segmentation of the audio signal. A frame can either be appended to a previously started event or it can be assigned to a new event. The latter happens if the absolute value of the correlation between consecutive chroma vectors drops below a certain threshold.

4 Benoit Catteau, Jean-Pierre Martens, and Marc Leman On the basis of its mean frame energy each event is labeled as chord or rest and for each chord, a chroma vector is computed by first taking the mean chroma vector over its frames, and by then normalizing this mean vector so as to achieve that its elements sum up to 1. 4 Modeling the probabilities For solving Equation 4, one needs good models for the observation likelihoods P (x n k n, c n ), the key transition probabilities P (k n k n 1, c n 1 ) and the chord transition probabilities P (c n k n 1, c n 1, k n ). 4.1 Modeling the observation likelihoods The observation likelihood expresses how well the observations support a proposed chord hypothesis. Although they sum up to one, we assume weak dependencies among the vector components and propose to use the following model: P (x n k n, c n ) = 11 m=0 P (x nm c nm ), 11 m=0 x nm = 1 (6) In its most simple form this model requires two statistical distributions: P (x c = 1) and P (x c = 0) (x and c denote individual notes here). We have chosen for P (x 0) = G o (e x2 2σ 2 + P o ) x (0, 1) (7) P (x 1) = G 1 (e (x X) 2 2σ 2 + P o ) x (0, X) (8) = G 1 (1 + P o ) x (X, 1) (9) (see Figure 1) with G o and G 1 being normalization factors. Offset P o must preserve some evidence in case an expected large x nm is missing or an unexpected large x nm (e.g. caused by an odd harmonic of the pitch) is present. In our experiments X and σ were kept fixed to 0.33 and 0.13 respectively (these values seem to explain the observation statistics). 4.2 Modeling the key transition probabilities Normally it would take a large chord and key annotated music corpus to determine appropriate key and chord transition probabilities. However, we argue that (1) transitions between similar keys/chords are more likely to occur than transitions between less similar keys/chords, and (2) chords comprising the key tonic or fifth are more likely to appear than others. We therefore

A probabilistic framework for audio-based tonal key and chord recognition 5 1.4 1.2 P(x 0) P(x 1) 1 0.8 P 0.6 0.4 0.2 P0 0 0 0.2 X 0.4 0.6 0.8 1 x Fig. 1. Distributions (without normalization factors) to model the observation likelihoods of x given that the note chroma vector contains a c = 1 or c = 0. propose to retrieve the requested probabilities from music theory and to avoid the need for a labeled training database. Lerdahl (2001) has proposed a three-dimensional representation of the tonal space and a scheme for quantizing the perceptual differences between chords as well as keys. Lerdahl distinguishes five note levels, namely the chromatic, diatonic, triadic, fifth and tonic levels and he accumulates the differences observed at all these levels in a distance metric. If we can assume that in the case of a key modulation the probability of k n is dominated by the distance d(k n, k n 1 ) emerging from Lerdahl s theory, then we can propose the following model: P (k n k n 1, c n 1 ) = P os k n = k n 1 (10) = β s e d(kn,k n 1 ) ds k n k n 1 (11) with β s being a normalization factor and d s = 15, the mean distance between keys. By changing P os we can control the chance of hypothesizing a key modulation. 4.3 Modeling the chord transition probabilities For computing these probabilities we rely on the distances between diatonic chords (= chords solely composed of notes that fit into the key) as they follow from Lerdahl s theory, and on the tonicity of the chord. Reserving some probability mass for transitions to non-diatonic chords we obtain P (c n c n 1, k n, k n 1 ) = P oc c n = non-diatonic in k n (12) = β c e d(c n,c n 1 ) dc g(c n, k n ) c n = diatonic in k n (13) as a model. β c is a normalizaton factor, d c = 6 (the mean distance between chord vectors) and g(c n, k n ) is a model that favors chords comprising the key tonic (g = 1.5) or fifth (g = 1.25) over others (g = 1). By changing P oc we can control the chance of hypothesizing a non-diatonic chord.

6 Benoit Catteau, Jean-Pierre Martens, and Marc Leman 5 Experimental results For parameter tuning and system evaluation we have used four databases. Cadences. A set of 144 files: 3 classical cadences times 24 keys (12 major and 12 minor keys) times 2 synthesis methods (Shepard tones and MIDI-to-wave). Modulations. A set of 20 files: 10 chord sequences of length 9 (copied from Krumhansl and Kessler (1982)) times 2 synthesis methods. All sequences start in C major or C minor and on music theoretical grounds a unique key can be assigned to each chord. Eight sequences show a key modulation at position 5, the other two do not, but they explore chords on various degrees. Real audio. A set of 10 polyphonic audio fragments (60 seconds) from 10 different songs (see Table 1). Each fragment was chord and key labeled. MIREX. A set of 96 MIDI-to-wave synthesized fragments: compiled as a training database for the systems participating in the MIREX-2005 key detection contest. Each fragment was supplied with one key label. In case of modulation it is supposed to represent the dominant key for that fragment. Artist Title Key 1 CCR Proud Mary D Major 2 CCR Who ll stop the rain G Major 3 CCR Bad moon rising D Major 4 America Horse with no name E Minor 5 Dolly Parton Jolene Cs Minor 6 Toto Cutugno L Italiano A Minor 7 Iggy Pop The passenger A Minor 8 Marco Borsato Dromen zijn bedrog C Minor 9 Live I Alone Gb Major Eb Major 10 Ian McCulloch Sliding C Major Table 1. The test songs and their key 5.1 Free parameter tuning In order to tune the free parameters (P o, P os, P oc ) we worked on all the cadences and modulation sequences and one song from the real audio database. Since P os and P oc were anticipated to be the most critical parameters we explored them first in combination with P o = 0.1. There is a reasonably large area in the (P os, P oc )-plane where the performances on all the tuning data are good and stable (0.3 < P os < 0.5 and 0 P oc < 0.2). We have chosen for P os = 0.4 and P oc = 0.15 to get a fair chance of selecting key modulations and non-diatonic chords when present in the audio. For these values we got 100%, 96.7% and 92.1% of correct key labels for the cadences, the modulation sequences and the song. The corresponding correct chord label percentages were 100%, 93.8% and 73.7%. Changing P o did not cause any further improvement.

A probabilistic framework for audio-based tonal key and chord recognition 7 5.2 System evaluation Real audio. For real audio we have measured the percentages of deleted reference chords (D), inserted chords (I), frames with the correct key label (C k ) and frames with the correct chord label (C c ). We obtained D = 4.3%, I = 82%, C k = 51.2% and C c = 75.7%. An illustration of the reference and computed labels for song 1 is shown on Figure 2. A E B Fs Cs Gs Ds As FC Reference A E B Fs Cs Gs Ds As FC Computed result G D G D 0.02 0.22 0.42 0.62 0.82 1.02 0.02 0.22 0.42 0.62 0.82 1.02 A E B Fs Cs Gs Ds As FC A E B Fs Cs Gs Ds As FC G D G D 0.02 0.22 0.42 0.62 0.82 1.02 0.02 0.22 0.42 0.62 0.82 1.02 Fig. 2. Annotated (left) and computed (right) chords (top) and keys (bottom) for song 1. The grey zones refer to major and the black ones to minor labels. A first observation is that our system produces a lot of chord insertions. This must be investigated in more detail, but possibly the annotator discarded some of the short chord changes. A second observation is that the key accuracy is rather low. However, a closer analysis showed that more than 60% of the key errors were confusions between a minor and its relative major. Another 15% were confusions between keys whose tonics differ by a fifth. By applying a weighted error measure as recommended by MIREX (weights of 0.3 for minor to relative major, 0.5 for a tonic difference of a fifth, and 1 otherwise) we obtain a key accuracy of 75.5%. Our chord recognition results seem to be very good. Without chord enhancement on the basis of high-level musical knowledge (this knowledge can also be applied on our system outputs) Shenoy and Wang (2005) report a chord accuracy of 48%. Although there are differences in the data set, the assumptions made by the system (e.g. fixed key) and the evaluation procedure, we believe that the above figure supports our claim that simultaneous chord and key labeling can outperform a cascaded approach. MIREX data. Since we did not participate in the MIREX contest, we only had access to the MIREX training set and not to the evaluation set. However

8 Benoit Catteau, Jean-Pierre Martens, and Marc Leman since we did not perform any parameter tuning on this set, we believe that the results of our system on the MIREX training set are representative of thoses we would be able to attain on the MIREX evaluation set. Using the recommended MIREX evaluation approach we obtained a key accuracy of 83%. The best result reported in the MIREX contest İzmirli (2005) was 89.5%. We hope that by further refining our models we will soon be able to bridge the gap with that performance. 6 Summary and conclusion We have proposed a segment-based probabilistic framework for the simultaneous recognition of chords and keys. The framework incorporates a novel observation likelihood model and key & chord transition models that were not trained but derived from the tonal space theory of Lerdahl. Our system was evaluated on real audio fragments and on MIDI-to-wave synthesized chord sequences (MIREX-2005 contest data). Apparently, real audio is hard to process correctly, but nevertheless our system does appear to outperform its counterparts in advanced chord labeling systems that have recently been developed by others. The key labeling results for the MIREX data are also very good and already close to the best results previously reported for these data. References BELLO JP, PICKENS J (2005): A robust mid-level representation for harmonic content in music signals. In Procs 6th Int. Conference on Music Information Retrieval (ISMIR 2005). London, 304 311. İZMIRLI, Ö (2005): Tonal similarity from audio using a template based attractor model. In Procs 6th Int. Conference on Music Information Retrieval (ISMIR 2005). London, 540 545. KRUMHANSL C, KESSLER E (1982): Tracing the Dynamic Changes in Perceived Tonal Organization in a Spatial Representation of Musical Keys. Psychological Review, 89, 334 368. LEMAN M (2000): An auditory model of the role of short-term memory in probetone ratings. Music Perception 17, 435-464. LERDAHL F (2001): Tonal Pitch Space. Oxford University Press, New York. PAUWS S (2004): Musical key Extraction from Audio. In Procs 5th Int. Conference on Music Information Retrieval (ISMIR 2004). Barcelona, 96 99. SHENOY A, WANG Y (2005): Key, chord, and Rhythm Tracking of Popular Music Recordings. Computer Music Journal, 29(3), 75 86. TEMPERLEY D (1999): What s Key for Key? the Krumhansl-Schmuckler Key- Finding Algorithm Reconsidered. Music Perception, 17(1), 65 100. TERHARDT E., STOLL G. AND SEEWANN M. (1982): Algorithm for extraction of pitch and pitch salience for complex tonal signals. In J. Acoust. Soc. Am., 71, 679 688.