A probabilistic framework for audio-based tonal key and chord recognition

Similar documents
EE391 Special Report (Spring 2005) Automatic Chord Recognition Using A Summary Autocorrelation Function

Homework 2 Key-finding algorithm

10 Visualization of Tonal Content in the Symbolic and Audio Domains

A System for Automatic Chord Transcription from Audio Using Genre-Specific Hidden Markov Models

Notes on David Temperley s What s Key for Key? The Krumhansl-Schmuckler Key-Finding Algorithm Reconsidered By Carley Tanoue

Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes

Transcription of the Singing Melody in Polyphonic Music

Outline. Why do we classify? Audio Classification

Music Radar: A Web-based Query by Humming System

Music Segmentation Using Markov Chain Methods

Detecting Musical Key with Supervised Learning

Topic 10. Multi-pitch Analysis

Harmony and tonality The vertical dimension. HST 725 Lecture 11 Music Perception & Cognition

Characteristics of Polyphonic Music Style and Markov Model of Pitch-Class Intervals

Multiple instrument tracking based on reconstruction error, pitch continuity and instrument activity

Statistical Modeling and Retrieval of Polyphonic Music

MUSIC CONTENT ANALYSIS : KEY, CHORD AND RHYTHM TRACKING IN ACOUSTIC SIGNALS

Audio Feature Extraction for Corpus Analysis

Probabilist modeling of musical chord sequences for music analysis

HST 725 Music Perception & Cognition Assignment #1 =================================================================

TREE MODEL OF SYMBOLIC MUSIC FOR TONALITY GUESSING

Augmentation Matrix: A Music System Derived from the Proportions of the Harmonic Series

Automatic Piano Music Transcription

Week 14 Query-by-Humming and Music Fingerprinting. Roger B. Dannenberg Professor of Computer Science, Art and Music Carnegie Mellon University

A PROBABILISTIC TOPIC MODEL FOR UNSUPERVISED LEARNING OF MUSICAL KEY-PROFILES

Computational Modelling of Harmony

Chord Classification of an Audio Signal using Artificial Neural Network

Automatic Rhythmic Notation from Single Voice Audio Sources

A PERPLEXITY BASED COVER SONG MATCHING SYSTEM FOR SHORT LENGTH QUERIES

A repetition-based framework for lyric alignment in popular songs

CHAPTER 3. Melody Style Mining

A geometrical distance measure for determining the similarity of musical harmony. W. Bas de Haas, Frans Wiering & Remco C.

A STATISTICAL VIEW ON THE EXPRESSIVE TIMING OF PIANO ROLLED CHORDS

Week 14 Music Understanding and Classification

Pitch Spelling Algorithms

Composer Identification of Digital Audio Modeling Content Specific Features Through Markov Models

Tonal Cognition INTRODUCTION

MUSI-6201 Computational Music Analysis

Take a Break, Bach! Let Machine Learning Harmonize That Chorale For You. Chris Lewis Stanford University

Music Alignment and Applications. Introduction

DETECTION OF KEY CHANGE IN CLASSICAL PIANO MUSIC

TOWARD AN INTELLIGENT EDITOR FOR JAZZ MUSIC

Recognition and Summarization of Chord Progressions and Their Application to Music Information Retrieval

NOTE-LEVEL MUSIC TRANSCRIPTION BY MAXIMUM LIKELIHOOD SAMPLING

AN APPROACH FOR MELODY EXTRACTION FROM POLYPHONIC AUDIO: USING PERCEPTUAL PRINCIPLES AND MELODIC SMOOTHNESS

2 The Tonal Properties of Pitch-Class Sets: Tonal Implication, Tonal Ambiguity, and Tonalness

CS229 Project Report Polyphonic Piano Transcription

Topic 11. Score-Informed Source Separation. (chroma slides adapted from Meinard Mueller)

OBJECTIVE EVALUATION OF A MELODY EXTRACTOR FOR NORTH INDIAN CLASSICAL VOCAL PERFORMANCES

Semantic Segmentation and Summarization of Music

POST-PROCESSING FIDDLE : A REAL-TIME MULTI-PITCH TRACKING TECHNIQUE USING HARMONIC PARTIAL SUBTRACTION FOR USE WITHIN LIVE PERFORMANCE SYSTEMS

THE importance of music content analysis for musical

Chord Representations for Probabilistic Models

Query By Humming: Finding Songs in a Polyphonic Database

METHOD TO DETECT GTTM LOCAL GROUPING BOUNDARIES BASED ON CLUSTERING AND STATISTICAL LEARNING

Evaluating Melodic Encodings for Use in Cover Song Identification

Figured Bass and Tonality Recognition Jerome Barthélemy Ircam 1 Place Igor Stravinsky Paris France

APPLICATIONS OF A SEMI-AUTOMATIC MELODY EXTRACTION INTERFACE FOR INDIAN MUSIC

A TEXT RETRIEVAL APPROACH TO CONTENT-BASED AUDIO RETRIEVAL

CPU Bach: An Automatic Chorale Harmonization System

Music Information Retrieval with Temporal Features and Timbre

A CHROMA-BASED SALIENCE FUNCTION FOR MELODY AND BASS LINE ESTIMATION FROM MUSIC AUDIO SIGNALS

Singer Traits Identification using Deep Neural Network

Automatic Key Detection of Musical Excerpts from Audio

DETECTION OF SLOW-MOTION REPLAY SEGMENTS IN SPORTS VIDEO FOR HIGHLIGHTS GENERATION

Music Emotion Recognition. Jaesung Lee. Chung-Ang University

Drum Sound Identification for Polyphonic Music Using Template Adaptation and Matching Methods

Subjective Similarity of Music: Data Collection for Individuality Analysis

DAY 1. Intelligent Audio Systems: A review of the foundations and applications of semantic audio analysis and music information retrieval

Perceptual Evaluation of Automatically Extracted Musical Motives

Bayesian Model Selection for Harmonic Labelling

Creating a Feature Vector to Identify Similarity between MIDI Files

Influence of timbre, presence/absence of tonal hierarchy and musical training on the perception of musical tension and relaxation schemas

Effects of acoustic degradations on cover song recognition

A MULTI-PARAMETRIC AND REDUNDANCY-FILTERING APPROACH TO PATTERN IDENTIFICATION

The MAMI Query-By-Voice Experiment Collecting and annotating vocal queries for music information retrieval

A CLASSIFICATION APPROACH TO MELODY TRANSCRIPTION

Automatic characterization of ornamentation from bassoon recordings for expressive synthesis

Classification of Timbre Similarity

AP MUSIC THEORY 2016 SCORING GUIDELINES

Automatic Laughter Detection

Sequential Association Rules in Atonal Music

A System for Acoustic Chord Transcription and Key Extraction from Audio Using Hidden Markov models Trained on Synthesized Audio

A Geometrical Distance Measure for Determining the Similarity of Musical Harmony

CSC475 Music Information Retrieval

INTER GENRE SIMILARITY MODELLING FOR AUTOMATIC MUSIC GENRE CLASSIFICATION

An Integrated Music Chromaticism Model

Melody Extraction from Generic Audio Clips Thaminda Edirisooriya, Hansohl Kim, Connie Zeng

IMPROVING GENRE CLASSIFICATION BY COMBINATION OF AUDIO AND SYMBOLIC DESCRIPTORS USING A TRANSCRIPTION SYSTEM

Measurement of overtone frequencies of a toy piano and perception of its pitch

Piano Transcription MUMT611 Presentation III 1 March, Hankinson, 1/15

SINGING PITCH EXTRACTION BY VOICE VIBRATO/TREMOLO ESTIMATION AND INSTRUMENT PARTIAL DELETION

Introductions to Music Information Retrieval

Unsupervised Bayesian Musical Key and Chord Recognition

AUDIO-BASED COVER SONG RETRIEVAL USING APPROXIMATE CHORD SEQUENCES: TESTING SHIFTS, GAPS, SWAPS AND BEATS

Chord Label Personalization through Deep Learning of Integrated Harmonic Interval-based Representations

AP MUSIC THEORY 2011 SCORING GUIDELINES

A Novel System for Music Learning using Low Complexity Algorithms

AP MUSIC THEORY 2015 SCORING GUIDELINES

A Psychoacoustically Motivated Technique for the Automatic Transcription of Chords from Musical Audio

Transcription:

A probabilistic framework for audio-based tonal key and chord recognition Benoit Catteau 1, Jean-Pierre Martens 1, and Marc Leman 2 1 ELIS - Electronics & Information Systems, Ghent University, Gent (Belgium) Benoit.Catteau@elis.UGent.be 2 IPEM - Department of Musicology, Ghent University, Gent (Belgium) Marc.Leman@UGent.be Abstract A unified probabilistic framework for audio-based chord and tonal key recognition is described and evaluated. The proposed framework embodies an acoustic observation likelihood model and key & chord transition models. It is shown how to conceive these models and how to use music theory to link key/chord transition probabilities to perceptual similarities between keys/chords. The advantage of a theory based model is that it does not require any training, and consequently, that its performance is not affected by the quality of the available training data. 1 Introduction Tonal key and chord recognition from audio are important steps towards the construction of a mid-level representation of Western tonal music for e.g. Music Information Retrieval (MIR) applications. A straightforward approach to key recognition (e.g. Pauws (2004), Leman (2000)) is to represent the acoustic observations and the keys by chroma vectors and chroma profiles respectively, and to use an ad hoc distance measure to assess how well the observations match a suggested key profile. Well-known profiles are the Krumhansl and Kessler (1982) and Temperley (1999) profiles, and a popular distance measure is the cosine distance. The classical approach to chord recognition is one of key detection before chord recognition. Recently however, Shenoy and Wang (2005) proposed a 3-step algorithm performing chord detection first, key detection then, and finally, chord enhancement on the basis of high-level knowledge and key information. Our point of departure is that tonal key and chord recognition should preferably be accomplished simultaneously on the basis of a unified probabilistic framework. We will propose a segment-based framework that extends e.g. the frame-based HMM framework for chord detection proposed by Bello and Pickens (2005).

2 Benoit Catteau, Jean-Pierre Martens, and Marc Leman In the subsequent sections we provide a general outline (Section 2) and a detailed description (Sections 3 and 4) of our approach, as well as an experimental evaluation (Section 5) of our current implementation of this approach. 2 General outline of the approach Before introducing our probabilistic framework, we want to recall some basics about the links between notes, chords and keys in Western tonal music. The pitch of a periodic sound is usually mapped to a pitch class (a chroma) collecting all the pitches that are in an octave relation to each other. Chroma s are represented on a log-frequency scale of 1 octave long, and this chromatic scale is divided into 12 equal intervals, the borders of which are labeled as notes: A, As, B,.., Gs. A tonal key is represented by 7 eligible notes selected from the set of 12. Characteristics of a key are its tonic (the note with the lowest chroma) and the mode (major, minor harmonic,..) that was used to select the 7 notes starting from the tonic. A chord refers to a stack of three (=triad) or more notes sounding together during some time. It can be represented by a 12-bit binary chromatic vector with ones on the chord note positions and zeroes on the remaining positions. This vector leads to a unique chord label as soon as the key is available. Having explained the links between keys, chords and notes, we can now present our probabilistic framework. We suppose that an acoustic front-end has converted the audio into a sequence of N events which are presumed to represent individual chords. Each event is characterized by an acoustic observation vector x n, and the whole observation sequence is denoted as X = {x 1,.., x N }. The aim is now to assign key labels k n and chord chroma vectors c n to these events. More precisely, we seek for the sequence pair ( ˆK, Ĉ) that maximizes the posterior probability P (K, C X). By applying Bayes law, and by noting that the prior probability P (X) is independent of (K,C), one comes to the conclusion that the problem can also be formulated as ˆK, Ĉ = arg max P (K, C, X) = arg max P (K, C) P (X K, C) (1) K,C K,C By sticking to two key modes, namely the major and minor harmonic mode, and by only examining the 4 most important triads (major, minor, augmented and diminished) per tonic we achieve that only 48 chord vectors and 24 keys per event have to be tested. If we can then assume that the acoustic likelihood P (X K, C) can be factorized as P (X K, C) = N P (x n k n, c n ) (2) and if P (K, C) can be modeled by the following bigram music model n=1

A probabilistic framework for audio-based tonal key and chord recognition 3 P (K, C) = N P (k n, c n k n 1, c n 1 ) (3) n=1 then it is straightforward to show that the problem can be reformulated as ˆK, Ĉ = arg max K,C N P (k n k n 1, c n 1 )P (c n k n 1, c n 1, k n )P (x n k n, c n ) (4) n=1 The solution can be found by means of a Dynamic Programming search. In the subsequent two sections we describe the front-end that was used to construct the acoustic observations and the models that were developed to compute the probabilities involved. 3 The acoustic front-end The objective of the acoustic front-end is to segment the audio into chord and rest intervals and to create a chroma vector for each chord interval. 3.1 Frame-by-frame analysis The front-end first performs a frame-by-frame short-time power spectrum (STPS) analysis. The frames are 150 ms long and two subsequent frames overlap by 130 ms. The frames are Hamming windowed and the STPS is computed in 1024 points equidistantly spaced on a linear frequency scale. The STPS is then mapped to a log-frequency spectrum comprising 84 samples: 7 octaves (between the MIDI scores C1 and C8) and 12 samples per octave. By convolving this spectrum with a Hamming window of 1 octave wide, one obtains a so-called background spectrum. Subtracting this from the original spectrum leads to an enhanced log-frequency spectrum. By means of sub-harmonic summation (Terhardt et al. (1982)), the latter is converted to a sub-harmonic sum spectrum T (i), i = 0,..83 which is finally folded into one octave to yield the components of the chroma vector x of the analyzed frame: x m = 3.2 Segmentation 6 T n (12j + m), m = 0,.., 11 (5) j=0 The chroma vectors of the individual frames are used to perform a segmentation of the audio signal. A frame can either be appended to a previously started event or it can be assigned to a new event. The latter happens if the absolute value of the correlation between consecutive chroma vectors drops below a certain threshold.

4 Benoit Catteau, Jean-Pierre Martens, and Marc Leman On the basis of its mean frame energy each event is labeled as chord or rest and for each chord, a chroma vector is computed by first taking the mean chroma vector over its frames, and by then normalizing this mean vector so as to achieve that its elements sum up to 1. 4 Modeling the probabilities For solving Equation 4, one needs good models for the observation likelihoods P (x n k n, c n ), the key transition probabilities P (k n k n 1, c n 1 ) and the chord transition probabilities P (c n k n 1, c n 1, k n ). 4.1 Modeling the observation likelihoods The observation likelihood expresses how well the observations support a proposed chord hypothesis. Although they sum up to one, we assume weak dependencies among the vector components and propose to use the following model: P (x n k n, c n ) = 11 m=0 P (x nm c nm ), 11 m=0 x nm = 1 (6) In its most simple form this model requires two statistical distributions: P (x c = 1) and P (x c = 0) (x and c denote individual notes here). We have chosen for P (x 0) = G o (e x2 2σ 2 + P o ) x (0, 1) (7) P (x 1) = G 1 (e (x X) 2 2σ 2 + P o ) x (0, X) (8) = G 1 (1 + P o ) x (X, 1) (9) (see Figure 1) with G o and G 1 being normalization factors. Offset P o must preserve some evidence in case an expected large x nm is missing or an unexpected large x nm (e.g. caused by an odd harmonic of the pitch) is present. In our experiments X and σ were kept fixed to 0.33 and 0.13 respectively (these values seem to explain the observation statistics). 4.2 Modeling the key transition probabilities Normally it would take a large chord and key annotated music corpus to determine appropriate key and chord transition probabilities. However, we argue that (1) transitions between similar keys/chords are more likely to occur than transitions between less similar keys/chords, and (2) chords comprising the key tonic or fifth are more likely to appear than others. We therefore

A probabilistic framework for audio-based tonal key and chord recognition 5 1.4 1.2 P(x 0) P(x 1) 1 0.8 P 0.6 0.4 0.2 P0 0 0 0.2 X 0.4 0.6 0.8 1 x Fig. 1. Distributions (without normalization factors) to model the observation likelihoods of x given that the note chroma vector contains a c = 1 or c = 0. propose to retrieve the requested probabilities from music theory and to avoid the need for a labeled training database. Lerdahl (2001) has proposed a three-dimensional representation of the tonal space and a scheme for quantizing the perceptual differences between chords as well as keys. Lerdahl distinguishes five note levels, namely the chromatic, diatonic, triadic, fifth and tonic levels and he accumulates the differences observed at all these levels in a distance metric. If we can assume that in the case of a key modulation the probability of k n is dominated by the distance d(k n, k n 1 ) emerging from Lerdahl s theory, then we can propose the following model: P (k n k n 1, c n 1 ) = P os k n = k n 1 (10) = β s e d(kn,k n 1 ) ds k n k n 1 (11) with β s being a normalization factor and d s = 15, the mean distance between keys. By changing P os we can control the chance of hypothesizing a key modulation. 4.3 Modeling the chord transition probabilities For computing these probabilities we rely on the distances between diatonic chords (= chords solely composed of notes that fit into the key) as they follow from Lerdahl s theory, and on the tonicity of the chord. Reserving some probability mass for transitions to non-diatonic chords we obtain P (c n c n 1, k n, k n 1 ) = P oc c n = non-diatonic in k n (12) = β c e d(c n,c n 1 ) dc g(c n, k n ) c n = diatonic in k n (13) as a model. β c is a normalizaton factor, d c = 6 (the mean distance between chord vectors) and g(c n, k n ) is a model that favors chords comprising the key tonic (g = 1.5) or fifth (g = 1.25) over others (g = 1). By changing P oc we can control the chance of hypothesizing a non-diatonic chord.

6 Benoit Catteau, Jean-Pierre Martens, and Marc Leman 5 Experimental results For parameter tuning and system evaluation we have used four databases. Cadences. A set of 144 files: 3 classical cadences times 24 keys (12 major and 12 minor keys) times 2 synthesis methods (Shepard tones and MIDI-to-wave). Modulations. A set of 20 files: 10 chord sequences of length 9 (copied from Krumhansl and Kessler (1982)) times 2 synthesis methods. All sequences start in C major or C minor and on music theoretical grounds a unique key can be assigned to each chord. Eight sequences show a key modulation at position 5, the other two do not, but they explore chords on various degrees. Real audio. A set of 10 polyphonic audio fragments (60 seconds) from 10 different songs (see Table 1). Each fragment was chord and key labeled. MIREX. A set of 96 MIDI-to-wave synthesized fragments: compiled as a training database for the systems participating in the MIREX-2005 key detection contest. Each fragment was supplied with one key label. In case of modulation it is supposed to represent the dominant key for that fragment. Artist Title Key 1 CCR Proud Mary D Major 2 CCR Who ll stop the rain G Major 3 CCR Bad moon rising D Major 4 America Horse with no name E Minor 5 Dolly Parton Jolene Cs Minor 6 Toto Cutugno L Italiano A Minor 7 Iggy Pop The passenger A Minor 8 Marco Borsato Dromen zijn bedrog C Minor 9 Live I Alone Gb Major Eb Major 10 Ian McCulloch Sliding C Major Table 1. The test songs and their key 5.1 Free parameter tuning In order to tune the free parameters (P o, P os, P oc ) we worked on all the cadences and modulation sequences and one song from the real audio database. Since P os and P oc were anticipated to be the most critical parameters we explored them first in combination with P o = 0.1. There is a reasonably large area in the (P os, P oc )-plane where the performances on all the tuning data are good and stable (0.3 < P os < 0.5 and 0 P oc < 0.2). We have chosen for P os = 0.4 and P oc = 0.15 to get a fair chance of selecting key modulations and non-diatonic chords when present in the audio. For these values we got 100%, 96.7% and 92.1% of correct key labels for the cadences, the modulation sequences and the song. The corresponding correct chord label percentages were 100%, 93.8% and 73.7%. Changing P o did not cause any further improvement.

A probabilistic framework for audio-based tonal key and chord recognition 7 5.2 System evaluation Real audio. For real audio we have measured the percentages of deleted reference chords (D), inserted chords (I), frames with the correct key label (C k ) and frames with the correct chord label (C c ). We obtained D = 4.3%, I = 82%, C k = 51.2% and C c = 75.7%. An illustration of the reference and computed labels for song 1 is shown on Figure 2. A E B Fs Cs Gs Ds As FC Reference A E B Fs Cs Gs Ds As FC Computed result G D G D 0.02 0.22 0.42 0.62 0.82 1.02 0.02 0.22 0.42 0.62 0.82 1.02 A E B Fs Cs Gs Ds As FC A E B Fs Cs Gs Ds As FC G D G D 0.02 0.22 0.42 0.62 0.82 1.02 0.02 0.22 0.42 0.62 0.82 1.02 Fig. 2. Annotated (left) and computed (right) chords (top) and keys (bottom) for song 1. The grey zones refer to major and the black ones to minor labels. A first observation is that our system produces a lot of chord insertions. This must be investigated in more detail, but possibly the annotator discarded some of the short chord changes. A second observation is that the key accuracy is rather low. However, a closer analysis showed that more than 60% of the key errors were confusions between a minor and its relative major. Another 15% were confusions between keys whose tonics differ by a fifth. By applying a weighted error measure as recommended by MIREX (weights of 0.3 for minor to relative major, 0.5 for a tonic difference of a fifth, and 1 otherwise) we obtain a key accuracy of 75.5%. Our chord recognition results seem to be very good. Without chord enhancement on the basis of high-level musical knowledge (this knowledge can also be applied on our system outputs) Shenoy and Wang (2005) report a chord accuracy of 48%. Although there are differences in the data set, the assumptions made by the system (e.g. fixed key) and the evaluation procedure, we believe that the above figure supports our claim that simultaneous chord and key labeling can outperform a cascaded approach. MIREX data. Since we did not participate in the MIREX contest, we only had access to the MIREX training set and not to the evaluation set. However

8 Benoit Catteau, Jean-Pierre Martens, and Marc Leman since we did not perform any parameter tuning on this set, we believe that the results of our system on the MIREX training set are representative of thoses we would be able to attain on the MIREX evaluation set. Using the recommended MIREX evaluation approach we obtained a key accuracy of 83%. The best result reported in the MIREX contest İzmirli (2005) was 89.5%. We hope that by further refining our models we will soon be able to bridge the gap with that performance. 6 Summary and conclusion We have proposed a segment-based probabilistic framework for the simultaneous recognition of chords and keys. The framework incorporates a novel observation likelihood model and key & chord transition models that were not trained but derived from the tonal space theory of Lerdahl. Our system was evaluated on real audio fragments and on MIDI-to-wave synthesized chord sequences (MIREX-2005 contest data). Apparently, real audio is hard to process correctly, but nevertheless our system does appear to outperform its counterparts in advanced chord labeling systems that have recently been developed by others. The key labeling results for the MIREX data are also very good and already close to the best results previously reported for these data. References BELLO JP, PICKENS J (2005): A robust mid-level representation for harmonic content in music signals. In Procs 6th Int. Conference on Music Information Retrieval (ISMIR 2005). London, 304 311. İZMIRLI, Ö (2005): Tonal similarity from audio using a template based attractor model. In Procs 6th Int. Conference on Music Information Retrieval (ISMIR 2005). London, 540 545. KRUMHANSL C, KESSLER E (1982): Tracing the Dynamic Changes in Perceived Tonal Organization in a Spatial Representation of Musical Keys. Psychological Review, 89, 334 368. LEMAN M (2000): An auditory model of the role of short-term memory in probetone ratings. Music Perception 17, 435-464. LERDAHL F (2001): Tonal Pitch Space. Oxford University Press, New York. PAUWS S (2004): Musical key Extraction from Audio. In Procs 5th Int. Conference on Music Information Retrieval (ISMIR 2004). Barcelona, 96 99. SHENOY A, WANG Y (2005): Key, chord, and Rhythm Tracking of Popular Music Recordings. Computer Music Journal, 29(3), 75 86. TEMPERLEY D (1999): What s Key for Key? the Krumhansl-Schmuckler Key- Finding Algorithm Reconsidered. Music Perception, 17(1), 65 100. TERHARDT E., STOLL G. AND SEEWANN M. (1982): Algorithm for extraction of pitch and pitch salience for complex tonal signals. In J. Acoust. Soc. Am., 71, 679 688.