Chord Recognition with Stacked Denoising Autoencoders

Size: px

Start display at page:

Download "Chord Recognition with Stacked Denoising Autoencoders"

Laureen Agatha Harris
5 years ago
Views:

1 Chord Recognition with Stacked Denoising Autoencoders Author: Nikolaas Steenbergen Supervisors: Prof. Dr. Theo Gevers Dr. John Ashley Burgoyne A thesis submitted in fulfilment of the requirements for the degree of Master of Science in Artificial Intelligence in the Faculty of Science July 2014

2 Abstract In this thesis I propose two different approaches for chord recognition based on stacked denoising autoencoders working directly on the FFT. These approaches do not use any intermediate targets such as pitch class profiles/chroma vectors or the Tonnetz, in an attempt to remove any restrictions that might be imposed by such an interpretation. It is shown that these systems can significantly outperform a reference system based on state-of-the-art features. The first approach computes chord probabilities directly from an FFT excerpt of the audio data. In the second approach, two additional inputs, filtered with a median filter over different time spans, are added to the input. Hereafter, in both systems, a hidden Markov model is used to perform a temporal smoothing after pre-classifying chords. It is shown that using several different temporal resolutions can increase the classification ability in terms of weighted chord symbol recall. All algorithms are tested in depth on the Beatles Isophonics and the Billboard datasets on a restricted chord vocabulary containing major and minor chords and an extended chord vocabulary containing major, minor, 7th and inverted chord symbols. In addition to presenting the weighted chord average recall, a post-hoc Friedman multiple comparison test for statistical significance on performance is also conducted. 1

3 Acknowledgements I would like to thank Theo Gevers and John Ashley Burgoyne for supervising my thesis. Thanks to Ashley Burgoyne, for his helpful thorough advice and guidance. Thanks Amogh Gudi for all the fruit full discussions about deep learning techniques while lifting weights and sweating in the gym. Special thanks to my parents, Brigitte and Christiaan Steenbergen and my brothers Alexander and Florian, without their help, support and love, I would not be where I am now. 2

4 Contents 1 Introduction 7 2 Musical Background Notes and Pitch Chords Other Structures in Music Related Work Preprocessing / Features PCP / Chroma Vector Calculation Minor Pitch Changes Percussive Noise Reduction Repeating Patterns Harmonic / Enhanced Pitch Class Profile Modelling Human Loudness Perception Tonnetz / Tonal Centroid Classification Template Approaches Data-Driven Higher Context Models Stacked Denoising Autoencoders Autoencoders Autoencoders and Denoising Training Multiple Layers Dropout Chord Recognition Systems Comparison System Basic Pitch Class Profile Features Comparison System Simplified Harmony Progression Analyzer Harmonic Percussive Sound Separation Tuning and Loudness-Based PCPs HMMs Stacked Denoising Autoencoders for Chord Recognition Preprocessing of Features for Stacked Denoising Autoencoders Stacked Denoising Autoencoders for Chord Recognition Multi-Resolution Input for Stacked Denoising Autoencoders 34 6 Results Reduction of Chord Vocabulary Score Computation Weighted Chord Symbol Recall Training Systems Setup Significance Testing Beatles Dataset Restricted Major-Minor Chord Vocabulary

5 6.5.2 Extended Chord Vocabulary Billboard Dataset Restricted Major-Minor Chord Vocabulary Extended Chord Vocabulary Weights Discussion Performance on the Different Datasets SDAE MR-SDAE Weights Extensions Conclusion 55 A Joint Optimization 61 A.1 Basic System Outline A.2 Gradient of the Hidden Markov Model A.3 Adjusting Neural Network Parameters A.4 Updating HMM Parameters A.5 Neural Network A.6 Hidden Markov Model A.7 Combined Training A.8 Joint Optimization A.9 Joint Optimization Possible Interpretation

6 List of Figures 1 Piano keyboard and MIDI note range Conventional autoencoder training Denoising autoencoder training Stacked denoising autoencoder training SDAE for chord recognition MR-SDAE for chord recognition Post-hoc multiple-comparison Friedman tests for Beatles restricted chord vocabulary Whisker plot for the Beatles restricted chord vocabulary Post-hoc multiple-comparison Friedman tests for Beatles extended chord vocabulary Whisker plot for Beatles extended chord vocabulary Post-hoc multiple-comparison Friedman tests for Billboard restricted chord vocabulary Post-hoc multiple-comparison Friedman tests for Billboard extended chord vocabulary Visualization of weights of the input layer of the SDAE Plot of sum of absolute values for the input layer of the SDAE Absolute training error for joint optimization Classification performance of joint optimization while training

7 List of Tables 1 Semitone steps and intervals Intervals and chords WCSR for the Beatles restricted chord vocabulary WCSR for the Beatles extended chord vocabulary WCSR for the Billboard restricted chord vocabulary WCSR for the Billboard extended chord vocabulary Results for chord recognition in MIREX

8 1 Introduction The increasing amount of digitized music available online has given rise to demand for automatic analysis methods. A new subfield of information retrieval has emerged that concerns itself only with music: music information retrieval (MIR). Music information retrieval concerns itself with different subcategories, from analyzing features of a music piece (e.g., beat detection, symbolic melody extraction, and audio tempo estimation) to exploring human input methods (like query by tapping or query by singing/humming ) to music clustering and recommendation (like mood detection or cover song identification). Automatic chord estimation is one of the open challenges in MIR. Chord estimation (or recognition) describes the process of extracting musical chord labels from digitally encoded music pieces. Given an audio file, the specific chord symbol and temporal position and duration have to be automatically determined. The main evaluation programme for MIR is the annual Music Information Retrieval Exchange (MIREX) challenge 1. It consists of challenges in different sub-tasks of MIR, including chord recognition. Often improving one task can influence the performance in other tasks, e.g., finding a better beat estimate can improve the performance of finding the temporal positions of chord changes, or improve the task of querying by tapping. The same is the case for chord recognition. It can improve performance of cover song identification, in which starting from an input song, cover songs are retrieved: chord information is a useful if not vital feature for discrimination. Chord progressions also have an influence of the mood transmitted through music. Thus being able to retrieve the chords used in a music piece accurately could also be helpful for mood categorization, e.g., for personalized Internet radios. Chord recognition is also valuable to do by itself. It can aid musicologists as well as hobby and professional musicians in transcribing songs. There is a great demand for chord transcriptions of well-known and also lesser-known songs. This manifests itself in many Internet pages that hold manual transcriptions of songs, especially for guitar. 2 Unfortunately, these mostly contain transcriptions only of the most popular songs and often several different versions of the same song exist. Furthermore, they not guaranteed to be correct. Chord recognition is a difficult task which requires a lot of practice even for humans E.g. ultimate guitar: 911Tabs com/, guitartabs 7

9 2 Musical Background In this section I give an overview of important musical terms and concepts later used in this thesis. I first describe how musical notes relate to physical sound waves in section 2.1, then how chords relate to notes in section 2.2 and later different other aspects of music that play a role for automatic chord recognition in section Notes and Pitch Pitch describes the perceived frequency of a sound. In Western tonality pitches are labelled by the letters A to G. The transcription of a musically relevant pitch and its duration is called note. Pitches can be ordered by frequency, whereby a pitch is said to be higher if the corresponding frequency is higher. The human auditory system works on a logarithmic scale, which also manifests itself in music: Musical pitches are ordered in octaves, repeating the note names, usually denoted in ascending order from C to B: C, D, E, F, G, A, B. We can denote different octave relationships with an additional number as a subscript added to the symbol described previously. So a pitch A 0 is one octave lower than the corresponding pitch A 1 one octave above. Two pitches one octave apart double in corresponding frequency. Humans typically perceive those two pitches as the same pitch (Shepard, 1964). In music an octave is split into twelve roughly equal semitones. By definition each of the letters C to B are two semitone steps apart, excepting the steps from E to F and B to C, which both are only one semitone apart. To denote those notes that are in between the named letters, the additional symbols for a semitone step in increasing frequency and for a step in decreasing frequency directions are used. For example we can describe the musically relevant pitch between C and D both as C and D. Because this system only defines the relationship between pitches, we need a reference frequency. In modern Western tonality usually the reference frequency of A 4 at 440 Hz is standard (Sikora, 2003). In practice slight deviations of this reference tuning may occur, e.g., due to instrument mistuning or similar. This reference pitch thus defines the respective frequencies of other notes implicitly through the octave and semitone relationships. We may compute the corresponding frequencies for all other notes given a reference pitch with following equation: f n = 2 n 12 fr, (1) where f n the frequency for n semitone steps from the reference pitch f r. The human ear can perceive a frequency range of approximately 20 Hz to Hz. In practice this frequency range is not fully used in music. For example the MIDI standard, which is more than sufficient for musical purposes in terms of octave range, covers only notes in semitone steps from C 1, corresponding to about 8.17 Hz, to G 9, which is Hz. A standard piano keyboard covers the range from A 0 at 27.5 Hz to C Hz. Figure 1 depicts a standard piano keyboard in relation to the range of frequencies of MIDI standard notes, with indicated physical sound frequencies. 8

10 MIDI note range piano note range 1 88 C 1 33 Hz C Hz A Hz C Hz C Hz C 0 16 Hz C 2 65 Hz C Hz C Hz C Hz C 1 8 Hz C Hz G Hz Figure 1: Piano keyboard and MIDI note range. White keys depict the range of the standard piano, for those notes that are described by letters. Black keys deviate semitone from a note described by a letter. The gray area depicts extensions over the note range of a piano, covered by the MIDI standard. 9

11 2.2 Chords For the purpose of this thesis we define a chord as three or more notes played simultaneously. The distance in frequency of two notes is called an interval. In a musical context we can describe an interval as the number of semitone steps two notes are apart (Sikora, 2003). A chord consists of a root note, usually the lowest note in terms of frequency. The interval relationship of the other notes played at the same time defines the chord type. Thus a chord can be defined as a root-note and a type. In the following we use the notation <root-note>:<chordtype>, proposed by Harte (2010). We can refer to the notes in musical intervals in order of ascending frequencies as: root-note, third, fifth, and if there is a fourth note seventh. In Table 1, we can see the intervals for chords considered in this thesis and the semitone step distance for those intervals. The root note and fifth have fixed intervals. For the seventh and third, we differentiate between major and minor intervals, differing by one semitone step. For this thesis we restrict ourselves to two different chord vocabularies to be recognized, the first one containing only major and minor chord types. Both major and minor chords consist of three notes: the root note, the third and the fifth. The interval between root note and third distinguishes major and minor chord types (see tables 1 and 2) a major chord contains a major third, while the minor chord contains a minor third. We distinguish between twelve root notes for each chord type, for a total of 24 possible chords. Burgoyne et al. (2011) propose a dataset which contains songs from the Billboard charts from 1950s through the 1990s. This major-minor chord vocabulary accounts for 65% of the chords. We can extend this chord vocabulary to take into account 83% of the chord types in the Billboard dataset by including variants of the seventh chords, by adding an optional fourth note to a chord. Hereby, in addition to simple major and minor chords, we add 7th, major 7th and minor 7th chord-types to our chord-type vocabulary. Major 7th chords and minor 7th chords are essentially major and minor chords, whereby the added fourth note has the interval major seventh and minor seventh respectively. In addition to different chord types, it is possible to change the frequency order of the notes for different intervals by pulling one note below the rootnote in terms of frequency. This is called chord inversion. Thus our extended chord vocabulary containing major, minor, 7th, major 7th and minor 7th also contains all possible inversions. We can denote this through an additional identifier in our chord syntax: < root-note>:<chord-type>/<inversion-identifier>, where the inversion-identifier can either be 3, 5, or 7 played below the root-note. For example E:maj7/7 would be a major 7 chord, consisting of the root note E, interval number of semitone-steps root-note 0 minor third 3 major third 4 fifth 7 minor seventh 10 major seventh 11 Table 1: Semitone steps and intervals. 10

12 chord-type intervals notes major 1,3,5 minor 1, 3,5 7 1,3,5, 7 major7 1,3,5,7 minor7 1, 3,5, 7 Table 2: Intervals and chords. root-note denoted as 1, third as 3, fifth as 5 and seventh as 7. We denote minor as a major third, fifth, and major seventh, and the major seventh is played below the root note in terms of frequency. It is possible, however, that in parts of the song, no or only non-harmonic instruments (e.g., percussion) are playing. To be able to interpret this case we define an additional non-chord symbol, thus adding an additional chord symbol to our 24 different chord symbols for the restricted chord vocabulary, leaving us with 25 different symbols. The extended chord vocabulary contains major, minor, 7th, major 7th and minor 7th chord types (depicted in table 2) and all possible inversions. So, for each root-note, this leaves us with 3 different chord symbols for major and minor, and four different chord symbols for extended chords, thus 216 different symbols and an additional non-chord symbol. Furthermore, we assume that chords cannot overlap, although this is not strictly true, for example, due to reverb, multiple instruments playing chords, etc. However, in practice this overlap is negligible and reverb is often not that long. Thus we regard a chord to be a continuous entity with designated start point, end point and a chord symbol (either consisting of the root note, chord type and inversion, or a non-chord symbol). 2.3 Other Structures in Music A musical piece has several other components, some contributing additional harmonic content, for example vocals, which might also carry a linguistically interpretable message. Since a music piece has an overall harmonic structure and an inherent set of music theoretical harmonic rules, this information also influences the chords played at any time and vice versa, but does not necessarily contribute to the chord played directly. The duration and start and end point in time of a chord played is influenced by rhythmic instruments, such as percussion. These do not contribute to the harmonic content of a music piece but nonetheless are interdependent with other instruments in terms of timing, thus the beginning and end of a chord played. These additional harmonic and non-harmonic components are part of the same frequency range as components that directly contribute to the chord played. From this viewpoint, if we do not explicitly take into account additional components, we are dealing with an additional task of filtering out this noise due to these extra components in addition to the task of recognizing chords themselves. 11

13 3 Related Work Most musical chord estimation methods can broadly be divided into two subprocesses: preprocessing of features from wave-file data, and higher-level classification of those features into chords. I first describe in section 3.1 the preprocessing steps of the raw wave-form data, as well as the extensions and the refinements of its computation steps to take more properties of waveform music data into account. An overview of higher-level classification organized by methods applied is given in section 3.2. These not only differ in the methods per se, but also in what kind of musical context they take into account for the final classification. More recent methods take more musical context into account and seem to perform better. Since the methods proposed in this thesis are based on machine learning, I have decided to organize the description of other higher level classification approaches from a technical perspective rather than from a music-theoretical perspective. 3.1 Preprocessing / Features The most common preprocessing step for feature extraction from waveform data is the computation of so called pitch class profiles (PCPs), a human-perceptionbased concept coined by Shepard (1964). He conducted a human perceptual study in which he found that humans are able to perceive notes that are in octave relation as equivalent. A similar representation can be computed from wave form data for chord recognition. A PCP in a music-computational sense is a representation of the frequency spectrum wrapped into one musical octave, thus an aggregated 12-dimensional vector of the energy of the respective input frequencies. This is often called a chroma vector. A sequence of chroma vectors over time is called a chromagram. The terms PCP and chroma vector in chord recognition literature are used interchangeably. It should be noted, however, that only the physical sound energy is aggregated: this is not purely music harmonic information. Thus the chromagram may contain additional non-harmonic noise, such as drums, harmonic overtones and transient noise. In the following I will give an overview of the basics of calculating the chroma vector and different extensions proposed to improve the quality of these features PCP / Chroma Vector Calculation In order to compute a chroma vector, the input signal is broken into frames and converted to the frequency domain, which is most often done through a discrete Fourier transform (DFT), using a window function to reduce spectral leakage. Harris (1978) compares 23 different window functions and finds that the performance depends very much on the properties of the data. Since musical data is not heterogeneous, there is no single best-performing windowing function. Different window functions have been used in the literature, and often the specific window function is not stated. Khadkevich and Omologo (2009a) compare the performance impact of using Hanning, Hamming and Blackman windowing functions on musical wave form data applied to the chord estimation domain. They state that the results are very similar for those three types. However, the Hamming window performed slightly better for window lengths of 12

14 1024 and 2048 samples (for a sampling rate of Hz), which are the most common lengths in automatic chord recognition systems today. To convert from the Fourier domain to a chroma vector, two different methods are used. Wakefield (1999) sums energies of frequencies in the Fourier space closest to the pitch of a chroma vector bin (and its multiples) in order to aggregate the energy in a discrete mapping from spectral frequency domain to the corresponding chroma vector bin, converting the input directly to a chroma vector. Brown (1991) developed a so called constant-q transform, using a kernel matrix multiplication to convert the DFT spectogramm into logarithmic frequency space. Each bin of the logarithmic frequency representation corresponds to the frequency of a musical note. After conversion into logarithmic frequency domain, we then can simply sum up the respective bins, to obtain the chroma vector representation. For both methods the aggregated sound energy in the chroma vector is usually normalized either to sum to one or with respect to the maximum energy in a single bin. Both methods lead to similar results and are used in current literature Minor Pitch Changes In Western tonality music instruments are tuned to the reference frequency of A 4 above middle C (MIDI note 69), whose standard frequency is 440 Hz. In some cases the tuning of the instruments can deviate slightly, usually less than a quartertone from this standard tuning: Hz (Mauch, 2010). Most humans are unable to determine an absolute pitch height without a reference pitch. We can hear a mistuning of one instrument with some practice, but it is difficult to determine a slight deviation of all instruments from the usual reference frequency described above. The bins for the chroma vectors are relative to a fixed pitch, thus minor deviations in the input will affect its quality. Minor deviations of the reference pitch can be taken into account through shifting the pitch of the chromagram bins. Several different methods have been proposed: Harte and Sandler (2005) use a chroma vector with 36 bins, 3 per semitone. Computing a histogram of energies with respect to frequency for one chroma vector and the whole song and examining the peak positions in the extended chroma vector enables them to estimate the true tuning and derive a 12-bin chroma vector, under the assumption that the tuning will not deviate during the piece of music. This takes a slightly changed reference frequency into account. Gómez (2006) first restricts the input frequencies from 100 to 5000 Hz to reduce the search space and to remove additional overtone and percussive noise. She uses a weighting function which aggregates spectral peaks not to one, but to several chromagram bins. The spectral energy contributions of these bins are weighted according to a squared cosine distance in frequency. Dressler and Streich (2007) treat minor tuning differences as an angle and use circular statistics to compensate for minor pitch shifts, which was later adapted by Mauch and Dixon (2010b). Minor tuning differences are quite prominent in Western tonal music, and adjusting the chromagram can lead to performance increase, such that several other systems make use of one of the former methods, e.g.: Papadopoulos and Peeters (2007, 2008), Reed et al. (2009), Khadkevich and Omologo (2009a), Oudre et al. (2009). 13

15 3.1.3 Percussive Noise Reduction Music audio often contains noise that can not directly be used for chord recognition, such as transient or percussive noise. Percussive and transient noise normally is short, in contrast to harmonic components, which are rather stable over time. A simple way to reduce this is to smooth subsequent chroma vectors through filtering or averaging. Different filters have been proposed. Some researchers, e.g., Peeters (2006), Khadkevich and Omologo (2009b), Mauch et al. (2008), use a median filter over time after tuning and before aggregating the chroma vectors, to remove transient noise. Gómez (2006) uses several different filtering methods and derivatives based on a method developed by Bonada (2000) to detect transient noise and leave a window out of the chroma vector calculation of 50 ms before and after transient noise, reducing the input space. Catteau et al. (2007) calculate a background spectrum by convolving the logfrequency spectrum with a Hamming window of length of one octave, which they subtract from the original chroma vector to reduce noise. Because there are methods to estimate a beat from the audio signal (Ellis, 2007), and chord changes are more likely to appear on these metric positions, several systems aggregate or filter the chromagram only in between those detected beats. Ni et al. (2012) use a so called harmonic percussive sound separation algorithm described in Ono et al. (2008), which attempts to split the audio signal into percussive and harmonic components. After that they use the median chroma feature vector as representation for the complete chromagram between two beats. A similar approach is used by Weil et al. (2009), who also use a beat tracking algorithm, and average the chromagram between two consecutive beats. Glazyrin and Klepinin (2012) calculate a beat-synchronous smoothed chromagram and propose a modified Prewitt filter from image recognition for edge detection applied to music to suppress non-harmonic spectral components Repeating Patterns Musical pieces inherit a very repetitive structure, e.g., in popular music higherlevel structures such as verse and chorus are repeated, and usually those are repetitions of different harmonic (chord) patterns themselves. These structures can be exploited to improve the chromagram through recognizing and averaging or filtering those repetitive parts to remove local deviation. Repetitive parts can also be estimated and used later in the classification step to increase performance. Mauch et al. (2009) first perform a beat estimation and smooth the chroma vectors in a prefiltering step. Then a frame-by-frame similarity matrix from the beat-synchronous chromagram is computed and the song is segmented into an estimation of verse and chorus. This information is used to average the beat synchronous chromagram. Since beat estimation is a current research topic itself and often does not work perfectly, there might be errors in the beat positions. Cho and Bello (2011) argue that it is advantageous to use recurrent plots with a simple threshold operation to find similarities on a chord level for later averaging, thus leaving out the segmentation of the song into chorus and verse and beat detection. Glazyrin and Klepinin (2012) build upon and alter the system of Cho and Bello. They use a normalized self-similarity matrix on the computed chroma vectors using Euclidean distance as a comparison measure. 14

16 3.1.5 Harmonic / Enhanced Pitch Class Profile One problem of the computation of PCPs in general is to find an interpretation for overtones (energy in integer multiples of the fundamental frequency), since these might generate energy in frequencies that contribute to chroma vector bins other than the actual notes of the respective chord. For example the overtones of A 4 (440 Hz) are at 880 Hz and 1320 Hz, which is close to E 6 (MIDI note 68) at approximately Hz. Several different ways to achieve this have been proposed. In most cases the frequency range that is taken into account is restricted, e.g., approx from 100 Hz to 5000 Hz (Lee, 2006; Gómez, 2006). Most of the harmonic content is contained in this interval. Lee (2006) refines the chroma vector by computing the so called harmonic product spectrum, in which the product of the energy for octave multiples (up to a certain number) for each bin is calculated. Later the chromagram on basis of this harmonic product spectrum is computed. He states that multiplying the fundamental frequency with its octave multiples can decrease noise on notes that are not contained in the original piece of music. Additionally he finds a reduction of noise induced by false harmonics compared to conventional chromagram calculation. Gómez (2006) proposes an aggregation function for the computation of the chroma vector, in which the energy of the frequency multiples are summed, but first weighted by a decay factor, which is dependent on the multiple. Mauch and Dixon (2010a) use a non-negative least-squares method to find a linear combination of note profiles in a dictionary matrix to compute the log-frequency representation similar to the constant-q transform mentioned earlier Modelling Human Loudness Perception Human loudness perception is not directly proportional to the power or amplitude spectrum (Ni et al., 2012), thus the different representations described above do not model human perception accurately. Ni et al. (2012) describe a method to incorporate this through a log 10 scale for the sound power in respect to frequency. Pauws (2004) uses a tangential weighting function to achieve a similar goal for key detection. They find an improvement on the quality of the resulting chromagram compared to non-loudness-weighted methods Tonnetz / Tonal Centroid Another representation of harmonics is the so called Tonnetz, which is attributed to Euler in the 19th century. It is a planar representation of musical notes on a 6-dimensional politype, where pitch relations are mapped onto its vertices. Close musical harmonic relations (e.g., fifths and thirds) have a small Euclidean distance. Harte et al. (2006) describe a way to compute a Tonnetz from a 12- bin chroma vector, and report a performance increase for a harmonic change detection function, compared to standard methods. Humphrey et al. (2012) use a convolutional neural network from the FFT to model a projection function from wave form input to a Tonnetz. They perform experiments on the task of chord recognition with a Gaussian mixture model, and report that the Tonnetz output representation outperforms state-of-the-art chroma vectors. 15

17 3.2 Classification The majority of chord recognition systems compute a chromagram using one or a combination of methods described above. Early approaches use predefined chord templates and compare them with the computed frame-wise chroma features from audio pieces, which are then classified. With the supply of more and more hand-annotated data, more data-driven learning approaches have been developed. The most prominent data-driven model adopted is taken from speech recognition, the hidden Markov model (HMM). Bayesian networks are also used frequently, which are a generalization of HMMs. Recent approaches propose to take more musical context into account to increase performance, such as a local key, bass note, beat and song structure segmentation. Although most chord recognition systems rely on the computation of single chroma vectors, more recent approaches compute two chroma vectors for each frame. A bass and treble chromagram (differing in frequency range) are computed, as it is reasoned that the sequence of bass notes have an important role in the harmonic development of a song and can colour the treble chromagrams due to harmonics Template Approaches The chroma vector as an estimate of the harmonic content of a frame of a music piece should contain peaks at bins that correspond to chord notes played. Chord template approaches use chroma-vector-like templates. These can be either predefined through expert knowledge, or learned from data. Those templates are then compared with a fitting function with the computed chroma vector of each frame respectively. The frame is then classified as the chord symbol corresponding to the best-fitting template. The first research paper explicitly concerned with chord recognition is by Fujishima (1999), which constitutes a non-machine-learning system. Fujishima first computes simple chroma vectors as described above. He then uses predefined 12-dimensional binary chord patterns (either 1 or 0 for present and non-present notes in the chroma vector in the chord) and computes the inner product with the chroma vector. For real-world chord estimation, the set of chords consists of schemata for triadic harmonic events, and to some extent more complex chords such as sevenths and ninths. Fujishima s system was only used on synthesized sound data, however. Binary chord templates with an enhanced chroma vector using harmonic overtone suppression were used by Lee (2006). Other groups use a more elaborate chromagram with tuning (36 bins) for minor pitch changes reducing chord types to be recognized (Harte and Sandler, 2005; Oudre et al., 2009). Oudre et al. (2011) extend the methods already mentioned, by comparing different filtering methods as described in section and measures of fit (Euclidean distance, Kullback-Leibler divergence and Itakura-Saito divergence) to select the most suitable chord template. They also take harmonic overtones of chord notes into account, such that bins in the templates for notes not occurring in the chord do not necessarily have to be zero. Glazyrin and Klepinin (2012) use quasi-binary chord templates, in which the tonic and the 5th are enhanced and the template is normalized afterwards. The templates are compared to smoothed and fine-tuned chroma vectors. Chord templates do not have to be in form of chroma vectors. They can also 16

18 be modelled as a Gaussian, or as mixture of Gaussians as used by Humphrey et al. (2012), in order to get an probabilistic estimate of a chord likelihood. To eliminate short spurious chords that only last a few frames, they use a Viterbi decoder. They do not use the chroma vector for classification, but a Tonnetz as described in section The transformation function is learned by a convolutional neural network from data. It should be noted that basically all chord template approaches can model chord probabilities that can in turn be used as input for higher level classification methods or for temporal smoothing such as hidden Markov models, described in section as shown by Papadopoulos and Peeters (2007) Data-Driven Higher Context Models The recent increase in availability of hand-annotated data on chord recognition has spawned new machine-learning-based methods. In chord-recognition literature, different approaches have been proposed, from neural networks, to systems adopted from speech recognition to support vector machines and others. More recent machine learning systems seem to capture more and more context of music. In this section I describe higher level classification models found organized by machine learning methods used. Neural Networks Su and Jeng (2001) try to model the human auditory system with artificial neural networks. They perform a wavelet transform (as an analogy to the ear) and feed the output into a neural network (as an analogy for the cerebrum ) for classification. They use a self-organizing map to determine the style of chord and the tonality (C, C# etc.). It was tested on classical music to recognize 4 different chord types (major, minor, augmented, and diminished). Zhang and Gerhard (2008) propose a system based on neural networks to detect basic guitar chords and their voicings (inversions) with the help of a voicing vector and a chromagram. The neural network in this case first is trained to identify and output the basic chords; a later post processing step will determine the voicing. Osmalsky et al. (2012) build a database with several different instruments playing single chords individually, part of it recorded in a noisy and part of it in a noise-free environment. They use a feed-forward neural net with a chroma vector as input to classify 10 different chords and experiment with different subsets of their training set. HMM Neural networks do not take time dependencies between subsequent inputs into account. In music pieces there is a strong interdependency of subsequent chords, which renders a classification of chords for a whole music piece difficult to model based solely on neural networks. Since a template and neural net based approaches do not explicitly take temporal properties of music into account, a widely adopted method is to use a hidden Markov model. It has proven to be a good tool for the related field of speech recognition. The chroma vector is treated as observation, which can be modelled by different probability distributions, and the states of the HMM are the chord symbols to be extracted. Sheh and Ellis (2003) pioneered HMMs for real-world chord recognition. They propose that the emission distribution be a single Gaussian with 24 dimensions, trained from data with expectation maximization. Burgoyne et al. 17

19 (2007) state that a mixture of Gaussians is more suitable as the emission distribution. They also compare the use of Dirichlet distributions as the emission distribution and conditional random fields as the higher level classifier. HMMs are used with slightly different chromagram computations and training initialisations according to prior music theoretic knowledge by Bello and Pickens (2005). Lee (2006) build upon the systems of Bello and Pickens and Sheh and Ellis, generate training data from symbolic files (MIDI) and use an HMM for chord extraction. Papadopoulos and Peeters (2007) compare several different methods of determining the parameters of the HMM and observation probabilities. They conclude that a template-based approach combined with an HMM with a cognitive based transition matrix shows the best performance. Later Papadopoulos and Peeters (2008, 2011) propose an HMM approach focusing on (and extracting) beat estimates to take into account musical beat addition, beat deletion or changes in meter to enhance recognition performance. Ueda et al. (2010) use Harmonic Percussive Sound Separation chromagram features and an HMM for classification. Chen et al. (2012) cluster song-level duration histograms to take time duration explicitly into account in a so-called duration-explicit HMM. Ni et al. (2012) is the best performing system of 2012 MIREX challenge in chord estimation. It works on the basis of an HMM, bass and treble chroma and beat and key detection. Structured SVM Weller et al. (2009) compare the performance of HMMs and support vector machines (SVMs) for chord recognition and achieve stateof-the-art results using support vector machines. n-grams Language and music are closely related. Both spoken language and music rely on audio data. Thus it makes sense to apply spoken-languagerecognition approaches to music analysis and chord recognition. A dominant approach for language recognition is an n-gram model. A bigram model (n = 2) is essentially a hidden Markov model, in which one state only depends on the previous one. Cheng et al. (2008) compare 2-, 3-, and 4-grams, thus making one chord dependent on multiple previous chords. They use it for song similarity after a chord recognition step. In their experiments the simple 3- and 4-grams outperform the basic HMM system of Harte and Sandler (2005); they state that n-grams are able to learn the basic rules of chord progressions from hand annotated data. Scholz et al. (2009) use a 5-gram and compare different smoothing techniques and find that modelling more complex chords with 7ths and 9ths should be possible with n-grams. They do not state how features are computed and interpreted. Dynamic Bayesian Networks Musical chords develop meaning in their interplay with other characteristics of a music piece, such as bass note, beat and key: they can not be viewed as an isolated entity. These interdependencies are difficult to model with a standard HMM approach. Bayesian networks are a generalization of HMMs, in which the musical context can be modelled more intuitively. Bayesian networks give the opportunity to model interdependencies simultaneously, creating a more sound model for music pieces from a musictheoretic perspective. Another advantage of a Bayesian network is that it can 18

20 directly extract multiple types of information, which may not be a priority for the task of chord recognition, but is an advantage for the extended task of general transcription of music pieces. Cemgil et al. (2006) were among the first to introduce Bayesian networks for music computation. They do not apply the system to chord recognition but to polyphonic music transcription (transcription on a note-by-note basis). They implement a special case of the switching Kalman filter. Mauch (2010) and Mauch and Dixon (2010b) make use of a Bayesian network and incorporate beat detection, bass note and key estimation. The observations of the Bayesian network in the system are treble and bass chromagrams. Dixon et al. (2011) compare a similar system to a logic based system. Deep Learning Techniques Deep learning techniques have beaten the state of the art in several benchmark problems in recent years, although for the task of chord recognition it is a relatively unexplored method. There are three recent publications using deep learning techniques. Humphrey and Bello (2012) call for a change in the conventional approach of using a variation of chroma vector and a higher level classifier, since they state recent improvements seem to bring only diminishing return. They present a system consisting of a convolutional neural network with several layers, trained to learn a Tonnetz from a constant-qtransformed FFT, and subsequently classify it with a Gaussian mixture model. Boulanger-Lewandowski et al. (2013) make use of deep learning techniques with recurrent neural networks. They use different techniques including a Viterbilike algorithm from HMMs and beam search to take temporal information into account. They report upper-bound results comparable to the state of the art using the Beatles Isophonics dataset (see section 6.5 for a dataset description) for training and testing. Glazyrin (2013) uses stacked denoising autoencoders with a 72-bin constant-q transform input, trained to output chroma vectors. A self-similarity algorithm is applied to the neural network output and later classified with a deterministic algorithm, similar to the template approaches mentioned above. 19

21 4 Stacked Denoising Autoencoders In this section I give a description of the theoretical background of stacked denoising autoencoders used for the two chord recognition systems in this thesis following Vincent et al. (2010). First a definition of autoencoders and their training method is given in section 4.1, then it is described how this can be extended to form a denoising autoencoder in section 4.2. We can stack denoising autoencoders to train them in an unsupervised manner and possibly get a useful higher level data abstraction by training several layers, which is described in section Autoencoders Autoencoders or autoassociators try to find an encoding of given data in the hidden layers. Similar to Vincent et al. (2010) we define the following: We assume a supervised learning scenario. A training set of n touples of inputs x and targets t. D n = {(x 1, t 1 ),.., (x n, t n )}, where x R d if the input is real valued, or x [0, 1] d. Our goal is to infer a new, higher level representation y, of x. The new representation again is y R d or y [0, 1] d depending if real valued or binary representation is assumed. Encoder A deterministic mapping f θ that transforms the input x to a hidden representation y is called an encoder. It can be described as follows: y = f θ (x) = s(w x + b), (2) where θ = {W, b}, W a d d weight matrix and b an offset (or bias) vector of dimension d. The function s(x) is a non linear mapping, e.g., a sigmoid 1 activation function 1+e. The output y is called the hidden representation. x Decoder A deterministic mapping g θ that maps hidden representation y back to input space by constructing a vector z = g θ (y) is called a decoder. Typically this is in form of a mapping: or a mapping followed by a non-linearity: z = g θ (y) = W y + b (3) z = g θ (y) = s(w y + b ) (4) where θ = {W, b }, W a d d weight matrix and b an offset (or bias) vector of dimension d. Often the restriction W = W is imposed on the weights. z can be regarded as an approximation of the original input data x, reconstructed from the hidden representation y. 20

22 Input x f θ Hidden representation y L(x, z) Loss function g θ Autoencoder training output z Figure 2: Conventional autoencoder training. Vector x from the training set is projected by f θ (x) to hidden representation y, hereafter projected back to input space using g θ (y) to compute z. The loss function L(x, z) is calculated and used as training objective for minimization. Training The idea behind such a model is to get a good hidden representation y, from which the decoder is able to reconstruct the original input as closely as possible. It can be shown that finding the optimal parameters for such a model can be viewed as a maximization of the lower bound between the mutual information of the input and the hidden representation in the first layer (Vincent et al., 2010). To estimate the parameters we define a loss function. This can be for a binary input x [0, 1] d the cross entropy: L(x, z) = d x k log(z k ) + (1 x k ) log(1 z k ) (5) k=1 or for real valued input x R d : L(x, z) = x z 2, (6) The squared error objective. Since we use real valued input data, this squared error objective is used in this thesis as loss function. Given this loss function we want to minimize the average loss (Vincent et al., 2008): θ, θ 1 = arg min θ,θ n n i=1 L(x (i), z (i) 1 ) = arg min θ,θ n n i=1 ( L x (i), g θ ( fθ (x (i) ) )), (7) Where θ, θ denote the optimal parameters for encoding and decoding function for which the loss function is minimized, which might be tied. This can be achieved iteratively by backpropagation. n is the number of training samples. Figure 2 visualizes the training procedure for an autoencoder. If the hidden representation y is of the same dimensionality as the input x, it is trivial to construct a mapping that yields zero reconstruction error, the identity mapping. Obviously this constitutes a problem since merely learning the identity mapping does not lead to any higher level of abstraction. To evade this problem a bottleneck is introduced, for example by using fewer nodes for a 21

23 hidden representation thus reducing its dimensions. It is also possible to impose a penalty on the network activations to form a bottleneck, and thus train a sparse network. These additional restrictions force the neural network to focus on the most informative parts of the data leaving out noisy uninformative parts. Several layers can be trained in a greedy manner to achieve a yet higher level of abstraction. Enforcing Sparsity To prevent autoencoders from learning the identity mapping, we can penalize activation. This is described by Hinton (2010) for restricted Boltzman machines, but can be used for autoencoders as well. The general idea is that it is less informative if we have nodes that fire very frequently, i.e. a node that is always active does not add any useful information and could be left out. We can enforce sparsity by adding a penalty term for large average activations over the whole dataset to the backpropagated error. We can compute the average activation of a hidden unit j over all training samples with: ˆp j = 1 n n f j θ (x(i) ) (8) i=1 In this thesis the following addition to the loss function is used, which is derived from the KL divergence: L p = β h j=1 ( p log pˆp + (1 p) log( 1 p ) ), (9) j 1 ˆp j where ˆp the average activation over the complete training set for hidden unit j, n the number of training samples, p is a target activation parameter and β a penalty weighting parameter, all specified beforehand. The bound h is the number of hidden nodes. For a sigmoidal activation function p is usually set to a value that is close to zero, for example A frequent setting for β is 0.1. This ensures that units will have a large activation only on a limited amount of training samples and otherwise have an activation close to zero. We now simply add this weighted activation error term to L(x, z), described above. 4.2 Autoencoders and Denoising Vincent et al. (2010) propose another training criterion in addition to the bottleneck. They state that an autoencoder can also be trained to clean a partially corrupted input, also called denoising. If noisy input is assumed, it can be beneficial to corrupt (parts of) the input of the autoencoder while training and use the uncorrupted input as target. The autoencoder is hereby encouraged to reconstruct a clean version of the corrupted input. This can make the hidden representation of the input more robust to noise, and can potentially lead to a better higher level abstraction of the input data. Vincent et al. (2010) state that different types of noises can be considered. There is masking noise, i.e., setting a random fraction of the input to 0, salt and pepper noise, i.e., setting a random fraction of the input to either 0 or 1, and, especially for real-valued input, isotropic additive Gaussian noise, i.e. adding noise from a Gaussian distribution to the input. To achieve this, we 22

24 corrupt the initial input x into x according to a stochastic mapping x q D ( x x). This corrupted input is then projected to the hidden representation as described before by means of y = f θ ( x) = s(w x+b). Then we can reconstruct z = g θ (y). The parameters θ and θ are trained to minimize the average reconstruction error between output z and the uncorrupted input x, but in contrast to conventional autoencoders, z is now a deterministic function of x instead of x. For our purpose, under usage of additive Gaussian noise, we can train the denoising autoencoder with a squared error loss function: L 2 (x, z) = x z 2. Parameters can be initialized at random and then optimized by backpropagation. Figure 3 depicts training of a denoising autoencoder. Uncorrupted input x q D Corrupted input x L(x, z) Loss function f θ Hidden representation y g θ Denoising autoencoder output z Figure 3: Vector x form training set is corrupted with q D and converted to hidden representation y. The loss function L(x, z) is calculated from the output and the uncorrupted input and used for training 4.3 Training Multiple Layers If we want to train (or initialize training parameters for supervised backpropagation for) deep networks, we need a manner to extend the approach from a single layer, as described in the previous sections, to multiple layers. As described by Vincent et al. (2010), this can be easily achieved by repeating the process for each layer separately. Depicted in figure 4 is such a greedy layer wise training. First we propagate the input x through the already trained layers. Note that we do not use additional corruption noise yet. Next we use the uncorrupted hidden representation of the previous layer as input for the layer we are about to train. We train this specific layer as described in the previous sections. The input to the layer to be trained is first corrupted by q D and then projected into latent space by using f (2) θ. We then project it back to input space of the specific layer with g (2) θ. Using an error function L, we can optimize the projection functions with respect to the defined error, and therefore possibly obtain a useful higher-level representation. This process can be repeated several times to initialize a deep neural network structure, circumventing usual problems that arise when initializing deep networks at random and then applying 23

25 backpropagation. Next we can apply a classifier on the output of this deep neural network trained to supress noise. Alternatively we can add another layer of hidden nodes for classification purposes on top of the previously unsupervised trained network structure and apply standard backpropagation to fine-tune the network weights according to our supervised training training targets t. Loss function L(y, z 2 ) f (2) θ q D (2) gθ f (2) θ f (1) θ f (1) θ f (1) θ x x x Figure 4: Training of several layers in a greedy unsupervised manner. The input is propagated without corruption. To train an additional layer the output of the first layer is corrupted by q D and the weights are adjusted with f (2) θ,g (2) θ with the respective loss function. After training for this layer is completed, we can train subsequent layers Dropout Hinton et al. (2012) were able to improve performance on several other recognition tasks, including MNIST for hand written digit recognition and TIMIT a database for speech recognition, by randomly omitting a fraction of hidden nodes from training for each sample. This is in essence training a different model for each training sample and iteration on one training sample only. According to Hinton et al. this prevents the network from overfitting. In the testing phase we make use of the complete network again. Thus what we effectively are doing with dropout is averaging: averaging many models trained on one training sample each. This has yielded an improvement in different modelling tasks (Hinton et al., 2012). 4 As shown in (Vincent et al., 2010) 24

26 5 Chord Recognition Systems In this section I describe the structure of three different approaches to classify chords. 1. We first describe the structure of a comparison system: a simplified version of the Harmony Progression Analyzer as proposed by Ni et al. (2012). The features computed can be considered state of the art. We discard, however, additional context information like key, bass and beat tracking, since the neural network approaches developed in this thesis do not take this into account (although it should be noted that in principle the approaches developed in this thesis could be extended to take this additional context information into account as well). The simplified version of the Harmonic Progression Analyzer will serve as a reference system for performance comparison. 2. A neural network initialized by stacked denoising autoencoder pretraining with later backpropagation fine-tuning can be applied to an excerpt of the FFT to estimate chord probabilities directly, which then can be smoothed with the help of an HMM, to take temporal information into account. We substitute the emission probabilities with the output of the stacked denoising autoencoders. 3. This approach can be extended by adding filtered versions of the FFT over different time spans to the input. We extend the input to include two additional vectors, median-smoothed over different timespans. Here again additional temporal smoothing is applied in a post-classification process. In section 5.1 we describe the comparison system and briefly the key ideas incorporated in the computation of state-of-the-art features. Since the two other approaches described in this thesis make use of stacked denoising autoencoders that interpret the FFT directly, we describe beneficial pre-processing steps in section In section we describe a stacked denoising autoencoder approach for chord recognition in which the outputs are chord symbol probabilities directly, and in section we propose an extension of this approach inspired by a system developed for face recognition and phone recognition by Tang and Mohamed (2012) under usage of a so called multi-resolution deep belief network and apply it to chord recognition with the use of stacked denoising autoencoders. Appendix A describes the theoretical foundation of applying a joint optimization of the HMM and neural network for chord recognition. 5.1 Comparison System In this section we describe a basic comparison system for the other approaches implemented. It reflects the structure of most current approaches and uses state-of-the-art features for chord recognition. Most recent chord recognition systems rely on an improved computation of the PCP vector and take extra information into account such as bass notes or key information. This extra information is usually incorporated into a more elaborate higher-level framework, such as multiple HMMs or a Bayesian network. 25

27 The comparison system consists of the computation of state-of-the-art PCP vectors for all frames, but only a single HMM for later classification and temporal alignment of chords, which allows for a more fair comparison to the stacked denoising autoencoder approaches. The basic computation steps described in the following are used in the approach described by Ni et al. (2012). They split the computation of features into a bass chromagram and a treble chromagram, and track them with two additional HMMs. The computed frames are aligned according to a beat estimate. To make this more elaborate system comparable, again we only compute one chromagram containing both bass and treble and use a single HMM for temporal smoothing and do not align frames according to an beat estimate. We first describe the very basic steps of PCP features predominantly used in chord recognition for 15 years in section 5.1.1, hereafter in section we describe extensions of the basic PCP used in the comparison system Basic Pitch Class Profile Features The basic pipeline for computing a pitch class profile as a feature for chord recognition consists of two steps: 1. The signal is projected from time to frequency domain through a Fourier transform. Often files are downsampled to Hz to allow for faster computation. This is also done in the reference system. The range of frequencies is restricted through filtering, to only analyse frequencies below, e.g., 4000 Hz (about the range of the keyboard of a piano, see figure 1) or similar, since other frequencies carry less information about the chord notes played and introduce more noise to the signal. In the reference system a frequency range from approximately 55 Hz to Hz is used, as this interval is proposed in the original system (Ni et al., 2012). 2. The second step consists of a constant-q transform, which projects the amplitude of the signal in the linear frequency space to a logarithmic representation of signal amplitude, in which each constant-q transform bin represents the spectral energy in respect to the frequency of a musical note. 3. In a third step the bins representing one musical note and its octave multiples are summed and the resulting vector is sometimes normalized. In the following section we describe the constant-q transform and computation of the PCP in more detail. Constant-Q transform After converting the signal from time to frequency domain through a discrete or fast Fourier transform, we can apply an additional transform to make the frequency bins logarithmically spaced. This transform can be viewed as a set of filters in time domain, which filter a frequency band according to a logarithmic scaling of center frequencies of the constant-q bins. Originally it was proposed to be an additional term in the Fourier transform, but it has been shown by Brown and Puckette (1992) to be computationally more efficient to filter the signal in Fourier space, thus applying the set of filters transformed into Fourier space to the signal also in Fourier space. This 26

28 can be realized with a matrix multiplication. This transformation process to logarithmically spaced bins is called the constant-q transform (Brown, 1991). The name stems from the factor Q, which describes the relationship between center frequency of each filter and the filter width Q = f k f k. Q is a so-called quality factor which stays constant, f k is the center frequency and f k the width of the filter. We can choose the filters such that they filter out the energy contained in musically relevant frequencies (i.e., frequencies corresponding to musical notes): f kcq = (2 1 B ) k cq f min, (10) where f min is the frequency for the lowest musical note to be filtered, f kcq the center frequency corresponding to constant-q bin k cq. B denotes the number of constant-q frequency bins per octave, usually B = 12 (one bin per semitone). Setting Q = 1 establishes a link between musically relevant frequencies and 2 1 B 1 filter width of our filterbank. Different types of filters can be used to aggregate the energy in relevant frequencies and to reduce spectral leakage. For comparison system we make use of a Hamming window as described as well by Brown and Puckette (1992): ( 2πn ) w(n, f kcq ) = cos M(f kcq ) (11) where n = M(f kcq ) 2,..., M(f kcq ) 2 1, M(f kcq ) is the window size, computable with Q and corresponding center frequency f kcq for constant-q bin, and k cq and n the current input bin in time domain and sampling rate of the input signal f s (Brown, 1991): M(f kcq ) = Q f s f kcq. (12) We can now compute the filters and thus the respective sound power in the signal filtered according to a musically-relevant set of center frequencies. Instead of applying these filters in time domain, it is computationally more efficient to do so in spectral domain, by projecting the window functions to Fourier space first. We can apply the filters hereafter through a matrix multiplication in frequency space. As denoted by Brown and Puckette (1992) for bin k cq of the constant-q transform can write: X cq [k cq ] = 1 N N 1 k=0 X[k]K[k, k cq ], (13) where k cq describes the constant-q transform bin, X[k] the signal amplitude at bin k in Fourier domain, N is the number of Fourier bins and K[k, k cq ] the value of the Fourier transform of our filter w(n, f kcq ) for constant-q transform k cq at Fourier bin k. Choosing the right minimum frequency and quality factor will result in constant-q bins corresponding to harmonically-relevant frequencies. Having transformed the linearly-spaced amplitude per frequency to a musically spaced constant-q transform bin, we can now continue to aggregate notes that are one octave apart, hereby reducing the dimension of the feature vector significantly. 27

29 PCP Aggregation Shepard s (1964) experiments on human perception of music suggest that humans can perceive notes one octave apart as belonging to the same group of notes, known as pitch classes. Given these results we compute pitch class profiles based on the signal energy in logarithmic spectral space. As described by Lee (2006): P CP [k] = N cq 1 m=0 X cq (k + mb), (14) where k = 1, 2,..., B is the index for the PCP bin, N cq is the number of octaves in the frequency range of the constant-q transform. Usually B = 12, so that one bin for each musical note in one octave is computed. For pre-processing, e.g., correction of minor tuning differences, B = 24 or B = 36 are also sometimes used. Hereafter the resulting vector is usually normalized, typically with respect to the L 1, L 2 or L norm Comparison System Simplified Harmony Progression Analyzer In this section I describe the refinements made to the very basic chromagram computation defined above. The state-of-the-art system proposed by Ni et al. (2012) takes additional context into account. They state that tracking the key and the bass line provides important context that provides useful additional information for recognizing musical chords. For a more accurate comparison with stacked denoising autoencoder approaches, which cannot easily take such context into account, we discard the musical key, bass and beat information that is used by Ni et al. We compute the features with the code that is freely available from their website 5 and adjust it to a fixed stepsize of 1024 samples with a sampling rate of Hz thus a step size of approximately 0.09s per frame, instead of a beat-aligned step size. In addition to a so-called harmonic percussive sound separation algorithm as described by Ono et al. (2008), which attempts to split the signal into an hamonic and a percussive part, Ni et al. implement a loudness-based PCP vector and correct for minor tuning deviations Harmonic Percussive Sound Separation Ono et al. (2008) describe a method to discriminate between the percussive contribution to the Fourier transform and the harmonic one. This can be achieved by exploiting the fact that percussive sounds most often manifest themselves as bursts of energy spanning a wide range of frequencies but only during a limited time. On the other hand, harmonic components span a limited frequency range but are more stable over time. Ono et al. present a way to estimate the percussive and harmonic parts of the signal contribution in Fourier space as an optimization problem which can be solved iteratively: F h,i is the short-time Fourier transform of an audio signal f(t) and W h,i = F h,i 2 is its power spectrogram. We minimize the L 2 norm of power spectrogram gradients, J(H, P ), with H h,i the harmonic component and P h,i the percussive

30 component, with h the frequency bin and i the time in Fourier space: J(H, P ) = 1 2σ 2 H h,i subject to the constraint that and (H h,i 1 H h,i ) σ 2 P (P h 1,i P h,i ) 2, (15) h,i H h,i + P h,i = W h,i (16) H h,i 0, (17) P h,i 0, (18) where W h,i is the original power spectrogram, as described above, and σ H and σ P are parameters to control the smoothness vertically and horizontally. Details for an iterative optimization procedure can be found in the original paper Tuning and Loudness-Based PCPs Here we describe further refinements of the PCP vector, first how to take minor deviations (less than a semitone) from the reference tuning into account, and later an addition proposed by Ni et al. (2012) to model human loudness perception. Tuning To take into account minor pitch shifts of the tuning of the specific song, features are fine-tuned as described by Harte and Sandler (2005). Instead of computing a 12-bin chromagram directly, we can compute multiple bins for each semitone, as described in section for setting B > 12 (e.g., B = 36). We can then compute a histogram of sound power peaks with respect to frequency and select a subset of constant-q bins to compute the PCP vectors, to shift our reference tuning according to small deviations for a song. Loudness Based PCPs Since human loudness perception of sound in respect to frequencies is not linear, Ni et al. (2012) propose a loudness weighting function. First we can compute a sound power level matrix : L s,t = 10 log 10 ( X s,t 2 p ref ), s = 1,..., S, t = 1,..., T, (19) where p ref indicates the fundamental reference power, and X s,t the constant-q transform of our input signal as described in the previous section (s denoting the constant-q transform bin and t the time). They propose to use A-weighting (Talbot-Smith, 2001), in which we add a specific value depending on the frequency. An approximation to human sensitivity of loudness perception in respect to frequency is then given by: where L s,t = L s,t + A(f s ), s = 1,..., S, t = 1,..., T, (20) A(f s ) = log 10 (R A (f s )), (21) 29

31 and R A (f s ) = f 4 s (f 2 s ) (f 2 s )(f 2 s )(f 2 s ). (22) Having calculated this we can proceed to compute the pitch class profiles as described above, using L s,t. Ni et. al. normalize the loudness-based PCP vector after aggregation according to: X p,t min p X p X p,t =,t max p X p,t min, (23) p X p,t where X p,t denotes the value for PCP bin p time t. Ni et al. state that due to this normalization, specifying the reference sound power level p ref is not necessary HMMs In this section we give a brief overview of the hidden Markov model (HMM), as far as important for this thesis. It is a widely used model for speech as well as chord recognition. A musical song is highly structured in time certain chord sequences and transitions are more common than others but PCP features do not take any time dependencies into account by themselves. A temporal alignment can increase the performance of a chord recognition system. Additionally, since we compute the PCP features from the amplitude of the signal alone, which is noisy in regards to chord information due to percussion, transient noise or other, the resulting feature vector is not clean. HMMs in turn are used to deal with noisy data, which adds another argument to use HMMs for temporal smoothing. Definition There exist several variants of HMMs. For our comparison system we restrict ourselves to an HMM with a single Gaussian emission distribution for each state. For the stacked denoising autoencoders we use the output of the autoencoders directly as a chord estimate and as emission probability. An HMM with a Gaussian emission probability is a so-called continuous-densities HMM. It is capable of interpreting multidimensional real valued input such as the PCP vectors we use as features, described above in section An HMM estimates the probability of a sequence of latent states corresponding to a sequence of lower-level observations. As described by Rabiner (1989), an HMM can be defined as a 5-tuple consisting of: 1. N, the number of states in the model. 2. M, the number of distinct observations, which in the case of a continuous densities HMM is infinite. 3. A = {a ij }, the state transition probability distribution, where a ij = P (q t+1 = S j q t = S i ), 1 i, j N, and q t denotes the current state at time t. If the HMM is ergodic (i.e., all transitions to every state from every state are possible) for all i and j, a ij > 0. Transition probabilities satisfy the stochastic constraints N a ij = 1 and 1 i N. j=1 30

32 4. B = {b j (O)}, the set of observation probabilities, which in our case is infinite. b j (O t ) = P (O t q t = S j ), the observation probability in state j, where, 1 j N, for observation O t at time t. If we assume a continuousdensity HMM, i.e., we have a real-valued, possibly multidimensional input, we can use a (mixture of) Gaussian distributions for the probability distribution b j (O): b j (O t ) = M m=1 Z jm N (O t, µ jm, Σ jm ), with 1 j N. Here O t is the input vector at time t, Z jm the mixture weight (coefficient) for the m th mixture in state j and N (O, µ jm, Σ jm ), the Gaussian probability density function, with mean vector µ jm and covariance matrix Σ jm for state j and component m. 5. π = {π i } where π i = P (q 1 = S i ), with 1 i N. This is the initial state probability. Parameter Estimation We can define the states to be the 24 chord symbols and the non-chord symbol for the simple major-minor chord discrimination task, and 217 different symbols for the extended chord vocabulary, including major, minor, 7th and inverted chords and the non-chord symbol. The features in case of the baseline system are computed as a 12-bin PCP vector, with a single Gaussian as emission model for the HMM. In case of the stacked denoising autoencoder systems, we can use the output of the networks directly as emission probabilities. Since we are dealing with a fully annotated data set, it is trivial to estimate the initial state probabilities and the transitions by computing relative frequencies with help of supplied ground truth. In the case of Gaussian emission model, we can estimate the parameters from training data by the EM algorithm (McLachlan et al., 2004). Likelihood of a Sequence To compute the likelihood of given observations belonging to a certain chord sequence we can compute the following: P (q 1, q 2...q t, O 1, O 2,...O t λ) = π 1 b 1 T t=2 a t,t 1 b t (O t ), (24) where π 1 is the initial state probability for state at time 1, b 1 the emission probability for the first observation, a t,t 1 the transition probability from state t 1 to state t, and b t (O t ) the emission probability for time t for observation O t at time t. λ denotes the parameters of our HMM. The most likely sequence of hidden states for given observations can be computed efficiently with the help of the Viterbi algorithm (see Rabiner, 1989, for details). 5.2 Stacked Denoising Autoencoders for Chord Recognition A piece of music contains additional non-harmonic information, or harmonic information which does not directly contribute to the chord played at a certain time in the song. This can be considered as noise for the objective of estimating the correct chord progressions from a song. Since stacked denoising autoencoders are trained to reduce artificially added noise, they seem to be a 31

33 suitable choice for application on noisy data, and have been shown to achieve state-of-the-art performance on several benchmark tests (including audio genre classification) (Vincent et al., 2010). Moreover deep learning architectures can be partly trained in an unsupervised manner, which might prove to be useful for a field like chord recognition, since there is a huge amount of unlabeled digitized musical data available, but only a very limited fraction of this is annotated. In this section I describe two systems relying on stacked denoising autoencoders for chord recognition. The preprocessing of the input data follows the same basic steps for the two stacked denoising autoencoder approaches, described in section All approaches make use of an HMM to smooth and interpret the neural network output as a post-classification step. Since the chord ground truth is given, we are also able to calculate a perfect PCP and train stacked denoising autoencoders to approximate the former from given FFT input. A description of how to apply a joint optimization procedure for the HMM and neural network for chord recognition, taken from speech recognition, is given in appendix A (This did not yield any further improvements, however). Furthermore it is possible to train a stacked denoising autoencoder to model chord probabilities directly which then are smoothed by an HMM, described in section Hereafter I propose an extension to this approach by extending the input of the stacked denoising autoencoders to cover multiple resolutions, smoothed over different time spans, in section Preprocessing of Features for Stacked Denoising Autoencoders In all approaches described below, we employ the stacked denoising autoencoders directly to the Fourier transformed signal. This minimizes the preprocessing steps, and restrictions imposed, but still some preprocessing of the input can increase the performance. 1. To restrict the search space only the first 1500 FFT bins are used. This restricts the frequency range to approximately 0 to 3000 Hz. Most of the frequencies emitted by harmonic instruments are still contained in this interval. 2. Since values taken from the FFT directly contain high-energy peaks, we apply a square root compression as done by Boulanger-Lewandowski et al. (2013) for deep belief networks. 3. We then normalize the FFT frames according to the L 2 norm in a final preprocessing step. 32

34 5.2.2 Stacked Denoising Autoencoders for Chord Recognition chord symbols SDAE single frame preprocessing FFT input, one time frame Figure 5: Stacked denoising autoencoder for chord recognition, single resolution. Humphrey et al. (2012) state that the performance of chord recognition systems has not improved significantly recently, and suggest that one reason could be the widespread usage of PCP features. They try to find a different representation by modelling a Tonnetz under usage of convolutional neural networks. Cho and Bello (2014), who evaluate the influence on performance of different parts of chord recognition systems, also come to the conclusion that the choice of feature computation has a great influence on the overall performance and suggest the exploration of other types of features differing from the PCP. A nice property of deep learning approaches is that they are often able to find a higher level representation of the input data by themselves and do not rely on predefined feature computation. When classifying data, we can train a neural network to output pseudoprobabilities for each class given an input. This is done through a final logistic regression layer (or softmax) for the output of the neural network. We use a softmax output and a 1-of-K encoding, such that we have K outputs, each of which can be interpreted as a probability of a certain chord being played. Thus we can use the output of a 1-of-K encoding softmax output layer neural network directly as substitute for the emission probability of the HMM and further process it with temporal smoothing to compute a final chord symbol output. Since deep learning provides us with a powerful strategy for neural network training, we are able to discard all steps of the conventional PCP vector computation and restrictions that might be imposed by them apart from the FFT and train the network to classify chords. This differs from previous approaches 33

like Boulanger-Lewandowski et al. (2013) and Glazyrin (2013), who use deep learning techniques but still model PCPs either as intermediate target or as output of the neural network.

35 like Boulanger-Lewandowski et al. (2013) and Glazyrin (2013), who use deep learning techniques but still model PCPs either as intermediate target or as output of the neural network. Figure 5 depicts the processing pipeline of the system. This system, with a single input frame is referred to as stacked denoising autoencoder (SDAE) Multi-Resolution Input for Stacked Denoising Autoencoders chord symbols SDAE Concatenate frames single frame median filter median filter preprocessing FFT input, multiple time frames Figure 6: Stacked denoising autoencoder for chord recognition, multi-resolution Glazyrin (2013), who uses stacked denoising autoencoders (with and without recurrent layers) to estimate PCP vectors from the constant-q transform, states that he suspects it to be beneficial to take multiple subsequent frames into account, but also writes that informal experiments did not show any improvements in recognition performance. Boulanger-Lewandowski et al. (2013) also make use of a recurrent layer with a deep belief network to take temporal information into account before additional (HMM) smoothing. Both approaches thus reason that it might be beneficial to take temporal information into account before using an HMM as a final computation step. We can find a similar paradigm in Tang and Mohamed (2012), used with deep learning. They propose a system in which images of faces are analyzed by a deep belief network. In addition to the original image they propose extending the input to different subsampled versions of the image for face recognition and report improved performance over a single resolution input. They also report improved performance for extending the classifier input to several inputs 34

36 with different subsampling ranges applied to phone recognition and temporal smoothing with deep belief networks on the TIMIT dataset. The proposed system in this thesis is designed to take additional temporal information into account before the HMM post-processing as well. Following the intuition of Glazyrin and the idea of Tang et al., we extend the input of the stacked denoising autoencoder, computing two different time resolutions of the FFT and concatenating them with the original input of the stacked denoising autoencoders. In addition to the original FFT vector, we apply a median filter for different ranges of subsequent frames around the current frame. After median filtering each vector is preprocessed as indicated in section Hereafter we join the resulting vectors and use them as frame-wise input for the stacked denoising autoencoders. Cho and Bello (2014) conduct experiments to evaluate the influence on performance of different parts of the most prevalent constituents of chord recognition systems. They find that pre-smoothing has a significant impact on chord recognition performance in their experiments. They state that through filtering we can eliminate or reduce transient noise, which is generated by short bursts of energy such as percussive instruments, although this has the disadvantage to also smear chord boundaries. However, in the proposed system we supply both the original input in which the chord boundaries are sharp, but with transient noise, and a version that is smoothed. Cho and Bello (2014) compare average filtering and median filtering and find that there is little to no difference in terms of recognition performance. We use a median filter instead of an average filter since it is a prevalent approach in chord recognition. Median filters are applied in several other approaches, e.g., Peeters (2006), or Khadkevich and Omologo (2009b), to reduce transient noise. The stacked denoising autoencoders are again trained to output chord probabilities by fine tuning with traditional backpropagation. In the following we refer to this as a multi resolution stacked denoising autoencoder (MR-SDAE). Figure 6 illustrates the processing pipeline of the MR-SDAE. 35

37 6 Results Finding suitable training and testing sets for chord estimation is difficult because transcribing chords in songs requires a significant amount of training, even for humans. Only experts are able to transcribe chord progressions of songs accurately and in full detail. Furthermore, most musical pieces are subject to strict copyright laws. This poses the problem that ground truth and audio data are delivered separately. Different recordings of the same song might not fit exactly to the ground truth available due to minor temporal deviations. There are, fortunately, tools to align ground truth data and audio files. For the following experiments, Dan Ellis AUDFPRINT tool was used to align audio files with publicly available ground truth. 6 We report results on two different datasets: a transcription of 180 Beatles songs, and the publicly available part of the McGill Billboard dataset, containing 740 songs. The Beatles dataset has been available for several years, and as other training data is scarce, many algorithms published in the MIREX challenge have been pretrained on this dataset. Because of the same scarcity of good data, the MIREX challenge has also used the Beatles dataset (with a small number of additional songs) to evaluate the performance of chord recognition algorithms, and thus the official results on the Beatles dataset might be biased. We report a cross-validation performance, in which we train the algorithm on a subset of the data and test it on the remaining unseen part. This we repeat ten times for different subsets of the dataset, and report the average performance and 95% confidence interval. This is done to give an estimation how the proposed methods might perform on unseen data. However the Beatles dataset is composed by one group of musicians only, which itself might bias the results, since musical groups tend to have a certain style of playing music. Therefore we also conduct experiments on the Billboard dataset, which is not restricted to one group of musicians, but rather contains popular songs from Billboard Hot 100 charts from the 1958 to Additionally the Billboard dataset contains more songs, thus providing us with more training examples. To compare the proposed methods to other methods, we use the training and testing set of the MIREX 2013 challenge, a subset of the McGill Billboard dataset that was unpublished before 2012 but is available now. Although there are more recent results on the Billboard dataset (MIREX 2013), the test set ground truth for that part of the dataset has not yet been released. Deep learning neural network training was implemented with the help of Palm s deep learning MATLAB toolbox (Palm, 2012). HMM smoothing was realized with functions of Kevin Murphy s Bayes net MATLAB toolbox. 7 Computation of state-of-the-art features was done under usage of Ni et al. s code. 8 In the following I first give an explanation of how we can measure the performance of the algorithms in section 6.2. Training algorithms to learn the set of all possible chords is infeasible in this point of time due to the number of possible chords and relative frequencies of chords appearing in the publicly available datasets. Certain chords appear in popular songs more frequently than others, and so we train the algorithms to recognize a set of these chord symbols

38 containing only major and minor chords, which we call the restricted chord vocabulary, and a set of chords containing major, minor, 7th and inverted chords, which we call the extended chord vocabulary. In section 6.1, I describe how to interpret chords that are not part of these sets. Results are reported for both chord symbol subsets on the Beatles dataset in section 6.5 for the reference system, SDAEs and MR-SDAEs. Results for both subsets on the Billboard set are reported in section 6.6. The results of other algorithms submitted to MIREX 2013 for the Billboard test set used in this thesis are stated in section Reduction of Chord Vocabulary As described in section 2.2, chords considered in this thesis consist of three or four notes, with distinct interval relationships to the root note. We have a certain set of chord symbols in the two chord symbol sets. The first contains only major and minor chords with three notes, the second an extension to this chord symbol set containing also 7th and inverted chords. For the Billboard dataset these two subsets are already supplied. For the Beatles dataset, we need to reduce the chords in the ground truth to match the chord symbol sets we want to recognize, since those are fully-detailed transcriptions, which contain chord symbols not in our defined subsets. Some chords are an extension of other chords, e.g., C:maj7 can be seen as an extension of C:maj, since the first one contains the same notes as the latter one but for the additional fourth note with interval 7 above the root note C. We thus reduce all other chords in the ground truth according to following set of rules: 1. If the ground truth chord symbol is in the subset of chord symbols to be recognized, leave it unchanged. 2. If there is a subset of notes that matches a chord symbol in the recognition set, denote instead of the original ground truth symbol the symbol in the recognition set (e.g., C:maj7 is mapped to C:maj for the restricted vocabulary). 3. If there is no subset of chord notes from a symbol in the recognition set for the original ground truth, denote it as non-chord (e.g., C:dim is mapped to the non-chord symbol). 6.2 Score Computation The results reported use a method of measurement that has been proposed by Harte (2010) and Mauch (2010): the weighted chord symbol recall (WCSR). In the following a description of how it is computed is provided Weighted Chord Symbol Recall Since most of chord recognition algorithms including the ones proposed here work on a discretized input space, but the ground truth is measured in continuous segments with start time, end time and a distinct chord symbol, we need a measure to estimate the performance of any proposed algorithm. This could be achieved by simply discretizing the ground truth according to the discretization of the estimation, and hereafter performing a frame-wise comparison. However, 37

39 Harte (2010) and Mauch (2010) propose a more accurate measure. The framewise comparison measure can be enhanced by computing the relative overlap of matching chord segments between the continuous-time ground truth and the frame-wise estimation of chord symbols by the recognition system: This is called chord symbol recall (CSR): CSR = S A i S S A j E i S S A i A i S E j, (25) where S A i is one segment of the hand annotated ground truth, and S E j one segment of the machine estimation. The test set for musical chord recognition usually contains several songs, which each have a different length and contain a different number of chords. Thus we can extend the CSR for a corpus of songs if we sum the the results for each song weighted by its length. This is the weighted chord symbol recall (WCSR), used for evaluating performance on a corpus containing several songs: W CSR = N L i CSR i i=0, (26) N L i where L i the length of song i and CSR i the chord symbol recall between machine estimation and hand annotated segments for song i. 6.3 Training Systems Setup Conducting experiments following parameters are found to be suitable. The stacked denoising autoencoders are trained with 30 iterations of unsupervised training with additive Gaussian noise, variance 0.2, and fraction of corrupted inputs 0.7. The autoencoders have 2 hidden layers with 800 hidden nodes each with a sigmoid activation function, the output layer contains as many nodes as there are chord symbols. To enforce sparsity an activation penalty weighting of β = 0.1, and target activation p = 0.05 is used. The dropout is set to 0.5, and batch training with a batch of 100 samples is used. The learning rate is set to 1 and momentum to 0.5. For the MR-SDAE the previous and subsequent 3 frames for the second input vector, and the previous and subsequent 9 frames for the third input vector are used. Due to memory restrictions only a subset of frames of the complete training set for training of the stacked denoising autoencoder based systems is employed. 10% of the training data for validation while training is separated. Additionally I extended Palm s deep-learning library with an early stopping mechanism, which stops supervised training after the performance on the validation set does not improve for 20 iterations, or else after 500 iterations, to restrict computation time. It then returns the best performing weight configuration according to the training validation. For the comparison system, since not all chords of the extended chord vocabulary are included in all datasets, missing chords are substituted with the mean PCP vector in the training set. Malformed covariance matrices are corrected by adding a small amount of random noise. i=0 38

40 6.4 Significance Testing Similar to Mauch and Dixon (2010a), a Friedman multiple comparison test is used to test for significant differences in performance of the proposed algorithms and the reference system. This tests the performance of different algorithms on a song level, but differs from the WCSR, which takes the song length into account in the final score. The Friedman multiple comparison test measures the statistical significance of ranks, thus indicating whether an algorithm outperforms another algorithm with statistical significance on a song level without regard to the WCSR for songs in general. For the purpose of testing for statistical significance of performance, we select one fold of the cross validation on the Beatles dataset, on which the performance is close to the mean, and one test run for the Billboard dataset, which is close to the mean as well for the SDAE based approaches. All plots for the post hoc multiple comparison Friedman test for significance show the mean rank and 95% confidence interval in term of ranks. 6.5 Beatles Dataset The Beatles Isophonics dataset 9 contains songs of the Beatles and Zweieck. We only use the Beatles songs for evaluating the performance of algorithms, since it is difficult to come by the audio data of the Zweieck songs. The Beatles-only subset of this dataset consists of 180 songs. In section and section 6.5.2, the results for restricted and extended chord vocabulary, for the comparison system, SDAE and MR-SDAE are reported. The cross-validation performance across ten folds is shown. We partition the dataset into ten subsets, where we use one for testing and nine for training. For the first fold we use every tenth song from the Beatles dataset starting from the first, as ordered in the ground truth, the second fold every tenth song starting from the second etc. We train ten different models, one for each testing partition. Since we use a HMM smoothing step, we show raw results without HMM smoothing and a final performance of the systems with temporal smoothing, for the neural network approaches. The reference system uses the HMM even for classification, and thus we only report a single final performance statistic. All results are reported as WCSR as described above and used in the MIREX challenge. Since there are ten different results, one for testing on each partition, I report the average WCSR, as well as a 95% confidence interval of the aggregated results. To get an insight into the distribution of performance results, I also plot box-and-whisker diagrams. Finally I perform Friedman multiple comparison tests for statistical significance across algorithms. Since the implementation of the learning algorithms in MATLAB is memory intensive, I subsample the training partitions for the SDAEs. For SDAE, I use every 3rd frame for training, and for MR-SDAE, every 4th frame, resulting in approximately and training samples for each fold Restricted Major-Minor Chord Vocabulary Friedman multiple comparison tests Values are computed on fold five of the Beatles dataset, which yields a result close to the mean performance for

41 all algorithms tested. In figure 7 the results of the post hoc Friedman multiple comparison tests for all systems smoothed and unsmoothed on the restricted chord vocabulary task are depicted. The algorithms showed significantly different performance, with p < SDAE MR-SDAE S-HPA SDAE MR-SDAE Mean column ranks with 95% confidence interval Figure 7: Mean and 95% confidence intervals for post hoc Friedman multiple comparison tests for the Beatles dataset on the restricted chord vocabulary, for the comparison system (S-HPA), SDAE, and MR-SDAE, before HMM smoothing (normal weight) and after (highlighted in bold). Whisker Plot and Mean Performance In this section results for the proposed algorithms, SDAE, MR-SDAE and the reference system on the reduced major-minor chord symbol recognition task are presented. Figure 8 depicts a box-and-whisker diagram for the performance of the algorithms with and without temporal smoothing and the performance of the reference system. The upper and lower whiskers depict the maximum and minimum performance of all results of the ten-fold cross validation, while the upper and lower boundaries of the boxes represent the upper and lower quartiles. We can see the median of all runs as a dotted line inside the box. The average WCSR together with 95% confidence intervals over folds before and after temporal smoothing can be found in table 3. 40

42 WCSR in % SDAE MR-SDAE S-HPA SDAE MR-SDAE Figure 8: Results for the simplified HPA, SDAE and MR-SDAE for the restricted chord vocabulary 10-fold cross-validation on the Beatles dataset with and without HMM smoothing. Results after smoothing are highlighted in bold. System Not smoothed Smoothed S-HPA ± 9.32 SDAE ± ± 7.41 MR-SDAE ± ± 7.92 Table 3: Average WCSR for the restricted chord vocabulary on the Beatles dataset, smoothed and unsmoothed, with 95% confidence interval Summary In the Friedman multiple comparison test in figure 7, we observe that the mean ranks of post-smoothing SDAE and MR-SDAE are significantly higher than the mean ranks of the reference system (S-HPA), and also that smoothing significantly improves the performance. Mean ranks for SDAE and MR-SDAE without smoothing are lower than that of the reference system, however not significantly. The MR-SDAE has a slightly higher mean rank compared to the SDAE, but not significantly. In figure 8 we can observe that pre- and post-smoothed SDAE and MR- SDAE distributions are negatively skewed. The S-HPA however is skewed positively. The skewness of the distribution does not change much for the SDAE and MR-SDAE comparing before and after smoothing, however, smoothing improves the performance in general. In table 3, we can see that the mean performance of MR-SDAE outperforms the SDAE slightly and that both achieve higher mean performance compared to the reference system after HMM smoothing. The means for results before 41

43 HMM smoothing for SDAE and MR-SDAE are lower however Extended Chord Vocabulary Friedman Multiple Comparison Tesst Again values are computed for fold five of the Beatles dataset. In figure 9 the results of the post hoc Friedman multiple comparison tests for all systems smoothed and unsmoothed on the extended chord vocabulary task are depicted. The algorithms showed significantly different performance, with p < SDAE MR-SDAE S-HPA SDAE MR-SDAE Mean column ranks with 95% confidence interval Figure 9: Mean and 95% confidence intervals for post hoc Friedman multiple comparison tests for the Beatles dataset on the extended chord vocabulary for the comparison system (S-HPA), SDAE, and MR-SDAE, before HMM smoothing (normal weight) and after (highlighted in bold). Whisker Plots and Means Similar to above we depict box-and-whisker diagrams for the unsmoothed and smoothed results of ten-fold cross validation for the extended chord symbol set in Figure 10. Table 4 depicts the average WCSR and 95% confidence interval over folds for the training for smoothed and unsmoothed results. 42

44 WCSR in % SDAE MR-SDAE S-HPA SDAE MR-SDAE Figure 10: Whisker plot for simplified HPA, SDAE, and MR-SDAE using the extended chord vocabulary and 10-fold cross-validation on the Beatles dataset, with and without smoothing. Results after smoothing are highlighted in bold. System Not smoothed Smoothed S-HPA ± 7.89 SDAE ± ± 7.37 MR-SDAE ± ± 6.81 Table 4: Average WCSR for simplified HPA, SDAE and MR-SDAE usging the extended chord vocabulary on the Beatles dataset, smoothed and unsmoothed, with 95% confidence intervals. Summary In the Friedman multiple comparison tests in figure 9, we can observe that again the post-smoothing performance in terms of ranks of the SDAE and MR-SDAE is significantly better than the reference system. In comparison to the restricted chord vocabulary recognition task, the margin is even larger. A peculiar thing to note, is that with the extended chord vocabulary the presmoothing performance of MR-SDAE is not significantly worse than the postsmoothing performance of both SDAE based chord recognition systems. SDAE shows lower mean ranks before smoothing than the reference system, and MR- SDAE seems to perform slightly better than the reference system, although not significantly so before smoothing. In figure 10, we can see similar negatively-skewed distributions of cross validation results for SDAE and MR-SDAE, like the restricted chord vocabulary setting. Again we can observe that the skewness of the distributions do not change much after smoothing, but we can observe an increase in performance. 43

45 However, in the extended chord vocabulary task, the medians of the SDAE and MR-SDAE are higher than that of the reference system, showing values even higher than the best performance of the reference system. The reference system on the extended chord vocabulary does not show a distinct skew. The better performance is also reflected in table 4, where the proposed systems achieve higher means before and after HMM smoothing compared to the reference system. 6.6 Billboard Dataset The McGill Billboard dataset 10 consists of songs randomly sampled from the Billboard Hot 100 charts from 1958 to This dataset currently contains 740 songs, of which we separate 160 songs for testing and use the remaining for training the algorithms. The selected test set corresponds to the official test set of the MIREX 2012 challenge. Although there are results for algorithms in the MIREX challenge 2013 on the Billboard dataset, the ground truth of the specific test set has not been publicly released at this point of time. Similar to the Beatles dataset, the audiofiles are not publicly available, but there are several different audio recordings for the songs in the dataset. We again use Dan Ellis AUDFPRINT tool to align audio data with the ground truth. For this dataset the ground truth is already available in the right format for restricted major-minor chord vocabulary and extended 7th and inverted chord vocabulary, thus we do not need to reduce the chords ourselves. Since the Billboard dataset is much larger than the Beatles dataset, we sample every 8th frame for the SDAE training and every 16th for the MR-SDAE, resulting in approximately and frames respectively. Algorithms were run five times Restricted Major-Minor Chord Vocabulary Friedman Multiple Comparison Tests In figure 11 the results for the post hoc Friedman multiple comparison test for the Billboard restricted chord vocabulary task for the reference system and smoothed and unsmoothed SDAE and MR-SDAE are presented. The algorithms showed significantly different performance, with p <

46 SDAE MR-SDAE S-HPA SDAE MR-SDAE Mean column ranks with 95% confidence interval Figure 11: Mean and 95% confidence interval for post hoc Friedman multiple comparison tests for the Billboard dataset on the restricted chord vocabulary, for the comparison system (S-HPA), SDAE, and MR-SDAE, before HMM smoothing (normal weight) and after (highlighted in bold). Mean Performance In this section results for the MIREX 2012 test partition of the Billboard dataset for the restricted major-minor chord vocabulary are depicted. Table 5 shows the results for performance of the SDAE with and without smoothing. Since we do not perform a cross validation on this dataset and the comparison system does not have any randomized initialization, we report the 95% confidence interval for the SDAEs only, with respect to multiple random initialisations (note that these are not directly comparable to the confidence intervals over cross-validation folds as reported for the Beatles dataset). System Not smoothed Smoothed S-HPA SDAE ± ± 0.31 MR-SDAE ± ± 0.40 Table 5: Average WCSR for the restricted chord vocabulary on the MIREX 2012 Billboard test set, smoothed and unsmoothed, with 95% confidence intervals if applicable. 45

47 Summary Figure 9, depicting the Friedman multiple comparison test for significance, reveals that in the Billboard restricted chord vocabulary task, the reference system does not perform significantly worse than the post-smoothing SDAE and MR-SDAE. It is also notable that in this setting the pre-smoothing MR-SDAE significantly outperforms the pre-smoothing SDAE. Similar to the restricted chord vocabulary task for the Beatles test, on the Billboard dataset, the means before smoothing are lower than those of the reference system. However, we can still observe a better pre-smoothing mean performance for MR-SDAE, in comparison with SDAE. Comparing mean performance HMM smoothing, we see no significant differences Extended Chord Vocabulary Friedman Multiple Comparison Tests In figure 12 the results for the post hoc Friedman multiple comparison test for the Billboard extended chord vocabulary task for the reference system and smoothed and unsmoothed SDAE and MR-SDAE are presented. The algorithms showed significantly different performance, with p < SDAE MR-SDAE S-HPA SDAE MR-SDAE Mean column ranks with 95% confidence interval Figure 12: Mean and 95% confidence intervals for post hoc Friedman multiple comparison tests for the Billboard dataset on the extended chord vocabulary, for the comparison system (S-HPA), SDAE, and MR-SDAE, before HMM smoothing (normal weight) and after (highlighted in bold). 46

48 Mean Performance Table 6 depicts the performance of the reference system and SDAEs on the extended chord vocabulary containing major, minor, 7th and inverse chord symbols. Again no confidence interval is reported for the reference system since there is no random component and results are the same for multiple runs. System Not smoothed Smoothed S-HPA SDAE ± ± 0.32 MR-SDAE ± ± 0.50 Table 6: Average WCSR for the extended chord vocabulary on the MIREX 2012 Billboard test set, smoothed and unsmoothed, with 95% confidence intervals if applicable. Summary The Friedman multiple comparison test in figure 12 shows again significantly better performance for the post-smoothing SDAE systems in comparison to the pre-smoothing performance, and also to the reference system. MR-SDAE again seems to achieve a higher mean rank in comparison with SDAE, however this is not statistically significant. In terms of mean performance in WCSR, depicted in table 6, the presmoothing performance figures for SDAE and MR-SDAE are higher than those of the reference system. Again MR-SDAE outperforms SDAE in mean WCSR. The same is the case after smoothing: MR-SDAE outperforms the SDAE slightly, and both perform better than the reference system. 6.7 Weights In this section we visualize of the input layer of the neural network trained on the Beatles dataset. Figure 13 shows an excerpt of the input layer of the neural network, weights being depicted as a grayscale image, where black denotes negative weights and white corresponds to positive weights. In figure 14 the sum of absolute values over all weights for each FFT input are plotted. The vertical lines depict FFT bins, which correspond to musically important frequencies, i.e., musical notes. 47

49 Hidden nodes Weights for inputs Figure 13: Excerpt of the weights of the input layer. Black denotes negative weights, and white positive Sum of absolute weights ,000 1,200 1,400 input (FFT bin) Figure 14: Sum of absolute values for each input of the trained neural network. Vertical gray lines indicate bins of the FFT that correspond to musically relevant frequencies. 48

EE391 Special Report (Spring 2005) Automatic Chord Recognition Using A Summary Autocorrelation Function

EE391 Special Report (Spring 25) Automatic Chord Recognition Using A Summary Autocorrelation Function Advisor: Professor Julius Smith Kyogu Lee Center for Computer Research in Music and Acoustics (CCRMA)