Chord Recognition with Stacked Denoising Autoencoders

Size: px
Start display at page:

Download "Chord Recognition with Stacked Denoising Autoencoders"

Transcription

1 Chord Recognition with Stacked Denoising Autoencoders Author: Nikolaas Steenbergen Supervisors: Prof. Dr. Theo Gevers Dr. John Ashley Burgoyne A thesis submitted in fulfilment of the requirements for the degree of Master of Science in Artificial Intelligence in the Faculty of Science July 2014

2 Abstract In this thesis I propose two different approaches for chord recognition based on stacked denoising autoencoders working directly on the FFT. These approaches do not use any intermediate targets such as pitch class profiles/chroma vectors or the Tonnetz, in an attempt to remove any restrictions that might be imposed by such an interpretation. It is shown that these systems can significantly outperform a reference system based on state-of-the-art features. The first approach computes chord probabilities directly from an FFT excerpt of the audio data. In the second approach, two additional inputs, filtered with a median filter over different time spans, are added to the input. Hereafter, in both systems, a hidden Markov model is used to perform a temporal smoothing after pre-classifying chords. It is shown that using several different temporal resolutions can increase the classification ability in terms of weighted chord symbol recall. All algorithms are tested in depth on the Beatles Isophonics and the Billboard datasets on a restricted chord vocabulary containing major and minor chords and an extended chord vocabulary containing major, minor, 7th and inverted chord symbols. In addition to presenting the weighted chord average recall, a post-hoc Friedman multiple comparison test for statistical significance on performance is also conducted. 1

3 Acknowledgements I would like to thank Theo Gevers and John Ashley Burgoyne for supervising my thesis. Thanks to Ashley Burgoyne, for his helpful thorough advice and guidance. Thanks Amogh Gudi for all the fruit full discussions about deep learning techniques while lifting weights and sweating in the gym. Special thanks to my parents, Brigitte and Christiaan Steenbergen and my brothers Alexander and Florian, without their help, support and love, I would not be where I am now. 2

4 Contents 1 Introduction 7 2 Musical Background Notes and Pitch Chords Other Structures in Music Related Work Preprocessing / Features PCP / Chroma Vector Calculation Minor Pitch Changes Percussive Noise Reduction Repeating Patterns Harmonic / Enhanced Pitch Class Profile Modelling Human Loudness Perception Tonnetz / Tonal Centroid Classification Template Approaches Data-Driven Higher Context Models Stacked Denoising Autoencoders Autoencoders Autoencoders and Denoising Training Multiple Layers Dropout Chord Recognition Systems Comparison System Basic Pitch Class Profile Features Comparison System Simplified Harmony Progression Analyzer Harmonic Percussive Sound Separation Tuning and Loudness-Based PCPs HMMs Stacked Denoising Autoencoders for Chord Recognition Preprocessing of Features for Stacked Denoising Autoencoders Stacked Denoising Autoencoders for Chord Recognition Multi-Resolution Input for Stacked Denoising Autoencoders 34 6 Results Reduction of Chord Vocabulary Score Computation Weighted Chord Symbol Recall Training Systems Setup Significance Testing Beatles Dataset Restricted Major-Minor Chord Vocabulary

5 6.5.2 Extended Chord Vocabulary Billboard Dataset Restricted Major-Minor Chord Vocabulary Extended Chord Vocabulary Weights Discussion Performance on the Different Datasets SDAE MR-SDAE Weights Extensions Conclusion 55 A Joint Optimization 61 A.1 Basic System Outline A.2 Gradient of the Hidden Markov Model A.3 Adjusting Neural Network Parameters A.4 Updating HMM Parameters A.5 Neural Network A.6 Hidden Markov Model A.7 Combined Training A.8 Joint Optimization A.9 Joint Optimization Possible Interpretation

6 List of Figures 1 Piano keyboard and MIDI note range Conventional autoencoder training Denoising autoencoder training Stacked denoising autoencoder training SDAE for chord recognition MR-SDAE for chord recognition Post-hoc multiple-comparison Friedman tests for Beatles restricted chord vocabulary Whisker plot for the Beatles restricted chord vocabulary Post-hoc multiple-comparison Friedman tests for Beatles extended chord vocabulary Whisker plot for Beatles extended chord vocabulary Post-hoc multiple-comparison Friedman tests for Billboard restricted chord vocabulary Post-hoc multiple-comparison Friedman tests for Billboard extended chord vocabulary Visualization of weights of the input layer of the SDAE Plot of sum of absolute values for the input layer of the SDAE Absolute training error for joint optimization Classification performance of joint optimization while training

7 List of Tables 1 Semitone steps and intervals Intervals and chords WCSR for the Beatles restricted chord vocabulary WCSR for the Beatles extended chord vocabulary WCSR for the Billboard restricted chord vocabulary WCSR for the Billboard extended chord vocabulary Results for chord recognition in MIREX

8 1 Introduction The increasing amount of digitized music available online has given rise to demand for automatic analysis methods. A new subfield of information retrieval has emerged that concerns itself only with music: music information retrieval (MIR). Music information retrieval concerns itself with different subcategories, from analyzing features of a music piece (e.g., beat detection, symbolic melody extraction, and audio tempo estimation) to exploring human input methods (like query by tapping or query by singing/humming ) to music clustering and recommendation (like mood detection or cover song identification). Automatic chord estimation is one of the open challenges in MIR. Chord estimation (or recognition) describes the process of extracting musical chord labels from digitally encoded music pieces. Given an audio file, the specific chord symbol and temporal position and duration have to be automatically determined. The main evaluation programme for MIR is the annual Music Information Retrieval Exchange (MIREX) challenge 1. It consists of challenges in different sub-tasks of MIR, including chord recognition. Often improving one task can influence the performance in other tasks, e.g., finding a better beat estimate can improve the performance of finding the temporal positions of chord changes, or improve the task of querying by tapping. The same is the case for chord recognition. It can improve performance of cover song identification, in which starting from an input song, cover songs are retrieved: chord information is a useful if not vital feature for discrimination. Chord progressions also have an influence of the mood transmitted through music. Thus being able to retrieve the chords used in a music piece accurately could also be helpful for mood categorization, e.g., for personalized Internet radios. Chord recognition is also valuable to do by itself. It can aid musicologists as well as hobby and professional musicians in transcribing songs. There is a great demand for chord transcriptions of well-known and also lesser-known songs. This manifests itself in many Internet pages that hold manual transcriptions of songs, especially for guitar. 2 Unfortunately, these mostly contain transcriptions only of the most popular songs and often several different versions of the same song exist. Furthermore, they not guaranteed to be correct. Chord recognition is a difficult task which requires a lot of practice even for humans E.g. ultimate guitar: 911Tabs com/, guitartabs 7

9 2 Musical Background In this section I give an overview of important musical terms and concepts later used in this thesis. I first describe how musical notes relate to physical sound waves in section 2.1, then how chords relate to notes in section 2.2 and later different other aspects of music that play a role for automatic chord recognition in section Notes and Pitch Pitch describes the perceived frequency of a sound. In Western tonality pitches are labelled by the letters A to G. The transcription of a musically relevant pitch and its duration is called note. Pitches can be ordered by frequency, whereby a pitch is said to be higher if the corresponding frequency is higher. The human auditory system works on a logarithmic scale, which also manifests itself in music: Musical pitches are ordered in octaves, repeating the note names, usually denoted in ascending order from C to B: C, D, E, F, G, A, B. We can denote different octave relationships with an additional number as a subscript added to the symbol described previously. So a pitch A 0 is one octave lower than the corresponding pitch A 1 one octave above. Two pitches one octave apart double in corresponding frequency. Humans typically perceive those two pitches as the same pitch (Shepard, 1964). In music an octave is split into twelve roughly equal semitones. By definition each of the letters C to B are two semitone steps apart, excepting the steps from E to F and B to C, which both are only one semitone apart. To denote those notes that are in between the named letters, the additional symbols for a semitone step in increasing frequency and for a step in decreasing frequency directions are used. For example we can describe the musically relevant pitch between C and D both as C and D. Because this system only defines the relationship between pitches, we need a reference frequency. In modern Western tonality usually the reference frequency of A 4 at 440 Hz is standard (Sikora, 2003). In practice slight deviations of this reference tuning may occur, e.g., due to instrument mistuning or similar. This reference pitch thus defines the respective frequencies of other notes implicitly through the octave and semitone relationships. We may compute the corresponding frequencies for all other notes given a reference pitch with following equation: f n = 2 n 12 fr, (1) where f n the frequency for n semitone steps from the reference pitch f r. The human ear can perceive a frequency range of approximately 20 Hz to Hz. In practice this frequency range is not fully used in music. For example the MIDI standard, which is more than sufficient for musical purposes in terms of octave range, covers only notes in semitone steps from C 1, corresponding to about 8.17 Hz, to G 9, which is Hz. A standard piano keyboard covers the range from A 0 at 27.5 Hz to C Hz. Figure 1 depicts a standard piano keyboard in relation to the range of frequencies of MIDI standard notes, with indicated physical sound frequencies. 8

10 MIDI note range piano note range 1 88 C 1 33 Hz C Hz A Hz C Hz C Hz C 0 16 Hz C 2 65 Hz C Hz C Hz C Hz C 1 8 Hz C Hz G Hz Figure 1: Piano keyboard and MIDI note range. White keys depict the range of the standard piano, for those notes that are described by letters. Black keys deviate semitone from a note described by a letter. The gray area depicts extensions over the note range of a piano, covered by the MIDI standard. 9

11 2.2 Chords For the purpose of this thesis we define a chord as three or more notes played simultaneously. The distance in frequency of two notes is called an interval. In a musical context we can describe an interval as the number of semitone steps two notes are apart (Sikora, 2003). A chord consists of a root note, usually the lowest note in terms of frequency. The interval relationship of the other notes played at the same time defines the chord type. Thus a chord can be defined as a root-note and a type. In the following we use the notation <root-note>:<chordtype>, proposed by Harte (2010). We can refer to the notes in musical intervals in order of ascending frequencies as: root-note, third, fifth, and if there is a fourth note seventh. In Table 1, we can see the intervals for chords considered in this thesis and the semitone step distance for those intervals. The root note and fifth have fixed intervals. For the seventh and third, we differentiate between major and minor intervals, differing by one semitone step. For this thesis we restrict ourselves to two different chord vocabularies to be recognized, the first one containing only major and minor chord types. Both major and minor chords consist of three notes: the root note, the third and the fifth. The interval between root note and third distinguishes major and minor chord types (see tables 1 and 2) a major chord contains a major third, while the minor chord contains a minor third. We distinguish between twelve root notes for each chord type, for a total of 24 possible chords. Burgoyne et al. (2011) propose a dataset which contains songs from the Billboard charts from 1950s through the 1990s. This major-minor chord vocabulary accounts for 65% of the chords. We can extend this chord vocabulary to take into account 83% of the chord types in the Billboard dataset by including variants of the seventh chords, by adding an optional fourth note to a chord. Hereby, in addition to simple major and minor chords, we add 7th, major 7th and minor 7th chord-types to our chord-type vocabulary. Major 7th chords and minor 7th chords are essentially major and minor chords, whereby the added fourth note has the interval major seventh and minor seventh respectively. In addition to different chord types, it is possible to change the frequency order of the notes for different intervals by pulling one note below the rootnote in terms of frequency. This is called chord inversion. Thus our extended chord vocabulary containing major, minor, 7th, major 7th and minor 7th also contains all possible inversions. We can denote this through an additional identifier in our chord syntax: < root-note>:<chord-type>/<inversion-identifier>, where the inversion-identifier can either be 3, 5, or 7 played below the root-note. For example E:maj7/7 would be a major 7 chord, consisting of the root note E, interval number of semitone-steps root-note 0 minor third 3 major third 4 fifth 7 minor seventh 10 major seventh 11 Table 1: Semitone steps and intervals. 10

12 chord-type intervals notes major 1,3,5 minor 1, 3,5 7 1,3,5, 7 major7 1,3,5,7 minor7 1, 3,5, 7 Table 2: Intervals and chords. root-note denoted as 1, third as 3, fifth as 5 and seventh as 7. We denote minor as a major third, fifth, and major seventh, and the major seventh is played below the root note in terms of frequency. It is possible, however, that in parts of the song, no or only non-harmonic instruments (e.g., percussion) are playing. To be able to interpret this case we define an additional non-chord symbol, thus adding an additional chord symbol to our 24 different chord symbols for the restricted chord vocabulary, leaving us with 25 different symbols. The extended chord vocabulary contains major, minor, 7th, major 7th and minor 7th chord types (depicted in table 2) and all possible inversions. So, for each root-note, this leaves us with 3 different chord symbols for major and minor, and four different chord symbols for extended chords, thus 216 different symbols and an additional non-chord symbol. Furthermore, we assume that chords cannot overlap, although this is not strictly true, for example, due to reverb, multiple instruments playing chords, etc. However, in practice this overlap is negligible and reverb is often not that long. Thus we regard a chord to be a continuous entity with designated start point, end point and a chord symbol (either consisting of the root note, chord type and inversion, or a non-chord symbol). 2.3 Other Structures in Music A musical piece has several other components, some contributing additional harmonic content, for example vocals, which might also carry a linguistically interpretable message. Since a music piece has an overall harmonic structure and an inherent set of music theoretical harmonic rules, this information also influences the chords played at any time and vice versa, but does not necessarily contribute to the chord played directly. The duration and start and end point in time of a chord played is influenced by rhythmic instruments, such as percussion. These do not contribute to the harmonic content of a music piece but nonetheless are interdependent with other instruments in terms of timing, thus the beginning and end of a chord played. These additional harmonic and non-harmonic components are part of the same frequency range as components that directly contribute to the chord played. From this viewpoint, if we do not explicitly take into account additional components, we are dealing with an additional task of filtering out this noise due to these extra components in addition to the task of recognizing chords themselves. 11

13 3 Related Work Most musical chord estimation methods can broadly be divided into two subprocesses: preprocessing of features from wave-file data, and higher-level classification of those features into chords. I first describe in section 3.1 the preprocessing steps of the raw wave-form data, as well as the extensions and the refinements of its computation steps to take more properties of waveform music data into account. An overview of higher-level classification organized by methods applied is given in section 3.2. These not only differ in the methods per se, but also in what kind of musical context they take into account for the final classification. More recent methods take more musical context into account and seem to perform better. Since the methods proposed in this thesis are based on machine learning, I have decided to organize the description of other higher level classification approaches from a technical perspective rather than from a music-theoretical perspective. 3.1 Preprocessing / Features The most common preprocessing step for feature extraction from waveform data is the computation of so called pitch class profiles (PCPs), a human-perceptionbased concept coined by Shepard (1964). He conducted a human perceptual study in which he found that humans are able to perceive notes that are in octave relation as equivalent. A similar representation can be computed from wave form data for chord recognition. A PCP in a music-computational sense is a representation of the frequency spectrum wrapped into one musical octave, thus an aggregated 12-dimensional vector of the energy of the respective input frequencies. This is often called a chroma vector. A sequence of chroma vectors over time is called a chromagram. The terms PCP and chroma vector in chord recognition literature are used interchangeably. It should be noted, however, that only the physical sound energy is aggregated: this is not purely music harmonic information. Thus the chromagram may contain additional non-harmonic noise, such as drums, harmonic overtones and transient noise. In the following I will give an overview of the basics of calculating the chroma vector and different extensions proposed to improve the quality of these features PCP / Chroma Vector Calculation In order to compute a chroma vector, the input signal is broken into frames and converted to the frequency domain, which is most often done through a discrete Fourier transform (DFT), using a window function to reduce spectral leakage. Harris (1978) compares 23 different window functions and finds that the performance depends very much on the properties of the data. Since musical data is not heterogeneous, there is no single best-performing windowing function. Different window functions have been used in the literature, and often the specific window function is not stated. Khadkevich and Omologo (2009a) compare the performance impact of using Hanning, Hamming and Blackman windowing functions on musical wave form data applied to the chord estimation domain. They state that the results are very similar for those three types. However, the Hamming window performed slightly better for window lengths of 12

14 1024 and 2048 samples (for a sampling rate of Hz), which are the most common lengths in automatic chord recognition systems today. To convert from the Fourier domain to a chroma vector, two different methods are used. Wakefield (1999) sums energies of frequencies in the Fourier space closest to the pitch of a chroma vector bin (and its multiples) in order to aggregate the energy in a discrete mapping from spectral frequency domain to the corresponding chroma vector bin, converting the input directly to a chroma vector. Brown (1991) developed a so called constant-q transform, using a kernel matrix multiplication to convert the DFT spectogramm into logarithmic frequency space. Each bin of the logarithmic frequency representation corresponds to the frequency of a musical note. After conversion into logarithmic frequency domain, we then can simply sum up the respective bins, to obtain the chroma vector representation. For both methods the aggregated sound energy in the chroma vector is usually normalized either to sum to one or with respect to the maximum energy in a single bin. Both methods lead to similar results and are used in current literature Minor Pitch Changes In Western tonality music instruments are tuned to the reference frequency of A 4 above middle C (MIDI note 69), whose standard frequency is 440 Hz. In some cases the tuning of the instruments can deviate slightly, usually less than a quartertone from this standard tuning: Hz (Mauch, 2010). Most humans are unable to determine an absolute pitch height without a reference pitch. We can hear a mistuning of one instrument with some practice, but it is difficult to determine a slight deviation of all instruments from the usual reference frequency described above. The bins for the chroma vectors are relative to a fixed pitch, thus minor deviations in the input will affect its quality. Minor deviations of the reference pitch can be taken into account through shifting the pitch of the chromagram bins. Several different methods have been proposed: Harte and Sandler (2005) use a chroma vector with 36 bins, 3 per semitone. Computing a histogram of energies with respect to frequency for one chroma vector and the whole song and examining the peak positions in the extended chroma vector enables them to estimate the true tuning and derive a 12-bin chroma vector, under the assumption that the tuning will not deviate during the piece of music. This takes a slightly changed reference frequency into account. Gómez (2006) first restricts the input frequencies from 100 to 5000 Hz to reduce the search space and to remove additional overtone and percussive noise. She uses a weighting function which aggregates spectral peaks not to one, but to several chromagram bins. The spectral energy contributions of these bins are weighted according to a squared cosine distance in frequency. Dressler and Streich (2007) treat minor tuning differences as an angle and use circular statistics to compensate for minor pitch shifts, which was later adapted by Mauch and Dixon (2010b). Minor tuning differences are quite prominent in Western tonal music, and adjusting the chromagram can lead to performance increase, such that several other systems make use of one of the former methods, e.g.: Papadopoulos and Peeters (2007, 2008), Reed et al. (2009), Khadkevich and Omologo (2009a), Oudre et al. (2009). 13

15 3.1.3 Percussive Noise Reduction Music audio often contains noise that can not directly be used for chord recognition, such as transient or percussive noise. Percussive and transient noise normally is short, in contrast to harmonic components, which are rather stable over time. A simple way to reduce this is to smooth subsequent chroma vectors through filtering or averaging. Different filters have been proposed. Some researchers, e.g., Peeters (2006), Khadkevich and Omologo (2009b), Mauch et al. (2008), use a median filter over time after tuning and before aggregating the chroma vectors, to remove transient noise. Gómez (2006) uses several different filtering methods and derivatives based on a method developed by Bonada (2000) to detect transient noise and leave a window out of the chroma vector calculation of 50 ms before and after transient noise, reducing the input space. Catteau et al. (2007) calculate a background spectrum by convolving the logfrequency spectrum with a Hamming window of length of one octave, which they subtract from the original chroma vector to reduce noise. Because there are methods to estimate a beat from the audio signal (Ellis, 2007), and chord changes are more likely to appear on these metric positions, several systems aggregate or filter the chromagram only in between those detected beats. Ni et al. (2012) use a so called harmonic percussive sound separation algorithm described in Ono et al. (2008), which attempts to split the audio signal into percussive and harmonic components. After that they use the median chroma feature vector as representation for the complete chromagram between two beats. A similar approach is used by Weil et al. (2009), who also use a beat tracking algorithm, and average the chromagram between two consecutive beats. Glazyrin and Klepinin (2012) calculate a beat-synchronous smoothed chromagram and propose a modified Prewitt filter from image recognition for edge detection applied to music to suppress non-harmonic spectral components Repeating Patterns Musical pieces inherit a very repetitive structure, e.g., in popular music higherlevel structures such as verse and chorus are repeated, and usually those are repetitions of different harmonic (chord) patterns themselves. These structures can be exploited to improve the chromagram through recognizing and averaging or filtering those repetitive parts to remove local deviation. Repetitive parts can also be estimated and used later in the classification step to increase performance. Mauch et al. (2009) first perform a beat estimation and smooth the chroma vectors in a prefiltering step. Then a frame-by-frame similarity matrix from the beat-synchronous chromagram is computed and the song is segmented into an estimation of verse and chorus. This information is used to average the beat synchronous chromagram. Since beat estimation is a current research topic itself and often does not work perfectly, there might be errors in the beat positions. Cho and Bello (2011) argue that it is advantageous to use recurrent plots with a simple threshold operation to find similarities on a chord level for later averaging, thus leaving out the segmentation of the song into chorus and verse and beat detection. Glazyrin and Klepinin (2012) build upon and alter the system of Cho and Bello. They use a normalized self-similarity matrix on the computed chroma vectors using Euclidean distance as a comparison measure. 14

16 3.1.5 Harmonic / Enhanced Pitch Class Profile One problem of the computation of PCPs in general is to find an interpretation for overtones (energy in integer multiples of the fundamental frequency), since these might generate energy in frequencies that contribute to chroma vector bins other than the actual notes of the respective chord. For example the overtones of A 4 (440 Hz) are at 880 Hz and 1320 Hz, which is close to E 6 (MIDI note 68) at approximately Hz. Several different ways to achieve this have been proposed. In most cases the frequency range that is taken into account is restricted, e.g., approx from 100 Hz to 5000 Hz (Lee, 2006; Gómez, 2006). Most of the harmonic content is contained in this interval. Lee (2006) refines the chroma vector by computing the so called harmonic product spectrum, in which the product of the energy for octave multiples (up to a certain number) for each bin is calculated. Later the chromagram on basis of this harmonic product spectrum is computed. He states that multiplying the fundamental frequency with its octave multiples can decrease noise on notes that are not contained in the original piece of music. Additionally he finds a reduction of noise induced by false harmonics compared to conventional chromagram calculation. Gómez (2006) proposes an aggregation function for the computation of the chroma vector, in which the energy of the frequency multiples are summed, but first weighted by a decay factor, which is dependent on the multiple. Mauch and Dixon (2010a) use a non-negative least-squares method to find a linear combination of note profiles in a dictionary matrix to compute the log-frequency representation similar to the constant-q transform mentioned earlier Modelling Human Loudness Perception Human loudness perception is not directly proportional to the power or amplitude spectrum (Ni et al., 2012), thus the different representations described above do not model human perception accurately. Ni et al. (2012) describe a method to incorporate this through a log 10 scale for the sound power in respect to frequency. Pauws (2004) uses a tangential weighting function to achieve a similar goal for key detection. They find an improvement on the quality of the resulting chromagram compared to non-loudness-weighted methods Tonnetz / Tonal Centroid Another representation of harmonics is the so called Tonnetz, which is attributed to Euler in the 19th century. It is a planar representation of musical notes on a 6-dimensional politype, where pitch relations are mapped onto its vertices. Close musical harmonic relations (e.g., fifths and thirds) have a small Euclidean distance. Harte et al. (2006) describe a way to compute a Tonnetz from a 12- bin chroma vector, and report a performance increase for a harmonic change detection function, compared to standard methods. Humphrey et al. (2012) use a convolutional neural network from the FFT to model a projection function from wave form input to a Tonnetz. They perform experiments on the task of chord recognition with a Gaussian mixture model, and report that the Tonnetz output representation outperforms state-of-the-art chroma vectors. 15

17 3.2 Classification The majority of chord recognition systems compute a chromagram using one or a combination of methods described above. Early approaches use predefined chord templates and compare them with the computed frame-wise chroma features from audio pieces, which are then classified. With the supply of more and more hand-annotated data, more data-driven learning approaches have been developed. The most prominent data-driven model adopted is taken from speech recognition, the hidden Markov model (HMM). Bayesian networks are also used frequently, which are a generalization of HMMs. Recent approaches propose to take more musical context into account to increase performance, such as a local key, bass note, beat and song structure segmentation. Although most chord recognition systems rely on the computation of single chroma vectors, more recent approaches compute two chroma vectors for each frame. A bass and treble chromagram (differing in frequency range) are computed, as it is reasoned that the sequence of bass notes have an important role in the harmonic development of a song and can colour the treble chromagrams due to harmonics Template Approaches The chroma vector as an estimate of the harmonic content of a frame of a music piece should contain peaks at bins that correspond to chord notes played. Chord template approaches use chroma-vector-like templates. These can be either predefined through expert knowledge, or learned from data. Those templates are then compared with a fitting function with the computed chroma vector of each frame respectively. The frame is then classified as the chord symbol corresponding to the best-fitting template. The first research paper explicitly concerned with chord recognition is by Fujishima (1999), which constitutes a non-machine-learning system. Fujishima first computes simple chroma vectors as described above. He then uses predefined 12-dimensional binary chord patterns (either 1 or 0 for present and non-present notes in the chroma vector in the chord) and computes the inner product with the chroma vector. For real-world chord estimation, the set of chords consists of schemata for triadic harmonic events, and to some extent more complex chords such as sevenths and ninths. Fujishima s system was only used on synthesized sound data, however. Binary chord templates with an enhanced chroma vector using harmonic overtone suppression were used by Lee (2006). Other groups use a more elaborate chromagram with tuning (36 bins) for minor pitch changes reducing chord types to be recognized (Harte and Sandler, 2005; Oudre et al., 2009). Oudre et al. (2011) extend the methods already mentioned, by comparing different filtering methods as described in section and measures of fit (Euclidean distance, Kullback-Leibler divergence and Itakura-Saito divergence) to select the most suitable chord template. They also take harmonic overtones of chord notes into account, such that bins in the templates for notes not occurring in the chord do not necessarily have to be zero. Glazyrin and Klepinin (2012) use quasi-binary chord templates, in which the tonic and the 5th are enhanced and the template is normalized afterwards. The templates are compared to smoothed and fine-tuned chroma vectors. Chord templates do not have to be in form of chroma vectors. They can also 16

18 be modelled as a Gaussian, or as mixture of Gaussians as used by Humphrey et al. (2012), in order to get an probabilistic estimate of a chord likelihood. To eliminate short spurious chords that only last a few frames, they use a Viterbi decoder. They do not use the chroma vector for classification, but a Tonnetz as described in section The transformation function is learned by a convolutional neural network from data. It should be noted that basically all chord template approaches can model chord probabilities that can in turn be used as input for higher level classification methods or for temporal smoothing such as hidden Markov models, described in section as shown by Papadopoulos and Peeters (2007) Data-Driven Higher Context Models The recent increase in availability of hand-annotated data on chord recognition has spawned new machine-learning-based methods. In chord-recognition literature, different approaches have been proposed, from neural networks, to systems adopted from speech recognition to support vector machines and others. More recent machine learning systems seem to capture more and more context of music. In this section I describe higher level classification models found organized by machine learning methods used. Neural Networks Su and Jeng (2001) try to model the human auditory system with artificial neural networks. They perform a wavelet transform (as an analogy to the ear) and feed the output into a neural network (as an analogy for the cerebrum ) for classification. They use a self-organizing map to determine the style of chord and the tonality (C, C# etc.). It was tested on classical music to recognize 4 different chord types (major, minor, augmented, and diminished). Zhang and Gerhard (2008) propose a system based on neural networks to detect basic guitar chords and their voicings (inversions) with the help of a voicing vector and a chromagram. The neural network in this case first is trained to identify and output the basic chords; a later post processing step will determine the voicing. Osmalsky et al. (2012) build a database with several different instruments playing single chords individually, part of it recorded in a noisy and part of it in a noise-free environment. They use a feed-forward neural net with a chroma vector as input to classify 10 different chords and experiment with different subsets of their training set. HMM Neural networks do not take time dependencies between subsequent inputs into account. In music pieces there is a strong interdependency of subsequent chords, which renders a classification of chords for a whole music piece difficult to model based solely on neural networks. Since a template and neural net based approaches do not explicitly take temporal properties of music into account, a widely adopted method is to use a hidden Markov model. It has proven to be a good tool for the related field of speech recognition. The chroma vector is treated as observation, which can be modelled by different probability distributions, and the states of the HMM are the chord symbols to be extracted. Sheh and Ellis (2003) pioneered HMMs for real-world chord recognition. They propose that the emission distribution be a single Gaussian with 24 dimensions, trained from data with expectation maximization. Burgoyne et al. 17

19 (2007) state that a mixture of Gaussians is more suitable as the emission distribution. They also compare the use of Dirichlet distributions as the emission distribution and conditional random fields as the higher level classifier. HMMs are used with slightly different chromagram computations and training initialisations according to prior music theoretic knowledge by Bello and Pickens (2005). Lee (2006) build upon the systems of Bello and Pickens and Sheh and Ellis, generate training data from symbolic files (MIDI) and use an HMM for chord extraction. Papadopoulos and Peeters (2007) compare several different methods of determining the parameters of the HMM and observation probabilities. They conclude that a template-based approach combined with an HMM with a cognitive based transition matrix shows the best performance. Later Papadopoulos and Peeters (2008, 2011) propose an HMM approach focusing on (and extracting) beat estimates to take into account musical beat addition, beat deletion or changes in meter to enhance recognition performance. Ueda et al. (2010) use Harmonic Percussive Sound Separation chromagram features and an HMM for classification. Chen et al. (2012) cluster song-level duration histograms to take time duration explicitly into account in a so-called duration-explicit HMM. Ni et al. (2012) is the best performing system of 2012 MIREX challenge in chord estimation. It works on the basis of an HMM, bass and treble chroma and beat and key detection. Structured SVM Weller et al. (2009) compare the performance of HMMs and support vector machines (SVMs) for chord recognition and achieve stateof-the-art results using support vector machines. n-grams Language and music are closely related. Both spoken language and music rely on audio data. Thus it makes sense to apply spoken-languagerecognition approaches to music analysis and chord recognition. A dominant approach for language recognition is an n-gram model. A bigram model (n = 2) is essentially a hidden Markov model, in which one state only depends on the previous one. Cheng et al. (2008) compare 2-, 3-, and 4-grams, thus making one chord dependent on multiple previous chords. They use it for song similarity after a chord recognition step. In their experiments the simple 3- and 4-grams outperform the basic HMM system of Harte and Sandler (2005); they state that n-grams are able to learn the basic rules of chord progressions from hand annotated data. Scholz et al. (2009) use a 5-gram and compare different smoothing techniques and find that modelling more complex chords with 7ths and 9ths should be possible with n-grams. They do not state how features are computed and interpreted. Dynamic Bayesian Networks Musical chords develop meaning in their interplay with other characteristics of a music piece, such as bass note, beat and key: they can not be viewed as an isolated entity. These interdependencies are difficult to model with a standard HMM approach. Bayesian networks are a generalization of HMMs, in which the musical context can be modelled more intuitively. Bayesian networks give the opportunity to model interdependencies simultaneously, creating a more sound model for music pieces from a musictheoretic perspective. Another advantage of a Bayesian network is that it can 18

20 directly extract multiple types of information, which may not be a priority for the task of chord recognition, but is an advantage for the extended task of general transcription of music pieces. Cemgil et al. (2006) were among the first to introduce Bayesian networks for music computation. They do not apply the system to chord recognition but to polyphonic music transcription (transcription on a note-by-note basis). They implement a special case of the switching Kalman filter. Mauch (2010) and Mauch and Dixon (2010b) make use of a Bayesian network and incorporate beat detection, bass note and key estimation. The observations of the Bayesian network in the system are treble and bass chromagrams. Dixon et al. (2011) compare a similar system to a logic based system. Deep Learning Techniques Deep learning techniques have beaten the state of the art in several benchmark problems in recent years, although for the task of chord recognition it is a relatively unexplored method. There are three recent publications using deep learning techniques. Humphrey and Bello (2012) call for a change in the conventional approach of using a variation of chroma vector and a higher level classifier, since they state recent improvements seem to bring only diminishing return. They present a system consisting of a convolutional neural network with several layers, trained to learn a Tonnetz from a constant-qtransformed FFT, and subsequently classify it with a Gaussian mixture model. Boulanger-Lewandowski et al. (2013) make use of deep learning techniques with recurrent neural networks. They use different techniques including a Viterbilike algorithm from HMMs and beam search to take temporal information into account. They report upper-bound results comparable to the state of the art using the Beatles Isophonics dataset (see section 6.5 for a dataset description) for training and testing. Glazyrin (2013) uses stacked denoising autoencoders with a 72-bin constant-q transform input, trained to output chroma vectors. A self-similarity algorithm is applied to the neural network output and later classified with a deterministic algorithm, similar to the template approaches mentioned above. 19

21 4 Stacked Denoising Autoencoders In this section I give a description of the theoretical background of stacked denoising autoencoders used for the two chord recognition systems in this thesis following Vincent et al. (2010). First a definition of autoencoders and their training method is given in section 4.1, then it is described how this can be extended to form a denoising autoencoder in section 4.2. We can stack denoising autoencoders to train them in an unsupervised manner and possibly get a useful higher level data abstraction by training several layers, which is described in section Autoencoders Autoencoders or autoassociators try to find an encoding of given data in the hidden layers. Similar to Vincent et al. (2010) we define the following: We assume a supervised learning scenario. A training set of n touples of inputs x and targets t. D n = {(x 1, t 1 ),.., (x n, t n )}, where x R d if the input is real valued, or x [0, 1] d. Our goal is to infer a new, higher level representation y, of x. The new representation again is y R d or y [0, 1] d depending if real valued or binary representation is assumed. Encoder A deterministic mapping f θ that transforms the input x to a hidden representation y is called an encoder. It can be described as follows: y = f θ (x) = s(w x + b), (2) where θ = {W, b}, W a d d weight matrix and b an offset (or bias) vector of dimension d. The function s(x) is a non linear mapping, e.g., a sigmoid 1 activation function 1+e. The output y is called the hidden representation. x Decoder A deterministic mapping g θ that maps hidden representation y back to input space by constructing a vector z = g θ (y) is called a decoder. Typically this is in form of a mapping: or a mapping followed by a non-linearity: z = g θ (y) = W y + b (3) z = g θ (y) = s(w y + b ) (4) where θ = {W, b }, W a d d weight matrix and b an offset (or bias) vector of dimension d. Often the restriction W = W is imposed on the weights. z can be regarded as an approximation of the original input data x, reconstructed from the hidden representation y. 20

22 Input x f θ Hidden representation y L(x, z) Loss function g θ Autoencoder training output z Figure 2: Conventional autoencoder training. Vector x from the training set is projected by f θ (x) to hidden representation y, hereafter projected back to input space using g θ (y) to compute z. The loss function L(x, z) is calculated and used as training objective for minimization. Training The idea behind such a model is to get a good hidden representation y, from which the decoder is able to reconstruct the original input as closely as possible. It can be shown that finding the optimal parameters for such a model can be viewed as a maximization of the lower bound between the mutual information of the input and the hidden representation in the first layer (Vincent et al., 2010). To estimate the parameters we define a loss function. This can be for a binary input x [0, 1] d the cross entropy: L(x, z) = d x k log(z k ) + (1 x k ) log(1 z k ) (5) k=1 or for real valued input x R d : L(x, z) = x z 2, (6) The squared error objective. Since we use real valued input data, this squared error objective is used in this thesis as loss function. Given this loss function we want to minimize the average loss (Vincent et al., 2008): θ, θ 1 = arg min θ,θ n n i=1 L(x (i), z (i) 1 ) = arg min θ,θ n n i=1 ( L x (i), g θ ( fθ (x (i) ) )), (7) Where θ, θ denote the optimal parameters for encoding and decoding function for which the loss function is minimized, which might be tied. This can be achieved iteratively by backpropagation. n is the number of training samples. Figure 2 visualizes the training procedure for an autoencoder. If the hidden representation y is of the same dimensionality as the input x, it is trivial to construct a mapping that yields zero reconstruction error, the identity mapping. Obviously this constitutes a problem since merely learning the identity mapping does not lead to any higher level of abstraction. To evade this problem a bottleneck is introduced, for example by using fewer nodes for a 21

23 hidden representation thus reducing its dimensions. It is also possible to impose a penalty on the network activations to form a bottleneck, and thus train a sparse network. These additional restrictions force the neural network to focus on the most informative parts of the data leaving out noisy uninformative parts. Several layers can be trained in a greedy manner to achieve a yet higher level of abstraction. Enforcing Sparsity To prevent autoencoders from learning the identity mapping, we can penalize activation. This is described by Hinton (2010) for restricted Boltzman machines, but can be used for autoencoders as well. The general idea is that it is less informative if we have nodes that fire very frequently, i.e. a node that is always active does not add any useful information and could be left out. We can enforce sparsity by adding a penalty term for large average activations over the whole dataset to the backpropagated error. We can compute the average activation of a hidden unit j over all training samples with: ˆp j = 1 n n f j θ (x(i) ) (8) i=1 In this thesis the following addition to the loss function is used, which is derived from the KL divergence: L p = β h j=1 ( p log pˆp + (1 p) log( 1 p ) ), (9) j 1 ˆp j where ˆp the average activation over the complete training set for hidden unit j, n the number of training samples, p is a target activation parameter and β a penalty weighting parameter, all specified beforehand. The bound h is the number of hidden nodes. For a sigmoidal activation function p is usually set to a value that is close to zero, for example A frequent setting for β is 0.1. This ensures that units will have a large activation only on a limited amount of training samples and otherwise have an activation close to zero. We now simply add this weighted activation error term to L(x, z), described above. 4.2 Autoencoders and Denoising Vincent et al. (2010) propose another training criterion in addition to the bottleneck. They state that an autoencoder can also be trained to clean a partially corrupted input, also called denoising. If noisy input is assumed, it can be beneficial to corrupt (parts of) the input of the autoencoder while training and use the uncorrupted input as target. The autoencoder is hereby encouraged to reconstruct a clean version of the corrupted input. This can make the hidden representation of the input more robust to noise, and can potentially lead to a better higher level abstraction of the input data. Vincent et al. (2010) state that different types of noises can be considered. There is masking noise, i.e., setting a random fraction of the input to 0, salt and pepper noise, i.e., setting a random fraction of the input to either 0 or 1, and, especially for real-valued input, isotropic additive Gaussian noise, i.e. adding noise from a Gaussian distribution to the input. To achieve this, we 22

24 corrupt the initial input x into x according to a stochastic mapping x q D ( x x). This corrupted input is then projected to the hidden representation as described before by means of y = f θ ( x) = s(w x+b). Then we can reconstruct z = g θ (y). The parameters θ and θ are trained to minimize the average reconstruction error between output z and the uncorrupted input x, but in contrast to conventional autoencoders, z is now a deterministic function of x instead of x. For our purpose, under usage of additive Gaussian noise, we can train the denoising autoencoder with a squared error loss function: L 2 (x, z) = x z 2. Parameters can be initialized at random and then optimized by backpropagation. Figure 3 depicts training of a denoising autoencoder. Uncorrupted input x q D Corrupted input x L(x, z) Loss function f θ Hidden representation y g θ Denoising autoencoder output z Figure 3: Vector x form training set is corrupted with q D and converted to hidden representation y. The loss function L(x, z) is calculated from the output and the uncorrupted input and used for training 4.3 Training Multiple Layers If we want to train (or initialize training parameters for supervised backpropagation for) deep networks, we need a manner to extend the approach from a single layer, as described in the previous sections, to multiple layers. As described by Vincent et al. (2010), this can be easily achieved by repeating the process for each layer separately. Depicted in figure 4 is such a greedy layer wise training. First we propagate the input x through the already trained layers. Note that we do not use additional corruption noise yet. Next we use the uncorrupted hidden representation of the previous layer as input for the layer we are about to train. We train this specific layer as described in the previous sections. The input to the layer to be trained is first corrupted by q D and then projected into latent space by using f (2) θ. We then project it back to input space of the specific layer with g (2) θ. Using an error function L, we can optimize the projection functions with respect to the defined error, and therefore possibly obtain a useful higher-level representation. This process can be repeated several times to initialize a deep neural network structure, circumventing usual problems that arise when initializing deep networks at random and then applying 23

25 backpropagation. Next we can apply a classifier on the output of this deep neural network trained to supress noise. Alternatively we can add another layer of hidden nodes for classification purposes on top of the previously unsupervised trained network structure and apply standard backpropagation to fine-tune the network weights according to our supervised training training targets t. Loss function L(y, z 2 ) f (2) θ q D (2) gθ f (2) θ f (1) θ f (1) θ f (1) θ x x x Figure 4: Training of several layers in a greedy unsupervised manner. The input is propagated without corruption. To train an additional layer the output of the first layer is corrupted by q D and the weights are adjusted with f (2) θ,g (2) θ with the respective loss function. After training for this layer is completed, we can train subsequent layers Dropout Hinton et al. (2012) were able to improve performance on several other recognition tasks, including MNIST for hand written digit recognition and TIMIT a database for speech recognition, by randomly omitting a fraction of hidden nodes from training for each sample. This is in essence training a different model for each training sample and iteration on one training sample only. According to Hinton et al. this prevents the network from overfitting. In the testing phase we make use of the complete network again. Thus what we effectively are doing with dropout is averaging: averaging many models trained on one training sample each. This has yielded an improvement in different modelling tasks (Hinton et al., 2012). 4 As shown in (Vincent et al., 2010) 24

26 5 Chord Recognition Systems In this section I describe the structure of three different approaches to classify chords. 1. We first describe the structure of a comparison system: a simplified version of the Harmony Progression Analyzer as proposed by Ni et al. (2012). The features computed can be considered state of the art. We discard, however, additional context information like key, bass and beat tracking, since the neural network approaches developed in this thesis do not take this into account (although it should be noted that in principle the approaches developed in this thesis could be extended to take this additional context information into account as well). The simplified version of the Harmonic Progression Analyzer will serve as a reference system for performance comparison. 2. A neural network initialized by stacked denoising autoencoder pretraining with later backpropagation fine-tuning can be applied to an excerpt of the FFT to estimate chord probabilities directly, which then can be smoothed with the help of an HMM, to take temporal information into account. We substitute the emission probabilities with the output of the stacked denoising autoencoders. 3. This approach can be extended by adding filtered versions of the FFT over different time spans to the input. We extend the input to include two additional vectors, median-smoothed over different timespans. Here again additional temporal smoothing is applied in a post-classification process. In section 5.1 we describe the comparison system and briefly the key ideas incorporated in the computation of state-of-the-art features. Since the two other approaches described in this thesis make use of stacked denoising autoencoders that interpret the FFT directly, we describe beneficial pre-processing steps in section In section we describe a stacked denoising autoencoder approach for chord recognition in which the outputs are chord symbol probabilities directly, and in section we propose an extension of this approach inspired by a system developed for face recognition and phone recognition by Tang and Mohamed (2012) under usage of a so called multi-resolution deep belief network and apply it to chord recognition with the use of stacked denoising autoencoders. Appendix A describes the theoretical foundation of applying a joint optimization of the HMM and neural network for chord recognition. 5.1 Comparison System In this section we describe a basic comparison system for the other approaches implemented. It reflects the structure of most current approaches and uses state-of-the-art features for chord recognition. Most recent chord recognition systems rely on an improved computation of the PCP vector and take extra information into account such as bass notes or key information. This extra information is usually incorporated into a more elaborate higher-level framework, such as multiple HMMs or a Bayesian network. 25

27 The comparison system consists of the computation of state-of-the-art PCP vectors for all frames, but only a single HMM for later classification and temporal alignment of chords, which allows for a more fair comparison to the stacked denoising autoencoder approaches. The basic computation steps described in the following are used in the approach described by Ni et al. (2012). They split the computation of features into a bass chromagram and a treble chromagram, and track them with two additional HMMs. The computed frames are aligned according to a beat estimate. To make this more elaborate system comparable, again we only compute one chromagram containing both bass and treble and use a single HMM for temporal smoothing and do not align frames according to an beat estimate. We first describe the very basic steps of PCP features predominantly used in chord recognition for 15 years in section 5.1.1, hereafter in section we describe extensions of the basic PCP used in the comparison system Basic Pitch Class Profile Features The basic pipeline for computing a pitch class profile as a feature for chord recognition consists of two steps: 1. The signal is projected from time to frequency domain through a Fourier transform. Often files are downsampled to Hz to allow for faster computation. This is also done in the reference system. The range of frequencies is restricted through filtering, to only analyse frequencies below, e.g., 4000 Hz (about the range of the keyboard of a piano, see figure 1) or similar, since other frequencies carry less information about the chord notes played and introduce more noise to the signal. In the reference system a frequency range from approximately 55 Hz to Hz is used, as this interval is proposed in the original system (Ni et al., 2012). 2. The second step consists of a constant-q transform, which projects the amplitude of the signal in the linear frequency space to a logarithmic representation of signal amplitude, in which each constant-q transform bin represents the spectral energy in respect to the frequency of a musical note. 3. In a third step the bins representing one musical note and its octave multiples are summed and the resulting vector is sometimes normalized. In the following section we describe the constant-q transform and computation of the PCP in more detail. Constant-Q transform After converting the signal from time to frequency domain through a discrete or fast Fourier transform, we can apply an additional transform to make the frequency bins logarithmically spaced. This transform can be viewed as a set of filters in time domain, which filter a frequency band according to a logarithmic scaling of center frequencies of the constant-q bins. Originally it was proposed to be an additional term in the Fourier transform, but it has been shown by Brown and Puckette (1992) to be computationally more efficient to filter the signal in Fourier space, thus applying the set of filters transformed into Fourier space to the signal also in Fourier space. This 26

28 can be realized with a matrix multiplication. This transformation process to logarithmically spaced bins is called the constant-q transform (Brown, 1991). The name stems from the factor Q, which describes the relationship between center frequency of each filter and the filter width Q = f k f k. Q is a so-called quality factor which stays constant, f k is the center frequency and f k the width of the filter. We can choose the filters such that they filter out the energy contained in musically relevant frequencies (i.e., frequencies corresponding to musical notes): f kcq = (2 1 B ) k cq f min, (10) where f min is the frequency for the lowest musical note to be filtered, f kcq the center frequency corresponding to constant-q bin k cq. B denotes the number of constant-q frequency bins per octave, usually B = 12 (one bin per semitone). Setting Q = 1 establishes a link between musically relevant frequencies and 2 1 B 1 filter width of our filterbank. Different types of filters can be used to aggregate the energy in relevant frequencies and to reduce spectral leakage. For comparison system we make use of a Hamming window as described as well by Brown and Puckette (1992): ( 2πn ) w(n, f kcq ) = cos M(f kcq ) (11) where n = M(f kcq ) 2,..., M(f kcq ) 2 1, M(f kcq ) is the window size, computable with Q and corresponding center frequency f kcq for constant-q bin, and k cq and n the current input bin in time domain and sampling rate of the input signal f s (Brown, 1991): M(f kcq ) = Q f s f kcq. (12) We can now compute the filters and thus the respective sound power in the signal filtered according to a musically-relevant set of center frequencies. Instead of applying these filters in time domain, it is computationally more efficient to do so in spectral domain, by projecting the window functions to Fourier space first. We can apply the filters hereafter through a matrix multiplication in frequency space. As denoted by Brown and Puckette (1992) for bin k cq of the constant-q transform can write: X cq [k cq ] = 1 N N 1 k=0 X[k]K[k, k cq ], (13) where k cq describes the constant-q transform bin, X[k] the signal amplitude at bin k in Fourier domain, N is the number of Fourier bins and K[k, k cq ] the value of the Fourier transform of our filter w(n, f kcq ) for constant-q transform k cq at Fourier bin k. Choosing the right minimum frequency and quality factor will result in constant-q bins corresponding to harmonically-relevant frequencies. Having transformed the linearly-spaced amplitude per frequency to a musically spaced constant-q transform bin, we can now continue to aggregate notes that are one octave apart, hereby reducing the dimension of the feature vector significantly. 27

29 PCP Aggregation Shepard s (1964) experiments on human perception of music suggest that humans can perceive notes one octave apart as belonging to the same group of notes, known as pitch classes. Given these results we compute pitch class profiles based on the signal energy in logarithmic spectral space. As described by Lee (2006): P CP [k] = N cq 1 m=0 X cq (k + mb), (14) where k = 1, 2,..., B is the index for the PCP bin, N cq is the number of octaves in the frequency range of the constant-q transform. Usually B = 12, so that one bin for each musical note in one octave is computed. For pre-processing, e.g., correction of minor tuning differences, B = 24 or B = 36 are also sometimes used. Hereafter the resulting vector is usually normalized, typically with respect to the L 1, L 2 or L norm Comparison System Simplified Harmony Progression Analyzer In this section I describe the refinements made to the very basic chromagram computation defined above. The state-of-the-art system proposed by Ni et al. (2012) takes additional context into account. They state that tracking the key and the bass line provides important context that provides useful additional information for recognizing musical chords. For a more accurate comparison with stacked denoising autoencoder approaches, which cannot easily take such context into account, we discard the musical key, bass and beat information that is used by Ni et al. We compute the features with the code that is freely available from their website 5 and adjust it to a fixed stepsize of 1024 samples with a sampling rate of Hz thus a step size of approximately 0.09s per frame, instead of a beat-aligned step size. In addition to a so-called harmonic percussive sound separation algorithm as described by Ono et al. (2008), which attempts to split the signal into an hamonic and a percussive part, Ni et al. implement a loudness-based PCP vector and correct for minor tuning deviations Harmonic Percussive Sound Separation Ono et al. (2008) describe a method to discriminate between the percussive contribution to the Fourier transform and the harmonic one. This can be achieved by exploiting the fact that percussive sounds most often manifest themselves as bursts of energy spanning a wide range of frequencies but only during a limited time. On the other hand, harmonic components span a limited frequency range but are more stable over time. Ono et al. present a way to estimate the percussive and harmonic parts of the signal contribution in Fourier space as an optimization problem which can be solved iteratively: F h,i is the short-time Fourier transform of an audio signal f(t) and W h,i = F h,i 2 is its power spectrogram. We minimize the L 2 norm of power spectrogram gradients, J(H, P ), with H h,i the harmonic component and P h,i the percussive

30 component, with h the frequency bin and i the time in Fourier space: J(H, P ) = 1 2σ 2 H h,i subject to the constraint that and (H h,i 1 H h,i ) σ 2 P (P h 1,i P h,i ) 2, (15) h,i H h,i + P h,i = W h,i (16) H h,i 0, (17) P h,i 0, (18) where W h,i is the original power spectrogram, as described above, and σ H and σ P are parameters to control the smoothness vertically and horizontally. Details for an iterative optimization procedure can be found in the original paper Tuning and Loudness-Based PCPs Here we describe further refinements of the PCP vector, first how to take minor deviations (less than a semitone) from the reference tuning into account, and later an addition proposed by Ni et al. (2012) to model human loudness perception. Tuning To take into account minor pitch shifts of the tuning of the specific song, features are fine-tuned as described by Harte and Sandler (2005). Instead of computing a 12-bin chromagram directly, we can compute multiple bins for each semitone, as described in section for setting B > 12 (e.g., B = 36). We can then compute a histogram of sound power peaks with respect to frequency and select a subset of constant-q bins to compute the PCP vectors, to shift our reference tuning according to small deviations for a song. Loudness Based PCPs Since human loudness perception of sound in respect to frequencies is not linear, Ni et al. (2012) propose a loudness weighting function. First we can compute a sound power level matrix : L s,t = 10 log 10 ( X s,t 2 p ref ), s = 1,..., S, t = 1,..., T, (19) where p ref indicates the fundamental reference power, and X s,t the constant-q transform of our input signal as described in the previous section (s denoting the constant-q transform bin and t the time). They propose to use A-weighting (Talbot-Smith, 2001), in which we add a specific value depending on the frequency. An approximation to human sensitivity of loudness perception in respect to frequency is then given by: where L s,t = L s,t + A(f s ), s = 1,..., S, t = 1,..., T, (20) A(f s ) = log 10 (R A (f s )), (21) 29

31 and R A (f s ) = f 4 s (f 2 s ) (f 2 s )(f 2 s )(f 2 s ). (22) Having calculated this we can proceed to compute the pitch class profiles as described above, using L s,t. Ni et. al. normalize the loudness-based PCP vector after aggregation according to: X p,t min p X p X p,t =,t max p X p,t min, (23) p X p,t where X p,t denotes the value for PCP bin p time t. Ni et al. state that due to this normalization, specifying the reference sound power level p ref is not necessary HMMs In this section we give a brief overview of the hidden Markov model (HMM), as far as important for this thesis. It is a widely used model for speech as well as chord recognition. A musical song is highly structured in time certain chord sequences and transitions are more common than others but PCP features do not take any time dependencies into account by themselves. A temporal alignment can increase the performance of a chord recognition system. Additionally, since we compute the PCP features from the amplitude of the signal alone, which is noisy in regards to chord information due to percussion, transient noise or other, the resulting feature vector is not clean. HMMs in turn are used to deal with noisy data, which adds another argument to use HMMs for temporal smoothing. Definition There exist several variants of HMMs. For our comparison system we restrict ourselves to an HMM with a single Gaussian emission distribution for each state. For the stacked denoising autoencoders we use the output of the autoencoders directly as a chord estimate and as emission probability. An HMM with a Gaussian emission probability is a so-called continuous-densities HMM. It is capable of interpreting multidimensional real valued input such as the PCP vectors we use as features, described above in section An HMM estimates the probability of a sequence of latent states corresponding to a sequence of lower-level observations. As described by Rabiner (1989), an HMM can be defined as a 5-tuple consisting of: 1. N, the number of states in the model. 2. M, the number of distinct observations, which in the case of a continuous densities HMM is infinite. 3. A = {a ij }, the state transition probability distribution, where a ij = P (q t+1 = S j q t = S i ), 1 i, j N, and q t denotes the current state at time t. If the HMM is ergodic (i.e., all transitions to every state from every state are possible) for all i and j, a ij > 0. Transition probabilities satisfy the stochastic constraints N a ij = 1 and 1 i N. j=1 30

32 4. B = {b j (O)}, the set of observation probabilities, which in our case is infinite. b j (O t ) = P (O t q t = S j ), the observation probability in state j, where, 1 j N, for observation O t at time t. If we assume a continuousdensity HMM, i.e., we have a real-valued, possibly multidimensional input, we can use a (mixture of) Gaussian distributions for the probability distribution b j (O): b j (O t ) = M m=1 Z jm N (O t, µ jm, Σ jm ), with 1 j N. Here O t is the input vector at time t, Z jm the mixture weight (coefficient) for the m th mixture in state j and N (O, µ jm, Σ jm ), the Gaussian probability density function, with mean vector µ jm and covariance matrix Σ jm for state j and component m. 5. π = {π i } where π i = P (q 1 = S i ), with 1 i N. This is the initial state probability. Parameter Estimation We can define the states to be the 24 chord symbols and the non-chord symbol for the simple major-minor chord discrimination task, and 217 different symbols for the extended chord vocabulary, including major, minor, 7th and inverted chords and the non-chord symbol. The features in case of the baseline system are computed as a 12-bin PCP vector, with a single Gaussian as emission model for the HMM. In case of the stacked denoising autoencoder systems, we can use the output of the networks directly as emission probabilities. Since we are dealing with a fully annotated data set, it is trivial to estimate the initial state probabilities and the transitions by computing relative frequencies with help of supplied ground truth. In the case of Gaussian emission model, we can estimate the parameters from training data by the EM algorithm (McLachlan et al., 2004). Likelihood of a Sequence To compute the likelihood of given observations belonging to a certain chord sequence we can compute the following: P (q 1, q 2...q t, O 1, O 2,...O t λ) = π 1 b 1 T t=2 a t,t 1 b t (O t ), (24) where π 1 is the initial state probability for state at time 1, b 1 the emission probability for the first observation, a t,t 1 the transition probability from state t 1 to state t, and b t (O t ) the emission probability for time t for observation O t at time t. λ denotes the parameters of our HMM. The most likely sequence of hidden states for given observations can be computed efficiently with the help of the Viterbi algorithm (see Rabiner, 1989, for details). 5.2 Stacked Denoising Autoencoders for Chord Recognition A piece of music contains additional non-harmonic information, or harmonic information which does not directly contribute to the chord played at a certain time in the song. This can be considered as noise for the objective of estimating the correct chord progressions from a song. Since stacked denoising autoencoders are trained to reduce artificially added noise, they seem to be a 31

33 suitable choice for application on noisy data, and have been shown to achieve state-of-the-art performance on several benchmark tests (including audio genre classification) (Vincent et al., 2010). Moreover deep learning architectures can be partly trained in an unsupervised manner, which might prove to be useful for a field like chord recognition, since there is a huge amount of unlabeled digitized musical data available, but only a very limited fraction of this is annotated. In this section I describe two systems relying on stacked denoising autoencoders for chord recognition. The preprocessing of the input data follows the same basic steps for the two stacked denoising autoencoder approaches, described in section All approaches make use of an HMM to smooth and interpret the neural network output as a post-classification step. Since the chord ground truth is given, we are also able to calculate a perfect PCP and train stacked denoising autoencoders to approximate the former from given FFT input. A description of how to apply a joint optimization procedure for the HMM and neural network for chord recognition, taken from speech recognition, is given in appendix A (This did not yield any further improvements, however). Furthermore it is possible to train a stacked denoising autoencoder to model chord probabilities directly which then are smoothed by an HMM, described in section Hereafter I propose an extension to this approach by extending the input of the stacked denoising autoencoders to cover multiple resolutions, smoothed over different time spans, in section Preprocessing of Features for Stacked Denoising Autoencoders In all approaches described below, we employ the stacked denoising autoencoders directly to the Fourier transformed signal. This minimizes the preprocessing steps, and restrictions imposed, but still some preprocessing of the input can increase the performance. 1. To restrict the search space only the first 1500 FFT bins are used. This restricts the frequency range to approximately 0 to 3000 Hz. Most of the frequencies emitted by harmonic instruments are still contained in this interval. 2. Since values taken from the FFT directly contain high-energy peaks, we apply a square root compression as done by Boulanger-Lewandowski et al. (2013) for deep belief networks. 3. We then normalize the FFT frames according to the L 2 norm in a final preprocessing step. 32

34 5.2.2 Stacked Denoising Autoencoders for Chord Recognition chord symbols SDAE single frame preprocessing FFT input, one time frame Figure 5: Stacked denoising autoencoder for chord recognition, single resolution. Humphrey et al. (2012) state that the performance of chord recognition systems has not improved significantly recently, and suggest that one reason could be the widespread usage of PCP features. They try to find a different representation by modelling a Tonnetz under usage of convolutional neural networks. Cho and Bello (2014), who evaluate the influence on performance of different parts of chord recognition systems, also come to the conclusion that the choice of feature computation has a great influence on the overall performance and suggest the exploration of other types of features differing from the PCP. A nice property of deep learning approaches is that they are often able to find a higher level representation of the input data by themselves and do not rely on predefined feature computation. When classifying data, we can train a neural network to output pseudoprobabilities for each class given an input. This is done through a final logistic regression layer (or softmax) for the output of the neural network. We use a softmax output and a 1-of-K encoding, such that we have K outputs, each of which can be interpreted as a probability of a certain chord being played. Thus we can use the output of a 1-of-K encoding softmax output layer neural network directly as substitute for the emission probability of the HMM and further process it with temporal smoothing to compute a final chord symbol output. Since deep learning provides us with a powerful strategy for neural network training, we are able to discard all steps of the conventional PCP vector computation and restrictions that might be imposed by them apart from the FFT and train the network to classify chords. This differs from previous approaches 33

35 like Boulanger-Lewandowski et al. (2013) and Glazyrin (2013), who use deep learning techniques but still model PCPs either as intermediate target or as output of the neural network. Figure 5 depicts the processing pipeline of the system. This system, with a single input frame is referred to as stacked denoising autoencoder (SDAE) Multi-Resolution Input for Stacked Denoising Autoencoders chord symbols SDAE Concatenate frames single frame median filter median filter preprocessing FFT input, multiple time frames Figure 6: Stacked denoising autoencoder for chord recognition, multi-resolution Glazyrin (2013), who uses stacked denoising autoencoders (with and without recurrent layers) to estimate PCP vectors from the constant-q transform, states that he suspects it to be beneficial to take multiple subsequent frames into account, but also writes that informal experiments did not show any improvements in recognition performance. Boulanger-Lewandowski et al. (2013) also make use of a recurrent layer with a deep belief network to take temporal information into account before additional (HMM) smoothing. Both approaches thus reason that it might be beneficial to take temporal information into account before using an HMM as a final computation step. We can find a similar paradigm in Tang and Mohamed (2012), used with deep learning. They propose a system in which images of faces are analyzed by a deep belief network. In addition to the original image they propose extending the input to different subsampled versions of the image for face recognition and report improved performance over a single resolution input. They also report improved performance for extending the classifier input to several inputs 34

36 with different subsampling ranges applied to phone recognition and temporal smoothing with deep belief networks on the TIMIT dataset. The proposed system in this thesis is designed to take additional temporal information into account before the HMM post-processing as well. Following the intuition of Glazyrin and the idea of Tang et al., we extend the input of the stacked denoising autoencoder, computing two different time resolutions of the FFT and concatenating them with the original input of the stacked denoising autoencoders. In addition to the original FFT vector, we apply a median filter for different ranges of subsequent frames around the current frame. After median filtering each vector is preprocessed as indicated in section Hereafter we join the resulting vectors and use them as frame-wise input for the stacked denoising autoencoders. Cho and Bello (2014) conduct experiments to evaluate the influence on performance of different parts of the most prevalent constituents of chord recognition systems. They find that pre-smoothing has a significant impact on chord recognition performance in their experiments. They state that through filtering we can eliminate or reduce transient noise, which is generated by short bursts of energy such as percussive instruments, although this has the disadvantage to also smear chord boundaries. However, in the proposed system we supply both the original input in which the chord boundaries are sharp, but with transient noise, and a version that is smoothed. Cho and Bello (2014) compare average filtering and median filtering and find that there is little to no difference in terms of recognition performance. We use a median filter instead of an average filter since it is a prevalent approach in chord recognition. Median filters are applied in several other approaches, e.g., Peeters (2006), or Khadkevich and Omologo (2009b), to reduce transient noise. The stacked denoising autoencoders are again trained to output chord probabilities by fine tuning with traditional backpropagation. In the following we refer to this as a multi resolution stacked denoising autoencoder (MR-SDAE). Figure 6 illustrates the processing pipeline of the MR-SDAE. 35

37 6 Results Finding suitable training and testing sets for chord estimation is difficult because transcribing chords in songs requires a significant amount of training, even for humans. Only experts are able to transcribe chord progressions of songs accurately and in full detail. Furthermore, most musical pieces are subject to strict copyright laws. This poses the problem that ground truth and audio data are delivered separately. Different recordings of the same song might not fit exactly to the ground truth available due to minor temporal deviations. There are, fortunately, tools to align ground truth data and audio files. For the following experiments, Dan Ellis AUDFPRINT tool was used to align audio files with publicly available ground truth. 6 We report results on two different datasets: a transcription of 180 Beatles songs, and the publicly available part of the McGill Billboard dataset, containing 740 songs. The Beatles dataset has been available for several years, and as other training data is scarce, many algorithms published in the MIREX challenge have been pretrained on this dataset. Because of the same scarcity of good data, the MIREX challenge has also used the Beatles dataset (with a small number of additional songs) to evaluate the performance of chord recognition algorithms, and thus the official results on the Beatles dataset might be biased. We report a cross-validation performance, in which we train the algorithm on a subset of the data and test it on the remaining unseen part. This we repeat ten times for different subsets of the dataset, and report the average performance and 95% confidence interval. This is done to give an estimation how the proposed methods might perform on unseen data. However the Beatles dataset is composed by one group of musicians only, which itself might bias the results, since musical groups tend to have a certain style of playing music. Therefore we also conduct experiments on the Billboard dataset, which is not restricted to one group of musicians, but rather contains popular songs from Billboard Hot 100 charts from the 1958 to Additionally the Billboard dataset contains more songs, thus providing us with more training examples. To compare the proposed methods to other methods, we use the training and testing set of the MIREX 2013 challenge, a subset of the McGill Billboard dataset that was unpublished before 2012 but is available now. Although there are more recent results on the Billboard dataset (MIREX 2013), the test set ground truth for that part of the dataset has not yet been released. Deep learning neural network training was implemented with the help of Palm s deep learning MATLAB toolbox (Palm, 2012). HMM smoothing was realized with functions of Kevin Murphy s Bayes net MATLAB toolbox. 7 Computation of state-of-the-art features was done under usage of Ni et al. s code. 8 In the following I first give an explanation of how we can measure the performance of the algorithms in section 6.2. Training algorithms to learn the set of all possible chords is infeasible in this point of time due to the number of possible chords and relative frequencies of chords appearing in the publicly available datasets. Certain chords appear in popular songs more frequently than others, and so we train the algorithms to recognize a set of these chord symbols

38 containing only major and minor chords, which we call the restricted chord vocabulary, and a set of chords containing major, minor, 7th and inverted chords, which we call the extended chord vocabulary. In section 6.1, I describe how to interpret chords that are not part of these sets. Results are reported for both chord symbol subsets on the Beatles dataset in section 6.5 for the reference system, SDAEs and MR-SDAEs. Results for both subsets on the Billboard set are reported in section 6.6. The results of other algorithms submitted to MIREX 2013 for the Billboard test set used in this thesis are stated in section Reduction of Chord Vocabulary As described in section 2.2, chords considered in this thesis consist of three or four notes, with distinct interval relationships to the root note. We have a certain set of chord symbols in the two chord symbol sets. The first contains only major and minor chords with three notes, the second an extension to this chord symbol set containing also 7th and inverted chords. For the Billboard dataset these two subsets are already supplied. For the Beatles dataset, we need to reduce the chords in the ground truth to match the chord symbol sets we want to recognize, since those are fully-detailed transcriptions, which contain chord symbols not in our defined subsets. Some chords are an extension of other chords, e.g., C:maj7 can be seen as an extension of C:maj, since the first one contains the same notes as the latter one but for the additional fourth note with interval 7 above the root note C. We thus reduce all other chords in the ground truth according to following set of rules: 1. If the ground truth chord symbol is in the subset of chord symbols to be recognized, leave it unchanged. 2. If there is a subset of notes that matches a chord symbol in the recognition set, denote instead of the original ground truth symbol the symbol in the recognition set (e.g., C:maj7 is mapped to C:maj for the restricted vocabulary). 3. If there is no subset of chord notes from a symbol in the recognition set for the original ground truth, denote it as non-chord (e.g., C:dim is mapped to the non-chord symbol). 6.2 Score Computation The results reported use a method of measurement that has been proposed by Harte (2010) and Mauch (2010): the weighted chord symbol recall (WCSR). In the following a description of how it is computed is provided Weighted Chord Symbol Recall Since most of chord recognition algorithms including the ones proposed here work on a discretized input space, but the ground truth is measured in continuous segments with start time, end time and a distinct chord symbol, we need a measure to estimate the performance of any proposed algorithm. This could be achieved by simply discretizing the ground truth according to the discretization of the estimation, and hereafter performing a frame-wise comparison. However, 37

39 Harte (2010) and Mauch (2010) propose a more accurate measure. The framewise comparison measure can be enhanced by computing the relative overlap of matching chord segments between the continuous-time ground truth and the frame-wise estimation of chord symbols by the recognition system: This is called chord symbol recall (CSR): CSR = S A i S S A j E i S S A i A i S E j, (25) where S A i is one segment of the hand annotated ground truth, and S E j one segment of the machine estimation. The test set for musical chord recognition usually contains several songs, which each have a different length and contain a different number of chords. Thus we can extend the CSR for a corpus of songs if we sum the the results for each song weighted by its length. This is the weighted chord symbol recall (WCSR), used for evaluating performance on a corpus containing several songs: W CSR = N L i CSR i i=0, (26) N L i where L i the length of song i and CSR i the chord symbol recall between machine estimation and hand annotated segments for song i. 6.3 Training Systems Setup Conducting experiments following parameters are found to be suitable. The stacked denoising autoencoders are trained with 30 iterations of unsupervised training with additive Gaussian noise, variance 0.2, and fraction of corrupted inputs 0.7. The autoencoders have 2 hidden layers with 800 hidden nodes each with a sigmoid activation function, the output layer contains as many nodes as there are chord symbols. To enforce sparsity an activation penalty weighting of β = 0.1, and target activation p = 0.05 is used. The dropout is set to 0.5, and batch training with a batch of 100 samples is used. The learning rate is set to 1 and momentum to 0.5. For the MR-SDAE the previous and subsequent 3 frames for the second input vector, and the previous and subsequent 9 frames for the third input vector are used. Due to memory restrictions only a subset of frames of the complete training set for training of the stacked denoising autoencoder based systems is employed. 10% of the training data for validation while training is separated. Additionally I extended Palm s deep-learning library with an early stopping mechanism, which stops supervised training after the performance on the validation set does not improve for 20 iterations, or else after 500 iterations, to restrict computation time. It then returns the best performing weight configuration according to the training validation. For the comparison system, since not all chords of the extended chord vocabulary are included in all datasets, missing chords are substituted with the mean PCP vector in the training set. Malformed covariance matrices are corrected by adding a small amount of random noise. i=0 38

40 6.4 Significance Testing Similar to Mauch and Dixon (2010a), a Friedman multiple comparison test is used to test for significant differences in performance of the proposed algorithms and the reference system. This tests the performance of different algorithms on a song level, but differs from the WCSR, which takes the song length into account in the final score. The Friedman multiple comparison test measures the statistical significance of ranks, thus indicating whether an algorithm outperforms another algorithm with statistical significance on a song level without regard to the WCSR for songs in general. For the purpose of testing for statistical significance of performance, we select one fold of the cross validation on the Beatles dataset, on which the performance is close to the mean, and one test run for the Billboard dataset, which is close to the mean as well for the SDAE based approaches. All plots for the post hoc multiple comparison Friedman test for significance show the mean rank and 95% confidence interval in term of ranks. 6.5 Beatles Dataset The Beatles Isophonics dataset 9 contains songs of the Beatles and Zweieck. We only use the Beatles songs for evaluating the performance of algorithms, since it is difficult to come by the audio data of the Zweieck songs. The Beatles-only subset of this dataset consists of 180 songs. In section and section 6.5.2, the results for restricted and extended chord vocabulary, for the comparison system, SDAE and MR-SDAE are reported. The cross-validation performance across ten folds is shown. We partition the dataset into ten subsets, where we use one for testing and nine for training. For the first fold we use every tenth song from the Beatles dataset starting from the first, as ordered in the ground truth, the second fold every tenth song starting from the second etc. We train ten different models, one for each testing partition. Since we use a HMM smoothing step, we show raw results without HMM smoothing and a final performance of the systems with temporal smoothing, for the neural network approaches. The reference system uses the HMM even for classification, and thus we only report a single final performance statistic. All results are reported as WCSR as described above and used in the MIREX challenge. Since there are ten different results, one for testing on each partition, I report the average WCSR, as well as a 95% confidence interval of the aggregated results. To get an insight into the distribution of performance results, I also plot box-and-whisker diagrams. Finally I perform Friedman multiple comparison tests for statistical significance across algorithms. Since the implementation of the learning algorithms in MATLAB is memory intensive, I subsample the training partitions for the SDAEs. For SDAE, I use every 3rd frame for training, and for MR-SDAE, every 4th frame, resulting in approximately and training samples for each fold Restricted Major-Minor Chord Vocabulary Friedman multiple comparison tests Values are computed on fold five of the Beatles dataset, which yields a result close to the mean performance for

41 all algorithms tested. In figure 7 the results of the post hoc Friedman multiple comparison tests for all systems smoothed and unsmoothed on the restricted chord vocabulary task are depicted. The algorithms showed significantly different performance, with p < SDAE MR-SDAE S-HPA SDAE MR-SDAE Mean column ranks with 95% confidence interval Figure 7: Mean and 95% confidence intervals for post hoc Friedman multiple comparison tests for the Beatles dataset on the restricted chord vocabulary, for the comparison system (S-HPA), SDAE, and MR-SDAE, before HMM smoothing (normal weight) and after (highlighted in bold). Whisker Plot and Mean Performance In this section results for the proposed algorithms, SDAE, MR-SDAE and the reference system on the reduced major-minor chord symbol recognition task are presented. Figure 8 depicts a box-and-whisker diagram for the performance of the algorithms with and without temporal smoothing and the performance of the reference system. The upper and lower whiskers depict the maximum and minimum performance of all results of the ten-fold cross validation, while the upper and lower boundaries of the boxes represent the upper and lower quartiles. We can see the median of all runs as a dotted line inside the box. The average WCSR together with 95% confidence intervals over folds before and after temporal smoothing can be found in table 3. 40

42 WCSR in % SDAE MR-SDAE S-HPA SDAE MR-SDAE Figure 8: Results for the simplified HPA, SDAE and MR-SDAE for the restricted chord vocabulary 10-fold cross-validation on the Beatles dataset with and without HMM smoothing. Results after smoothing are highlighted in bold. System Not smoothed Smoothed S-HPA ± 9.32 SDAE ± ± 7.41 MR-SDAE ± ± 7.92 Table 3: Average WCSR for the restricted chord vocabulary on the Beatles dataset, smoothed and unsmoothed, with 95% confidence interval Summary In the Friedman multiple comparison test in figure 7, we observe that the mean ranks of post-smoothing SDAE and MR-SDAE are significantly higher than the mean ranks of the reference system (S-HPA), and also that smoothing significantly improves the performance. Mean ranks for SDAE and MR-SDAE without smoothing are lower than that of the reference system, however not significantly. The MR-SDAE has a slightly higher mean rank compared to the SDAE, but not significantly. In figure 8 we can observe that pre- and post-smoothed SDAE and MR- SDAE distributions are negatively skewed. The S-HPA however is skewed positively. The skewness of the distribution does not change much for the SDAE and MR-SDAE comparing before and after smoothing, however, smoothing improves the performance in general. In table 3, we can see that the mean performance of MR-SDAE outperforms the SDAE slightly and that both achieve higher mean performance compared to the reference system after HMM smoothing. The means for results before 41

43 HMM smoothing for SDAE and MR-SDAE are lower however Extended Chord Vocabulary Friedman Multiple Comparison Tesst Again values are computed for fold five of the Beatles dataset. In figure 9 the results of the post hoc Friedman multiple comparison tests for all systems smoothed and unsmoothed on the extended chord vocabulary task are depicted. The algorithms showed significantly different performance, with p < SDAE MR-SDAE S-HPA SDAE MR-SDAE Mean column ranks with 95% confidence interval Figure 9: Mean and 95% confidence intervals for post hoc Friedman multiple comparison tests for the Beatles dataset on the extended chord vocabulary for the comparison system (S-HPA), SDAE, and MR-SDAE, before HMM smoothing (normal weight) and after (highlighted in bold). Whisker Plots and Means Similar to above we depict box-and-whisker diagrams for the unsmoothed and smoothed results of ten-fold cross validation for the extended chord symbol set in Figure 10. Table 4 depicts the average WCSR and 95% confidence interval over folds for the training for smoothed and unsmoothed results. 42

44 WCSR in % SDAE MR-SDAE S-HPA SDAE MR-SDAE Figure 10: Whisker plot for simplified HPA, SDAE, and MR-SDAE using the extended chord vocabulary and 10-fold cross-validation on the Beatles dataset, with and without smoothing. Results after smoothing are highlighted in bold. System Not smoothed Smoothed S-HPA ± 7.89 SDAE ± ± 7.37 MR-SDAE ± ± 6.81 Table 4: Average WCSR for simplified HPA, SDAE and MR-SDAE usging the extended chord vocabulary on the Beatles dataset, smoothed and unsmoothed, with 95% confidence intervals. Summary In the Friedman multiple comparison tests in figure 9, we can observe that again the post-smoothing performance in terms of ranks of the SDAE and MR-SDAE is significantly better than the reference system. In comparison to the restricted chord vocabulary recognition task, the margin is even larger. A peculiar thing to note, is that with the extended chord vocabulary the presmoothing performance of MR-SDAE is not significantly worse than the postsmoothing performance of both SDAE based chord recognition systems. SDAE shows lower mean ranks before smoothing than the reference system, and MR- SDAE seems to perform slightly better than the reference system, although not significantly so before smoothing. In figure 10, we can see similar negatively-skewed distributions of cross validation results for SDAE and MR-SDAE, like the restricted chord vocabulary setting. Again we can observe that the skewness of the distributions do not change much after smoothing, but we can observe an increase in performance. 43

45 However, in the extended chord vocabulary task, the medians of the SDAE and MR-SDAE are higher than that of the reference system, showing values even higher than the best performance of the reference system. The reference system on the extended chord vocabulary does not show a distinct skew. The better performance is also reflected in table 4, where the proposed systems achieve higher means before and after HMM smoothing compared to the reference system. 6.6 Billboard Dataset The McGill Billboard dataset 10 consists of songs randomly sampled from the Billboard Hot 100 charts from 1958 to This dataset currently contains 740 songs, of which we separate 160 songs for testing and use the remaining for training the algorithms. The selected test set corresponds to the official test set of the MIREX 2012 challenge. Although there are results for algorithms in the MIREX challenge 2013 on the Billboard dataset, the ground truth of the specific test set has not been publicly released at this point of time. Similar to the Beatles dataset, the audiofiles are not publicly available, but there are several different audio recordings for the songs in the dataset. We again use Dan Ellis AUDFPRINT tool to align audio data with the ground truth. For this dataset the ground truth is already available in the right format for restricted major-minor chord vocabulary and extended 7th and inverted chord vocabulary, thus we do not need to reduce the chords ourselves. Since the Billboard dataset is much larger than the Beatles dataset, we sample every 8th frame for the SDAE training and every 16th for the MR-SDAE, resulting in approximately and frames respectively. Algorithms were run five times Restricted Major-Minor Chord Vocabulary Friedman Multiple Comparison Tests In figure 11 the results for the post hoc Friedman multiple comparison test for the Billboard restricted chord vocabulary task for the reference system and smoothed and unsmoothed SDAE and MR-SDAE are presented. The algorithms showed significantly different performance, with p <

46 SDAE MR-SDAE S-HPA SDAE MR-SDAE Mean column ranks with 95% confidence interval Figure 11: Mean and 95% confidence interval for post hoc Friedman multiple comparison tests for the Billboard dataset on the restricted chord vocabulary, for the comparison system (S-HPA), SDAE, and MR-SDAE, before HMM smoothing (normal weight) and after (highlighted in bold). Mean Performance In this section results for the MIREX 2012 test partition of the Billboard dataset for the restricted major-minor chord vocabulary are depicted. Table 5 shows the results for performance of the SDAE with and without smoothing. Since we do not perform a cross validation on this dataset and the comparison system does not have any randomized initialization, we report the 95% confidence interval for the SDAEs only, with respect to multiple random initialisations (note that these are not directly comparable to the confidence intervals over cross-validation folds as reported for the Beatles dataset). System Not smoothed Smoothed S-HPA SDAE ± ± 0.31 MR-SDAE ± ± 0.40 Table 5: Average WCSR for the restricted chord vocabulary on the MIREX 2012 Billboard test set, smoothed and unsmoothed, with 95% confidence intervals if applicable. 45

47 Summary Figure 9, depicting the Friedman multiple comparison test for significance, reveals that in the Billboard restricted chord vocabulary task, the reference system does not perform significantly worse than the post-smoothing SDAE and MR-SDAE. It is also notable that in this setting the pre-smoothing MR-SDAE significantly outperforms the pre-smoothing SDAE. Similar to the restricted chord vocabulary task for the Beatles test, on the Billboard dataset, the means before smoothing are lower than those of the reference system. However, we can still observe a better pre-smoothing mean performance for MR-SDAE, in comparison with SDAE. Comparing mean performance HMM smoothing, we see no significant differences Extended Chord Vocabulary Friedman Multiple Comparison Tests In figure 12 the results for the post hoc Friedman multiple comparison test for the Billboard extended chord vocabulary task for the reference system and smoothed and unsmoothed SDAE and MR-SDAE are presented. The algorithms showed significantly different performance, with p < SDAE MR-SDAE S-HPA SDAE MR-SDAE Mean column ranks with 95% confidence interval Figure 12: Mean and 95% confidence intervals for post hoc Friedman multiple comparison tests for the Billboard dataset on the extended chord vocabulary, for the comparison system (S-HPA), SDAE, and MR-SDAE, before HMM smoothing (normal weight) and after (highlighted in bold). 46

48 Mean Performance Table 6 depicts the performance of the reference system and SDAEs on the extended chord vocabulary containing major, minor, 7th and inverse chord symbols. Again no confidence interval is reported for the reference system since there is no random component and results are the same for multiple runs. System Not smoothed Smoothed S-HPA SDAE ± ± 0.32 MR-SDAE ± ± 0.50 Table 6: Average WCSR for the extended chord vocabulary on the MIREX 2012 Billboard test set, smoothed and unsmoothed, with 95% confidence intervals if applicable. Summary The Friedman multiple comparison test in figure 12 shows again significantly better performance for the post-smoothing SDAE systems in comparison to the pre-smoothing performance, and also to the reference system. MR-SDAE again seems to achieve a higher mean rank in comparison with SDAE, however this is not statistically significant. In terms of mean performance in WCSR, depicted in table 6, the presmoothing performance figures for SDAE and MR-SDAE are higher than those of the reference system. Again MR-SDAE outperforms SDAE in mean WCSR. The same is the case after smoothing: MR-SDAE outperforms the SDAE slightly, and both perform better than the reference system. 6.7 Weights In this section we visualize of the input layer of the neural network trained on the Beatles dataset. Figure 13 shows an excerpt of the input layer of the neural network, weights being depicted as a grayscale image, where black denotes negative weights and white corresponds to positive weights. In figure 14 the sum of absolute values over all weights for each FFT input are plotted. The vertical lines depict FFT bins, which correspond to musically important frequencies, i.e., musical notes. 47

49 Hidden nodes Weights for inputs Figure 13: Excerpt of the weights of the input layer. Black denotes negative weights, and white positive Sum of absolute weights ,000 1,200 1,400 input (FFT bin) Figure 14: Sum of absolute values for each input of the trained neural network. Vertical gray lines indicate bins of the FFT that correspond to musically relevant frequencies. 48

EE391 Special Report (Spring 2005) Automatic Chord Recognition Using A Summary Autocorrelation Function

EE391 Special Report (Spring 2005) Automatic Chord Recognition Using A Summary Autocorrelation Function EE391 Special Report (Spring 25) Automatic Chord Recognition Using A Summary Autocorrelation Function Advisor: Professor Julius Smith Kyogu Lee Center for Computer Research in Music and Acoustics (CCRMA)

More information

CS229 Project Report Polyphonic Piano Transcription

CS229 Project Report Polyphonic Piano Transcription CS229 Project Report Polyphonic Piano Transcription Mohammad Sadegh Ebrahimi Stanford University Jean-Baptiste Boin Stanford University sadegh@stanford.edu jbboin@stanford.edu 1. Introduction In this project

More information

Chord Classification of an Audio Signal using Artificial Neural Network

Chord Classification of an Audio Signal using Artificial Neural Network Chord Classification of an Audio Signal using Artificial Neural Network Ronesh Shrestha Student, Department of Electrical and Electronic Engineering, Kathmandu University, Dhulikhel, Nepal ---------------------------------------------------------------------***---------------------------------------------------------------------

More information

Computational Modelling of Harmony

Computational Modelling of Harmony Computational Modelling of Harmony Simon Dixon Centre for Digital Music, Queen Mary University of London, Mile End Rd, London E1 4NS, UK simon.dixon@elec.qmul.ac.uk http://www.elec.qmul.ac.uk/people/simond

More information

A probabilistic framework for audio-based tonal key and chord recognition

A probabilistic framework for audio-based tonal key and chord recognition A probabilistic framework for audio-based tonal key and chord recognition Benoit Catteau 1, Jean-Pierre Martens 1, and Marc Leman 2 1 ELIS - Electronics & Information Systems, Ghent University, Gent (Belgium)

More information

A System for Automatic Chord Transcription from Audio Using Genre-Specific Hidden Markov Models

A System for Automatic Chord Transcription from Audio Using Genre-Specific Hidden Markov Models A System for Automatic Chord Transcription from Audio Using Genre-Specific Hidden Markov Models Kyogu Lee Center for Computer Research in Music and Acoustics Stanford University, Stanford CA 94305, USA

More information

Melody Extraction from Generic Audio Clips Thaminda Edirisooriya, Hansohl Kim, Connie Zeng

Melody Extraction from Generic Audio Clips Thaminda Edirisooriya, Hansohl Kim, Connie Zeng Melody Extraction from Generic Audio Clips Thaminda Edirisooriya, Hansohl Kim, Connie Zeng Introduction In this project we were interested in extracting the melody from generic audio files. Due to the

More information

CSC475 Music Information Retrieval

CSC475 Music Information Retrieval CSC475 Music Information Retrieval Monophonic pitch extraction George Tzanetakis University of Victoria 2014 G. Tzanetakis 1 / 32 Table of Contents I 1 Motivation and Terminology 2 Psychacoustics 3 F0

More information

Music Composition with RNN

Music Composition with RNN Music Composition with RNN Jason Wang Department of Statistics Stanford University zwang01@stanford.edu Abstract Music composition is an interesting problem that tests the creativity capacities of artificial

More information

Sparse Representation Classification-Based Automatic Chord Recognition For Noisy Music

Sparse Representation Classification-Based Automatic Chord Recognition For Noisy Music Journal of Information Hiding and Multimedia Signal Processing c 2018 ISSN 2073-4212 Ubiquitous International Volume 9, Number 2, March 2018 Sparse Representation Classification-Based Automatic Chord Recognition

More information

Robert Alexandru Dobre, Cristian Negrescu

Robert Alexandru Dobre, Cristian Negrescu ECAI 2016 - International Conference 8th Edition Electronics, Computers and Artificial Intelligence 30 June -02 July, 2016, Ploiesti, ROMÂNIA Automatic Music Transcription Software Based on Constant Q

More information

Topic 10. Multi-pitch Analysis

Topic 10. Multi-pitch Analysis Topic 10 Multi-pitch Analysis What is pitch? Common elements of music are pitch, rhythm, dynamics, and the sonic qualities of timbre and texture. An auditory perceptual attribute in terms of which sounds

More information

Automatic Rhythmic Notation from Single Voice Audio Sources

Automatic Rhythmic Notation from Single Voice Audio Sources Automatic Rhythmic Notation from Single Voice Audio Sources Jack O Reilly, Shashwat Udit Introduction In this project we used machine learning technique to make estimations of rhythmic notation of a sung

More information

POST-PROCESSING FIDDLE : A REAL-TIME MULTI-PITCH TRACKING TECHNIQUE USING HARMONIC PARTIAL SUBTRACTION FOR USE WITHIN LIVE PERFORMANCE SYSTEMS

POST-PROCESSING FIDDLE : A REAL-TIME MULTI-PITCH TRACKING TECHNIQUE USING HARMONIC PARTIAL SUBTRACTION FOR USE WITHIN LIVE PERFORMANCE SYSTEMS POST-PROCESSING FIDDLE : A REAL-TIME MULTI-PITCH TRACKING TECHNIQUE USING HARMONIC PARTIAL SUBTRACTION FOR USE WITHIN LIVE PERFORMANCE SYSTEMS Andrew N. Robertson, Mark D. Plumbley Centre for Digital Music

More information

Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes

Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes hello Jay Biernat Third author University of Rochester University of Rochester Affiliation3 words jbiernat@ur.rochester.edu author3@ismir.edu

More information

Lecture 9 Source Separation

Lecture 9 Source Separation 10420CS 573100 音樂資訊檢索 Music Information Retrieval Lecture 9 Source Separation Yi-Hsuan Yang Ph.D. http://www.citi.sinica.edu.tw/pages/yang/ yang@citi.sinica.edu.tw Music & Audio Computing Lab, Research

More information

Hidden Markov Model based dance recognition

Hidden Markov Model based dance recognition Hidden Markov Model based dance recognition Dragutin Hrenek, Nenad Mikša, Robert Perica, Pavle Prentašić and Boris Trubić University of Zagreb, Faculty of Electrical Engineering and Computing Unska 3,

More information

A DISCRETE MIXTURE MODEL FOR CHORD LABELLING

A DISCRETE MIXTURE MODEL FOR CHORD LABELLING A DISCRETE MIXTURE MODEL FOR CHORD LABELLING Matthias Mauch and Simon Dixon Queen Mary, University of London, Centre for Digital Music. matthias.mauch@elec.qmul.ac.uk ABSTRACT Chord labels for recorded

More information

Experiments on musical instrument separation using multiplecause

Experiments on musical instrument separation using multiplecause Experiments on musical instrument separation using multiplecause models J Klingseisen and M D Plumbley* Department of Electronic Engineering King's College London * - Corresponding Author - mark.plumbley@kcl.ac.uk

More information

Topic 11. Score-Informed Source Separation. (chroma slides adapted from Meinard Mueller)

Topic 11. Score-Informed Source Separation. (chroma slides adapted from Meinard Mueller) Topic 11 Score-Informed Source Separation (chroma slides adapted from Meinard Mueller) Why Score-informed Source Separation? Audio source separation is useful Music transcription, remixing, search Non-satisfying

More information

A Psychoacoustically Motivated Technique for the Automatic Transcription of Chords from Musical Audio

A Psychoacoustically Motivated Technique for the Automatic Transcription of Chords from Musical Audio A Psychoacoustically Motivated Technique for the Automatic Transcription of Chords from Musical Audio Daniel Throssell School of Electrical, Electronic & Computer Engineering The University of Western

More information

Week 14 Music Understanding and Classification

Week 14 Music Understanding and Classification Week 14 Music Understanding and Classification Roger B. Dannenberg Professor of Computer Science, Music & Art Overview n Music Style Classification n What s a classifier? n Naïve Bayesian Classifiers n

More information

Effects of acoustic degradations on cover song recognition

Effects of acoustic degradations on cover song recognition Signal Processing in Acoustics: Paper 68 Effects of acoustic degradations on cover song recognition Julien Osmalskyj (a), Jean-Jacques Embrechts (b) (a) University of Liège, Belgium, josmalsky@ulg.ac.be

More information

TOWARD AN INTELLIGENT EDITOR FOR JAZZ MUSIC

TOWARD AN INTELLIGENT EDITOR FOR JAZZ MUSIC TOWARD AN INTELLIGENT EDITOR FOR JAZZ MUSIC G.TZANETAKIS, N.HU, AND R.B. DANNENBERG Computer Science Department, Carnegie Mellon University 5000 Forbes Avenue, Pittsburgh, PA 15213, USA E-mail: gtzan@cs.cmu.edu

More information

Multiple instrument tracking based on reconstruction error, pitch continuity and instrument activity

Multiple instrument tracking based on reconstruction error, pitch continuity and instrument activity Multiple instrument tracking based on reconstruction error, pitch continuity and instrument activity Holger Kirchhoff 1, Simon Dixon 1, and Anssi Klapuri 2 1 Centre for Digital Music, Queen Mary University

More information

Drum Sound Identification for Polyphonic Music Using Template Adaptation and Matching Methods

Drum Sound Identification for Polyphonic Music Using Template Adaptation and Matching Methods Drum Sound Identification for Polyphonic Music Using Template Adaptation and Matching Methods Kazuyoshi Yoshii, Masataka Goto and Hiroshi G. Okuno Department of Intelligence Science and Technology National

More information

Probabilist modeling of musical chord sequences for music analysis

Probabilist modeling of musical chord sequences for music analysis Probabilist modeling of musical chord sequences for music analysis Christophe Hauser January 29, 2009 1 INTRODUCTION Computer and network technologies have improved consequently over the last years. Technology

More information

THE importance of music content analysis for musical

THE importance of music content analysis for musical IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 1, JANUARY 2007 333 Drum Sound Recognition for Polyphonic Audio Signals by Adaptation and Matching of Spectrogram Templates With

More information

Tempo and Beat Analysis

Tempo and Beat Analysis Advanced Course Computer Science Music Processing Summer Term 2010 Meinard Müller, Peter Grosche Saarland University and MPI Informatik meinard@mpi-inf.mpg.de Tempo and Beat Analysis Musical Properties:

More information

A PERPLEXITY BASED COVER SONG MATCHING SYSTEM FOR SHORT LENGTH QUERIES

A PERPLEXITY BASED COVER SONG MATCHING SYSTEM FOR SHORT LENGTH QUERIES 12th International Society for Music Information Retrieval Conference (ISMIR 2011) A PERPLEXITY BASED COVER SONG MATCHING SYSTEM FOR SHORT LENGTH QUERIES Erdem Unal 1 Elaine Chew 2 Panayiotis Georgiou

More information

Supervised Learning in Genre Classification

Supervised Learning in Genre Classification Supervised Learning in Genre Classification Introduction & Motivation Mohit Rajani and Luke Ekkizogloy {i.mohit,luke.ekkizogloy}@gmail.com Stanford University, CS229: Machine Learning, 2009 Now that music

More information

Transcription of the Singing Melody in Polyphonic Music

Transcription of the Singing Melody in Polyphonic Music Transcription of the Singing Melody in Polyphonic Music Matti Ryynänen and Anssi Klapuri Institute of Signal Processing, Tampere University Of Technology P.O.Box 553, FI-33101 Tampere, Finland {matti.ryynanen,

More information

Automatic music transcription

Automatic music transcription Music transcription 1 Music transcription 2 Automatic music transcription Sources: * Klapuri, Introduction to music transcription, 2006. www.cs.tut.fi/sgn/arg/klap/amt-intro.pdf * Klapuri, Eronen, Astola:

More information

Chord Label Personalization through Deep Learning of Integrated Harmonic Interval-based Representations

Chord Label Personalization through Deep Learning of Integrated Harmonic Interval-based Representations Chord Label Personalization through Deep Learning of Integrated Harmonic Interval-based Representations Hendrik Vincent Koops 1, W. Bas de Haas 2, Jeroen Bransen 2, and Anja Volk 1 arxiv:1706.09552v1 [cs.sd]

More information

Singer Traits Identification using Deep Neural Network

Singer Traits Identification using Deep Neural Network Singer Traits Identification using Deep Neural Network Zhengshan Shi Center for Computer Research in Music and Acoustics Stanford University kittyshi@stanford.edu Abstract The author investigates automatic

More information

Composer Identification of Digital Audio Modeling Content Specific Features Through Markov Models

Composer Identification of Digital Audio Modeling Content Specific Features Through Markov Models Composer Identification of Digital Audio Modeling Content Specific Features Through Markov Models Aric Bartle (abartle@stanford.edu) December 14, 2012 1 Background The field of composer recognition has

More information

Structured training for large-vocabulary chord recognition. Brian McFee* & Juan Pablo Bello

Structured training for large-vocabulary chord recognition. Brian McFee* & Juan Pablo Bello Structured training for large-vocabulary chord recognition Brian McFee* & Juan Pablo Bello Small chord vocabularies Typically a supervised learning problem N C:maj C:min C#:maj C#:min D:maj D:min......

More information

6.UAP Project. FunPlayer: A Real-Time Speed-Adjusting Music Accompaniment System. Daryl Neubieser. May 12, 2016

6.UAP Project. FunPlayer: A Real-Time Speed-Adjusting Music Accompaniment System. Daryl Neubieser. May 12, 2016 6.UAP Project FunPlayer: A Real-Time Speed-Adjusting Music Accompaniment System Daryl Neubieser May 12, 2016 Abstract: This paper describes my implementation of a variable-speed accompaniment system that

More information

IEEE Proof Web Version

IEEE Proof Web Version IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 22, NO. 2, FEBRUARY 2014 1 Automatic Chord Estimation from Audio: AReviewoftheStateoftheArt Matt McVicar, Raúl Santos-Rodríguez, Yizhao

More information

2. AN INTROSPECTION OF THE MORPHING PROCESS

2. AN INTROSPECTION OF THE MORPHING PROCESS 1. INTRODUCTION Voice morphing means the transition of one speech signal into another. Like image morphing, speech morphing aims to preserve the shared characteristics of the starting and final signals,

More information

MUSI-6201 Computational Music Analysis

MUSI-6201 Computational Music Analysis MUSI-6201 Computational Music Analysis Part 9.1: Genre Classification alexander lerch November 4, 2015 temporal analysis overview text book Chapter 8: Musical Genre, Similarity, and Mood (pp. 151 155)

More information

10 Visualization of Tonal Content in the Symbolic and Audio Domains

10 Visualization of Tonal Content in the Symbolic and Audio Domains 10 Visualization of Tonal Content in the Symbolic and Audio Domains Petri Toiviainen Department of Music PO Box 35 (M) 40014 University of Jyväskylä Finland ptoiviai@campus.jyu.fi Abstract Various computational

More information

Music Segmentation Using Markov Chain Methods

Music Segmentation Using Markov Chain Methods Music Segmentation Using Markov Chain Methods Paul Finkelstein March 8, 2011 Abstract This paper will present just how far the use of Markov Chains has spread in the 21 st century. We will explain some

More information

A repetition-based framework for lyric alignment in popular songs

A repetition-based framework for lyric alignment in popular songs A repetition-based framework for lyric alignment in popular songs ABSTRACT LUONG Minh Thang and KAN Min Yen Department of Computer Science, School of Computing, National University of Singapore We examine

More information

A QUERY BY EXAMPLE MUSIC RETRIEVAL ALGORITHM

A QUERY BY EXAMPLE MUSIC RETRIEVAL ALGORITHM A QUER B EAMPLE MUSIC RETRIEVAL ALGORITHM H. HARB AND L. CHEN Maths-Info department, Ecole Centrale de Lyon. 36, av. Guy de Collongue, 69134, Ecully, France, EUROPE E-mail: {hadi.harb, liming.chen}@ec-lyon.fr

More information

Outline. Why do we classify? Audio Classification

Outline. Why do we classify? Audio Classification Outline Introduction Music Information Retrieval Classification Process Steps Pitch Histograms Multiple Pitch Detection Algorithm Musical Genre Classification Implementation Future Work Why do we classify

More information

Music Alignment and Applications. Introduction

Music Alignment and Applications. Introduction Music Alignment and Applications Roger B. Dannenberg Schools of Computer Science, Art, and Music Introduction Music information comes in many forms Digital Audio Multi-track Audio Music Notation MIDI Structured

More information

A CLASSIFICATION-BASED POLYPHONIC PIANO TRANSCRIPTION APPROACH USING LEARNED FEATURE REPRESENTATIONS

A CLASSIFICATION-BASED POLYPHONIC PIANO TRANSCRIPTION APPROACH USING LEARNED FEATURE REPRESENTATIONS 12th International Society for Music Information Retrieval Conference (ISMIR 2011) A CLASSIFICATION-BASED POLYPHONIC PIANO TRANSCRIPTION APPROACH USING LEARNED FEATURE REPRESENTATIONS Juhan Nam Stanford

More information

Studying the effects of bass estimation for chord segmentation in pop-rock music

Studying the effects of bass estimation for chord segmentation in pop-rock music Studying the effects of bass estimation for chord segmentation in pop-rock music Urbez Capablo Riazuelo MASTER THESIS UPF / 2014 Master in Sound and Music Computing Master thesis supervisor: Dr. Perfecto

More information

Automatic Piano Music Transcription

Automatic Piano Music Transcription Automatic Piano Music Transcription Jianyu Fan Qiuhan Wang Xin Li Jianyu.Fan.Gr@dartmouth.edu Qiuhan.Wang.Gr@dartmouth.edu Xi.Li.Gr@dartmouth.edu 1. Introduction Writing down the score while listening

More information

Query By Humming: Finding Songs in a Polyphonic Database

Query By Humming: Finding Songs in a Polyphonic Database Query By Humming: Finding Songs in a Polyphonic Database John Duchi Computer Science Department Stanford University jduchi@stanford.edu Benjamin Phipps Computer Science Department Stanford University bphipps@stanford.edu

More information

Week 14 Query-by-Humming and Music Fingerprinting. Roger B. Dannenberg Professor of Computer Science, Art and Music Carnegie Mellon University

Week 14 Query-by-Humming and Music Fingerprinting. Roger B. Dannenberg Professor of Computer Science, Art and Music Carnegie Mellon University Week 14 Query-by-Humming and Music Fingerprinting Roger B. Dannenberg Professor of Computer Science, Art and Music Overview n Melody-Based Retrieval n Audio-Score Alignment n Music Fingerprinting 2 Metadata-based

More information

Speech To Song Classification

Speech To Song Classification Speech To Song Classification Emily Graber Center for Computer Research in Music and Acoustics, Department of Music, Stanford University Abstract The speech to song illusion is a perceptual phenomenon

More information

Music Representations

Music Representations Lecture Music Processing Music Representations Meinard Müller International Audio Laboratories Erlangen meinard.mueller@audiolabs-erlangen.de Book: Fundamentals of Music Processing Meinard Müller Fundamentals

More information

Statistical Modeling and Retrieval of Polyphonic Music

Statistical Modeling and Retrieval of Polyphonic Music Statistical Modeling and Retrieval of Polyphonic Music Erdem Unal Panayiotis G. Georgiou and Shrikanth S. Narayanan Speech Analysis and Interpretation Laboratory University of Southern California Los Angeles,

More information

AUTOMATIC ACCOMPANIMENT OF VOCAL MELODIES IN THE CONTEXT OF POPULAR MUSIC

AUTOMATIC ACCOMPANIMENT OF VOCAL MELODIES IN THE CONTEXT OF POPULAR MUSIC AUTOMATIC ACCOMPANIMENT OF VOCAL MELODIES IN THE CONTEXT OF POPULAR MUSIC A Thesis Presented to The Academic Faculty by Xiang Cao In Partial Fulfillment of the Requirements for the Degree Master of Science

More information

SINGING PITCH EXTRACTION BY VOICE VIBRATO/TREMOLO ESTIMATION AND INSTRUMENT PARTIAL DELETION

SINGING PITCH EXTRACTION BY VOICE VIBRATO/TREMOLO ESTIMATION AND INSTRUMENT PARTIAL DELETION th International Society for Music Information Retrieval Conference (ISMIR ) SINGING PITCH EXTRACTION BY VOICE VIBRATO/TREMOLO ESTIMATION AND INSTRUMENT PARTIAL DELETION Chao-Ling Hsu Jyh-Shing Roger Jang

More information

Music Information Retrieval with Temporal Features and Timbre

Music Information Retrieval with Temporal Features and Timbre Music Information Retrieval with Temporal Features and Timbre Angelina A. Tzacheva and Keith J. Bell University of South Carolina Upstate, Department of Informatics 800 University Way, Spartanburg, SC

More information

Audio Feature Extraction for Corpus Analysis

Audio Feature Extraction for Corpus Analysis Audio Feature Extraction for Corpus Analysis Anja Volk Sound and Music Technology 5 Dec 2017 1 Corpus analysis What is corpus analysis study a large corpus of music for gaining insights on general trends

More information

AN ARTISTIC TECHNIQUE FOR AUDIO-TO-VIDEO TRANSLATION ON A MUSIC PERCEPTION STUDY

AN ARTISTIC TECHNIQUE FOR AUDIO-TO-VIDEO TRANSLATION ON A MUSIC PERCEPTION STUDY AN ARTISTIC TECHNIQUE FOR AUDIO-TO-VIDEO TRANSLATION ON A MUSIC PERCEPTION STUDY Eugene Mikyung Kim Department of Music Technology, Korea National University of Arts eugene@u.northwestern.edu ABSTRACT

More information

Content-based music retrieval

Content-based music retrieval Music retrieval 1 Music retrieval 2 Content-based music retrieval Music information retrieval (MIR) is currently an active research area See proceedings of ISMIR conference and annual MIREX evaluations

More information

Computational Models of Music Similarity. Elias Pampalk National Institute for Advanced Industrial Science and Technology (AIST)

Computational Models of Music Similarity. Elias Pampalk National Institute for Advanced Industrial Science and Technology (AIST) Computational Models of Music Similarity 1 Elias Pampalk National Institute for Advanced Industrial Science and Technology (AIST) Abstract The perceived similarity of two pieces of music is multi-dimensional,

More information

Pitch Perception and Grouping. HST.723 Neural Coding and Perception of Sound

Pitch Perception and Grouping. HST.723 Neural Coding and Perception of Sound Pitch Perception and Grouping HST.723 Neural Coding and Perception of Sound Pitch Perception. I. Pure Tones The pitch of a pure tone is strongly related to the tone s frequency, although there are small

More information

Music Similarity and Cover Song Identification: The Case of Jazz

Music Similarity and Cover Song Identification: The Case of Jazz Music Similarity and Cover Song Identification: The Case of Jazz Simon Dixon and Peter Foster s.e.dixon@qmul.ac.uk Centre for Digital Music School of Electronic Engineering and Computer Science Queen Mary

More information

Analysing Musical Pieces Using harmony-analyser.org Tools

Analysing Musical Pieces Using harmony-analyser.org Tools Analysing Musical Pieces Using harmony-analyser.org Tools Ladislav Maršík Dept. of Software Engineering, Faculty of Mathematics and Physics Charles University, Malostranské nám. 25, 118 00 Prague 1, Czech

More information

AUTOMASHUPPER: AN AUTOMATIC MULTI-SONG MASHUP SYSTEM

AUTOMASHUPPER: AN AUTOMATIC MULTI-SONG MASHUP SYSTEM AUTOMASHUPPER: AN AUTOMATIC MULTI-SONG MASHUP SYSTEM Matthew E. P. Davies, Philippe Hamel, Kazuyoshi Yoshii and Masataka Goto National Institute of Advanced Industrial Science and Technology (AIST), Japan

More information

Detecting Musical Key with Supervised Learning

Detecting Musical Key with Supervised Learning Detecting Musical Key with Supervised Learning Robert Mahieu Department of Electrical Engineering Stanford University rmahieu@stanford.edu Abstract This paper proposes and tests performance of two different

More information

Music Source Separation

Music Source Separation Music Source Separation Hao-Wei Tseng Electrical and Engineering System University of Michigan Ann Arbor, Michigan Email: blakesen@umich.edu Abstract In popular music, a cover version or cover song, or

More information

Soundprism: An Online System for Score-Informed Source Separation of Music Audio Zhiyao Duan, Student Member, IEEE, and Bryan Pardo, Member, IEEE

Soundprism: An Online System for Score-Informed Source Separation of Music Audio Zhiyao Duan, Student Member, IEEE, and Bryan Pardo, Member, IEEE IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 5, NO. 6, OCTOBER 2011 1205 Soundprism: An Online System for Score-Informed Source Separation of Music Audio Zhiyao Duan, Student Member, IEEE,

More information

Chroma Binary Similarity and Local Alignment Applied to Cover Song Identification

Chroma Binary Similarity and Local Alignment Applied to Cover Song Identification 1138 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 6, AUGUST 2008 Chroma Binary Similarity and Local Alignment Applied to Cover Song Identification Joan Serrà, Emilia Gómez,

More information

PHYSICS OF MUSIC. 1.) Charles Taylor, Exploring Music (Music Library ML3805 T )

PHYSICS OF MUSIC. 1.) Charles Taylor, Exploring Music (Music Library ML3805 T ) REFERENCES: 1.) Charles Taylor, Exploring Music (Music Library ML3805 T225 1992) 2.) Juan Roederer, Physics and Psychophysics of Music (Music Library ML3805 R74 1995) 3.) Physics of Sound, writeup in this

More information

Subjective Similarity of Music: Data Collection for Individuality Analysis

Subjective Similarity of Music: Data Collection for Individuality Analysis Subjective Similarity of Music: Data Collection for Individuality Analysis Shota Kawabuchi and Chiyomi Miyajima and Norihide Kitaoka and Kazuya Takeda Nagoya University, Nagoya, Japan E-mail: shota.kawabuchi@g.sp.m.is.nagoya-u.ac.jp

More information

A Survey of Audio-Based Music Classification and Annotation

A Survey of Audio-Based Music Classification and Annotation A Survey of Audio-Based Music Classification and Annotation Zhouyu Fu, Guojun Lu, Kai Ming Ting, and Dengsheng Zhang IEEE Trans. on Multimedia, vol. 13, no. 2, April 2011 presenter: Yin-Tzu Lin ( 阿孜孜 ^.^)

More information

NOTE-LEVEL MUSIC TRANSCRIPTION BY MAXIMUM LIKELIHOOD SAMPLING

NOTE-LEVEL MUSIC TRANSCRIPTION BY MAXIMUM LIKELIHOOD SAMPLING NOTE-LEVEL MUSIC TRANSCRIPTION BY MAXIMUM LIKELIHOOD SAMPLING Zhiyao Duan University of Rochester Dept. Electrical and Computer Engineering zhiyao.duan@rochester.edu David Temperley University of Rochester

More information

A Study on Music Genre Recognition and Classification Techniques

A Study on Music Genre Recognition and Classification Techniques , pp.31-42 http://dx.doi.org/10.14257/ijmue.2014.9.4.04 A Study on Music Genre Recognition and Classification Techniques Aziz Nasridinov 1 and Young-Ho Park* 2 1 School of Computer Engineering, Dongguk

More information

JOINT BEAT AND DOWNBEAT TRACKING WITH RECURRENT NEURAL NETWORKS

JOINT BEAT AND DOWNBEAT TRACKING WITH RECURRENT NEURAL NETWORKS JOINT BEAT AND DOWNBEAT TRACKING WITH RECURRENT NEURAL NETWORKS Sebastian Böck, Florian Krebs, and Gerhard Widmer Department of Computational Perception Johannes Kepler University Linz, Austria sebastian.boeck@jku.at

More information

DAY 1. Intelligent Audio Systems: A review of the foundations and applications of semantic audio analysis and music information retrieval

DAY 1. Intelligent Audio Systems: A review of the foundations and applications of semantic audio analysis and music information retrieval DAY 1 Intelligent Audio Systems: A review of the foundations and applications of semantic audio analysis and music information retrieval Jay LeBoeuf Imagine Research jay{at}imagine-research.com Rebecca

More information

International Journal of Advance Engineering and Research Development MUSICAL INSTRUMENT IDENTIFICATION AND STATUS FINDING WITH MFCC

International Journal of Advance Engineering and Research Development MUSICAL INSTRUMENT IDENTIFICATION AND STATUS FINDING WITH MFCC Scientific Journal of Impact Factor (SJIF): 5.71 International Journal of Advance Engineering and Research Development Volume 5, Issue 04, April -2018 e-issn (O): 2348-4470 p-issn (P): 2348-6406 MUSICAL

More information

ON FINDING MELODIC LINES IN AUDIO RECORDINGS. Matija Marolt

ON FINDING MELODIC LINES IN AUDIO RECORDINGS. Matija Marolt ON FINDING MELODIC LINES IN AUDIO RECORDINGS Matija Marolt Faculty of Computer and Information Science University of Ljubljana, Slovenia matija.marolt@fri.uni-lj.si ABSTRACT The paper presents our approach

More information

DOWNBEAT TRACKING WITH MULTIPLE FEATURES AND DEEP NEURAL NETWORKS

DOWNBEAT TRACKING WITH MULTIPLE FEATURES AND DEEP NEURAL NETWORKS DOWNBEAT TRACKING WITH MULTIPLE FEATURES AND DEEP NEURAL NETWORKS Simon Durand*, Juan P. Bello, Bertrand David*, Gaël Richard* * Institut Mines-Telecom, Telecom ParisTech, CNRS-LTCI, 37/39, rue Dareau,

More information

MUSIC CONTENT ANALYSIS : KEY, CHORD AND RHYTHM TRACKING IN ACOUSTIC SIGNALS

MUSIC CONTENT ANALYSIS : KEY, CHORD AND RHYTHM TRACKING IN ACOUSTIC SIGNALS MUSIC CONTENT ANALYSIS : KEY, CHORD AND RHYTHM TRACKING IN ACOUSTIC SIGNALS ARUN SHENOY KOTA (B.Eng.(Computer Science), Mangalore University, India) A THESIS SUBMITTED FOR THE DEGREE OF MASTER OF SCIENCE

More information

WHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG?

WHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG? WHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG? NICHOLAS BORG AND GEORGE HOKKANEN Abstract. The possibility of a hit song prediction algorithm is both academically interesting and industry motivated.

More information

A Discriminative Approach to Topic-based Citation Recommendation

A Discriminative Approach to Topic-based Citation Recommendation A Discriminative Approach to Topic-based Citation Recommendation Jie Tang and Jing Zhang Department of Computer Science and Technology, Tsinghua University, Beijing, 100084. China jietang@tsinghua.edu.cn,zhangjing@keg.cs.tsinghua.edu.cn

More information

arxiv: v1 [cs.sd] 8 Jun 2016

arxiv: v1 [cs.sd] 8 Jun 2016 Symbolic Music Data Version 1. arxiv:1.5v1 [cs.sd] 8 Jun 1 Christian Walder CSIRO Data1 7 London Circuit, Canberra,, Australia. christian.walder@data1.csiro.au June 9, 1 Abstract In this document, we introduce

More information

Neural Network for Music Instrument Identi cation

Neural Network for Music Instrument Identi cation Neural Network for Music Instrument Identi cation Zhiwen Zhang(MSE), Hanze Tu(CCRMA), Yuan Li(CCRMA) SUN ID: zhiwen, hanze, yuanli92 Abstract - In the context of music, instrument identi cation would contribute

More information

Singer Recognition and Modeling Singer Error

Singer Recognition and Modeling Singer Error Singer Recognition and Modeling Singer Error Johan Ismael Stanford University jismael@stanford.edu Nicholas McGee Stanford University ndmcgee@stanford.edu 1. Abstract We propose a system for recognizing

More information

Measurement of overtone frequencies of a toy piano and perception of its pitch

Measurement of overtone frequencies of a toy piano and perception of its pitch Measurement of overtone frequencies of a toy piano and perception of its pitch PACS: 43.75.Mn ABSTRACT Akira Nishimura Department of Media and Cultural Studies, Tokyo University of Information Sciences,

More information

Music Radar: A Web-based Query by Humming System

Music Radar: A Web-based Query by Humming System Music Radar: A Web-based Query by Humming System Lianjie Cao, Peng Hao, Chunmeng Zhou Computer Science Department, Purdue University, 305 N. University Street West Lafayette, IN 47907-2107 {cao62, pengh,

More information

HST 725 Music Perception & Cognition Assignment #1 =================================================================

HST 725 Music Perception & Cognition Assignment #1 ================================================================= HST.725 Music Perception and Cognition, Spring 2009 Harvard-MIT Division of Health Sciences and Technology Course Director: Dr. Peter Cariani HST 725 Music Perception & Cognition Assignment #1 =================================================================

More information

Music Genre Classification

Music Genre Classification Music Genre Classification chunya25 Fall 2017 1 Introduction A genre is defined as a category of artistic composition, characterized by similarities in form, style, or subject matter. [1] Some researchers

More information

Appendix A Types of Recorded Chords

Appendix A Types of Recorded Chords Appendix A Types of Recorded Chords In this appendix, detailed lists of the types of recorded chords are presented. These lists include: The conventional name of the chord [13, 15]. The intervals between

More information

Tempo and Beat Tracking

Tempo and Beat Tracking Tutorial Automatisierte Methoden der Musikverarbeitung 47. Jahrestagung der Gesellschaft für Informatik Tempo and Beat Tracking Meinard Müller, Christof Weiss, Stefan Balke International Audio Laboratories

More information

AUTOREGRESSIVE MFCC MODELS FOR GENRE CLASSIFICATION IMPROVED BY HARMONIC-PERCUSSION SEPARATION

AUTOREGRESSIVE MFCC MODELS FOR GENRE CLASSIFICATION IMPROVED BY HARMONIC-PERCUSSION SEPARATION AUTOREGRESSIVE MFCC MODELS FOR GENRE CLASSIFICATION IMPROVED BY HARMONIC-PERCUSSION SEPARATION Halfdan Rump, Shigeki Miyabe, Emiru Tsunoo, Nobukata Ono, Shigeki Sagama The University of Tokyo, Graduate

More information

OBJECTIVE EVALUATION OF A MELODY EXTRACTOR FOR NORTH INDIAN CLASSICAL VOCAL PERFORMANCES

OBJECTIVE EVALUATION OF A MELODY EXTRACTOR FOR NORTH INDIAN CLASSICAL VOCAL PERFORMANCES OBJECTIVE EVALUATION OF A MELODY EXTRACTOR FOR NORTH INDIAN CLASSICAL VOCAL PERFORMANCES Vishweshwara Rao and Preeti Rao Digital Audio Processing Lab, Electrical Engineering Department, IIT-Bombay, Powai,

More information

Automatic Laughter Detection

Automatic Laughter Detection Automatic Laughter Detection Mary Knox Final Project (EECS 94) knoxm@eecs.berkeley.edu December 1, 006 1 Introduction Laughter is a powerful cue in communication. It communicates to listeners the emotional

More information

LEARNING AUDIO SHEET MUSIC CORRESPONDENCES. Matthias Dorfer Department of Computational Perception

LEARNING AUDIO SHEET MUSIC CORRESPONDENCES. Matthias Dorfer Department of Computational Perception LEARNING AUDIO SHEET MUSIC CORRESPONDENCES Matthias Dorfer Department of Computational Perception Short Introduction... I am a PhD Candidate in the Department of Computational Perception at Johannes Kepler

More information

Music Structure Analysis

Music Structure Analysis Lecture Music Processing Music Structure Analysis Meinard Müller International Audio Laboratories Erlangen meinard.mueller@audiolabs-erlangen.de Book: Fundamentals of Music Processing Meinard Müller Fundamentals

More information

Improving Frame Based Automatic Laughter Detection

Improving Frame Based Automatic Laughter Detection Improving Frame Based Automatic Laughter Detection Mary Knox EE225D Class Project knoxm@eecs.berkeley.edu December 13, 2007 Abstract Laughter recognition is an underexplored area of research. My goal for

More information

Augmentation Matrix: A Music System Derived from the Proportions of the Harmonic Series

Augmentation Matrix: A Music System Derived from the Proportions of the Harmonic Series -1- Augmentation Matrix: A Music System Derived from the Proportions of the Harmonic Series JERICA OBLAK, Ph. D. Composer/Music Theorist 1382 1 st Ave. New York, NY 10021 USA Abstract: - The proportional

More information

The Intervalgram: An Audio Feature for Large-scale Melody Recognition

The Intervalgram: An Audio Feature for Large-scale Melody Recognition The Intervalgram: An Audio Feature for Large-scale Melody Recognition Thomas C. Walters, David A. Ross, and Richard F. Lyon Google, 1600 Amphitheatre Parkway, Mountain View, CA, 94043, USA tomwalters@google.com

More information