Computational Rhythm Similarity Development and Verification Through Deep Networks and Musically Motivated Analysis

NEW YORK UNIVERSITY Computational Rhythm Similarity Development and Verification Through Deep Networks and Musically Motivated Analysis by Tlacael Esparza Submitted in partial fulfillment of the requirements for the Master of Music in Music Technology in the Department of Music and Performing Arts Professions in The Steinhardt School New York University Advisor: Juan Bello [DATE:2013/12/06] January 2014

NEW YORK UNIVERSITY Abstract Steinhardt Master of Music by Tlacael Esparza In developing computational measures of rhythmic similarity in music, validation methods typically rely on proxy classification tasks on common datasets, equating rhythm sim-ilarity to genre. In this paper, a novel state-of-the-art system for rhythm similarity is proposed that leverages deep network architectures for feature learning and classification, using this standard approach of genre classification on a well-known dataset for validation. In addressing this method of validation, an extensive cross-disciplinary analysis of the performance of this system is undertaken. In addition to analyses through MIR, machine learning and statistical methods, a detailed study of both the results and the dataset are performed from a musicological perspective, delving into the musical, historical and cultural specifics that impact the system. Through this study, insights are gained in further gauging the abilities of this measure of rhythm similarity beyond classification accuracy as well as a deeper understanding of this system design and validation approach as a musically meaningful exercise....

Acknowledgements I would like to thank Professor Juan Bello for his guidance, encouragement and dedication to my education, and Eric Humphrey without whom I would have been lost in a deep network somewhere. Many people have helped me along the way with this work and I am very grateful for their time and generosity. These include: Uri Nieto, Mary Farbood, Adriano Santos and Professor Larry Crook, as well as Carlos Silla and Alessandro Koerich for the Latin Music Dataset and their insights into the data collection process. And most importantly, thanks to my family and my fiancée, Ashley Reeb, for their unwavering emotional, spiritual, intellectual and financial support.... ii

Contents Abstract i Acknowledgements ii List of Figures List of Tables iv v 1 Introduction 1 2 Explication and Literature Review 3 2.1 Computational Music Similarity Measures.................. 3 2.2 Rhythm Similarity............................... 5 2.2.1 Onset Patterns............................. 8 2.3 Machine Learning................................ 9 2.3.1 Deep Networks for Feature Learning and Classification....... 9 3 Approach 12 3.1 Onset Patterns Implementation... 12 3.2 Deep Network Implementation........................ 13 3.2.1 Applying the Deep Network...................... 14 3.3 Analytic Approach............................... 15 4 System Configuration and Results 16 4.1 Dataset..................................... 16 4.2 Methodology.................................. 16 4.3 OP Parameter Tuning............................. 17 4.3.1 Periodicity Resolution......................... 17 4.3.2 Frequency Resolution......................... 18 4.4 Deep Network Parameterization........................ 18 4.4.1 Layer Width.............................. 19 4.4.2 Network Depth............................. 19 4.5 Optimal System Configuration........................ 19 5 OP Rhythm Similarity Analysis 24 5.1 Tempo Dependence............................... 24 5.2 Fine Grain Rhythmic Similarity........................ 26 iii

Contents iv 6 Dataset Observations 30 6.1 Ground Truths................................. 30 6.2 Artist/Chronological Distribution....................... 32 6.3 Brazilian Skew... 32 7 System Verification Issues 36 7.1 Rhythm-Genre Connection.......................... 36 7.2 Inter-genre Influence.............................. 38 7.3 Rhythm as Genre Signifier... 39 8 Conclusions 40 Bibliography 42

List of Figures 2.1 Extraction of Onset Patterns (OP) from the audio signal.......... 8 4.1 E ect of P on classification. The highest result is highlighted in blue, while significantly di erent results are in red................. 18 4.2 E ect of F on classification. The highest result is highlighted in blue, while significantly di erent results are in red................. 18 4.3 Mean comparison of ANOVA tests on network layer complexity in a 2- layer architecture show significantly lower results for small M........ 19 4.4 Top: Progression from input to output shows increasingly compact genre representation from input to output. Bottom: Progression from input to output shows increasingly distant classes from input to output....... 23 5.1 Gaussian modeled tempo distributions by genre in the LMD........ 25 6.1 Left: OP of a modern Sertaneja track. Right: OP of a Tango recording from 1917..................................... 33 6.2 Top: Geographical spread of Bolero vs. Brazilian genres. Bottom: Detail of geographical spread of Brazilian genres................... 35 v

List of Tables 2.1 Summary of main approaches in the literature for computational rhythm similarity..................................... 6 4.1 ANOVA results for classification scores with varying P values shows that periodicity resolution is a significant factor.................. 17 4.2 Classification accuracies for di erent features on the LMD......... 20 4.3 Classification accuracies by genre, ordered from highest classification score to lowest, show Brazilian genres generally performing worse than the rest. 20 4.4 Confusions matrix shows classification a nities between Sertaneja and several other genres............................... 21 4.5 Comparison of di erent classifiers on OP data. The proposed system outperforms all others by a margin of 2.23%................. 21 5.1 Results of binary logistic regression with classification success as dependents variable, BPM and Density as input show density is significant while BPM is not.................................... 26 5.2 Hosmer & Lemeshow test shows BPM and density data to be poor predictors for classification success......................... 26 5.3 Feel breakdown by genre showing percentage of tracks in each genre that are swung..................................... 28 5.4 Comparison of actual genre feel versus predicted genre feel for LMD classification results................................. 28 6.1 LMD infometrics................................. 33 vi

Chapter 1 Introduction A fundamental goal in the field of music information retrieval (MIR) is to extract musically meaningful information from digitized audio signals through computational methods. This, in its vagueness and breadth, describes most MIR tasks. In practice, and with the field still in its relative infancy, these tasks have often simplified to extracting musical feature representations that highlight basic characteristics like pitch, harmony, melody, tempo, timbre, structure and rhythm, among others. With the assumption that complex musical features such as mood or genre are signified by sets of fundamental musical attributes, it is hoped that these more abstract characteristics can be identified through combinations of these methods [1, 2]. There are many motivations for as well as current successful applications of this work. Pitch and beat tracking algorithms have found widespread use in digital audio workstations such as Pro Tools and Abelton Live, enabling pitch and beat correction, tempo estimation, and time-stretching of recorded audio. These functionalities have been used to great e ect in current popular music and are often audibly detectable, as with the music of the artist T-Pain, known for heavy use of auto-tuning software. Beyond music production, these computational methods can be leveraged to analyze and annotate the ever-growing and intractably large collections of music that the digital age has enabled. With an approximated 75,000 o cial album releases in 2010 alone as a indication of scale [3], and the primary means of maintaining and consuming this music through digital transmission, a computational approach to annotating and cataloging these collections is highly desirable. Indeed, new digital-era companies that serve streams of music to users on demand, such as Spotify, Soundclound and Pandora, have begun to employ many MIR methods (and researchers) for genre detection, playlist generation and music recommendation, among other services. 1

Chapter 1. Introduction 2 A main objective of this thesis aims to examine and further develop computational methods of measuring rhythm similarity in music signals. The importance of rhythm to music almost needs no mention. Under composer Edgar Varése s generous definition of music as organized sound, rhythm remains fundamental in that time is one of the few dimensions along which sound can be organized. And so, with the goal of the MIR community to fully parse musical content through computational means, contending with rhythm is an important step in this endeavor. Combining previous research on rhythm from the MIR literature with advances in machine learning, this work presents a state-of-the-art system for measuring rhythm similarity. In the hope of anchoring this abstracted computational process to its stated goal of extracting musically, and specifically, rhythmically meaningful information, this work takes a concerted e ort, beyond what is common in the literature, to analyze not only the results, but also the dataset used and the system s design from a multi-disciplinary perspective. Using a standard MIR verification scheme on a well-known dataset, the Latin Music Dataset [4], through statistical analyses on dataset metadata, analysis of the dataset from a musical, cultural and historical perspective and scrutiny of the basic assumptions built into the design, it is hoped that the system s musical relevance can be understood with greater clarity beyond the common classification score measuring stick. Further, with a great deal of personal interest and domain knowledge in the subject (rhythm) and the specific area of application for this work (latin rhythms/music), it is hoped that this research approach will provide a useful and elucidating look into the analysis of computational rhythm similarity measures, and also act as an encouragement to take on this level of scrutiny in developing computational methods for music applications in general.

Chapter 2 Explication and Literature Review Much of MIR research has largely followed a standard and persistent design model in developing novel methods for parsing music signals. This model is comprised of two steps: one, a hand-crafted feature extraction stage that aims to single out a given musical characteristic from an input signal; and two, a semantic interpretation or classification stage applying some function that allows the mapping of this feature to a symbolic representation. This chapter takes a survey of previous work in developing both of these system components as they relate to the current task of measuring rhythm similarity. Sections 2.1 and 2.2 summarizes standard approaches to feature extraction and reviews the various attempts at characterizing rhythm through feature design, highlighting the development of Onset Patterns which serve as a jumping o point for further research. Section 2.3 reviews improvements to feature extraction methods through the use of sophisticated machine learning algorithms, pointing to a blurring in the distinction between feature extraction and classification and setting the direction for this research. 2.1 Computational Music Similarity Measures Though some feature extraction methods produce easily interpretable, musically relevant representations of the signal directly, certain feature representations are imbued with musical meaning only as measures of similarity. For instance, the output of a tempo detection algorithm which looks for strong sub-sonic frequencies can be interpreted easily by looking for a global maximum in the representation, revealing a beats-per-minute value, a standard unit of tempo. Conversely, Mel-Frequency Cepstral Coe cients (MFCCs), widely used as a measure of timbre, are not musically interpretable on their own, but, 3

Chapter 2. Explication and Literature Review 4 paired with a distance metric, can be used to identify sounds based on distance to a known example, a common application of which is instrument identification [5, 6]. In this paradigm of measuring similarity, musical facets can be seen as existing in some multi-dimensional space where similar representations are grouped closely together, and the feature extraction algorithm is a mathematical projection of a given signal into one of these spaces. Through this approach, a posteriori-defined properties such as rhythm and structure, esoteric qualities such as timbre and complex characteristics such as mood can be inferred based on their distance to labelled examples in their respective feature space. In this way, classification supplies semantic meaning and a perceptually meaningful framework for analyzing these more complicated features. As an example of this, the previously mentioned quality of sound referred to as timbre is not easily defined, di cult to conceptualize and its manifestations di cult to describe with language. However, agreeing that timbre is the feature that distinguishes the sound of one instrument from another, we can discuss timbre similarity through the task of matching multiple recordings of the same instrument, defining it in finite terms (i.e. the timbre of a flute vs the timbre of a horn). One of the major obstacles to this approach is the necessity of labeled datasets. The development, verification and interpretation of these algorithms relies on classification tasks on pre-labeled examples; without an example of flute timbre to match with, an unlabeled signal cannot be identified with this characteristic. Ideally, when developing new similarity features, the verification dataset suits the task well by representing the desired musical feature homogeneously within a given class, but datasets with feature similarity-based ground-truths can be expensive and time-restrictive to produce. To address this, the MIR community actively compiles and shares labeled datasets for these purposes; examples of widely used datasets include labeled audio of monophonic instruments (McGill University Master Samples), audio accompanied by various descriptive tags (Magnatagatune) and many datasets divided along genre membership (LMD, Ballroom Dance, Turkish Music). But in practice, this has often lead to the use of datasets not created specifically for the given similarity measure of concern, employing a proxy identification with some other more easily identifiable characteristic. This includes the very common use of genre as a proxy for texture and rhythm similarity (which this thesis research employs knowingly) [7 12], and cover song identification for harmonic and melodic similarity [13 15]. Implicit in this approach is the assumption that ground-truths in these datasets correlate strongly enough with the musical characteristic being measured to provide meaningful classification results and system verification.

Chapter 2. Explication and Literature Review 5 2.2 Rhythm Similarity Rhythm is a complex and wide musical concept with varying definitions in di erent contexts. Though rhythm exists in music on various time-scales and can describe anything from the timing of a melody to textural shifts and large scale events, in this paper (and in the MIR literature on the subject), rhythm is taken to refer to regularly repeating sound events in time on the musical measure-level (approximately 2-8 seconds); that is, rhythm as those looping musical phrases that a percussionist or a bass player in a dance band might play. In the MIR literature, analyzing rhythmic similarity is distinct from rhythm description or transcription tasks. Where the latter seeks to transform a musical signal into symbolic annotations or describe it directly in some manner, the former is concerned only with isolating rhythm as invariant across di erent instances, often using highly abstracted representations. Although [16] provides a framework for understanding rhythm similarity with symbolic sequences, for the rapidly growing body of recorded audio this kind of analysis is not applicable for several reasons: the vast majority of audio recordings typically do not have this level of annotation; providing this information by hand is time restrictive; and computational methods of annotation remain ine ective [17]. Hence, signal-based methods for rhythm analysis are highly desirable. From a conceptual perspective, isolating this level of rhythm as an invariance requires removing pitch, tempo and timbre dependence so that a rhythm played on two di erent instruments, using di erent pitches and at di erent speeds will be recognized as the same. However, previous approaches tailor this list according to the intended application and sometimes include additional dependencies to be removed: phase, referring to the position of the start of a repeating rhythm; and temporal variance, referring to the evolution of a rhythm over longer time frames. Removing phase and temporal variance is a practical consideration specific to signal processing concerns; though a human can often easily recognize the beginning of a rhythmic pattern based on larger musical context, recognizing this computationally has been shown to be problematic [18] and when analyzing a signal, there is no guarantee that the beginning of the signal will correspond to the beginning of a repeating rhythm. Similarly, for track-wise classification, temporal invariance works towards minimizing the e ects of portions of audio where there is no discernible rhythm or where changes in rhythm are not representative of the track on the whole. Aside from a handful of intuitively motivated rhythm similarity systems that extract unit rhythm representations and preserve phase by employing often complicated heuristics to deduce the beginning of a phrase [19 21], most designs remove phase and take a

Chapter 2. Explication and Literature Review 6 more abstracted approach. Though di ering in important ways, they typically follow a common script: 1) calculate a novelty function from the signal, removing pitch-level frequencies, and highlighting rhythmic content; 2) produce a periodicity and/or rhythmic decomposition of this novelty function by analyzing consecutive rhythm phrase-length windows (typically 4 to 12 seconds), capturing local rhythm content on this scale; 3) transform this local representation by warping, shifting, resampling or normalizing to remove tempo-dependence; 4) aggregate local representations over time to produce a track-wise rhythm representation, removing temporal dependence. Table 2.1 shows a summary of these four steps for each of the main approaches in the literature. In this table, E ected Dimension refers to the musical dimension that each stage acts to either preserve or remove. Though all of these methods remove pitch content in the novelty function calculation, the Scale-Transform implemented in [11, 22, 23] and the Fluctuation and Onset Patterns implemented in [10, 11, 24] do preserve some level of timbre through multi-band novelty function representations. Most approaches produce local rhythm representations by using the Auto-Correlation Function (ACF) or Discrete Fourier Transform (DFT). [12] notes that these functions are beneficial for their preservation of the sequential order of rhythmic events, but they also remove phase as a periodicity representation, where only rhythm-level frequencies are coded. Rhythm Patterns [9, 25] diverge from this approach by including, in addition to periodicity analysis, Inter-Onset Intervals (IOI) which encodes the spaces between onsets in the novelty function, and Rhythm Patterns which are bar-length representations of the novelty function. This is a robust approach but relies on unreliable heuristics for extracting the downbeat used to determine the Rhythm pattern. All of the approaches make some e ort to remove temporal variance through temporal aggregation over all frames. Approach Novelty Local Rhythm Rhythm Scaling/Morphing Aggregation E ected Dimension Pitch/Timbre Phase Tempo Temporal Beat histogram [26 29] single ACF log-lag + shift detection, sub-sampling histogram Rhythm patterns [9, 25] single rhythm patterns, ACF, IOI bar-length normalization k-means, histogram Hybrid ACF/DFT [12, 30] single DFT, ACF and hybrids resampling with local tempo mean Scale transform [11, 22, 23] single, multiband ACF, DFT scale transform mean Fluctuation/Onset patterns [10, 11, 24] multiband DFT log-frequency + subsampling mean Table 2.1: Summary of main approaches in the literature for computational rhythm similarity. The biggest divergences in these designs can be seen in the various methods for removing tempo-sensitivity from the representation. Noting that relative rhythmic structure can be compared more easily as a shift on a log-scale versus a stretch on a linear scale, a log-lag mapping in the Beat histogram [28, 29] or a log-frequency mapping in the Onset pattern [10] allows for reduced sensitivity to tempo changes, where only large tempo di erences are noticeable. In [10, 29], the e ect of tempo is further reduced by sub-sampling in the log-lag/frequency domain to produce a coarser representation. [28]

Chapter 2. Explication and Literature Review 7 employs a shift in the log-lag domain to obtain a fully tempo-insensitive representation, but this relies on determining the proper shift value which is prone to errors. Subject to similar problems are the methods employed in calculating Hybrid [12] and, as mentioned before, Rhythm Pattern [9] which rely on determining tempo and bar boundaries for tempo normalization and bar-length pattern identification. The octave errors common to tempo estimation algorithms are problematic here, leading to inconsistencies in rhythm representations for these methods. [22, 23] o ers a robust, fully tempo-invariant approach that takes the scale transform of the ACF, resulting in a scale-invariant version of the already shift invariant ACF, obviating the need for determining a shift amount to correct for the shift introduced by log-lag mapping. Though the Beat Histogram, and Hybrid ACF/DFT, if applied successfully, do result in a fully pitch, tempo, timbre, and phase invariant rhythm representation, these are less useful when tasked with measuring rhythm similarity in the context of general similarity in multi-instrumental recorded music. Indeed, with most of these methods performing verifications through genre identification on standard dance-based datasets, better classification success has been obtained with the Onset Pattern [10, 11] and the Scale Transform [22, 23], which both preserve some level of timbre dependence through a multi-band representation. This makes sense when the question is not are these rhythms the same? but rather do these two tracks sound similar from a rhythmic perspective?, where the listener looks not only for similar rhythms but for similar orchestrations of those rhythms. While the former might be more conceptually pure with respect to rhythm similarity, the latter is more amenable to a genre classification task and as a tool in measuring general music similarity. It merits repeating here that nearly all of the rhythm similarity studies mentioned above employ genre identification as a verification method. Recalling the common use of already available datasets in lieu of ones tailored for the task - in this case, a dataset labeled according to a specifically defined understanding of rhythm similarity - this has been a common and generally accepted practice in rhythm similarity research. In using dance-based datasets (LMD [11], Ballroom Dance [9 12, 25], Turkish Music [22, 23]), the underlying assumption behind this practice is not only that a rhythm can be reliably associated with a specific genre, but also that a given genre has a representative rhythm, justifying a bijective mapping from one to the other. 2.2.1 Onset Patterns Taking the perspective that a timbre-sensitive approach to rhythm similarity is desirable for application to multi-instrumental music signals, and noting the importance of

Chapter 2. Explication and Literature Review 8 reducing reliance on error-prone heuristics in the design, the Onset Pattern and the Scale Transform stand out as promising approaches. The primary di erence between these two lies in their approach to tempo-invariance, where the Scale Transform achieves full tempo invariance and Onset Pattern shows invariance only for local tempo changes. As [11] e ectively shows, tempo can be an important and identifying characteristic for certain genres. Although the motivation here is not genre identification, this suggests the idea that perhaps tempo is also important for the perception of rhythm similarity. If two songs have the same rhythm but have very di erent tempos to the point that they produce a di erent e ect on the ear, this becomes a characteristic worth tracking. With this in mind, the Onset Pattern, which encodes only relatively large di erences in tempo, is especially promising for further development as a general measure of rhythm similarity in music. Computation of Onset Patterns (OP), as first described in [10] and refined in [11], are relatively straight forward to calculate and follow the signal pipeline mentioned above. As illustrated in Figure 2.1: 1) the signal is transformed to the time-frequency domain, processed to produce a novelty function through spectral-flux, mean removal and half-wave rectification, and sub-sampled to produce log-spaced frequency sub-bands; 2) log 2 -frequency DFTs are applied to these sub-bands over 6-8 second windows to produce a periodicity representation; 3) each frame is subsampled in the frequency and periodicity dimension to generalize the representation; and 4) frames are aggregated to produce a track-level representation. However, not detailed in these steps is the ordering of pooling stages, important to [10] s design, that act to summarize multi-band information into a smaller representation. In particular, pooling occurs in the frequency dimension before and after calculating periodicity. Also left out is a normalization step to correct for artifacts from the various log-compression or pooling steps. However, justifications for these design choices as well as the implementation of this normalization step is left unclear in the original paper. Figure 2.1: Extraction of Onset Patterns (OP) from the audio signal. [11] refines this process by systematically testing di erent designs and parameters. Of particular note in its findings is the importance of window size in the periodicity calculation and the negligible e ect of specific ordering of pooling steps. With an 8-second

Chapter 2. Explication and Literature Review 9 long window (versus 6 seconds in [10]), a single pooling stage can be applied at the end with no e ect to overall e cacy. Through this exhaustive search, [11] was able to improve OP performance beyond the original design. However, these results are based on necessarily limited parameter testing, constrained by time and feasibility and largely reliant on ignoring possible e ects of interaction between parameter choices, highlighting the di culties in optimizing feature extractions. 2.3 Machine Learning Until recently, MIR research has taken the approach of designing algorithms to extract some explicit musical feature, using simple data models and distance measures for verification against ground truths (e.g. [10, 11] s use of a K Nearest-Neighbor models with a Euclidean distance on OP features). However, for more complex musical characteristics, some in the field are turning their focus away from feature design to more sophisticated classification models and machine learning algorithms such as support vector machines [31 33], multi-layer perceptrons [34 36], and more recently deep-network architectures [37, 38]. With the standardization of many feature designs such as chroma, MFCCs, among many others, these more advanced machine learning methods have been used to squeeze performance from these features or to extract more complex characteristics from sets of features. In this line of thought, rather than rely on some specific feature extraction method, the task is couched in terms of a data classification problem which allows for leveraging learning algorithms to extract the relevant information based on a desired outcome. 2.3.1 Deep Networks for Feature Learning and Classification [39] advocates giving learning algorithms, in particular deep network architectures, a more fundamental role in system development; with a su ciently sophisticated learning algorithm, an optimally designed feature can be automatically extracted from a minimally processed input signal. This has the potential to solve several problems that have plagued MIR research for over a decade. Besides obviating the need to spend time rigorously testing algorithms in search of optimal designs and parameters, more importantly, it has the potential to capture musical characteristics that would otherwise be too complex or abstruse to formulate within a feature extraction signal chain. Hand-crafted algorithms are necessarily limited by our own perceptual and technical abilities, and the approach that relies on these alone to explore the function space of

Chapter 2. Explication and Literature Review 10 signal-to-feature mappings limits the range of possible solutions. As initially demonstrated in [40] for music information retrieval, deep network architectures can be used to this end for their ability to model high-order functions through system depth. By cascading multiple a ne transformation layers of simpler nonlinear functions they allow for a system complexity su cient to model abstract musical characteristics. As [39] argues, using deep architectures to learn features for MIR follows naturally from the observation that many successful designed features in the literature can be described in terms of deep architectures themselves, combining multiple steps of a ne transformations, non-linearities and pooling. Taking the now standard calculation of MFCCs as an example, steps include: Mel-scale filterbank transform and discrete cosine projection (a ne transformation); and log-scaling and complex modulus (nonlinearity). Hence, from this perspective the primary di erences between di erent feature designs is the choice of parameters. Further, given that these parameters can be optimized for a given task with deep networks, not only is it possible to learn better designs for features such as MFCCs, but this points to the prospect of learning better features altogether that are unconstrained by the specifics of implementation. In the two step paradigm described above, here the distinction between feature extraction and classification becomes obscured where step one is reduced to preparing the data for input to step two, a deep network where each layer is a progressively more refined feature representation and the final output layer performs classification. Deep architectures have found strong use in problems of feature learning for machine vision [41 44], but there has been relatively little research into this approach within the MIR community. Although SVMs as well as other more sophisticated learning algorithms, as mentioned above, have been used to improve classification rates for designed features, the e orts to learn the features themselves have been few. The initial successful uses of deep networks for music retrieval tasks in [40] and [45] show that learned features outperform MFCCs for genre classification and sophisticated temporal pooling methods can be learned to incorporate multi-scale information for better results. Further use of deep networks in [38] shows that Convolutional Neural Nets, a specialized ANN deep network, can be successfully used for the task of chord recognition by extracting chord representations directly from several seconds of tiled pitch spectra. The positive results these approaches achieve are encouraging and justify further research into deep networks for feature design tasks such as rhythm similarity. It is important to note that in these supervised learning schemes, the data used in training and classification plays a more fundamentally important role in feature design. With hand-crafted features, designs are based on some idealized concept of a given musical feature i.e. tempo, timbre, pitch, and classification tasks serve merely as validation of

Chapter 2. Explication and Literature Review 11 the design. However, if the feature itself is learned in the process of supervised training of a classification model, it is necessarily shaped by the relationship between class labels and signal attributes in the dataset used for training. This is both a positive characteristic of this approach since, as mentioned, it unhinges the perceived musical characteristic from a pre-determined algorithm, but it requires care and scrutiny when creating or using pre-existing datasets as is a common practice. Although, research in unsupervised deep learning networks shows promise in reducing the reliance on large datasets [46], this work only considers fully-supervised methods.

Chapter 3 Approach Based on the observations discussed in the previous chapter, this chapter presents a novel variation of the onset pattern approach. By treating the pooling and normalization stages of feature extraction as layers of a deep learning network, these stages can be optimized to the task of genre classification. In this way, the post processing and pooling steps that are infeasible to optimize manually can be learned as an extension of the Onset Pattern feature in this deep architecture context. Once trained, this transformation is applied independently to all track-wise onset patterns and the outputs are averaged over time, yielding a summary representation for an entire track. 3.1 Onset Patterns Implementation OP calculation here generally follows the processes outlined in [10], [11], but for this application, the calculation is simplified by removing several post processing steps. Operating on mono recordings sampled at 22050Hz, log 2 -frequency DFTs are taken over 1024-sample windows with a hop size of 256 samples. Frequencies span six octaves beginning at 150Hz. The frequency resolution of this transform is kept variable to test optimal resolution levels in later experiments. Multi-band novelty functions are generated by computing spectral flux, removing the mean and half-wave rectifying the result. From here, eight-second long windows of these novelty functions are analyzed at 0.5 seconds intervals to extract a periodicity-spectrum by applying another log 2 -DFT spanning five octaves beginning at 0.5Hz. This corresponds to a Beat-Per-Minute (BPM) range of 30 to 960BPM. This is referred to here as the periodicity range. As with the log 2 -DFT used in the frequency multi-band calculation, periodicity resolution is left as a variable. This gives a frame-matrix with dimensions (F, P) wheref is the number of frequency bins and P is the number of periodicity bins. 12

Chapter 3. Approach 13 3.2 Deep Network Implementation For feature learning and classification, this research makes heavy use of Eric Humphrey s in-development deep learning network Python libraries, informally presented in [47]. Formally, deep networks transform an input Z 1 into an output Z L through composition with nonlinear functions f l ( l )wherel 2 L, indicating total layer depth. For each layer, Z l 1 is the input to function f l with parameters l. The network is composed of a ne transformations, or fully-connected layers, where the outputs from one layer are distributed fully over the inputs to the next layer. Precisely: F (Z 1 ) = f L (...f 2 (f 1 (Z 1 1 ) 2 ))... L ) (3.1) Where, F =[f 1,f 2,...f L ] is the set of layer functions, = [ 1, 2,... L ] is the corresponding set of layer parameters, the output of one layer is passed as the input to the next as f l (Z l )=Z l+1 and the overall depth of the network is given by L. Layer f l is a fully-connected, or a ne, transformations, defined by the following: f l (Z l l )=h(w l Z l + b l ), l =[W l,b l ] (3.2) Here, the input Z l is flattened to a column vector of length N l and the dot-product is computed with a weight matrix W l of shape (M l,n l ), followed by an additive vector bias term b l with length M l. Note that an a ne layer transforms an N l -dimensional input to an M l -dimensional output, referred to as the width of the layer. The final operation is a point-wise nonlinearity, h( ), defined here as tanh( ), which is bounded on ( 1, 1). When used as a classification system, the first L 1 layers of a deep network can be viewed as feature extractors, and the last layer, f L, is simply a linear classifier. This output can be forced to behave as a probability mass function for membership to a given class by making the length of Z l match the number of classes and by constraining the L 1 -norm of the output to equal 1. This probability mass function P ( ) for an input Z 1 is achieved by applying the softmax operation to the output of the network, Z L, defined as follows: exp(z L ) P (Z 1 ) = (Z L )= P ML m=1 exp (Z L[m]) (3.3)

Chapter 3. Approach 14 In this supervised learning implementation, the output Z L of this final layer is used to make a prediction where the most likely class is determined by argmax(p (Z 1 )), that, with a provided target value y, can be combined into loss function. With the network defined as a probability mass function for class membership, it can be trained by iteratively minimizing this loss function using the negative log-likelihood of the correct class for a set of K observations: L = KX log(p (X k = Y k )) (3.4) k=0 where, Z k and Y k are the input data and corresponding class label, respectively of the k th observation. This loss function can then be minimized through gradient descent, which iteratively searches for the minimum value of the loss function. Here, gradients are computed with K>1, but much smaller than the total number of observations, by sampling data points from the training set and averaging the loss over the batch. Specifically, the update rule for is defined as its di erence with the gradient of the scalar loss L with respect to the parameters, weighted by the learning rate, given by the following: L (3.5) K = 100 is used, where the observations are drawn uniformly from each class, i.e. a 10 observations of each genre, and a constant learning rate of = 0.1. Learning proceeded for 3k iterations without early stopping or model selection. Note that all input data is preprocessed before input to the network to have zero-mean and unit variance. This is done by calculating the mean and standard deviation over all data points and was shown to significantly improve system performance. 3.2.1 Applying the Deep Network Unlike previous classification schemes for rhythm similarity methods, track-level aggregation is held o until after frame-wise classification. Here, the deep network is applied independently to a time-series of onset patterns, producing a posteriorgram. Though there are alternative statistics that could be explored, such as the median or maximum, mean-aggregation is taken for each class prediction over the entire track.

Chapter 3. Approach 15 3.3 Analytic Approach Chapter 2 highlights two connected issues that have prompted the analysis and discussion approach taken in this research. The first issue concerns the practice of genre identification as a proxy task for rhythm similarity. As mentioned, genre classification is the de-facto proxy task for verifying rhythm similarity measure and there remains a dearth in the literature for: 1) in depth analysis of the suitability of genre for the given feature; 2) informed explications of the assumptions made in system design; 3) and a proper examination of classification results fully taking into account the contents of the dataset used. The facile assumptions made in system verification and the face-value interpretations of classification results commonly accepted belies either a general lack of commitment or naiveté to musical relevance among researchers. As explored in [48] and stated confidently enough to be used as its title Classification Accuracy is Not Enough. The second issue concerns the e ect of the dataset on learned features in rhythm similarity. As discussed at the end of Section 2.3.1, in a deep network, features are learned based on provided labeled training examples. Hence, the feature s characteristics are molded by the class representations in the dataset. Though desirable if working with an ideal dataset for the task, in the case of this research which uses genre membership as a proxy for rhythm similarity, there may be unintended (i.e. not rhythmically relevant) influences on the feature representation. In an e ort to better understand the musical significance of this rhythm similarity research beyond classification score and in an attempt to account for these various factors, a multi-disciplinary approach is taken here to examine the results, the dataset, and the system design. In addition to standard machine learning, MIR and statistical analyses methods, results are examined through rhythmic, musico-cultural and historical analyses, employing personal domain knowledge and borrowing heavily from the related musicological literature.

Chapter 4 System Configuration and Results 4.1 Dataset In keeping with standard methods, a genre classification task is used to evaluate this measure of rhythm similarity, utilizing the well-known Latin Music Dataset (LMD). The LMD is a collection of Latin dance music comprised of 3216 tracks 1, split into 10 distinct genres: Axé, Bachata, Bolero, Forró, Gaúcha, Merengue, Pagode, Salsa, Sertaneja and Tango. The LMD is used here for several reasons: for this dance-based dataset, genre is assumed to serve as a good proxy for rhythm; the size of the LMD compares favorably to other smaller dance-based datasets such as the Ballroom set, a requisite for supervised deep-learning tasks; and, perhaps more importantly, this research stems from a deeper interest in Latin music in general. Based on the idea that domain knowledge is important to the development and analysis of computational music similarity measures, personal knowledge and interest in the subject is leveraged for the analyses in Chapters V-VII. Though the LMD provides full recordings, many of the tracks are from live shows and contain non-musical segments (e.g. clapping, spoken introductions). To reduce this noise, only the middle 30 seconds from each track are used for analysis. 4.2 Methodology The following experiments seek to identify the optimal system configuration for genre classification on the LMD. These experiments are broken into two parts: the first concerns resolution of the OP and the second concerns complexity in the feature-learning 1 Though the original LMD has 3,227 total recording, duplicates and tracks that were too short in duration for analysis have been removed. 16

Chapter 4. System Configuration and Results 17 Source SS df MS F Prob>F Columns 186.101 6 31.0168 9.21 3.11E-07 Error 212.231 63 3.3688 Total 398.332 69 Table 4.1: ANOVA results for classification scores with varying P values shows that periodicity resolution is a significant factor. stages of the network. For the OP, best general feature space is desired, one that is maximally informative while avoiding over-representation which can slow down, and even hinder classification. various OP resolutions are examined by testing values for frequency bins (F ) and periodicity bins (P ) as independent factors. Subsequent network test seek to design a network that appropriately fits the complexity of the task. System complexity is determined by layer depth (L) and layer output size (M), several combinations of value for these parameters are examined. For baseline classification, the system defined in Section 3.2 with a single layer network is used, which is simply multi-class linear regression. This is the classifier used for all OP parameter tests. Scores for all classification tests are averaged over 10 cross-validated folds, stratified against genre. 4.3 OP Parameter Tuning Initial tests begin on an OP with F = 30 and P = 25 based on results in [11], taking the minimal dimensions that were shown to perform well. 4.3.1 Periodicity Resolution Over the seven tested OP configurations, with P in the range [5, 100], P = 15 provides the best results. An analysis of variance test on classification scores shows that periodicity resolution plays a significant role in the outcome. This is indicated by a Prop>F value less than 0.05 as can be seen in Table 4.1. After applying a Tukey HSD adjustment, a comparison of means, Figure 4.1, presents a clear trend, with significantly lower scores for OPs with either too few or too many periodicity bins and the maximum classification rate obtained with P = 15. These tests, showing better results with fewer dimensions, di er from results in [11], but this disparity most likely arises from di erences in data and classification strategy.

Chapter 4. System Configuration and Results 18 30x5 30x10 OP Dimension 30x15 30x25 30x50 30x75 30x100 81 82 83 84 85 86 87 88 89 Mean Accuracies (%) Figure 4.1: E ect of P on classification. The highest result is highlighted in blue, while significantly di erent results are in red. 4.3.2 Frequency Resolution Setting P = 15 based on the above, F values are then tested in the range [18, 300]. An ANOVA test on these results shows a significant e ect for this parameter with Prop>F =0.001 and, as can be seen in Figure 4.2, accuracy rates go up with higher frequency resolution, leveling out for F 240. Results in [11] show minor but statistically insignificant improvements by increasing the OP frequency resolution, but this is consistent with results here for F apple 120 and does not preclude the higher scores seen for F>120. Based on these tests, going forward OPs are calculated setting F = 240 and P = 15. 18x15 30x15 OP Dimension 60x15 90x15 120x15 240x15 300x15 84 85 86 87 88 89 90 91 Mean Accuracies (%) Figure 4.2: E ect of F on classification. The highest result is highlighted in blue, while significantly di erent results are in red. 4.4 Deep Network Parameterization With optimal parameters for this feature set in place, the next step is finding the best network architecture for this data. Returning to the notation of Section 3.2, here choices of layer width, M l,l < L 1, and network depth, L are explored. Note that the input and output dimensionality are fixed as N 1 = 240 15 and M L = 10 due to the previous discussion and the number of classes in the dataset, respectively.

Chapter 4. System Configuration and Results 19 4.4.1 Layer Width This parameter search begins with a two-layer network (L = 2), sweeping the width of the first layer, M 1, over increasing powers of 2 in the range [16, 8192]. Results demonstrate a performance pivot around M 1 = 128, achieving a maximum accuracy at M 1 = 2048 but otherwise insignificant variation with M 1 128. An ANOVA on these results show significance for this factor ( Prob>F = 0.015), but Figure 4.3 indicates minimal impact for M 1 128. 16 32 64 Hidden Layer Size 128 256 512 1024 2048 4096 8192 88.5 89 89.5 90 90.5 91 91.5 92 92.5 Mean Accuracies (%) Figure 4.3: Mean comparison of ANOVA tests on network layer complexity in a 2-layer architecture show significantly lower results for small M. 4.4.2 Network Depth Based on the above, deeper architectures are considered by setting M l = 2048,l <L 1 and incrementally adding layers for a maximum depth of L = 6. This fails to show any significant changes in accuracy, with an ANOVA test revealing a Prob>F of 0.3684, greater than the null-hypothesis threshold of 0.05. Importantly, while only a limited number of interactions between depth and width are explored, independently varying L or M l for various values shows no significant di erence provided M l 128, consistent with previous findings. 4.5 Optimal System Configuration Further tests continue with a two-layer architecture (L =2,M 1 = 2048) based on the parameters used for the best score in Figure 4.3, expressed completely by the following: P (X 1 ) = (f 2 (f 1 (X 1 1 ) 2 )) (4.1) For clarity, the dimensionality of the first layer, f 1, is given by (M 1 = 2048,N 1 = 3600), and the dimensionality of the second by (M 2 = 10,N 2 = 2048).

Chapter 4. System Configuration and Results 20 Feature Accuracy (%) LPQ (Texture Descriptors) [49] 80.78 OP (Holzapfel) [11] 81.80 Mel Scale Zoning [50] 82.33 OP (Proposed) 91.32 Table 4.2: Classification accuracies for di erent features on the LMD. Genre Total Per Correctly % Genre Predicted Correct Merengue 314 309 98.41 Tango 407 400 98.28 Bachata 312 304 97.44 Pagode 306 288 94.12 Salsa 309 286 92.56 Axé 313 284 90.73 Bolero 314 278 88.54 Gaúcha 309 264 85.44 Forró 312 260 83.33 Sertaneja 320 264 82.50 Total 3216 2937 91.32 Table 4.3: Classification accuracies by genre, ordered from highest classification score to lowest, show Brazilian genres generally performing worse than the rest. With this configuration, classification on the LMD yielded a peak average score of 91.32%, which surpasses previous attempts at genre classification on this dataset. Table 4.2 shows the proposed approach outperforming others by a margin of more than 8%. One trend that is immediately apparent in the results is a di culty in classifying Brazilian genres. Table 4.3, with genre-wise classification accuracies ordered from highest to lowest, shows Axé, Gaúcha, Forró and Sertaneja, all Brazilian genres, occupying four of the five bottom slots. Also, when looking at class-by-class confusions, as shown in Table 4.4, certain a nities between genres are apparent. The lowest scoring Sertaneja has the majority of its false tags predicted as Bolero, but also many predicted as Gaúcha and Forró. While the next three lowest performing classes, Gaúcha, Forró and Bolero, have most of their false tags predicted as Sertaneja. These trends in class-confusions will be expanded on in subsequent chapters. The increase in accuracy from previous attempts may be partially explained by di erences in methodology (i.e. aggregation strategies, signal noise reduction, etc.), but the strength of this deep-network strategy for classification plays a significant role here. Its e ect can be seen in Table 4.5, by comparing the proposed approach to simpler classification methods on the same OP input, the former outperforming the rest by a margin of