Analysis of Trumpet Tone Quality Using Machine Learning and Audio Feature Selection

Size: px

Start display at page:

Download "Analysis of Trumpet Tone Quality Using Machine Learning and Audio Feature Selection"

Angelica Young
5 years ago
Views:

1 Analysis of Trumpet Tone Quality Using Machine Learning and Audio Feature Selection Trevor Alexander Knight Master of Engineering Electrical and Computer Engineering McGill University Montreal, Quebec December 2011 A thesis submitted to McGill University in partial fulfilment of the requirements for the degree of Masters in Engineering c Trevor Alexander Knight

2 ACKNOWLEDGEMENTS The infinite kindness and wisdom of Ichiro Fujinaga was indispensable for this thesis. Finn Upham provided ideas, feedback, and data analysis that were crucial to the initial experiment and its publication that forms the basis of Chapter 4. The patient guidance and knowledge of J. Ashley Burgoyne was incredibly helpful for properly executing feature selection. The advice and supervision of Mark Coates was instrumental to making this thesis possible. Guillaume Boutard and Mathieu Bergeron provided help with translating the abstract. The authors of jmir, WEKA, and the Timbre Toolbox created great research tools extremely useful for this project. None of the work would have been possible without the existence and support of the Centre for Interdisciplinary Research in Music Media Technology (CIRMMT) and the help of the technical staff, Julien Boissinot, Harold Kilianski, and Yves Méthot. ii

3 ABSTRACT This work examines which audio features, the components of recorded sound, are most relevant to trumpet tone quality by using classification and feature selection. A total of 10 trumpet players with a variety of experience levels were recorded playing the same notes under the same conditions. Twelve musical instrumentalists listened to the notes and provided subjective ratings of the tone quality on a sevenpoint Likert scale to provide training data for classification. The initial experiment verified that there is statistical agreement between human raters on tone quality and that it was possible to train a support vector machine (SVM) classifier to identify different levels of tone quality with success of 72% classification accuracy with the notes split into two classes and 46% when using seven classes. In the main experiment, different types of feature selection algorithms were applied to the 164 possible audio features to select high-performing subsets. The baseline set of all 164 audio features obtained a classification accuracy of 58.9% with seven classes tested with cross-validation. Ranking, sequential floating forward selection, and genetic search produced accuracies of 43.8%, 53.6%, and 59.6% with 20, 21, and 74 features, respectively. Future work in this field could focus on more nuanced interpretations of tone quality or on the applicability to other instruments. iii

4 ABRÉGÉ Ce travail examine les caractéristique acoustique, c.-à-d. les composantes de l enregistrement sonore, les plus pertinentes pour la qualité du timbre de trompette à l aide de la classification automatique et de la sélection de caractéristiques. Un total de 10 joueurs de trompette de niveau varié, jouant les mêmes notes dans les mêmes conditions, a été enregistré. Douze instrumentistes de musique ont écouté les enregistrements et ont fourni des évaluations subjectives de la qualité du timbre sur une échelle de Likert à sept points afin de fournir des données d entrainement du système de classification. La première expérience a vérifié qu il existe une correlation statistique entre les évaluateurs humains sur la qualité du timbre et qu il était possible de former un système de classification de type machine à vecteurs de support pour identifier les différents niveaux de qualité du timbre avec un succès de précision de la classification de 72% pour les notes quand divisées en deux classes et 46% lors de l utilisation de sept classes. Dans l expérience principale, différents types d algorithmes de sélection de caractéristiques ont été appliqués aux 164 fonctions audio possibles pour sélectionner les sous-ensembles les plus performants. L ensemble de toutes les 164 fonctions audio a obtenu une précision de classification de 58,9% avec sept classes testées par validation croisée. Les algorithmes de ranking, sequential floating forward selection, et génétiques produisent une précision respective de 43,8%, 53,6% et 59,6% avec 20, 21 et 74 caractéristiques. Les futurs travaux dans ce domaine pourraient se concentrer sur des interprétations plus nuancées de la qualité du timbre ou sur l applicabilité à d autres instruments. iv

5 TABLE OF CONTENTS ACKNOWLEDGEMENTS ii ABSTRACT iii ABRÉGÉ LIST OF TABLES iv viii LIST OF FIGURES x LIST OF ABBREVIATIONS xi 1 Introduction Overview Classification, Regression, and Features Classification and regression Audio features Types of features Implementation considerations Software Feature selection Wrapper versus filter methods Examples of wrapper algorithms Evaluating feature selection Related Work Related literature on timbre Related literature on trumpet tone quality Previous machine learning applied to audio signals v

6 4 Initial Experiment Recordings Ratings Audio features Classification Classifier choice Data subsets tested Other tests Results Discussion The Main Experiment Changes to the data collection New dataset Audio feature extraction Verifying the results of the initial experiment Feature Selection and Classification Results with those Features Slight modification to the classifier Selection and test methods Features and feature sets Baseline feature sets Ranking Sequential floating forward selection with AIC Genetic search algorithm Other tests Discussion Results from the baseline feature sets Results from ranking with Relief Results from sequential floating forward selection Results from genetic search Global lessons Selected features Conclusions and Future Work vi

7 A Trumpet Players and Trumpets B Features extracted B.1 Initial Experiment B.2 Main Experiment REFERENCES INDEX vii

8 Table LIST OF TABLES page 4 1 Initial two class results Confusion matrix for two classes Confusion matrix for two classes Confusion matrix for two classes Initial three class results Confusion matrix for three classes Initial seven class results Confusion matrix for seven classes Leave one player out test Performer identification confusion matrix Spearman correlation of ratings Full dataset two class results Confusion matrix for two classes Confusion matrix for two classes Confusion matrix for two classes Initial three class results Confusion matrix for three classes Seven class results for the full dataset Confusion matrix for seven classes viii

9 5 10 Confusion matrix for player identification of all players A comparison of the classifier accuracy in the initial dataset and the full dataset Abbreviations of input representations for feature extraction Baseline classification results Leave-player-out accuracy Features selected by Relief ranking Classification results from Relief ranking Sequential floating forward selected features from all possible Sequential floating forward selected features from audio features Summary of classification results A 1 List of trumpet players recorded ix

10 Figure LIST OF FIGURES page 4 1 Initial rating interface Histogram for the initial experiment Each player s contribution for the initial experiment Two class distribution from initial experiment Final rating interface Histogram for the full dataset Distribution of ratings for each player Distribution of ratings from each rater Comparison of distributions of the training and test sets x

11 LIST OF ABBREVIATIONS AARF: attribute relation file format (a WEKA file format) CIRMMT: Centre for Interdisciplinary Research in Music Media and Technology DSP: digital signal processing ERB: equivalent rectangular bandwidth FFT: fast Fourier transform GUI: graphical user interface ISMIR: International Society for Music Information Retrieval knn: k-nearest neighbour MDS: multidimensional scaling MIR: music information retrieval MIREX: Music Information Retrieval Evaluation exchange MPEG: Motion Picture Experts Group (a standards organization) PCA: principle component analysis SFFS: sequential floating forward selection STFT: short-time Fourier transform SVM: support vector machine WEKA: Waikato Environment for Knowledge Analysis xi

12 CHAPTER 1 Introduction 1.1 Overview Timbre is frequently defined as the differences between two sounds of the same pitch and loudness. While physical phenomena that contribute to pitch and loudness are well established, the components of timbre are enigmatic. Pitch is related to the frequency of a sound and the physical phenomena and describes how high (high frequency) or low (low frequency) a sound is [1]. Loudness refers to the amplitude or strength of a sound. By being defined by what it is not, timbre becomes a vague term which accounts for many differences in the perceptions of a sound and can be difficult to understand [2]. Tone quality, is a more specific term that refers to a subjective, qualitative timbre for a specific instrument that is to say, how good a performance sounds. Although expert musicians and music teachers can identify good or bad tone quality and even provide some descriptors or causes of good tone quality [3, 4, 5], what constitutes good or bad tone within the data of a recorded audio signal is largely unknown. The research presented here combines several research fields for an investigation into what audio features, or descriptors of a recorded signal, are relevant for assessing tone quality. This knowledge could be applied to several fields, for example to 1

13 research on human perception of sound to investigate preferences for pleasant or unpleasant sound. More directly, the results of this work could be used to give feedback to student musicians on their tone quality by analyzing it during performance. This research represents a first step towards these goals. In order to do this, however, the research focuses only on long tones from a trumpet to provide a meaningful but manageable dataset to begin tone-quality feature research. The general method used for this research was as follows: Ten trumpet players with a wide range of experience and training were recorded playing the same twelve long tones. Then subjective judgements on the tone quality of each note were collected from expert brass players. This data is used as labels to train classifiers and perform feature selection in order to analyze the data. The following chapter provides the technical background on classification, audio features, and audio feature selection. Chapter 3 examines the basics of timbre and tone quality, previous research in timbre and tone quality, and applications of classifiers to audio. Chapter 4 describes the initial experiment to examine whether humans agreed on tone quality and if it was possible to train a classifier to identify tone quality. Chapter 5 describes the collection of a much larger dataset and the confirmation of the initial experiment results. Chapter 6 describes several experiments to determine the best features for tone quality and discusses the results. Lastly, Chapter 8 discusses conclusions of the work and future directions. 2

14 CHAPTER 2 Classification, Regression, and Features 2.1 Classification and regression Classification is the application of computer algorithms to assign classes or labels. For example, in music information retrieval, genre classification is a common problem, that is to say, given an audio recording, the algorithm seeks to assign a classification to it such as rock or classical. A related machine learning concept is regression. Regression is a highly related field that seeks to assign a numeric value rather than discrete classes. Regression problems can also be called continuous class problems, as most of the principles are the same as classification barring the nature of the label data. An explanation of the specialized vocabulary of classification at this point is both necessary and provides and overview of the basics of classification. Classes. The classes or labels are the possible results of the classification algorithm. For example, in genre classification, classes could be names such as classical, rock, or solo cello. In this research, class labels are such things as good tone quality or bad tone quality. The full breakdown of classes in this research is explained later. Instances. In the parlance of classification, instances refer to each example used. In genre classification, each song would be an instance. In this research, each recorded note is a separate instance. 3

15 Features. Each instance is represented by a set of features. This is the only data that is given to the classifier to make a decision. In genre classification and this research, each instance is recorded audio, but the recording is not processed by the classifier. Instead, a set of numerical descriptors of the content of the audio, known as audio features, are given to the classifier instead. In this research, for example, one feature might be the average pitch. Classifier. The classifier is an algorithm that attempts to assign classes to new, unseen instances based on instances whose correct classification is known. Training set. For any classifier to function, it needs to first have several examples of correct classification on which to base the new predictions. The training set is that set. Test set. The test set is the set of instances whose labels are unknown to the classifier, but known to the researcher. As an easily executed and common measure of classifier performance, the success rate of the classifier can be determined by comparing the assigned classes and the known, correct classes. Cross-Validation. Cross-validation is a technique used to test the performance of a classifier while allowing the researcher to use all of a dataset in roles of both training and testing. In order to do this, all of the experimental data is first randomly assigned to a fixed number of groups or folds. One of the folds is then left out to be used as the test set and all the other folds are used as training data. Each fold is successively left out to be used as training data and in this way, each instance is only used once as testing data but never at the same time that it is a training instance. 4

16 Feature Selection. Feature selection is the process of choosing a subset of all possible features to use in a classification task. Feature selection can potentially both reduce the memory needed to store large feature sets and improve classifier accuracy. Even for well defined problems, however, selecting features manually or automatically can be complicated. 2.2 Audio features Audio feature extraction is the process of getting any sort of information from the data points of an audio file using digital signal processing (DSP). It is a commonly used first step for audio analysis in the field of music information retrieval (MIR) as it reduces large audio files into simple numeric descriptions of content. Audio features can also be referred to as parameters or descriptors, particularly in the fields of synthesis analysis and in the MPEG 7 audio standard, respectively. Audio feature extraction is a crucial first step for classification as it allows comparison of audio files. The actual data points of audio of two recordings could vary greatly depending on the specifics of the analog to digital conversion even if the source material was the same. Audio features allow a higher-level descriptions of the content and therefore similar audio will have similar features Types of features There are several ways to categorize types of features. First would be the dimensionality of the extracted feature. For example, the sample rate is a single scalar for the whole file or a global feature, while the energy of the signal is a time-varying feature that has one point for each window along the whole file but is one-dimensional. The short-term Fourier transform (STFT) coefficients are also time-varying but also 5

17 multi-dimensional as there is a vector of numbers for every window. These different types of features will have different output forms which will be more efficiently processed and stored by different software, a reason to be aware of them. As well, features could be grouped into three conceptual categories: features, metafeatures and aggregators [6]. Features are extracted from the signal itself. Metafeatures take base features and produce a new feature, for example, the running average of pitch. Metafeatures are still time-dependent vectors. Aggregators condense and combine one or more feature arrays down to smaller dimensionality. For example, mean, median, standard deviation, or inter-quartile range are examples of simple aggregators. Awareness of dependencies of features such as these allows more efficient computation. Lastly, features could be grouped based on their equivocality. Many audio features have one clear, accepted definition, equation, or algorithm for their extraction, for example, energy or RMS amplitude. Other features have several acceptable methods with varying results. For example, note onsets are an audio feature with several methods for labelling them with varying degrees of success. In fact, as evidenced by the annual Music Information Retrieval Evaluation exchange (MIREX) 1, different methods perform better or worse for different classes of instruments. In order to use audio features for music information retrieval tasks, one must be aware of the decisions made in the implementation of the feature extraction software that may impact results

18 2.2.2 Implementation considerations While many of the parameters for feature extraction software have default values to allow quick and general purpose use, to obtain the best results, a minimum knowledge of the effects of the different settings is recommended. Input representations At the most basic level is the representation or manipulation of the signal which will be processed to extract features. For example, in the Timbre Toolbox [7], discussed in Section 2.2.3, there are seven different input representations that can then be used for extracting features: the audio signal itself, the temporal energy envelope, the magnitudes of short-term Fourier transform (STFT) coefficients, the power of the STFT, the equivalent rectangular bandwidth (ERB) gammatone, ERB FFT, and sinusoidal harmonic partials. In their paper, Peeters et al. [7] discuss the relative merits of these representations but find that when the same end features are extracted from different input representations that the features are still highly correlated and therefore likely redundant. Window considerations As mentioned above, there are global audio features that provide a single summary number for the whole duration of audio. For example, the duration, the sample rate, or file size would all be a single number. Many features, however, return a value that changes over time called time-varying features. These values are extracted using a moving window (also called a frame ) for analysis. This means that only a portion of the signal is analyzed at a given time and a value, dependent only on the data within the window, is produced at a given time. 7

19 As a result, audio feature extraction software typically requires three parameters that need to be set regarding the window. First is the window shape or function and determines the way the audio signal is cut into the smaller portions for analysis. It is a scaling factor given to the points within the window. For example, a rectangular window has uniform weighting across the whole window and merely provides a section of data of the given length. A rectangular window may be suitable for some algorithms but any audio feature that operates in the frequency domain, a rectangular window would give aliasing and therefore distort the spectral content of the signal. Other window functions exist that attempt to trade-off between avoiding aliasing or spectral leakage and appropriate dynamic contrast [8]. The most common windows are the Hann and Hamming windows which balance these two concerns. They also taper to zero at the ends to avoid discontinuities and therefore aliasing and when windows are appropriately recombined, they allow perfect reconstruction of an audio signal. The window size or length is the number of sample points in each window. Because of the computational savings of fast Fourier transform (FFT) algorithms over other spectral transforms and because the FFT works with numbers of samples that are powers of two (128, 256, 512, etc.), window lengths are often a power of two. When choosing a window length, there is a distinct trade-off between greater spectral resolution and temporal accuracy. A longer window often allows greater resolution or accuracy in the feature, but at the expense of temporal resolution. For example, a larger window provides greater spectral resolution which is good for precise measurement of frequencies. The large window, however, would do poorly 8

20 at localizing temporal events like onsets, providing too long a window to place them accurately. The window jump or increment is the number of data points the window moves over each time. Depending on the feature extracted, the sum of the windows over the length of the audio needs to sum to one. For triangle, Hann, and Hamming but not trapezoidal windows, the window jump needs to be half of the window length to uniformly represent the original signal, that is to say, give perfect reconstruction as previously discussed. Summary statistics To be used for classification or regression, however, vectors of time-varying features need to be summarized to single-number features. Mean, median, standard deviation, interquartile range, minimum, and maximum are all common choices. Mean, standard deviation, minimum and maximum, however, are susceptible to corruption by outliers, silence, or noisy representations. Median and interquartile range are therefore preferred representations of the central tendency and variability of a feature[7] Software There are several open-source audio programs which provide an extensible framework to extract audio features. Given the several options available, however, choosing one can be difficult. Aspects to consider include which features are already available for that framework, the method of use, the output format, platform compatibility, and computational efficiency. 9

21 Marsyas was one of the original feature extraction libraries and was written in C++ and is therefore a good option for programmers who prefer C++ [9]. It also is compatible with the attribute-relation file format (AARF) which is part of a wellknown machine learning suite called Waikato Environment for Knowledge Analysis (WEKA) [10]. Vamp plugins are a set of open-source libraries for features which are accessed and used by two pieces of software: Sonic Visualiser and Sonic Annotator [11]. Sonic Visualiser is a GUI software that allows the playback and analysis of one or two audio files and quick visualization of the outputted features. The advantage of this software is the ease of testing and validating different parameters of a feature extractor or roughly comparing results. Sonic Annotator is the command line method of using Vamp plugins. Both, however, are not very computationally efficient and are therefore slow when processing large batches of data. This is because it does not intelligently reuse low-level calculations like STFT between final features. jaudio [12] does, however, allow the efficient creation of metafeatures and aggregators by recycling basic features. Additionally, because it is written in Java, it is cross-platform compatible. It also allows use from a GUI or command line, and again, supports WEKA AARF. As well, it is compatible with the other aspects of the jmir suite [8], allowing audio features to be combined with other data or classified using the Autonomous Classification Engine (ACE). Yaafe (Yet another audio feature extractor) is a relatively new feature extractor that aims to provide simplicity of use and efficiency for large operations [13]. The output format, HDF5, is an efficient binary storage format especially when compared 10

22 to verbose formats such as XML. It also offers programming interfaces for Matlab and Python interaction. Perhaps the newest, with the initial publication released in November 2011, is the Timbre Toolbox [7]. The Timbre Toolbox is for Matlab and is aimed at analyzing individual sounds or notes (as opposed to recordings of songs) and specifically for timbrel analysis. As well, it offers choice of input representations and summary statistics. Lastly, the Matlab MIR Toolbox is a set of feature extractors and tools for working with audio in Matlab [14]. This library would be a good choice for someone seeking to stay within the Matlab environment and looking for features capable of analyzing full songs and other MIR tools. 2.3 Feature selection There are many potential uses for audio features in music information retrieval as any task that manipulates or uses information from audio must use feature extraction in some way. It is naturally important to choose the audio features that are most helpful or applicable to the problem at hand. Feature selection is the process of choosing a subset of features from all available features for use for a particular problem. While discussion of feature suitability is common in several classification fields such as instrument recognition [15], timbre recognition [16], and speaker recognition [17], there exists no literature on features that are appropriate for determining tone quality of a performance. 11

23 There are several reasons feature selection algorithms are useful. First of all, it is often not obvious which audio features out of all possible features are well suited to a particular problem. Furthermore, using noisy, spurious, and redundant features in a classifier have been shown to decrease accuracy [18, 19]. Lastly, finding and just using the relatively small subset of features for a problem reduces the amount of processing time to extract those features, the storage space of the features, and often, the amount of training data required for a classifier, avoiding the so-called curse of dimensionality Wrapper versus filter methods There are several different ways to approach feature selection. One main way to differentiate methods is by contrasting filter and wrapper methods [20]. Filter methods attempt to find the most suitable feature set often by examining their relevance to the output class regardless of the classifier attempting to be used. Kohavi and John [20] detail several commonly used definitions of relevance and outline reasons why they fail in a situation with models more complicated than simple correlation of a feature with the output. Furthermore, redundant features are either all selected or all ignored, depending on the definition of relevancy, and neither situation is ideal as one includes too many features, limiting the effectiveness of feature selection and the other, useful information is discarded. The most simple filter methods find a coefficient of correlation between a given feature and the output. After obtaining the correlation coefficient for each feature, the experimenter can use the most highly correlated features. A slightly more sophisticated algorithm is known as the Relief method[21]. Roughly, Relief aims to find 12

24 the individual contribution of each feature within small areas of the input space. It first picks a random instance in the dataset and finds the nearest neighbours of that instance, that is to say, the instances with the most similar features. The algorithm iterates through each feature of the randomly chosen instance and its neighbours to see how well it predicts any class change between the instances. The weight or coefficient of the feature is increased if it predicts the class of the neighbours well and decreased when it does not. Theoretically, this should identify the features which are best at local discrimination while decreasing the amount of redundancy in the chosen features. In this research, there are almost certainly redundant features in the dataset as the idea was to select the optimal features from as broad a pool of features as possible, many of which will correlate with one another. Secondly, there are likely interactions between features that produce good or bad tone quality, such as two features that when present together indicate bad tone quality, but individually may not. The possibility of such a situation is enough to discount selection algorithms that would fail to account for them. Lastly, for better or worse, filter algorithms do not consider the peculiarities or biases of the classifier being used, such as similarity metrics. Wrapper methods, on the other hand, evaluate the suitability of a feature subset by using the results from training and testing a particular classifier as the performance metric, trying to find the subset with the best results. In this sense, they wrap around the classifier of choice. Wrapper methods are therefore better because they both are able to capture more complicated interactions between features 13

25 without having to explicitly quantify the contributions of features and the test metric is actual classifier results. One potential solution to the problem of finding the best subset would be to simply test all possible subsets. With n features, however, there are 2 n possible subsets of features and testing each with k-fold cross-validation means the number of classifier train/test cycles is k2 n. Assuming training and testing the classifier with five-fold cross-validation only takes one second, testing all subsets of only 14 features would take about a day. Wrapper methods seek a way to avoid testing all possible subsets of features, each algorithm specifying a way to find well-performing subsets. In general, a wrapper algorithm starts with a population of subsets of features, tests all the subsets of the population, then picks a new population. The method of selecting a new population and the termination condition are what differentiates different wrapper methods. There are three general types of wrapper methods: forward selection, backward elimination, and genetic algorithms [22] Examples of wrapper algorithms For sequential forward selection or simply forward selection, the subsets of the initial population contain only one feature and every single feature is represented, giving an initial population of n. The second generation population consists of subsets of two features each: the one best performing feature from the first generation plus one of the n 1 remaining features. It continues like this, progressively adding features to the best subset of the previous generation. It terminates with the last population which consists of one set of all features. Throughout the run, the overall 14

26 best performing subset is the winner at the end. In the end, there are (n(n + 1))/2 subsets tested, a reduction from 2 n. Backward elimination is conceptually similar but in reverse: it starts with all features and tests removing one feature from the best subset each generation. That is to say, the initial population is only one set, consisting of all n features. The second generation is n subsets, each with one feature removed (i.e. n 1 features each). The best performing subset is used as the basis for the next generation of n 1 subsets, now with n 2 features. Again, this continues until the population of individual features and the subset with the best overall classification is chosen. There are two variants that combine forward and backward search. A method known as plus l-take away r or (l, r)-search combines these two methods to avoid local optima and remove features that were once beneficial but are no longer necessary in a larger subset [23]. The method proceeds like forward selection for l steps, then backward selection for r steps. The two values cannot be equal but if l is larger than r, the algorithm proceeds like forward selection, otherwise like backwards selection. The termination is at a given number of required features. A second variant is sequential floating forward selection (SFFS). It proceeds like forward selection, however, after each forward step, a backwards, elimination step is tested. If it improves the classifier, the elimination is kept and there is potentially another backwards step. Backwards steps continue like this until they no longer improve the classifier and another forward step is taken. In general, it proceeds forward until the set number of features is reached. 15

27 Lastly, genetic algorithms are inspired by natural selection and the process of genetic exchange in living things. The genetic analogy is easier to conceptualize if these subsets are thought of as binary vectors with one number for each feature representing if that feature is included or not. In the context of genetic search, these vectors are called chromosomes. In a standard genetic algorithm [24], the initial population is composed of random chromosomes of features, the number of which is set by the experimenter. Like before, every subset or chromosome in the population is tested in the classifier, determining a classification result or fitness in the biological analogy. To generate the next generation, two chromosomes are selected as parents based on their fitness, that is to say, more fit chromosomes are more likely to be chosen. The probability of the most fit is a parameter set by the use. The two parents may then exchange sections of their chromosomes, again, based on a probability set by the user. Lastly, any of the single genes (whether a feature is used or not) may be flipped based on mutation probability. The result of these operations is two chromosomes in the offspring generation and the process is repeated until the offspring generation has the same number of chromosomes as the parent population. The algorithm terminates after a set number of generations, and like before, the chromosome (or subset) with the best performance from all tested is the subset selected Evaluating feature selection Similar to training and testing on different sets of data using cross-validation, a feature selection algorithm must be implemented on one dataset and tested on another. If feature selection is merely applied to one dataset and the cross-validation 16

28 results reported for it, then it can merely be assumed to be the best features for that dataset and not indicative of the utility of those features on new data, presumably the purpose of training a classifier. 17

29 CHAPTER 3 Related Work This thesis builds on previous work in two distinct fields: timbre/tone quality research and applied machine learning. As timbre is defined by what it is not, many previous studies have attempted to find concrete components or factors of timbre in recorded audio. As far back as the mid 19th century, researchers have proposed and tested theories on such components of timbre. Furthermore, tone quality, as a specific type of timbre, is discussed both in timbre and instrument pedagogy literature. While the pedagogy literature discusses the effects of physical actions of the performer on the sound, the timbre research attempts to scientifically determine the components of tone quality. As well, the field of machine learning has been applied to recorded audio for various classification tasks, such as identifying composer or genre. Previous research in this field has examined the efficacy of such classifiers and their features. 3.1 Related literature on timbre Tone quality, as mentioned previously, is a subjective, instrument-specific timbre and any attempt at understanding tone quality must start at understanding timbre. Because timbre is defined by what it is not (pitch, duration, and loudness), it encompasses nearly all possible changes in audio signals and is not easy to quantify or analyze. The first attempt to understand what makes two sounds perceptually different came from the work of Georg Ohm which was expanded by Hermann Helmholtz 18

30 in 1863 [25]. Helmholtz explained that the perception of steady-state sound was dependent on the blending of harmonics but conceded that the attack and decay of the sound was very important in distinguishing, for example, between the pronunciation of the letters B, D, and G. Indeed, Saldanha and Corso demonstrated with an experiment that the initial transients and vibrato aided identification of instruments [26]. To attempt to determine the primary components of timbre, some researchers have used a technique known as multidimensional scaling (MDS) [27, 28, 29]. In general, multidimensional scaling presents a series of stimuli to participants and asks them to rate the similarity between the two. The data on the similarity are then used as distances between the sounds and optimization is used to place the sounds in a timbre space [30]. John Grey [28] used synthesized instrument sounds based on the time-varying amplitude and frequency of actual instrument recordings. While this method allows precise control and knowledge of the content of the sound, it ensures that the only differences between the sounds are those that are pre-specified by synthesis parameters and capable of being made by such a synthesis method. Nevertheless, mapping distances to three dimensions gave the most interpretable result, giving dimensions relating to the distribution of energy in the spectrum, the synchronization of the attacks and decays of the upper harmonics as well as their temporal fluctuation, and the presence of low-amplitude, high-frequency energy in the initial attack segment. Indeed, most of these studies have chosen two or three dimensions for a timbre space and in general, the studies have found three similar dimensions to timbre: 19

31 spectral content, temporal changes in the spectrum, and the nature of the attack, or the first part of the sound. These three dimensions are not completely orthogonal, however, in theory or practise, as the perception of one affects the perception of the others and practically for instrumentalists, a change in one can not be made independently of the others. These type of experiments would not be sufficient for the goal of this research as they seek to find features that correlate well with the two or three dimensions of the timbre space. While the tone quality scale of the collected data is linear, tone quality is unlikely to have a linear direction in a timbre space as there are likely several reasons for good or bad tone quality that change along the tone quality scale and as well as their combinations that are meaningful. In a recent work, Peeters et al. [7] took a different approach to learning the dimensionality and components of timbre. They extracted audio features from 6307 instrument samples from all families of instruments and performed clustering analysis using the correlation between features. They found 10 statistically separate clusters of audio features and three of the major clusters of features are descriptors that seem to align with the three dimensions typically found in MDS experiments. On the other hand, that leaves seven clusters of features undiscovered by MDS experiments. These studies, however, attempt to find differences in timbre between distinct sounds or instruments, rather than differences in timbre between performances on the same instrument. 20

32 3.2 Related literature on trumpet tone quality In 1964, Jean-Claude Risset examined the temporal changes in the amplitudes of trumpet harmonics over the course of the note, making profiles of several notes [31]. He then used this information to synthesize new tones. According to the paper, the new tones were indistinguishable from the originals. He also noted the relative amplitudes of the harmonics varied with the loudness of the playing and that greater amplitude in the upper harmonics was key to the brassy sound [32]. These studies give an idea of how to reproduce a sound but it gives no idea of the allowable variations in amplitudes for good or bad tone quality, or how to extrapolate to arbitrary length notes. Madsen and Geringer have, however, examined preferences for good and bad tone quality in trumpet performance [33]. The good and bad tone qualities were created by asking an expert musician to intentionally play good and bad tone quality and then manually selecting one recording of each. Though the two tone qualities were audibly distinguishable when presented without accompaniment, the only analyzed difference described by the paper was a change in the amplitude of the second fundamental, giving very little insight into the differences in the two. In a different study, the authors used an equalizer to amplify or dampen the third through eleventh harmonics of recorded tones to be rated in tone quality [34]. For the brass instruments, a darker tone, caused by dampened harmonics, was judged to have a lower tone quality than the standard or brightened conditions. This, however, is an artificial difference created between two tones created by manipulation rather than true 21

33 differences in performance and the correlation with actual changes in tone quality is unknown. Lastly, literature aimed at student musicians and their teachers may shed some light on what constitutes good tone. For the trumpet, tone quality is a supposed to be a product of the balance and coordination of embouchure, the oral cavity, and the air stream [3]. Literature on instrument pedagogy also discusses the importance of the attack to tone quality. In his instructional book on the trumpet, Delbert Dale says, the actual sound of the attack (the moment the sound bursts out of the instrument) has a great deal to do with the sound of the remainder of the tone at least to the listener [4]. He also acknowledges that while no two persons have the same or even similar tonal ideals and that standard for good and bad tone quality varies, some common tone quality problems such as a shrill piercing quality in the upper register, and a fuzzy and unclear tone in the lower register have been identified by trumpet experts. 3.3 Previous machine learning applied to audio signals Machine learning is a field of computer science that seeks to have a computer complete a task without explicitly programming it how to do so. It is a common solution to problems where there the input data is of high dimensionality or where humans are able to complete a task but are unable to explain how. In the field of music information retrieval (MIR), it has been used to address several problems with those characteristics. For example, one task is genre classification which seeks to assign labels such as jazz or rock to audio samples. It is a particularly difficult task because the 22

34 features that differentiate between these genres such as instrumentation or style are difficult to quantify or select and the ground truth is so culturally based. Nonetheless, at the annual Music Information Retrieval Evaluation exchange (MIREX) as part of the International Society for Music Information Retrieval (ISMIR) 1 Conference, researchers submit algorithms for genre classification in several categories and top algorithms routinely score over 70 percent accurate. Any classifier algorithm is permitted and support vector machines (SVMs) are frequently used in submissions. Perfecto Herrera-Boyer et al. [35] presents an overview of different classifiers applied to audio such as instrument or instrument family identification, finding melodically similar pieces, or separating sound sources. As well, Mandel and Ellis [36] examined the performance of SVMs compared to k-nearest neighbour algorithms (knns) for classifier choice, and time-variant features compared to global features, two popular classifier algorithms used in audio tasks. They found that global features and an SVM classifier outperformed any other combination of features and classifier. As knns and SVMs are two commonly used classifier algorithms, we will now briefly discuss and contrast them [37]. The principles of a knn classifier are rather simple. Rather than any sort of training phase, a knn classifier simply stores all the labelled training data. When presented with a query instance, the classifier identifies the k closest training points in feature space and assigns the majority class to the query point. The only two parameters to select are the value of k and the distance metric for determining the 1 retrieved 7 December

35 relevant neighbours. The definition of a distance metric and preprocessing of the features (such as by normalizing), however, are crucial to the efficacy of a knn classifier. SVM classifiers, on the other hand, have several parameters and create a model by training, not needing to store all points. The goal of the training phase is to define a non-linear boundary in the input feature space, dividing it into two classes. For multiple classes, more than one SVM can be trained for each boundary. The goal of SVM training is to define a boundary between the two classes and maximize the margin, the distance between the boundary and the closest point from each class, therefore evenly dividing the two classes. During training, the boundary is iteratively adjusted so the nearest mis-classified point is correctly classified. These points become the support vectors that define the boundary and the only points necessary to define the classification model. In the event of a non-linearly-separable data, meaning the data cannot be separated by any means of line or multi-dimensional plane, a class of functions known as kernels can be used to modify the dimensionality of the input features, manipulating the input features and / or increasing the dimensionality to the point that the instances are separable. In actuality, however, the data is often not completely separable with or without a kernel or it would not be desirable to do so. Like other machine learning algorithms, SVMs can be trained to overfit the data set, destroying the applicability of the trained model to new instances. Soft-margin SVMs can be used which allow mis-classified instances and instances in the margin with a penalty. This technique allows a tuning 24

36 of the complexity and generality of the model. Generally, the parameters of the algorithm control the width of a margin around the boundary and the penalty for mis-classified points in the boundary and outside the boundary. Finally, when applied, new data is classified according to which side of the boundary it lies. 25

37 CHAPTER 4 Initial Experiment In order to proceed with the main experiment of this thesis and perform feature selection for tone quality, it was necessary to verify that the experiment could be performed using what would be perceptual, and therefore inherently subjective, data. That poses two questions to be answered. First, given that the data will be subjective, is there statistical agreement between raters of tone quality, at least with regard to trumpets? Secondly, if it is possible to obtain human agreement on tone quality, could a classifier be trained using just audio features to predict the human ratings with any level of accuracy? This initial experiment indicated that the answer to both of these questions is positive. To provide data to test these hypotheses, trumpet players were recorded playing long tones in isolation. Brass players provided ratings of the tone quality of each note and classifiers were trained using subsets of the data. Each step of this process is explained in the following sections and then the results are presented and discussed. The content of this chapter was originally presented at ISMIR 2011[38]. 4.1 Recordings Four trumpet players were recorded for this phase of the experiment. They were selected to intentionally have a large variety of ability levels in order to provide wide-range, gross differences in tone quality, as opposed to smaller, more nuanced difference in tone quality, the suitability of which would be dependent on musical 26

38 context or genre. At the low end, there was a trombone player who had never performed on the trumpet but was playing the trumpet for this experiment. There were two players who had several years of trumpet instruction but had only occasionally practised and performed in the last few years. Lastly, there was a fourth-year undergraduate trumpet-performance major who practised or performed on a daily basis. All anonymous details of all trumpet players recorded for this thesis are shown in Appendix A. The recordings were made in the Performance and Recording Lab in the Centre for Interdisciplinary Research in Music Media and Technology (CIRMMT) at McGill University. All trumpet players were recorded with the same microphone (DPA 4011-TL, Alleroed, Denmark) and digital recorder (Sound Devices 744T, Reedsburg, Wisconsin) at a bit depth of 24 and a sample rate of 48 khz. The recorded material consisted of twelve half notes separated by half rests. The twelve notes were divided into three continuous spans of four notes each using the same valve combinations (1+2, 1, 2, open), in the low range (A3, B 3, B3, C4), the mid range (E4, F4, F 4, G4), and the high range (E5, F5, F 5, G5). As well, because of the difficulty of playing at different dynamic levels in different parts of the range [4], and the changes in tone across different parts of the range [39] each player recorded the twelve notes at piano, mezzo-forte, and fortissimo. For consistency of duration, before the starting the recording of each line, four clicks of a metronome at 60 beats per minute were played for the trumpet player. With the exception of the trombone player, the three other players recorded the same material twice both on their own trumpet and mouthpiece as well as a control 27

39 trumpet and mouthpiece. For details on the makes and models of the trumpets, see Appendix A. Additionally, the trombone player was unable to play the highest four notes. This gives a contribution of 12 x 3 x 2 = 72 notes for the three trumpet players, and 3 x 8 = 24 notes from the trombone player. One note was discarded because of a computer error for a total dataset of 239 notes. 4.2 Ratings The next step was to collect information on the tone quality of each of the notes. This information was provided by human raters giving ratings on a 7-point Likert scale, from 1 being labelled worst to 7, labelled best and the other points unlabelled. Five brass players (three trumpet players, one trombone player, and one French horn player) provided ratings. Each rater sat in a quiet room and made ratings using the computer interface, shown in Figure 4 1 and listening to the playback from the computer through headphones (Bose QuietComfort 3, Framingham, MA). The raters were instructed to listen to the note as many times as they wanted and make a subjective rating of the note using anything they could hear and any criteria they deemed important, including their knowledge of brass instruments and the dynamic level. The notes were presented in three blocks (all of the piano notes, then the mezzo-forte notes, and then the fortissimo notes) but were randomized in each block. Spearman s rank correlation coefficient, ρ, is a test of the inter-rater reliability with ordinal data. For more than 2 raters, the correlation coefficient is found by averaging the pairwise correlation between all pairs of raters. In this case, the average ρ for all raters was a statistically significant 0.50, with p <

Figure 4 1: Rating interface for the initial experiment All notes were represented with their average rating across all five raters, giving a decimal value between 1 and 7.

40 Figure 4 1: Rating interface for the initial experiment All notes were represented with their average rating across all five raters, giving a decimal value between 1 and 7. The distribution of notes in the dataset is shown in Figure 4 2. The contribution of each player to the dataset is shown in Figure Audio features As mentioned previously, jaudio was used for extracting features. Although papers have described audio features to use for timbre recognition [40, 16] and tone quality is related to timbre, in order to avoid discarding useful features and to be inclusive about the definition of tone quality, this experiment used a large selection of extracted audio features, totalling 46 features, 3 of which were multidimensional. 29

41 25 20 Number of notes Averaged rating across all 5 raters Figure 4 2: Histogram of notes in the dataset. 30

42 Player 1 Player 2 Player 3 Player Figure 4 3: A histogram of the notes in the dataset showing the contribution from each of the players. 31

43 This feature set was selected from all available features in jaudio by exclusion, only removing features that clearly would bias the classifier and/or would have no impact on tone quality. For example, the duration of a note should have little to do with tone quality but may be well correlated with the player of a note, providing information to the classifier that could allow it to cheat. 4.4 Classification Classifier choice ACE (Autonomous Classification Engine) 2.0 is a software program used for training, testing, and running classifiers and, along with jaudio, is part of the jmir software suite [41]. It was used throughout this initial experiment for these purposes. It uses WEKA machine learning algorithms [10]. An automated classifier-selection algorithm built into ACE was used to experiment with different classifiers including k-nearest neighbor, support vector machines (SVMs), several types of decision trees, and neural networks on two random subsets of the data. The parameters used by ACE were fixed in advance [42]. SVMs performed best on these subsets, suggesting their use as a consistent and well-rounded general-purpose classifier for the following experiments. For this reason and because of their dominance in related literature and the relative interchangeability of classifier algorithms, SVMs were used through-out this study. The SVM implementation used a complexity variable of one and a simple linear kernel. In multi-class situations, however, SVMs do not encode an ordering of classes which makes the task slightly more difficult in the three and seven-class problems discussed below. In these cases, a meta-classifier was used to train multiple SVMs to 32

44 differentiate the classes. The method is described when introduced below in Section Data subsets tested Different groupings of the notes were used to test the accuracy of the classifiers, including two, three, and seven classes. While the judgements from the five raters were only integer values, each note was represented by a single average rating across all the raters and was therefore often a decimal number. The notes were assigned to classes based on this average rating. The purpose of dividing the data into several subsets was to test whether classification was possible at all and the effect of the level of resolution on classifier accuracy. While these groups were given labels such as good and bad, these labels do not necessarily correspond to divisions in the musical concept of tone quality. Two-class problems were evaluated for three different groupings with increasing inclusion of data, to see how much the accuracy of the classifier declined as the division between the two groups decreased. The first grouping takes just the extremes of the data: the good class only has average ratings above 5.5 and the bad class has average ratings below 2.5, excluding all points in between. The data divisions were the same in training and testing data. The second grouping is more inclusive, using all data below 3.5 for bad and above 4.5 for good, again, excluding data in between. The last grouping includes all the data, split at the median rating, 4.6. The distribution of this last subset is shown in Figure

45 150 Player 1 Player 2 Player 3 Player Bad Good Figure 4 4: The distribution of notes in the dataset with two classes, using all the notes and split on the median value. 34

46 Next, a grouping of three classes was also evaluated, splitting the data approximately into three groups: notes with a rating below 4.2, notes above or equal to 5.2, and the notes in between. Lastly, rounding the averaged ratings into the nearest number produced seven classes of data with labels 1 to 7. The distribution of this class is the same as seen in the histogram of Figure 4 3. All of these subsets were tested with five-fold cross validation, giving an average success rate Other tests In order to delve deeper into the workings of the classifier and to get an idea of whether the results were robust to unseen players, a classifier was trained and tested using leave-one-player-out methodology. That is to say, trained using the data from three of the players and tested on the fourth. Because of the dominance of Player 1 in the low end of the ratings, however, this method was tested with and without Player 1. Lastly, a classifier was trained and tested on its ability to identify a player. Each note was not labeled with a quality but rather with just a player number. The accuracy was then again tested with five-fold cross-validation. 4.5 Results For the two-class problems, the test using just the extremes of data predictably had the highest success rate, 96.9 percent correct identification. Adding more data decreased the accuracy of the classifier as expected, down to 72 percent when splitting 35

47 Table 4 1: The classifier accuracy when splitting the data into two classes, using three different levels of inclusion. Bad Good Average Range Number Range Number Success (%) on the median value. The summary is given in Table 4 1, and the three confusion matrices are shown in Tables 4 2, 4 3, and 4 4. Table 4 2: The confusion matrix for the two class problem, using just the extremes of data. The correct classes are given by the row labels. Bad Good Bad 19 0 Good 2 45 Table 4 3: The confusion matrix for the two class problem, using the medium level of inclusion. The correct classes are given by the row labels. Bad Good Bad Good In the three class problem, a SVM classifier correctly labelled 54.0 percent of the tones. The break down of the classes is shown in Table 4 5 and the confusion matrix is shown in Table 4 6. Lastly, with seven classes, a classifier was still able to obtain an average success of 46.0 percent. The breakdown of the classes is shown in Table 4 7 and the confusion matrix for this result is shown in Table

48 Table 4 4: The confusion matrix for the two class problem, using all the data, splitting on the median value. The correct classes are given by the row labels. Bad Good Bad Good Table 4 5: The classifier accuracy when splitting the data into three roughly even classes. Bad Middle Good Average Range Number Range Number Range Number Success (%) In the player identification task, a classifier correctly identified the player in 88.3 percent of the instances. When testing using the leave-one-player-out test, the rate of a classifier dropped across all tests. The summary of these results are given in Table Discussion SVMs show a surprising ability to discriminate between classes based on the extracted features with two, three, and seven classes. Even with seven classes, a classifier was able to identify the correct class 46% of the time, which is better than the outcome expected from chance or that expected from picking the most common class (36%). This shows promise for the possibility to train a classifier to give automatic feedback on student musicians performance. There are, however, severe limitations to this data set. Because there are only four players in the data set, each with a distinct distribution of notes in the ratings histogram, there may be latent features unrelated to performance quality that can 37

49 Table 4 6: The confusion matrix for the three class problem. The correct classes are given by the row labels. Bad Mid Good Bad Mid Good Table 4 7: The numbers in each class and classifier accuracy when rounding the data to the nearest number, forming seven classes. Class: Avg. Success (%) Number: help narrow the selection of class and improve classifier success. This hypothesis is bolstered by the high success in performer identification task. For comparison, in a previous study, a one-note attempt at identifying the correct performer out of three possible performers gave at best a 43 percent success rate [43]. The results for the leave-one-player-out task decreased sharply compared to the result using all players and testing with cross-validation. Each player had a distinct distribution of notes in the dataset, Player 1 dominating the low end and Player 3 dominating the high-end. For this reason, it may be easier to identify different performers but harder to place notes into tone quality categories. There may be a few other factors that allowed high accuracy of classification with cross-validation but not with leave-one-player-out. For example, in the seven class classification, class one was identified with a high level of accuracy (eight out of 11) whereas none of class two was identified correctly. Mathematically, for a note to be considered class one there had to be a high level of agreement among the raters 38

50 Table 4 8: The confusion matrix for the seven class problem. The correct classes are given by the row labels Table 4 9: Correct classification percentage for several tests of the leave-one-playerout. Success (%) Average Player left out: Success (%) 7 classes (w/ player 1): classes (split at 4.6): classes: classes: classes (w/o player 1): with at least three of the raters giving it a one. This suggests that it was an unusually bad note likely with distinctive characteristics unique to class one. Overall, these results are promising but warranting a more careful examination which is conducted throughout the rest of this thesis. 39

51 Table 4 10: The player identification confusion matrix, the correct player identifications are given by the row labels

52 CHAPTER 5 The Main Experiment Since the initial experiment confirmed the hypotheses that humans reasonably agree on tone quality and that it is possible for a classifier to identify tone quality, a greater data collection was undertaken. This chapter describes the main experiment with an increased dataset and small changes in the method. 5.1 Changes to the data collection The recordings for this part of the experiment were made in the same way as the initial experiment as explained in Section 4.1, playing in the same room, with the same recording hardware, with the same tempo, notes, dynamic levels, and reference trumpet. There were six additional trumpet players added to the experiment, again, attempting to get a range of experiences from beginning or outof-practice trumpet players to many years of continuous study. They are players E through J in the table of Appendix A. The four players from the initial experiment are labelled A through D. The 348 new notes from these recordings were segmented in the same manner and combined with the notes from the original four players for a total of 588 notes in the dataset. While all the recordings were done in the same manner, there were small differences in the recording levels between players because of differences between the styles of the players, differences in the interpretation of dynamic levels, and the small differences between player placement relative to the microphone. For these reasons, 41

53 the amplitudes of the notes were normalized so that differences in dynamics would not affect the judgements of the raters or the provide any bias to the classifier. If not normalized, the small changes in dynamic could be used to identify the player or otherwise provide information unrelated to the tone quality. As well, disparities in tone quality could affect the raters perception of the notes. Because the timbre is affected by dynamic level, however, the original recording dynamic ( piano, mezzoforte, or fortissimo ) was added as a nominal feature for each note. They were normalized to the same peak level. Labelling was done in a similar manner as in the initial experiment, but with a total of 12 raters: six trumpet players, three other brass players, and three other instrumentalists (guitar, clarinet, and drums). As well, rather than present in three blocks of all the piano notes, all the mezzo-forte notes, and then all the fortissimo notes like last time, the notes were presented in nine blocks, based on the three dynamic levels and the three ranges, for example, mezzo-forte, low or piano, mid. The order of the 9 blocks was randomized for each participant and the order of the notes within the blocks was randomized. The participants were encouraged to take a break between blocks by a prompt in the testing software. The slightly changed interface is shown in Figure 5 1. The ratings from the original 5 raters were not used in this larger dataset because the notes were not normalized before collecting the initial experiment ratings. 5.2 New dataset Like the initial experiment, the average Spearman s rank correlation coefficient, or ρ, was significantly greater than 0 for all raters (p < 0.01). The average correlation 42

54 Figure 5 1: The modified rating interface for the second round of data collection coefficients of the raters are given in Table 5 1. The distribution of the complete dataset of 588 notes can be seen in Figure 5 2. The median and range of the ratings of the notes contributed by each player can be seen in Figure 5 3. Figure 5 4 shows the median and range of ratings contributed by each of the raters. It can be seen that all but Raters 1 and 5 used the full range of ratings and eight of the raters have identical range and quartiles. 43

THE POTENTIAL FOR AUTOMATIC ASSESSMENT OF TRUMPET TONE QUALITY

12th International Society for Music Information Retrieval Conference (ISMIR 2011) THE POTENTIAL FOR AUTOMATIC ASSESSMENT OF TRUMPET TONE QUALITY Trevor Knight Finn Upham Ichiro Fujinaga Centre for Interdisciplinary