The Single Hidden Layer Neural Network Based Classifiers for Han Chinese Folk Songs. Sui Sin Khoo. Doctor of Philosophy

Size: px

Start display at page:

Download "The Single Hidden Layer Neural Network Based Classifiers for Han Chinese Folk Songs. Sui Sin Khoo. Doctor of Philosophy"

Martha Lewis
5 years ago
Views:

1 The Single Hidden Layer Neural Network Based Classifiers for Han Chinese Folk Songs Sui Sin Khoo A thesis submitted in fulfilment of the requirements for the Doctor of Philosophy at Faculty of Engineering and Industrial Sciences Swinburne University of Technology Australia 2013

2 This page is intentionally left blank.

3 Abstract This thesis investigates the application of a few powerful machine learning techniques in music classification using a symbolic database of folk songs: The Essen Folksong Collection. Firstly, a meaningful and representative set of theory-based method of encoding Chinese folk songs, called the musical feature density map (MFDMap) is developed to enable efficient classification by machines. This encoding method effectively encapsulates useful musical information that is readable by the machines and at the same time can be easily interpreted by humans. This encoding will aid ethnomusicologists in future folk song research. The extreme learning machine (ELM), an extremely fast machine learning algorithm that utilizes the structure of the single-hidden layer feedforward neural networks (SLFNs) is employed as the machine classifier. This algorithm is capable of performing at a very fast speed and has good generalization performance. The application of the ELM classifier and its enhanced variant called the regularized extreme learning machine (R-ELM), in real-world multi-class folk song classification is examined in this thesis. The effectiveness of the MFDMap encoding technique combining with the ELM classifiers for multi-class folk song classification is verified. The finite impulse response extreme learning machine (FIR-ELM) is a relatively new learning algorithm. It is a powerful algorithm in the sense that its robustness is reflected in the design of the input weights and the output weights. This algorithm can effectively remove input disturbances and undesired frequency components in the input data. The capability of the FIR-ELM in solving complex real-world multi-class classification is examined in this thesis. The MFDMap performed more effectively with i

4 the FIR-ELM. The classification accuracy using the FIR-ELM is significantly better than both the ELM and the R-ELM. The techniques of folk song classification proposed in this thesis are further investigated on a different data samples. These techniques are also applied to the European folk songs, a culture that is very different from the Chinese culture, to investigate the flexibility of the learning machines. In addition, the roles and relationships of four music elements: solfege, interval, duration and duration ratio are investigated. ii

5 Acknowledgement I would like to express my gratitude to my supervisor, Professor Zhihong Man, who has given me both guidance and courage to pursue the work in this thesis. A special thank for his patience to my slow responses and his advices that lead me along the path. I would also like to express my utmost gratefulness to my parents for their loving and constant support, interest and encouragement that lead me up to this point in my life. I would love to express my deepest appreciation to my dearest brother who leads me and inspired me along my way. A sweet thank to Aiji, Kevin, Fei Siang, Hai, and Tuan Do for all the laughter and companies I received during my years of research in Swinburne. iii

6 This page is intentionally left blank.

7 Declaration This is to certify that: 1. This thesis contains no material which has been accepted for the award to the candidate of any other degree or diploma, except where due reference is made in the text of the examinable outcome. 2. To the best of the candidate s knowledge, this thesis contains no material previously published or written by another person except where due reference is made in the text of the examinable outcome. 3. The work is based on the joint research and publications; the relative contributions of the respective authors are disclosed. Sui Sin Khoo, 2013 v

8 This page is intentionally left blank.

9 Table of Contents Table of Contents ABSTRACT...i ACKNOWLEDGEMENT...iii LIST OF FIGURES...xi LIST OF TABLES...xiii LIST OF ACRONYMS...xix 1. INTRODUCTION Motivation Contribution Organization of the Thesis LITERATURE REVIEW Artificial Neural Network McCulloch-Pitts Threshold Processing Unit Rosenblatt s Perceptron Multi-Layer Perceptron Learning Algorithms Extreme Learning Machine Music Representations Audio Format Symbolic Format Discussion MUSIC REPRESENTATION AND THE MUSICAL FEATURE DENSITY MAP Ethnomusicology Background on Geographical Based Han Chinese Folk Song Classification...44 vii

10 Table of Contents Rationale for the Choice of the Five Classes Music Data Set The Essen Folksong Collection The **Kern Representation An Example of Han Chinese Folk Song in **Kern Format Assumptions in **Kern Version of the Essen Folksong Collection Music Elements and Encoding Pitch Elements Duration Elements The Musical Feature Density Map Advantage of the Musical Feature Density Map Future Enhancement to the Musical Feature Density Map THE EXTREME LEARNING MACHINE FOLK SONG CLASSIFIER Introduction Extreme Learning Machine Regularized Extreme Learning Machine Experiment Design and Setting Data Pre-Processing and Post-Processing Parameter Setting Experiment Results Discussion Conclusion THE FINITE IMPULSE RESPONSE EXTREME LEARNING MACHINE FOLK SONG CLASSIFIER Introduction Finite Impulse Response Extreme Learning Machine Experiment Design and Setting Data Pre-Processing and Post-Processing Parameter Setting Experiment Results Discussion Conclusion viii

11 Table of Contents 6. A TWO-CASE EUROPEAN FOLK SONG CLASSIFICATION Introduction Experiment Design and Setting The Musical Feature Density Map Data Set Parameter Setting Experiment Results Discussion Conclusion CONCLUSION Summary Future Works REFERENCES APPENDIX A. FOLK SONG CLASSIFICATION USING AUDIO REPRESENTATION LIST OF PUBLICATIONS ix

12 This page is intentionally left blank.

13 List of Figures List of Figures 2.1 An example of a Threshold Processing Unit A single-hidden layer feedforward neural network The flow diagram of the construction of a beat histogram Map of the three main rivers: the Yellow River, the Yangtze River and the Pearl River Map of the regions in China with the five classes studied in this thesis highlighted The musical score of a Jiangsu folk song Si Ji Ge A **kern representation of the Jiangsu folk song Si Ji Ge An example of a Jiangsu folk song encoded using solfege representation An example of a Jiangsu folk song encoded using interval representation The seven most commonly used durations Examples of tie notes and their equivalence in duration Examples of dotted notes and their equivalence in duration Examples of triplets and their equivalence in duration An example of a Jiangsu folk song encoded using duration representation An example of a Jiangsu folk song encoded using duration ratio representation The flow chart for constructing a MFDMap The music score and the encode solfege, interval, duration and duration ratio representations (Step 1 to 4 in constructing Case 1 MFDMap) The Case 1 MFDMap for Shanxi folk song Zou Xi Kou The music score and the encode solfege, interval, duration and duration ratio representations (Step 1 to 4 in constructing Case 2 MFDMap rests omitted) The Case 2 MFDMap for Shanxi folk song Zou Xi Kou Example of Class 1 folk song using windowing method Example of Class 2 folk song using windowing method...85 xi

14 List of Figures 3.20 Example of Class 3 folk song using windowing method Example of Class 4 folk song using windowing method Example of Class 5 folk song using windowing method Example of Class 1 folk song using Case 1 MFDMap Example of Class 2 folk song using Case 1 MFDMap Example of Class 3 folk song using Case 1 MFDMap Example of Class 4 folk song using Case 1 MFDMap Example of Class 5 folk song using Case 1 MFDMap Example of Class 1 folk song using Case 2 MFDMap Example of Class 2 folk song using Case 2 MFDMap Example of Class 3 folk song using Case 2 MFDMap Example of Class 4 folk song using Case 2 MFDMap Example of Class 5 folk song using Case 2 MFDMap A single-hidden layer feedforward neural network A single hidden layer neural network with linear neurons and time-delay elements An example of raw musical data of Austrian folk song An example of raw musical data of German folk song An example of a MFDMap of Austrian folk song An example of a MFDMap of German folk song The FIR-ELM network structure with linear neurons and time-delay elements Classification accuracy of the low-pass FIR-ELM with 100 hidden neurons (MFDMap: interval, duration and duration ratio) Classification accuracy of four filters FIR-ELM with cutoff frequency 0.1 (MFDMap: interval, duration and duration ratio) xii

15 List of Tables List of Tables 3.1 The solfege encoding reference table (all tonics start within principal octave) List of durations and the encoded representation List of encoded music representations and their respective occurrence percentage (Step 5 to 7 in constructing Case 1 MFDMap) List of encoded music representations and their respective occurrence percentage (Step 5 to 7 in constructing Case 2 MFDMap rests omitted) Selected list of reduced MFDMaps and their respective list of features (Case 1, notes and rests) Selected list of reduced MFDMaps and their respective list of features (Case 2, only notes) Classification accuracy of the ELM classifier using Case 1 and Case 2 MFDMap with the original map size Classification accuracy of the ELM classifier using Case 1 and Case 2 MFDMap with x = Classification accuracy of the ELM classifier using Case 1 and Case 2 MFDMap with x = Classification accuracy of the ELM classifier using Case 1 and Case 2 MFDMap with x = Classification accuracy of the ELM classifier using Case 1 and Case 2 MFDMap with x = Classification accuracy of the ELM classifier using Case 1 and Case 2 MFDMap with x = Classification accuracy of the ELM classifier using Case 1 and Case 2 MFDMap with x = Classification accuracy of the ELM classifier using Case 1 and Case 2 MFDMap with x = xiii

16 List of Tables 4.11 Classification accuracy of the ELM classifier using Case 1 and Case 2 MFDMap with x = Classification accuracy of the R-ELM classifier using Case 1 and Case 2 MFDMap with the original map size Classification accuracy of the R-ELM classifier using Case 1 and Case 2 MFDMap with x = Classification accuracy of the R-ELM classifier using Case 1 and Case 2 MFDMap with x = Classification accuracy of the R-ELM classifier using Case 1 and Case 2 MFDMap with x = Classification accuracy of the R-ELM classifier using Case 1 and Case 2 MFDMap with x = Classification accuracy of the R-ELM classifier using Case 1 and Case 2 MFDMap with x = Classification accuracy of the R-ELM classifier using Case 1 and Case 2 MFDMap with x = Classification accuracy of the R-ELM classifier using Case 1 and Case 2 MFDMap with x = Classification accuracy of the R-ELM classifier using Case 1 and Case 2 MFDMap with x = Confusion matrix for Case 1 MFDMap with map size 71 (x = 15) at 8000 hidden neurons, using the ELM classifier Confusion matrix for Case 2 MFDMap with map size 63 (x = 15) at 8000 hidden neurons, using the ELM classifier Confusion matrix for Case 1 MFDMap with map size 121 (x = 3) at 3000 hidden neurons, using the R-ELM classifier Confusion matrix for Case 2 MFDMap with map size 63 (x = 15) at 5000 hidden neurons, using the R-ELM classifier Classification accuracy using Case 1 MFDMap with the original map size (map size = 172, ω c = 0.6, d/γ = 0.001) Classification accuracy using Case 1 MFDMap with x = 3 (map size = 121, ω c = 0.6, d/γ = 0.001) Classification accuracy using Case 1 MFDMap with x = 5 (map size = 101, ω c = 0.6, d/γ = 0.001) xiv

17 List of Tables 5.4 Classification accuracy using Case 1 MFDMap with x = 10 (map size = 81, ω c = 0.6, d/γ = 0.001) Classification accuracy using Case 1 MFDMap with x = 15 (map size = 71, ω c = 0.6, d/γ = 0.001) Classification accuracy using Case 1 MFDMap with x = 20 (map size = 63, ω c = 0.6, d/γ = 0.001) Classification accuracy using Case 1 MFDMap with x = 30 (map size = 55, ω c = 0.6, d/γ = 0.001) Classification accuracy using Case 1 MFDMap with x = 40 (map size = 47, ω c = 0.6, d/γ = 0.001) Classification accuracy using Case 1 MFDMap with x = 50 (map size = 40, ω c = 0.6, d/γ = 0.001) Classification accuracy using Case 2 MFDMap with the original map size (map size = 145, ω c = 0.6, d/γ = 0.001) Classification accuracy using Case 2 MFDMap with x = 3 (map size = 102, ω c = 0.6, d/γ = 0.001) Classification accuracy using Case 2 MFDMap with x = 5 (map size = 88, ω c = 0.6, d/γ = 0.001) Classification accuracy using Case 2 MFDMap with x = 10 (map size = 73, ω c = 0.6, d/γ = 0.001) Classification accuracy using Case 2 MFDMap with x = 15 (map size = 63, ω c = 0.6, d/γ = 0.001) Classification accuracy using Case 2 MFDMap with x = 20 (map size = 58, ω c = 0.6, d/γ = 0.001) Classification accuracy using Case 2 MFDMap with x = 30 (map size = 49, ω c = 0.6, d/γ = 0.001) Classification accuracy using Case 2 MFDMap with x = 40 (map size = 44, ω c = 0.6, d/γ = 0.001) Classification accuracy using Case 2 MFDMap with x = 50 (map size = 37, ω c = 0.6, d/γ = 0.001) Confusion matrix for Case 1 MFDMap with x = 15 at 500 hidden neurons (map size = 71, ω c = 0.6, d/γ = 0.001) Confusion matrix for Case 2 MFDMap with x = 15 at 500 hidden neurons (map size = 63, ω c = 0.6, d/γ = 0.001) xv

18 List of Tables 5.21 Classification accuracy of the RPROP, ELM, R-ELM, FIR-ELM and SVM classifier The fifteen MFDMaps Classification accuracy (%) using one music element in the MFDMap Classification accuracy (%) using two music elements in the MFDMap Classification accuracy (%) using three music elements in the MFDMap Confusion matrix for MFDMap using interval, duration and duration ratio elements Classification accuracy of the RPROP, ELM, R-ELM, FIR-ELM and SVM classifier A.1 Classification accuracy (%) of the RPROP classifier using median A.2 Classification accuracy (%) of the RPROP classifier using mean A.3 Classification accuracy (%) of the RPROP classifier using variance A.4 Classification accuracy (%) of the RPROP classifier using median and mean A.5 Classification accuracy (%) of the RPROP classifier using median and variance A.6 Classification accuracy (%) of the RPROP classifier using mean and variance A.7 Classification accuracy (%) of the RPROP classifier using median, mean and variance A.8 Classification accuracy (%) of the ELM classifier using median A.9 Classification accuracy (%) of the ELM classifier using mean A.10 Classification accuracy (%) of the ELM classifier using variance A.11 Classification accuracy (%) of the ELM classifier using median and mean A.12 Classification accuracy (%) of the ELM classifier using median and variance A.13 Classification accuracy (%) of the ELM classifier using mean and variance A.14 Classification accuracy (%) of the ELM classifier using median, mean and variance A.15 Classification accuracy (%) of the low-pass FIR-ELM classifier using median A.16 Classification accuracy (%) of the low-pass FIR-ELM classifier using mean.198 xvi

19 List of Tables A.17 Classification accuracy (%) of the low-pass FIR-ELM classifier using variance A.18 Classification accuracy (%) of the low-pass FIR-ELM classifier using median and mean A.19 Classification accuracy (%) of the low-pass FIR-ELM classifier using median and variance A.20 Classification accuracy (%) of the low-pass FIR-ELM classifier using mean and variance A.21 Classification accuracy (%) of the low-pass FIR-ELM classifier using median, mean and variance A.22 Classification accuracy (%) of the high-pass FIR-ELM classifier using median A.23 Classification accuracy (%) of the high-pass FIR-ELM classifier using mean A.24 Classification accuracy (%) of the high-pass FIR-ELM classifier using variance A.25 Classification accuracy (%) of the high-pass FIR-ELM classifier using median and mean A.26 Classification accuracy (%) of the high-pass FIR-ELM classifier using median and variance A.27 Classification accuracy (%) of the high-pass FIR-ELM classifier using mean and variance A.28 Classification accuracy (%) of the high-pass FIR-ELM classifier using median, mean and variance A.29 Classification accuracy (%) of the band-pass FIR-ELM classifier using median A.30 Classification accuracy (%) of the band-pass FIR-ELM classifier using mean A.31 Classification accuracy (%) of the band-pass FIR-ELM classifier using variance A.32 Classification accuracy (%) of the band-pass FIR-ELM classifier using median and mean A.33 Classification accuracy (%) of the band-pass FIR-ELM classifier using median and variance xvii

20 List of Tables A.34 Classification accuracy (%) of the band-pass FIR-ELM classifier using mean and variance A.35 Classification accuracy (%) of the band-pass FIR-ELM classifier using median, mean and variance A.36 Classification accuracy (%) of the band-stop FIR-ELM classifier using median A.37 Classification accuracy (%) of the band- stop FIR-ELM classifier using mean A.38 Classification accuracy (%) of the band- stop FIR-ELM classifier using variance A.39 Classification accuracy (%) of the band- stop FIR-ELM classifier using median and mean A.40 Classification accuracy (%) of the band- stop FIR-ELM classifier using median and variance A.41 Classification accuracy (%) of the band- stop FIR-ELM classifier using mean and variance A.42 Classification accuracy (%) of the band- stop FIR-ELM classifier using median, mean and variance xviii

21 List of Acronyms List of Acronyms ANN BH BP bpm DFT DWT ELM ERM EsAC FFT FIR FIR-ELM FNN FPH LPC MFCC MIDI MLP MFDMap OSC PH R-ELM RMS RPROP SACF SC SF artificial neural network beat histogram backpropagation beats-per-minute discrete Fourier transform discrete wavelet transform extreme learning machine empirical risk minimization Essen Associative Code fast Fourier transform finite impulse response finite impulse response extreme learning machine feedforward neural network folded pitch histogram linear predictive coding Mel-frequency cepstral coefficients Musical Instrument Digital Interface multi-layer perceptron musical feature density map Open Sound Control pitch histogram regularized extreme learning machine root mean square resilient propagation summary enhanced autocorrelation function spectral centroid spectral flux xix

22 List of Acronyms SR SLFN SRM SVM TPU UPH ZC spectral roll-off single-hidden layer feedforward neural network structural risk minimization support vector machine threshold processing unit unfolded pitch histogram zero-crossing xx

23 Chapter 1 Introduction Chapter 1 Introduction This thesis will investigate the application of a few powerful learning machines in music classification using a symbolic database of folk songs: the Essen Folksong Collection [1]. A meaningful and representative set of theory-based method of encoding Chinese folk songs is first developed to enable classification by learning machines. This encoding will aid ethnomusicologists in future folk song research and the superiority of the finite impulse response extreme learning machine (FIR-ELM) for music classification is confirmed. 1.1 Motivation The single-hidden layer feedforward neural network (SLFN) is the simplest and most popular structure of multi-layer perceptrons. It has been seen in [2-3] that the SLFNs with any continuous bounded nonlinear activation functions or any arbitrary (continuous or non-continuous) bounded activation function, which has unequal limits at infinities, can approximate any continuous function and implement any classification application with a sufficiently large number of hidden neurons. Such architecture has vast applications particularly in pattern recognition. In recent years, an emerging technology called the extreme learning machine (ELM) [4] has been attracting attention within the machine learning domain. The ELM 1

24 Chapter 1 Introduction is a learning algorithm, designed specifically for single-hidden layer feedforward neural networks. Unlike conventional gradient descent-based algorithm, this algorithm has very fast speed and good generalization performance as its main attractions. There are wide applications of the ELM in pattern classification domain. Some examples are handwritten character recognition [5-6], classification of bioinformatics datasets [7-11], financial credit scoring [12-13], internet-based information processing [14-17] and music genre classification [18]. The finite impulse response extreme learning machine (FIR-ELM), an enhanced variation of ELM, which has theoretically proven to greatly improve the robustness of the ELM, was proposed in [19]. The FIR-ELM is also designed for single-hidden layer feedforward neural networks. This algorithm adopts the concept of the FIR filter in the design of the hidden layer of the neural network to effectively remove input disturbances and undesired frequency components. This modification has greatly improved the robustness of the original ELM algorithm, especially in handling noisy data. In addition, an objective function that includes both the weighted sum of the output error squares and the weighted sum of the output weights squares is minimized in the output weight space of the neural network to compute a set of optimal output weights to further improve the robustness of the neural network. This new algorithm was employed for a real-world binary classification task in [20] with bioinformatics dataset. However, up until now, there is no application of such an algorithm in any realworld multi-class classification. Chinese folk songs are an important part of Chinese culture. They are a valuable source for humanities research. They reflect the history, society, customs, tradition and everyday life of the nation. They are the faithful companion of the people in their daily life, as a form of entertainment and as assistance in laboring works. They serve as a medium to transfer and exchange knowledge and information, to express feelings, thoughts and emotions, to communicate and to entertain. Chinese folk songs have significant influence on the development of other forms of traditional music, including the traditional dance music, opera, instrumental music and quyi ( 曲艺 ) [21]. Many instrumental music and dance music are adapted or rearranged from folk songs. Chinese folk songs also have an active influence on court 2

25 Chapter 1 Introduction music, religious music and cultivated music. In addition, many contemporary composers produce works that use folk songs or components of folk songs as the theme and works that reflect great influence from folk songs. Chinese folk songs are unquestionably a very important asset of humanity. This thesis intends to contribute a part in preserving and sustaining this important art. 1.2 Contribution The main contributions of this thesis are summarized as follows: A novel music encoding method is developed for encoding Chinese folk songs. This encoding method utilizes the symbolic representations of the musical elements and enables music to be represented in a manner that is as close to human perception as possible. The ELM technique is successfully implemented for folk song classification using real-world Han Chinese folk song data set. The FIR-ELM, an improved version of the ELM, gives a better outcome in solving folk song classification. The capability of such an algorithm in multiclass classification is verified. In addition, a potentially useful method of encoding songs is demonstrated which may be helpful in future ethnomusicology research of Chinese folk songs. The developed song encoding technique and the machine learning based classification algorithms are then applied to European folk songs and the performance and usefulness are successfully verified. 3

26 Chapter 1 Introduction 1.3 Organization of the Thesis The main contents of this thesis are organized as follows: Chapter 2 presents a brief overview of the artificial neural networks (ANNs), focusing on the SLFNs and the conventional learning algorithms developed for the network structure. A brief review on the techniques used for the representations of music in machine classification is included. Chapter 3 discusses issues of the ethnomusicology background for geographical based Han Chinese folk song classification and the format and musical contents of the realworld data set employed for the research in this thesis. The music elements employed to characterize each class of folk songs and their respective method of representation are presented. Finally, the novel technique of developing a feature map to meaningfully represent folk songs for machine classification without loss of musical meaning is proposed. Chapter 4 presents an outline of the extreme learning machine (ELM) and the regularized extreme learning machine (R-ELM) algorithms, followed by a detailed description on the experiments design and settings for the implementation of machine classification. This chapter also includes a careful discussion on the technique of automatic classification for Chinese folk songs. Chapter 5 investigates the capability of a new robust algorithm called the finite impulse response extreme learning machine (FIR-ELM) on multi-class classification problems. At the same time, the enhancement to the performance of automatic classification for Han Chinese folk songs is tested using such algorithms on a series of different experiments. Chapter 6 presents a two-case European folk song classification task using the conclusion derived in Chapter 5 to further investigate the success rate of such technique in folk songs of other cultures. 4

27 Chapter 1 Introduction Chapter 7 concludes the research activities in this thesis and presents a summary of the findings. Some suggestions for future work are included in this chapter. 5

28 This page is intentionally left blank.

29 Chapter 2 Literature Review Chapter 2 Literature Review This chapter presents a brief overview of the artificial neural networks, focusing particularly on the structure of the single-hidden layer feedforward neural networks and the conventional learning algorithms applicable to this network structure. A brief review on the techniques for the representations of music in machine classification is also included. 2.1 Artificial Neural Network The human brain is a highly complex system. It is capable of performing parallel computation in a non-linear manner. Neurons in the human brain can be organized to perform multiple tasks such as pattern recognition, motor control, perception and etc. Artificial neural networks (ANNs) mimic the organization and functionality of human brain. The work in ANN, commonly referred to as neural network is vast and usually mimics the natural behaviour and phenomena of such thinking system. In order to have the capability of performing complex task, neural networks employ a massive interconnection of simple computing cells which are always referred to as neurons or processing unit. A good definition of neural network is as follows [22]: 7

30 Chapter 2 Literature Review A neural network is a massively parallel distributed processor that has a natural propensity for storing experiential knowledge and making it available for use. It resembles the brain in two respects: 1. Knowledge is acquired by the network through a learning process. 2. Interneuron connection strengths known as synaptic weights are used to store the knowledge. The learning algorithm is the procedure used to perform the learning process and functioned by modifying the synaptic weights in the network in an orderly fashion so as to achieve a desired design objective McCulloch-Pitts Threshold Processing Unit The McCulloch-Pitts Threshold Processing Unit (TPU) is a concept developed by McCulloch and Pitts [22-24] in It can be considered as the initial structure of ANN. This model only takes binary inputs (0 or 1), each of which are connected to a fixed weight and a bias term. The outputs from the model are the multiplication and summation of the inputs with the weights and biases that passed through a threshold activation function. The outputs are also in the binary form. An example of the TPU model is shown in Figure 2.1. The computation of the TPU is as follows. For a sample input data vector x = x 1,x 2,,x n, the output of the model is n y = g( x) = g wi xi + b (2.1) i= 1 where w i is the weight connecting the ith input, b is the threshold term and g(x) is the threshold activation function. In the TPU, the weights are decimal numbers which rank the relative importance of each input and the threshold is a small value that has the effect of applying an affine transformation to the output y. The threshold activation function, g(x), is defined as follow: 8

31 Chapter 2 Literature Review 1 if x 0 g ( x) =. (2.2) 0 if x < 0 Figure 2.1: An example of a Threshold Processing Unit Rosenblatt s Perceptron The first perceptron was developed by Frank Rosenblatt [25-26] in It is the first model proposed for learning with a teacher (supervised learning). It is the simplest form of a neural network used for classification of patterns that are linearly separable. The perceptron resembles the structure of a TPU, it has a single neuron, weights and bias. Unlike the TPU, the perceptron has adjustable synaptic weights. In order to tune the synaptic weights, an error-correction rule known as the perceptron convergence algorithm is developed. The synaptic weights are adjusted on an iteration-by-iteration basis. In the perceptron model, the input vector is defined as T x ( n ) = [ + 1, x1 ( n), x2( n),..., x m ( n)] where +1 is the synaptic weight for the bias term, b, and n denotes the time-step in applying the algorithm. Correspondingly, the weight 9

32 Chapter 2 Literature Review vector is defined as T w ( n ) = [ b, w1 ( n), w2 ( n),..., w m ( n)]. Then, linear combiner output of the neuron can be written in compact form y( n) m = wi ( n) xi ( n). (2.3) i= 0 T = w ( n) x( n) In order for the perceptron to function properly, the two classes C 1 and C 2 must be linearly separable. This means that the patterns must be sufficiently separated from each other so that the decision surface consists of a hyperplane. This means that there exists a weight vector w such that T w x > 0 T w x 0 for every input vector x belonging to class C1. (2.4) for every input vector x belonging to class C 2 The error-correction algorithm for adapting the weights in the perceptron can be summarized as follows [26]: 1. Initialization. Set w(0) = 0. Then, perform the following computations for timestep n = 1,2, 2. Activation. At time-step n, activate the perceptron by applying continuousvalued input vector x(n) and desired response d(n). 3. Computation of actual response. Compute the actual response y(n) of the perceptron as T y( n) = sgn[ w ( n) x( n)] (2.5) where sgn( ) is the signum function. 4. Adaptation of weight vector. Update the weight vector of the perceptron to obtain w( n + 1) = w( n) + η[ d( n) y( n)] x( n) (2.6) 10

33 Chapter 2 Literature Review where + 1 if x( n) belongs to class C1 d ( n) =, (2.7) 1 if x( n) belongs to class C2 η is the learning-rate parameter, a positive constant limited to the range 0 <η 1, and the difference d(n) y(n) plays the role of the error signal. 5. Continuation. Increase time-step n by one and go back to step Multi-Layer Perceptron The multi-layer perceptron (MLP) [22] is a concept that extent the capability of the Rosenblatt s perceptron from classifying linearly separable patterns to solving nonlinear problems. The typical structure of a MLP consists of one or more hidden layer in between the input layer and output layer forming a cascade structure of perceptrons. The input signals are propagated through the network in a forward direction, on a layer-bylayer basis. The MLP is also known as the multi-layer feedforward neural networks (FNNs) due to the direction of the propagation of signals. The structure of a MLP can be of any size. The simplest MLP is the singlehidden layer feedforward neural networks (SLFNs) where the network structure consists of only one hidden layer, besides the output and input layers Single-Hidden Layer Feedforward Neural Network A single-hidden layer feedforward neural network has three network layers: input layer, hidden layer and output layer. The input layer consists of sensory units called input neurons that receive activation signals from an external source and then supply respective elements of the activation pattern (input vector) to the neurons in the second layer, i.e. the hidden layer. The hidden layer consists of computational nodes called hidden neurons and serves to intervene between the input layer and the output layer. It acts as a pre-processor that receives input pattern from the input layer and projects it 11

34 Chapter 2 Literature Review into the feature space in order for the features to be more easily separated. Finally, the output layer, which consists of computational nodes that are called output neurons, receives the pre-processed pattern from the hidden layer and performs further computation to produce a set of output signals that constitutes the overall response of the neural network to the set of activation patterns supplied by the input neurons. It is to be noted that the input neurons are non-computational nodes. They simply receive activation signals and supply them to the hidden layer for computation. The network structure of a single-hidden layer feedforward neural network is shown in Figure 2.2. In this neural network, there are n input neurons, Ñ hidden neurons and m output neurons. The analytic function corresponding to the SLFN in Figure 2.2 can be written as follows. The output of the jth hidden neuron is obtained by first forming a weighted linear combination of all n input values and adding a bias to give a j = n i= 1 w x + b ji i j (2.8) for j = 1,2,3,, Ñ with w ji the weight connecting the ith input to the jth hidden neuron and b j the bias term for the jth hidden neuron. Figure 2.2: A single-hidden layer feedforward neural network. 12

35 Chapter 2 Literature Review Then, the linear sum in (2.8) is transformed using a non-linear activation function g(x) to give the activated output y = g a ). (2.9) j ( j The final outputs of the neural network are obtained by transforming the activations of the hidden neurons using a second layer of processing elements, i.e. the output neurons in the output layer. Thus, for each output neuron k, a linear combination of the outputs of the hidden neurons is formed to give a k = Ñ j = 1 β y + b (2.10) kj j k for k = 1,2,3,,m where β kj is the output weight connecting the jth hidden neuron to the kth output and b k the bias term for the kth output. Similarly, an activation function is applied to the linear sum in (2.10) to give the final output o = g~ ( ). (2.11) k a k The notation g ~ ( x ) is used to emphasize that the activation function for the output layer need not be the same as the activation function for hidden layer. Often, the activation function for the output neurons is different from that for the hidden neurons because the output neurons perform different roles than that of the hidden neurons. In most cases, instead of a non-linear function, a linear activation function is used for the output layer Learning Algorithms The functionality of a multi-layer perceptron is its capability in learning a suitable mapping from a given data set. The efficiency of the MLP mainly depends on the learning algorithm. The learning algorithm determines the ideal adjustments and settings to the parameters of the MLP. There are two broad categories of learning algorithm: supervised learning and unsupervised learning. 13

36 Chapter 2 Literature Review Supervised learning is also known as active learning [22] where an external teacher is supplied. The role of the teacher is to provide the desired or targeted response of a training vector in order for the network to learn a good mapping of the input-output patterns. The desired response represents the optimum action to be performed by the neural network. The network parameters are then adjusted under the combined influence of the training vector and the error signal (error signal is defined as the difference between the actual response of the network and the desired response). The adjustment continued iteratively in a step-by-step fashion with the aim that the neural network will eventually emulate the teacher. Unsupervised learning, also known as self-organized learning [22] is the contrary of supervised learning. There is no teacher present in the learning. Rather, the parameters of the network are optimized with respect to the task-independent measure of the input. An internal representation of the input is formed without influence from any external source. The techniques employed in this thesis are of the type of supervised learning technique. Hence, discussions on unsupervised learning will not be included Gradient Descent-Based Algorithms The most popular, also one of the simplest learning algorithms for the MLPs is the gradient descent method (also known as steepest descent). In gradient descent method, the network learning process starts with an initial random weight vector. The weight vector is then iteratively updated in steps such that, at each step, it moves a short distance in the direction of the negative gradient (i.e. the greatest rate of decrease) of the error surface. At each successive step the value of the error function, E, will decrease, eventually leading to a weight vector at which E = 0. (2.12) The error function, typically, the mean sum of squares, is defined as 14

37 Chapter 2 Literature Review 1 E = m m k = 1 ( o k t k ) 2 (2.13) where m is the number of outputs, o k is the actual neural network response of the kth output neuron in (2.10) and (2.11) and t k is the corresponding target for a particular input pattern x n. In order to reduce the error value, E, the network weights are updated as follow: w ji _ new = wji _ old + Δwji (2.14) where w ji is the weight connecting the ith input to the jth hidden neuron and E Δwji = η (2.15) w ji η is the learning rate parameter for the gradient descent algorithm. The output weights are updated using the analogous expressions. The main advantage of gradient descent based methods is the relatively simple computation of the algorithm. However, although the optimization always arrives at a minimum but due to the initial starting point, the minimum might be a local minimum instead of the global minimum. Unfortunately, once the algorithm converges to a minimum, there is no further way to decrease the error value and the optimization process has to be restarted. Hence, the solution obtained will often be non-optimal. The other problem with gradient descent-based is the long learning time. As the optimization process of the gradient descent algorithms is performed in an iteratively step-by-step manner, they required a long learning time. In addition, the number of parameters that need tuning also leads to a time-consuming process Discriminant-Based Algorithms The discriminant-based algorithms are fairly different from the gradient descent algorithms. Unlike the gradient descent algorithms which approximate the parameters 15

38 Chapter 2 Literature Review for the feature probability distribution, the disriminant-based algorithms focus on finding the discriminants that separate members from different classes by estimating these discriminants directly. The support vector network, also known as the support vector machine (SVM) [27-28] is one of the most popular algorithm from this group. It is an alternative method proposed to overcome the problems in gradient descent algorithms. Unlike the gradient descent algorithms, the SVM is a non-probabilistic binary linear classifier which utilizes Lagrange multipliers in its output weights optimization operation. The SVM was initially designed to solve binary classification problems. The SVM works by non-linearly mapping the input vectors to a very high-dimension feature space where a linear decision surface can then be constructed. Unlike the gradient descend algorithms, the non-linear mapping function of the SVM is decided based on a priori knowledge and the output layer decision surface is then computed using the optimization method. One of the drawbacks of the SVM is the complexity of the optimization procedure and the high degrees of polynomial used for forming the decision surfaces. This leads to a considerably long learning time. The running times of the state-of-theart SVM learning algorithms scale approximately quadratically to the number of training sample. In addition, as the SVM is designed for binary classification; in order to solve multi-class classification problem, the algorithm has to break down a single multi-class problem into multiple binary classification problems Extreme Learning Machine The major bottlenecks of the gradient descent-based feedforward neural network, such as the one described in Section , are the very slow learning speed and the issue of converging to local minima. It has been shown in [29] and [30] that single-hidden layer feedforward neural networks with N hidden neurons and arbitrarily chosen input weights (weights connecting input layer to hidden layer) can learn N distinct observations with arbitrarily small error. This method has been proved to produce good 16

39 Chapter 2 Literature Review generalization performance and extremely fast learning speed on both artificial and real applications in [31]. It has also been further proved in [32] that SLFNs with arbitrarily assigned input weights and hidden layer biases and with almost any non-zero activation function are capable of universally approximating any continuous functions on any compact input sets. The extreme learning machine (ELM), an emerging learning algorithm that utilized the structure of a single-hidden layer feedforward neural network, has been proved to overcome the limitations of both the gradient descent-based algorithms and the support vector machine through its technique of parameters assignment and has been proved to outperform both algorithm [33]. Unlike conventional gradient descent-based algorithms, an ELM randomly assigned the input weights and hidden layer biases and deterministically computes the optimal output weights using the generalized inverse of the hidden layer outputs. Hence, the ELM s learning speed can be many times faster than conventional gradient descentbased algorithms while obtaining better performance. In addition, the generalized inverse operation allows the ELM to reach the smallest training error and the smallest norm of weights. The ELM uses the network structure as shown in Figure 2.2, i.e. a single-hidden layer feedforward neural network. For a dataset with N distinct samples {(X,T) X = [x 1,x 2,,x N ], T = [t 1,t 2,,t N ]} where x i = [x i1,x i2,,x in ] T R n is the input vector and t i = [t i1,t i2,,t im ] T R m is the target vector, the SLFN with Ñ hidden neurons can be written as Ñ j =1 β g( w x + b ) = o (2.16) j j i j i for i = 1,2,,N where β j = [β j1, β j2,, β jm ] T is the output weights vector connecting the jth hidden neuron and the output neurons, w j = [w j1, w j2,, w jn ] T is the input weights vector connecting the input neurons and the jth hidden neuron, b j is the bias of the jth hidden neuron, w j x i denotes the inner product of w j and x i, g(x) is the activation function and o i = [o i1,o i2,,o im ] T R m is the output vector with respect to x i = 17

40 Chapter 2 Literature Review 18 [x i1,x i2,,x in ] T the input vector. It is to be noted that the output neurons are linear, i.e. the activation function of the output neurons is a linear function. For the SLFN with Ñ hidden neurons and activation function g(x) to approximate N data samples with zero error, there exist β j, w j and b j such that i j i j Ñ j j b g t x w β = + = ) ( 1 (2.17) for i = 1,2,,N. (2.17) can then be written compactly in matrix form Hβ = T where Ñ N Ñ N Ñ N N Ñ Ñ Ñ Ñ N Ñ Ñ b b b b b b b b b b b b = ) g( ) g( ) g( ) g( ) g( ) g( ) g( ) g( ) g( ),...,,,,...,,,,...,, ( x w x w x w x w x w x w x w x w x w x x x w w w H L M O M M L L, (2.18) m Ñ Ñ = T T 2 T 1 β β β β M and (2.19) m N N = T T 2 T 1 t t t T M. (2.20) The output weight matrix, β, of the SLFN is then computed as follows: T H H H β T 1 T ) ( =. (2.21) Regularized Extreme Learning Machine Although the ELM greatly improved the performance of conventional gradient descentbased algorithms, the design of the output layer weights in the ELM gives rise to an

41 Chapter 2 Literature Review issue. As the output weights is determine through generalized inverse on the hidden layer output matrix, this minimum norm least squares solution of the hidden layer output is an empirical risk minimization (ERM) operation which tends to result in an overfitting model especially if the training set is not sufficiently large. Deng, Zheng and Chen [34] proposed to overcome the drawback in the output weights by introducing a regularization term into the ELM algorithm. A weight factor, γ, for empirical risk is inserted to regularize the proportion of the empirical risk, 2 ε, and the structural risk, 2 β. Their improved algorithm is called the regularized extreme learning machine (R-ELM). In the R-ELM algorithm, the output weights are calculated by minimizing both the weighted sum of the output error squares and the sum of the output weights squares of the SLFN: Minimize γ ε + β (2.22) 2 2 subject to ε = O T = Hβ T. (2.23) The problem is solved by using the method of Lagrange multipliers: γ L = 2 N m 1 2 m 2 2 εij + βij i= 1 j= 1 i= 1 j= 1 k= 1 p= 1 Ñ N m λ ( h kp T k β T p kp ε ) kp (2.24) where ɛ ij is the ijth element of the error matrix ɛ, β ij is the ijth element of the output weight matrix β, T ij is the ijth element of the output data matrix T, h i is the ith column of the hidden layer output matrix H, β j is the jth column of the output weight matrix β, λ ij is the ijth Lagrange multiplier and γ is the constant parameter used to adjust the empirical risk. Differentiating L in (2.24) with respect to (β ij,ɛ ij ) and let them equal to zero gives L β ij T β = H λ and (2.25) 19

42 Chapter 2 Literature Review L ε ij λ = γε. (2.26) Considering the constraint in (2.23), (2.26) can be expressed as ( β T) λ = γ H. (2.27) Using (2.27) in (2.25) leads to the computation of the output weight matrix, β, of the SLFN: 1 I T T β = + H H H T. (2.28) γ In this thesis, the application of the single-hidden layer feedforward neural network using the extreme learning machine technique, as the folk song classifier will be investigated. The discussion on this superior technique for folk song classification will be presented in Chapter 4 and Chapter Music Representations As computer technology became more sophisticated and accessible, human s interest in involving machine in music classification flourishes. Automatic music classification consists of using machine to obtain useful features from music and using these features to identify which of a set of classes a new piece of music is most likely belongs to. The two main formats of digital representation of music are: audio format and symbolic format. In audio format, music is represented in the form of raw audio signals. There is no explicit information about the music notes, voicing and phrasing nor any musical symbol and tag. WAV and MP3 files are the most commonly used audio representation [35]. Symbolic format, on the other hand, uses symbols and notations with direct musical meaning to model the visual aspects of a music score, and audio information or annotations related to the music piece. Symbolic representations contain information about what and how a music piece is to be played. Some commonly used 20

43 Chapter 2 Literature Review symbolic representations are MIDI, Humdrum, abc and MusicXML [36]. In music classification, the choice of the format is usually depending on the availability of the data samples Audio Format In music classification using audio data, features used to characterize each of the classes are constructed based on information directly derived from audio signal properties. These features are usually referred to as low-level features which do not provide direct and precise information regarding the musical context and content. This information is usually obtained by performing feature extraction on a fixed size segment of audio signal called window or frame. A window can contains audio samples ranging from a few milliseconds to seconds and sometimes even minutes. While most features extracted from audio signal are based on short windows, some longer windows can be used if information on a large scale structure is desired. A music audio is usually segmented into many overlapping windows in order to increase time localization. The distance between the starts of two overlapping windows is usually called hop size. Although there is no fixed standard for the hop size, the common hop size applied in music classification is half the size of the analysis window. In the following, a summarized list of common audio features employed in [18,37-59] is briefly discussed. The first section of the discussion focuses on features extracted in the time domain. The second section discusses features that are extracted in the frequency domain using the discrete Fourier transform (DFT) technique. In order to allow combination of features from both time domain and frequency domain, the size of the analysis window in both cases are usually made the same. Nonetheless, it is not compulsory to do so. Finally, in the third section, two high-level features that can be extracted from the audio signal are discussed. 21

44 Chapter 2 Literature Review Common Audio Features Extracted in Time Domain Features derived from audio signal in time domain are usually calculated directly from the sequence of samples. Whilst most features described in this section are extracted from individual short windows usually with time scale ranging from 10 milliseconds to 40 milliseconds (with the purpose of being consistent with the window size apply on feature extraction in frequency domain), some features are calculated based on a collection of consecutive short windows in order to capture the pattern of the signal changes over time. To define the features mathematically, first let the music signal be denoted as x, the tth analysis window constructed using N samples at a time from the music signal x and hop size h be denoted as x t, we have x[ n + h( t 1)] 0 n N 1 x t [ n] =. (2.29) 0 otherwise For non-overlapping analysis window, the hop size h is equivalent to the window size N. Root Mean Square (RMS) The root mean square is a measure of the power in the music signal. It is often used as a loudness feature in audio based music classification. The RMS is defined as follows: 1 1 N t N n= 0 2 RMS = x [ n]. (2.30) t Fraction of Low Energy Windows Fraction of low energy is a measure of the fraction of analysis window within a set of consecutive windows that have root mean square value that is below some threshold value. The common calculation of low energy fraction uses the average RMS of the set of windows under consideration as the threshold value. This feature gives an indication of the fraction of silence or near silence in the segment of signal under consideration. Therefore, music with little silences, for example, music with high instrumental activity will have low fraction of low energy windows. 22

45 Chapter 2 Literature Review Zero-Crossing (ZC) The zero-crossing is the number of times the waveform changes sign within a given music frame of length N. In other words, it is the number of times a signal passes the zero midpoint of the signal range. It is used as an indication of noisiness as signals with no DC component will tend to cross the midpoint more often. The zero-crossings is highly correlated with the spectral centroid of clean (non-noisy) signals. The zerocrossing is computed as: N 1 n= 0 ZC = sign( x [ n]) sign( x [ n 1]) (2.31) t t t where 1 xt[ n] 0 sign( xt [ n]) =,0 n N 1. (2.32) 0 xt[ n] < 0 Linear Predictive Coding (LPC) The linear predictive coding is a method initially developed to analyze and encode human speech signals. The LPC works by first estimating the formants (spectral bands corresponding to the resonance frequencies in the human vocal tract), then performing inverse filtering to remove the effects of these formants from the speech signal. It then estimates the intensity and frequency of the residue (the remaining signal after the subtraction). The result from these is a vector of values that describe the intensity and frequency of the residue, the formants and the residue signal. This vector can be used to recreate speech where both the intensity and frequency of the residue and the residue signal can be used to create the source signal and the formants can be used to create a filter. Speech is produced by running the source signal through the filter. The detail explanation of the LPC can be found in [60]. The most important aspect of the LPC is that it allows a music sample to be approximated as a linear combination of previous samples. The unique set of predictor coefficients is determined by minimizing the sum of the squared differences between 23

46 Chapter 2 Literature Review the actual signal and the predicted signal. Different approaches can be used for the minimization such as autocorrelation method, covariance method and lattice method. One common application of the LPC in music is for identifying instrument types Common Audio Features Extracted in Frequency Domain In order to extract features from music audio signal in frequency domain, the audio signal is first segmented into overlapping, very short analysis frames on a time scale between 10 milliseconds to 40 milliseconds where the signal is considered stationary. The overlap step size is usually within the range of 5 milliseconds to 20 milliseconds. Then, each of these analysis frames is multiplied with a windowing function. The use of windowing function is to keep the continuity of the first and last sample in an analysis frame and to reduce the problem of spectral leakage, which refers to power which is assigned to frequency components that are not actually in the signal being analyzed. There are many windowing functions but the most commonly used windowing functions in music classification are the Hamming windows and the Hann windows. If the music signal in the tth frame is denoted as x[ n + h( t 1)] 0 n N 1 x t [ n] = (2.33) 0 otherwise where h is the hop size and N is the number of samples within a frame, then the signal after windowing function is w x [ n] = x [ n] w[ n] (2.34) t t where, for Hamming windows n w[ n] = cos 2π, 0 n N 1 (2.35) N 1 and Hann windows n w[ n] = cos 2π, 0 n N 1. (2.36) N 1 24

47 Chapter 2 Literature Review Finally, after applying windowing function, the fast Fourier transform (FFT), an optimized version of the DFT, is performed on each of the analysis frame to obtain the magnitude frequency response. Detailed discussion on the Fourier transform can be found in [61]. In the followings, M t [n] is the magnitude spectrum of the Fourier transform at frequency bin n, out of N bins, for Fourier analysis frame t. Spectral Centroid (SC) The spectral centroid (SC) is a measure of the spectral brightness of a music signal. Higher centroid values correspond to brighter textures with more high frequencies. The spectral centroid is usually used to characterize the timbre of musical instruments. It is defined as the center of gravity of the magnitude spectrum of the Fourier transform: N 1 M t[ n] n n= 0 SC t = N 1. (2.37) M [ n] n= 0 t Spectral Roll-off (SR) The spectral roll-off measures the spectral shape and indicates how much of the energy is concentrated in the lower frequencies. It is usually used in speech analysis to differentiate between voiced and unvoiced speech. In music analysis, it is used as a feature to characterize the timbre of musical instruments. It is defined as the frequency value below which resides the 85% (can be any number but 85% is the typical value) of the magnitude distribution: SR t n= 0 M [ n] = 0.85 t N 1 n= 0 M [ n]. (2.38) t 25

48 Chapter 2 Literature Review Spectral Flux (SF) The spectral flux measures the amount of local spectral change in the signal. It is computed by calculating the change in the normalized magnitude spectrum, N t [n], between successive frames: N 1 n= 0 2 SF t = ( Nt[ n] Nt 1 [ n]). (2.39) The spectral flux is another feature used for characterizing timbre of musical instruments. Mel-Frequency Cepstral Coefficients (MFCC) The Mel-frequency cepstral coefficients is one of the most widely used features in both speech recognition and audio-based music classification. The MFCC takes into account the human perception sensitivity with respect to frequencies. It is computed as follows: (1) take the log-amplitude of the magnitude spectrum. (2) Group and smooth the frequency bins according to the perceptually motivated Mel-frequency scaling. (3) Find the discrete cosine transform to de-correlate the resulting feature vectors. Typically, 13 coefficients are used in speech analysis. Tzanetakis and Cook [38] discovered that the first five coefficients provide the best performance in music genre classification. Further details of the MFCC are presented in [62]. The Mel-frequency cepstral coefficient is also a feature for timbre-based representation High-Level Features Extracted from Audio Although low-level features are useful, in most cases, they are not representative in other applications where high-level music features such as pitch and rhythmic patterns are required. Extracting high-level information from audio signals is less straightforward and less accurate than from symbolic musical data. Nonetheless, under the assumption that imperfections in the extracted information can be averaged out in broad high-level representations, it is possible to derive some useful high-level 26

49 Chapter 2 Literature Review information from audio signals. The two main high-level representations that can be constructed from audio signals are the pitch histogram and the beat histogram. Pitch Histograms Tzanetakis and Cook [38] proposed a technique for deriving pitch information from sound signal through constructing a variety of different pitch histograms. The pitch content feature detection algorithm employed to construct the pitch histograms is based on the multi-pitch detection algorithm described by Tolonen and Karjalainen [63]. In their algorithm, the sound signal is first decomposed into two frequency bands: below 1000 Hz and above 1000Hz. Amplitude envelopes are then extracted for each frequency band. The envelope extraction is performed by applying half-wave rectification and low-pass filtering on the signals. The extracted envelopes are summed and an enhanced autocorrelation function called the summary enhanced autocorrelation function (SACF) is then computed in order to reduce the effect of the integer multiples of the peak frequencies to the multiple pitch detection. The prominent peaks of the SACF are treated as the main pitches of a corresponding short segment of sound signals. The three dominant peaks of the SACF are then accumulated into a pitch histogram (PH) over the entire sound file. Next, the frequencies corresponding to each histogram peak are converted to musical pitches such that each bin of the PH corresponds to a musical note with a specific pitch, for example, the musical note name A4 is equivalent to 440Hz. The musical pitches are labeled using the MIDI note numbering scheme where the conversion from frequency to MIDI note number can be performed using the following equation: f n = 12log (2.40) where f is the frequency in Hertz and n is the histogram bin (MIDI note number). There are two versions of pitch histogram proposed in [38]: the unfolded pitch histogram (UPH) and the folded pitch histogram (FPH). The UPH is constructed using 27

50 Chapter 2 Literature Review (2.40). The FPH method discards the octave information of a note and group notes as according to pitch classes. In the FPH, the octave information of all notes is normalized to a single octave using the mapping equation: c = nmod12 (2.41) where c is the folded histogram bin (i.e. the pitch class or chroma value) and n is the unfolded histogram bin (MIDI note number). The main difference between the UPH and the FPH is that the unfolded pitch histogram contains information about the pitch range of a musical piece while the folded pitch histogram contains information regarding the pitch classes or harmonic content of the music. The FPH method is similar to the chorma based representation employed in [64] for audio thumbnailing. Some detailed explanation of chroma and height dimension of musical pitch can be found in [65] and the relation of musical scales to frequency is discussed in [66]. A variant of the FPH called the circle of fifths histogram is designed where adjacent histogram bins are spaced a fifth apart rather than a semitone as in the original FPH. The authors [38] believed that the distances between the adjacent bins in the new variant are better suited for expressing tonal music relation (tonic-dominant) and the extracted features result in better classification accuracy. The mapping from the original FPH to the new circle of fifths histogram can be achieved by c = ( 7 c)mod12 (2.42) where c is the new circle of fifths histogram bin after the mapping and c is the original folded histogram bin. The number 7 correspond to the seven semitones or the music interval of a fifth. There are many useful features that can be calculated from the pitch histograms. For examples, the difference between the lowest pitch and the highest pitch in a pitch histogram can indicate pitch range. The bin label of the pitch class histogram with the highest amplitude may indicate the primary key of the pitch or at least the dominant. 28

51 Chapter 2 Literature Review The interval between the two strongest pitches of the folded pitch class histogram can give an indication of the centrality of tonality in a piece. Beat Histograms A common automatic beat detector structure consists of signal decomposition into frequency bands using a filterbank, followed by an envelope extraction step and finally a periodicity detection algorithm to detect the lags at which the signal s envelope is most similar to itself. The process of beat detection is similar to pitch detection except that it is done in a larger time scale (approximately 0.5 seconds to 1.5 seconds for beat detection compare to 2 milliseconds to 50 milliseconds for pitch). The concept of constructing a histogram of time intervals between note onsets that gives some overall information about the rhythmic patterns in signal as a whole is first promoted by Tzanetakis and Cook [38]. In their approach, the features used to represent the rhythmic structure of a piece of music are based on the most salient periodicities of the sound signal. Figure 2.3 shows the flow diagram of the construction of a beat histogram (BH) [38]. For constructing the beat histogram, the sound signal is first decomposed into a number of octave frequency bands using the discrete wavelet transform method. Then, the time domain amplitude envelope of each band is extracted by applying full-wave rectification, low-pass filtering, downsampling of each octave frequency band and finally a mean removal. These envelopes are then summed together and the autocorrelation is then computed on the sum. The dominant peaks of the autocorrelation function each corresponds to the various periodicities of the signal s envelope. The peaks obtained are then accumulated over a whole sound file to build the beat histogram. The histogram bins in the BH each corresponds to a peak lag, i.e. the beat period in beats-per-minute (bpm). When compiling the beat histogram, in state of adding one to a bin, the amplitude of each peak is added to the bin. Using such method, if the signal is very similar to itself (strong beat) the histogram peaks will be higher. The equations used in each step of the beat analysis algorithm [38] are listed in the followings. In the equations, x is the sound signal and n = 1,2,,N where N is the total samples of signal. 29

52 Chapter 2 Literature Review Figure 2.3: The flow diagram of the construction of a beat histogram. Full wave rectification y [ n] = x[ n] (2.43) Full wave rectification is applied to extract the temporal envelope of the sound signal rather than the time domain signal. Low pass filtering y[ n] = (1 α ) x[ n] + αy[ n 1] (2.44) is a one pole filter with an alpha value (α) of It is used to smooth the envelope. Downsampling y [ n] = x[ kn] (2.45) 30

53 Chapter 2 Literature Review where k = 16 is used in [38]. Due to the large periodicities of beat analysis, the objective of applying downsampling is to reduce computation time for the autocorrelation computation without affecting the performance of the algorithm. Mean removal y[ n] = x[ n] E[ x[ n]] (2.46) Mean removal is used to center the signal at zero for the autocorrelation stage. Autocorrelation N 1 y[ lag] = Y[ n] Y[ n lag] (2.47) N n= 1 where lag is the number of samples of delay. Autocorrelation is calculated for all integer values of lag, subject to 0 lag < N. Y is the outcome of pre-processing on the sound signal which includes the full wave rectification, low-pass filtering, downsampling and mean removal. Autocorrelation is a technique that involves comparing a signal with versions of itself delayed by successive intervals, which yields the relative strength of different periodicities within the signal. In music processing, autocorrelation allows one to find the relative strength of different rhythmic pulses. The calculation of autocorrelation results in a histogram where each bin corresponds to a different lag time. Since the sampling rate of a signal is known, the histogram can provides an indication of the relative importance of the time intervals that pass between strong peaks. High-level rhythmic information can be derived from a beat histogram. For examples, the number of strong peaks can provide some measure of rhythmic sophistication. The periods of the highest peaks can provide good information for the tempo of an audio signal. The ratios between the highest peaks, in terms of both amplitude and period, can give metrical insights and an indication as to whether a signal 31

54 Chapter 2 Literature Review is likely polyrhythmic or not. The sum of histogram as a whole can give an indication of beat strength. The proportional collective strength of low-level bins can give an indication of degree of rubato or rhythmic looseness Symbolic Format Musical information is represented in an essentially different way in symbolic musical file formats than in audio files. Unlike audio files that store the digital approximation of the actual sound signals, symbolic files store the higher-level notions about music rather than the direct representation of the sound. For example, an audio file will store the approximation of the actual sound waves produced by a singer singing the Jasmine Flower (a Jiangsu folk song) but a symbolic file will store information such as the absolute pitch of each note sung by the singer, the instrument used to produce the sound, in this case, human voice, and the duration information of each note. The symbolic representation of music can exist in many formats. For example, the physical form of written or printed sheet music, holes punched in player rolls and keypunched cards and of course, the digital files of the modern age such as MIDI [67-68], Open Sound Control [69-71], GUIDO [72], Humdrum [73] and MusicXML [74-75]. A short overview of some commonly encountered symbolic music file formats is presented below. A good overview of symbolic formats (except MIDI) can be found in [36]. Dannenberg presented a useful survey on symbolic music representation in [76] Symbolic Music File Formats In general, the digital symbolic music file formats can be divided into three broad categories: (1) formats for communicating performance information between controllers, computers and synthesizers; (2) formats for representing musical scores and associated visual formatting information; and (3) formats for facilitating theoretical and musicological analysis [77]. 32

55 Chapter 2 Literature Review Formats for communicating information between controllers, computers and synthesizers The most well-known format in this group is the MIDI Musical Instrument Digital Interface [68] format. MIDI is a technical standard which describes a set of protocols, digital interfaces and connectors that allow a wide variety of electronic musical instruments, computers and other related devices to connect and communicate with each other. Due to its popularity, a very large amount of music of many kinds is stored using this format. Subsequently, a large portion of music classification researches that employ symbolic music representation used MIDI files as their research data set. Open Sound Control (OSC) developed by Wright and Freed [69-71] is a successor of the MIDI format. It is a real-time performance oriented symbolic file format that is widely recognized as more technically superior than its predecessor, the MIDI format. Some advantages of OSC include improved time resolution, explicit compatibility with modern networking technology and improved general flexibility. Formats for representing musical scores and associated visual formatting information The most commonly used file formats within this group are the file formats of two leading score editing applications: Finale 1 (.mus format) and Sibelius 2 (.sib format). Both applications are commercial software and the details of the representation of the file formats are not published. One needs to purchase the software in order to read or write files in these formats. This limitation greatly reduced the research value of these file formats. Nonetheless, there are some research-oriented formats that can be used for representing musical scores. Two of the more well-known formats are the GUIDO [72] and LilyPond [78]. Both of them are text-based formats. MusicXML [74-75] is an XML-based file format for representing Western musical notation. Although the format is proprietary, it can be freely used under a

56 Chapter 2 Literature Review Public License. MusicXML received a relatively high popularity due to its adoption by a variety of commercial or non-commercial music notation programs such as Finale, Sibelius, MuseScore 3, SmartScore 4, Steinberg Cubase 5 and Rosegarden 6. The MusicXML can be a good tool to serve as an intermediate file format to transfer data between the.mus and.sib files. Formats intended for facilitating theoretical and musicological analysis The most prominent file formats in this category are the formats associated with the Humdrum Toolkit [79]. Among them, the most popular and most general file format is the **kern format [80]. Some of the many Humdrum file formats are designed to represent more specialize music types such as the **bhatk for transcribe Hindustani music, the **hildegard format for the German neume manuscripts and the **koto format for koto (a traditional Japanese stringed musical instrument that is similar to the Chinese zheng) tablature. Humdrum also facilitates translation to and from MIDI data Benefits of Using Symbolic Music File Formats To date, there have been more researches in music classification that are performed using audio files than symbolic music files. This is greatly due to the increase in commercial music information retrieval applications and users that are much more interested in processing audio files. In general, this group of researches viewed music as a type of sounds more than its other contexts beyond the sound perspective. Nonetheless, some examples of automatic music classification using symbolic data can be seen in [81-91]. On the other hand, musicologists and music theorists usually prefer the symbolic musical representations. The main reason is that features extracted from audio data 3 musescore.org

57 Chapter 2 Literature Review generally have little intuitive meaning to humans. For example, features such as the zero-crossing rate and the Mel-frequency cepstral coefficients extracted over a sequence of audio windows although useful for automatic music classification, they are unlikely to give any insight or inspiration to music theorists. Conversely, features extracted from symbolic data, such as those related to the key and rhythm of a piece of music, are much more straightforward and meaningful to humans. These features often provide useful insights on music. The main advantage of symbolic data is that it provides much more immediate and reliable access to musically meaningful research than audio data. Since the fundamental elements in symbolic files are usually a precise representation of musical notations while the fundamental elements in audio files are typically sound samples, it is much easier to extract high-level music information from symbolic files with high accuracy than from audio files. In addition, some symbolic file formats such as MIDI are usually more compact than audio recordings. This makes storing, processing and transmitting much faster and easier. Furthermore, it is much easier to correct and edit symbolic files than audio recordings, and the correction can be made more accurately. The existing optical music recognition techniques such as [92-94] and softwares such as SmartScore, OpenOMR 7, SharpEye 8 and Gamera 9 allow printed or written scores to be processed into symbolic file formats from which music features can then be extracted. This is particularly useful for cases in which audio recording for music score does not exist or is hard to obtain. From musicological perspective, it is better to use features extracted from music scores than from audio recordings as it eliminates the potential performance biases and errors. This enables analysis to be based entirely on the artifact provided by the composer. 7 sourceforge.net/projects/openomr gamera.informatik.hsnr.de 35

58 Chapter 2 Literature Review Symbolic Features Extracted from MIDI Among the digital format of symbolic representations, MIDI representation is the most popular and widely used format employed in automatic music classification research, largely owing to its popularity and hence the data availability. MIDI, short for Music Instrument Digital Interface, is an encoding system used to represent, transfer and store musical information. Information is represented as sequences of instructions called MIDI messages. Each MIDI message corresponds to either an event or change in a control parameter. The details on MIDI and its specifications will not be covered in this thesis but many books on MIDI are available, for example [67] can be consulted for further details. Also, the official web site of MIDI Manufactures Association 10 provides comprehensive information and documentations on MIDI. An extensive list of features extracted from MIDI can be grouped into seven groups [84]: pitch based, melody based, chord based, rhythm based, instrumentation based, musical texture based and dynamics based. The derived features can then be compiled into relevant feature vectors for music classification. Pitch Based Features Three pitch histograms are constructed in [84] based on the technique proposed by [37-38]. The first histogram is the basic pitch histogram which consists of 128 bins, one for each MIDI pitch. The magnitude of each bin corresponded to the number of times the Note On events occurred at a particular pitch. This histogram gives an insight into the range and spread of notes in a music piece. The second histogram is called the pitch class histogram which has 12 bins, one for each of the pitch in the twelve pitch classes. The magnitude of each bin is then corresponded to the number of times the Note On events occurred for a particular pitch class. This histogram gives insights into the types of scales used and the amount of transposition that was present

59 Chapter 2 Literature Review The third histogram is the fifths pitch histogram which consists of 12 bins. The bins are a reordered sequence of the bins in the pitch class histogram where in this histogram, the adjacent bins are a perfect fifth apart rather than a semitone apart. The list of pitch based features proposed in [84] are the most common pitch prevalence, the most common pitch class prevalence, the relative strength of top pitches, the relative strength of top pitch classes, the interval between strongest pitches, the interval between strongest pitch classes, the number of common pitches, the pitch variety, the pitch class variety, the pitch range, the most common pitch, the primary register, the importance of bass register, the importance of middle register, the importance of higher register, the most common pitch class, the dominant spread, the strong tonal centres, the basic pitch histogram, the pitch class distribution, the fifths pitch histogram, the quality, the glissando prevalence, the average range of glissandos, the vibrato prevalence and the prevalence of micro-tones. Melody Based Features The pitch based features discussed above do not reflect the information relating to the order in which pitches are played. The melody is a very important part of how humans reflect on music that they hear. In order to achieve this, the statistics about melodic motion and intervals are used. A melodic interval histogram is proposed in [84] where each bin of the histogram is labeled with a number indicating the number of semitones separating sequentially adjacent notes in a given channel. The magnitude of each bin indicates the fraction of all melodic intervals that correspond to the melodic interval of the given bin. Features are then derived from this histogram. The list of melody based features includes the melodic interval histogram, the average melodic interval, the most common melodic interval, the distance between most common melodic intervals, the most common melodic interval prevalence, the relative strength of most common intervals, the number of common melodic intervals, the amount of arpeggiation, the repeated notes, the chromatic motion, the stepwise motion, the melodic thirds, the melodic fifths, the melodic tritons, the melodic octaves, the embellishment, the direction of motion, the duration of melodic arcs, the size of melodic arcs and the melodic pitch variety. 37

60 Chapter 2 Literature Review Chord Based Features Musical chords are created by having different notes played simultaneously. Some technique of chord analysis presented in Rowe [95] is adopted to design the chord based features in [84]. Two histograms are constructed for the chord based features. The first histogram is the vertical interval histogram which consists of bins labeled with different vertical intervals. The magnitude of each bin in the histogram is the sum of all vertical intervals that are sounded at each tick. The second histogram is the chord type histogram. In this histogram, each bin is labeled with one of the following types of chords: two pitch class chord, major triad, minor triad, other triad, diminished, augmented, dominant seventh, major seventh, minor seventh, other chord with four pitch classes and chord with more than four pitch classes. The list of chord based features proposed are the vertical intervals, the chord types, the most common vertical interval, the second most common vertical interval, the distance between two most common vertical intervals, the prevalence of most common vertical interval, the prevalence of second most common vertical interval, the ratio of prevalence of two most common vertical intervals, the average number of simultaneous pitch classes, the variability of number of simultaneous pitch classes, the minor major ratio, the perfect vertical intervals, the unisons, the vertical minor seconds, the vertical thirds, the vertical fifths, the vertical tritons, the vertical octaves, the vertical dissonance ratio, the partial chords, the minor major triad ratio, the standard triads, the diminished and augmented triads, the dominant seventh chords, the seventh chords, the complex chords, the non-standard chords and the chord duration. Rhythm Based Features Researches such as [95-98] emphasized that rhythm plays a very important role in many types of music. In defining the rhythm based features, beat histogram is construct using the technique proposed in [37-38]. However, instead of using not-quite-accurate beat information derived from audio signals, the precise representation of beat information in 38

61 Chapter 2 Literature Review MIDI is employed to construct the beat histogram. The rhythm based features employed are then derived from the beat histogram. The rhythm based features that are derived from the beat histogram include the strongest rhythmic pulse, the second strongest rhythmic pulse, the harmonicity of two strongest rhythmic pulses, the strength of strongest rhythmic pulse, the strength of second strongest rhythmic pulse, the strength ratio of two strongest rhythmic pulses, the combined strength of two strongest rhythmic pulses, the number of strong pulses, the number of moderate pulses, the number of relatively strong pulses, the rhythmic looseness, the polyrhythms, the rhythmic variability and the beat histogram itself. There are other rhythmic features that are not from the beat histogram. These are the note density, the note density variability, the average note duration, the variability of note duration, the maximum note duration, the minimum note duration, the staccato incidence, the average time between attacks, the variability of time between attacks, the average time between attacks for each voice, the average variability of time between attacks for each voice, the incidence of complete rests, the maximum complete rest duration, the average rest duration per each voice, the average variability of rest duration across voices, the initial tempo, the initial time signature, the compound or simple meter, the triple meter, the quintuple meter and the change of meter. Instrumentation Based Features This group of features utilize the capability of General MIDI (level 1) specification which allows recordings to make use of 128 pitched-instrument patches and a further 47 percussion instruments in the Percussion Key Map. The instrumentation based features that are proposed include the present of pitched instruments, the present of unpitched instruments, the note prevalence of pitched instruments, the note prevalence of unpitched instruments, the time prevalence of pitched instruments, the variability of note prevalence of pitched instruments, the variability of note prevalence of unpitched instruments, the number of pitched instruments, the number of unpitched instruments, the percussion prevalence, the string keyboard fraction, the acoustic guitar fraction, the electric guitar fraction, the violin fraction, the saxophone fraction, the brass fraction, the 39

62 Chapter 2 Literature Review woodwinds fraction, the orchestral strings fraction, the string ensemble fraction and the electric instrument fraction. Musical Texture Based Features The musical texture based features make use of the fact that MIDI notes can be assigned to different channels and to different tracks, thus making it possible to segregate the notes belonging to different voices. The texture related features include the maximum number of independent voices, the average number of independent voices, the variability of number of independent voices, the voice equality number of notes, the voice equality note duration, the voice equality dynamics, the voice equality melodic leaps, the voice equality range, the importance of loudest voice, the relative range of loudest voice, the relative range isolation of loudest voice, the range of highest line, the relative note density of highest line, the relative note durations of lowest line, the melodic intervals in lowest line, the simultaneity, the variability simultaneity, the voice overlap, the parallel motion and the voice separation. Dynamic Based Features In music, the dynamic is usually refers to the loudness of a piece. In [84], the dynamic refers to the velocity values scaled by volume channel messages: channel volume note dynamic = note velocity. (2.48) 127 All dynamic features in [84] use relative measures rather than absolute measures as the default volume and velocity values set by different sequencer are varied. The list of dynamic based features includes the overall dynamic range, the variation of dynamics, the variation of dynamics in each voice and the average note to note dynamics change. It can be seen in [84] that an extensive list of potential features can be extracted from MIDI data. However, not all of them are applicable to the various existing types of music. In addition, depending on the author of the MIDI files, some necessary data might not be available with the MIDI files to derive certain features. Apart from that, not all symbolic file formats are as compact as MIDI files. Hence, depending on the 40

63 Chapter 2 Literature Review application of the tasks and the file formats available, varied number of features and types of features will be derived and employed in the research. In short, there is no standard list of features but the choice of features is dependent on the availability and capability of the data and the characteristics of music under investigation. However, in any cases, the use of pitch and rhythmic information are inevitable. 2.3 Discussion The two main components of folk song classification are the machine classifier and the music encoding. In this thesis, the single-hidden layer feedforward neural network is employed as the machine classifier for the folk song classification research. However, neither the gradient descent-based learning algorithm nor the discriminant-based learning algorithm is employed for the SLFN. Instead, a superior technique which has been proven to be able of overcoming drawbacks in both types of algorithm called the extreme learning machine, is employed to examine and to verify its performance in multi-class classification task, particularly, folk song classification. Nevertheless, the classification performance of the gradient descent-based learning algorithm and the support vector machine will be included for comparison. As mentioned in Chapter 1, this thesis intends to contribute a part in preserving and sustaining the art of Chinese folk songs. Hence, during the data collection, efforts were made to obtain data in format that facilitate ethnomusicological analysis. Since folk songs in the Essen Folksong Collection [1] are documented using kern representation, this database is employed for achieving such purpose. It is to be noted that, many symbolic music formats can be easily converted into audio format. There are many commercial converters for converting MIDI to MP3 or MIDI to WAV. In addition, symbolic formats other than MIDI usually facilitate tool for conversion into standard MIDI files. For example, the hum2mid 11 is a program for converting kern files into standard MIDI files. 11 extra.humdrum.org/man/hum2mid/ 41

64 Chapter 2 Literature Review Preliminary investigations were performed to examine the performance of folk song classification using audio representation technique. The results are recorded in the Appendix. The overall result for the research using audio representation is below average. This preliminary research suggests that a more appropriate and efficient music representation technique is required for encoding the Han Chinese folk songs for machine classification. 42

65 Chapter 3 Music Representation and the MFDMap Chapter 3 Music Representation and the Musical Feature Density Map As presented in Chapter 2, there are two main approaches to music representation: audio and symbolic. In most cases, the choice of the music representation is greatly dependent on the format of the available data set. In this thesis, the Essen Folksong Collection [1] is employed as the data set for the research. Folk songs in this database are recorded in kern format. Hence, in this thesis, discussions are formatted around the symbolic approach of music representation, i.e. from a high-level, musicological point of view instead of the low-level audio signals view. This chapter begins with a discussion on the ethnomusicology background of the research topic geographical based Han Chinese folk song classification, follows by a discussion on the music database employed for the research. Then, the music elements employed to define the different classes of Han Chinese folk songs are discussed. Finally, a novel encoding method called the musical feature density map (MFDMap) is proposed for encoding useful music information. The MFDMap is designed to incorporate the ethnomusicology theory into the structure of the music feature vector. 43

66 Chapter 3 Music Representation and the MFDMap 3.1 Ethnomusicology Background on Geographical Based Han Chinese Folk Song Classification The Han Chinese is an ethnic group native to East Asia and is the largest ethnic group in China. Most scholars use the general word Chinese to refer to the Han Chinese. However, there are considerable linguistic, custom and social diversity among the subgroups of the Han, mostly due to historical events, geographical conditions and assimilation of various regional ethnicities and tribes. As discussed in Chapter 1, folk songs are an important part of traditional Chinese music. They reflect the ideals and emotions of the common people and illustrate their custom and social life over thousands of years of Chinese history. There are two major classification systems for the study of Han folk songs [99]: (i) according to the place of origin (geographical based) and (ii) according to the occasion when they are sung (song type based). The study of the classification of Han folk songs according to geographical factors falls into the first group. It was pioneered by two prominent ethnomusicologists: Jing Miao and Jianzhong Qiao [100] in the late 1980 s. As a whole, geographical factors such as the environment, weather and landscape structure determine the social and economic activities. Subsequently, the social and economic structures influence the development trends and characteristics of the cultures. Naturally, these cultural elements are then reflected in the folk songs. Therefore, geographical based classification of Han folk songs is not a meaningless and unproven task. In their research [100], Miao and Qiao suggest that due to factors such as intermarriages, social exchanges, business communications, etc. the cultural practices in neighbouring regions are usually very similar and closely related. Therefore, there are many similar music elements exhibited in the folk songs originating from these places. This suggests that these closely related regions should be grouped instead of being viewed individually. However, there is usually no single way of drawing the boundaries. Depending on the level of understanding, the point of view and the amount of available information, Han folk songs can be partitioned into any number of categories. For example, from the world viewpoint, Han folk songs are placed in the 44

67 Chapter 3 Music Representation and the MFDMap eastern group (as opposed to the western). If the viewpoint is narrowed down to just the Han culture, these folk songs can be broadly placed into any other number of categories. If the viewpoint is to be further narrowed down (to the maximum extend), each folk song is unique and has unique texture and style. In other words, any fashion of division has its relativity and should be used as a reference instead of an ultimate convention. In addition, Miao and Qiao [100] also emphasized that, the features used to identify and define a particular class label might not be universally demonstrated in all songs that are originated from that class. In many cases, as a result of population migration, social transformations, revolutions, wars, social exchanges and other historical factors, some folk songs were mutated, propagated or migrated. During the process, these folk songs lost some or all of their originality while adapting to other influences. Hence, in the classification of folk songs according to geographical region of origin, it is fairly common to have candidates that do not exhibit similar characteristics to others in the same class or folk songs that exhibit characteristics that belong to more than one class. Miao and Qiao highlighted that, in many cases, the research outcomes can only be applied to the typical examples. As a result, the features used to define a class in geographical-based folk song classification can only be an approximation. In order to derive useful and meaningful attributes to describe each geographical class, it is important to understand the factors that contribute to the forming of the musical style in the folk songs. As mentioned previously, folk songs reflect the culture of the people, hence it is important to understand the culture of a population in a certain geographical region. It is commonly said that human civilization originated around river basins. In their study [100], Miao and Qiao indicate that the geographical structure of the Han folk song culture can be divided into two broad regions: the north and the south, each of which is closely associated with the two main rivers in China: the Huang He ( 黄河, Yellow River) of the north and the Chang Jiang ( 长江, Yangtze River) of the south. In the north, the Huang He basin is further divided into two: the plains in the east and the plateaus in the west. Due to a more complex geographical structure, the south region is divided into more chunks. The Chang Jiang basin is divided into three regions: east, center and west. The east is mainly plains, the center regions comprise mountainous areas, hilly areas and lake areas, and the west is mainly plateaus. There is 45

Chapter 3 Music Representation and the MFDMap another significant river in the south called the Zhu Jiang ( 珠江, Pearl River) that also contributes an important role in forming the folk song culture.

68 Chapter 3 Music Representation and the MFDMap another significant river in the south called the Zhu Jiang ( 珠江, Pearl River) that also contributes an important role in forming the folk song culture. Besides those regions that are situated on the river basins, the regions that are situated between them are classified as the transitional areas. A map of the three main rivers is shown in Figure 3.1. Figure 3.1: Map of the three main rivers: the Yellow River, the Yangtze River and the Pearl River. The east of Huang He basin covers regions such as Hebei, Shandong, Liaoning, Jilin and Heilongjiang. These regions have fertile land, rich natural resources, convenient transportation and a prosperous economy. Agriculture, forestry, animal husbandry and fishery are active in these regions. The natural environments and the various economic activities create diversity in the society in these regions. In addition, active trading that happened in these regions gives rise to external influence in the folk songs in these regions. There are substantial numbers of folk songs in these regions that are dispersed from the northwest and southwest regions. The common forms of folk songs in these regions are xiaodiao (ditty) and haozi (work song). Xiaodiao is usually 46

69 Chapter 3 Music Representation and the MFDMap sung as a form of entertainment or as folk art performances. The melody is usually well organized and very decorative. Haozi are sung during collective physical work. They usually have fast and powerful rhythms which synchronize with the movements of the laborers. Folk songs from these regions usually have intervallic jump of fifth, sixth or seventh. The northwest regions (west of the Huang He basin), such as Shaanxi, Shanxi, Ningxia and Gansu, have large area covered in the Loess Plateau (also known as the Huangtu Plateau). These regions are sparsely populated, have many gullies and ravines and many areas that are difficult to access. Unlike the north east regions, the land here is not suitable for agriculture and people have to travel to other regions for jobs. Also, the land structure leads to problems in building a good transportation system. The means of transportation is the horse and donkey. This gives rise to a unique social class porter, who is responsible for transporting the local product for trading. The jobs oblige the people to travel long distances through rugged and remote mountain roads. During traveling, they sing songs to relieve tiredness and also as a form of self entertainment. These songs usually have free rhythm and have bold, unrestrained, dark and longdrawn-out texture along with a hint of misery and gloom. This style of folk song is usually unique to the plateau land structure and very uncommon in plains and watery regions such as those regions in the east of the Chang Jiang basin. The common form of folk songs is the shange (mountain song). Shange are songs sung in open areas like the mountains or open fields. Some shange are sung while working but unlike haozi, the associated physical movements are usually minimal and less intense. The interval distance of fourth (especially perfect fourth) and second (especially major second) are the common representatives of the style of folk songs in these regions. They can effectively express the dreary, desolate and sorrowful mood of the plateau. The southwest regions, include Sichuan, Guizhou, Yunnan and the northwest part of Guangxi, are where the majority of the Han people resides. These regions are located on the western part of the Chang Jiang basin and have similar land structure to the northwest regions, which are mainly plateaus. Unlike the dry and windy climate in the northwest, these southwest regions fall in the temperate and subtropical climate zone with sufficient rainfall throughout the year. Rice is one of the main plantations in these regions. The most popular form of folk songs in these regions is the shange. Most 47

70 Chapter 3 Music Representation and the MFDMap shange from these southwest regions are lyrical. Some of them are love songs and many of the lyrics in these songs include words that picture beautiful scenes of the villages and landscapes. Chuanfu haozi (boatman work song) is also very common in Sichuan. Gewu xiaodiao (dancing ditty) is popular in Yunnan and Guizhou. Folk songs in these regions usually have small pitch range and small intervals. It should be noted that since ancient time, these regions on the southwest of China were populated by peoples from many different ethnic groups. Influence of non-han materials in the Han folk songs is bound to be common. In addition, some folk songs are commonly shared among the Han and non-han peoples. The regions around Zhu Jiang basin include the majority of Guangdong (except those non-han areas), the south part of Guangxi and Hainan. The climate here belongs to the subtropical zone. These regions are surrounded by islands and harbours on the southern part. Fishery is very active and many forms of folk songs are common in these regions: gewu xiaodiao, yuge (fishermen song), shange, haozi and xiaodiao. The folk songs in these regions focus a lot on the life of the fishermen and farmers (the two main occupations in these regions). The pitch range used in folk songs originating from these regions is usually slightly more than an octave. Sol and re are fairly commonly used and interval distance of fifth, sixth and seventh are common among folk songs in these regions. The southeast regions such as Jiangsu, Zhejiang and Anhui are on the plain region of the Chang Jiang basin. These regions have mild climate, rich resources and adequate rainfall, and are a suitable area for growing rice. Many forms of folk song circulate around these regions. Among them are tiange (farm field song), xiaodiao, haozi, yuge, shange, and chage (tea song). Tiange is usually sung by farmers when working in the rice fields to create a lively atmosphere and to make the work less tiresome. Similarly, chage is sung during tea-picking. Xiaodiao is the most popular and most representative form of folk songs from these regions, and has great influence on folk songs in other regions in China. Folk songs in the southeast usually proceed in stepwise movement. It is also a common feature to insert a big interval in the stepwise progression of folk songs. The interval is usually a minor sixth (especially mi to do ) or a perfect octave. Most folk songs in these regions, especially Jiangsu, follow closely to the pentatonic scale (i.e. fa and ti rarely occurred). 48

71 Chapter 3 Music Representation and the MFDMap This section does not include all regions in China. The regions that are left out are mainly within the transitional zone which is not within the focus of this thesis. A thorough analysis and discussion on geographical based classification of folk songs is presented in [100] Rationale for the Choice of the Five Classes This thesis focuses on Han folk songs from five classes: Dongbei 1 ( 东北 ), Shanxi ( 山西 ), Sichuan ( 四川 ), Guangdong ( 广东 ) and Jiangsu ( 江苏 ). A few factors were taken into consideration when selecting the classes for the research. 1. The five classes selected are all within the main regions of the folk song culture highlighted in the previous section and also in [100]. Dongbei is part of the plains located east of the Huang He basin while Shanxi is in the west of the Huang He basin. Sichuan is on the plateau in the west of Chang Jiang basin and Jiangsu, on the other hand, is in the east. Finally, Guangdong is located on the Zhu Jiang basin. These classes are highlighted in Figure In [100], the authors point out that folk songs from neighbouring regions usually possess similar characteristics and texture. This is generally due to the similar customs, social structures and practices, and other cultural activities that are shared among people in those areas. These similarities usually result from communications and social exchanges among the people. However, mountains and rivers usually act as a natural barrier that break off communication and hence naturally encourages the growth of different cultures. The five classes selected are geographically reasonably far apart from each other. Hence, it is practical to categorize them as separate classes. However, as mentioned earlier, the migration of people and the propagation of popular folk tunes are still a concern causing similarity between folk songs from different regions. In other words, although each of these five classes can be regarded as geographically 1 Dongbei comprises of Liaoning, Jiling and Heilongjiang. 49

72 Chapter 3 Music Representation and the MFDMap separate, it is unavoidable not to have some folk songs related to more than one class. Figure 3.2: Map of the regions in China with the five classes studied in this thesis highlighted. 3. Another concern when selecting the five classes for the research is the size of the sample data. When there is more than one choice, the region with the largest number of samples is employed. For example, Shanxi, Gansu, Ningxia and Shaanxi are all located within the western part of Huang He basin and are all considered as having the similar folk song colour in [100]. Hence, when deciding the candidate region for research, the region with the largest data sample is used. It is important to note that, even though these regions fall within the same colour area, each of them still posses differences within themselves. In other words, they are similar on a broader perspective but dissimilar in the 50

Deep Neural Networks Scanning for patterns (aka convolutional networks) Bhiksha Raj

Deep Neural Networks Scanning for patterns (aka convolutional networks) Bhiksha Raj 1 Story so far MLPs are universal function approximators Boolean functions, classifiers, and regressions MLPs can be