Hidden Markov Model based dance recognition Dragutin Hrenek, Nenad Mikša, Robert Perica, Pavle Prentašić and Boris Trubić University of Zagreb, Faculty of Electrical Engineering and Computing Unska 3, 10000 Zagreb, Croatia E-mail: {dragutin.hrenek,nenad.miksa,robert.perica,pavle.prentasic,boris.trubic}@fer.hr Abstract In this paper we describe a dance classification system for compositions written in MIDI format. The system recognizes the following dances: tango, polka, mazurka, waltz, cha-cha-cha and march. The rhytmic structure of a dance is a finite sequence of notes of specified durations that repeats itself through the whole composition, so we can hypothesise that the probability of occurence of specified note duration depends on the duration of the note before it. Hence the implementation of the classifier is made using Hidden Markov Models. The models are used in two basic forms the first assumes discrete note durations, and the other assumes that note durations conform to normal distribution. The system was tested using dance-prototype generated examples with added Gaussian noise, as well as with human-played examples. The results gathered using both kinds of examples are comparable. The system was implemented using the Matlab programming package. I. INTRODUCTION Upon hearing a certain sequence of notes or rhythm, a dance expert or even a dance enthusiast immediately thinks of some type of dance or movement which would best fit the heard music. Thus, he/she easily recognizes the type of dance or music that is being played. Computer is not able to do the same with such ease, as it is unable to focus on a specific musical instrument in the audio recording. MIDI (Musical Instrument Digital Interface) format is commonly used in musical production, besides mp3, wave and similar formats. MIDI is a protocol by which computer communicates with certain external devices, such as keyboards. The protocol is based on exchange of messages between the device and the computer. Those messages can be saved in a file and interpreted later as an audio or as a note inscription. The protocol is a standard which is used by all musical instruments and musical software, but the problem is that most of the devices and software do not honor the protocol specifications exactly. Therefore, it s a common situation when a note inscription written in one program and saved in MIDI format, when opened in an another program, is poorly correlated with the source inscription. The problem resides in the fact that each note can be written in MIDI format in various ways. This increases the possibility of misinterpretaion of the recording. One of the most common notes is a quarter note. In a 4/4 measure it is represented by one tick. For example, if the tempo is 120 and measure 4/4, then there should be 120 quarter notes in one minute of recording. Thus, every quarter note should last exactly half a second. But this holds only on average. Let us assume that each quarter note lasts 100 ticks of the clock. The each quaver should last 50 ticks and dotted quaver should last 75 ticks. The dot in the note increases its duration by 50%. For example, dotted quarter note lasts the same as the quarter note and the quaver together. If the music is played by human then the quarter note lasts 100 ticks only on average, but it can last a bit more or less, e.g. 102 ticks or 85 ticks. This depends on the melody phrasing and other factors. Musical inscription software often adds noise to the duration of notes when saving in the MIDI format in order to achieve the greater fidelity of the recording as if it has been played by a human. This makes the correct interpretation of the note difficult to the computer, as for example, the note that lasts 85 ticks is much closer to a dotted quaver than to a quarter note. Thus, the quarter note is often not a real quarter note. This is the reason for wrong interpretation of notes among different programs. Because of this problem, classification of dances by rhythmic patterns obtained from MIDI files is a very challenging problem in computer science. In this paper we present methods that enable the computer to recognize the dances based on the human-labeled examples. The next section gives an overview of the previous works and solutions of the described problem. The third section describes a method for classification of musical pieces with the Hidden Markov Model. The fourth section describes the results of classification. The fifth section concludes the paper and discusses aspects of future work. II. PREVIOUS WORK The described problem is tightly related with the problem of detecting the rhythmic structure of the musical piece. Takeda et al. define the problem as a search for a sequence of states in a probabilistic model [1]. Since the states are represented with Hidden Markov Models, the most probable sequence of states can be find with the well-known Viterbi algorithm [2]. Therefore, the rhythmic structure is determined by the most probable sequence of states found by the Viterbi algorithm for the given sequence of observations. This method is good for finding the specific rhythmic structure, but it is impractical for classification of rhythmic structures. In [3], the system for extraction of musical features from MIDI recording is described. The described system consists of more subsystems for carrying out the following tasks: identifying basic musical objects (notes, pauses, chords, etc.), searching for accent on each musical object, rhythm recognition, rhythm tracking and note discretization. The rhythmic structure of the piece is recognized by looking into the time
interval which consists of certain number of notes. This time interval is determined in advance for each potential rhythmic structure that is being recognized. The actual notes in that interval are then compared to the expected notes and then the classification is performed. This method is not practical for solving our problem as it does not give good results. In [4] methods for note duration discretization and methods for detection and tracking of rhythm are presented. The rhythm detection in this paper is based on Hidden Markov Models in such a way that each state of the model represents the moment in which the note has been played. This enables the modeling of different moments in which a note can appear. This method is very useful for converting MIDI recordings into printable musical inscription. III. METHOD DESCRIPTION It is a general trend to use Hidden Markov Models (HMMs) for solving pattern recognition problems in cases where patterns are time dependant signals, as for example in speech recognition [2]. MIDI signals are time dependant signals and they represent a more abstract way for representing music in computer. It is much easier to extract note characteristics from MIDI recording than from mp3 or wave. Hence, we think that it would be a good idea to use HMMs for classification of musical pieces recorded in MIDI format. The idea behind HMMs assumes the existence of some set of states Q = {q i } N i=1, where N is the number of states. For each state we define probabilities of transition from the current state into all other states and probability for staying in current state. Furthermore, for each state we define its a priori probability (prior), i.e. the probability that the system will start in this state. Besides the set of states, there exists a set of possible outputs of the system V = {v j } M j=1, where M is number of possible outputs. For each state of the system, we define the probability that the system will generate a certain output while being in that state. All that can be formally written in the following way: Hidden Markov Model λ is a tuple λ = (Λ, B, Π) where Λ is a transition probability matrix, B is an output probability matrix and Π is a vector of priors. Elements of the matrix Λ are a ij and represent the probability of transition from state i to state j, i.e. a ij = P (q t+1 = j q t = i) The elements of the matrix B are b ij and represent the probability that the output j will be generated while the system is in state i, i.e. b ij = P (output = v j q t = i) The elements of the vector Π are π i and represent the probability that the system will start its work in the state i, i.e. π i = P (q 1 = i) As a result of such definition of HMM, it is suitable to represent it in a form of a directed graph. Vertices of the graph represent the states of the HMM and the outputs of the system, while the edges represent possible transitions between states and possible outputs of the system for each state. The weights of the edges represent probabilities. An example of a HMM is shown in figure 1. Figure 1. An example of a Hidden Markov Model. X represents the states, Y represents the possible outputs of the system, a represents the transition probabilities and b represents the probabilities of outputs in each state. Possible outputs of the system can be continuous too. In that case for each state we have to model the probability distribution which will generate the outputs of the system in that state, e.g. Gaussian distribution. In general, it is possible to model different probability distribution functions for each state, but it is common to use the same probability distribution function in all states, but with different parameters. This simplifies the usage of the model and the learning algorithm. Possible outputs of the system depend on the problem we try to model using HMMs. On the other hand, the number of states is a parameter of the model and thus influences the complexity of the learning. A. Data preparation and feature selection Our system recognizes dances using their rhythmic structures. Rhythmic structure is a sequence of notes of certain duration, i.e. the alternation of sound and silence in time. Rhythmic structure examples that can be recognized by our classifier are shown in figure 2. The duration of a note is the only feature that is used by our classifier as it is the only required feature to describe the rhythmic structure of a dance. B. Note discretization In cases when we want to test the classification of musical pieces by having notes represented by their class, we first need to perform note discretization, i.e. classify them into some
Figure 2. (a) Tango rhythm (b) Polka rhythm (c) Mazurka rhythm (d) Waltz rhythm (e) Cha-cha-cha rhythm (f) March rhythm Rhythmic structures of dances recognizable by our system class of notes is a note a quaver, a quarter note, a half note, etc. For discretization of notes we use a modified k Nearest Neighbours (knn) classifier, which determines the type of note based on its duration and examples read from the learning database. This means that the classifier reads a duration of a note from a MIDI file and then determines whether the given duration is a duration of a quaver, a quarter note, a half note, etc. Every type of note has its own identification number or index which is then used as a feature in a HMM based classifier. Thus semiquaver has an index 1, dotted semiquaver has an index 2, quaver has an index 3, dotted quaver has an index 4, quarter note has an index 5, dotted quarter note has an index 6, half note has an index 7, dotted half note has an index 8 and a whole note has an index 9. Such discrete notes are then used for learning the Hidden Markov Models. The classifier that is used for note discretization is not an usual knn classifier. Actually, it works in the following way: every note duration that has to be made discrete is first compared with the learning examples such that the differences of the duration the note and duration of all notes in the learning set are calculated. Our learning set has 100 examples for each note type. Next, all examples for which the absolute value of the mentioned difference is minimal and mutually equal are chosen. After that, the note is classified in the class that is most frequent among the chosen notes. For example, let us classify a note that has duration of 0.9245. We calculate the differences of that duration and durations of all notes in the learning set. Then we observe the absolute values of calculated differences. Let us assume that notes that correspond to the minimal absolute values of differences are from set of classes {6, 6, 6, 6, 5, 7}. Since the class 6 is the most frequent in the set of closest classes, the note is classified in the class 6, which represents the dotted quarter note. C. Learning the note classifier For classifying the notes we used a Hidden Markov Model based method, as it has been described in the third section. In the next subsection we will explain methods for learning the classifier and then we will describe a method of classification of a new example. The learning processes in cases of discrete and continuous note durations are similar. In both cases we use a Maximum Likelihood criterion. Based on that criterion, we want to determine the parameters of the Hidden Markov Model in a such a way that the generating probability of learning examples for that model will be maximal. Unfortunately, the solution of this maximization problem can not be found in closed form. Therefore, we need to use iterative methods for finding the solution. This can be done in various ways, for example with the Baum-Welch algorithm or with the gradient descent optimization, as is explained in [5]. Instead of Maximum Likelihood criterion, it is possible to use Maximum Mutual Information criterion, for which the gradient descent optimization methods are also required [5]. We learn our classifier with the Maximum Likelihood criterion because the method for iterative maximization of this criterion is already implemented in a Hidden Markov Model toolbox for Matlab software 1 The learning algorithm is stopped if it converges or if it exceeds the maximum allowed number of iterations, which, in our case, was 60 iterations. While learning, we record the log-likelihood in each iteration and show how it grows until it reaches its maximum. The plot of the growth of the log-likelihood is shown in figure 3. We have trained special HMMs for each dance, i.e. each HMM generates the rhythmic structure of the dance it represents with the maximum likelihood. In case of the continuous note durations, the output probabilities of each state of the model are represented with the Gaussian distribution with parameters µ i and σi 2, i.e. with the mean and the variance. The transition probabilities, priors and the Gaussian distribution parameters are determined with the learning algorithm using training examples. The number of states of each HMM is determined with the 3-fold cross-validation using 60 examples. We have determined that HMMs that represent tango, polka, cha-cha-cha and march should have 3 states. Hidden Markov Model that represents mazurka should have four states and HMM that represents waltz should have five states. The interpretation of parameters µ i and σi 2 is obvious. They determine the mean value of note s duration and the 1 http://www.cs.ubc.ca/ murphyk/software/hmm/hmm.html
We classify the given example as tango, since the likelihood that the example is tango is maximal. In case of discrete note durations, we first have to discretize the example and then classify it. The classification procedure is the same as in case with continuous note durations, with the exception that the output symbols of each state of HMM are discrete-valued so we do not assume a theoretical distribution that would generate the output examples. IV. RESULTS Figure 3. algorithm The growth of the log-likelihood in iterations of the learning variance around the mean. Interpretation of other parameters, such as number of states and transition probabilities is not so intuitive. The probability that a note is first in the rhythmic structure can be interpreted with the prior. We can interpret the number of states of HMM as a number of different notes in a rhythmic structure. The transition probabilities between states can represent the probabilities that a certain note will appear after another note in a rhythmic structure. For example, a 37 represents in this interpretation a probability that half note will appear after a quaver. In this example, we used indices 3 = quaver i 7 = halfnote. D. Classification of dances After learning the HMMs for each dance, the classification of a new example is simple and intuitive. For each HMM we calculate the likelihood that the model will generate the given example. We then classify the example into a dance category for which the calculated likelihood is maximal. We calculate the likelihood of generating the given example with the forward algorithm described in [2]. If likelihoods for generating example are same for all HMMs, the example will not be classified. Let us show that on an example. Let the rhythmic structure we want to classify be given with X = [ 0.9245 0.9440 0.6120 ] Likelihoods of the HMMs for the given example are as following: P (X = tango) = 0.0909 P (X = polka) = 8.9747 10 22 P (X = mazurka) = 2.3991 10 12 P (X = waltz) = 0.0199 P (X = cha-cha-cha) = 6.4525 10 9 P (X = march) = 4.9277 10 17 A. Data set Our system discriminates six different dances: tango, polka, mazurka, waltz, cha-cha-cha and march, but it is easy to add more dance types. Based on rhythmic structures that are available on Wikipedia and that are shown in figure 2, we generated the learning examples in two ways: synthetically from the prototypes and by playing the rhythms on a keyboard with the MIDI interface. The synthetic generation of examples was done in the following way: we assumed that the duration of the quarter note is 120 ticks. Based on that assumption we calculated the durations of other notes and generated the prototypes of the rhythmic structures of dances according to rhythms displayed in figure 2. We then added a Gaussian noise to the prototypes in order to get synthetic examples. The mean and variance of the Gaussian noise were randomly changed in order to get the most heterogeneous examples. Beside synthetically generated examples, we played the rhythms displayed in figure 2 on a keyboard with a MIDI interface which can be used to load the played notes into the computer. For each dance we played 70 examples that were used exclusively for learning and validation, whilst for crossvalidation we used additional 30 examples. As examples were really played, they represent a real situation where the note duration may not obey the Gaussian distribution, as it was assumed while synthetically generating the examples. This will also show whether the assumption that the duration of notes obey the Gaussian distribution was correct. B. Classification results The classifier was tested in various ways. First we used the synthetically generated examples with continuous note durations. We generated 50 examples for learning and 100 examples for testing in a way that has been described before. The parameters of the additive Gaussian noise were the following: the variance of each example was randomly selected from interval [0, 2] and the mean from interval [ 5, 5]. This means that for each example we first generated parameters of the additive Gaussian noise, then we generated the noise and finally added the noise to the dance rhythmic structure prototypes. The results of the first experiment are given in table I. The rows of the table represent the dance which was the decision of the classifier, and the columns represent the real dance type.
Table I CONFUSION MATRIX FOR CLASSIFIER WHICH USES SYNTHETICALLY GENERATED EXAMPLES AND CONTINUOUS-VALUED DURATION OF NOTES Tango 100 0 0 0 0 0 Polka 0 1 0 0 0 0 Mazurka 0 0 100 0 0 0 Waltz 0 0 0 100 21 0 Cha-cha-cha 0 0 0 0 79 0 March 0 99 0 0 0 100 The accuracy of the classification is 93.33%, and precision and recall are 80%. The F1 micro and F1 macro values are as follows: 1 = 80% 1 = 74.61% We can notice that the classifier is bad in discrimination of polka and march. Great similarity between those dances is the main reason for such behaviour if you see the rhythmic structures of those dances on figure 2, you can see very similar quaver and semiquaver patterns. Semiquavers and quavers are very short notes so it is very difficult discriminating them in this case, especially if examples have big variance and deviations of note duration means. Poorer discrimination of waltz from cha-cha-cha is a consequence of noise in examples. We can see that the classifier is much more accurate and precise than just randomly picking a dance. Namely, the accuracy of a random pick would, on average, be equal to the probability of randomly picking the correct dance. As the system discriminates six dances, the accuracy of the random choice would be 1 6 16.67%. If we make the examples from previous experiment discrete, as we described earlier, and if we then use our classifier in a discrete domain, the results become even better (see the table II). The classification accuracy has increased to 95.78%, and precision and recall to 87.33%. The F1 micro and F1 macro values are as follows: 1 = 87.33% 1 = 85.57% We can now notice that the classifier better discriminates polka from march, but still the most of the polka examples are misclassified, what can of course be explained with the great similarity of theoretical rhythmic structures of polka and march. In a realistic situation few musical pieces have the real theoretical rhythmic structure. Therefore, we tested our classifier with the examples we played by ourselves on the keyboard with the MIDI interface. In the following experiments, we used 25 examples of each dance for learning the classifier and 45 examples for testing the classification. We first carried out an experiment with continuous-valued note durations. The results of the experiment are shown in table III. The classification accuracy is 96.91%, and the precision and recall are 90, 74%. The F1 micro and F1 macro values are as follows: 1 = 90.74% 1 = 89.97% We can now notice that the classifier better discriminates waltz from cha-cha-cha, which is the consequence of better learning and testing examples. If we use 30 examples for learning, instead of 25, the precision and accuracy of classification raise to 100%. When carrying out the last experiment, we used the same examples as in previous experiment, but before classification we discretized them and used the classifier in a discrete domain. The results of the experiment are shown in table IV. The classification accuracy has raised to 99.32%, and precision to 98.50%. The F1 micro and F1 macro values are as follows: 1 = 97.95% 1 = 97.95% The recall is 97.41% because not all examples have been classified, i.e. the classifier has refused to classify one example of tango and two examples of waltz so those examples were used as false negatives when calculating the F values and recall. This lowered the recall of the classification. We can notice that in last experiment the discrimination rate between polka and march has raised a lot. This tells us that this type of classification is the best for general use. The refusal of classification is often regarded as better than misclassification as this enables people to manually classify the examples that classifier refused to classify. V. CONCLUSION In this paper we described a system which is able to recognize dances based on MIDI recordings of the pieces. This enables the enthusiasts that do not understand the musical notation to recognize their favourite dances. This is an interesting problem because MIDI recordings are usually played by humans, so it is not possible to determine the type of the note or rhythmic structure with the full certainty. We made a classifier which can classify the examples with both discrete- and continuous-valued note durations. The classifier is based on Hidden Markov Models. The discretization of notes was made with the modified knn classifier. The
Table II CONFUSION MATRIX FOR CLASSIFIER WHICH USES SYNTHETICALLY GENERATED EXAMPLES AND DISCRETE-VALUED NOTE DURATIONS Tango 100 0 0 0 0 0 Polka 0 27 0 0 0 3 Mazurka 0 0 100 0 0 0 Waltz 0 0 0 100 21 0 Cha-cha-cha 0 0 0 0 79 0 March 0 73 0 0 0 97 Table III CONFUSION MATRIX FOR CLASSIFIER WHICH USES HUMAN-PLAYED EXAMPLES AND CONTINUOUS-VALUED NOTE DURATIONS Tango 45 0 0 0 0 0 Polka 0 20 0 0 0 0 Mazurka 0 0 45 0 0 0 Waltz 0 0 0 45 0 0 Cha-cha-cha 0 0 0 0 45 0 March 0 25 0 0 0 45 Table IV CONFUSION MATRIX FOR CLASSIFIER WHICH USES HUMAN-PLAYED EXAMPLES AND DISCRETE-VALUED NOTE DURATIONS Tango 44 0 0 0 0 0 Polka 0 45 0 0 0 4 Mazurka 0 0 45 0 0 0 Waltz 0 0 0 43 0 0 Cha-cha-cha 0 0 0 0 45 0 March 0 0 0 0 0 41 classifier was trained with synthetically generated and humanplayed examples, with both continuous- and discrete-valued note durations. The best classification rates were achieved with human-played examples with discrete-valued note durations. We have accomplished everything planned, although future expansions of the system are possible, For example, it would be possible to create a classifier which would automatically find the characteristic rhythmic structure of a piece during the training phase. With a greater number of examples, such a classifier would be better at learning specific music pieces, in contrast to our work where we used theoretical rhytmic structure. [4] M. Hamanaka, M. Goto, H. Asoh, and N. Otsu, A learning-based quantization: Estimation of onset times in a musical score, in Proceedings of the 5th World Multi-conference on Systemics, Cybernetics and Informatics (SCI 2001, vol. 10, 2001, pp. 374 379. [5] N. Warakagoda, A Hybrid ANN-HMM ASR System with NN-based Adaptive Preprocessing. Institutt for teleteknikk, NTH, 1994. REFERENCES [1] H. Takeda, N. Saito, T. Otsuki, M. Nakai, H. Shimodaira, and S. Sagayama, Hidden Markov model for automatic transcription of MIDI signals, in Multimedia Signal Processing, 2002 IEEE Workshop on. IEEE, 2003, pp. 428 431. [2] L. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition, Proceedings of the IEEE, vol. 77, no. 2, pp. 257 286, 1989. [3] E. Cambouropoulos, From MIDI to traditional musical notation, in Proceedings of the AAAI Workshop on Artificial Intelligence and Music: Towards Formal Models for Composition, Performance and Analysis, vol. 30, 2000.