Shasha Zhang Art College, JinggangshanUniversity, Ji'an343009,Jiangxi, China

Size: px

Start display at page:

Download "Shasha Zhang Art College, JinggangshanUniversity, Ji'an343009,Jiangxi, China"

Alannah Gordon
5 years ago
Views:

1 doi: / Intelligent Recognition Model for Music Emotion Shasha Zhang Art College, JinggangshanUniversity, Ji'an343009,Jiangxi, China Abstract This paper utilizes intelligent means to make relatively detailed analysis on the relevant technologies regarding the music emotion recognition, through research and the improvement to the recognition of the main melody of music, to carry out in-depth analysis on MIDI file format, extract the characteristic parameters of the notes of various melodies and audio tracks of the MIDI music files (time value, pitch, sound intensity, etc.), based on the statistics of the pitch interval (the absolute sound intensity difference) of each audio track and note series, make use of an improved BP neural network algorithm, to successfully construct the music feature space model. On this basis, this paper establishes an automatic recognition model of BP neural network algorithm, and finally, verifies the effectiveness of the BP network design in the recognition of music emotion through instantiation. Keywords: Music Emotion, Main Melody Recognition, Intelligent Recognition, Bp Neural Network. 1. INTRODUCTION With full research and adequate application of computer technology in the field of multimedia, the multimedia business has also achieved rapid development and become one of the fastest growing industries with largest scale in the 21st century. The content based multimedia information processing technology is an important research hot topic in this field, especially after the introduction of animation, audio, video and other dynamic media, the multimedia technology has greatly enriched the manifestation of human emotions(agustus, Mahoney and Downey, et al., 2015; Wu, Zhong and Horner, et al., 2014). Computer music is an important component of the multimedia technology, though a lot of practical results have been achieved so far, the music signal digital encoding, digital compression and digital storage technology has developed rapidly, promoting the popularization and application of VCD, digital broadcasting and multimedia, etc., and demonstrating broad market prospect. However, the higher goal of computer music is to utilize the computer to simulate human emotion recognition and the creative intelligence for music, which involves the music theory, psychology, artificial intelligence, information processing, pattern recognition and other disciplines, and thus is of great difficulty(kimand André, 2008; Yangand Lee, 2004). The content of emotion computing research includes the real-time acquisition and modeling of the dynamic emotional information in the three-dimensional space, based on the emotion recognition and understanding of multimodal and dynamic timing characteristics, the information integration theory and methodology, the emotion automatic generation theory and the multimodal emotional expression, as well as the establishment of large scale dynamic emotion database based on physiological and behavioral characteristics, etc.(fritz, Jentschke and Gosselin, et al., 2009; Gosselin, Peretz, and Johnsen, et al.,2014). So far, there is almost no automatic generation model for artificial emotion both in line with human emotion laws and adaptable to machine implementation. Therefore, the research in the field of music emotion is of great theoretical and practical significance(juslin and Laukka, 2003). Overall speaking, the general analysis process of music emotion recognition includes three main aspects: Firstly, to analyze the music feature space, and make the corresponding feature modeling(balkwill, Thompson and Matsunaga, 2004); secondly, to explore the emotion space, and make the appropriate emotion modeling(peretz, Gagnon and Bouchard, 1998); and thirdly, based on the sample space of music emotion, make the recognition modeling for music emotion. On the basis of relatively accurate music recognition, the applications of music emotion can be proposed in the aspect of artificial intelligence, such as the musical fountain, music and stage lighting etc. designed based on the emotion that is expressed by music, and can also be applied in the music retrieval, intelligent creation and other fields(han, Rho and Jun, et al., 2010). 2. MUSIC EMOTION REPRESENTATION MODEL 2.1Music Emotion Analysis Model General speaking, the analysis model for music emotion recognition(gosselin, Peretz and Noulhiane, et al., 2015)is shown in Figure 1.

2 Figure 1. Music EmotionAnalysis Model Diagram It mainly includes three aspects of content, the first is the feature model in the music feature space, the second is the emotion model in the emotion space, and the third is the music emotion recognition model based on the sample space. Among them, the establishment of emotional model is the first step to carry out the music emotion analysis, which mainly completes the construction for the music emotion space. It is the fundamental premise for the next two steps, to lay the foundation for the establishment of the recognition model. Figure 2. Architecture of the Fuzzy Emotion Model Music Emotion Linguistic Variables The recognition model of music emotion can be divided into two levels: One level is the mental model of music emotion, which belongs to the cognitive class of model; and the other is the computational model of music emotion, which belongs to the analytic class of model. The mental models mainly explore the human emotional characteristics from the psychological point of view. Currently the classic models are mainly Hevner model and Thayer model.and the computational model mainly performs more in-depth analysis from the perspective of the computer analysis on music emotion. The current music computational models are all computational models based on linguistic value, including: Linguistic value calculation model based on the fuzzy set theory, and linguistic value calculation model based on the semantic similarity relationship linguistic value (Koelsch, Fritz and Müller, et al., 2009). 2.2 Computational Model formusic Emotion Computational Model Based on Fuzzy Theory Based on fuzzy set theory, this research takes the linguistic value for the academic thought value of human natural language, by certain grammar rules (such as mood operator) the composite linguistic value can be

3 obtained, while the semantics of the linguistic value can be characterized by the fuzzy membership functions. The basic features of this model are as the following: 1. The establishment of emotion model based on Hevner emotion ring; 2. Each emotion adopts the degree of membership to indicate its strength; 3.At certain moment, the intensity of certain emotion may be greater than that of all the other emotions, thus this emotion represents the major feature at this moment of the entire emotional state, which is called the dominant emotion; 4. The construction of a flowing emotion chain on the timeline. The emotioncomputational model based on fuzzy theory is a kind of emotion model in the digital art field that combines Hevner emotion ring and the fuzzy logic language. And the architecture of its emotion linguistic variables is shown in Figure 2, So far, the methodology of linguistic variables based on fuzzy theory is almost the standard method for the linguistic computational model, such as the group decision model based on numerical information and language information, as well as the multi-granularity linguistic information fusion model, theselinguistic computational models are based on fuzzy membership functions to represent the semantics of thelinguistic value, and apply the fuzzy logic operator as the linguistic aggregation operator. In summary, the fuzzy computational model based on the concept of linguistic variables is quite successful Computational Value ModelBased on Semantic Similarity The computational model based on fuzzy theory integrates the fuzzy characteristics of the connotation of the music emotion, and at the same time takes into account that the fuzzy system does not require the precise mathematical model, to facilitate the utilization of human experience and knowledge, with nonlinear, robust and other advantages, and adopts the fuzzy set to describe the music features, to perform analysis and recognition by fuzzy logic and fuzzy reasoning, whichcan be said to be closer to people's cognitive process for music, though there are still limitations in the fuzzy computational model. In order to overcome these limitations, the linguistic computational model based on semantic similarity relations is developed, and such linguistic computational model emphasizes the semantic similarity relationship between the linguistic values, and assumes that this similarity relationship is a basic characteristic of the human brain linguistic cognition, which is in line with the behavioral patterns of music emotion recognition. However, when this model definesthe logical operation between different linguistic expressions, it specifies that different expressions have the same priority, which may be somewhat different from the actual situation. Liu Tao et al. further develops and makes improvement on this basis, apply the linguistic computational model based on semantic similarity to the study on music emotion, and define the music emotion linguistic value model and the music emotion vectors as the following: Definition 1 (Linguistic Model): Diad <LA, R>represents the linguistic model: LA L1. L2. L3,, Ln (1) R rij n n, rij 0,1, i, j 1,2,, n (2) Where LA is a set constituted by finite linguistic values, and R is defined as the fuzzy similarity relationship in LA, n represents the number of elements in the set of linguistic values, the elements in R represent thedegree of overlap in the semantics between two linguistic values Li and Lj. Obviously the fuzzy relation matrix R is a symmetric matrix, which meets the following two properties: rij rji and rii 1. For the studies on music emotion, this model set the basic linguistic value set in LA into signature emotion words of eight subclasses of the emotion space: LAoM = {Sacred, Sorrow, Longing, Lyrical, Light, Joyous,, 1,2,,8 r Music, LAoMi to Enthusiastic, Vigorous}, abbreviated as : LAoM LAoMi i. And use represent the degree of similarity of the i-th element in the music and linguistic value set. Definition 2 (Music Emotion Vector): For the music with independent emotion semantics, its emotional connotation is defined as the eight-dimensional vector E in the Hevner emotion ring, element ei r Music, LAoMi represents the semantic similarity relationship of music and each sub emotion value linguistic value, with values 0-1 to representtheir degree of similarity, and the vector is called music emotion vector E : E r Music, LAoM1,, r Music, LAoMi,, r Music, LAoM 8 (3) Wherein, sub-emotion with the largest value is defined as the dominant emotion of the music Edon: Edon max ei, i 1,2,,8 max () represents the LAoM corresponding to the maximum value, for example, through certain process of reasoning, the emotion of a piece of music M can be expressed as: (0.2, 0.6, 0.9, 0.4, 0.3, 0.1, 0.0, 0.1), its dominant emotion is "longing", and its emotional semantic similarity value is 0.9, and at the same time the music also contains the emotional connotation of "sorrow", though in relatively low degree, 0.6, which can be called secondary emotion, therefore, the emotional connotation of music M is described as "Yearning very much, and somewhat a little sorrow". Of course, some music has several dominant emotions and secondary emotions.

4 The linguistic value model proposed by Liu Tao et al. mainly has makes development in two aspects: Firstly, define the music emotion vector and its similarity measurement rule through the semantic similarityrelationship; and secondly, expand the logic of the emotional vector expression and rules of addition and subtraction, so as to enhance its knowledge representation capacity. It is precisely based on these two points that the result of the integration of a number of complicated emotions of dominant and secondary emotions can be obtained through the reasoning of music emotion in this model, which is completely consistent with the complex feature of music emotion. Direct test method is applied to obtain the similarity matrix as shown in Table 1. Table 1. Similarity Emotion Model Test Obtained Similarity Matrix # Sacred Sorrow Longing Lyrical Light Joyous Enthusiastic Vigorous Sacred Sorrow Longing Lyrical Light Joyous Enthusiastic Vigorous AUTOMATIC RECOGNITION MODEL FOR MUSIC EMOTION 3.1 Automatic Recognition Model Based on Neural Network Pre-processing of Data (1)Extraction ofelements Three elements including pitch, sound intensity and sound length are extracted, which constitute the essential elements of music feature space, in accordance with the characteristic main melody emotion feature vector model, in the expectationto expand into an 8-dimensional vector space with the elements including:(pitch register, intensity, intensity stability, intensity orientation, melody orientation, pitch stability, interval stability, and interval span). However, as in the large number of samples, the amplitude of the same song shows theuniform distribution state, the intensity stability of different emotional categories of music and the intensity orientation do not have much distinction, with relatively small contribution to the music emotion classification, which means that the outline of the emotional connotation of music is characterized by a small number of elements that have relatively important roles. Herein the eight-dimensional feature space is streamlined into six dimensions, removing two vectors of the intensity stability and orientation. (2)Establishment ofsamples Through professional and popular music websites, Shanghai Conservatory of MusicElectronic Music Library and other channels and resources, a total of 580 pieces of MIDI music is obtained. Then according to the typical expression of emotions of the music and the production effects of MIDI music files, 190 songs are carefully selected, including 48 in the sorrow class, 51 in the joyous class, 50 in the sacred class, and 41 in the longingclass. Its approximate distribution is shown in Table 2 as follows. Table 2.Experimental Music Statistics Table Music Type Quantity Demonstration Music Film and television music 32 Knife Man Theme Song, " Princess Pearl", etc. Popular music 51 "Beautiful Mood", "If the Cloud Knows" etc. Classic music 16 "Hungarian Dance", "William Tell Overture" etc Religious music 43 Anthem, choirteaching music etc. Chinese folk music 48 "Happiness", "Fengyang Drum Dance", "Happy New Year" etc. (3)Screening of Samples For the experimental samples established heretofore, due to the inevitableerror in the measurement, if they are adopted directly as the test data, the result of classification may be slightly rough. In fact, select any of 15 samples randomly from each of the four emotions directly as the training data, and take the rest as the test data, the correct rate obtained for the neural network classification is only about 50%, which is caused by the relatively huge error of the data. Therefore, the screening on the test samples is necessary. Take the element of

5 music emotion in the sacred class of pitch register to make the description. In the judgment aspect of the data outliers, select the quantile instead of the mean value or variance, and the reason is that the outliers will affect the mean value and the variance but will not have any interference to the quantile. For the sacred class of 44 sets of data, make the box diagram, as shown in Figure 3, inside the box is the data within the 25% --75% of the fractile, and the length of tentacle is 1.5 times of thequartile deviation, and the longitudinal axis is the pitch register value, and the horizontal axis is the pitch value corresponding to the box midline. As can be seen, at both ends of the outer tentacles there is still some data, which isrelatively further away from the group, as the outliers.make the same box diagram for the other three types of emotional elements, and some outliers will also be found. Thus, study the samples one by one first. Ifamong the sample vectors that have been examined, one component is identified as an outlier, mark this sample, after the investigation on all samples is completed, remove the marked samples, so that the rest of the data has relatively focused trend. At the time when encountering the outlier, instead of removing it immediately, only mark it, so as not to cause the problem of different screening results due to the order of investigation, and also to retain as many samples as possible. Figure 3.Sacred Emotion Data Distribution Box Diagram (4)Selection of the Training Data Although the removal of the outlier results in four groups of data with relatively smaller error compared to the previous data, the selection of training data can directly affectthe results of the calculation, it is expected that the training data can reflect the most primary characteristics of the four categories of emotions as much as possible. Therefore, it is necessary to select a portion of the representative data from the screened data. For the four categories of emotions, calculate their respective center vector Si, i 1, 2,3, 4, by the definition of distance in the Euclidean space, calculate the distance dij for each element from its center vector in the four categories of emotions. The smaller dij is, the closer it means that this sample is from the center vector, that is, the more it can reflect the main characteristics of such emotions. Therefore, for each i, choose the smallest 15 dij as the training data, and a total of 60 training data is obtained Design of Network (1) CharacteristicData of the Sample Input: Take the first 60 training data which is previously obtained, and each sample is a six-dimensional vector. (2) Target Sample: In order to verify the experiment model more intuitively, herein the mapping of the target sample space, that is, the emotion representation space is temporarily taken as four-dimensional vector, and select the fourdimensional vector with the relatively large degree of distinction: Sacred, joyous, sorrow and longing. Take a four-dimensional vector: (1, 0, 0, 0), (0, 1, 0, 0), (0, 0, 1, 0) and (0, 0, 0, 1) to correspond to the four emotions of the sacred, joyous, longing and sorrow. (3) Structure of the Network: According to Kolmogorov theorem, adopt a N 2N 1M 3-layer BP network as the condition classifier. Where N represents the number of components of the input feature vector, M is the total number of categories of the output status. In accordance with the general design, the intermediate layer neuron transfer function S -type tangent function, the output layer neuron transfer function is S -type logarithmic function. The selection of the S -type logarithmic function is because the function is 0-1 function, just to meet the output requirements of the classifier.

6 The S -type logarithmic function graphs are shownrespectively in figure 4(1),4(2) as follows: Figure 4 (1). S-type Tangent Function Figure 4 (2).S-type Logarithmic Function Parameter Setting Set the maximum number of cycles as 100, the training error as , and the training function as Levenberg-Marquardt BP training function. The learning function takes momentum gradient descent learning function, and the performance function takes the mean square error function. Levenberg-Marquardt BP method introduction: Take d for the gradient direction, J for the Jacobi matrix, and the algorithm is as follows: (1)Take the initial point T k k (2)If k k k 1 x and v set precision : Set k 1 ; 1 0 J x r x, then stop. Otherwise solve the linear equations T T k x J x J x v I d J x r x to obtain (3)Set k 1 k k x x d ; And calculate (4)If k 0.25, then set v k1 4v k, if 0.75 (5)Set k k 1, go to(2) Result of Recognition k x d ; k1 k f x f x T T 1 T T k k k k k k k J x r x d d J x J x d k, k k 1 2 v v, otherwise, set v 1 2 k v Firstly, BP network training, the MATLAB BP network operating result is as shown in Figure 5 as follows: When processing tothe 63 step of iteration, the network training error is less than so as to converge. In MATLAB the specific training process of the BP network is shown in Figure 6 as below: k

Figure 5. BP NetworkTraining Process Note: The test environment is Windows Vista system, Matlab 7.0 platform. So far, the BP network training process is completed.

sample emotions are sorrow. And the determination results are shown in Figure 7 as the following.

represents joyous; [0,0,1,0] represents longing; [0,0,0,1] represents sorrow.

7 Figure 5. BP NetworkTraining Process Note: The test environment is Windows Vista system, Matlab 7.0 platform. So far, the BP network training process is completed. Select 131 sets of data for the test, of which the first 22 sample emotions are sacred, the subsequent 37 sample emotions are joyous, then the next 39 sample emotions are longing, and the last 33 sample emotions are sorrow. And the determination results are shown in Figure 7 as the following. Wherein, in the BP network, use four sets of different four-element column vectors, to represent four different corresponding music emotionsrespectively: [1,0,0,0] represents sacred; [0,1,0,0] represents joyous; [0,0,1,0] represents longing; [0,0,0,1] represents sorrow. As shown in the figure, since the recognition model cannot achieve 100% recognition accuracy, each emotion may have different misjudgment phenomenon. Figure 6. Recognition Result of the Neural Network 4. EXPERIMENTAL COMPARISON We make the experimental comparison on the algorithm proposed in this paper and the automatic recognition model based on the statistical classification. The automatic recognition model based on statistic classification also makes pre-treatment on the sample data, then adopts the Fisher discriminant method to construct the music emotion recognition model in MATLAB, and distinguishes these 131 sampleslikewise, finally getsthe recognition results as shown in Figure 7 as follows: Figure 7.Recognition Result of the Statistical Classification

Make comparison and analysis on both recognition models, and do statistics on the music emotion recognition results obtained by the BP network recognition model and the statistical classification

8 Make comparison and analysis on both recognition models, and do statistics on the music emotion recognition results obtained by the BP network recognition model and the statistical classification model, as shown in Figure 8. Figure 8.Comparison of the Impact of the Pre-processing of Statistical Classification Data to the Experimental Results As can be seen, the neural network classification results after the data processing (with the accuracy of 80.9%) is significantly better than the results before the data processing (71%). However, although the data has been processed, the actual error is still inevitable; therefore, the calculation accuracy of 80.9% is a pretty good recognition result. On the other hand, after the data is processed, apply the Fisher discriminant method, the classification result is about 2% higher than the result before the data is processed. Therefore, the following conclusions can be obtained: (1) As there is error in the data, all kinds of emotion data points will be interwoven in the vector space, since the projection process is a linear transformation, when the data integration applies the Fisher discriminant method to be projected on the vector, this will havea pretty huge impact on its accuracy. (2) In fact, the Fisher discriminant method is suitable for processing the linearly separable issue, which is more similar to the linear neural network. While the BP network can also handle the complicated nonlinear problems very perfectly, once the data error is not too huge, the classification results will be satisfactory (3) For the fault tolerance, an error of data will have certain influence on the Fisher classification method, however, it may almost certainly have no impact on BP network at all, because the destruction of a few neurons will not affect the characteristics of the overall network, therefore, for data that contains noise, BP network is more competent. (4) BP network has relatively strong adaptive capacity. (5) Neural network adopts the fuzzy logic to simulate the way people think, thus, the application of neural network for the fuzzy classification issues such as emotion classification has more reference value than the statistical methods. (6) In the aspect of the calculation of the time, as the neural network is a numerical iterative method, therefore, it is inevitably slower than the speed of statistical classification, but for the issue with the size of scale that is not particularly big, the time is acceptable. 5.CONCLUSION From the perspective of emotion computing, this paper analyzed the emotion recognition and the information processing characteristics according to artificial neural network, and proposed the research ideas by the application of BP network for music emotion classification and recognition. The paper analyzed the acoustic principles and the composition of MIDI music files, as the basis for the extraction of the basic music features; and summarized the development, features, models, operating principles and learning algorithms of the neural network as the basic method of recognition. And on the basis of the aforementioned two aspects, it extracted the melody characteristic parameters of the MIDI music files; designed the BP network model suitable for the music emotion recognition, and realized the music emotion recognition based on BP network, and performed verification on the algorithm through the samples. REFERENCES Agustus J.L., Mahoney C.J., Downey L.E., et al. (2015) Functional MRI of Music Emotion Processing in Frontotemporal Dementia, Annals of the New York Academy of Sciences, 1337(1), pp Balkwill L.L., Thompson W.F., Matsunaga R.I. (2004) Recognition of Emotion in Japanese, Western, and Hindustani Music by Japanese Listeners, Japanese Psychological Research, 46(4), p.p Fritz T., Jentschke S., Gosselin, N., et al. (2009) Universal Recognition of Three Basic Emotions in Music, Current Biology, 19(7), pp

9 Gosselin N., Peretz I., Johnsen E., et al. (2014) Amygdala Damage Impairs Emotion Recognition from Music,Neuropsychologia, 52(2), pp Gosselin N., Peretz I., Noulhiane M., et al. (2015) Impaired Recognition of Scary Music Following Unilateral Temporal Lobe Excision,Brain, 138(3), pp Han B.J., Rho S., Jun S., et al. (2010) Music Emotion Classification and Context-based Music Recommendation,Multimedia Tools and Applications, 47(3), pp Juslin, P.N., Laukka P. (2003) Communication of Emotions in Vocal Expression and Music Performance: Different Channels, Same Code,Psychological Bulletin, 129(5), pp Kim J., André E. (2008) Emotion Recognition Based on Physiological Changes in Music Listening,Pattern Analysis and Machine Intelligence, IEEE Transactions on, 30(12), pp Koelsch S., Fritz T., Müller K., et al. (2009) Investigating Emotion with Music: An FMRI Study,Human Brain Mapping, 30(3), pp Peretz I., Gagnon L., Bouchard B. (1998) Music and Emotion: Perceptual Determinants, Immediacy, and Isolation after Brain Damage,Cognition, 68(2), pp Wu B., Zhong E., Horner A., et al. (2014) Music Emotion Recognition by Multi-label Multi-layer Multi-instance Multiview Learning, In Proceedings of the ACM International Conference on Multimedia, pp Yang D., Lee W.S. (2004) Disambiguating Music Emotion Using Software Agents,In ISMIR, 4(2), pp

CS229 Project Report Polyphonic Piano Transcription

CS229 Project Report Polyphonic Piano Transcription Mohammad Sadegh Ebrahimi Stanford University Jean-Baptiste Boin Stanford University sadegh@stanford.edu jbboin@stanford.edu 1. Introduction In this project