VECTOR REPRESENTATION OF EMOTION FLOW FOR POPULAR MUSIC. Chia-Hao Chung and Homer Chen

VECTOR REPRESENTATION OF EMOTION FLOW FOR POPULAR MUSIC Chia-Hao Chung and Homer Chen National Taiwan University Emails: {b99505003, homer}@ntu.edu.tw ABSTRACT The flow of emotion expressed by music through time is a useful feature for music information indexing and retrieval. In this paper, we propose a novel vector representation of emotion flow for popular music. It exploits the repetitive verse-chorus structure of popular music and connects a verse (represented by a point) and its corresponding chorus (another point) in the valencearousal emotion plane. The proposed vector representation visually gives users a snapshot of the emotion flow of a popular song in an intuitive and instant manner, more effective than the point and curve representations of music emotion flow. Because many other genres also have repetitive music structure, the vector representation has a wide range of applications. Index Terms Affective content, emotion flow, music emotion representation, music structure. 1. INTRODUCTION It is commonly agreed that music listening is an appealing experience for most people because music evokes emotion in listeners. As emotion conveyed by music is important to music listening, there is a strong need for effective extraction and representation of music emotion from the music organization and retrieval perspective. This paper focuses on music emotion representation. A typical approach to music emotion representation condenses the entire emotion flow of a song to a single emotion. This approach is adopted by most music emotion recognition (MER) systems [1] [3]. It works by selecting a certain segment from the song and mapping the musical features extracted from the segment to a single emotion. The emotion representation is either a label, such as happy, angry, sad, or relaxed, or the coordinates of a point in, for example, the valence-arousal (VA) emotion plane [4]. The former is a categorical representation, while the latter is a dimensional representation [5]. A user can query songs through either form of single-point music emotion representation, and a music retrieval system responds to the query with songs that match the emotion specified by the user [6], [7]. However, the emotion of a music piece varies as it unrolls in time [8]. This dynamic nature has not been fully explored for music emotion representation, perhaps because the emotion flow of music is difficult to qualify or quantify in data collecting and model training [1]. The Fig. 1. Affect curves of four songs in the VA plane where diamonds indicate the beginning and circles indicate the end of the songs. The black curve is Smells Like Teen Spirit by Nirvana. The blue curve is Are We the Waiting by Green Day. The green curve is Dying in the sun by The Cranberries. The red curve is Barriers by Aereogramme. work that comes close is called music emotion tracking [9] [12], which generates a sequence of points at regular interval to form an affect curve in the emotion plane [13]. Four examples are shown in Fig. 1, where each curve is generated by dividing a full song into 30-second segments with 10-second hop size and by predicting the VA values of all segments. Each curve depicts the emotion of a song from the beginning to the end. We can see that the variation of music emotion can be quite complex and that a point representation cannot properly capture the dynamics of music emotion. The representation of emotion flow for music should be easy to visualize, yet sufficiently informative to convey the dynamics of music emotion. The conventional point representation of music emotion is the simplest one; however, it does not contain any dynamic information of music emotion. On the other hand, the affect curve can fairly show the dynamics of music emotion, but it is too complex to specify for users. Clearly, simplicity and informativeness are two competing criteria, and a certain degree of tradeoff between them is necessary in practice. It has been reported that the emotion expressed by a music piece has to do with music structure. Schubert et al. [14] showed that music emotion flow can be attributed to the changes of music structure. Yang et al. [15] reported that the boundaries between contrasting segments of a music piece have rapid changes of VA values. Wang et al. 978-1-4673-7478-1/15/$31.00 2015 IEEE

(a) (b) (c) Verse Chorus Verse Chorus Verse Chorus Fig. 2. (a) Music structure of Smells Like Teen Spirit by Nirvana. (b) The arousal values and (c) the valence values of all 30- second segments of the song. [16] showed that exploiting the music structure of popular music for segment selection improves the performance of an MER system. For popular music, the music structure usually consists of a number of repetitive musical sections [17]. Each musical section refers to a song segment that has its own musical role such as verse or chorus. As shown in Fig. 2, popular music typically has repetitive verse-chorus structure and its emotion flow changes significantly during the transition between verse and chorus sections. The burgeoning evidence of the strong relation between music structure and emotion flow motivates us to develop an effective representation of emotion flow for music retrieval. The proposed emotion flow representation of a song is a vector in the VA emotion plane, pointing from the emotion of a verse to the emotion of its corresponding chorus. This representation is simple and intuitive, which is made possible by exploiting the repetitive property of music structure of popular music. We focus on popular music in this paper because it has perhaps the largest user base on a daily basis and because its structure is normally within a finite set of well-known patterns [18] [22]. In summary, the primary contributions of this paper include: A study on the music structure of popular music, such as pop, R&B, and rock songs, is conducted to demonstrate the repetitive property of the music structure of popular music (Section 2). A novel vector representation of emotion flow for popular music is proposed. A comprehensive comparison of the proposed vector representation with the point and curve representations is presented (Section 3 and 4). A performance study is conducted to demonstrate the accuracy and effectiveness of the vector representation in capturing the emotion flow of a song (Section 5). 2. MUSIC STRUCTURE OF POPULAR MUSIC Music is an art form of organized sounds. A popular song can be divided into a number of musical sections, such as introduction (intro), verse, chorus, bridge, instrumental solo, and ending (outro) [18]. Such sections are structured (maybe repeatedly) in a particular pattern referred to as Table. 1. Music structure statistics of the 60 English popular songs of the NTUMIR-60 dataset. Times per song Proportion to song Intro Verse Chorus Others Outro 0.93 3.13 2.37 1.28 0.48 0.09 0.44 0.29 0.11 0.07 musical form. Recovering the musical form is called music structure analysis and can be considered a segmentation process that detects the temporal position and duration of each segment [19]. Here, we briefly review the common musical sections and their musical roles. Intro and outro indicate the beginning and the ending sections, respectively, of a song and usually only contain instrumental sounds without singing voice and lyrics. However, not every song has intro or outro. For example, composers may place a verse or a chorus in the beginning or at the end of a song to make the song sound special. The sections corresponding to verse or chorus normally express a flow of emotion as the music unfolds. The verse usually has low energy, and it is the place where the story of the song is narrated. Compared to verse, chorus is emotive and leaves significant impression on listeners [20]. Other structural elements, such as bridge and instrumental solo, are optional and function as transitional sections to avoid monotonous composition and to make the song colorful. Bridge means a transition between other types of sections, and instrumental solo is predominantly the special transitional section of instrumental sounds. To investigate music structure, we conduct an analysis of NTUMIR-60, which is a dataset consisting of 60 English popular songs [23]. Because the state-of-the-art automatic music structure analysis is not as accurate as expected [19], [21], we perform the analysis manually. The results are shown in Table 1. We can see that verse and chorus indeed make a large portion of a song and on the average appear 3.13 and 2.37 times, respectively, per song. This is consistent with the findings by musicologists that verse and chorus is a widely used musical form (aka the verse-chorus form) for song writers of popular music [20]. This also suggests that verse and chorus are the most memorable sections of a song [22] and represent the main affection of the song. The corresponding emotion flow gives listeners an affective sensation. 3. MUSIC EMOTION REPRESENTATION In either the categorical or the dimensional approach, the typical representation of music emotion represents the affective content of a song by a single emotion. The categorical approach describes emotion using a finite number of discrete affective terms [24], [25], whereas the dimensional approach defines emotion in a continuous space, such as the VA plane [26], [27]. In this section, we first review the point and curve representation in the dimensional approach and then present the vector representation in detail.

Table. 2. A comparison of the point, the curve, and the proposed vector representations. Point Curve Vector Locational information X * X X Dynamic information X X Fig. 3. Illustration of the vector representation of music emotion flow. The two terminals of the vector represent a verse and its corresponding chorus in the VA plane. 3.1. Point and curve representation In the dimensional approach, the VA values of a music segment can be predicted from the extracted features of the music segment through a regression formulation of the MER problem [26]. The emotion of the music segment is represented by a point in the VA plane. Given the entire music segments of a song, one may select one of them to represent the whole song. This gives rise to the single point representation of music emotion in the VA plane, and a user only has to specify the coordinates of the point in the VA plane to retrieve the corresponding song. Although this method provides an intuitive way for music retrieval, as discussed in Section 1, it is impossible to represent the emotion flow of whole song by a single point in the VA plane. In addition, which music segment really represents the entire song is difficult to determine automatically by computer. By dividing a song into a number of segments and predicting the VA values of each music segment [11], [12], the collection of VA points forms an affect curve of the song in the VA plane. One may also represent valence and arousal of the song separately, each as a function of time. Although such affect curves can indeed show the emotion flow of a song, the representation is too complex to be adopted in a music retrieval system, because most users are unable to precisely specify the affect curve of a song even if it is a familiar one. In addition, how to measure the similarity (or distance) between two affect curves with different lengths is an open issue. Therefore, a simple approach is desirable. 3.2. Vector representation By exploiting the repetitive property of music structure of popular music, we can represent the characteristic of emotion flow in a much simpler way than the affect curve representation. As discussed in Section 2, the verse-chorus form is a common music structure of popular music and has a strong relation to the emotion flow of a song. Therefore, we leverage it to construct the emotion flow representation of a song. The resulting representation is a vector pointing from a verse to its corresponding chorus in the VA emotion plane, as illustrated in Fig. 3. Structural information Complexity Low High Medium * A checked box means yes. Besides the positional information of the verses and choruses in the VA plane, the vector representation indicates the direction and strength of the emotion flow of a song. Therefore, the vector representation is more informative than the point representation. Since the two terminals of a vector represent the emotions of a verse and its corresponding chorus of a song, this representation is more intuitive and simpler to use than the affect curve, which does not explicitly present the structural information of a song. Indeed, the vector representation does express the main emotion flow of a song characterized in the versechorus form. Table 2 shows a qualitative comparison of the point representation, the affect curve representation, and the proposed vector representation. We can see that the vector representation of emotion flow is novel, simple, and intuitive. Users can easily search songs by specifying a vector in the VA plane as the query, and a music retrieval system can quickly respond to the query according to the proximity of a candidate song to the vector. In practice, a set of candidate songs can be generated and ordered according to the proximity when presented to the user. With this representation of music emotion flow, many innovative music retrieval mechanisms can be developed to match the needs of a specific application. Although we focus on popular music in this paper, the repetitive property of music structure can also be found in other genres, such as the sonata form and the rondo form of classical music [28]. The vector representation is good for the visualization of the emotion flow of such music as well. 4. IMPLEMENTATION The MER system described in [26] serves as the platform to generate the VA values of musical sections (segments). The MER system consists of two main steps, as shown in Fig. 4. The first step performs regression model training, and the second step takes musical sections as inputs and generates their VA values. The details of regression model training and vector representation generation are described in this section. X

Fig. 4. Overview of an MER system 4.1. Regression model training Adopting the dimensional approach for MER, we define the valence and arousal as real values in [ 1, 1] and formulate the prediction of VA values as a regression problem. Denote the input training data by (x i, y i), where 1 i N, x i is the feature vector for the ith input data, and y i is the real value to be predicted for the ith input data. A regression model (regressor) is trained by minimizing the mean squared difference between the prediction and the annotated value [26]. The dataset NTUMIR-60, which is composed of 60 English popular songs, is used for training and testing. For fair comparison, each song is converted to a uniform format (22,050 Hz, 16 bits, and mono channel PCM WAV) and normalized to the same volume level. Then, each song is trimmed to a 30-second segment manually to conduct the subjective test and the feature extraction. In the subjective test, each segment is annotated by 40 participants, and the mean of the annotated VA values is calculated and used as the ground truth of the segment. Then the MIRToolbox [29] is applied to extract 177 features including the following five types of acoustic features: two dynamic features (the mean and the standard deviation of root-mean-squared energy), five rhythmic features (fluctuation peak, fluctuation centroid, tempo, pulse clarity, and event density), 142 spectral features (the mean and the standard deviation of centroid, brightness, spread, skewness, kurtosis, rolloff 85%, rolloff 95%, entropy, flatness, roughness, irregularity, 20 MFCCs, 20 delta MFCCs, and 20 delta-delta MFCCs), six timbre features (the mean and the standard deviation of zero crossing rate, low energy, and spectral flux), and 22 tonal features (12-bin chromagram concatenated with the mean and the standard deviation of chromagram peak, chromagram centroid, key clarity, HCDF, and mode). The quality of NTUMIR-60 for MER is evaluated and reported in [23]. The regression models of arousal and valence are trained independently. For accuracy, the support vector regression (SVR) [30], [31] with radial basis kernel function is adopted to train the regressors. A grid-search is applied to find the best kernel parameter γ and the best penalty parameter C [32], where γ {10 4, 10 3, 10 2, 10 1 } and C {1, 10 1, 10 2, 10 3, 10 4 }. To evaluate the performances of the regressors, tenfold cross validation is conducted. The whole dataset is randomly divided into 10 parts, nine of them for training and the remaining one for testing. The above process is repeated 50 times. The average performance in term of the R-squared value [33] is 0.21 for valence and 0.76 for Fig. 5. The proposed vector representation provides an intuitive visualization of music emotion flow in the VA plane. This chart shows the emotion flows of all the songs in the NTUMIR-60 dataset. Each blue diamond represents the emotion of verses, and each red circle represents the emotion of choruses connected to the corresponding verses by a line segment. arousal. This result is comparable to the one reported in the previous work [23], [26]. 4.2. Generating vector representation The audio segmentation method proposed in [34] is applied to segment each song of the NTUMIR-60 dataset. All verses and choruses are manually selected from the song based on the result of segmentation, and their VA values are estimated independently. In our current implementation the vector representation of the song in the VA plane is generated by connecting the point representing the average verse with that representing the average chorus. Fig. 5 shows the resulting vector representations of all songs of the NTUMIR-60 dataset. We can see that each vector clearly describes the emotion flow of a song. For example, a vector in the first quadrant pointing to the upper right corner indicates that the corresponding song drives listeners toward a positive and exciting feeling, whereas a vector in the second quadrant pointing toward the upper left corner indicates that the song it represents would drive listeners toward a negative and aggressive mood. We also see that, for most of the songs, the arousal value of its representative chorus is higher than that of the corresponding verse. That is, the emotion vectors usually go upward. This reflects the fact that the chorus is typically more exciting than its corresponding verse [20]. 5. EVALUATION An experiment is conducted to evaluate the effectiveness of the proposed vector representation of music emotion flow in comparison with two ad hoc methods. The effectiveness of a method is measured in terms of the approximation error between the method and the emotion flow of a song. All songs of the NTUMIR-60 dataset are considered in this experiment.

Table. 3. Results of Euclidean and cosine distances between the ground truth and three different approaches. Random F30L30 1 Vector Euclidean distance Cosine distance 2 0.10 0.10 0.07 0.21 0.20 0.14 1 F30L30 means that the first segment is from the 30th to 60th second and the second segment is from the last 60th to the last 30th second of a song. 2 Cosine distance is defined as 1 minus cosine similarity. As discussed in Section 1, the emotion flow of a song is difficult for a subject to specify; therefore, we use the affect curve generated by MER as the ground truth. Specifically, the affect curve of each song is generated by dividing the full song into 30-second segments with 10- second hop size and by predicting the VA values of all segments. Then, a k-means algorithm [35] is applied to partition the collection of VA points into two clusters. The center points of these two clusters are used as reference to calculate the approximation error of the proposed vector representation and compare it with that of the two ad hoc methods. The first ad hoc method randomly selects two 30- second segments from a song and constructs a vector representation from them. The second ad hoc method selects the first segment from the 30th to 60th second of a song and the second segment from the last 60th to last 30th second of the song. The VA values of the two selected segments are predicted independently. Two distance measures are considered: Euclidean distance and cosine similarity [36]. The former is applied to compute the difference of two vectors in length, and the latter is applied to compute the angular difference of two vectors. The experimental results are shown in Table 3. Note that the process of randomly selecting two segments from a song is repeated 100 times and the average results are presented in the first column of Table 3. Compared with the two ad hoc methods, the vector representation has the smallest approximation error in both Euclidean distance and cosine distance. This shows effectiveness of the vector representation in capturing the emotion flow of popular music. In Fig. 6, the vector representation of the emotion flow of each song is plotted together with the affect curve of the song and the emotion of each verse and chorus identified for the song. We can see that most vectors are located in the repetitive region of the affect curves. The dangling parts of an affect curve normally correspond to the intro and outro sections of the song, and hence they are of no concern. We can also see that the verses are located on one side of the affect curve of a song while the choruses are located on the other side. Thus, using the average verse and average chorus for the vector representation can effectively characterize the affect curve and the emotion flow. Fig. 6. Most vectors (represented by a diamond-circle pair) generated by our method are in the repetitive region of the affect curves (shown in grey). The hollow diamond represents the emotion of a verse, and the hollow circle represents the emotion of a chorus of a song. 6. CONCLUSION In this paper, we have investigated the repetitive property of music structure and described a novel approach that represents the emotion flow of popular music by a vector in the VA plane. The vector emerges from a representative verse of a song and ends at the corresponding chorus. We have also compared the proposed vector representation

with point and curve representations of music emotion and shown that the proposed method is an intuitive and effective representation of emotion flow for popular music. This property of our method is supported by experimental results. This work is motivated by the increasing need for effective music content representation and analysis in response to the explosive content growth. With the proposed vector representation, the proximity of emotion flow between two songs can be easily measured, which is essential to music retrieval, and many innovative music retrieval applications can be developed. REFERENCES [1] Y.-H. Yang and H. H. Chen, Music Emotion Recognition, CRC Press, 2011. [2] Y.-H. Yang and H. H. Chen, Machine recognition of music emotion: A review, ACM Trans. Intell. Syst. Technol., vol. 3, no. 3, article 40, 2012. [3] Y. E. Kim, E. M. Schmidt, R. Migneco, B. G. Morton, P. Richardson, J. Scott, J. A. Speck, and D. Turnbull, Music emotion recognition: A state of the art review, in Proc. 11th Int. Soc. Music Inform. Retrieval Conf., pp. 255-266, Utrecht, Netherlands, 2010. [4] J. A. Russell, A circumplex model of affect, J. Pers. Soc. Psychol., vol. 39, no. 6, pp. 1161-1178, 1980. [5] T. Eerola and J. K. Vuoskoski, A comparison of the discrete and dimensional models of emotion in music, Psychol. Music, vol. 39, no. 1, pp. 18-49, 2010. [6] X. Zhu, Y.-Y. Shi, H.-G. Kim, and K.-W. Eom, An integrated music recommendation system, IEEE Trans. Consum. Electron., vol. 53, no. 2, pp. 917-925, 2006. [7] Y.-H. Yang, Y.-C. Lin, H.-T. Cheng, and H. H. Chen, Mr. Emo: Music retrieval in the emotion plane, in Proc. ACM Multimedia, pp. 1003-1004, Vancouver, Canada, 2008. [8] E. Schubert, Measurement and time series analysis of emotion in music, Ph.D. dissertation, School of Music & Music Education, University of New South Wales, Sydney, Australia, 1999. [9] L. Lu, D. Liu, and H.-J. Zhang, Automatic mood detection and tracking of music audio signals, IEEE Trans. Audio, Speech, Language Process., vol. 14, no. 1, pp. 5-18, 2006. [10] M. D. Korhonen, D. A. Clausi, and M. E. Jernigan, Modeling emotional content of music using system identification, IEEE Trans. Syst. Man, Cybern. B, Cybern., vol. 36, no. 3, pp. 588-599, 2006. [11] R. Panda and R. P. Paiva, Using support vector machines for automatic mood tracking in audio music, Audio Engineering Soc. Convention 130, London, UK, 2011. [12] E. M. Schmidt, D. Turnbull, and Y. E. Kim, Feature selection for content-based, time-varying musical emotion regression, in Proc. ACM Int. Conf. Multimedia Inform. Retrieval, pp. 267-274, Philadelphia, USA, 2010. [13] A. Hanjalic and L.-Q. Xu, Affective video content representation and modeling, IEEE Trans. Multimedia, vol. 7, no. 1, pp. 143-154, 2005. [14] E. Schubert, S. Ferguson, N. Farrar, D. Taylor, and G. E. McPherson, Continuous response to music using discrete emotion faces, in Proc. 9th Int. Symp. Computer Music Modelling and Retrieval, pp. 1-17, London, UK, 2012. [15] Y.-H. Yang, C.-C. Liu, and H. H. Chen, Music emotion classification: A fuzzy approach, in Proc. ACM Multimedia, pp. 81-84, Santa Barbara, USA, 2006. [16] X. Wang, Y. Wu, X. Chen, and D. Yang, Enhance popular music emotion regression by importing structure information, in Proc. Asia-Pacific Signal and Inform. Process. Association Annu. Summit and Conf., pp. 1-4, Kaohsiung, Taiwan, 2013. [17] B. Horner and T. Swiss, Key Terms in Popular Music and Culture, Blackwell Publishing, 1999. [18] N. C. Maddage, C. Xu, M. S. Kankanhalli, and X. Shao, Content-based music structure analysis with applications to music semantics understanding, in Proc. ACM Multimedia, pp. 112-119, NY, USA, 2004. [19] J. Paulus, M. Müller, and A. Klapuri, Audio-based music structure analysis, in Proc. 11th Int. Soc. Music Inform. Retrieval Conf., pp. 625-636, Utrecht, Netherlands, 2010. [20] D. Christopher, Rockin' out: expressive modulation in verse chorus form, Music Theory Online, vol. 17, 2011. [21] J. B. L. Smith, C.-H. Chuan, and E. Chew, Audio properties of perceived boundaries in music, IEEE Trans. Multimedia, vol. 16, no. 5, pp. 1219-1228, 2014. [22] M. Cooper and J. Foote, Summarizing popular music via structural similarity analysis, in Proc. IEEE Workshop on Applications of Signal Process. Audio and Acoustic, pp. 127-130, New Paltz, NY, USA, 2003. [23] Y.-H. Yang, Y.-F. Su, Y.-C. Lin, and H. H. Chen, Music emotion recognition: The role of individuality, in Proc. ACM Int. Workshop on Human-centered Multimedia, pp. 13-21, Augsburg, Bavaria, Germany, 2007. [24] X. Hu, J. S. Downie, C. Laurier, M. Bay, and A. F. Ehmann, The 2007 MIREX audio mood classification task: Lessons learned, in Proc. 9th Int. Conf. Music Inform. Retrieval, pp. 462-467, Philadelphia, USA, 2008. [25] C. Laurier, J. Grivolla, and P. Herrera, Multimodal music mood classification using audio and lyrics, in Proc. IEEE 7th Int. Conf. Machine Learning and Applications, pp. 688-693, San Diego, California, USA, 2008. [26] Y.-H. Yang, Y.-C. Lin, Y.-F. Su, and H. H. Chen, A regression approach to music emotion recognition, IEEE Trans. Audio, Speech, Language Process., vol. 16, no. 2, pp. 448-457, 2008. [27] E. M. Schmidt and Y. E. Kim, Projection of acoustic features to continuous valence-arousal mood labels via regression, in Proc. 10th Int. Soc. Music Inform. Retrieval Conf., Kobe, Japan, 2009. [28] M. Hickey, Assessment rubrics for music composition, Music Educators Journal, vol. 85, no. 4, pp. 26-33, 1999. [29] O. Lartillot and P. Toiviainen, A MATLAB toolbox for musical feature extraction from audio, in Proc. Int. Conf. Digital Audio Effects, pp. 237-244, Bordeaux, France, 2007. [30] A. J. Smola and B. Schölkopf, A tutorial on support vector regression, Stat. Comput., vol. 4, no. 3, pp. 199-222, 2004. [31] C.-C. Chang and C.-J. Lin, LIBSVM: A library for support vector machines, ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, article 27, 2011. [32] C.-W. Hsu, C.-C. Chang, and C.-J. Lin, A practical guide to support vector classification, Technical report, National Taiwan University, 2010 [Online]. Available at: http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf. [33] A. Sen and M. S. Srivastava, Regression Analysis: Theory, Methods, and Applications, Springer Science & Business Media, 1990. [34] J. Foote and M. Cooper, Media segmentation using selfsimilarity decomposition, in Proc. SPIE Storage and Retrieval for Multimedia Databases, vol. 5021, pp. 167-175, 2003. [35] S. P. Lloyd, Least squares quantization in PCM, IEEE Trans. Inform. Theory, vol. 28, no. 2, pp. 129-137, 1982. [36] L. Lee, Measures of distributional similarity, in Proc. 37th Annu. Meeting of the Association for Computational Linguistics on Computational Linguistics, pp. 25-32, PA, USA, 1999.