Mood Tracking of Radio Station Broadcasts

Mood Tracking of Radio Station Broadcasts Jacek Grekow Faculty of Computer Science, Bialystok University of Technology, Wiejska 45A, Bialystok 15-351, Poland j.grekow@pb.edu.pl Abstract. This paper presents an example of a system for the analysis of emotions contained within radio broadcasts. We prepared training data, did feature extraction, built classifiers for music/speech discrimination and for emotion detection in music. To study changes in emotions, we used recorded broadcasts from 4 selected European radio stations. The collected data allowed us to determine the dominant emotion in the radio broadcasts and construct maps visualizing the distribution of emotions in time. The obtained results provide a new interesting view of the emotional content of radio station broadcasts. Keywords: Emotion detection, Mood tracking, Audio feature extraction, Music information retrieval, Radio broadcasts. 1 Introduction The overwhelming number of media outlets is constantly growing. This also applies to radio stations available on the Internet, over satellite, and over the air. On the one hand, the number of opportunities to listen to various radio shows has grown, but on the other, choosing the right station has become more difficult. Music information retrieval helps those people who listen to the radio mainly for the music. This technology is able to make a general detection of the genre, artist, and even emotion. Listening to music is particularly emotional. People need a variety of emotions, and music is perfectly suited to provide them. Listening to the radio station throughout the day, whether we want to it not, we are affected by the transmitted emotional content. In this paper, we focus on emotional analysis of the music presented by radio stations. During the course of a radio broadcast, these emotions can take on a variety of shades, change several times with varying intensity. This paper presents a method of tracking changing emotions during the course of a radio broadcast. The collected data allowed to determine the dominant emotion in the radio broadcast and construct maps visualizing the distribution of emotions in time. Music emotion detection studies are mainly based on two popular approaches: categorical or dimensional. The categorical approach [1][2][3][4] describes emotions with a discrete number of classes - affective adjectives. In the dimensional approach [5][6], emotions are described as numerical values of valence

2 Jacek Grekow and arousal. In this way an emotion of a song is represented as a point on an emotion space. In this work, we use the categorical approach. There are several other studies on the issue of mood tracking [7][8][9]. Lu et al. [3], apart from detecting emotions, tracked them, divided the music into several independent segments, each of which contained a homogeneous emotional expression. The use of mood tracking for indexing and searching multimedia databases has been used in the work of Grekow and Ras [10]. One wonders how long it takes a person to recognize emotion in musical compositions to which he/she is listening. Bachorik et al. [11] concluded that the majority of music listeners need 8 seconds to identify the emotion of a piece. This time is closely related to the length of a segment of music during emotion detection. Xiao et al. [12] found that the segment length should be no shorter than 4 sec and no longer than 16 sec. In various studies on the detection of emotion in music the segments are varying lengths. In [4][1], segment length is 30 sec. A 25 second segment was used by Yang et al. [5]. Fifteen second clips are used as ground truth data by Schmidt et al. [13]. The length of 1 sec segment was used by Schmidt and Kim in [6]. In this work, we use 6-second segments. A comprehensive review of the methods that have been proposed for music emotion recognition was prepared by Yang et al. [14]. Another paper surveying state-of-the-art automatic emotion recognition was presented by Kim et al. in [15]. The issue of mood tracking is not only limited to music. The paper by Mohammad [16] is an interesting extension of the topic; the author investigated the development of emotions in literary texts. Yeh et al. [17] tracked the continuous changes of emotional expressions in Mandarin speech. A method of profiling radio stations was described by Lidy and Rauber [18]. They used a technique of Self-Organizing Maps to organize the program coverage of radio stations on a two-dimensional map. This approach allows to profile the complete program of a radio station. 2 Music Data To conduct the study of emotion detection of radio stations, we prepared two sets of data. One of them was used for music/speech discrimination, and the other for the detection of emotion in music. A set of training data for music/speech discrimination consisted of 128 wav files, including 64 designated as speech and 64 marked as music. The training data were taken from the generally accessible data collection project MARSYAS (http://marsyas.info/download/data_sets). The training data set for emotion detection consisted of 374 six-second fragments of different genres of music: classical, jazz, blues, country, disco, hip-hop, metal, pop, reggae, and rock. The tracks were all 22050Hz Mono 16-bit audio files in wav format. In this research we use 4 emotion classes: energetic-positive, energetic-negative, calm-negative, calm-positive, presented with their abbreviation in Table 1. They

Mood Tracking of Radio Station Broadcasts 3 cover the four quadrants of the 2 dimensional Thayer model of emotion [19]. They correspond to four basic emotion classes: happy, angry, sad and relaxed. Abbreviation Description e1 energetic-positive e2 energetic-negative e3 calm-negative e4 calm-positive Table 1. Description of mood labels Music samples were labeled by the author of this paper, a music expert with a university musical education. Six-second music samples were listened to and then labeled with one of the emotions (e1, e2, e3, e4). In the case when the music expert was not certain which emotion to assign, such a sample was rejected. In this way, the created labels were associated with only one emotion in the file. As a result, we obtained 4 sets of files: 101 files labeled e1, 107 files labeled e2, 78 files labeled e3, and 88 files labeled e4. To study changes in emotions, we used recorded broadcasts from 4 selected European radio stations: Polish Radio Dwojka (Classical/Culture), recorded on 4.01.2014; Polish Radio Trojka (Pop/Rock), recorded on 2.01.2014; BBC Radio 3 (Classical), recorded on 25.12.2013; ORF OE1 (Information/Culture), recorded on 12.01.2014. For each station we recorded 10 hours beginning at 10 A.M. The recorded broadcasts were segmented into 6-second fragments using sfplay.exe from MARSYAS software. For example, we obtained 6000 segments from one 10 h broadcast. 3 Features Extraction For features extraction, we used the framework for audio analysis of MARSYAS software, written by George Tzanetakis [20]. MARSYAS is implemented in C++ and retains the ability to output feature extraction data to ARFF format [21]. With the tool bextract.exe, the following features can be extracted: Zero Crossings, Spectral Centroid, Spectral Flux, Spectral Rolloff, Mel-Frequency Cepstral Coefficients (MFCC), and chroma features - 31 features in total. For each of these basic features, four statistic features were calculated: 1. The mean of the mean (calculate mean over the 20 frames, and then calculate the mean of this statistic over the entire segment); 2. The mean of the standard deviation (calculate the standard deviation of the feature over 20 frames, and then calculate the mean these standard deviations over the entire segment); 3. The standard deviation of the mean (calculate the mean of the feature over

4 Jacek Grekow 20 frames, and then calculate the standard deviation of these values over the entire segment); 4. The standard deviation of the standard deviation (calculate the standard deviation of the feature over 20 frames, and then calculate the standard deviation of these values over the entire segment). In this way, we obtained 124 features. The input data during features extraction were 6-second segments in wav format, sample rate 22050, channels: 1, Bits: 16. An example of using bextract.exe from the MARSYAS v0.2 package to extract features: bextract.exe -fe -sv colllection -w outputfile where collection is file with list of input files and outputfile is name of output file in ARFF format. For each 6-second file we obtained a representative single feature vector. The obtained vectors were used for building classifiers and for predicting new instances. 4 Classification 4.1 The Construction of Classifiers We built two classifiers, one for music/speech discrimination and the second for emotion detection, using the WEKA package [21]. During the construction of the classifier for music/speech discrimination, we tested the following algorithms: J48, RandomForest, BayesNet, SMO [22]. The classification results were calculated using a cross validation evaluation CV-10. The best accuracy (98%) was achieved using SMO algorithm, which is an implementation of support vector machines (SVM) algorithm. The second best algorithm was Random-Forest (94% accuracy). During the construction of the classifier for emotion detection, we tested the following algorithms: J48, RandomForest, BayesNet, IBk (K-nn), SMO (SVM). The highest accuracy (55.61%) was obtained for SMO algorithm. SMO was trained using polynominal kernel. The classification results were calculated using a cross validation evaluation CV-10. After applying attribute selection (attribute evaluator: WrapperSubsetEval, search method BestFirst), classifier accuracy improved to 60.69%. classified as a b c d a = e1 65 26 5 5 b = e2 23 77 2 5 c = e3 13 4 37 24 d = e4 19 2 19 48 Table 2. Confusion matrix The confusion matrix (Table 2) obtained during classifier evaluation shows that the most recognized emotion was e2 (Precision 0.706, Recall 0.72, F-measure

Mood Tracking of Radio Station Broadcasts 5 0.713), and the next emotion was e1 (Precision 0.542, Recall 0.644, F-measure 0.588). We may notice a considerable amount of mistakes between the emotions of the left and right quadrants of the Thayer model, that is between e1 and e2, and analogously between e3 and e4. This is confirmed by the fact that detection on the arousal axis of Thayers model is easier. There are less mistakes made between the top and bottom quadrants. At the same time, recognition of emotions on the valence axis (positive-negative) is more difficult. 4.2 Analysis of Recordings During the analysis of the recorded radio broadcasts, we conducted a two-phase classification. The recorded radio program was divided into 6-second segments. For each segment, we extracted a feature vector. This feature vector was first used to detect if the given segment is speech or music. If the current segment was music, then we used a second classifier to predict what type of emotion it contained. For features extraction, file segmentation, use of classifiers to predict new instances, and visualization of results, we wrote a Java application that connected different software products: MARSYAS, MATLAB and WEKA package. 5 Results of Mood Tracking in Radio Stations The percentages of speech, music, and emotion in music obtained during the segment classification of 10-hour broadcasts of four radio stations are presented in Table 3. On the basis of these results, radio stations can be compared in two ways. The first way is to compare the amount of music and speech in the radio broadcasts, and the second is to compare the occurrence of individual emotions. PR Dwojka PR Trojka BBC Radio 3 ORF OE1 speech 59.37% 73.35% 32.25% 69.10% music 40.63% 26.65% 67.75% 30.90% e1 4.78% 4.35% 2.43% 2.48% e2 5.35% 14.43% 1.00% 0.92% e3 20.27% 6.02% 56.19% 22.53% e4 10.23% 1.85% 8.13% 4.97% e1 in music 11.76% 16.32% 3.58% 8.02% e2 in music 13.16% 54.14% 1.47% 2.98% e3 in music 49.89% 22.59% 82.93% 72.91% e4 in music 25.17% 6.94% 12.00% 16.08% Table 3. Percentage of speech, music, and emotion in music in 10-hour broadcasts of four radio stations

6 Jacek Grekow 5.1 Comparison of Radio Stations The dominant station in the amount of music presented was BBC Radio 3 (67.75%). We noted a similar ratio of speech to music in the broadcasts of PR Trojka and ORF OE1, in both of which speech dominated (73.35% and 69.10%, respectively). A more balanced amount of speech and music was noted on PR Dwojka (59.37% and 40.63%, respectively). Comparing the content of emotions, we can see that PR Trojka clearly differs from the other radio stations, because the dominant emotion is e2 energeticnegative (54.14%) and e4 calm-positive occurs the least often (6.94%). We noted a clear similarity between BBC Radio 3 and ORF OE1, where the dominant emotion was e3 calm-negative (82.93% and 72.91%, respectively). Also, the proportions of the other emotions (e1, e2, e4) were similar for these stations. We could say that emotionally these stations are similar, except that considering the speech to music ratio, BBC Radio 3 had much more music. The dominant emotion for PR Dwojka was e3, which is somewhat similar to BBC Radio 3 and ORF OE1. Compared to the other stations, PR Dwojka had the most (25.17%) e4 calm-positive music. 5.2 Emotion Maps The figures (Fig. 1, Fig. 2, Fig. 3, Fig. 4) present speech and emotion maps for each radio broadcast. Each point on the map is the value obtained from the classification of a 6-second segment. These show which emotions occurred at given hours of the broadcasts. For PR Dwojka (Fig. 1), there are clear musical segments (1500-2500, 2300-3900) during which e3 dominated. At the end of the day (4500-6000), emotion e2 occurs sporadically. It is interesting that e1 and e4 (from right half of the Thayer model) did not occur in the morning. For PR Trojka (Fig. 2), emotion e4 did not occur in the morning, and e2 and e3 dominated (segments 1200-2800 and 3700-6000). For BBC Radio 3 (Fig. 3), we observed almost a complete lack of energetic emotions (e1 and e2) in the afternoon (segments after 3200). For ORF OE1 (Fig. 4), e3 dominated up to segment 3600, and then broadcasts without music dominated. The presented analyses of maps of emotions could be developed by examining the quantity of changes of emotions or the distribution of daily emotions. 6 Conclusions This paper presents an example of a system for the analysis of emotions contained within radio broadcasts. The collected data allowed to determine the dominant emotion in the radio broadcast and present the amount of speech and music. The obtained results provide a new interesting view of the emotional content of radio stations. The precision of the constructed maps visualizing the distribution of emotions in time obviously depends on the precision of the classifiers of emotion detection.

Mood Tracking of Radio Station Broadcasts 7 Fig. 1. Map of speech and music emotion in PR Dwojka 10h broadcast Fig. 2. Map of speech and music emotion in PR Trojka 10h broadcast

8 Jacek Grekow Fig. 3. Map of speech and music emotion in BBC Radio 3 10h broadcast Fig. 4. Map of speech and music emotion in ORF OE1 10h broadcast

Mood Tracking of Radio Station Broadcasts 9 Their accuracy could be better. This is still associated with the imperfection of features for audio analysis. In this matter, there is still much to be done. We could also test audio features extracted by other software for feature extraction, such as jaudio or MIR toolbox. Also, musical file labeling, which are input data for learning classifiers, could be made by a bigger number of music experts; this would enhance the reliability of the classifiers. Development of the presented system to include emotion detection in speech also seems to be a logical prospect in the future. A system for the analysis of emotions contained within radio broadcasts could be a helpful tool for people planning radio programs enabling them to consciously plan the emotional distribution in the broadcast music. Another example of applying this system could be an additional tool for radio station searching. Because the perception of emotions can be subjective and different people perceive emotions slightly differently, the emotional analysis of radio stations could be dependent on the user s preferences. Search profiling of radio stations taking into consideration the user would be an interesting solution. Acknowledgments. This paper is supported by the S/WI/3/2013. References 1. Li, T. and Ogihara, M.: Detecting emotion in music. Proceedings of the Fifth International Symposium on Music Information Retrieval, pp. 239-240 (2003) 2. Grekow, J. and Ras, Z.: Detecting emotions in classical music from MIDI files. Foundations of Intelligent Systems: ISMIS 2009, LNAI, Vol. 5722, pp. 261-270 (2009) 3. Lu, L., Liu, D. and Zhang, H.J.: Automatic mood detection and tracking of music audio signals. IEEE Transactions on Audio, Speech and Language Processing, vol. 14, no. 1, pp. 5-18 (2006) 4. Song, Y., Dixon, S. and Pearce, M.: Evaluation of Musical Features for Emotion Classifica-tion. In Proceedings of the 13th International Society for Music Information Retrieval Con-ference (2012) 5. Yang,Y.-H., Lin,Y.C., Su,Y.F. and Chen, H.H.: A regression approach to music emotion recognition. IEEE Transactions on Audio, Speech, and Language Processing, Volume 16, Issue 2, pp. 448-457 (2008) 6. Schmidt, E. and Kim, Y.: Modeling Musical Emotion Dynamics with Conditional Random Fields. In Proceedings of the 12th International Society for Music Information Retrieval Conference, pp. 777 782 (2011) 7. Schmidt, E.M., Turnbull, D. and Kim Y.E.: Feature Selection for Content-Based, Time-Varying Musical Emotion Regression. Proc. ACM SIGMM International Conference on Multimedia Information Retrieval, Philadelphia, PA (2010) 8. Schmidt, E.M. and Kim, Y.E.: Prediction of time-varying musical mood distributions from audio. Proceedings of the 2010 International Society for Music Information Retrieval Conference, Utrecht, Netherlands (2010) 9. Grekow, J.: Mood tracking of musical compositions. Foundations of Intelligent Systems: ISMIS 2012, Lecture Notes in Computer Science, pp. 228-233, eds. Li Chen, Alexander Felfernig, Jiming Liu, Zbigniew Ras; 20th International Symposium, Macau, China (2012)

10 Jacek Grekow 10. Grekow, J. and Ras, Z.: Emotion Based MIDI Files Retrieval System. Advances in Music Information Retrieval, Studies in Computational Intelligence, Springer (2010) 11. Bachorik, J.P., Bangert, M., Loui, P., Larke, K., Berger, J., Rowe, R. and Schlaug, G.: Emotion in motion: Investigating the time-course of emotional judgments of musical stimuli. Music Perception, vol. 26, no. 4, pp. 355-364 (2009) 12. Xiao, Z., Dellandrea, E., Dou, W. and Chen, L.: What is the best segment duration for music mood analysis? International Workshop on Content-Based Multimedia Indexing (CBMI 2008), pp. 17-24 (2008) 13. Schmidt, E. M., Scott, J.J. and Kim, Y.E.: Feature Learning in Dynamic Environments: Modeling the Acoustic Structure of Musical Emotion. In Proceedings of the 12th International Society for Music Information Retrieval Conference, pp. 325-330 (2012) 14. Yang Y.H. and Homer H. Chen, H.H.: Machine Recognition of Music Emotion: A Re-view. ACM Transactions on Intelligent Systems and Technology, Volume 3, Issue 3, Article No. 40 (2012) 15. Kim, Y., Schmidt, E., Migneco, R., Morton, B., Richardson, P., Scott, J., Speck, J. and Turnbull, D.: State of the Art Report: Music Emotion Recognition: A State of the Art Review. In Proceedings of the 11th International Society for Music Information Retrieval Conference, pp. 255-266 (2010) 16. Mohammad, S.: From Once Upon a Time to Happily Ever After: Tracking Emotions in Novels and Fairy Tales. Proceedings of the ACL 2011 Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, pp. 105-114, Portland, OR, USA (2011) 17. Yeh, J., Pao, T., Pai, Ch., Cheng, Y.: Tracking and Visualizing the Changes of Mandarin Emotional Expression. ICIC 2008, LNCS 5226, pp. 978-984 (2008) 18. Lidy, T. and Rauber, A.: Visually Profiling Radio Stations. In Proceedings of the 7th Inter-national Conference on Music Information Retrieval (2006) 19. Thayer, R.E.: The biopsychology arousal. Oxford University Press (1989) 20. Tzanetakis, G. and Cook, P.: Marsyas: A framework for audio analysis. Organized Sound, 10:293-302 (2000) 21. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P. and Witten, I.H.: The WEKA Data Mining Software: An Update; SIGKDD Explorations, Volume 11, Issue 1 (2009) 22. Witten, I.H. and Frank, E.: Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, San Francisco, CA, USA (2005)