Automatic Extraction of Popular Music Ringtones Based on Music Structure Analysis

Automatic Extraction of Popular Music Ringtones Based on Music Structure Analysis Fengyan Wu fengyanyy@163.com Shutao Sun stsun@cuc.edu.cn Weiyao Xue Wyxue_std@163.com Abstract Automatic extraction of popular music ringtones have become an important and useful area for communications and telecommunications industry. Quick and batch extraction of music ringtones increases the convenience in practical application. In this paper, we propose an automatic technology to extract the ringtones from popular music based on the musical structural analysis. This is a meaning attempt to use the theory of musical structural analysis to solve a practical problem. Experiments show the feasibility of the process, several sets of comparative experiments reflect that various stages of threshold may have different effects. The experiments also reflected some problems which inspired us to optimize the processing. On our testing database composed of 186 popular songs, the best accuracy of boundary detection with tolerance ±3 seconds achieves up to 79.9%. We also try to invite strangers to evaluate the result by using voting mechanism. Keywords SVM; boundary detection; random forest; automatic extraction I. INTRODUCTION It is known that automatic music segmentation is very significant in many fields. Segments correspond to structurally meaningful regions of the performance, such as verse or chorus [1].In recent years, a lot of people devoted to the study of automatic segmentation technique, also, many people try to extract the chorus of a music. However, ringtones extraction has not be paid enough attention. This page proposes a framework to extract ringtones from music automatically by analyzing the structure of popular music. With the development of the communication business, mobile phones have become an indispensable part of people s life. Ringtones and ring-music bring more fun and pleasure when people make calls. However, in most cases, ringtone edit remains labor-intensive work, people need to listen each song, and set starting and ending points for a clip within the audio file, then cut the segment. Manual checking of each song and cropping specific parts of a song with proper tools are needed in this process, which could be highly time-consuming and waste of human resources. Quick and batch extraction of music becomes urgently needed. In this paper, we propose a simple method for extracting music ringtones by using musical structural analysis and random forest classification. Although the ringtones are selected according to individual preferences, we studied a large number of music ringtones provided on the major sites, and then found that most of them follows the rules as : 1)ringtones are segments which are frequently repeated segments of a song; 2)The ringtone has not been strictly defined as a intro or a chorus of a song. People tend to choose parts that are easy to remember to be a ringtone. 3)The ringtone has a strong melody characteristics. The segments are popular and catchy. 4)The ringtones extracted from the same types or composers of songs seem have similar melody and structure. 5)Most ringtones last 45~60s. The starting point of the ringtone is considered as the starting point of a sentence, not an unexpected phrase. According to the proposals mentioned above, we propose a new framework to extract ringtones from music automatically by analyzing the structure of popular music. We try to find the boundary between the starting point and ending point of a ringtone, and verify the feasibility of the method by experiments. The rest of the paper is organized as follows: Section 2 describes the related work. Section3 presents the method. Evaluation is done in Section4. Finally, conclusion and perspectives are discussed in section 5. II. RELATED WORK The structure of popular music is usually composed of intro, verse, chorus, bridge and outro,a total of five parts. As we know that an important part of music analysis is the detection of those structure[2], in most cases, we use the structure analysis to detect the boundary of music. Chorus as an important part of music which contains the memory points of the song, can be most likely to be a ringtone. Therefore, musical structural analysis and chorus extraction become core topics of ringtone extraction. The problem to extract chorus has been addressed previously by Logan[3] and Chu[4]. They focused on the using of Hidden Markov Models and clustering 978-1-5090-0806-3/16/$31.00 copyright 2016 IEEE ICIS 2016, June 26-29, 2016, Okayama, Japan

techniques on mel-frequency cepstral coefficients (MFCC),and built a set of spectral features that have been used with great success for applications in speech processing[5]. In 2003, Chai Wei [6~8] through analyzing the hierarchy structure of music signals, the paper puts forward the algorithm of using the results of structural analysis to carry out music digest and chorus extraction. Chen[9] built up a music summarization system based on the structure label system that spot out the main theme segment. L. Regnier [10] shows that partial clustering is a promising approach for singing voice detection and separation. However, as we mentioned in Section 1, ringtone extraction has more flexible requirements than the chorus extraction. We need to extract the ringtone according to the habit/genre/singer, not only do the extraction of chorus. In this paper, we propose a framework to extract ringtones from music automatically by using the theory of musical structural analysis and machine learning algorithm. III. SYSTEM DESIGN OF RINGTONE EXTRACTION The system of ringtone extraction from popular music is shown in Figure 1, in which the processing steps can be grouped into 3 steps. Fig. 1. The system framework of singing voice detection A. System Description There are three major steps of the extraction process: firstly, beat tracking and feature extraction. Secondly, segment boundary detection and ringtone extraction. The boundary detection finds the points from the music to cut a song into segments which the possible ringtone may in it. Finally, smooth filtering and choose the suitable segments to be ringtones. Fig.1 shows the proposed system framework of popular music ringtone extraction. We use numbers to mark the three parts of the process in the picture. Step one and segment Boundary Detection are introduced in detail in my other article[11], this paper focuses on the process of extracting the ringtone segments, which is the second and third step. Importantly, the system tries to find the starting point of singing sentence instead of singing words, because ringtone is always starting from the beginning of a complete lyrics. Therefore, both training sample and testing sample for this experiment need to be divided into fragments based on the result of beat tracking, cause we consider that a sentence is not begin in a beat. After getting the result of beat tracking, we extract the features of each beat in MFCC and Chroma features. B. Feature Extraction and Segment Boundary Detection We detailed the process of feature extraction and segment boundary detection in my another paper [11]. In our system, we choose to use Simon Dixon s beat tracker BeatRoot [8] to extract the beat onsets from the songs. The beats have been extracted from 22050 Hz files with the duration generally ranged from 450 to 500ms. We used a beat as a unit to note the result, In our experiment, we used two kinds of features which are commonly used in speech recognition area: MFCC (Mel Frequency Cepstrum Coefficients) and Chroma and their first derivative and second derivative. Both of them were introduced in my article [11]. The dimensions of MFCC are 36 and of Chroma are 33. In this paper, we choose SVM to do the segment boundary detection. The kernel we choose to use is Radial Basis Function (RBF) kernel as follow: K(v 1,v 2 )= exp (- v 1 -v 2 2 ) (1) v 1,v 2 are feature vectors extracted by the method mentioned in paper[11]. As we all know, penalty parameter C has a significant impact on the result of the SVM. For the penalty parameter C, we do a cross validation on each training set to find the optimal value. The rule of segment boundary detection is to attach a label named singing voice or music to each beat. After doing the annotation, we need to use a simple filtering to pick out some novelty points. we regards the single beat or two beats that are both different from the prior beat and follow beat as novelty peaks, cause a beat lasts about 0.5 sec may too short to be discovered by listening. A single different beat can be regard as a wrong result of boundary detection. By filtering out of some novelty peaks, we get the segment boundary point set of the song.

C. Ringtone Extraction and Random Forest Classification We got the boundary points set by the method mentioned above, the next step we need is to choose the valuable segments which most likely to be ringtones. In summary of the rules mentioned in Section1, there are two kinds of segments can be considered as a ringtone. One typical type is the chorus of a song, the other type is the intro whose duration is longer than 30 seconds. In our experiment, we proposed to use random forest classification to choose the chorus segments and use screening and smooth filtering to choose the qualified intro. Smooth filtering can also fix the suitable positive starting point of a ringtone. Random forest is used in a random way to build a forest, To understand and use the various options, further information about how they are computed is useful. Most of the options depend on two data objects generated by random forests. When the training set for the current tree is drawn by sampling with replacement, about one-third of the cases are left out of the sample. This out-of-bag data is used to get a running unbiased estimate of the classification error as trees are added to the forest. After each tree is built, all of the data are run down the tree, and proximities are computed for each pair of cases. If two cases occupy the same terminal node, their proximity is increased by one. At the end of the run, the proximities are normalized by dividing by the number of trees. Proximities are used in replacing missing data, locating outliers, and producing illuminating low-dimensional views of the data. We used the result of the segment boundary detection as the input of the random forest algorithm. The boundary detection cut the music into segments, and then we extracted the mean MFCC and Chroma features of each segment. We use the clips of popular ringtones downloaded from Internet to build a training set, and test all segments of a test song to choose the valuable segments. The results of the experiment will be introduced in Section 4. D. Smooth and Correction As we mentioned in Section 1, the starting point of a ringtone should be a start of a sentence, not between two phrase. We scan the label of each beat of the selected segment to be the first no-singing voice label to be the precise starting point of a segment, after doing this, we can fix the starting point of a ringtone extraction. In some special occasions, intro is a long time melody, and can also be considered as a ringtone. We judge by using the rule as follow: if the first segment of a song is more than 30 seconds and contains almost beats labeled as no-singing voice,we consider it as a ringtone. A smooth and correction is indeed, the suitable starting point of a ringtone for the user experience is important. Extract more consumer satisfied ringtone segment from one song can also enhance the user experience. IV. EVALUATION AND RESULT All experiments are based on the TUT standard annotation collection of music data. It contains186 songs of the Beatles. We choose 140 of them to build the training set, and rest of them to test the result. The SVM training set contains 100 pieces of popular music with duration ranging from 20 to 30 sec. Half of them are singing voice mixed with musical instruments, and the others are music without singing voice. We regard mute pieces as no-singing voice, the mute section we meet while doing the scan of the beat label is recognized as no-singing voice. The random forest training set contains 140 corresponding clips ringtone of the Beatles which are downloaded from Internet. The SVM testing set contains 50 pieces of popular music with duration ranging from 20 to 30sec, 25 of them are singing voice and other 25 are no-singing voice. It also contains 20 songs which are chosen from the TUT collection with standard annotation. In order to select the appropriate parameters, all the experimental data were cross validated. Since the suitability of music ringtone extraction is based on user preferences, the accuracy can not be reflected intuitive. In the experiments of this paper, we did several groups comparative experiments, and use more realistic calculation method to reflect the accuracy of the extraction. A. Experiment of Segment Boundary Detection We test 50 popular music (25singing voice pieces and 25 no-singing voice pieces)which all have a duration about 30sec. We use the SVM to classify each beat to note the label as 0(nosinging voice) or 1(singing voice), and then compare with the standard annotation to calculate the precision. This experiment shows the directly result of the classification. TABLE I. Method THE AVERAGE PRECISION OF THE SEGMENT BOUNDARY DETECTION SVM classification with pieces SVM classification with songs Without Filtering 0.845 0.728 Filtering 0.926 0.883 As we can see from the table, Tab.1 shows the result of the segment boundary detection. The results shows that SVM classification can provide a reasonable singing voice boundary detection. However, the threshold of filtering method may powerfully effect the accuracy of the segment boundary detection. B. Ringtone Extraction from Segment Boundary set Previously mentioned, chorus extraction can be most possible to be a ringtone. In this experiment, we consider the chorus extraction accuracy as the ringtone extraction accuracy, all the results compared with the standard annotation. We choose two group data sets to be the testing sets. One is the result set of SVM classification with a simple filtering as mentioned above, the other set is the standard annotation of the segment boundary detection. The standard annotation set can reflect a more accurate efficiency of random forest classification. Due to the ringtone is set according to the personal preference, when the coverage rate which compared with the ringtone download from Internet is higher than 80%, we consider the result are correct. As mentioned above, we used MFCC and Chroma features to do comparative experiments. The features we did experiments included:

MFCC(36 dimensions), Chroma(39), MFCC & Chroma 25. interviewed their feelings, 84% of them considered the ringtones which doing the screen are more suitable. TABLE II. THE AVERAGE PRECISION OF 30SONGS DOWNLOAD FROM INTERNET WITH SMOOTH FILTERING Method MFCC Chroma MFCC &Chroma SVM Result Set 0.658 0.776 0.739 Standard Annotation Set 0.692 0.832 0.787 Fig. 2. The RESULTS OF THREE KINDS OF FEATURES Table2 and Fig.2 show the results of the several comparative experiments. The row of the table shows the precision of three kinds of features, and the column shows the different results of SVM result set and standard annotation set. These experiments verified the feasibility of the random forest algorithm to extract the ringtone from music. These experiment shows that Chroma features has a better effect to extract music ringtone than MFCC. However, the combination of two kinds of features did not enhance the effect.also, the result showed the precision of the boundary detection by SVM affect a lot on the music ringtone extraction. The segment boundary got from standard annotation set showed the objective results of random forest algorithm to choose the right segment. C. Smooth and Screen the segment The starting point of segment boundary which generated by SVM with filtering may not be appropriate to be a ringtone, cause the threshold of filtering may effected the result of the segment. In this experiment, we consider that a no-singing voice beat can be more possible be the starting of a sentence. We screen the label of each beat of the selected segment to find the first no-singing voice label to be the precise starting point of a segment. Also, the intro which lasts more than 30s can also be considered as a ringtone. The results of this experiment are difficult to be reflected with data, we randomly found 50 people to listen to these extracted ringtones, and then V. CONCLUSION Ringtone extraction is a wide range of business needs. This paper proposed an automatic framework of music ringtone extraction,by using musical structural analysis and machine learning. Experiments of this paper not only show the feasibility of the process: We designed a set of rules for automatic extraction of music ringtone, and verified the feasibility through experiments. We compared three kinds of audio feature extraction effect, and obtained that Chroma feature have a more stable effect in this process. We used a more objective standard annotation set to verify the feasibility of the random forest algorithm to extract the music ringtone. We proposed a method to find the start point of a music ringtone and verified the effect the method of user research. This is a meaning attempt to use the theory of musical structural analysis to solve a practical problem. The system is the base for further research. B ut also revealed some problems: All training set and testing set comes from the same author, the accuracy of random forest for songs from different author or genre may not be stable. The result of machine learning may depend on the selection of training set, cross-validation and training data selection may be key factors to the classification. Future work will be directed towards improving the classification accuracy of each machine learning stage, and try to use the speech signal related knowledge to fix the starting point of a ringtone. ACKNOWLEDGMENT (Heading 5) The paper is supported by the Chinese Music Audience Automatic Classification of Music Technology Innovation Program of the Ministry of Culture (WHB201520). REFERENCES [1] Brian McFee,Daniel P.W.Ellis LEARNING TO SEGMENT SONGS WITH ORDINAL LINEAR DISCRIMINANT ANALYSIS, 2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP).. [2] Heng-Tze Cheng, Yi-Hsuan Yang, Yu-Ching Lin, and Homer H. Chen Music using audio and textual information,ieee 2009. [3] Mark.A.Bartsch,Gregory H.Wakefield, To Catch A Chorus:Using Chroma-Based Representations for Audio Thumbnailing,IEEE Workshop on Applications of Signal Processing to Audio and Acoustics 2001. [4] B.Logan and S.Chu, Music summarization using key phrases,in International Conference on Acoustics, Speech and Singal Processing, 2000.

[5] L.R.Rabiner and B.H.Juang,Fundamentals of Speech Recognition,Prentice-Hall.1993. [6] Chai Wei,Vercoe Barry. Music Thumbanailing via Stuctural Analysis. Proceedings of ACM Multimedia Conference.2003 [7] Chai Wei, Vercoe Barry.Structual Analysis of Music Singals for Indexing and Thumbnailing.Proceedings of ACM/IEEE Joint Conference on Digital Libraries,2003. [8] Chai We,Structual Analysis of Musical Singals Via Patten Matching.Proceedings of IEEE Intrenational Conference on Acoustics,Speech,and Singal Processing,2003 [9] Chen Yanliang,Music Structural Analysis and Application.U.D.C:681.3 [10] Namunu C.Maddage,Automatic Structure Detection for Popular Music.Institute for Infocomm Research. [11] D. Dimitriadis, P. Maragos, and A. Potamianos, Robust am-fm features for speech recognition, IEEE Signal Process. Lett., vol. 12, pp. 621 624, 2005 [12] Wu Fengyan, Singing Voice Detection of Popular Music Using Beat Tracking and SVM Classification,International Conference on Computer and Information Science (ICIS 2015) [13]. Reliable onset detection scheme for singing voices based on enhanced difference filtering and combined features,wireless Communications & Signal Processing, 2009. WCSP 2009 [14] SHI ZI-qiang,Li Hai-feng,SUN Jia-yin.Vocal discrimination in pop music based on SVM.Computer Engineering and Applications,2008,44(25):126-128 [15] S. Dixon. Automatic extraction of tempo and beat from expressive performances.journal of New Music Research, 30(1):39 58, 2001