Automatic Extraction of Popular Music Ringtones Based on Music Structure Analysis

Similar documents
INTER GENRE SIMILARITY MODELLING FOR AUTOMATIC MUSIC GENRE CLASSIFICATION

A QUERY BY EXAMPLE MUSIC RETRIEVAL ALGORITHM

A Music Retrieval System Using Melody and Lyric

On Human Capability and Acoustic Cues for Discriminating Singing and Speaking Voices

Topics in Computer Music Instrument Identification. Ioanna Karydi

MUSI-6201 Computational Music Analysis

Automatic Rhythmic Notation from Single Voice Audio Sources

WHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG?

Improving Frame Based Automatic Laughter Detection

Supervised Learning in Genre Classification

Automatic Commercial Monitoring for TV Broadcasting Using Audio Fingerprinting

A repetition-based framework for lyric alignment in popular songs

Content-based Music Structure Analysis with Applications to Music Semantics Understanding

Methods for the automatic structural analysis of music. Jordan B. L. Smith CIRMMT Workshop on Structural Analysis of Music 26 March 2010

Singer Traits Identification using Deep Neural Network

Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes

International Journal of Advance Engineering and Research Development MUSICAL INSTRUMENT IDENTIFICATION AND STATUS FINDING WITH MFCC

Music Radar: A Web-based Query by Humming System

TOWARD AN INTELLIGENT EDITOR FOR JAZZ MUSIC

Bi-Modal Music Emotion Recognition: Novel Lyrical Features and Dataset

MODELS of music begin with a representation of the

Subjective Similarity of Music: Data Collection for Individuality Analysis

Automatic Piano Music Transcription

Music structure information is

GRADIENT-BASED MUSICAL FEATURE EXTRACTION BASED ON SCALE-INVARIANT FEATURE TRANSFORM

Drum Sound Identification for Polyphonic Music Using Template Adaptation and Matching Methods

Audio-Based Video Editing with Two-Channel Microphone

THE importance of music content analysis for musical

Music Recommendation from Song Sets

DAY 1. Intelligent Audio Systems: A review of the foundations and applications of semantic audio analysis and music information retrieval

Audio Structure Analysis

MUSICAL INSTRUMENT IDENTIFICATION BASED ON HARMONIC TEMPORAL TIMBRE FEATURES

Automatic Summarization of Music Videos

Mood Tracking of Radio Station Broadcasts

Melody Extraction from Generic Audio Clips Thaminda Edirisooriya, Hansohl Kim, Connie Zeng

Automatic Identification of Instrument Type in Music Signal using Wavelet and MFCC

Repeating Pattern Extraction Technique(REPET);A method for music/voice separation.

Audio Structure Analysis

Toward Multi-Modal Music Emotion Classification

Composer Identification of Digital Audio Modeling Content Specific Features Through Markov Models

19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007

Automatic Laughter Detection

Tempo and Beat Analysis

AUTOREGRESSIVE MFCC MODELS FOR GENRE CLASSIFICATION IMPROVED BY HARMONIC-PERCUSSION SEPARATION

Melody Retrieval On The Web

Chord Classification of an Audio Signal using Artificial Neural Network

Acoustic Scene Classification

Repeating Pattern Discovery and Structure Analysis from Acoustic Music Data

Week 14 Music Understanding and Classification

Shades of Music. Projektarbeit

MOVIES constitute a large sector of the entertainment

Hidden Markov Model based dance recognition

Music Emotion Recognition. Jaesung Lee. Chung-Ang University

Music Genre Classification and Variance Comparison on Number of Genres

Statistical Modeling and Retrieval of Polyphonic Music

Singing voice synthesis based on deep neural networks

Neural Network for Music Instrument Identi cation

Singer Identification

EXPLORING THE USE OF ENF FOR MULTIMEDIA SYNCHRONIZATION

Detecting Musical Key with Supervised Learning

Recognition and Summarization of Chord Progressions and Their Application to Music Information Retrieval

An Examination of Foote s Self-Similarity Method

Browsing News and Talk Video on a Consumer Electronics Platform Using Face Detection

Singer Recognition and Modeling Singer Error

Music Database Retrieval Based on Spectral Similarity

The song remains the same: identifying versions of the same piece using tonal descriptors

CS 591 S1 Computational Audio

hit), and assume that longer incidental sounds (forest noise, water, wind noise) resemble a Gaussian noise distribution.

Automatic Music Clustering using Audio Attributes

Computational Models of Music Similarity. Elias Pampalk National Institute for Advanced Industrial Science and Technology (AIST)

Music Segmentation Using Markov Chain Methods

UC San Diego UC San Diego Previously Published Works

Music Mood Classification - an SVM based approach. Sebastian Napiorkowski

LAUGHTER serves as an expressive social signal in human

Phone-based Plosive Detection

A Computational Model for Discriminating Music Performers

Music Synchronization. Music Synchronization. Music Data. Music Data. General Goals. Music Information Retrieval (MIR)

Speech Recognition and Signal Processing for Broadcast News Transcription

Computational Modelling of Harmony

Outline. Why do we classify? Audio Classification

Automatic Laughter Detection

Music Mood. Sheng Xu, Albert Peyton, Ryan Bhular

Computer Coordination With Popular Music: A New Research Agenda 1

Detection of Panoramic Takes in Soccer Videos Using Phase Correlation and Boosting

Week 14 Query-by-Humming and Music Fingerprinting. Roger B. Dannenberg Professor of Computer Science, Art and Music Carnegie Mellon University

Music Structure Analysis

Recognising Cello Performers Using Timbre Models

SINGING PITCH EXTRACTION BY VOICE VIBRATO/TREMOLO ESTIMATION AND INSTRUMENT PARTIAL DELETION

Data Driven Music Understanding

Detection and demodulation of non-cooperative burst signal Feng Yue 1, Wu Guangzhi 1, Tao Min 1

Music Genre Classification

Interacting with a Virtual Conductor

Voice & Music Pattern Extraction: A Review

HUMMING METHOD FOR CONTENT-BASED MUSIC INFORMATION RETRIEVAL

DETECTION OF SLOW-MOTION REPLAY SEGMENTS IN SPORTS VIDEO FOR HIGHLIGHTS GENERATION

Analytic Comparison of Audio Feature Sets using Self-Organising Maps

Enhancing Music Maps

Research & Development. White Paper WHP 232. A Large Scale Experiment for Mood-based Classification of TV Programmes BRITISH BROADCASTING CORPORATION

POST-PROCESSING FIDDLE : A REAL-TIME MULTI-PITCH TRACKING TECHNIQUE USING HARMONIC PARTIAL SUBTRACTION FOR USE WITHIN LIVE PERFORMANCE SYSTEMS

Automatic Labelling of tabla signals

Transcription:

Automatic Extraction of Popular Music Ringtones Based on Music Structure Analysis Fengyan Wu fengyanyy@163.com Shutao Sun stsun@cuc.edu.cn Weiyao Xue Wyxue_std@163.com Abstract Automatic extraction of popular music ringtones have become an important and useful area for communications and telecommunications industry. Quick and batch extraction of music ringtones increases the convenience in practical application. In this paper, we propose an automatic technology to extract the ringtones from popular music based on the musical structural analysis. This is a meaning attempt to use the theory of musical structural analysis to solve a practical problem. Experiments show the feasibility of the process, several sets of comparative experiments reflect that various stages of threshold may have different effects. The experiments also reflected some problems which inspired us to optimize the processing. On our testing database composed of 186 popular songs, the best accuracy of boundary detection with tolerance ±3 seconds achieves up to 79.9%. We also try to invite strangers to evaluate the result by using voting mechanism. Keywords SVM; boundary detection; random forest; automatic extraction I. INTRODUCTION It is known that automatic music segmentation is very significant in many fields. Segments correspond to structurally meaningful regions of the performance, such as verse or chorus [1].In recent years, a lot of people devoted to the study of automatic segmentation technique, also, many people try to extract the chorus of a music. However, ringtones extraction has not be paid enough attention. This page proposes a framework to extract ringtones from music automatically by analyzing the structure of popular music. With the development of the communication business, mobile phones have become an indispensable part of people s life. Ringtones and ring-music bring more fun and pleasure when people make calls. However, in most cases, ringtone edit remains labor-intensive work, people need to listen each song, and set starting and ending points for a clip within the audio file, then cut the segment. Manual checking of each song and cropping specific parts of a song with proper tools are needed in this process, which could be highly time-consuming and waste of human resources. Quick and batch extraction of music becomes urgently needed. In this paper, we propose a simple method for extracting music ringtones by using musical structural analysis and random forest classification. Although the ringtones are selected according to individual preferences, we studied a large number of music ringtones provided on the major sites, and then found that most of them follows the rules as : 1)ringtones are segments which are frequently repeated segments of a song; 2)The ringtone has not been strictly defined as a intro or a chorus of a song. People tend to choose parts that are easy to remember to be a ringtone. 3)The ringtone has a strong melody characteristics. The segments are popular and catchy. 4)The ringtones extracted from the same types or composers of songs seem have similar melody and structure. 5)Most ringtones last 45~60s. The starting point of the ringtone is considered as the starting point of a sentence, not an unexpected phrase. According to the proposals mentioned above, we propose a new framework to extract ringtones from music automatically by analyzing the structure of popular music. We try to find the boundary between the starting point and ending point of a ringtone, and verify the feasibility of the method by experiments. The rest of the paper is organized as follows: Section 2 describes the related work. Section3 presents the method. Evaluation is done in Section4. Finally, conclusion and perspectives are discussed in section 5. II. RELATED WORK The structure of popular music is usually composed of intro, verse, chorus, bridge and outro,a total of five parts. As we know that an important part of music analysis is the detection of those structure[2], in most cases, we use the structure analysis to detect the boundary of music. Chorus as an important part of music which contains the memory points of the song, can be most likely to be a ringtone. Therefore, musical structural analysis and chorus extraction become core topics of ringtone extraction. The problem to extract chorus has been addressed previously by Logan[3] and Chu[4]. They focused on the using of Hidden Markov Models and clustering 978-1-5090-0806-3/16/$31.00 copyright 2016 IEEE ICIS 2016, June 26-29, 2016, Okayama, Japan

techniques on mel-frequency cepstral coefficients (MFCC),and built a set of spectral features that have been used with great success for applications in speech processing[5]. In 2003, Chai Wei [6~8] through analyzing the hierarchy structure of music signals, the paper puts forward the algorithm of using the results of structural analysis to carry out music digest and chorus extraction. Chen[9] built up a music summarization system based on the structure label system that spot out the main theme segment. L. Regnier [10] shows that partial clustering is a promising approach for singing voice detection and separation. However, as we mentioned in Section 1, ringtone extraction has more flexible requirements than the chorus extraction. We need to extract the ringtone according to the habit/genre/singer, not only do the extraction of chorus. In this paper, we propose a framework to extract ringtones from music automatically by using the theory of musical structural analysis and machine learning algorithm. III. SYSTEM DESIGN OF RINGTONE EXTRACTION The system of ringtone extraction from popular music is shown in Figure 1, in which the processing steps can be grouped into 3 steps. Fig. 1. The system framework of singing voice detection A. System Description There are three major steps of the extraction process: firstly, beat tracking and feature extraction. Secondly, segment boundary detection and ringtone extraction. The boundary detection finds the points from the music to cut a song into segments which the possible ringtone may in it. Finally, smooth filtering and choose the suitable segments to be ringtones. Fig.1 shows the proposed system framework of popular music ringtone extraction. We use numbers to mark the three parts of the process in the picture. Step one and segment Boundary Detection are introduced in detail in my other article[11], this paper focuses on the process of extracting the ringtone segments, which is the second and third step. Importantly, the system tries to find the starting point of singing sentence instead of singing words, because ringtone is always starting from the beginning of a complete lyrics. Therefore, both training sample and testing sample for this experiment need to be divided into fragments based on the result of beat tracking, cause we consider that a sentence is not begin in a beat. After getting the result of beat tracking, we extract the features of each beat in MFCC and Chroma features. B. Feature Extraction and Segment Boundary Detection We detailed the process of feature extraction and segment boundary detection in my another paper [11]. In our system, we choose to use Simon Dixon s beat tracker BeatRoot [8] to extract the beat onsets from the songs. The beats have been extracted from 22050 Hz files with the duration generally ranged from 450 to 500ms. We used a beat as a unit to note the result, In our experiment, we used two kinds of features which are commonly used in speech recognition area: MFCC (Mel Frequency Cepstrum Coefficients) and Chroma and their first derivative and second derivative. Both of them were introduced in my article [11]. The dimensions of MFCC are 36 and of Chroma are 33. In this paper, we choose SVM to do the segment boundary detection. The kernel we choose to use is Radial Basis Function (RBF) kernel as follow: K(v 1,v 2 )= exp (- v 1 -v 2 2 ) (1) v 1,v 2 are feature vectors extracted by the method mentioned in paper[11]. As we all know, penalty parameter C has a significant impact on the result of the SVM. For the penalty parameter C, we do a cross validation on each training set to find the optimal value. The rule of segment boundary detection is to attach a label named singing voice or music to each beat. After doing the annotation, we need to use a simple filtering to pick out some novelty points. we regards the single beat or two beats that are both different from the prior beat and follow beat as novelty peaks, cause a beat lasts about 0.5 sec may too short to be discovered by listening. A single different beat can be regard as a wrong result of boundary detection. By filtering out of some novelty peaks, we get the segment boundary point set of the song.

C. Ringtone Extraction and Random Forest Classification We got the boundary points set by the method mentioned above, the next step we need is to choose the valuable segments which most likely to be ringtones. In summary of the rules mentioned in Section1, there are two kinds of segments can be considered as a ringtone. One typical type is the chorus of a song, the other type is the intro whose duration is longer than 30 seconds. In our experiment, we proposed to use random forest classification to choose the chorus segments and use screening and smooth filtering to choose the qualified intro. Smooth filtering can also fix the suitable positive starting point of a ringtone. Random forest is used in a random way to build a forest, To understand and use the various options, further information about how they are computed is useful. Most of the options depend on two data objects generated by random forests. When the training set for the current tree is drawn by sampling with replacement, about one-third of the cases are left out of the sample. This out-of-bag data is used to get a running unbiased estimate of the classification error as trees are added to the forest. After each tree is built, all of the data are run down the tree, and proximities are computed for each pair of cases. If two cases occupy the same terminal node, their proximity is increased by one. At the end of the run, the proximities are normalized by dividing by the number of trees. Proximities are used in replacing missing data, locating outliers, and producing illuminating low-dimensional views of the data. We used the result of the segment boundary detection as the input of the random forest algorithm. The boundary detection cut the music into segments, and then we extracted the mean MFCC and Chroma features of each segment. We use the clips of popular ringtones downloaded from Internet to build a training set, and test all segments of a test song to choose the valuable segments. The results of the experiment will be introduced in Section 4. D. Smooth and Correction As we mentioned in Section 1, the starting point of a ringtone should be a start of a sentence, not between two phrase. We scan the label of each beat of the selected segment to be the first no-singing voice label to be the precise starting point of a segment, after doing this, we can fix the starting point of a ringtone extraction. In some special occasions, intro is a long time melody, and can also be considered as a ringtone. We judge by using the rule as follow: if the first segment of a song is more than 30 seconds and contains almost beats labeled as no-singing voice,we consider it as a ringtone. A smooth and correction is indeed, the suitable starting point of a ringtone for the user experience is important. Extract more consumer satisfied ringtone segment from one song can also enhance the user experience. IV. EVALUATION AND RESULT All experiments are based on the TUT standard annotation collection of music data. It contains186 songs of the Beatles. We choose 140 of them to build the training set, and rest of them to test the result. The SVM training set contains 100 pieces of popular music with duration ranging from 20 to 30 sec. Half of them are singing voice mixed with musical instruments, and the others are music without singing voice. We regard mute pieces as no-singing voice, the mute section we meet while doing the scan of the beat label is recognized as no-singing voice. The random forest training set contains 140 corresponding clips ringtone of the Beatles which are downloaded from Internet. The SVM testing set contains 50 pieces of popular music with duration ranging from 20 to 30sec, 25 of them are singing voice and other 25 are no-singing voice. It also contains 20 songs which are chosen from the TUT collection with standard annotation. In order to select the appropriate parameters, all the experimental data were cross validated. Since the suitability of music ringtone extraction is based on user preferences, the accuracy can not be reflected intuitive. In the experiments of this paper, we did several groups comparative experiments, and use more realistic calculation method to reflect the accuracy of the extraction. A. Experiment of Segment Boundary Detection We test 50 popular music (25singing voice pieces and 25 no-singing voice pieces)which all have a duration about 30sec. We use the SVM to classify each beat to note the label as 0(nosinging voice) or 1(singing voice), and then compare with the standard annotation to calculate the precision. This experiment shows the directly result of the classification. TABLE I. Method THE AVERAGE PRECISION OF THE SEGMENT BOUNDARY DETECTION SVM classification with pieces SVM classification with songs Without Filtering 0.845 0.728 Filtering 0.926 0.883 As we can see from the table, Tab.1 shows the result of the segment boundary detection. The results shows that SVM classification can provide a reasonable singing voice boundary detection. However, the threshold of filtering method may powerfully effect the accuracy of the segment boundary detection. B. Ringtone Extraction from Segment Boundary set Previously mentioned, chorus extraction can be most possible to be a ringtone. In this experiment, we consider the chorus extraction accuracy as the ringtone extraction accuracy, all the results compared with the standard annotation. We choose two group data sets to be the testing sets. One is the result set of SVM classification with a simple filtering as mentioned above, the other set is the standard annotation of the segment boundary detection. The standard annotation set can reflect a more accurate efficiency of random forest classification. Due to the ringtone is set according to the personal preference, when the coverage rate which compared with the ringtone download from Internet is higher than 80%, we consider the result are correct. As mentioned above, we used MFCC and Chroma features to do comparative experiments. The features we did experiments included:

MFCC(36 dimensions), Chroma(39), MFCC & Chroma 25. interviewed their feelings, 84% of them considered the ringtones which doing the screen are more suitable. TABLE II. THE AVERAGE PRECISION OF 30SONGS DOWNLOAD FROM INTERNET WITH SMOOTH FILTERING Method MFCC Chroma MFCC &Chroma SVM Result Set 0.658 0.776 0.739 Standard Annotation Set 0.692 0.832 0.787 Fig. 2. The RESULTS OF THREE KINDS OF FEATURES Table2 and Fig.2 show the results of the several comparative experiments. The row of the table shows the precision of three kinds of features, and the column shows the different results of SVM result set and standard annotation set. These experiments verified the feasibility of the random forest algorithm to extract the ringtone from music. These experiment shows that Chroma features has a better effect to extract music ringtone than MFCC. However, the combination of two kinds of features did not enhance the effect.also, the result showed the precision of the boundary detection by SVM affect a lot on the music ringtone extraction. The segment boundary got from standard annotation set showed the objective results of random forest algorithm to choose the right segment. C. Smooth and Screen the segment The starting point of segment boundary which generated by SVM with filtering may not be appropriate to be a ringtone, cause the threshold of filtering may effected the result of the segment. In this experiment, we consider that a no-singing voice beat can be more possible be the starting of a sentence. We screen the label of each beat of the selected segment to find the first no-singing voice label to be the precise starting point of a segment. Also, the intro which lasts more than 30s can also be considered as a ringtone. The results of this experiment are difficult to be reflected with data, we randomly found 50 people to listen to these extracted ringtones, and then V. CONCLUSION Ringtone extraction is a wide range of business needs. This paper proposed an automatic framework of music ringtone extraction,by using musical structural analysis and machine learning. Experiments of this paper not only show the feasibility of the process: We designed a set of rules for automatic extraction of music ringtone, and verified the feasibility through experiments. We compared three kinds of audio feature extraction effect, and obtained that Chroma feature have a more stable effect in this process. We used a more objective standard annotation set to verify the feasibility of the random forest algorithm to extract the music ringtone. We proposed a method to find the start point of a music ringtone and verified the effect the method of user research. This is a meaning attempt to use the theory of musical structural analysis to solve a practical problem. The system is the base for further research. B ut also revealed some problems: All training set and testing set comes from the same author, the accuracy of random forest for songs from different author or genre may not be stable. The result of machine learning may depend on the selection of training set, cross-validation and training data selection may be key factors to the classification. Future work will be directed towards improving the classification accuracy of each machine learning stage, and try to use the speech signal related knowledge to fix the starting point of a ringtone. ACKNOWLEDGMENT (Heading 5) The paper is supported by the Chinese Music Audience Automatic Classification of Music Technology Innovation Program of the Ministry of Culture (WHB201520). REFERENCES [1] Brian McFee,Daniel P.W.Ellis LEARNING TO SEGMENT SONGS WITH ORDINAL LINEAR DISCRIMINANT ANALYSIS, 2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP).. [2] Heng-Tze Cheng, Yi-Hsuan Yang, Yu-Ching Lin, and Homer H. Chen Music using audio and textual information,ieee 2009. [3] Mark.A.Bartsch,Gregory H.Wakefield, To Catch A Chorus:Using Chroma-Based Representations for Audio Thumbnailing,IEEE Workshop on Applications of Signal Processing to Audio and Acoustics 2001. [4] B.Logan and S.Chu, Music summarization using key phrases,in International Conference on Acoustics, Speech and Singal Processing, 2000.

[5] L.R.Rabiner and B.H.Juang,Fundamentals of Speech Recognition,Prentice-Hall.1993. [6] Chai Wei,Vercoe Barry. Music Thumbanailing via Stuctural Analysis. Proceedings of ACM Multimedia Conference.2003 [7] Chai Wei, Vercoe Barry.Structual Analysis of Music Singals for Indexing and Thumbnailing.Proceedings of ACM/IEEE Joint Conference on Digital Libraries,2003. [8] Chai We,Structual Analysis of Musical Singals Via Patten Matching.Proceedings of IEEE Intrenational Conference on Acoustics,Speech,and Singal Processing,2003 [9] Chen Yanliang,Music Structural Analysis and Application.U.D.C:681.3 [10] Namunu C.Maddage,Automatic Structure Detection for Popular Music.Institute for Infocomm Research. [11] D. Dimitriadis, P. Maragos, and A. Potamianos, Robust am-fm features for speech recognition, IEEE Signal Process. Lett., vol. 12, pp. 621 624, 2005 [12] Wu Fengyan, Singing Voice Detection of Popular Music Using Beat Tracking and SVM Classification,International Conference on Computer and Information Science (ICIS 2015) [13]. Reliable onset detection scheme for singing voices based on enhanced difference filtering and combined features,wireless Communications & Signal Processing, 2009. WCSP 2009 [14] SHI ZI-qiang,Li Hai-feng,SUN Jia-yin.Vocal discrimination in pop music based on SVM.Computer Engineering and Applications,2008,44(25):126-128 [15] S. Dixon. Automatic extraction of tempo and beat from expressive performances.journal of New Music Research, 30(1):39 58, 2001