Popular Song Summarization Using Chorus Section Detection from Audio Signal

Popular Song Summarization Using Chorus Section Detection from Audio Signal Sheng GAO 1 and Haizhou LI 2 Institute for Infocomm Research, A*STAR, Singapore 1 gaosheng@i2r.a-star.edu.sg 2 hli@i2r.a-star.edu.sg Abstract Music signal is a one-dimensional temporal sequence. It thus incurs difficulty for the listeners to quickly capturing the mostly attracting parts in popular songs, unless the listeners play the song until the ending. In order to improve the listening experience, music summarization, a tool to summarize the song using the most attractive sections, is needed. In the paper, a system and method is presented to summarize the popular songs by detecting the chorus sections from the input audio signal. The proposed summarization system uses the unique audio feature representation method, i.e. the octavedependent probabilistic latent semantic analysis, and the chorus detection algorithm that combines the repeated segment extraction and chorus identification. The performance of music summarization is evaluated on the song database with the ground truth of chorus sections, i.e. the start and ending timestamp of each chorus section. As we know, it is the first systematically evaluation of music summary performance. In terms of multiple metrics such as the boundary accuracy, precision, recall and F1, we show that the proposed system is much superior to the widely accepted methods. I. INTRODUCTION Music signal is a one-dimensional temporal sequence. It thus incurs difficulty for the listeners to quickly capturing the mostly attracting parts in popular songs, unless the listeners play the song sequentially until the ending. Sometimes, it is annoying to listen to the whole songs with the length of a few minutes. Although the listener can play using the fast forward or backward functions in the play bar, it is still not easy to correctly identify the beginning of interesting segments. Therefore, in order to improve the listening experience, music summarization, a tool to summarize the song content using most attractive sections, is needed. Music is a highly structured signal. The artists exploit the repeated lyrics (sometimes a few word modifications), theme, tone, etc. to express their emotions and concepts. In terms of signal, we can visually identify the repeated frequencytemporal pattern in the frequency domain. Among the different parts in popular music, the chorus sections are mostly important. The chorus contains the main idea, or big picture, of what is being expressed lyrically and musically. It is repeated throughout the song, and the melody and lyric rarely vary 1. So the chorus sections become the good 2015 IEEE 17th International Workshop on Multimedia Signal Processing (MMSP), Oct. 19-21, 2015, Xiamen, China. 978-1-4673-7478-1/15/$31.00 2015 IEEE. 1 http://en.wikipedia.org/wiki/song_structure candidates for music summarization. In general, the popular song has two modality features: text, i.e. lyrics, and audio. In order to use lyrics for chorus detection, the boundary of lyrics (i.e. phrases or sentences) must be known as a prior. Unfortunately, it is not always available for the popular song. In comparison, audio signal is always ready. From the spectrum reading, it is observed that the repeated audio segments have the similar frequencytemporal pattern, even if there are a few noises because of some modifications of lyric words, tone or accompany instruments. In the frame-level (a few milliseconds), the chorus part is indistinguishable from other parts. But in the long temporal window (e.g. 1 second), it is quite distinguishable. In the paper, we will exploit the observation from spectrum reading and visual analysis into the design of chorus detection system. Derived from the widely accepted chroma feature [9, 12, 13], we will develop a novel octave-dependent probabilistic latent semantic analysis (OdPlsa) to analyse the audio signal, based on which a chorus detection system is developed. In Section II, the related work will be discussed. In Section III, the octave-dependent probabilistic latent semantic analysis (OdPlsa) is presented in detail. In Section IV, the chorus detection algorithm is introduced. In Section V, the experiments are carried out to evaluate the performance of the chorus detection based music summary system. Finally we summarize our work. II. RELATED WORK Audio signal, similar to speech, is a one-dimensional time serial sequence. Thus, the feature extraction techniques in speech processing can be applied to the tasks of music information retrieval (MIR) such as measuring audio similarity, audio classification, music structure analysis (e.g. [4], [12], [15], [16], and [22]). A comprehensive survey can be found in [15]. Among many kinds of speech features, Melfrequency cepstral coefficient (MFCC) is found useful in MIR tasks. For example, in [2] MFCC is extracted to measure audio similarity for music summary. In [7], it is used in chorus detection and emotion classification. It is also found that MFCC is sensitive to key and accompaniment changes. To address those issues, chroma (pitch class profile) feature is proposed and becomes an effective representation for music signal [6][8][9][13][18-21]. It is a 12-dimensional vector corresponding to 12 distinct semitones in music theory.

Chroma characterizes the magnitude distribution in 12 semitones of each frame. It is obtained by mapping the FFT frequency bands to the semitones, and, for each semitone, the energies in different octaves are combined. It has different implementations. Chroma feature is widely used in chorus detection, music segmentation, cover song detection, etc. Despite many successes, chroma has two drawbacks in characterizing music signals. One is that it only describes the frame-level magnitude distribution regardless of the long context magnitude distribution in the song. As a result, its discrimination capability is limited. Another is that the linear method is used to fuse the magnitude across the octaves without considering the representation capability of the semitone in each octave. These two issues relate to the spectrum distributions in the song level and the octave relation. It motivates us to investigate the octave-dependent probabilistic latent analysis (OdPlsa), an extension from probabilistic latent analysis [1]. OdPlsa learns the latent clusters by analysing the magnitude distribution of semitones along the octave and temporal dimensions, and then represents the audio segment in the latent clusters. It exploits the music structure in semitone, octave, and temporal, which is different from the existing latent component technique used for audio signal analysis. For example, in [10], PLSA is directly applied to model the occurrence of symbolic audio tokens in a song. In [3], the shift-invariance PLSA is used to modelling the magnitude distribution at the 12-dimensional semitone and time axes and the shift property in the model can solve the key change. Following the steps in [9, 13], the chorus detection is carried out based on the extracted audio feature. Many papers have been published under the chorus detection or music structure detection [6-9, 11-15, 18-21]. Firstly, the pair-wise audio segment similarity is calculated. It is expected that the repeated segments have high similarity value. Otherwise, the lower values are observed. When viewing the matrix as an image, the repeated segments will form a line along the diagonal. However, due to the non-perfect audio features, sometimes the line will not be observable. It means that the repeated segments may not have similarity values significantly higher than the other dissimilar segments. It causes the difficulty to extract the lines. In [9], Goto applies the 2D-filter method to enhance the possible similar candidate points and suppress the others. It enhances the lines. In order to get chorus section, he also designs the methods to measure the possibility of a repeated segment as a chorus. Similarly, in [9, 18, 20], the researchers also exploit various heuristic algorithms to find the most significant lines from the matrix. In general, the chorus detection is heuristic and unsupervised. III. OCTAVE-DEPENDENT PROBABILISTIC LATENT SEMANTIC ANALYSIS (ODPLSA) A. OdPlsa based Audio Analysis Because OdPlsa is used to model the spectrogram, the FFTbased short-time frequency analysis is first applied on audio signal. Then FFT frequency band is mapped to its corresponding midi note, whose magnitude is the weighted summary on all related FFT bands. The process is similar to [9]. Each midi note is denoted with its octave level and semitone. Thus the FFT sequence in the octave space is denoted as,,. Here : the octave level : the semitone between 0,11 : the time stamp (the basic unit is frame) between 0, 1 with being the sequence length The short-time frame feature (16 ms in the paper) has weak discrimination so it is often to segment the sequence using a window, that is continuous frames. Thus the sequence becomes to be the -frame chunk sequence with the length,. For the i-th chunk, it is a sequence as:,,,,, 1,,,, 1 with 1. Thus in OdPlsa, each component pattern models the magnitude distribution in both semitone and time dimensions (L-frame length). To simplify the notation, semitone and time are combined into a variable,. Without any confusion, the chunk sequence is still noted as,, ( : the t-th chunk. : refer to a specific semitone and time location in the rectangle region 12*L.). The occurrence probability of a particular spectrum at the o-th octave and f-th location for the t-th chunk,,,, is modelled by the following mixture models with K components as,,,,,. (1) In the model, models the magnitude distribution of the k-th pattern,, tells us how the pattern is distributed across the octaves in each chunk, and, describes the importance of the octave in each chunk. Thus the log-likelihood of the model to generate the observation sequence,, is defined as,,,,,,, (2) To estimate the pattern models, an auxiliary variable,,,, is introduced into Eq.(2) (abbreviated as. ). And we get,,,,,,.. (3) Then the traditional EM algorithm can be applied to estimate model parameters,,, and. In the E-step,,, is estimated as,,, X o, f, t log, P k. (4) In Eq.(4), is a constant. In the M-step, the model parameters are estimated as,,,,,, (5),,,,, (6),,,, (7),,

Similar to Eq. (4),,, in the above equations are also constants. B. Audio Representation in OdPlsa Latent Space The component pattern model is in OdPlsa. It describes the magnitude distribution in 2-dimensional axes. In the current implementation, it is 12-dimensional semitones and L-frame time window. It is shared by the different octaves. We can image that the patterns are the most prominent information occurred in the music signal such as frequently repeated melody or chord. Figure 1 depicts an example of spectrogram and learned patterns for a song. From the learned models, we can know how the patterns distribute in the octaves, i.e.,, according to the following equations.,,, (8),,, (9),,,, (40) For each chunk of music we can represent its content in the audio pattern space like as,,,,, (51) (a) Spectrum in MIDI notes of a song (blue row meaning this MIDI note missed due to the FFT resolution) (b) Learned audio patterns, i.e., from the song in (a) (reshaped to 12- seimitone dimension and L-frame window. K=8, L=20) Fig. 1. Illustrations of learned patterns from spectrum of a song If the octave size is, then the feature dimension will be. This representation depends on the octave. If we want to get an octave-invariance feature, we can sum up the pattern distribution over all octaves such as,, (12) IV. MUSIC SUMMARIZATION WITH CHORUS SECTIONS The task of chorus detection is to extract repeated chorus section from a single song. The chorus sections will have similar melody along the temporal despite of some changes of lyric or accompany. It is expected that the temporal information plays the important role. As discussed in the above section, OdPlsa can find temporal patterns in the signal. So we apply the OdPlsa algorithm to extract the latent space from the single song and each audio segment is represented in the latent space (See Section III). Then the pair-wise segment similarity is calculated (Here the negative Euclidean distance is used.). The similarity matrix is symmetric. So only the diagonal and lower-triangle points are considered (See Fig.2a). When visualizing the similarity matrix, the repeated two segments will form an observable line along the diagonal in the 2D image in theory. Otherwise, they should have very low values. In practice, however, it is not so simple. If one chorus section of the song almost duplicates its previous (e.g. same lyric, accompany, etc.), the line will be noticeable in the image. But if the chorus section has a few changes comparing to its previous, such as a few modifications of lyrics, accompany and tone, their similarity value will be not significant higher than the other non-repeated segment. The line will not be observable (In Fig.2a, it is the case). We first summarize the steps of chorus detection algorithm in Table I. Then each step is explained in the next. In order to detect the possible line from the similarity matrix based image, we study the distribution of the similarity values and find that it is nearly the Gaussian distribution. So we normalize the similarity values using its mean and variance and only the top-n values are kept for further processing (See Table I, step 1). The top-n values are assigned to one (While points in Fig.2b) while the others are zeros (Black points in Fig.2b). Thus, a black-white image (See Fig.2b) is built. In this image, the diagonal line is more obvious than the original (See Fig.2a). The second step (See Table I, step 2) is to find all possible lines along the diagonals. For each diagonal, it is a zero-one sequence in which one means the corresponding segments are similar while zeros means dissimilar. The sequence with the continuous ones forms a possible line. Each line corresponds to two similar song sections (with beginning and ending timestamps). In practice, we find that too many short lines are extracted. To reduce the effect of short lines, we add a few constrains about the minimal length of each section, and how much overlap ratio between two similar sections are allowed. Thus, the possible lines are found (Marked with green colour in Fig.2b and the beginning and ending marked by red X). From the candidate lines in the above, finally the chorus section is found (see Table, step 3 & 4). When we analyse these lines, some lines, i.e. corresponding sections, are

overlapped. So a heuristic method is developed to merge the overlapped lines. For example, the two lines will be merged if the overlapping ratio of their corresponding sections is more than a threshold (e.g. 80%). After this step, the size of the lines is significantly reduced. From these lines, we get the corresponding sections. For each section, we have a value to indicate its length and a value to indicate the number of repeated sections in the song. They are the indicators whether the section is a chorus. If a section is a chorus, it should repeat more than two times and its length should be more than a threshold such as 10 seconds. According to the indicators, the sections are chosen as the chorus, if they have the maximal repeated times and their lengths satisfy the constraints. In Fig.2c, the green lines mark the found chorus sections. These chorus sections are used to summarize the song. TABLE I CHORUS DETECTION ALGORITHM Input: similarity matrix, (see Fig.2a). N: the size of audio chunk., : (i,j)-th entry in Output: chorus sections 1. Find significant points with highly possibility lying in the similar segments and binarize as following ( : mean of similarity scores. : standard variance.), 1,, 1 0, Here 1 is set to keep top 2% points. Only the 1-value points are considered in the next steps (see Fig.2b). 2. Find all lines from the diagonals with the conditions (see Fig.2b): a) if two neighbor points along the diagonal is less than smoothing threshold ( 1 ), then merge them with previous points; b) if the line length is less than a threshold ( 2 ), ignore it. 3. Merge overlapped lines from the line candidates and generate new lines. Only keep the top-n (=30) longest lines for the next step. 4. Get all segments from the above lines, and count the repetition number of each segment. Select the longest segments with repetition numbers more than 2 as chorus sections (each section is marked with its starting and ending timestamps) (See Fig.2c). than n-second, then the chorus of the song is considered to be correct. b) n-percent overlap accuracy (M2): if the overlap between any one of the detected chorus sections and the ground-truth chorus section is more than n-percent of length of the ground-truth, then the song chorus is considered to be correct. (a) Similarity matrix (only lower triangle part of matrix) (b) Significant points (white dots) and detected lines (green) with red X marking starting and ending) V. EXPERIMENTAL RESULTS The evaluated music dataset contains 247 popular songs with large diversity in the styles, which is provided by our industry collaborator. For each song, the boundaries of all chorus sections are manually tagged. On average there are ~2.6 chorus per song. A. Evaluation Metrics The performance is reported using multiple metrics to reflect different aspects of the detection system. a) n-second starting time accuracy (M1): if the absolute difference between the starting time of any detected chorus section for a song and the ground-truth starting time is less (c) Detected chorus sections (green lines) Fig. 2. Visualizing chorus detection: from similarity matrix to chorus c) n-second starting time based precision, recall and F1 (M3): it is based on the n-second starting time accuracy, from which we will know if the detected section is true positive or

false positive. It gives us a complete image about the detected sections. B. Experimental Setup Each song is 16bit/8k sample rate and mono channel. The 1024-point FFT (128ms frame length) spectrum is extracted with 112ms overlapping between the consecutive frames. We only consider FFT bands in the range between 32.7Hz (C1) and 4000Hz (B7) covering 7 octaves. The OdPlsa based feature is extracted as in Section III (K=8) with a 20-frame (0.32s) audio chunk without any overlap. The benchmark system is based on the 12- dimensional chroma feature implemented as in [9]. In addition, the effect of the first-order difference feature, Δ = 1 1 2, is also studied considering its success in speech recognition [16]. We segment the sequence into a 0.64s-length section with 0.32s overlap and calculate the pair-wise segment similarity scores (here similarity function is negative Euclidean distance). Then the chorus detection algorithm (see Table I) is started to find all chorus sections. C. Comparison between OdPlsa and Chroma The results are reported in Figure 3 in terms of 1-second staring time accuracy (M1), 80% overlap accuracy (M2), and 1-second starting time F1 (M3). The relative gains reach up to 18.4% (M1), 34.4% (M2) and 39.8% (M3) respectively when comparing with the chroma-based benchmark system. OdPlsabased feature significantly outperforms the popular chroma feature. The improvement is because OdPlsa feature characterizes the pattern distribution in both octave and temporal dimensions as well as the distribution of each frequency band in the song level. But chroma only describes the distribution along the frequency band. Our results demonstrate that OdPlsa is a powerful tool for audio content representation. 80 60 40 20 0 Benchmark OdPlsa M1 M2 M3 Fig. 2. Performance comparison between OdPlsa and chroma D. Effect of n-order difference feature In speech recognition, the n-order difference features will improve the accuracy. Now we study the effect of the firstorder difference feature on the performance of chorus detection based on OdPlsa feature. The comparison is shown in Figure 4. With the addition of first-order feature, the relative improvements are 13.3% (M1), 9.7% (M2) and 5.3% (M3). So the first-order OdPlsa-based feature is a good addition for chorus detection system. We don t observe benefits from higher-order features. 80 60 40 20 0 No first-order feaure M1 M2 M3 Fig. 4. Effect of first-order feature on performance with OdPlsa E. Effect of Various Implementation OdPlsa with first-order feaure In Section III-B, we have discussed different implementations to represent the audio segment. In the above experiments, the OdPlsa feature is calculated according to Eq.11. The octave effect is considered in the feature and the feature dimension is the multiply between the latent cluster number and the octave size. The OdPlsa feature can also be computed according to Eq.12, i.e. ignoring the octave effect. In Table II, the performances of the two different implementations are shown, separately for the system with or without the first-order feature. In the table, OdPlsa_OctaveInd refers to the feature according to Eq.12. From this comparison, it is found that the octave dependent OdPlsa feature is much better than the octave independent feature. The first-order feature also improves the octave independent system, which is the conclusion same as in the above section. TABLE III PERFORMANCE EFFECT OF DIFFERENT IMPLEMENTATIONS OF ODPLSA FEATURE (ODPLSA_OCTAVEIND, ODPLSA) M1 M2 M3 without delta OdPlsa_OctaveInd 27.53 47.77 15.09 OdPlsa 36.44 71.26 23.93 with delta OdPlsa_OctaveInd 31.58 68.02 19.16 OdPlsa 41.30 78.14 25.19 VI. CONCLUSION In the paper, a chorus detection based music summary system is presented based on the novel octave-dependent PLSA feature extraction algorithm and unique chorus detection method. The performance of music summary is reported on the popular song database with the ground-truth of chorus sections. As we know, it is the first time to report the performance of chorus detection against a few hundreds songs. Our experiments show that the proposed technique and system is superior to the widely chroma feature based system. ACKNOWLEDGEMENTS We thank Lei Jia and Hui Song from Baidu Inc. for the support of the project and kindly providing the music dataset.

REFERENCES [1] T. Hofmann, Probabilistic latent semantic indexing, Proceedings of ACM SIGIR 99. [2] M. Cooper & J. Foote, Automatic music summarization via similarity analysis, Proc. of IC MIR 02. [3] R. J. Weiss & J. P. Bello, Unsupervised discovery of temporal structure in music, IEEE Journal of Selected Topics in Signal Processing, 5(6):1240-1251, Oct. 2011. [4] D. Turnbull, G. Lanckriet, E. Pampalk & M. Goto, A supervised approach for detecting boundaries in music using difference features and boosting, Proc. of ISMIR 07. [5] J. Serra, M. Muller, P. Grosche & J. Ll. Arcos, Unsupervised detection of music boundaries by time series structure features, Proc. of AAAI 12. [6] J. V. Balen, J. A. Burgoyne, F. Wiering, R. C.Veltkamp, An analysis of chorus features in popular song, Proceedings of ISMIR 13. [7] C.-H. Yeh, Y.-D. Lin, M.-S. Lee and W.-Y. Tseng, Popular music analysis: chorus and emotion detection, Proceedings of APSIPA 10. [8] A. Eronen, Chorus detection with combined use of MFCC and chroma features and image processing filters, Proc. of DAFx 07. [9] M. Goto, A chorus section detection method for musical audio signals and its application to a music listening station, IEEE Trans. on ASL, Vol.14, No.5, Sept., 2006. [10] P. Smaragdis, M. Shashanka & B. Raj, Topic models for audio mixture analysis, Proc. of NIPS Workshop on Applications for Topic Models: Text and Beyond, 2009. [11] C. Burges, D. Plastina, J. Platt, E. Renshaw & H. S. Malvar, Using audio fingerprinting for duplicate detection and audio thumbnails, Proc. of ICASSP 05. [12] B. McFee & D. P.W. Ellis, Learning to segment songs with ordinal linear discriminant analysis, Proceedings of ICASSP 14. [13] M. Goto, SmartMusicKIOSK: music listening station with chorussearch function, Proc. of ACM Symposium on User Interface Software and Technology (UIST), 2003. [14] M. Muller, P. Grosche & N.-Z. Jiang, A segment-based fitness measure for capturing repetitive structures of music recordings, Proc. of ISMIR 11. [15] M. Schedl, E. Gómez & J. Urbano, Music information retrieval: recent developments and applications, Foundations and Trends in Information Retieval, Vol.8, Issue 2-3, p127-261, 2014. [16] L.R. Rabiner & B.-H. Juang, Fundamentals of Speech Recognition, Prentice Hall, 1993. [17] A. Bosch, A. Zisserman & X. Munoz, Scene classification via plsa, Proc. of ECCV 06. [18] M. A. Bartsch & G.H. Wakefield, To catch a chorus: using chromabased representations for audio thumbnailing, Proc. of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2001. [19] J. Paulus, M. Muller & A. Klapuri, Audio-based music structure analysis, Proc. of ISMIR 10. [20] V. Mildner, P. Klenner, K.-D. Kammeyer, Chorus detection in songs of pop music, Proc. of ESSV 03. [21] N.C. Maddage, C.-.S. Xu, M. d. Kankanhalli & X. Shao, Content-based music structure analysis with applications to music semantics understanding, Proc. of ACM MM 04. [22] P. Golik, B. Harb, A. Misra, M. Riley, A. Rudnick, E. Weinstein, Mobile music modeling, analysis and recognition, Proc. of ICASSP 12. [23] J. Arenas-Garcia, A. Meng, K. B. Petersen, T. Lehn-Schioler, L. K. Hansen, J. Larsen, Unveiling music structure via plsa similarity fusion, Proc. of IEEE Workshop on Machine Learning for Signal Processing, 2007. [24] S. Ravuri and D. P.W. Ellis, Cover song detection: From high scores to general classification, Proc. Of ICASSP 10.