A Music Retrieval System Using Melody and Lyric

202 IEEE International Conference on Multimedia and Expo Workshops A Music Retrieval System Using Melody and Lyric Zhiyuan Guo, Qiang Wang, Gang Liu, Jun Guo, Yueming Lu 2 Pattern Recognition and Intelligent System Laboratory 2 Key Laboratory of Trustworthy Distributed Computing and Service, Ministry of Education,2 Beiing University of Posts and Telecommunications Beiing, China E-mail: guozhiyuan.cathie@gmail.com Abstract Using melody and/or lyric to query a music retrieval system is convenient for users but challenging for developers. This paper proposes efficient schemes for realizing key algorithms in such a kind of system. Specifically, we characterize our system by adding lyric to query as follows: A Support Vector Machine (SVM) is employed to distinguish humming queries from singing queries; For a singing query, lyrics of candidates, which are pre-selected by the commonly used melody matching method, are used to dynamically build up the recognition network; A novel fusion strategy, which is based on the classification confidence, is proposed to combine the lyric and melody scores. The experimental results show that error reduction rates as much as 22.9%, 25.0%, 28.7% and 33.5% for mean reciprocal rank (MRR) are achieved by using the proposed method, respectively for four existing query by singing/humming (QBSH) systems. Keywords-QBSH; SVM; isolated-word recognition; music retrieval I. INTRODUCTION Query-by-singing/humming (QBSH) systems, which help user to find the wanted song based on the singing or humming queries, provide an intuitive and practical way for music retrieval. In the past decades, substantial research has been devoted to the QBSH systems [-0], and various effective matching method have been proposed, such as dynamic time warping (DTW) [2], linear scaling (LS) [3], recursive alignment (RA) [4], and earth mover s distance (EMD) [5]. However, most of QBSH systems use only the melody features, but lyric information is ignored. Moreover, many researchers believe that singing queries are more difficult to handle than the humming queries in QBSH systems, because the speech in the singing audio will reduce the accuracy of melody feature extracting, which closely connects to the quality of a QBSH system. To deal with this problem, Haus et al [6] applied signal processing techniques for the singing queries to extract melody feature more accurately. In fact, lyric information is very useful for song identification. Guo et al [] developed a music retrieval system using spoken lyric queries. Since most of users are non-professional singers, it is very likely that the input singing/humming queries contain errors and biases. In this case, QBSH systems, which only based on melody, may result a failed retrieval. Obviously, lyric is the additive complementary information for song identification if the input query is sung. Using both melody and lyric for QBSH systems is intuitive but challenging for researchers. Firstly, because only the singing queries include lyric information, there is the risk to extract false lyrics, which do not actually exist in humming queries. That will lead to a serious deterioration of performance. Secondly, it is difficult to extract lyric feature from singing queries, because the speech is deformed. There has been a few research studies devote to melody and lyric based QBSH systems. Suzuki et al [7] proposed a QBSH method based on both lyric and melody information, but it can not handle humming queries, which contain no lyric information. To solve this issue, Wang et al [8] used a singing/humming discriminator (SHD) to distinguish the humming query from singing query. He firstly converted the query into a phone string by using a phone-level continuous speech recognizer, and then counted the number of distinct phones in the string. Considering a singing query usually has more distinct phones than s humming query, he classified the input query as humming or singing. A singing query would be converted into a syllable string, and each candidate obtained a lyric score using a syllable-level recognizer. His method provided a slight improvement. However, the processing time was greatly increased due to that two recognition procedures are added. Moreover, as the classification accuracy was extremely dependent on the phone recognition results, which unfortunately were not accurate enough, the improvement of retrieval accuracy was insignificant. This paper proposes a novel QBSH method using both melody and lyric information. Different from Wang s method, we use a well-trained SVM to identify the singing query, and a dynamically constructed isolated-word recognizer to recognize lyric of the singing query. Moreover, a robust fusion method, which is based on classification confidence, is used to combine the lyric and melody scores. Experimental results show that our classifier significantly outperforms than the classifier proposed by Wang. The error reduction rates of 22.9%, 25.0%, 28.7% and 33.5% for mean reciprocal rank (MRR) are achieved by using the proposed method, respectively for four existing QBSH systems. The remainder of this paper is organized as follows: Section describes the overview of the proposed QBSH system. The proposed method is introduced in Section. In 978-0-7695-4729-9/2 $26.00 202 IEEE DOI 0.09/ICMEW.202.65 343

Section, the experimental results are demonstrated. The conclusion follows in Section. II. OVERVIEW OF THE PROPOSED QBSH SYSTEM Fig. gives an overview of the proposed QBSH system, which processes as follows: Step : We use a melody matching method to sort music in the database. Music clips, which have top K highest melody scores, are selected as candidates. Four different methods were used in our experiments, viz., DTW [2], LS [3], RA [4] and EMD [5], which have been proved effective in QBSH systems. Step 2: A well-trained SVM is employed to classify the input query into the categories of humming or singing. Step 3: If the query is classified as a humming query, the ranked candidates according to the melody scores will be returned to the user. Step 4: If the query is classified as a singing query, a dynamically constructed isolated-word recognizer is employed to assign lyric scores for all candidates, and then two kinds of scores including lyric and melody scores are fused. The ranked results will be returned to the user according to the combined scores. III. THE PROPOSED QBSH METHOD A. Melody retrieval Melody retrieval aims at finding the probable candidate clips, which are most similar to the query in respect of melody. Many melody matching methods can achieve this goal. Most of them, such as LS, DTW, and RA, can calculate the distance between two pitch sequences. Others, such as EMD, can calculate the distance between two note sequences. A clip, which has a smaller distance with the query, is considered to be more similar to the query, and it obtains a higher melody score. Let Q and P represent the query and a clip in the database respectively. It should to be noted that both them have been converted into pitch or note sequences according to the adopted melody matching method. Let D Q,P represents the distance between the query Q and the clip P. MS(P), the melody score of P, can be calculated by (). Clips, which have the top K largest melody scores, are selected as candidates. MS( P) = () D Q, P We provide a brief description of the four commonly used melody matching methods. ) DTW: Dynamic time warping (DTW) [2] is a pitch based matching method. The distance between two pitch sequences S = p, p,..., p ), S = q, q,..., q ) can be ( 2 n 2 ( 2 m iteratively calculated by (2). Here, n and m mean the length of S, S 2 respectively. p i is the i-th pitch of S, and q SVM Training Corpus Lyrics Database Query SVM Lyrics Recognition Lyrics Scores Score Fusion Singing Melody Retrieval Humming Candidate Fragments Melody Scores Result MIDI Database Figure. The framework of the proposed QBSH system is the -th pitch of S 2. d(i, ) is the cost associated with p i and q, which can be defined as: d(i, )= p i - q -, where is the absolute value operation and is a constant. D(i, ) represents the minimum distance from the start point to the lattice point (i, ). Obviously, D(n, m) is the distance between S and S 2. D( i 2, ) ( i, ) = d( i, ) + min D( i, ) D( i, 2) D (2) 2) LS: Linear scaling (LS) [3] is a simple but effective pitch-based melody matching method. The main idea of this method is rescaling the input audio, based on the analysis that the length of the input audio is not always equal to the corresponding part in the MIDI data. LS choose different rescaling factors to stretch or compress the pitch contour of the input audio to more accurately match the correct part in the MIDI file. The most appropriate rescaling factor will result in minimum distance between the input audio and the music clip in the database. 3) RA: Recursive alignment (RA) [4] is another pitchbased melody matching method. Since linear scaling can not solve the problem of nonlinear alignment, RA solves this problem in a top-down fashion which is more capable of capturing long distance information in human singing. This method differs from DTW, because it starts optimization from a global view. RA utilizes LS as a subroutine and it tries to tune local matching recursively in order to optimize the alignment. Further details may be found in [4]. 4) EMD: Earth mover s distance (EMD) [5] measures the minimal cost that must be paid to transform one distribution into the other. Melody matching can be naturally cast as a transportation problem by defining one clips as the supplier and the other as the consumer. 344

To obtain EMD between the input query and a candidate clip, we need convert the clip into a set of notes with weights. Let P = {( p, ωp ),( p, ),...,(, )} 2 ω p p 2 n ω be the notes set pn of a candidate clip as supplier, where p i is a note occurred in the candidate, and ω p i is the duration of p i. Similarly, let Q = {( q, ωq ),( q, ),...,(, )} 2 ωq q 2 m ω represents qm the query as demander. The EMD, which represents the melody distance between two clips here, can be quickly calculated by many algorithms. B. SVM-based Singing/Humming Classification SVM [2] has attracted lots of researchers due to its excellent performance on many classification problems. It is reported that SVMs can achieve greater or equal performance comparing to other classifiers, while requiring significantly less training data. In our system, we use an SVM to distinguish humming clips from singing clips. An SVM is trained using training data, including 30 humming clips and 30 singing clips. All the training date and input audio were segmented into 0.25-second frames with 50% overlap, and then 32-dimensional features were extracted for each frame. ) Features: 32-dimensional features, which includes one-dimensional zero cross rate, one-dimensional spectral energy, one-dimensional spectral centroid, one-dimensional spectral bandwidth, eight-dimensional spectral band energy, eight-dimensional sub-band spectral flux, twelvedimensional mel- frequency cepstral coefficients (MFCC), are extracted to represent a frame in this work. 2) Singing/Humming classification using SVM: In the classification process, we first segment an input audio into frames, and then classify each frame into the categories of singing or humming using the trained SVM. To mitigate the impact of inevitable classification errors, a median filter is used to smooth the classification results contour. Fig. 2 shows the results of the first 90 frames of a query before and after smoothing. The width of filter window is set to 3 in the example. It can be seen that two itters are removed. Let N s represents the number of singing frames of an input query, which can be counted from the SVM classification results after smoothing, and N h represents the number of humming frames. N s /(N s +N h ) represents the proportion of singing frames in the input query. A larger value of N s /(N s +N h ) represents a higher possibility that the input query is sung. But it is important to note that different misclassifications have different costs. The cost of a humming query being misclassified as singing is larger than the cost of a singing query being misclassified as humming. Because if a humming query is misclassified as singing, the lyric information, which does not actually exist, will be exacted, and it will lead to deterioration. But in the opposite situation, a singing query still can find its corresponding song using only the melody information even if it is misclassified as humming. So, we should improve the classification accuracy of category singing. A threshold T s (0.5 T s <) is used to handle this situation, the input query will be classified as a singing query when the value of N s /(N s +N h ) is greater than or equal to T s. A larger value of T s leads to a higher classification accuracy of singing. That is to say, the classification results of singing are more reliable. C. Lyric Recognition If the query clip is classified into singing, a lyric recognizer is used to assign lyric score for each candidate clip, which is selected by melody matching method. Since melody matching method has located each candidate clip in their corresponding songs, it is easy to obtain their lyrics. By using each lyric as a word, an isolated-word recognition network can be easily constructed. Fig. 3 shows the structure of the recognition network, which has K paths representing K candidate lyrics. K is usually between 20 to00. The isolated-word recognizer uses continuous density hidden Markov models with cross word, context dependent tied state tri-phones. 39 MFCC is extracted from each frame for recognition. When the recognition process finished, each word can get a posterior probability. The lyric score of a candidate is the posterior probability of its corresponding lyric. An isolated word recognizer performs better comparing with a continuous speech recognizer in the system, since the Lyrics Lyrics 2 Begin End Lyrics K- Figure 2. An example of smoothing results. In the vertical axis, represents humming and - represents singing. The left panel shows the initial classification results of the first 90 frames in the input query, and the right panel shows the smoothed results. Figure 3. Lyrics K The lyric recognition network. 345

lyric of the input singing query is one of the K candidate lyrics. Due to the simplicity of the recognition network, the lyric recognition is fast and accurate. D. Combination of melody and lyric scores A score level fusion strategy is proposed to combine the lyric and melody scores for the candidate clips of a singing query. Various methods were proposed for score level fusion [9], such as the MIN, MAX, SUM, PRODUCT, and Weighted SUM rules. MIN means selecting the minimum value of all of the scores, MAX means selecting the maximum value, PRODUCT means to obtain the multiplied value of all the scores, SUM means to obtain the summed value and Weighted SUM means to obtain the summed value of all the scores with weights. In the proposed QBSH system, we use Weighted SUM rule, which has been verified that it can achieve the best performance comparing with other rules. The final score of a candidate clip can be calculated as follows: CS(c ) = p MS(c ) + (- p) LS(c ) (3) Where c represents the -th candidate, MS(c ) is the melody score of c, LS(c ) is its lyric score (As mentioned in Section, the lyric score of a candidate is the posterior probability of its corresponding lyric.), and CS(c ) is the fused score. p is the weight coefficient which can be determined empirically. Furthermore, the QBSH system will be deteriorated in the case that the humming query was wrongly classified as a singing query. The classification confidence is used to weight the lyric score according the classification confidence of singing. The improved score level fusion method is as follows: N s CS(c ) = p MS(c ) + ( - p) LS(c ) (4) (N s + N h ) Where N s means the number of frames classified as singing frames for one query, and N h means the number of frames classified as humming frames. N s /(N s +N h ) represents the confidence that the input query is sung. The improved fusion method is more robust against classification errors. IV. EXPERIMENTS A. Experimental Data and Setup The MIREX (Music Information Retrieval Evaluation Exchange) QBSH corpus released by Jang [3] is used to demonstrate the proposed method. The corpus includes 48 MIDI files and 443 singing or humming queries. All the queries are from the beginning. We add 000 MIDI files to MIREX corpus to compose the MIDI database. The Lyrics database consists of lyrics of all songs in the MIDI database. Since our lyric recognizer is for Mandarin, we selected 878 queries belonging to Chinese songs in the corpus. 60 queries, including 30 humming clips and 30 singing clips, are randomly selected to train the SVM. The left 88 clips, including 47 singing queries and 40 humming queries, compose the test set. The acoustic model (AM) of the lyric recognizer were trained using Chinese speech recognition corpus of 863 Program [4], which is a database provided by Chinese National High Technology Proect 863 for Chinese LVCSR system development, and all the audios in the corpus are normal speech. All the experiments were conducted on a platform of PC and C++. B. Evaluation Metrics The evaluation measurements are top-m hit rate and mean reciprocal rank (MRR). Let r i denotes the rank of correct song, the top-m hit rate is the proportion of queries for which r i M. MRR is the average of the reciprocal ranks across all queries, and it can be calculated as (5). Here, n is the number of queries, and rank i means the rank of the correct song corresponding to the i-th query. MRR = n n i= rank C. Singing/humming classification results using SVM Table shows the singing/humming classification accuracies of Wang s method and the proposed SVM based method with different values of T s described in Section. The second column of the table gives the results of Wang s method [8]. Here, the classification accuracy of Singing / Humming is defined as the proportion of correctly classified clips among the clips which are classified as Singing / Humming. An overall classification accuracy of 89.27% has been yielded when the threshold T s is set to 0.5. It can be seen that the proposed SVM based classifier significantly outperforms Wang s method not only in the overall classification accuracy, but also in the classification accuracy of category Singing, which is more important in the QBH system according to the analysis in Section. Moreover, our method can easily control the classification accuracy of category Singing by setting different values of T s. The classification accuracy of category Singing increases with the increasing of T s. D. The performances using different lyric recognition methods All of 47 singing clips are used to test the performance using our recognition methods (with different values of K) and Wang s methods. Fig. 4 shows the experimental results. SCSR is a short name of syllable based continuous speech recognition, which is used by Wang et al [8]. The other 4 curves represent using our lyric recognition methods. K represents the number of candidates, which are selected by RA [4]. In these experiments, we only use the lyric scores to rank the candidates. i (5) 346

TABLE I. THE HUMMING AND SINGING CLASSIFICATION ACCURACIES. Categories SHD[8] T s=0.50 T s=0.55 T s=0.60 T s=0.65 Humming 54.80% 70.36% 68.6% 65.54% 63.05% Singing 95.43% 96.57% 97.2% 97.30% 97.64% 98% 97% 96% 95% 00% 95% 90% 85% 80% 94% 93% 92% 9% 90% 75% 70% 65% 60% Figure 5. The performance of RA, DTW and the combinations with lyric recognition. 00% Figure 4. The retrieval performances using only lyric scores derived by different recognition methods. The vertical axis means the hit rate, while the horizontal axis shows the top T candidates and MRR. As can be seen, the proposed isolated-word lyric recognizer is much effective than the continuous speech recognizer used in Wang s method [8]. Besides, the recognition speed is increased by approximately 3 times comparing with SCSR. It should be noted that the top-20 rate reduced when the value of K increases from 0 to 20. Smaller value of K means smaller amount of competition paths in the network, which is helpful for recognition. But if K is too small, the right lyric may be not included in the K lyrics, and this will lead to recognition errors definitely. E. Retrieval accuracy of the proposed QBSH systems The melody retrieval part in our system can adopt any existing melody matching methods. We realize four systems, namely, Melody&Lyric RA, Melody&Lyric DTW, Melody&Lyric LS and Melody&Lyric EMD, by using RA, DTW, LS and EMD as melody matching method respectively. Fig. 5 and Fig. 6 show the performance of the above four systems (K=50, and T s =0.55 for all four systems). The axes in Fig. 5 and Fig. 6 have the same meaning as that in Fig. 4. As can be seen, Melody&Lyric RA, which use both lyric and melody information, performs better than Melody RA, which is only based on melody information. The same conclusions can be obtained for the other three systems. From Fig. 5 and Fig. 6, we can see that RA performs best among the four methods, DTW is the second, LS is the third and EMD performs worst. But after adding lyric information, the corresponding four systems achieve error reduction rates of 22.9%, 25.0%, 28.7% and 33.5% respectively. It indicates that the worse the performance of melody-only system is, the greater the improvement will be. 95% 90% 85% 80% 75% 70% Figure 6. The performance of LS, EMD and the combinations with lyric recognition. V. CONCLUSION In this paper, we proposed a novel QBSH method by adding lyrics information. An SVM classifier was used to identify the singing query, and an isolated-word recognizer was used to recognize lyrics. In addition, fusion method was proposed to combine melody and lyric scores. Our experiments demonstrate that this method shows promising results on the test data. Our current lyrics recognizer is designed for Mandarin. It can not handle English songs. We will try to develop a Mandarin-English bilingual recognizer to solve this issue in the future. ACKNOWLEDGMENT This work was partially supported by the proect under Grant No.B08004, a key proect of the Ministry of Science and Technology of China under grant no.202zx0300209-002, Innovation Fund of Information and Communication Engineering School of BUPT in 20, Development Program (863) of China under grant No. 20AA0A205, and the Next-Generation Broadband 347

Wireless Mobile Communications Network Technology Key Proect under Grant No. 20ZX03002-005-0. REFERENCES [] A. Ghias, J. Logan, D. Chamberlin, B.C. Smith, Query by humming: Musical information retrieval in an audio database, Proc. ACM Multimedia, pp.23-236, 995. [2] J.S.R. Jang, M.Y. Gao, A query-by-singing system based on dynamic programming, Proc. International Workshop on Intelligent Systems Resolutions (the 8th Bellman Continuum), PP. 85-89, Hsinchu, Taiwan, Dec 2000. [3] J.S.R. Jang, H. Lee, M. Kao, Content-based music retrieval using linear scaling and branch-and-bound tree search, Proc. ICME, 200. [4] X. Wu, M. Li, J. Liu, J. Yang, Y. Yan, A top-down approach to melody match in pitch contour for query by humming, Proc. International Conference of Chinese Spoken Language Processing, 2006. [5] S. Huang, L. Wang, S. Hu, H. Jiang, B. Xu, Query by humming via multiscale transportation distance in random query occurrence context, Proc. ICME, 2008. [6] G. Haus, E. Pollastri, An audio front end for query-by-humming systems, Proc. ISMIR, 200. [7] M. Suzuki, T. Hosoya, A. Ito, S. Makino, Music information retrieval from a singing voice using lyrics and melody information, EURASIP Journal on Advances in Signal Processing, vol.2007, 2007. [8] C.C. Wang, J.S.R. Jang, W. Wang, An improved query by singing/humming system using melody and lyrics information, Proc. ISMIR, 200. [9] G.P. Nam, T.T.T. Luong, H.H. Nam, Intelligent query by humming system based on score level fusion of multiple classifiers, EURASIP Journal on Advances in Signal Processing, vol. 20, pp.22, 20. [0] Q. Wang, Z. Guo, G. Liu, J. Guo, Y. Lu, Query by humming by using locality sensitive hashing based on combination of pitch and note, Proc. International Conference on Multimedia & Expo Workshops (ICMEW), 202. [] Z. Guo, Q. Wang, G. Liu, J. Guo, A music retrieval system based on spoken lyric queries, International Journal of Advancements in Computing Technology, in press. [2] B.E. Boser, I. Guyon, V. Vapnik, A training algorithm for optimal margin classifiers, Proc. COLT, 992, pp.44-52. [3] http://www.musicir.org/mirex/wiki/200:query_by_singing/humming. [4] Y. Qian, S. Lin, Y. Zhang, Y. Liu, H. Liu, Q. Liu, An introduction to corpora resources of 863 program for Chinese language processing and human machine interaction, Proc. ALR2004, affiliated to IJCNLP, 2004. 348