Music Radar: A Web-based Query by Humming System

Music Radar: A Web-based Query by Humming System Lianjie Cao, Peng Hao, Chunmeng Zhou Computer Science Department, Purdue University, 305 N. University Street West Lafayette, IN 47907-2107 {cao62, pengh, zhoucm}@purdue.edu Abstract. Query by humming (QBH) means to search a piece of music by singing or humming. Given melodies hummed by the users, query by humming systems will return an ordered list of songs according to the similarity between hummings and target songs. Although there are many searching techniques for query by humming, our project is building a web-based query-by-humming system, which can find a piece of music in the digital music repository based on a few hummed notes, using a melody representation that combines with the pitch tracking. We also evaluate the performance of our system by using different data set. We evaluate the performance of our system on a public corpus given by MIREX. The exact match accuracy is 43.44%. And if the criterion is relaxed to top 10 ranking, the accuracy is increased to 75.63%. 1 Introduction Our project is to build a query-by-humming system (called the QBH system), which can find a piece of music in the digital music repository based on a few hummed notes. When the user does not know the title or any other text information about the music, he is still able to search for music by humming the melody. Query-byhumming is a much friendlier interface than existing systems for music searching on the Internet. The system we built is a web-based system. 1.1 QBH System Besides the above application value, the query-by-humming system is also an interesting topic from a scientific point of view. Identifying a musical work from a melodic fragment is a task that most people are able to accomplish with relative ease. However, how people achieve this is still unclear, i.e., how do people extract melody from a complex music piece and convert it to a representation that could be memorized and retrieved easily and accurately with tolerance of some transpositions? Although this whole question is beyond the scope of our project, we will build a system that performs like a human: it can extract melodies from music; it can convert

the melodies into an efficient representation and store them in its memory; when a user asks for a piece of music by humming the melody, it can first hear the query and then search in its memory for the piece that it thinks most similar to the query. The main features of this are: A melody representation, which combines both pitch and rhythmic information. New approximate melody matching algorithms based on the representation. A set of automatic transcription techniques customized for the query-byhumming system to obtain both pitch and rhythmic information. A handy tool to build a melody database from MIDI format. A deliverable query-by-humming system including both the server application and the browser application. 1.2 MIDI File Format The MIDI File is a file format used to store MIDI data (plus some other kinds of data typically needed by a sequencer). This format stores the standard MIDI messages (ie, status bytes with appropriate data bytes) plus a time-stamp for each message (ie, a series of bytes that represent how many clock pulses to wait before "playing" the event). The format also allows saving information about tempo, time and key signatures, the names of tracks and patterns, and other information typically needed by a sequencer. One MIDI file can store information for numerous patterns and tracks so that any sequencer can preserve these structures when loading the file. 1.3 Related Work Other researchers are also investigating the use of pitch tracking and dynamic programming matching [5] methods for music retrieval. Brown[7] presented a way of using autocorrelation to determine the meter of music scores. Chai[8] attempted to extract several perceptual features from the MIDI files. Most of the research in preexisting query by humming systems uses pitch contour to match similar melodies [9, 10, 11, 12]. 2 System Architecture Here is the overview of the system architecture.

Note Segmentation MIDI File Pitch Tracking Melody Extraction Query Construction Representation -1 2 1 2 1 2-1 0-1 2 1 2 1 2-1 0 Query Request DP Match Report Generator JSON Object Browser (Javascript) Server (PHP) Figure 1 System Architecture 2.1 Browser Side Architecture Note Segmentation: The purpose of note segmentation is to identify each note s onset and offset boundaries within the acoustic signal. In order to allow segmentation based on the signal s amplitude, we ask the user to sing using the syllables like da, thus separating notes by the short drop in amplitude caused by the stop consonant. Pitch Tracking: This component is to estimate the pitch or fundamental frequency of a periodic or virtually periodic signal, usually a digital recording of speech or a musical note or tone. This can be done in the time domain or the frequency domain or both the two domains. Query Construction: The note information will be transferred to a string representation and send to the server as a HTTP GET request. http://localhost/query.php?seq= -121012-2 Report Generator: Parse the JSON object returned by the server side and generate a rank list report to the user.

Figure 2 Web-based QBH User Interface 2.2 Server Side Architecture Music Database: This includes the source data. The source data are the original music corpora, from which we extract melodies and generate the data representation. The source data are in MIDI files currently (but it can be extended to handle other music format). Melody Extractor: This extracts the melody information from MIDI file and transfer to a string sequence to represent the note information. DB Matcher: This receives the query from browser side, uses dynamic programming to match it with the melodies in the melody description objects, and returns a rank list for matching songs to the user as a JSON object.

3 Implementation 3.1 Note Segmentation The purpose of note segmentation is to identify each note in acoustic signal in order to help pitch tracking. A note is a pitched sound, which is an atom element of most Western music. After identify a note, we can easily compute its pitch information, because each note should have relatively constant frequency. Also note segmentation can filter out most of the unvoiced period. Figure 3 Waveform We can identify the onset and offset boundary of a note by the amplitude of sound. If we assumed people hum the song with syllables like da or they sing the song with words, usually there will be a short drop in amplitude between every two notes. First, we convert the amplitude from waveform to amplitude through computing the spectrogram. 512 samples window length with an overlap of 256 samples are used for 41100Hz sample rate. By summing up the absolute values of spectrums in each window, we can get a sequence of amplitude A.

Figure 4 Spectrogram Second, we identify the onset and offset boundary for each note based on the amplitude. The basic idea is to set a threshold and the intersections of the amplitude and the threshold are the boundaries. Using a fixed threshold for the query will lead to poor segmentation, therefore, we use dynamic thresholds. We first define a global threshold as a = 0.3 A(w). Second, we divided the sequence of amplitude into frames of length 80ms. We define the local threshold for the i frame, F, to be a = max a, 0.7 F (w). Then we scan the amplitude sequence A. If a note is not onset and the current amplitude is greater than the local threshold of the current frame, we set the current position to be the onset boundary of a new note. If the onset boundary of current note is 100ms away from current position and the current amplitude drops below the local threshold of the current frame, we set the current position to be the offset boundary of the current note.

Figure 5 Amplitude and Notes The global, local thresholds and minimum duration of a note is chosen by heuristic. Some note may be added or dropped occasionally by our algorithms. We will discuss their effects in evaluation section. 3.2 Pitch Tracking The primary goal of pitch tracking is to find the fundamental frequency of each note. Pitch is a subjective perception based on the frequency of acoustic signal. The fundamental frequency, f!, is the lowest frequency of a periodic waveform x. We can use it as a metric for pitch. The algorithm we used for pitch tracking is autocorrelation. Autocorrelation is the most popular time-domain method for pitch tracking, which computes the crosscorrelation of signals. First, we divided the waveform into windows with a length of 30ms and with an overlap of 20ms. For each window, we compute the nonnormalized autocorrelation for a max lag of 50Hz (882 samples for 44100Hz sample rate). The equation for non-normalized autocorrelation is N-d r N (d)= n=1 x(n)x(n+d), where N is the frame size, d is the positive lag, x(n) is the value of the n sample in the window. The fundamental frequency is selected as the lag where maximum autocorrelation is reached between 50Hz and 1000Hz, which is the frequency range of human sound. In the second step, we convert the fundamental frequency into pitch combined with information from note segmentation. We first round the fundamental frequency into

note number used in midi files according to the equation: m = round 69 + 12 1 log 2 0. Each number represents a semitone. Then we choose the mode of note 33! numbers during a note period as the pitch. Figure 6 Note Number and Pitch 3.3 Melody Matching We compute minimum edit distance of pitch contours for melody matching. The underlying principle for this algorithm is that although most of the people are good at capturing the relative change in tunes instead of accuracy in tunes. We first use pitch contour to represent the pitch information from either pitch tracking algorithms or midi files. Second, we will use dynamic programming to compute the minimum edit distance between two pitch contours. The less the distance, the more similar the two pieces of music are. The pitch contour is a sequence of relative change in pitch. In our method only 3-level contour information is used, that is, we use 0, ±1 and ±2 to represent the changes. The pitch contour is computed for every note except the firs note as following. If a note has the same note number as the previous note, we use 0 to represent it. If a note is only one semitone higher (lower) than the previous note, we use +1 ( 1) to represent it. If a note is higher (lower) than the previous more than one semitone, we use +2 ( 2) to represent it. Therefore, a sequence of 53, 51, 45, 47, 50 will be represented as -2, -2, 2, 2. The second step is to compute the edit-distance of two pitch contours. We define that the pitch contour from the query is the pattern, 6, and the pitch contour from one midi file is the target, 7. The minimum edit-distance of matching the first 8 numbers in 6 and the first 9 numbers in 7 is : ;,<. Then the minimum edit-distance can be computed recursively:

L L L : ;>,< +?@A7 ;BCDEF (6 ; ) : ;,< = = : ;,<> +?@A7 GDHDFD (7 < ) : ;>,<> +?@A7 EDIHJKD (7 <, 6 ; ) By heuristic, we define the costs to be:?@a7 ;BCDEF (6 ; ) = MNA(6 ; ) + 1 =?@A7 GDHDFD (7 < ) = MNAO7 < P + 1?@A7 EDIHJKD (7 <, 6 ; ) = MNAO6 ; 7 < P Because the pattern usually matches a part of the target, we should allow inserting numbers at the beginning and deleting numbers at the end with no cost. Therefore, we define the initial condition to be: 0, if 8 = 0 : ;,< = Q 8, if 9 = 0 L And we modify the update function for when 8 = 6 : : ;>,< +?@A7 ;BCDEF (6 ; ) : ;,< = = : ;,<> : ;>,<> +?@A7 EDIHJKD (7 <, 6 ; ) 4 Evaluation Our system is evaluated on a public corpus, MIR-QBSH, from Music Information Retrieval Evaluation exchange (MIREX) campaign. The corpus contains 48 midi files as ground-truth and 4431 queries created from 2003 to 2009 by 195 different people. Different from general information retrieval systems, the number of relevant retrieved items is either 0 or 1 for QBH system. And since our system returns the rank of midi files in the database based on the Edit distance to the query, common measures like precision and recall, are appropriate to evaluate QBH system. Therefore, we choose a rank-based method to present our experimental results, which is called Top-k-Accuracy. We define a query to be successful if the relevant midi file is within top k items in the returned rank list. Top-k-Accuracy = n R N S n R is the number of successful queries for k. N S is the total number of queries. Top- 1-Accuracy means the proportion of exact matches. A closer to 1 value of Top-k- Accuracy indicates a better results. We also define the hardness of songs and proficiency of singers.

Song hardness = Singer ProZiciency = N SX R r X N S R X r X ] ;< is the rank of 8th singer on 9th song. ^_; is the number of singers that have a query on 9th song. ^_; is the number of songs that are queried by 8th singer. A smaller hardness means the song is more difficult to perform, while a larger proficiency suggests the singer might be more proficient. Table 1 is the Top-k- Accuracy for different k values K TOP-K-ACCURACY 1 0.4344 5 0.6276 10 0.7563 Table 1 Top-K-Accuracy for Different K Values Figure 7 and 8 show the song hardness and singer proficiency for the corpus we used. According to our observation of the results. We found that there are a few factors that may affect our matching results. Out-of-tune singing. If the singer cannot catch the change of tune, our system can hardly give a good matching result. Low voice quality, especially when the voice of the singer is overwhelmed by the background noise. Short query. The system needs an enough long query to match the unique sequence of a song. But the exact minimum duration of query largely depends on the song and the quality of the query. The average of the duration is around 6~8 seconds.

0.5 Song Hardness 0.45 0.4 0.35 0.3 Hardness 0.25 0.2 0.15 0.1 0.05 0 0 5 10 15 20 25 30 35 40 45 50 Songs Figure 7 Hardness of All Songs in Corpus 1 Singer Proficiency Proficiency 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 20 40 60 80 100 120 140 160 180 200 Singers Figure 8 Proficiency of All Singers 5 Conclusion In this project, we build a web-based query-by-humming system, use pitch tracking and dynamic programming matching method. For future work, we need to test our system on different classes of music (i.e. pop music, country music). We also can try

other matching methods like DTW (Dynamic Time Warping), HMM (Hidden Markov Model) and compare the performance. References 1.Asif Ghias and Jonathan Logan and David Chamberlin and Brian C. Smith, Query by humming: musical information retrieval in an audio database, In ACM Multimedia, 1995. 2. Jyh-Shing Roger Jang, MIR Corpora: http://mirlab.org/dataset/public/mir-qbsh-corpus.rar 3. MATLAB and MIDI: http://www.kenschutte.com/midi 4. Pitch Detection: http://note.sonots.com/scisoftware/pitch.html 5. Wei Chai, Melody Retrieval On The WebMaster Thesis at the Massachusetts Institute of Technology, M.I.T Media Laboratory, Fall 2000 6. MIDI: http://en.wikipedia.org/wiki/midi 7. Brown, Judith C. Determination of the meter of musical scores by autocorrelation. J. Acoust. Soc. Am. 94:4, Oct. 1993. 8. Chai, Wei and Vercoe, Barry. Using user models in music information retrieval systems. Proc. International Symposium on Music Information Retrieval, Oct. 2000. 9. A. Ghias, J. Logan, D. Chamberlin, and B. C. Smith., Query by humming: Musical information retrieval in an audio database. In ACM Multimedia 1995, pages 231 1995. 10, D. Q. Goldin and P. C. Kanellakis. On similarity queries for time-series data: Constraint specification and implementation. In Proceedings of the 1 st International Conference on Principles and Practice of Constraint Programming (CP'95), 1995. 11. J. M. Hellerstein, J. F. Naughton, and A. Pfeffer. Generalized search trees for database systems. InU. Dayal, P. M. D. Gray, and S. Nishio, editors, Proc. 21st Int. Conf. Very Large Data Bases, VLDB, pages 562-573. Morgan Kaufmann, 11-15 1995. 12. J.-S. R. Jang and H.-R. Lee. Hierarchical filtering method for content-based music retrieval via acoustic input. In Proceedings of the ninth ACM international conference on Multimedia, pages 401-410. ACM Press,2001.