IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING 1. Note Segmentation and Quantization for Music Information Retrieval

Size: px

Start display at page:

Download "IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING 1. Note Segmentation and Quantization for Music Information Retrieval"

Juniper Holt
5 years ago
Views:

1 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING 1 Note Segmentation and Quantization for Music Information Retrieval Norman H. Adams, Student Member, IEEE, Mark A. Bartsch, Member, IEEE, and Gregory H. Wakefield, Member, IEEE Abstract Much research in music information retrieval has focused on query-by-humming systems, which search melodic databases using sung queries. The database retrieval aspect of such systems has received considerable attention, but query processing and the melodic representation have not been examined as carefully. Common methods for query processing are based on musical intuition and historical momentum rather than specific performance criteria; existing systems often employ rudimentary note segmentation or coarse quantization of note estimates. In this work, we examine several alternative query processing methods as well as quantized melodic representations. One common difficulty with designing query-by-humming systems is the coupling between system components. We address this issue by measuring the performance of the query processing system both in isolation and coupled with a retrieval system. We first measure the segmentation performance of several note estimators. We then compute the retrieval accuracy of an experimental query-by-humming system that uses the various note estimators along with varying degrees of pitch and duration quantization. The results show that more advanced query processing can improve both segmentation performance and retrieval performance, although the best segmentation performance does not necessarily yield the best retrieval performance. Further, coarsely quantizing the melodic representation generally degrades retrieval accuracy. Index Terms Music information retrieval, pitch, pitch quantization, query-by-example, segmentation. I. INTRODUCTION RAPID SEARCHING of databases of music is one of the primary goals of music information retrieval (MIR). Part of the burgeoning field of content-based information retrieval, MIR research looks to organize and mine databases of music around their audio content. Many of us have experienced the frustration of knowing what a piece of music sounds like, but not knowing the artist or title of the piece. Query-by-humming (QBH) systems attempt to solve this problem by enabling database searches that use a fragment of sung melody as input. Query-by-humming is a particularly active area of research in the MIR community. Early QBH successes [1], [2] have suggested numerous alternative systems [3] [7], which upon fur- Manuscript received May 3, 2004; revised October 12, This work was supported in part by a National Science Foundation Graduate Research Fellowship and a Graduate Assistance in Areas of National Need Fellowship, as well as grants from the National Science Foundation (IIS ) and the MusEn Project at the University of Michigan through the Office of the Vice President for Research. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Gerald Schuller. N. H. Adams and G. H. Wakefield are with the Electrical Engineering and Computer Science Department, University of Michigan, Ann Arbor, MI USA ( nhadams@umich.edu; ghw@umich.edu). M. A. Bartsch is with ATK Mission Research, Albuquerque, NM USA ( bartscma@ieee.org). Digital Object Identifier /TSA ther investigation have yielded results that are often inconclusive or difficult to generalize [3], [8]. To date, most QBH research has adhered to the paradigm of implementing a complete QBH system and comparing its performance to that of other systems, a paradigm motivated in part by emerging applications in the music industry. Complete contemporary systems are described in [3] [5], [9], and [10]. Several commercial systems are available online as well; Melodyhound [11] and Musicline [12] are two well-known systems that accept acoustic input for searching databases of folk and pop/rock themes. While this research approach has yielded some powerful examples, it has not provided a means for directly understanding how to improve any one example. This follows, in part, because of the highly coupled nature of complete QBH systems. To compare only the final retrieval performance of independently developed QBH systems provides little understanding of the relationships between its constituent components. The focus of the present work is to explore one such component, sung query coding, and to measure the performance of various query coders both in isolation and when coupled with other components of a QBH system. The performance of QBH systems is dependent upon three elements: the query processing component, the retrieval component, and the query representation. The query processing component encodes the sung query into an efficient representation for search and retrieval. Much of the research in QBH has focused on improving retrieval performance by examining either different retrieval systems or melodic representations [3], [7], [10], [13] [16]. The query processing component, however, has received relatively little attention. As such, we apply several techniques that have found success in other speech and acoustic applications to sung query coding. The most common query representation is a sequence of pitch and duration pairs [3] [5], [9], [13] [15]. While there is no evidence that this query representation is optimal in any specific technical sense, it is both musically intuitive and many existing databases of themes use this or similar representations. That relatively little attention is given to the coding of sung queries perhaps reflects the belief that robust automatic sung melody transcription is intractable. Nevertheless, it is often observed that humans do accurately transcribe sung melodies that automatic systems cannot. This has motivated the incorporation of physiological models into sung melody transcription systems [6], [17]. Recently, the de facto reliance on this representation has been subject to more careful scrutiny, and alternative query representations are considered in [7], [10], [16], [18]. In the present work we restrict our attention to the domain of note representations. Unfortunately, poorly articulated queries with suboptimal intonation are the rule for query-by-humming systems, a fact /$ IEEE

2 2 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING which makes robust query transcription difficult. In particular, query segmentation is one of the most challenging aspects of this problem [3], [10], [19]. In many cases the majority of QBH retrieval errors result from note segmentation errors [20], [21]. Many existing QBH systems circumvent this problem by requiring the user to articulate each sung note with a separate da or ta, in which case a simple amplitude threshold is effective [1], [2], [4], [5], [9], [13]. In segmenting the audio signal into separate musical note events, other solutions employ a metronome that the user must either sing along to or provide as input while singing [4]. It is desirable to place no performance restrictions on the user in general, particularly if the intended user is not a trained singer. Unfortunately, most QBH systems use rudimentary segmenters that do not perform well with naturally sung queries. To compensate for singer and transcription errors, many researchers have investigated the use of melodic representations that are robust to such errors [1] [3], [9], [14], [22]. These robust representations apply a coarse quantization early in query processing. Such a scheme might represent a note using only one of two durations ( short and long ) or one of three pitches (such as a note s pitch being higher, lower, or equal to the previous note s pitch). Singer and transcription errors will have a less detrimental effect on retrieval performance if all pitches or durations within some range are quantized to the same level. Robust representations can also reduce the computational complexity of the retrieval system. From a signal processing perspective, though, the notion that discarding information should improve retrieval performance is counterintuitive, particularly given the absence of a pitch-contour model of query production. Nonetheless, such a quantization has been incorporated into the recent MPEG-7 standard. MPEG-7 was designed specifically to facilitate content-based information retrieval [23], reflecting the growing interest in mining the content of multimedia databases. QBH applications in particular were taken into account in the MPEG-7 standard [9], [24]. Included in the standard are Descriptors reserved for representing the main melody, or theme, of the audio file. These Descriptors include the fundamental frequency contour, as well as a coarse quantization of the note sequence. This coarse representation was first proposed in the MIR community [14]; pitch differences between successive notes are quantized to a five-level codebook [9], [23]. In the present work, we examine two hypotheses. First, we propose that the use of more advanced query processing methods derived from standard signal processing techniques can improve both the segmentation and the retrieval performance of QBH systems. Second, we suggest that robust (i.e., coarsely quantized) melodic representations do not improve retrieval performance for QBH systems. To test these hypotheses, we first compare the performance of several query segmenters. We then couple the query processors with a simple retrieval system and measure the overall classification accuracy of the QBH system. Finally, we compare the classification accuracy for different query processing methods as well as various degrees of query quantization. There is no consensus on how best to quantify the performance of a melody transcription system, it is unclear what ex- Fig. 1. Block diagram of query-by-humming system. Labels indicate locations where performance is measured. actly constitutes a good transcription. Accordingly, we augment the performance statistics of the segmenters with those of the retrieval system as an alternative measure of transcription quality. The final retrieval performance can be interpreted as a measure of how close a transcription system places sung queries to the intended melody. This paper presents an extension to work originally presented in [25]. The following section describes the methodology for testing our hypotheses. Sections III and IV describe the segmentation and quantization methods we implement. Sections V and VI present and discuss our results, and Section VII concludes the paper. II. METHODOLOGY The QBH system used for testing our hypotheses is shown in Fig. 1. The system consists of two primary components. The first is the query processing component, which estimates the sequence of notes sung by the user from a recorded acoustic signal. Each note consists of a (pitch, duration) pair, where pitch denotes the note s real-valued MIDI pitch number. 1 The duration of a note is taken to be the time between the start of the note and the start of the next note, often referred to as the inter-onset interval (IOI) [3]. The estimated note sequence is then classified by the retrieval component as one of the possible targets. This work is predicated on the assumption that queries should be transcribed using primarily pitch information. Singers can sing the same melody with varying amplitude envelopes, lyrics and style; we desire our QBH system to be invariant to such variables, however. As such, the first step in estimating the sequence of notes is to estimate the pitch contour of the sung query. A time-domain method is used to track the fundamental frequency contour [26], [27]. This algorithm computes the autocorrelation for overlapping windows of recorded data. The bias of the window function is mitigated by the normalization, where is the autocorrelation of the windowed data and is the autocorrelation of the window function. A set of candidate peaks is selected for every frame and the Viterbi algorithm is used to construct a smooth contour. We use a step-size of 10 ms throughout this work. Let be the pitch contour computed for a sung query. Note boundaries must be detected in this contour. If the singer sustains constant pitch and transitions between notes instantly, the pitch contour,, is piecewise constant. In this case the first-difference of could be used to detect note 1 The real-valued MIDI pitch number p is related to a signal s fundamental frequency in Hz, f, as p = 12 log (f =261) A MIDI pitch difference of one is referred to as a semitone.

3 ADAMS et al.: NOTE SEGMENTATION AND QUANTIZATION FOR MUSIC INFORMATION RETRIEVAL 3 boundaries. However, such ideal pitch contours are unrealistic. Untrained singers in particular will transition slowly between notes, taking as much as 200 ms to slide or scoop between notes. Furthermore, the pitch will typically fluctuate within note boundaries, as with vocal vibrato. The goal of the note segmenter is to detect legitimate note boundaries while neglecting spurious fluctuations in the pitch contour. Section III presents seven different note estimators, emphasizing note segmentation. The note pitches and durations are then quantized for use by the retrieval system. The quantizers are described in Section IV. To achieve reasonable robustness to common errors such as note insertions and deletions, the retrieval system uses a classifier that computes the edit distance [28], [29] between the quantized query and each target in a database of songs. A query is classified as the target song which has the smallest edit distance. Edit costs are assigned to produce a low-complexity approximation to a dynamic time warping distance metric. Inserting or deleting a note with pitch and duration given by (, ) yields a cost equal to the duration of that note Replacing this note with a note having pitch and duration (, ) has cost where is the cross-over pitch difference that relates the replacement cost with insertion and deletion costs. Replacing two equal-duration notes with a pitch difference equal to has equal cost to an insertion-deletion pair. We found to work well. In order to compensate for global pitch offsets between the query and target, we iteratively subtract the mean pitch difference between the aligned sequences and then realign the sequences using the edit distance algorithm. For performance evaluation, we employ a query database containing many sample queries of fourteen popular tunes, from the Beatles Hey Jude to Richard Rodgers Sound of Music. A total of 480 queries were collected from fifteen participants in our study. Each participant was asked to sing a familiar portion of a subset of the fourteen tunes four times. The participants had a variety of musical backgrounds; some had considerable musical or vocal training while most had none at all. Participants were instructed to sing each query as naturally as possible using the lyrics of the tune. 2 The queries are monophonic, 16 bit recordings sampled at 44.1 khz and resampled to 8 khz to reduce processing time. All data was collected in a quiet classroom setting and participants were free to progress at their own pace. This data is used to compute the segmentation performance and classification accuracy of the various configurations of our experimental query-by-humming system. The retrieval database of target songs consists of ideal representations of the fourteen songs. Note that every query represents the exact portion of melody contained in the target, i.e., only that sequence of pitches sung by the participants for Hey, Jude, for example, are contained in the database. For a real- 2 This contrasts substantially from the common practice of having participants sing isolated pitches on a neutral vowel. (1) (2) world QBH system it is unreasonable to assume the user will always sing the exact portion of a tune contained in the database. Some systems address this problem by including multiple themes for each tune in the database [20]. For comparing the relative performance of various note estimation and quantization methods, however, we do not consider such complications. III. NOTE SEGMENTATION A sequence of notes must be estimated from the pitch contour for use by the retrieval system. Detecting note boundaries in the pitch contour is one of the most challenging aspects of the note estimation [19] [21]. Accordingly, we are primarily concerned with the note segmentation component of the note estimators. The following subsections present seven note segmentation methods. The first four segmenters perform the segmentation before estimating note pitch. The pitch of each note is then estimated as the mean pitch contour value between the beginning and end of the note. The last three estimators perform the segmentation and pitch assignment concurrently. Furthermore, the last two estimators employ pitch quantization prior to segmentation. In this way we explore not only different note segmenters, but also what are the most effective priors to incorporate into the segmentation. The note segmenters often yield clusters of spurious notes around a single legitimate note. A note thinning procedure is therefore required to reduce these clusters to a single boundary. One method for thinning spurious notes is to enforce a minimum duration constraint, analogous to removing spurious edges in edge detection. In [2], all notes less than are discarded. We found that merging notes less than into their nearest neighbor (in pitch), beginning with the shortest note, yields better results. A. Baseline In [2] the use of a smoothed derivative is proposed to detect note boundaries. Adjacent 20 ms windows of pitch contour are compared, if the difference between the average pitch of each window is greater than some threshold,, a new note is inserted. By adjusting, missed notes can be traded for spurious notes. For the query database used in the present work we found 80 ms windows yields better results, reflecting perhaps larger contour fluctuations in our sample queries than [2]. Similar note detectors are used in [3], [6], [9]. However, MELDEX [30], which was developed by the authors of [2], use an RMS amplitude threshold rather than this baseline segmentation method. Indeed, most QBH systems to date use either a variant of an amplitude threshold or a variant of this simple baseline pitch segmenter. B. Kalman Filter One of the difficulties encountered in segmenting continuously sung melodies is that the pitch contour can be quite volatile. The magnitude of pitch fluctuations within a sustained note can exceed a semitone, as is often seen in full vibrato. While the appearance of vibrato is far from universal, it is common enough that accounting for its contribution to pitch variance is likely to improve segmentation. By modeling pitch

4 4 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING fluctuations as the output of a stationary linear system driven by white noise, a Kalman filter can be used to track the pitch contour. In order for the Kalman filter to accurately track contours within note boundaries a statistical analysis of the pitch contours is necessary. Thirty queries containing a large number of notes longer than 400 ms were selected for analysis. The pitch contours were extracted for each of these queries, and edited to include only the stable segments of long notes. The average power spectral density (PSD) of each segment was estimated, and then averaged across segments (in log domain) [31], [32]. The PSD exhibits a low-pass characteristic, with a weak resonance at 3 Hz [33], which can be modeled as a second-order autoregressive system. Accordingly, let be the value of the pitch contour at time index (3) where and are the AR model parameters and is WGN. The segments were manually divided into two categories, those with and without vibrato, and the segments with vibrato were used for the parametric estimation. Using the Levinson-Durbin recursion [31], a least-squares best fit was found with and, which places poles at angle 3.5 Hz. This agrees with the rates commonly associated with vibrato. Given the system model (3), the Kalman state vector is given by with state update and observation equations given by (4) Fig. 2. Top curve shows a segment of pitch contour. The second curve shows the Kalman prediction error and the bottom curve shows the RLS prediction error. The dotted line is a potential detection threshold. arity assumption is too restrictive. The recursive least-squares (RLS) filter is a special case of the general, time-varying Kalman filter [33] [35]. We employ a predictive RLS filter to track the observed pitch contour. Let the -step prediction be given by where (5) where are the adaptive filter coefficients and is the information vector for the current time index (8) and is the observation noise. An exponentially-weighted running mean is subtracted from the query pitch contour and used as the observation sequence for the Kalman filter to predict the pitch state. The error function,, for the predicted observation is given by where is the variance of the observation noise and is the expected covariance of the system state [32]. The distance is then compared to a threshold to determine whether a new note should be inserted. A sample output of the Kalman filter is shown in Fig. 2. For purposes of illustration we chose a portion of pitch contour that is more piecewise constant than most in the database, so that the behavior of the Kalman filter is evident. C. RLS Filter While the Kalman filter predicts contours with stable vibrato, many other fluctuations are not accurately tracked; the station- (6) (7) These coefficients are updated at every time step to minimize the cumulative squared-error. The optimal linear solution is found by a recursive implementation of the common normal equations [34] (9) (10) where is the autocorrelation of, and. A forgetting-factor of was found to work well. A modest filter order of is sufficient for a step predictor. As was the case for the Kalman filter, prediction error is used as the statistic to determine whether a new note has occurred. An example output is shown in Fig. 2. D. Nonlinear LMS Filter The previous two methods employ fairly restrictive assumptions about the pitch contours. At the opposite extreme, machine learning techniques can be used to discover regularity in the data

5 ADAMS et al.: NOTE SEGMENTATION AND QUANTIZATION FOR MUSIC INFORMATION RETRIEVAL 5 when analytic models are not available. In the following, we explore explore the use of a perceptron for segmentation. A perceptron is a least mean-squares (LMS) filter augmented with a (nonlinear) threshold operator [34]. A five-element feature vector is computed for nonoverlapping 80 ms windows of pitch contour [33], [36]. This feature vector forms the input to the NLMS filter, which detects note boundaries contained within the window. The feature vector,, was found heuristically by observing the dynamic properties of the measured pitch contours. For each feature, larger values are indicative of note boundaries. The first two features are computed from the pitch contour, is a smoothed derivative and is the maximum first-difference for the current window of data. The next two features are derived from the autocorrelation contour that accompanies the pitch contour [33]. The autocorrelation contour is interpreted as a measure of how pronounced the sung pitch is, often the autocorrelation decreases during note transitions. The last feature,, is a smoothed derivative of the RMS amplitude contour. Precise definitions of these features can be found in [33]. For each window, a linear combination of the feature vector is taken as the final decision statistic. The following iteration was repeated until converged to a stable solution if boundary missed if boundary mistakenly inserted otherwise (11) where and were found to be reasonable values. 3 The perceptron converged to the following weights based upon 30 manually segmented queries con- Consider an observed pitch contour, and piecewise constant signal taining constant regions.... (13) where are the pitches of each region and,, are the boundaries between regions, with and. The -dimensional parameter space must be searched to find the minimum error between the observed contour and piecewise constant fit. An exhaustive search is computationally impractical, so dynamic programming is employed to keep complexity manageable. An absolute error criterion was found to yield better segmentation performance than the more common squared error criterion. In this case the optimal pitch for given boundaries is the median of the pitch contour values. Let the error for a constant region bounded by and with pitch be (14) Suppose an optimal -note fit has been found for time step through (that is, ). Let be the total error for for the optimal fit. A recursive formula for can be found by observing (12) The predominant weight associated with the and supports our earlier claim that note segmentation is best performed using these pitch bearing features. E. ML Segmenter The remaining three methods, in contrast to the previous four, perform note segmentation and pitch estimation concurrently. Simultaneous estimation of the note boundaries and pitches is a higher dimensional problem than the note boundaries alone. Indeed, the ML segmenter and HMM segmenter are of somewhat higher computational order than the other methods presented here. Nonetheless, it is worth exploring whether estimating note boundaries and pitches simultaneously improves performance. If the observed pitch contour is modeled as an arbitrary piecewise constant signal corrupted by AWGN, optimal note boundary estimates are given by the MMSE piecewise constant curve fit to the observed data [37]. If we further assume that the observed contour represents constant regions, a convenient search algorithm is readily found [19], and is summarized in the following. 3 Queries were not edited prior to training. Because the majority of 80 ms windows did not contain note boundaries, we found that allowing v and v to take on different values led to better performance. (15) This formula is used to compute for all and. The optimal segmentation is then found by backward recursion from to. A detailed description of the algorithm can be found in [19]. Note that while dynamic programming improves search speed considerably, this segmenter is still far slower than the other methods presented here (order compared to ). This algorithm assumes the number of notes is known a priori. Generalizing this procedure to estimate the number of notes as well renders the algorithm computationally impractical [37]. Instead, a pragmatic solution overestimates the numbers of notes in the contour (four notes per second of contour, for example). In this case the raw output of the ML segmenter has a high false-alarm probability. As with the other methods however, these spurious notes often occur in clusters around a single legitimate note. Most of the spurious notes are removed by the note thinning procedure described in Section III-A.

6 6 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING F. Quantization Segmenter None of the segmenters described above make use of the fact that most melodies found in Western music are restricted to a 12-tone scale. In this work we incorporate a 12-tone structure into two note segmenters. Both segmenters require that a codebook of quantization pitches be specified before segmentation. The design of this codebook is discussed in Section IV-A. The first of the two segmenters, the quantizer segmenter, uses this codebook directly to perform segmentation whereas the second of these segmenters, the HMM segmenter, uses the elements of the codebook as states. We withhold the discussion of the HMM quantizer for the next subsection. After the codebook has been selected, the pitch contour is quantized and every pitch transition in the output is interpreted as the start of a new note. As expected, this segmentation is riddled with spurious notes [15], but many are removed by the note thinning procedure (Section III-A). In addition to the minimum-duration thinning, another thinning procedure is beneficial for this segmenter. For a given detected note boundary, let be the difference between the unquantized average pitch for the notes on either side of the boundary. Beginning with the smallest, all note boundaries such that are removed. We found 4 to be a good value. Note that for general sung melody transcription this segmentation method may not be appropriate. Most singers, especially untrained ones, demonstrate considerable pitch drift while singing. For many query-by-humming systems this is not a problem however. Such systems only use a short segment of sung melody, in which case pitch drift is negligible. G. HMM Segmenter The final segmenter employs a hidden Markov model (HMM) to estimate both note pitches and boundaries [38]. The pitch contour is modeled as a piecewise constant signal restricted to a -tone equal-tempered scale, as in the previous segmenter. In the present case, each of the -tones defines a state in the HMM. Fig. 3 shows a portion of an example HMM with states per octave, labeled according to Western scale pitches. Two state transition probability distributions were implemented. The first assigns a high (98%, for example) probability of self-transition,, and uniformly assigns the remaining transition probabilities, (, where is the total number of HMM states). The second uses the Yule algorithm with the fourteen target melodies as a training set. Of the two, the first was found to yield better performance, perhaps due to the small set of melodies used for training the second. Following [33], the observation noise is modeled as IID Laplacian with. The final note estimates are given by the most probable state sequence for the observed pitch contour. Similar to the dynamic programming algorithm used for the ML segmenter, optimal prefix sequences are computed for every state at every time step. Suppose the HMM is in state at time step. Let be the probability of the most likely state sequence from time step to, ending in state, multiplied by the probability of observation sequence. Using the same notation as 4 One cent is 1/100th of a semitone. Fig. 3. Portion of HMM used for note segmentation. Section III-E, a recursive formula for observing can be found by (16) where is the probability of being in state given the previous state and observation. The Viterbi algorithm is used to compute for all and. The optimal segmentation is then found by backward recursion through the state trellis from. A detailed description of the algorithm can be found in [38]. Because the probability of selftransition is large, the HMM estimator is relatively robust to note insertion errors. Nonetheless, the same note thinning procedure used for the other methods is employed here. While the HMM algorithm is somewhat slower than most of the other methods (order compared to ), it is considerably faster than the ML segmenter. IV. NOTE QUANTIZATION One of the goals of this work is to determine whether coarsely quantized melodic representations improve classification accuracy. Thus, we perform an explicit quantization in our queryprocessing system [39]. We restrict attention to separate quantization of pitch and duration. Furthermore, we do not consider quantizing several notes together due to the difficulty of designing an appropriate codebook of melodic phrases [40], [41]. A. Pitch Quantization Uniform scalar quantization with levels per octave is applied to the pitch estimates [39]. Since we are using MIDI pitch number to represent pitch, when we have a musically intuitive quantization of pitch to the equal tempered scale, with one quantization level per semitone. Setting yields a coarser pitch quantization, which is more robust to a singer s pitch errors. We examine a variety of values of for uniform pitch quantization. Singers typically do not have perfect pitch; a user might, for example, sing a query 50 cents off from the standard equal-tempered tuning. To minimize errors caused by this offset, we perform a search for an optimal pitch offset before quantizing. For a set of offsets spanning of an octave and separated by 5 cents, we compute the mean squared quantization error for the observed pitch contour. The offset with minimum MSE is chosen as the optimal offset.

7 ADAMS et al.: NOTE SEGMENTATION AND QUANTIZATION FOR MUSIC INFORMATION RETRIEVAL 7 The pitch quantization levels need not be uniformly distributed. Many popular melodies are locally restricted to a single key, such as C major. To quantize a sung query to a modal codebook, with seven nonuniformly spaced levels per octave, it is necessary to estimate the tonic of the sung query. 5 We implemented a modal quantizer by testing codebook offsets over a full octave, similar to the uniform quantizer. The tonic estimates were found to be unreliable however, hindering retrieval performance of the QBH system. B. Duration Quantization Scalar quantization with levels is applied to the duration estimates. Prior to quantizing the note durations, the durations are normalized by the total duration of the query, thus ensuring tempo invariance of the QBH system. Note that this normalization is only appropriate if the query can be assumed to contain the same portion of melody as contained in the target theme, as discussed in Section II. Due to this normalization, only a fraction of the levels are ultimately used. However, this is not of concern because we are interested in quantization as a method of reducing singer and transcription error, not as a method to reduce the number of bits needed to store the transcribed query. Uniform, logarithmic and adaptive codebooks are explored [39]. For uniform quantization, durations are uniformly spaced between zero and one, yielding a quantization density function that is constant in that interval, where is the normalized note duration. In many Western melodies, most note durations are related to some minimum duration by a power of two; eighth notes, quarter notes, half notes, and so forth. This implies that it may be appropriate to concentrate duration levels closer to zero. For logarithmic quantization, durations are spaced such that the quantization density function is. We found the performance of the uniform and logarithmic codebooks to be equivalent. Most melodic phrases contain far fewer unique note durations than the total number of notes. Therefore, it may be useful to search for clusters of note durations in the query note estimates. This implies an LBG (K-means) clustering that is performed for every query, adapting the quantization codebook to the distribution of observed durations [42]. We implemented an adaptive quantizer but found lackluster performance. LBG clustering requires a large number of training points relative to the number of quantization levels. As such, the adaptive quantizer s performance is comparable to the uniform quantizer for very small, but poor otherwise. Hence, in the following we report the performance for only the uniform quantizer with varying number of levels,. V. RESULTS As discussed in Section I we examine two hypotheses in the present work: that alternative note estimators yield better segmentation performance and ultimately better retrieval performance, and that coarsely quantized query representations do not improve retrieval performance. To test these hypotheses, we 5 For example, for a query sung in C major, the tonic would be the pitch class of C=(111; 261 Hz; 522 Hz; 111). Fig. 4. Detection performance for seven note segmenters. The curve labeled w/o thinning refers to the baseline segmenter without the minimum note duration thinning. All other segmenters employ this thinning. first examine the segmentation performance of the note estimators alone and then couple the note estimators with the retrieval system described in Section II to examine classification accuracy. To evaluate the segmentation performance of the note estimators, 80 of the queries from our test set were manually segmented by the first author. Often the precise number and location of the segments to insert in the observed pitch contour were ambiguous; however we have attempted to maintain consistency as much as possible. We numerically estimate the receiver operator characteristic (ROC) of each segmenter by comparing its output to the manual segmentations. An alignment radius of 100 ms is used in comparing the manual and automatic segmentations. For varying decision thresholds, we compute the false alarm rate, which is the probability that a given note boundary detected by the segmentation algorithm is spurious, and the detection rate, which is the probability that a true note boundary is detected by the segmentation algorithm. The first four segmenters presented in Section III detect note boundaries by computing a decision statistic and comparing this statistic to a threshold,. By adjusting, an estimate of the segmenters ROC curve is computed. The last three segmenters presented in Section III do not employ an overt decision threshold, hence another parameter must be selected for estimating the ROC curves. For the ML segmenter, the a priori note number,, is adjusted. For the quantizer segmenter, the minimum note duration,, is adjusted. Finally, for the HMM segmenter, the probability of self-transition,, is adjusted. The ROC curves for the quantizer and HMM segmenters are estimated using pitches per octave. Fig. 4 displays ROC curves for the seven estimators examined in this work. The false-alarm rate is represented along the abscissa and the detection rate along the ordinate (note the

8 8 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING Fig. 5. Classification accuracy for five note estimators versus resolution of the uniform pitch quantizer. Fig. 6. Classification accuracy for five of the note estimators versus the number of levels of the duration quantizer. scale for the plot, with false-alarm rates from 0 to 0.27 and detection rates from 0.64 to 1). The parameter in the figure is the segmentation method. All of the ROC curves are monotonic nondecreasing. Of all the segmenters, the baseline segmenter without note-thinning stands out as poorer than the rest. Including the note-thinning procedure improves performance considerably for the baseline segmenter, as well as the other methods. Note-thinning is critical for all methods considered here. The ROC curves for the remaining segmenters all include note-thinning. The remaining methods all yield ROC curves that overlap considerably. The RLS, HMM and quantizer segmenters give essentially equivalent performance. The NLMS segmenter yields the best segmentation performance by a modest percentage, giving a 4% improvement over the next best method for a false-alarm rate of 2.5%. The Kalman and MLE segmenters yield the worst performance of the alternative segmenters. We next couple the various note estimators with a retrieval system to measure the classification accuracy. The retrieval performance of the query-by-humming system is computed by running the system on each of the 480 queries in our query database. The classification accuracy is computed by determining the fraction of queries which are classified as the correct target song. To prevent the figures from becoming too cluttered, the Kalman and ML estimators are neglected in the following results, their performance being generally lackluster. Because our alignment algorithm assigns equal cost to insertion and deletion errors, it is reasonable to suppose that optimal classification accuracy is achieved when the segmenter insertion and deletion rates are equal. This was indeed found to be the case, hence all segmenters are tuned to have equal note insertion and deletion rates when coupled with the retrieval system. Fig. 5 shows the overall classification accuracy of five note estimators versus the number of pitch quantization levels. The abscissa represents the number of pitches per octave,, and the ordinate represents classification accuracy. The parameter is the segmentation method. Duration quantization with is applied for all segmentation methods. Data are shown along with best-fitting exponential curves. All curves are monotonic nondecreasing and plateau around levels per octave. Again, the baseline segmenter without note-thinning yields the worst performance, achieving a maximum classification accuracy of 80%. The HMM and quantizer segmenters yield the best classification accuracy, giving a maximum of 92%. The remaining methods give similar performance, with a maximum of about 88%. Fig. 6 gives the classification accuracy for five note estimators versus the number of duration quantization levels. A uniform pitch quantization with levels per octave is applied for all methods. Again, all curves are monotonic nondecreasing. The various segmenters show similar trends as in Fig. 5, e.g., the baseline without note-thinning yields the worst performance whereas the HMM and quantizer segmenters yield the best performance. Note that the dependent variable,, is the total number of levels in quantization codebook. Because the estimated durations are normalized by the total query duration, only a fraction of the levels are used. Thus, a setting of, for instance, is equivalent to mapping all durations to the same level and thus discarding duration information. VI. DISCUSSION As can be seen from Fig. 4, the alternative segmenters yield better performance than the baseline segmenter. The NLMS segmenter gives the best segmentation performance; however, this may reflect the fact that the feature vector was heuristically designed to optimize segmentation performance on these queries. The Kalman and ML estimators provide the worst performance of the six alternative estimators. The Kalman filter was trained to track one common contour fluctuation, vibrato. Other fluctuations, however, yield large prediction errors. The Kalman filter also exhibits relatively slow convergence, on the order of 300 ms, as can be seen in Fig. 2. Hence, the Kalman filter cannot detect note boundaries near the beginning of pitch contours. Furthermore, many query contours demonstrate considerable scooping, in which the singer momentarily drops to a lower pitch during a transition between two notes. This gradual transition loosely resembles a single oscillation of vibrato. The Kalman filter often accurately tracks this transition and hence many of these segments are not detected.

9 ADAMS et al.: NOTE SEGMENTATION AND QUANTIZATION FOR MUSIC INFORMATION RETRIEVAL 9 As noted in Section III-E, the ML segmenter requires an estimate of the number of notes to segment. When the true number of notes is known, the performance of the ML segmenter is essentially perfect. Generalizing this method to automatically estimate the number of notes renders it computationally unmanageable. Accordingly, the number of notes to search for is assumed to be proportional to the duration of the contour, and note-thinning is used to remove clusters of spurious notes. This heuristic rule for note number estimation is responsible for the method s mediocre performance. Two asymptotes in Fig. 4 are worth discussing. First, all of the ROC curves converge to a detection rate of 65% with no false-alarms. This is a result of the pitch tracker, which performs a voiced/unvoiced detection. Short pauses taken by the singer between notes result in a break in the pitch track, providing a partial segmentation with essentially no inserted notes. Recall that most existing QBH systems require the user to perform note segmentation implicitly by singing each note as a separate da or ta [1], [2], [4], [13]. Such systems employ an amplitude threshold to detect the start of new notes. Applying an amplitude threshold to our database of naturally sung queries yields a detection rate of 65%. Secondly, none of the note estimators achieve a detection rate greater than 96%. This is due to the thinning procedure, which discards any note shorter than 150 ms. Removing the note thinning allows the segmenters to reach a 100% detection rate, but only for a very high false-alarm rate. As for any detection problem, there is a natural tradeoff between note insertion and deletion errors. For measuring the classification accuracy of the complete system, we tuned all segmenters to have equal note insertion and deletion rates. In practice, however, deletion errors are often more troublesome than insertion errors [3], [20], [21], implying that the note segmenters should be tuned to have a high insertion rate. This is explored in [15]; a simple note segmenter with a high false-alarm rate is used in conjunction with an alignment cost scheme specifically designed to anticipate many note insertions. From the trends shown in Fig. 5, it is evident that coarse pitch quantization does not improve classification performance for any of the estimators. It is interesting to note that the curves plateau at a quantization less than levels. This suggests that only six or eight pitch levels per octave are required for accurate classification, thereby reinforcing some of the results presented in [14], [43]. When using such a quantizer, however, the pitch quantization levels no longer represent musical notes in the conventional 12-tone scale. This necessarily complicates construction of the target database, as direct inclusion of standard MIDI files is no longer possible since the targets must be recoded into a coarser codebook. While moderately coarse quantization may not degrade retrieval performance, we find no evidence to imply that it will improve performance. Comparing the estimators in Fig. 5, the alternative estimators yield better classification accuracy than the baseline estimators. The HMM and quantizer estimators yield clearly superior performance for. That the HMM does not perform well for a low number of pitch quantization levels may be explained by the transition probabilities; as the number of states decreases the probability of transiting to other states increases, making spurious state transitions more likely. Fig. 6 shows no benefit to quantizing the note durations. By comparing Figs. 5 and 6, though, it is evident that discarding pitch resolution is more detrimental to classification accuracy than discarding duration resolution. Discarding pitch information decreases classification accuracy from above 90% to less than 50%, whereas discarding duration information only decreases classification accuracy to 80%. Similar results have been found in [3], [14]. The HMM and quantizer note estimators give consistently superior classification performance, in spite of being outperformed with respect to segmentation. This illustrates the importance of coupling system components when evaluating performance. In Fig. 4, the HMM and quantizer demonstrated no improvement over the RLS and NLMS estimators with respect to segmentation accuracy. Still, when the note estimators are considered in connection with the rest of the query-by-humming system, the HMM and quantizer note estimators clearly give the best classification performance. These two estimators are unique in that they perform the pitch quantization before segmentation. That is, the HMM and quantizer estimators incorporate an a priori assumption about the distribution of note pitches sung. It is striking that arguably the simplest estimator considered here, the quantizer, is able to achieve the best performance; all that is required is application of appropriate prior constraints. This is perhaps to be expected given the natural pitch quantization in most Western music. The melodic themes considered in this work exist in a space of equally-tempered pitches. That the HMM and quantizer note estimators judiciously incorporate this prior into note segmentation is not reflected by an ROC metric. The benefit of incorporating this assumption only becomes apparent in the complete QBH system, where using a note segmenter that explicitly places the sung query in a space of equally-tempered pitches is advantageous. The HMM and quantizer note estimators are relatively better at moving the query representation toward the correct theme in the target space. This does not necessarily imply that the HMM and quantizer note estimators are the best choice for general sung melody transcription however. In some sense, the HMM and quantizer note estimators reduce performance imperfections in the final melodic representation. Hence, for many applications, such note estimators may be inappropriate. From Figs. 4 and 5, it is evident that the NLMS segmenter yields better segmentation performance than the RLS segmenter, but equivalent classification accuracy. The relative drop in the performance of the NLMS segmenter is a natural result of the NLMS training procedure. Only the NLMS segmenter is explicitly trained to optimize segmentation performance relative to a set of manually segmented sample queries. The ROC then gives a measure of how close the NLMS segmentation is to a test set of manually segmented queries. When both the training and test sets were manually segmented by the author, amplitude, timbre and phonetic information were implicitly considered in the segmentation. This information is not considered by the automatic segmenters, however. As such, during NLMS training the combination weights,, are modified by the perceptron training rule to account for notes that do not necessarily have a strong footprint in the pitch contour. Furthermore, the manual segmentations include

10 10 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING singer error. Hence, the NLMS segmenter may give a better indication of what the user sang than the HMM and quantizer estimators, which yield a more accurate transcription of what the user intended to sing. Indeed, so-called pitch-correction is explicitly incorporated into the HMM and quantizer note estimators, a similar correction implicitly occurs for note detection. The best classification accuracy observed is about 92%. We note that classification accuracy close to 100% could be achieved by removing the sample queries of three subjects from our test database. These queries were very inaccurate and some were virtually monotone, with only the lyrics as their recognizable feature. It is unclear that any QBH system should be designed to accommodate such queries. Specific modeling of singer and transcription error is becoming a more active area of QBH research [15], [41]. Our transcription methods are based exclusively on the estimated pitch contour. Employing other features such as broad spectral or phonetic information may significantly improve performance. Phonetic segmentation is a relatively mature area of research in the speech processing community [44], [45] although phonetic segmentation of sung melodies has not been explored in depth [46]. Substantial gains may also be possible by moving toward other query representations; this idea is becoming a more active area of QBH research [7], [16]. VII. CONCLUSION This work has explored the front-end audio-processing component of a query-by-humming system, with emphasis on note segmentation and quantization. Experimental results for seven note estimators were presented. A smoothed pitch derivative was used as a baseline. A Kalman filter, a RLS filter, a NLMS filter, a ML estimator, a HMM and a quantizer were also considered for segmentation. Of the seven estimators, the NLMS filter appeared to give the best segmentation performance, if only by a small margin. The HMM and quantizer estimators yielded segmentation performance roughly equal to that of the RLS filter. However, when the estimators were inserted into the complete query-by-humming system, the HMM and quantizer estimators yielded consistently superior classification. We also examined the use of coarse representations in query-by-humming systems. By equating coarse representations with quantization, we were able to test whether quantization improved performance. For both pitch and duration, we found that coarse quantization yields no improvement in classification accuracy. This implies that, at least for the query-by-humming system implemented here, coarse melodic representations do not improve retrieval performance. However, we found that note segmenters that incorporate pitch quantization into the segmentation yield superior classification. ACKNOWLEDGMENT The authors would like to thank members of the MusEn Project, Bryan Pardo and Colin Meek, for their comments and contributions to this work. REFERENCES [1] A. Ghias, J. Logan, D. Chamberlin, and B. C. Smith, Query by humming: musical information retrieval in an audio database, in Proc. ACM Multimedia, 1995, pp [2] R. J. McNab, L. A. Smith, and I. H. W. et al., Toward the digital music library: Tune retrieval from acoustic input, in Proc. ACM Digital Libraries Conf., Bethesda, MD, [3] R. B. Dannenberg, W. P. Birmingham, and G. T. et al., The MUSART testbed for query-by-humming evaluation, in Proc. ISMIR, Baltimore, MD, [4] N. Kosugi, Y. Nishihara, and T. S. et al., A Practical query-by-humming system for a large music database, in Proc. ACM Multimedia, Los Angeles, CA, 2000, pp [5] S. Pauws, Cubyhum: a fully operational query by humming system, in Proc. ISMIR, Paris, France, 2002, pp [6] L. P. Clarisse, J. P. Martens, and M. L. et al., An auditory model based transcriber of singing sequences, in Proc. ISMIR02, Paris, France, Oct [7] D. Mazzoni and R. B. Dannenberg, Melody matching directly from audio, in Proc. ISMIR01, Bloomington, IN, [8] J. S. Downie, Toward the scientific evaluation of music information retrieval systems, in Proc. ISMIR, Baltimore, MD, [9] J. M. Batke, G. Eisenberg, P. Weishaupt, and T. Sikora, A query by humming system using MPEG-7 descriptors, in Proc. AES 116th Convention, Berlin, Germany, May [10] Y. Zhu and D. Shasha, Warping indexes with envelope transforms for query by humming, in Proc. Int. Conf. Management of Data (SIGMOD), San Diego, CA, [11] Melodyhound. [Online] [12] Musicline. [Online] [13] B. Liu, Y. Wu, and Y. Li, A linear hidden Markov model for music information retrieval based on humming, in Proc. ICASSP03, [14] Y. E. Kim, W. Chai, R. Garcia, and B. Vercoe, Analysis of a contourbased representation for melody, in Proc. ISMIR00, Oct [15] A. Pikrakis, S. Theodoridis, and D. Kamarotos, Recognition of isolated musical patterns using context dependent dynamic time warping, IEEE Trans. Speech Audio Processing, vol. 11, no. 3, pp , May [16] N. H. Adams, M. A. Bartsch, J. Shiffrin, and G. H. Wakefield, Time series alignment for music information retrieval, in Proc. ISMIR, 2004, Barcelona, Spain, [17] T. Heinz and A. Brückmann, Using a physiological ear model for automatic melody transcription and sound source recognition, in Proc. AES 114th Conv., Amsterdam, The Netherlands, Mar [18] J. Song, S. Bae, and K. Yoon, Mid-level music melody representation of polyphonic audio for query-by-humming system, in Proc. ISMIR, Paris, France, [19] S. Kay, Fundamentals of Statistical Signal Processing: Detection Theory. Upper Saddle River, NJ: Prentice-Hall, [20] C. Meek and W. Birmingham, Johnny can t sing: A comprehensive error model for sung music queries, in Proc. ISMIR02, Paris, France, Oct [21] R. J. McNab and L. A. Smith, Evaluation of a melody transcription system, in Proc. IEEE Int. Conf. Multimedia and Expo 2000, vol. 2, 2000, pp [22] D. Parsons, The Directory of Tunes. Cambridge, U.K.: Spencer Brown, [23] J. Martínez, Ed., (2003, Mar.) ISO/IEC MPEG-7 Overview. [Online] [24] S. Quackenbush and A. Lindsay, Overview of MPEG-7 audio, IEEE Trans. Circuits Syst. Video Technol., vol. 11, no. 6, pp , Jun [25] N. H. Adams, M. A. Bartsch, and G. H. Wakefield, Coding of sung queries for music information retrieval, in Proc. IEEE WASPAA03,New Paltz, NY, Oct [26] P. Boersma, Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound, in Proc. Inst. Phonetic Sciences of the University of Amsterdam, vol. 17, 1993, pp [27] M. A. Bartsch. (2002) Automatic assessment of the spasmodic voice. [Online] [28] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, Biological Sequence Analysis. Cambridge, U.K.: Cambridge Univ. Press, [29] L. R. Bahl and F. Jelinek, Decoding for channels with insertions, deletions, and substitutions with applications to speech recognition, IEEE Trans. Information Theory, vol. IT-21, pp , [30] R. J. McNab, L. A. Smith, D. Bainbridge, and I. H. Witten, The New Zealand digital library MELody index, D-Lib Mag., May 1997.

ADAMS et al.: NOTE SEGMENTATION AND QUANTIZATION FOR MUSIC INFORMATION RETRIEVAL 11 [31] P. Stoica and R. Moses, Introduction to Spectral Analysis. Upper Saddle River, NJ: Prentice Hall Ptr, 1997.

(2002) Automatic Segmentation of Sung Melodies. [Online] http://www.eecs.umich.edu/systems/techreportlist.html [34] J. R. Treichler, C. R. Johnson, and M. G.

ung, E. J. Powers, W. M. Grady, and S. C. Bhatt, Adaptive power-line disturbance detection scheme using a prediction error filter and a stop-and-go CA CFAR detector, in Proc. ICASSP99, 1999, pp.

11 ADAMS et al.: NOTE SEGMENTATION AND QUANTIZATION FOR MUSIC INFORMATION RETRIEVAL 11 [31] P. Stoica and R. Moses, Introduction to Spectral Analysis. Upper Saddle River, NJ: Prentice Hall Ptr, [32] A. Sterian, Model-Based Segmentation of Time-Frequency Images for Musical Transcription, Ph.D. dissertation, Univ. Michigan, Dept. Elect. Eng. Comput. Sci., [33] N. H. Adams. (2002) Automatic Segmentation of Sung Melodies. [Online] [34] J. R. Treichler, C. R. Johnson, and M. G. Larimore, Theory and Design of Adaptive Filters. Upper Saddle River, NJ: Prentice-Hall, [35] J. Chung, E. J. Powers, W. M. Grady, and S. C. Bhatt, Adaptive power-line disturbance detection scheme using a prediction error filter and a stop-and-go CA CFAR detector, in Proc. ICASSP99, 1999, pp [36] T. Zhang and C. C. J. Kuo, Audio content analysis for online audiovisual data segmentation and classification, IEEE Trans. Speech Audio Process., vol. 9, no. 4, pp , May [37] T. Pavlidis, Waveform segmentation through functional approximation, IEEE Trans. Comput., vol. C-22, no. 7, pp , Jul [38] L. R. Rabiner and B. H. Juang, An introduction to hidden Markov models, IEEE ASSP Mag., pp. 4 16, Jan [39] R. M. Gray and D. L. Neuhoff, Quantization, IEEE Trans. Inform. Theory, vol. 44, no. 6, pp , [40] M. Melucci and N. Orio, Evaluating automatic melody segmentation aimed at music information retrieval, in Proc. JCDL 02, Portland, OR, Jul [41] C. Meek and W. P. Birmingham, Automatic thematic extractor, J. Intell. Inform. Syst., vol. 21, no. 1, pp. 9 34, Jul [42] Y. Linde, A. Buzo, and R. M. Gray, An algorithm for vector quantizer design, IEEE Trans. Commun., pp , Jan [43] A. L. Uitdenbogerd and Y. W. Yap, Was Parsons right? An experiment in usability of music representations for melody-based music retrieval, in Proc. ISMIR03, Baltimore, MD, Oct [44] R. Andre-Obrecht, A statistical approach for the automatic segmentation of continuous speech signals, IEEE Trans. Acoust., Speech, Signal Process., vol. 36, no. 1, pp , Jan [45] D. T. Toledano, L. A. H. Gomez, and L. V. Grande, Automatic phonetic segmentation, IEEE Trans. Speech Audio Process., vol. 11, no. 6, pp , Nov [46] M. Mellody, M. A. Bartsch, and G. H. Wakefield, Analysis of vowels in sung queries for a music information retrieval system, J. Intell. Inform. Syst., vol. 21, no. 1, pp , Jul Norman H. Adams (S 96) received the B.S. and M.S. degrees in electrical engineering (with highest distinction) from the University of Virginia, Charlottesville, in 2000 and 2001, respectively. His research while at Virginia focused on modulation and coding for nonlinear fiber optic communications. He is currently a Ph.D. candidate in the Electrical Engineering and Computer Science Department at the University of Michigan in Ann Arbor, MI. His research interests include music information retrieval, time-frequency representations, binaural sonification and statistical signal processing for acoustic applications. Mark A. Bartsch (M 04) received the B.S. degree in electrical engineering (summa cum laude) from the University of Dayton, Dayton, OH, in 2000, and the M.S.E. degree in electrical engineering from the University of Michigan, Ann Arbor, in 2002, where he is currently pursuing the Ph.D. degree. His research interests include analysis and resynthesis of the singing voice, musical information retrieval, and machine learning for signal processing applications. Gregory H. Wakefield (M 85) received the B.A. degree (summa cum laude) in mathematics and psychology, the M.S. and Ph.D. degrees in electrical engineering, and the Ph.D. degree in psychology, all from the University of Minnesota, Minneapolis, in 1978, 1982, 1985, and 1988, respectively. In 1986, he joined the faculty of the Electrical Engineering and Computer Science Department, University of Michigan, Ann Arbor, where he is currently an Associate Professor. His research interests are drawn from time-frequency representations, music signal processing, auditory systems modeling, psychoacoustics, sensory prosthetics, and sound quality engineering. He serves as consultant to various industries in sound quality engineering and time-frequency representations. Dr. Wakefield received the NSF Presidential Young Investigator Award in 1987 and the IEEE Millennium Award in 2000.

Music Radar: A Web-based Query by Humming System

Music Radar: A Web-based Query by Humming System Lianjie Cao, Peng Hao, Chunmeng Zhou Computer Science Department, Purdue University, 305 N. University Street West Lafayette, IN 47907-2107 {cao62, pengh,