IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING 1. Note Segmentation and Quantization for Music Information Retrieval

Size: px
Start display at page:

Download "IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING 1. Note Segmentation and Quantization for Music Information Retrieval"

Transcription

1 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING 1 Note Segmentation and Quantization for Music Information Retrieval Norman H. Adams, Student Member, IEEE, Mark A. Bartsch, Member, IEEE, and Gregory H. Wakefield, Member, IEEE Abstract Much research in music information retrieval has focused on query-by-humming systems, which search melodic databases using sung queries. The database retrieval aspect of such systems has received considerable attention, but query processing and the melodic representation have not been examined as carefully. Common methods for query processing are based on musical intuition and historical momentum rather than specific performance criteria; existing systems often employ rudimentary note segmentation or coarse quantization of note estimates. In this work, we examine several alternative query processing methods as well as quantized melodic representations. One common difficulty with designing query-by-humming systems is the coupling between system components. We address this issue by measuring the performance of the query processing system both in isolation and coupled with a retrieval system. We first measure the segmentation performance of several note estimators. We then compute the retrieval accuracy of an experimental query-by-humming system that uses the various note estimators along with varying degrees of pitch and duration quantization. The results show that more advanced query processing can improve both segmentation performance and retrieval performance, although the best segmentation performance does not necessarily yield the best retrieval performance. Further, coarsely quantizing the melodic representation generally degrades retrieval accuracy. Index Terms Music information retrieval, pitch, pitch quantization, query-by-example, segmentation. I. INTRODUCTION RAPID SEARCHING of databases of music is one of the primary goals of music information retrieval (MIR). Part of the burgeoning field of content-based information retrieval, MIR research looks to organize and mine databases of music around their audio content. Many of us have experienced the frustration of knowing what a piece of music sounds like, but not knowing the artist or title of the piece. Query-by-humming (QBH) systems attempt to solve this problem by enabling database searches that use a fragment of sung melody as input. Query-by-humming is a particularly active area of research in the MIR community. Early QBH successes [1], [2] have suggested numerous alternative systems [3] [7], which upon fur- Manuscript received May 3, 2004; revised October 12, This work was supported in part by a National Science Foundation Graduate Research Fellowship and a Graduate Assistance in Areas of National Need Fellowship, as well as grants from the National Science Foundation (IIS ) and the MusEn Project at the University of Michigan through the Office of the Vice President for Research. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Gerald Schuller. N. H. Adams and G. H. Wakefield are with the Electrical Engineering and Computer Science Department, University of Michigan, Ann Arbor, MI USA ( nhadams@umich.edu; ghw@umich.edu). M. A. Bartsch is with ATK Mission Research, Albuquerque, NM USA ( bartscma@ieee.org). Digital Object Identifier /TSA ther investigation have yielded results that are often inconclusive or difficult to generalize [3], [8]. To date, most QBH research has adhered to the paradigm of implementing a complete QBH system and comparing its performance to that of other systems, a paradigm motivated in part by emerging applications in the music industry. Complete contemporary systems are described in [3] [5], [9], and [10]. Several commercial systems are available online as well; Melodyhound [11] and Musicline [12] are two well-known systems that accept acoustic input for searching databases of folk and pop/rock themes. While this research approach has yielded some powerful examples, it has not provided a means for directly understanding how to improve any one example. This follows, in part, because of the highly coupled nature of complete QBH systems. To compare only the final retrieval performance of independently developed QBH systems provides little understanding of the relationships between its constituent components. The focus of the present work is to explore one such component, sung query coding, and to measure the performance of various query coders both in isolation and when coupled with other components of a QBH system. The performance of QBH systems is dependent upon three elements: the query processing component, the retrieval component, and the query representation. The query processing component encodes the sung query into an efficient representation for search and retrieval. Much of the research in QBH has focused on improving retrieval performance by examining either different retrieval systems or melodic representations [3], [7], [10], [13] [16]. The query processing component, however, has received relatively little attention. As such, we apply several techniques that have found success in other speech and acoustic applications to sung query coding. The most common query representation is a sequence of pitch and duration pairs [3] [5], [9], [13] [15]. While there is no evidence that this query representation is optimal in any specific technical sense, it is both musically intuitive and many existing databases of themes use this or similar representations. That relatively little attention is given to the coding of sung queries perhaps reflects the belief that robust automatic sung melody transcription is intractable. Nevertheless, it is often observed that humans do accurately transcribe sung melodies that automatic systems cannot. This has motivated the incorporation of physiological models into sung melody transcription systems [6], [17]. Recently, the de facto reliance on this representation has been subject to more careful scrutiny, and alternative query representations are considered in [7], [10], [16], [18]. In the present work we restrict our attention to the domain of note representations. Unfortunately, poorly articulated queries with suboptimal intonation are the rule for query-by-humming systems, a fact /$ IEEE

2 2 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING which makes robust query transcription difficult. In particular, query segmentation is one of the most challenging aspects of this problem [3], [10], [19]. In many cases the majority of QBH retrieval errors result from note segmentation errors [20], [21]. Many existing QBH systems circumvent this problem by requiring the user to articulate each sung note with a separate da or ta, in which case a simple amplitude threshold is effective [1], [2], [4], [5], [9], [13]. In segmenting the audio signal into separate musical note events, other solutions employ a metronome that the user must either sing along to or provide as input while singing [4]. It is desirable to place no performance restrictions on the user in general, particularly if the intended user is not a trained singer. Unfortunately, most QBH systems use rudimentary segmenters that do not perform well with naturally sung queries. To compensate for singer and transcription errors, many researchers have investigated the use of melodic representations that are robust to such errors [1] [3], [9], [14], [22]. These robust representations apply a coarse quantization early in query processing. Such a scheme might represent a note using only one of two durations ( short and long ) or one of three pitches (such as a note s pitch being higher, lower, or equal to the previous note s pitch). Singer and transcription errors will have a less detrimental effect on retrieval performance if all pitches or durations within some range are quantized to the same level. Robust representations can also reduce the computational complexity of the retrieval system. From a signal processing perspective, though, the notion that discarding information should improve retrieval performance is counterintuitive, particularly given the absence of a pitch-contour model of query production. Nonetheless, such a quantization has been incorporated into the recent MPEG-7 standard. MPEG-7 was designed specifically to facilitate content-based information retrieval [23], reflecting the growing interest in mining the content of multimedia databases. QBH applications in particular were taken into account in the MPEG-7 standard [9], [24]. Included in the standard are Descriptors reserved for representing the main melody, or theme, of the audio file. These Descriptors include the fundamental frequency contour, as well as a coarse quantization of the note sequence. This coarse representation was first proposed in the MIR community [14]; pitch differences between successive notes are quantized to a five-level codebook [9], [23]. In the present work, we examine two hypotheses. First, we propose that the use of more advanced query processing methods derived from standard signal processing techniques can improve both the segmentation and the retrieval performance of QBH systems. Second, we suggest that robust (i.e., coarsely quantized) melodic representations do not improve retrieval performance for QBH systems. To test these hypotheses, we first compare the performance of several query segmenters. We then couple the query processors with a simple retrieval system and measure the overall classification accuracy of the QBH system. Finally, we compare the classification accuracy for different query processing methods as well as various degrees of query quantization. There is no consensus on how best to quantify the performance of a melody transcription system, it is unclear what ex- Fig. 1. Block diagram of query-by-humming system. Labels indicate locations where performance is measured. actly constitutes a good transcription. Accordingly, we augment the performance statistics of the segmenters with those of the retrieval system as an alternative measure of transcription quality. The final retrieval performance can be interpreted as a measure of how close a transcription system places sung queries to the intended melody. This paper presents an extension to work originally presented in [25]. The following section describes the methodology for testing our hypotheses. Sections III and IV describe the segmentation and quantization methods we implement. Sections V and VI present and discuss our results, and Section VII concludes the paper. II. METHODOLOGY The QBH system used for testing our hypotheses is shown in Fig. 1. The system consists of two primary components. The first is the query processing component, which estimates the sequence of notes sung by the user from a recorded acoustic signal. Each note consists of a (pitch, duration) pair, where pitch denotes the note s real-valued MIDI pitch number. 1 The duration of a note is taken to be the time between the start of the note and the start of the next note, often referred to as the inter-onset interval (IOI) [3]. The estimated note sequence is then classified by the retrieval component as one of the possible targets. This work is predicated on the assumption that queries should be transcribed using primarily pitch information. Singers can sing the same melody with varying amplitude envelopes, lyrics and style; we desire our QBH system to be invariant to such variables, however. As such, the first step in estimating the sequence of notes is to estimate the pitch contour of the sung query. A time-domain method is used to track the fundamental frequency contour [26], [27]. This algorithm computes the autocorrelation for overlapping windows of recorded data. The bias of the window function is mitigated by the normalization, where is the autocorrelation of the windowed data and is the autocorrelation of the window function. A set of candidate peaks is selected for every frame and the Viterbi algorithm is used to construct a smooth contour. We use a step-size of 10 ms throughout this work. Let be the pitch contour computed for a sung query. Note boundaries must be detected in this contour. If the singer sustains constant pitch and transitions between notes instantly, the pitch contour,, is piecewise constant. In this case the first-difference of could be used to detect note 1 The real-valued MIDI pitch number p is related to a signal s fundamental frequency in Hz, f, as p = 12 log (f =261) A MIDI pitch difference of one is referred to as a semitone.

3 ADAMS et al.: NOTE SEGMENTATION AND QUANTIZATION FOR MUSIC INFORMATION RETRIEVAL 3 boundaries. However, such ideal pitch contours are unrealistic. Untrained singers in particular will transition slowly between notes, taking as much as 200 ms to slide or scoop between notes. Furthermore, the pitch will typically fluctuate within note boundaries, as with vocal vibrato. The goal of the note segmenter is to detect legitimate note boundaries while neglecting spurious fluctuations in the pitch contour. Section III presents seven different note estimators, emphasizing note segmentation. The note pitches and durations are then quantized for use by the retrieval system. The quantizers are described in Section IV. To achieve reasonable robustness to common errors such as note insertions and deletions, the retrieval system uses a classifier that computes the edit distance [28], [29] between the quantized query and each target in a database of songs. A query is classified as the target song which has the smallest edit distance. Edit costs are assigned to produce a low-complexity approximation to a dynamic time warping distance metric. Inserting or deleting a note with pitch and duration given by (, ) yields a cost equal to the duration of that note Replacing this note with a note having pitch and duration (, ) has cost where is the cross-over pitch difference that relates the replacement cost with insertion and deletion costs. Replacing two equal-duration notes with a pitch difference equal to has equal cost to an insertion-deletion pair. We found to work well. In order to compensate for global pitch offsets between the query and target, we iteratively subtract the mean pitch difference between the aligned sequences and then realign the sequences using the edit distance algorithm. For performance evaluation, we employ a query database containing many sample queries of fourteen popular tunes, from the Beatles Hey Jude to Richard Rodgers Sound of Music. A total of 480 queries were collected from fifteen participants in our study. Each participant was asked to sing a familiar portion of a subset of the fourteen tunes four times. The participants had a variety of musical backgrounds; some had considerable musical or vocal training while most had none at all. Participants were instructed to sing each query as naturally as possible using the lyrics of the tune. 2 The queries are monophonic, 16 bit recordings sampled at 44.1 khz and resampled to 8 khz to reduce processing time. All data was collected in a quiet classroom setting and participants were free to progress at their own pace. This data is used to compute the segmentation performance and classification accuracy of the various configurations of our experimental query-by-humming system. The retrieval database of target songs consists of ideal representations of the fourteen songs. Note that every query represents the exact portion of melody contained in the target, i.e., only that sequence of pitches sung by the participants for Hey, Jude, for example, are contained in the database. For a real- 2 This contrasts substantially from the common practice of having participants sing isolated pitches on a neutral vowel. (1) (2) world QBH system it is unreasonable to assume the user will always sing the exact portion of a tune contained in the database. Some systems address this problem by including multiple themes for each tune in the database [20]. For comparing the relative performance of various note estimation and quantization methods, however, we do not consider such complications. III. NOTE SEGMENTATION A sequence of notes must be estimated from the pitch contour for use by the retrieval system. Detecting note boundaries in the pitch contour is one of the most challenging aspects of the note estimation [19] [21]. Accordingly, we are primarily concerned with the note segmentation component of the note estimators. The following subsections present seven note segmentation methods. The first four segmenters perform the segmentation before estimating note pitch. The pitch of each note is then estimated as the mean pitch contour value between the beginning and end of the note. The last three estimators perform the segmentation and pitch assignment concurrently. Furthermore, the last two estimators employ pitch quantization prior to segmentation. In this way we explore not only different note segmenters, but also what are the most effective priors to incorporate into the segmentation. The note segmenters often yield clusters of spurious notes around a single legitimate note. A note thinning procedure is therefore required to reduce these clusters to a single boundary. One method for thinning spurious notes is to enforce a minimum duration constraint, analogous to removing spurious edges in edge detection. In [2], all notes less than are discarded. We found that merging notes less than into their nearest neighbor (in pitch), beginning with the shortest note, yields better results. A. Baseline In [2] the use of a smoothed derivative is proposed to detect note boundaries. Adjacent 20 ms windows of pitch contour are compared, if the difference between the average pitch of each window is greater than some threshold,, a new note is inserted. By adjusting, missed notes can be traded for spurious notes. For the query database used in the present work we found 80 ms windows yields better results, reflecting perhaps larger contour fluctuations in our sample queries than [2]. Similar note detectors are used in [3], [6], [9]. However, MELDEX [30], which was developed by the authors of [2], use an RMS amplitude threshold rather than this baseline segmentation method. Indeed, most QBH systems to date use either a variant of an amplitude threshold or a variant of this simple baseline pitch segmenter. B. Kalman Filter One of the difficulties encountered in segmenting continuously sung melodies is that the pitch contour can be quite volatile. The magnitude of pitch fluctuations within a sustained note can exceed a semitone, as is often seen in full vibrato. While the appearance of vibrato is far from universal, it is common enough that accounting for its contribution to pitch variance is likely to improve segmentation. By modeling pitch

4 4 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING fluctuations as the output of a stationary linear system driven by white noise, a Kalman filter can be used to track the pitch contour. In order for the Kalman filter to accurately track contours within note boundaries a statistical analysis of the pitch contours is necessary. Thirty queries containing a large number of notes longer than 400 ms were selected for analysis. The pitch contours were extracted for each of these queries, and edited to include only the stable segments of long notes. The average power spectral density (PSD) of each segment was estimated, and then averaged across segments (in log domain) [31], [32]. The PSD exhibits a low-pass characteristic, with a weak resonance at 3 Hz [33], which can be modeled as a second-order autoregressive system. Accordingly, let be the value of the pitch contour at time index (3) where and are the AR model parameters and is WGN. The segments were manually divided into two categories, those with and without vibrato, and the segments with vibrato were used for the parametric estimation. Using the Levinson-Durbin recursion [31], a least-squares best fit was found with and, which places poles at angle 3.5 Hz. This agrees with the rates commonly associated with vibrato. Given the system model (3), the Kalman state vector is given by with state update and observation equations given by (4) Fig. 2. Top curve shows a segment of pitch contour. The second curve shows the Kalman prediction error and the bottom curve shows the RLS prediction error. The dotted line is a potential detection threshold. arity assumption is too restrictive. The recursive least-squares (RLS) filter is a special case of the general, time-varying Kalman filter [33] [35]. We employ a predictive RLS filter to track the observed pitch contour. Let the -step prediction be given by where (5) where are the adaptive filter coefficients and is the information vector for the current time index (8) and is the observation noise. An exponentially-weighted running mean is subtracted from the query pitch contour and used as the observation sequence for the Kalman filter to predict the pitch state. The error function,, for the predicted observation is given by where is the variance of the observation noise and is the expected covariance of the system state [32]. The distance is then compared to a threshold to determine whether a new note should be inserted. A sample output of the Kalman filter is shown in Fig. 2. For purposes of illustration we chose a portion of pitch contour that is more piecewise constant than most in the database, so that the behavior of the Kalman filter is evident. C. RLS Filter While the Kalman filter predicts contours with stable vibrato, many other fluctuations are not accurately tracked; the station- (6) (7) These coefficients are updated at every time step to minimize the cumulative squared-error. The optimal linear solution is found by a recursive implementation of the common normal equations [34] (9) (10) where is the autocorrelation of, and. A forgetting-factor of was found to work well. A modest filter order of is sufficient for a step predictor. As was the case for the Kalman filter, prediction error is used as the statistic to determine whether a new note has occurred. An example output is shown in Fig. 2. D. Nonlinear LMS Filter The previous two methods employ fairly restrictive assumptions about the pitch contours. At the opposite extreme, machine learning techniques can be used to discover regularity in the data

5 ADAMS et al.: NOTE SEGMENTATION AND QUANTIZATION FOR MUSIC INFORMATION RETRIEVAL 5 when analytic models are not available. In the following, we explore explore the use of a perceptron for segmentation. A perceptron is a least mean-squares (LMS) filter augmented with a (nonlinear) threshold operator [34]. A five-element feature vector is computed for nonoverlapping 80 ms windows of pitch contour [33], [36]. This feature vector forms the input to the NLMS filter, which detects note boundaries contained within the window. The feature vector,, was found heuristically by observing the dynamic properties of the measured pitch contours. For each feature, larger values are indicative of note boundaries. The first two features are computed from the pitch contour, is a smoothed derivative and is the maximum first-difference for the current window of data. The next two features are derived from the autocorrelation contour that accompanies the pitch contour [33]. The autocorrelation contour is interpreted as a measure of how pronounced the sung pitch is, often the autocorrelation decreases during note transitions. The last feature,, is a smoothed derivative of the RMS amplitude contour. Precise definitions of these features can be found in [33]. For each window, a linear combination of the feature vector is taken as the final decision statistic. The following iteration was repeated until converged to a stable solution if boundary missed if boundary mistakenly inserted otherwise (11) where and were found to be reasonable values. 3 The perceptron converged to the following weights based upon 30 manually segmented queries con- Consider an observed pitch contour, and piecewise constant signal taining constant regions.... (13) where are the pitches of each region and,, are the boundaries between regions, with and. The -dimensional parameter space must be searched to find the minimum error between the observed contour and piecewise constant fit. An exhaustive search is computationally impractical, so dynamic programming is employed to keep complexity manageable. An absolute error criterion was found to yield better segmentation performance than the more common squared error criterion. In this case the optimal pitch for given boundaries is the median of the pitch contour values. Let the error for a constant region bounded by and with pitch be (14) Suppose an optimal -note fit has been found for time step through (that is, ). Let be the total error for for the optimal fit. A recursive formula for can be found by observing (12) The predominant weight associated with the and supports our earlier claim that note segmentation is best performed using these pitch bearing features. E. ML Segmenter The remaining three methods, in contrast to the previous four, perform note segmentation and pitch estimation concurrently. Simultaneous estimation of the note boundaries and pitches is a higher dimensional problem than the note boundaries alone. Indeed, the ML segmenter and HMM segmenter are of somewhat higher computational order than the other methods presented here. Nonetheless, it is worth exploring whether estimating note boundaries and pitches simultaneously improves performance. If the observed pitch contour is modeled as an arbitrary piecewise constant signal corrupted by AWGN, optimal note boundary estimates are given by the MMSE piecewise constant curve fit to the observed data [37]. If we further assume that the observed contour represents constant regions, a convenient search algorithm is readily found [19], and is summarized in the following. 3 Queries were not edited prior to training. Because the majority of 80 ms windows did not contain note boundaries, we found that allowing v and v to take on different values led to better performance. (15) This formula is used to compute for all and. The optimal segmentation is then found by backward recursion from to. A detailed description of the algorithm can be found in [19]. Note that while dynamic programming improves search speed considerably, this segmenter is still far slower than the other methods presented here (order compared to ). This algorithm assumes the number of notes is known a priori. Generalizing this procedure to estimate the number of notes as well renders the algorithm computationally impractical [37]. Instead, a pragmatic solution overestimates the numbers of notes in the contour (four notes per second of contour, for example). In this case the raw output of the ML segmenter has a high false-alarm probability. As with the other methods however, these spurious notes often occur in clusters around a single legitimate note. Most of the spurious notes are removed by the note thinning procedure described in Section III-A.

6 6 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING F. Quantization Segmenter None of the segmenters described above make use of the fact that most melodies found in Western music are restricted to a 12-tone scale. In this work we incorporate a 12-tone structure into two note segmenters. Both segmenters require that a codebook of quantization pitches be specified before segmentation. The design of this codebook is discussed in Section IV-A. The first of the two segmenters, the quantizer segmenter, uses this codebook directly to perform segmentation whereas the second of these segmenters, the HMM segmenter, uses the elements of the codebook as states. We withhold the discussion of the HMM quantizer for the next subsection. After the codebook has been selected, the pitch contour is quantized and every pitch transition in the output is interpreted as the start of a new note. As expected, this segmentation is riddled with spurious notes [15], but many are removed by the note thinning procedure (Section III-A). In addition to the minimum-duration thinning, another thinning procedure is beneficial for this segmenter. For a given detected note boundary, let be the difference between the unquantized average pitch for the notes on either side of the boundary. Beginning with the smallest, all note boundaries such that are removed. We found 4 to be a good value. Note that for general sung melody transcription this segmentation method may not be appropriate. Most singers, especially untrained ones, demonstrate considerable pitch drift while singing. For many query-by-humming systems this is not a problem however. Such systems only use a short segment of sung melody, in which case pitch drift is negligible. G. HMM Segmenter The final segmenter employs a hidden Markov model (HMM) to estimate both note pitches and boundaries [38]. The pitch contour is modeled as a piecewise constant signal restricted to a -tone equal-tempered scale, as in the previous segmenter. In the present case, each of the -tones defines a state in the HMM. Fig. 3 shows a portion of an example HMM with states per octave, labeled according to Western scale pitches. Two state transition probability distributions were implemented. The first assigns a high (98%, for example) probability of self-transition,, and uniformly assigns the remaining transition probabilities, (, where is the total number of HMM states). The second uses the Yule algorithm with the fourteen target melodies as a training set. Of the two, the first was found to yield better performance, perhaps due to the small set of melodies used for training the second. Following [33], the observation noise is modeled as IID Laplacian with. The final note estimates are given by the most probable state sequence for the observed pitch contour. Similar to the dynamic programming algorithm used for the ML segmenter, optimal prefix sequences are computed for every state at every time step. Suppose the HMM is in state at time step. Let be the probability of the most likely state sequence from time step to, ending in state, multiplied by the probability of observation sequence. Using the same notation as 4 One cent is 1/100th of a semitone. Fig. 3. Portion of HMM used for note segmentation. Section III-E, a recursive formula for observing can be found by (16) where is the probability of being in state given the previous state and observation. The Viterbi algorithm is used to compute for all and. The optimal segmentation is then found by backward recursion through the state trellis from. A detailed description of the algorithm can be found in [38]. Because the probability of selftransition is large, the HMM estimator is relatively robust to note insertion errors. Nonetheless, the same note thinning procedure used for the other methods is employed here. While the HMM algorithm is somewhat slower than most of the other methods (order compared to ), it is considerably faster than the ML segmenter. IV. NOTE QUANTIZATION One of the goals of this work is to determine whether coarsely quantized melodic representations improve classification accuracy. Thus, we perform an explicit quantization in our queryprocessing system [39]. We restrict attention to separate quantization of pitch and duration. Furthermore, we do not consider quantizing several notes together due to the difficulty of designing an appropriate codebook of melodic phrases [40], [41]. A. Pitch Quantization Uniform scalar quantization with levels per octave is applied to the pitch estimates [39]. Since we are using MIDI pitch number to represent pitch, when we have a musically intuitive quantization of pitch to the equal tempered scale, with one quantization level per semitone. Setting yields a coarser pitch quantization, which is more robust to a singer s pitch errors. We examine a variety of values of for uniform pitch quantization. Singers typically do not have perfect pitch; a user might, for example, sing a query 50 cents off from the standard equal-tempered tuning. To minimize errors caused by this offset, we perform a search for an optimal pitch offset before quantizing. For a set of offsets spanning of an octave and separated by 5 cents, we compute the mean squared quantization error for the observed pitch contour. The offset with minimum MSE is chosen as the optimal offset.

7 ADAMS et al.: NOTE SEGMENTATION AND QUANTIZATION FOR MUSIC INFORMATION RETRIEVAL 7 The pitch quantization levels need not be uniformly distributed. Many popular melodies are locally restricted to a single key, such as C major. To quantize a sung query to a modal codebook, with seven nonuniformly spaced levels per octave, it is necessary to estimate the tonic of the sung query. 5 We implemented a modal quantizer by testing codebook offsets over a full octave, similar to the uniform quantizer. The tonic estimates were found to be unreliable however, hindering retrieval performance of the QBH system. B. Duration Quantization Scalar quantization with levels is applied to the duration estimates. Prior to quantizing the note durations, the durations are normalized by the total duration of the query, thus ensuring tempo invariance of the QBH system. Note that this normalization is only appropriate if the query can be assumed to contain the same portion of melody as contained in the target theme, as discussed in Section II. Due to this normalization, only a fraction of the levels are ultimately used. However, this is not of concern because we are interested in quantization as a method of reducing singer and transcription error, not as a method to reduce the number of bits needed to store the transcribed query. Uniform, logarithmic and adaptive codebooks are explored [39]. For uniform quantization, durations are uniformly spaced between zero and one, yielding a quantization density function that is constant in that interval, where is the normalized note duration. In many Western melodies, most note durations are related to some minimum duration by a power of two; eighth notes, quarter notes, half notes, and so forth. This implies that it may be appropriate to concentrate duration levels closer to zero. For logarithmic quantization, durations are spaced such that the quantization density function is. We found the performance of the uniform and logarithmic codebooks to be equivalent. Most melodic phrases contain far fewer unique note durations than the total number of notes. Therefore, it may be useful to search for clusters of note durations in the query note estimates. This implies an LBG (K-means) clustering that is performed for every query, adapting the quantization codebook to the distribution of observed durations [42]. We implemented an adaptive quantizer but found lackluster performance. LBG clustering requires a large number of training points relative to the number of quantization levels. As such, the adaptive quantizer s performance is comparable to the uniform quantizer for very small, but poor otherwise. Hence, in the following we report the performance for only the uniform quantizer with varying number of levels,. V. RESULTS As discussed in Section I we examine two hypotheses in the present work: that alternative note estimators yield better segmentation performance and ultimately better retrieval performance, and that coarsely quantized query representations do not improve retrieval performance. To test these hypotheses, we 5 For example, for a query sung in C major, the tonic would be the pitch class of C=(111; 261 Hz; 522 Hz; 111). Fig. 4. Detection performance for seven note segmenters. The curve labeled w/o thinning refers to the baseline segmenter without the minimum note duration thinning. All other segmenters employ this thinning. first examine the segmentation performance of the note estimators alone and then couple the note estimators with the retrieval system described in Section II to examine classification accuracy. To evaluate the segmentation performance of the note estimators, 80 of the queries from our test set were manually segmented by the first author. Often the precise number and location of the segments to insert in the observed pitch contour were ambiguous; however we have attempted to maintain consistency as much as possible. We numerically estimate the receiver operator characteristic (ROC) of each segmenter by comparing its output to the manual segmentations. An alignment radius of 100 ms is used in comparing the manual and automatic segmentations. For varying decision thresholds, we compute the false alarm rate, which is the probability that a given note boundary detected by the segmentation algorithm is spurious, and the detection rate, which is the probability that a true note boundary is detected by the segmentation algorithm. The first four segmenters presented in Section III detect note boundaries by computing a decision statistic and comparing this statistic to a threshold,. By adjusting, an estimate of the segmenters ROC curve is computed. The last three segmenters presented in Section III do not employ an overt decision threshold, hence another parameter must be selected for estimating the ROC curves. For the ML segmenter, the a priori note number,, is adjusted. For the quantizer segmenter, the minimum note duration,, is adjusted. Finally, for the HMM segmenter, the probability of self-transition,, is adjusted. The ROC curves for the quantizer and HMM segmenters are estimated using pitches per octave. Fig. 4 displays ROC curves for the seven estimators examined in this work. The false-alarm rate is represented along the abscissa and the detection rate along the ordinate (note the

8 8 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING Fig. 5. Classification accuracy for five note estimators versus resolution of the uniform pitch quantizer. Fig. 6. Classification accuracy for five of the note estimators versus the number of levels of the duration quantizer. scale for the plot, with false-alarm rates from 0 to 0.27 and detection rates from 0.64 to 1). The parameter in the figure is the segmentation method. All of the ROC curves are monotonic nondecreasing. Of all the segmenters, the baseline segmenter without note-thinning stands out as poorer than the rest. Including the note-thinning procedure improves performance considerably for the baseline segmenter, as well as the other methods. Note-thinning is critical for all methods considered here. The ROC curves for the remaining segmenters all include note-thinning. The remaining methods all yield ROC curves that overlap considerably. The RLS, HMM and quantizer segmenters give essentially equivalent performance. The NLMS segmenter yields the best segmentation performance by a modest percentage, giving a 4% improvement over the next best method for a false-alarm rate of 2.5%. The Kalman and MLE segmenters yield the worst performance of the alternative segmenters. We next couple the various note estimators with a retrieval system to measure the classification accuracy. The retrieval performance of the query-by-humming system is computed by running the system on each of the 480 queries in our query database. The classification accuracy is computed by determining the fraction of queries which are classified as the correct target song. To prevent the figures from becoming too cluttered, the Kalman and ML estimators are neglected in the following results, their performance being generally lackluster. Because our alignment algorithm assigns equal cost to insertion and deletion errors, it is reasonable to suppose that optimal classification accuracy is achieved when the segmenter insertion and deletion rates are equal. This was indeed found to be the case, hence all segmenters are tuned to have equal note insertion and deletion rates when coupled with the retrieval system. Fig. 5 shows the overall classification accuracy of five note estimators versus the number of pitch quantization levels. The abscissa represents the number of pitches per octave,, and the ordinate represents classification accuracy. The parameter is the segmentation method. Duration quantization with is applied for all segmentation methods. Data are shown along with best-fitting exponential curves. All curves are monotonic nondecreasing and plateau around levels per octave. Again, the baseline segmenter without note-thinning yields the worst performance, achieving a maximum classification accuracy of 80%. The HMM and quantizer segmenters yield the best classification accuracy, giving a maximum of 92%. The remaining methods give similar performance, with a maximum of about 88%. Fig. 6 gives the classification accuracy for five note estimators versus the number of duration quantization levels. A uniform pitch quantization with levels per octave is applied for all methods. Again, all curves are monotonic nondecreasing. The various segmenters show similar trends as in Fig. 5, e.g., the baseline without note-thinning yields the worst performance whereas the HMM and quantizer segmenters yield the best performance. Note that the dependent variable,, is the total number of levels in quantization codebook. Because the estimated durations are normalized by the total query duration, only a fraction of the levels are used. Thus, a setting of, for instance, is equivalent to mapping all durations to the same level and thus discarding duration information. VI. DISCUSSION As can be seen from Fig. 4, the alternative segmenters yield better performance than the baseline segmenter. The NLMS segmenter gives the best segmentation performance; however, this may reflect the fact that the feature vector was heuristically designed to optimize segmentation performance on these queries. The Kalman and ML estimators provide the worst performance of the six alternative estimators. The Kalman filter was trained to track one common contour fluctuation, vibrato. Other fluctuations, however, yield large prediction errors. The Kalman filter also exhibits relatively slow convergence, on the order of 300 ms, as can be seen in Fig. 2. Hence, the Kalman filter cannot detect note boundaries near the beginning of pitch contours. Furthermore, many query contours demonstrate considerable scooping, in which the singer momentarily drops to a lower pitch during a transition between two notes. This gradual transition loosely resembles a single oscillation of vibrato. The Kalman filter often accurately tracks this transition and hence many of these segments are not detected.

9 ADAMS et al.: NOTE SEGMENTATION AND QUANTIZATION FOR MUSIC INFORMATION RETRIEVAL 9 As noted in Section III-E, the ML segmenter requires an estimate of the number of notes to segment. When the true number of notes is known, the performance of the ML segmenter is essentially perfect. Generalizing this method to automatically estimate the number of notes renders it computationally unmanageable. Accordingly, the number of notes to search for is assumed to be proportional to the duration of the contour, and note-thinning is used to remove clusters of spurious notes. This heuristic rule for note number estimation is responsible for the method s mediocre performance. Two asymptotes in Fig. 4 are worth discussing. First, all of the ROC curves converge to a detection rate of 65% with no false-alarms. This is a result of the pitch tracker, which performs a voiced/unvoiced detection. Short pauses taken by the singer between notes result in a break in the pitch track, providing a partial segmentation with essentially no inserted notes. Recall that most existing QBH systems require the user to perform note segmentation implicitly by singing each note as a separate da or ta [1], [2], [4], [13]. Such systems employ an amplitude threshold to detect the start of new notes. Applying an amplitude threshold to our database of naturally sung queries yields a detection rate of 65%. Secondly, none of the note estimators achieve a detection rate greater than 96%. This is due to the thinning procedure, which discards any note shorter than 150 ms. Removing the note thinning allows the segmenters to reach a 100% detection rate, but only for a very high false-alarm rate. As for any detection problem, there is a natural tradeoff between note insertion and deletion errors. For measuring the classification accuracy of the complete system, we tuned all segmenters to have equal note insertion and deletion rates. In practice, however, deletion errors are often more troublesome than insertion errors [3], [20], [21], implying that the note segmenters should be tuned to have a high insertion rate. This is explored in [15]; a simple note segmenter with a high false-alarm rate is used in conjunction with an alignment cost scheme specifically designed to anticipate many note insertions. From the trends shown in Fig. 5, it is evident that coarse pitch quantization does not improve classification performance for any of the estimators. It is interesting to note that the curves plateau at a quantization less than levels. This suggests that only six or eight pitch levels per octave are required for accurate classification, thereby reinforcing some of the results presented in [14], [43]. When using such a quantizer, however, the pitch quantization levels no longer represent musical notes in the conventional 12-tone scale. This necessarily complicates construction of the target database, as direct inclusion of standard MIDI files is no longer possible since the targets must be recoded into a coarser codebook. While moderately coarse quantization may not degrade retrieval performance, we find no evidence to imply that it will improve performance. Comparing the estimators in Fig. 5, the alternative estimators yield better classification accuracy than the baseline estimators. The HMM and quantizer estimators yield clearly superior performance for. That the HMM does not perform well for a low number of pitch quantization levels may be explained by the transition probabilities; as the number of states decreases the probability of transiting to other states increases, making spurious state transitions more likely. Fig. 6 shows no benefit to quantizing the note durations. By comparing Figs. 5 and 6, though, it is evident that discarding pitch resolution is more detrimental to classification accuracy than discarding duration resolution. Discarding pitch information decreases classification accuracy from above 90% to less than 50%, whereas discarding duration information only decreases classification accuracy to 80%. Similar results have been found in [3], [14]. The HMM and quantizer note estimators give consistently superior classification performance, in spite of being outperformed with respect to segmentation. This illustrates the importance of coupling system components when evaluating performance. In Fig. 4, the HMM and quantizer demonstrated no improvement over the RLS and NLMS estimators with respect to segmentation accuracy. Still, when the note estimators are considered in connection with the rest of the query-by-humming system, the HMM and quantizer note estimators clearly give the best classification performance. These two estimators are unique in that they perform the pitch quantization before segmentation. That is, the HMM and quantizer estimators incorporate an a priori assumption about the distribution of note pitches sung. It is striking that arguably the simplest estimator considered here, the quantizer, is able to achieve the best performance; all that is required is application of appropriate prior constraints. This is perhaps to be expected given the natural pitch quantization in most Western music. The melodic themes considered in this work exist in a space of equally-tempered pitches. That the HMM and quantizer note estimators judiciously incorporate this prior into note segmentation is not reflected by an ROC metric. The benefit of incorporating this assumption only becomes apparent in the complete QBH system, where using a note segmenter that explicitly places the sung query in a space of equally-tempered pitches is advantageous. The HMM and quantizer note estimators are relatively better at moving the query representation toward the correct theme in the target space. This does not necessarily imply that the HMM and quantizer note estimators are the best choice for general sung melody transcription however. In some sense, the HMM and quantizer note estimators reduce performance imperfections in the final melodic representation. Hence, for many applications, such note estimators may be inappropriate. From Figs. 4 and 5, it is evident that the NLMS segmenter yields better segmentation performance than the RLS segmenter, but equivalent classification accuracy. The relative drop in the performance of the NLMS segmenter is a natural result of the NLMS training procedure. Only the NLMS segmenter is explicitly trained to optimize segmentation performance relative to a set of manually segmented sample queries. The ROC then gives a measure of how close the NLMS segmentation is to a test set of manually segmented queries. When both the training and test sets were manually segmented by the author, amplitude, timbre and phonetic information were implicitly considered in the segmentation. This information is not considered by the automatic segmenters, however. As such, during NLMS training the combination weights,, are modified by the perceptron training rule to account for notes that do not necessarily have a strong footprint in the pitch contour. Furthermore, the manual segmentations include

10 10 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING singer error. Hence, the NLMS segmenter may give a better indication of what the user sang than the HMM and quantizer estimators, which yield a more accurate transcription of what the user intended to sing. Indeed, so-called pitch-correction is explicitly incorporated into the HMM and quantizer note estimators, a similar correction implicitly occurs for note detection. The best classification accuracy observed is about 92%. We note that classification accuracy close to 100% could be achieved by removing the sample queries of three subjects from our test database. These queries were very inaccurate and some were virtually monotone, with only the lyrics as their recognizable feature. It is unclear that any QBH system should be designed to accommodate such queries. Specific modeling of singer and transcription error is becoming a more active area of QBH research [15], [41]. Our transcription methods are based exclusively on the estimated pitch contour. Employing other features such as broad spectral or phonetic information may significantly improve performance. Phonetic segmentation is a relatively mature area of research in the speech processing community [44], [45] although phonetic segmentation of sung melodies has not been explored in depth [46]. Substantial gains may also be possible by moving toward other query representations; this idea is becoming a more active area of QBH research [7], [16]. VII. CONCLUSION This work has explored the front-end audio-processing component of a query-by-humming system, with emphasis on note segmentation and quantization. Experimental results for seven note estimators were presented. A smoothed pitch derivative was used as a baseline. A Kalman filter, a RLS filter, a NLMS filter, a ML estimator, a HMM and a quantizer were also considered for segmentation. Of the seven estimators, the NLMS filter appeared to give the best segmentation performance, if only by a small margin. The HMM and quantizer estimators yielded segmentation performance roughly equal to that of the RLS filter. However, when the estimators were inserted into the complete query-by-humming system, the HMM and quantizer estimators yielded consistently superior classification. We also examined the use of coarse representations in query-by-humming systems. By equating coarse representations with quantization, we were able to test whether quantization improved performance. For both pitch and duration, we found that coarse quantization yields no improvement in classification accuracy. This implies that, at least for the query-by-humming system implemented here, coarse melodic representations do not improve retrieval performance. However, we found that note segmenters that incorporate pitch quantization into the segmentation yield superior classification. ACKNOWLEDGMENT The authors would like to thank members of the MusEn Project, Bryan Pardo and Colin Meek, for their comments and contributions to this work. REFERENCES [1] A. Ghias, J. Logan, D. Chamberlin, and B. C. Smith, Query by humming: musical information retrieval in an audio database, in Proc. ACM Multimedia, 1995, pp [2] R. J. McNab, L. A. Smith, and I. H. W. et al., Toward the digital music library: Tune retrieval from acoustic input, in Proc. ACM Digital Libraries Conf., Bethesda, MD, [3] R. B. Dannenberg, W. P. Birmingham, and G. T. et al., The MUSART testbed for query-by-humming evaluation, in Proc. ISMIR, Baltimore, MD, [4] N. Kosugi, Y. Nishihara, and T. S. et al., A Practical query-by-humming system for a large music database, in Proc. ACM Multimedia, Los Angeles, CA, 2000, pp [5] S. Pauws, Cubyhum: a fully operational query by humming system, in Proc. ISMIR, Paris, France, 2002, pp [6] L. P. Clarisse, J. P. Martens, and M. L. et al., An auditory model based transcriber of singing sequences, in Proc. ISMIR02, Paris, France, Oct [7] D. Mazzoni and R. B. Dannenberg, Melody matching directly from audio, in Proc. ISMIR01, Bloomington, IN, [8] J. S. Downie, Toward the scientific evaluation of music information retrieval systems, in Proc. ISMIR, Baltimore, MD, [9] J. M. Batke, G. Eisenberg, P. Weishaupt, and T. Sikora, A query by humming system using MPEG-7 descriptors, in Proc. AES 116th Convention, Berlin, Germany, May [10] Y. Zhu and D. Shasha, Warping indexes with envelope transforms for query by humming, in Proc. Int. Conf. Management of Data (SIGMOD), San Diego, CA, [11] Melodyhound. [Online] [12] Musicline. [Online] [13] B. Liu, Y. Wu, and Y. Li, A linear hidden Markov model for music information retrieval based on humming, in Proc. ICASSP03, [14] Y. E. Kim, W. Chai, R. Garcia, and B. Vercoe, Analysis of a contourbased representation for melody, in Proc. ISMIR00, Oct [15] A. Pikrakis, S. Theodoridis, and D. Kamarotos, Recognition of isolated musical patterns using context dependent dynamic time warping, IEEE Trans. Speech Audio Processing, vol. 11, no. 3, pp , May [16] N. H. Adams, M. A. Bartsch, J. Shiffrin, and G. H. Wakefield, Time series alignment for music information retrieval, in Proc. ISMIR, 2004, Barcelona, Spain, [17] T. Heinz and A. Brückmann, Using a physiological ear model for automatic melody transcription and sound source recognition, in Proc. AES 114th Conv., Amsterdam, The Netherlands, Mar [18] J. Song, S. Bae, and K. Yoon, Mid-level music melody representation of polyphonic audio for query-by-humming system, in Proc. ISMIR, Paris, France, [19] S. Kay, Fundamentals of Statistical Signal Processing: Detection Theory. Upper Saddle River, NJ: Prentice-Hall, [20] C. Meek and W. Birmingham, Johnny can t sing: A comprehensive error model for sung music queries, in Proc. ISMIR02, Paris, France, Oct [21] R. J. McNab and L. A. Smith, Evaluation of a melody transcription system, in Proc. IEEE Int. Conf. Multimedia and Expo 2000, vol. 2, 2000, pp [22] D. Parsons, The Directory of Tunes. Cambridge, U.K.: Spencer Brown, [23] J. Martínez, Ed., (2003, Mar.) ISO/IEC MPEG-7 Overview. [Online] [24] S. Quackenbush and A. Lindsay, Overview of MPEG-7 audio, IEEE Trans. Circuits Syst. Video Technol., vol. 11, no. 6, pp , Jun [25] N. H. Adams, M. A. Bartsch, and G. H. Wakefield, Coding of sung queries for music information retrieval, in Proc. IEEE WASPAA03,New Paltz, NY, Oct [26] P. Boersma, Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound, in Proc. Inst. Phonetic Sciences of the University of Amsterdam, vol. 17, 1993, pp [27] M. A. Bartsch. (2002) Automatic assessment of the spasmodic voice. [Online] [28] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, Biological Sequence Analysis. Cambridge, U.K.: Cambridge Univ. Press, [29] L. R. Bahl and F. Jelinek, Decoding for channels with insertions, deletions, and substitutions with applications to speech recognition, IEEE Trans. Information Theory, vol. IT-21, pp , [30] R. J. McNab, L. A. Smith, D. Bainbridge, and I. H. Witten, The New Zealand digital library MELody index, D-Lib Mag., May 1997.

11 ADAMS et al.: NOTE SEGMENTATION AND QUANTIZATION FOR MUSIC INFORMATION RETRIEVAL 11 [31] P. Stoica and R. Moses, Introduction to Spectral Analysis. Upper Saddle River, NJ: Prentice Hall Ptr, [32] A. Sterian, Model-Based Segmentation of Time-Frequency Images for Musical Transcription, Ph.D. dissertation, Univ. Michigan, Dept. Elect. Eng. Comput. Sci., [33] N. H. Adams. (2002) Automatic Segmentation of Sung Melodies. [Online] [34] J. R. Treichler, C. R. Johnson, and M. G. Larimore, Theory and Design of Adaptive Filters. Upper Saddle River, NJ: Prentice-Hall, [35] J. Chung, E. J. Powers, W. M. Grady, and S. C. Bhatt, Adaptive power-line disturbance detection scheme using a prediction error filter and a stop-and-go CA CFAR detector, in Proc. ICASSP99, 1999, pp [36] T. Zhang and C. C. J. Kuo, Audio content analysis for online audiovisual data segmentation and classification, IEEE Trans. Speech Audio Process., vol. 9, no. 4, pp , May [37] T. Pavlidis, Waveform segmentation through functional approximation, IEEE Trans. Comput., vol. C-22, no. 7, pp , Jul [38] L. R. Rabiner and B. H. Juang, An introduction to hidden Markov models, IEEE ASSP Mag., pp. 4 16, Jan [39] R. M. Gray and D. L. Neuhoff, Quantization, IEEE Trans. Inform. Theory, vol. 44, no. 6, pp , [40] M. Melucci and N. Orio, Evaluating automatic melody segmentation aimed at music information retrieval, in Proc. JCDL 02, Portland, OR, Jul [41] C. Meek and W. P. Birmingham, Automatic thematic extractor, J. Intell. Inform. Syst., vol. 21, no. 1, pp. 9 34, Jul [42] Y. Linde, A. Buzo, and R. M. Gray, An algorithm for vector quantizer design, IEEE Trans. Commun., pp , Jan [43] A. L. Uitdenbogerd and Y. W. Yap, Was Parsons right? An experiment in usability of music representations for melody-based music retrieval, in Proc. ISMIR03, Baltimore, MD, Oct [44] R. Andre-Obrecht, A statistical approach for the automatic segmentation of continuous speech signals, IEEE Trans. Acoust., Speech, Signal Process., vol. 36, no. 1, pp , Jan [45] D. T. Toledano, L. A. H. Gomez, and L. V. Grande, Automatic phonetic segmentation, IEEE Trans. Speech Audio Process., vol. 11, no. 6, pp , Nov [46] M. Mellody, M. A. Bartsch, and G. H. Wakefield, Analysis of vowels in sung queries for a music information retrieval system, J. Intell. Inform. Syst., vol. 21, no. 1, pp , Jul Norman H. Adams (S 96) received the B.S. and M.S. degrees in electrical engineering (with highest distinction) from the University of Virginia, Charlottesville, in 2000 and 2001, respectively. His research while at Virginia focused on modulation and coding for nonlinear fiber optic communications. He is currently a Ph.D. candidate in the Electrical Engineering and Computer Science Department at the University of Michigan in Ann Arbor, MI. His research interests include music information retrieval, time-frequency representations, binaural sonification and statistical signal processing for acoustic applications. Mark A. Bartsch (M 04) received the B.S. degree in electrical engineering (summa cum laude) from the University of Dayton, Dayton, OH, in 2000, and the M.S.E. degree in electrical engineering from the University of Michigan, Ann Arbor, in 2002, where he is currently pursuing the Ph.D. degree. His research interests include analysis and resynthesis of the singing voice, musical information retrieval, and machine learning for signal processing applications. Gregory H. Wakefield (M 85) received the B.A. degree (summa cum laude) in mathematics and psychology, the M.S. and Ph.D. degrees in electrical engineering, and the Ph.D. degree in psychology, all from the University of Minnesota, Minneapolis, in 1978, 1982, 1985, and 1988, respectively. In 1986, he joined the faculty of the Electrical Engineering and Computer Science Department, University of Michigan, Ann Arbor, where he is currently an Associate Professor. His research interests are drawn from time-frequency representations, music signal processing, auditory systems modeling, psychoacoustics, sensory prosthetics, and sound quality engineering. He serves as consultant to various industries in sound quality engineering and time-frequency representations. Dr. Wakefield received the NSF Presidential Young Investigator Award in 1987 and the IEEE Millennium Award in 2000.

Music Radar: A Web-based Query by Humming System

Music Radar: A Web-based Query by Humming System Music Radar: A Web-based Query by Humming System Lianjie Cao, Peng Hao, Chunmeng Zhou Computer Science Department, Purdue University, 305 N. University Street West Lafayette, IN 47907-2107 {cao62, pengh,

More information

DELTA MODULATION AND DPCM CODING OF COLOR SIGNALS

DELTA MODULATION AND DPCM CODING OF COLOR SIGNALS DELTA MODULATION AND DPCM CODING OF COLOR SIGNALS Item Type text; Proceedings Authors Habibi, A. Publisher International Foundation for Telemetering Journal International Telemetering Conference Proceedings

More information

Query By Humming: Finding Songs in a Polyphonic Database

Query By Humming: Finding Songs in a Polyphonic Database Query By Humming: Finding Songs in a Polyphonic Database John Duchi Computer Science Department Stanford University jduchi@stanford.edu Benjamin Phipps Computer Science Department Stanford University bphipps@stanford.edu

More information

CS229 Project Report Polyphonic Piano Transcription

CS229 Project Report Polyphonic Piano Transcription CS229 Project Report Polyphonic Piano Transcription Mohammad Sadegh Ebrahimi Stanford University Jean-Baptiste Boin Stanford University sadegh@stanford.edu jbboin@stanford.edu 1. Introduction In this project

More information

Week 14 Query-by-Humming and Music Fingerprinting. Roger B. Dannenberg Professor of Computer Science, Art and Music Carnegie Mellon University

Week 14 Query-by-Humming and Music Fingerprinting. Roger B. Dannenberg Professor of Computer Science, Art and Music Carnegie Mellon University Week 14 Query-by-Humming and Music Fingerprinting Roger B. Dannenberg Professor of Computer Science, Art and Music Overview n Melody-Based Retrieval n Audio-Score Alignment n Music Fingerprinting 2 Metadata-based

More information

OBJECTIVE EVALUATION OF A MELODY EXTRACTOR FOR NORTH INDIAN CLASSICAL VOCAL PERFORMANCES

OBJECTIVE EVALUATION OF A MELODY EXTRACTOR FOR NORTH INDIAN CLASSICAL VOCAL PERFORMANCES OBJECTIVE EVALUATION OF A MELODY EXTRACTOR FOR NORTH INDIAN CLASSICAL VOCAL PERFORMANCES Vishweshwara Rao and Preeti Rao Digital Audio Processing Lab, Electrical Engineering Department, IIT-Bombay, Powai,

More information

WHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG?

WHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG? WHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG? NICHOLAS BORG AND GEORGE HOKKANEN Abstract. The possibility of a hit song prediction algorithm is both academically interesting and industry motivated.

More information

Automatic Rhythmic Notation from Single Voice Audio Sources

Automatic Rhythmic Notation from Single Voice Audio Sources Automatic Rhythmic Notation from Single Voice Audio Sources Jack O Reilly, Shashwat Udit Introduction In this project we used machine learning technique to make estimations of rhythmic notation of a sung

More information

Music Source Separation

Music Source Separation Music Source Separation Hao-Wei Tseng Electrical and Engineering System University of Michigan Ann Arbor, Michigan Email: blakesen@umich.edu Abstract In popular music, a cover version or cover song, or

More information

POST-PROCESSING FIDDLE : A REAL-TIME MULTI-PITCH TRACKING TECHNIQUE USING HARMONIC PARTIAL SUBTRACTION FOR USE WITHIN LIVE PERFORMANCE SYSTEMS

POST-PROCESSING FIDDLE : A REAL-TIME MULTI-PITCH TRACKING TECHNIQUE USING HARMONIC PARTIAL SUBTRACTION FOR USE WITHIN LIVE PERFORMANCE SYSTEMS POST-PROCESSING FIDDLE : A REAL-TIME MULTI-PITCH TRACKING TECHNIQUE USING HARMONIC PARTIAL SUBTRACTION FOR USE WITHIN LIVE PERFORMANCE SYSTEMS Andrew N. Robertson, Mark D. Plumbley Centre for Digital Music

More information

Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes

Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes hello Jay Biernat Third author University of Rochester University of Rochester Affiliation3 words jbiernat@ur.rochester.edu author3@ismir.edu

More information

Music Information Retrieval Using Audio Input

Music Information Retrieval Using Audio Input Music Information Retrieval Using Audio Input Lloyd A. Smith, Rodger J. McNab and Ian H. Witten Department of Computer Science University of Waikato Private Bag 35 Hamilton, New Zealand {las, rjmcnab,

More information

User-Specific Learning for Recognizing a Singer s Intended Pitch

User-Specific Learning for Recognizing a Singer s Intended Pitch User-Specific Learning for Recognizing a Singer s Intended Pitch Andrew Guillory University of Washington Seattle, WA guillory@cs.washington.edu Sumit Basu Microsoft Research Redmond, WA sumitb@microsoft.com

More information

A Parametric Autoregressive Model for the Extraction of Electric Network Frequency Fluctuations in Audio Forensic Authentication

A Parametric Autoregressive Model for the Extraction of Electric Network Frequency Fluctuations in Audio Forensic Authentication Proceedings of the 3 rd International Conference on Control, Dynamic Systems, and Robotics (CDSR 16) Ottawa, Canada May 9 10, 2016 Paper No. 110 DOI: 10.11159/cdsr16.110 A Parametric Autoregressive Model

More information

A Parametric Autoregressive Model for the Extraction of Electric Network Frequency Fluctuations in Audio Forensic Authentication

A Parametric Autoregressive Model for the Extraction of Electric Network Frequency Fluctuations in Audio Forensic Authentication Journal of Energy and Power Engineering 10 (2016) 504-512 doi: 10.17265/1934-8975/2016.08.007 D DAVID PUBLISHING A Parametric Autoregressive Model for the Extraction of Electric Network Frequency Fluctuations

More information

APPLICATIONS OF A SEMI-AUTOMATIC MELODY EXTRACTION INTERFACE FOR INDIAN MUSIC

APPLICATIONS OF A SEMI-AUTOMATIC MELODY EXTRACTION INTERFACE FOR INDIAN MUSIC APPLICATIONS OF A SEMI-AUTOMATIC MELODY EXTRACTION INTERFACE FOR INDIAN MUSIC Vishweshwara Rao, Sachin Pant, Madhumita Bhaskar and Preeti Rao Department of Electrical Engineering, IIT Bombay {vishu, sachinp,

More information

Robert Alexandru Dobre, Cristian Negrescu

Robert Alexandru Dobre, Cristian Negrescu ECAI 2016 - International Conference 8th Edition Electronics, Computers and Artificial Intelligence 30 June -02 July, 2016, Ploiesti, ROMÂNIA Automatic Music Transcription Software Based on Constant Q

More information

NEW QUERY-BY-HUMMING MUSIC RETRIEVAL SYSTEM CONCEPTION AND EVALUATION BASED ON A QUERY NATURE STUDY

NEW QUERY-BY-HUMMING MUSIC RETRIEVAL SYSTEM CONCEPTION AND EVALUATION BASED ON A QUERY NATURE STUDY Proceedings of the COST G-6 Conference on Digital Audio Effects (DAFX-), Limerick, Ireland, December 6-8,2 NEW QUERY-BY-HUMMING MUSIC RETRIEVAL SYSTEM CONCEPTION AND EVALUATION BASED ON A QUERY NATURE

More information

Hidden Markov Model based dance recognition

Hidden Markov Model based dance recognition Hidden Markov Model based dance recognition Dragutin Hrenek, Nenad Mikša, Robert Perica, Pavle Prentašić and Boris Trubić University of Zagreb, Faculty of Electrical Engineering and Computing Unska 3,

More information

Transcription of the Singing Melody in Polyphonic Music

Transcription of the Singing Melody in Polyphonic Music Transcription of the Singing Melody in Polyphonic Music Matti Ryynänen and Anssi Klapuri Institute of Signal Processing, Tampere University Of Technology P.O.Box 553, FI-33101 Tampere, Finland {matti.ryynanen,

More information

Tempo and Beat Analysis

Tempo and Beat Analysis Advanced Course Computer Science Music Processing Summer Term 2010 Meinard Müller, Peter Grosche Saarland University and MPI Informatik meinard@mpi-inf.mpg.de Tempo and Beat Analysis Musical Properties:

More information

TOWARD AN INTELLIGENT EDITOR FOR JAZZ MUSIC

TOWARD AN INTELLIGENT EDITOR FOR JAZZ MUSIC TOWARD AN INTELLIGENT EDITOR FOR JAZZ MUSIC G.TZANETAKIS, N.HU, AND R.B. DANNENBERG Computer Science Department, Carnegie Mellon University 5000 Forbes Avenue, Pittsburgh, PA 15213, USA E-mail: gtzan@cs.cmu.edu

More information

THE importance of music content analysis for musical

THE importance of music content analysis for musical IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 1, JANUARY 2007 333 Drum Sound Recognition for Polyphonic Audio Signals by Adaptation and Matching of Spectrogram Templates With

More information

Browsing News and Talk Video on a Consumer Electronics Platform Using Face Detection

Browsing News and Talk Video on a Consumer Electronics Platform Using Face Detection Browsing News and Talk Video on a Consumer Electronics Platform Using Face Detection Kadir A. Peker, Ajay Divakaran, Tom Lanning Mitsubishi Electric Research Laboratories, Cambridge, MA, USA {peker,ajayd,}@merl.com

More information

Color Quantization of Compressed Video Sequences. Wan-Fung Cheung, and Yuk-Hee Chan, Member, IEEE 1 CSVT

Color Quantization of Compressed Video Sequences. Wan-Fung Cheung, and Yuk-Hee Chan, Member, IEEE 1 CSVT CSVT -02-05-09 1 Color Quantization of Compressed Video Sequences Wan-Fung Cheung, and Yuk-Hee Chan, Member, IEEE 1 Abstract This paper presents a novel color quantization algorithm for compressed video

More information

Music Segmentation Using Markov Chain Methods

Music Segmentation Using Markov Chain Methods Music Segmentation Using Markov Chain Methods Paul Finkelstein March 8, 2011 Abstract This paper will present just how far the use of Markov Chains has spread in the 21 st century. We will explain some

More information

Music Database Retrieval Based on Spectral Similarity

Music Database Retrieval Based on Spectral Similarity Music Database Retrieval Based on Spectral Similarity Cheng Yang Department of Computer Science Stanford University yangc@cs.stanford.edu Abstract We present an efficient algorithm to retrieve similar

More information

Melody transcription for interactive applications

Melody transcription for interactive applications Melody transcription for interactive applications Rodger J. McNab and Lloyd A. Smith {rjmcnab,las}@cs.waikato.ac.nz Department of Computer Science University of Waikato, Private Bag 3105 Hamilton, New

More information

Topic 10. Multi-pitch Analysis

Topic 10. Multi-pitch Analysis Topic 10 Multi-pitch Analysis What is pitch? Common elements of music are pitch, rhythm, dynamics, and the sonic qualities of timbre and texture. An auditory perceptual attribute in terms of which sounds

More information

/$ IEEE

/$ IEEE 564 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 3, MARCH 2010 Source/Filter Model for Unsupervised Main Melody Extraction From Polyphonic Audio Signals Jean-Louis Durrieu,

More information

DETECTION OF SLOW-MOTION REPLAY SEGMENTS IN SPORTS VIDEO FOR HIGHLIGHTS GENERATION

DETECTION OF SLOW-MOTION REPLAY SEGMENTS IN SPORTS VIDEO FOR HIGHLIGHTS GENERATION DETECTION OF SLOW-MOTION REPLAY SEGMENTS IN SPORTS VIDEO FOR HIGHLIGHTS GENERATION H. Pan P. van Beek M. I. Sezan Electrical & Computer Engineering University of Illinois Urbana, IL 6182 Sharp Laboratories

More information

Melody Retrieval On The Web

Melody Retrieval On The Web Melody Retrieval On The Web Thesis proposal for the degree of Master of Science at the Massachusetts Institute of Technology M.I.T Media Laboratory Fall 2000 Thesis supervisor: Barry Vercoe Professor,

More information

Evaluating Melodic Encodings for Use in Cover Song Identification

Evaluating Melodic Encodings for Use in Cover Song Identification Evaluating Melodic Encodings for Use in Cover Song Identification David D. Wickland wickland@uoguelph.ca David A. Calvert dcalvert@uoguelph.ca James Harley jharley@uoguelph.ca ABSTRACT Cover song identification

More information

Melody Extraction from Generic Audio Clips Thaminda Edirisooriya, Hansohl Kim, Connie Zeng

Melody Extraction from Generic Audio Clips Thaminda Edirisooriya, Hansohl Kim, Connie Zeng Melody Extraction from Generic Audio Clips Thaminda Edirisooriya, Hansohl Kim, Connie Zeng Introduction In this project we were interested in extracting the melody from generic audio files. Due to the

More information

Singer Recognition and Modeling Singer Error

Singer Recognition and Modeling Singer Error Singer Recognition and Modeling Singer Error Johan Ismael Stanford University jismael@stanford.edu Nicholas McGee Stanford University ndmcgee@stanford.edu 1. Abstract We propose a system for recognizing

More information

AUTOREGRESSIVE MFCC MODELS FOR GENRE CLASSIFICATION IMPROVED BY HARMONIC-PERCUSSION SEPARATION

AUTOREGRESSIVE MFCC MODELS FOR GENRE CLASSIFICATION IMPROVED BY HARMONIC-PERCUSSION SEPARATION AUTOREGRESSIVE MFCC MODELS FOR GENRE CLASSIFICATION IMPROVED BY HARMONIC-PERCUSSION SEPARATION Halfdan Rump, Shigeki Miyabe, Emiru Tsunoo, Nobukata Ono, Shigeki Sagama The University of Tokyo, Graduate

More information

TERRESTRIAL broadcasting of digital television (DTV)

TERRESTRIAL broadcasting of digital television (DTV) IEEE TRANSACTIONS ON BROADCASTING, VOL 51, NO 1, MARCH 2005 133 Fast Initialization of Equalizers for VSB-Based DTV Transceivers in Multipath Channel Jong-Moon Kim and Yong-Hwan Lee Abstract This paper

More information

Department of Electrical & Electronic Engineering Imperial College of Science, Technology and Medicine. Project: Real-Time Speech Enhancement

Department of Electrical & Electronic Engineering Imperial College of Science, Technology and Medicine. Project: Real-Time Speech Enhancement Department of Electrical & Electronic Engineering Imperial College of Science, Technology and Medicine Project: Real-Time Speech Enhancement Introduction Telephones are increasingly being used in noisy

More information

On time: the influence of tempo, structure and style on the timing of grace notes in skilled musical performance

On time: the influence of tempo, structure and style on the timing of grace notes in skilled musical performance RHYTHM IN MUSIC PERFORMANCE AND PERCEIVED STRUCTURE 1 On time: the influence of tempo, structure and style on the timing of grace notes in skilled musical performance W. Luke Windsor, Rinus Aarts, Peter

More information

Music Composition with RNN

Music Composition with RNN Music Composition with RNN Jason Wang Department of Statistics Stanford University zwang01@stanford.edu Abstract Music composition is an interesting problem that tests the creativity capacities of artificial

More information

Measurement of overtone frequencies of a toy piano and perception of its pitch

Measurement of overtone frequencies of a toy piano and perception of its pitch Measurement of overtone frequencies of a toy piano and perception of its pitch PACS: 43.75.Mn ABSTRACT Akira Nishimura Department of Media and Cultural Studies, Tokyo University of Information Sciences,

More information

Transcription An Historical Overview

Transcription An Historical Overview Transcription An Historical Overview By Daniel McEnnis 1/20 Overview of the Overview In the Beginning: early transcription systems Piszczalski, Moorer Note Detection Piszczalski, Foster, Chafe, Katayose,

More information

Analysis of local and global timing and pitch change in ordinary

Analysis of local and global timing and pitch change in ordinary Alma Mater Studiorum University of Bologna, August -6 6 Analysis of local and global timing and pitch change in ordinary melodies Roger Watt Dept. of Psychology, University of Stirling, Scotland r.j.watt@stirling.ac.uk

More information

Week 14 Music Understanding and Classification

Week 14 Music Understanding and Classification Week 14 Music Understanding and Classification Roger B. Dannenberg Professor of Computer Science, Music & Art Overview n Music Style Classification n What s a classifier? n Naïve Bayesian Classifiers n

More information

2. AN INTROSPECTION OF THE MORPHING PROCESS

2. AN INTROSPECTION OF THE MORPHING PROCESS 1. INTRODUCTION Voice morphing means the transition of one speech signal into another. Like image morphing, speech morphing aims to preserve the shared characteristics of the starting and final signals,

More information

Laboratory Assignment 3. Digital Music Synthesis: Beethoven s Fifth Symphony Using MATLAB

Laboratory Assignment 3. Digital Music Synthesis: Beethoven s Fifth Symphony Using MATLAB Laboratory Assignment 3 Digital Music Synthesis: Beethoven s Fifth Symphony Using MATLAB PURPOSE In this laboratory assignment, you will use MATLAB to synthesize the audio tones that make up a well-known

More information

NOTE-LEVEL MUSIC TRANSCRIPTION BY MAXIMUM LIKELIHOOD SAMPLING

NOTE-LEVEL MUSIC TRANSCRIPTION BY MAXIMUM LIKELIHOOD SAMPLING NOTE-LEVEL MUSIC TRANSCRIPTION BY MAXIMUM LIKELIHOOD SAMPLING Zhiyao Duan University of Rochester Dept. Electrical and Computer Engineering zhiyao.duan@rochester.edu David Temperley University of Rochester

More information

Modeling memory for melodies

Modeling memory for melodies Modeling memory for melodies Daniel Müllensiefen 1 and Christian Hennig 2 1 Musikwissenschaftliches Institut, Universität Hamburg, 20354 Hamburg, Germany 2 Department of Statistical Science, University

More information

... A Pseudo-Statistical Approach to Commercial Boundary Detection. Prasanna V Rangarajan Dept of Electrical Engineering Columbia University

... A Pseudo-Statistical Approach to Commercial Boundary Detection. Prasanna V Rangarajan Dept of Electrical Engineering Columbia University A Pseudo-Statistical Approach to Commercial Boundary Detection........ Prasanna V Rangarajan Dept of Electrical Engineering Columbia University pvr2001@columbia.edu 1. Introduction Searching and browsing

More information

Composer Identification of Digital Audio Modeling Content Specific Features Through Markov Models

Composer Identification of Digital Audio Modeling Content Specific Features Through Markov Models Composer Identification of Digital Audio Modeling Content Specific Features Through Markov Models Aric Bartle (abartle@stanford.edu) December 14, 2012 1 Background The field of composer recognition has

More information

AN APPROACH FOR MELODY EXTRACTION FROM POLYPHONIC AUDIO: USING PERCEPTUAL PRINCIPLES AND MELODIC SMOOTHNESS

AN APPROACH FOR MELODY EXTRACTION FROM POLYPHONIC AUDIO: USING PERCEPTUAL PRINCIPLES AND MELODIC SMOOTHNESS AN APPROACH FOR MELODY EXTRACTION FROM POLYPHONIC AUDIO: USING PERCEPTUAL PRINCIPLES AND MELODIC SMOOTHNESS Rui Pedro Paiva CISUC Centre for Informatics and Systems of the University of Coimbra Department

More information

An Introduction to the Spectral Dynamics Rotating Machinery Analysis (RMA) package For PUMA and COUGAR

An Introduction to the Spectral Dynamics Rotating Machinery Analysis (RMA) package For PUMA and COUGAR An Introduction to the Spectral Dynamics Rotating Machinery Analysis (RMA) package For PUMA and COUGAR Introduction: The RMA package is a PC-based system which operates with PUMA and COUGAR hardware to

More information

CSC475 Music Information Retrieval

CSC475 Music Information Retrieval CSC475 Music Information Retrieval Monophonic pitch extraction George Tzanetakis University of Victoria 2014 G. Tzanetakis 1 / 32 Table of Contents I 1 Motivation and Terminology 2 Psychacoustics 3 F0

More information

AN ARTISTIC TECHNIQUE FOR AUDIO-TO-VIDEO TRANSLATION ON A MUSIC PERCEPTION STUDY

AN ARTISTIC TECHNIQUE FOR AUDIO-TO-VIDEO TRANSLATION ON A MUSIC PERCEPTION STUDY AN ARTISTIC TECHNIQUE FOR AUDIO-TO-VIDEO TRANSLATION ON A MUSIC PERCEPTION STUDY Eugene Mikyung Kim Department of Music Technology, Korea National University of Arts eugene@u.northwestern.edu ABSTRACT

More information

Chroma Binary Similarity and Local Alignment Applied to Cover Song Identification

Chroma Binary Similarity and Local Alignment Applied to Cover Song Identification 1138 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 6, AUGUST 2008 Chroma Binary Similarity and Local Alignment Applied to Cover Song Identification Joan Serrà, Emilia Gómez,

More information

TANSEN: A QUERY-BY-HUMMING BASED MUSIC RETRIEVAL SYSTEM. M. Anand Raju, Bharat Sundaram* and Preeti Rao

TANSEN: A QUERY-BY-HUMMING BASED MUSIC RETRIEVAL SYSTEM. M. Anand Raju, Bharat Sundaram* and Preeti Rao TANSEN: A QUERY-BY-HUMMING BASE MUSIC RETRIEVAL SYSTEM M. Anand Raju, Bharat Sundaram* and Preeti Rao epartment of Electrical Engineering, Indian Institute of Technology, Bombay Powai, Mumbai 400076 {maji,prao}@ee.iitb.ac.in

More information

Drum Sound Identification for Polyphonic Music Using Template Adaptation and Matching Methods

Drum Sound Identification for Polyphonic Music Using Template Adaptation and Matching Methods Drum Sound Identification for Polyphonic Music Using Template Adaptation and Matching Methods Kazuyoshi Yoshii, Masataka Goto and Hiroshi G. Okuno Department of Intelligence Science and Technology National

More information

Polyphonic Audio Matching for Score Following and Intelligent Audio Editors

Polyphonic Audio Matching for Score Following and Intelligent Audio Editors Polyphonic Audio Matching for Score Following and Intelligent Audio Editors Roger B. Dannenberg and Ning Hu School of Computer Science, Carnegie Mellon University email: dannenberg@cs.cmu.edu, ninghu@cs.cmu.edu,

More information

DAT335 Music Perception and Cognition Cogswell Polytechnical College Spring Week 6 Class Notes

DAT335 Music Perception and Cognition Cogswell Polytechnical College Spring Week 6 Class Notes DAT335 Music Perception and Cognition Cogswell Polytechnical College Spring 2009 Week 6 Class Notes Pitch Perception Introduction Pitch may be described as that attribute of auditory sensation in terms

More information

Jazz Melody Generation and Recognition

Jazz Melody Generation and Recognition Jazz Melody Generation and Recognition Joseph Victor December 14, 2012 Introduction In this project, we attempt to use machine learning methods to study jazz solos. The reason we study jazz in particular

More information

AUTOMATIC ACCOMPANIMENT OF VOCAL MELODIES IN THE CONTEXT OF POPULAR MUSIC

AUTOMATIC ACCOMPANIMENT OF VOCAL MELODIES IN THE CONTEXT OF POPULAR MUSIC AUTOMATIC ACCOMPANIMENT OF VOCAL MELODIES IN THE CONTEXT OF POPULAR MUSIC A Thesis Presented to The Academic Faculty by Xiang Cao In Partial Fulfillment of the Requirements for the Degree Master of Science

More information

Analysis of Video Transmission over Lossy Channels

Analysis of Video Transmission over Lossy Channels 1012 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 18, NO. 6, JUNE 2000 Analysis of Video Transmission over Lossy Channels Klaus Stuhlmüller, Niko Färber, Member, IEEE, Michael Link, and Bernd

More information

Chord Classification of an Audio Signal using Artificial Neural Network

Chord Classification of an Audio Signal using Artificial Neural Network Chord Classification of an Audio Signal using Artificial Neural Network Ronesh Shrestha Student, Department of Electrical and Electronic Engineering, Kathmandu University, Dhulikhel, Nepal ---------------------------------------------------------------------***---------------------------------------------------------------------

More information

MELODY EXTRACTION FROM POLYPHONIC AUDIO OF WESTERN OPERA: A METHOD BASED ON DETECTION OF THE SINGER S FORMANT

MELODY EXTRACTION FROM POLYPHONIC AUDIO OF WESTERN OPERA: A METHOD BASED ON DETECTION OF THE SINGER S FORMANT MELODY EXTRACTION FROM POLYPHONIC AUDIO OF WESTERN OPERA: A METHOD BASED ON DETECTION OF THE SINGER S FORMANT Zheng Tang University of Washington, Department of Electrical Engineering zhtang@uw.edu Dawn

More information

MELODY EXTRACTION BASED ON HARMONIC CODED STRUCTURE

MELODY EXTRACTION BASED ON HARMONIC CODED STRUCTURE 12th International Society for Music Information Retrieval Conference (ISMIR 2011) MELODY EXTRACTION BASED ON HARMONIC CODED STRUCTURE Sihyun Joo Sanghun Park Seokhwan Jo Chang D. Yoo Department of Electrical

More information

IMPROVED MELODIC SEQUENCE MATCHING FOR QUERY BASED SEARCHING IN INDIAN CLASSICAL MUSIC

IMPROVED MELODIC SEQUENCE MATCHING FOR QUERY BASED SEARCHING IN INDIAN CLASSICAL MUSIC IMPROVED MELODIC SEQUENCE MATCHING FOR QUERY BASED SEARCHING IN INDIAN CLASSICAL MUSIC Ashwin Lele #, Saurabh Pinjani #, Kaustuv Kanti Ganguli, and Preeti Rao Department of Electrical Engineering, Indian

More information

Comparison of Dictionary-Based Approaches to Automatic Repeating Melody Extraction

Comparison of Dictionary-Based Approaches to Automatic Repeating Melody Extraction Comparison of Dictionary-Based Approaches to Automatic Repeating Melody Extraction Hsuan-Huei Shih, Shrikanth S. Narayanan and C.-C. Jay Kuo Integrated Media Systems Center and Department of Electrical

More information

Voice & Music Pattern Extraction: A Review

Voice & Music Pattern Extraction: A Review Voice & Music Pattern Extraction: A Review 1 Pooja Gautam 1 and B S Kaushik 2 Electronics & Telecommunication Department RCET, Bhilai, Bhilai (C.G.) India pooja0309pari@gmail.com 2 Electrical & Instrumentation

More information

Estimating the Time to Reach a Target Frequency in Singing

Estimating the Time to Reach a Target Frequency in Singing THE NEUROSCIENCES AND MUSIC III: DISORDERS AND PLASTICITY Estimating the Time to Reach a Target Frequency in Singing Sean Hutchins a and David Campbell b a Department of Psychology, McGill University,

More information

AN ALGORITHM FOR LOCATING FUNDAMENTAL FREQUENCY (F0) MARKERS IN SPEECH

AN ALGORITHM FOR LOCATING FUNDAMENTAL FREQUENCY (F0) MARKERS IN SPEECH AN ALGORITHM FOR LOCATING FUNDAMENTAL FREQUENCY (F0) MARKERS IN SPEECH by Princy Dikshit B.E (C.S) July 2000, Mangalore University, India A Thesis Submitted to the Faculty of Old Dominion University in

More information

Efficient Computer-Aided Pitch Track and Note Estimation for Scientific Applications. Matthias Mauch Chris Cannam György Fazekas

Efficient Computer-Aided Pitch Track and Note Estimation for Scientific Applications. Matthias Mauch Chris Cannam György Fazekas Efficient Computer-Aided Pitch Track and Note Estimation for Scientific Applications Matthias Mauch Chris Cannam György Fazekas! 1 Matthias Mauch, Chris Cannam, George Fazekas Problem Intonation in Unaccompanied

More information

Automatic Labelling of tabla signals

Automatic Labelling of tabla signals ISMIR 2003 Oct. 27th 30th 2003 Baltimore (USA) Automatic Labelling of tabla signals Olivier K. GILLET, Gaël RICHARD Introduction Exponential growth of available digital information need for Indexing and

More information

Story Tracking in Video News Broadcasts. Ph.D. Dissertation Jedrzej Miadowicz June 4, 2004

Story Tracking in Video News Broadcasts. Ph.D. Dissertation Jedrzej Miadowicz June 4, 2004 Story Tracking in Video News Broadcasts Ph.D. Dissertation Jedrzej Miadowicz June 4, 2004 Acknowledgements Motivation Modern world is awash in information Coming from multiple sources Around the clock

More information

Computational Modelling of Harmony

Computational Modelling of Harmony Computational Modelling of Harmony Simon Dixon Centre for Digital Music, Queen Mary University of London, Mile End Rd, London E1 4NS, UK simon.dixon@elec.qmul.ac.uk http://www.elec.qmul.ac.uk/people/simond

More information

Music Alignment and Applications. Introduction

Music Alignment and Applications. Introduction Music Alignment and Applications Roger B. Dannenberg Schools of Computer Science, Art, and Music Introduction Music information comes in many forms Digital Audio Multi-track Audio Music Notation MIDI Structured

More information

6.UAP Project. FunPlayer: A Real-Time Speed-Adjusting Music Accompaniment System. Daryl Neubieser. May 12, 2016

6.UAP Project. FunPlayer: A Real-Time Speed-Adjusting Music Accompaniment System. Daryl Neubieser. May 12, 2016 6.UAP Project FunPlayer: A Real-Time Speed-Adjusting Music Accompaniment System Daryl Neubieser May 12, 2016 Abstract: This paper describes my implementation of a variable-speed accompaniment system that

More information

EE391 Special Report (Spring 2005) Automatic Chord Recognition Using A Summary Autocorrelation Function

EE391 Special Report (Spring 2005) Automatic Chord Recognition Using A Summary Autocorrelation Function EE391 Special Report (Spring 25) Automatic Chord Recognition Using A Summary Autocorrelation Function Advisor: Professor Julius Smith Kyogu Lee Center for Computer Research in Music and Acoustics (CCRMA)

More information

About Giovanni De Poli. What is Model. Introduction. di Poli: Methodologies for Expressive Modeling of/for Music Performance

About Giovanni De Poli. What is Model. Introduction. di Poli: Methodologies for Expressive Modeling of/for Music Performance Methodologies for Expressiveness Modeling of and for Music Performance by Giovanni De Poli Center of Computational Sonology, Department of Information Engineering, University of Padova, Padova, Italy About

More information

Soundprism: An Online System for Score-Informed Source Separation of Music Audio Zhiyao Duan, Student Member, IEEE, and Bryan Pardo, Member, IEEE

Soundprism: An Online System for Score-Informed Source Separation of Music Audio Zhiyao Duan, Student Member, IEEE, and Bryan Pardo, Member, IEEE IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 5, NO. 6, OCTOBER 2011 1205 Soundprism: An Online System for Score-Informed Source Separation of Music Audio Zhiyao Duan, Student Member, IEEE,

More information

Automatic music transcription

Automatic music transcription Music transcription 1 Music transcription 2 Automatic music transcription Sources: * Klapuri, Introduction to music transcription, 2006. www.cs.tut.fi/sgn/arg/klap/amt-intro.pdf * Klapuri, Eronen, Astola:

More information

Wipe Scene Change Detection in Video Sequences

Wipe Scene Change Detection in Video Sequences Wipe Scene Change Detection in Video Sequences W.A.C. Fernando, C.N. Canagarajah, D. R. Bull Image Communications Group, Centre for Communications Research, University of Bristol, Merchant Ventures Building,

More information

Figure 1: Feature Vector Sequence Generator block diagram.

Figure 1: Feature Vector Sequence Generator block diagram. 1 Introduction Figure 1: Feature Vector Sequence Generator block diagram. We propose designing a simple isolated word speech recognition system in Verilog. Our design is naturally divided into two modules.

More information

Analysis and Clustering of Musical Compositions using Melody-based Features

Analysis and Clustering of Musical Compositions using Melody-based Features Analysis and Clustering of Musical Compositions using Melody-based Features Isaac Caswell Erika Ji December 13, 2013 Abstract This paper demonstrates that melodic structure fundamentally differentiates

More information

An Efficient Multi-Target SAR ATR Algorithm

An Efficient Multi-Target SAR ATR Algorithm An Efficient Multi-Target SAR ATR Algorithm L.M. Novak, G.J. Owirka, and W.S. Brower MIT Lincoln Laboratory Abstract MIT Lincoln Laboratory has developed the ATR (automatic target recognition) system for

More information

A CHROMA-BASED SALIENCE FUNCTION FOR MELODY AND BASS LINE ESTIMATION FROM MUSIC AUDIO SIGNALS

A CHROMA-BASED SALIENCE FUNCTION FOR MELODY AND BASS LINE ESTIMATION FROM MUSIC AUDIO SIGNALS A CHROMA-BASED SALIENCE FUNCTION FOR MELODY AND BASS LINE ESTIMATION FROM MUSIC AUDIO SIGNALS Justin Salamon Music Technology Group Universitat Pompeu Fabra, Barcelona, Spain justin.salamon@upf.edu Emilia

More information

The Intervalgram: An Audio Feature for Large-scale Melody Recognition

The Intervalgram: An Audio Feature for Large-scale Melody Recognition The Intervalgram: An Audio Feature for Large-scale Melody Recognition Thomas C. Walters, David A. Ross, and Richard F. Lyon Google, 1600 Amphitheatre Parkway, Mountain View, CA, 94043, USA tomwalters@google.com

More information

The dangers of parsimony in query-by-humming applications

The dangers of parsimony in query-by-humming applications The dangers of parsimony in query-by-humming applications Colin Meek University of Michigan Beal Avenue Ann Arbor MI 489 USA meek@umich.edu William P. Birmingham University of Michigan Beal Avenue Ann

More information

Pattern Recognition in Music

Pattern Recognition in Music Pattern Recognition in Music SAMBA/07/02 Line Eikvil Ragnar Bang Huseby February 2002 Copyright Norsk Regnesentral NR-notat/NR Note Tittel/Title: Pattern Recognition in Music Dato/Date: February År/Year:

More information

Improving Frame Based Automatic Laughter Detection

Improving Frame Based Automatic Laughter Detection Improving Frame Based Automatic Laughter Detection Mary Knox EE225D Class Project knoxm@eecs.berkeley.edu December 13, 2007 Abstract Laughter recognition is an underexplored area of research. My goal for

More information

On Figure of Merit in PAM4 Optical Transmitter Evaluation, Particularly TDECQ

On Figure of Merit in PAM4 Optical Transmitter Evaluation, Particularly TDECQ On Figure of Merit in PAM4 Optical Transmitter Evaluation, Particularly TDECQ Pavel Zivny, Tektronix V1.0 On Figure of Merit in PAM4 Optical Transmitter Evaluation, Particularly TDECQ A brief presentation

More information

Improvised Duet Interaction: Learning Improvisation Techniques for Automatic Accompaniment

Improvised Duet Interaction: Learning Improvisation Techniques for Automatic Accompaniment Improvised Duet Interaction: Learning Improvisation Techniques for Automatic Accompaniment Gus G. Xia Dartmouth College Neukom Institute Hanover, NH, USA gxia@dartmouth.edu Roger B. Dannenberg Carnegie

More information

Adaptive Key Frame Selection for Efficient Video Coding

Adaptive Key Frame Selection for Efficient Video Coding Adaptive Key Frame Selection for Efficient Video Coding Jaebum Jun, Sunyoung Lee, Zanming He, Myungjung Lee, and Euee S. Jang Digital Media Lab., Hanyang University 17 Haengdang-dong, Seongdong-gu, Seoul,

More information

An Efficient Low Bit-Rate Video-Coding Algorithm Focusing on Moving Regions

An Efficient Low Bit-Rate Video-Coding Algorithm Focusing on Moving Regions 1128 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 11, NO. 10, OCTOBER 2001 An Efficient Low Bit-Rate Video-Coding Algorithm Focusing on Moving Regions Kwok-Wai Wong, Kin-Man Lam,

More information

A Fast Alignment Scheme for Automatic OCR Evaluation of Books

A Fast Alignment Scheme for Automatic OCR Evaluation of Books A Fast Alignment Scheme for Automatic OCR Evaluation of Books Ismet Zeki Yalniz, R. Manmatha Multimedia Indexing and Retrieval Group Dept. of Computer Science, University of Massachusetts Amherst, MA,

More information

Musical Instrument Identification Using Principal Component Analysis and Multi-Layered Perceptrons

Musical Instrument Identification Using Principal Component Analysis and Multi-Layered Perceptrons Musical Instrument Identification Using Principal Component Analysis and Multi-Layered Perceptrons Róisín Loughran roisin.loughran@ul.ie Jacqueline Walker jacqueline.walker@ul.ie Michael O Neill University

More information

Computer Coordination With Popular Music: A New Research Agenda 1

Computer Coordination With Popular Music: A New Research Agenda 1 Computer Coordination With Popular Music: A New Research Agenda 1 Roger B. Dannenberg roger.dannenberg@cs.cmu.edu http://www.cs.cmu.edu/~rbd School of Computer Science Carnegie Mellon University Pittsburgh,

More information

SINGING PITCH EXTRACTION BY VOICE VIBRATO/TREMOLO ESTIMATION AND INSTRUMENT PARTIAL DELETION

SINGING PITCH EXTRACTION BY VOICE VIBRATO/TREMOLO ESTIMATION AND INSTRUMENT PARTIAL DELETION th International Society for Music Information Retrieval Conference (ISMIR ) SINGING PITCH EXTRACTION BY VOICE VIBRATO/TREMOLO ESTIMATION AND INSTRUMENT PARTIAL DELETION Chao-Ling Hsu Jyh-Shing Roger Jang

More information

Understanding Compression Technologies for HD and Megapixel Surveillance

Understanding Compression Technologies for HD and Megapixel Surveillance When the security industry began the transition from using VHS tapes to hard disks for video surveillance storage, the question of how to compress and store video became a top consideration for video surveillance

More information

1022 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 19, NO. 4, APRIL 2010

1022 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 19, NO. 4, APRIL 2010 1022 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 19, NO. 4, APRIL 2010 Delay Constrained Multiplexing of Video Streams Using Dual-Frame Video Coding Mayank Tiwari, Student Member, IEEE, Theodore Groves,

More information

WE ADDRESS the development of a novel computational

WE ADDRESS the development of a novel computational IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 3, MARCH 2010 663 Dynamic Spectral Envelope Modeling for Timbre Analysis of Musical Instrument Sounds Juan José Burred, Member,

More information