An Examination of Foote s Self-Similarity Method

WINTER 2001 MUS 220D Units: 4 An Examination of Foote s Self-Similarity Method Unjung Nam The study is based on my dissertation proposal. Its purpose is to improve my understanding of the feature extractors used in the field of content-based music retrieval/classification. I am particularly interested in Jonathan Foote s self-similarity method. I have summarized various articles related to feature extractors used in the field of audio information retrieval/classification. I have also included two experiments with Foote s method on real musical pieces. Even though the analyses of these experiments are still in progress, this study has helped me in understanding the structural process of Foote's method. It seems that each variable in the system such as frame rate, kernel size, etc., should be optimized for each case by conducting thorough empirical experiments on different kinds of musical signals. Moreover, processing with this system on a full piece of music (usually more than 1 minute long) requires a lot of computing time. Hence, one of the final aim of my study is to suggest ways of improving some of the limitations of this system. Table of Contents Motivation 2 1. Features used in Content-Based Music Retrieval Methods 3 1.1 Spectral Features 3 1.2 Temporal Features 4 1.3 Other Musical Features 4 1.4 Overall Structural Features 5 2. Foote s Self-Similarity Method 6 2.1 Parameterization 6 2.2 Distance Matrix Embedding 6 2.3 Kernel Correlation 11 3. Experiments with Real Music 13 3.1 Bach s Well-Tempered Clavier I, Prelude no. 1 C major 13 3.2 Bach s Air on a G String 17 3.3 Discussion 22 Appendix 23 Bibliography 30

Motivation Last quarter I worked on an automatic musical style classification system that classifies three different genres (classical, pop/rock, and jazz) of music based on the three acoustic features, spectral centroid, short-time energy function, and short-time zero-crossing rate. Though the system worked well in clustering a small number of digital music files (20 files) by K-means clustering algorithm and K nearest neighbor classifier, it failed to choose the most similar model space as the input signal. At first, I concluded that the failure was due to the classification model space based on only a small number of music files and an inappropriate selection of features/clustering method. However, as I continued my research, the increase in the number of features and music files of the classification model space or more sophisticated clustering method did not necessarily improve the system. An expert, whom I met at my poster demo session of this system, pointed out that the more features I use, the more complicated the system becomes, and consequently it becomes more difficult for the system to work well. I think that this method can be used as a finer level of the system since it is only successful in clustering different musical genres in a coarse way. It is necessary to search for methods for a lower level (a more detailed level: that is, classifying music with different tempo or with different lead vocals, etc) classification system. In this case the time-varying characteristic of music should be taken into account. I think that Foote s method seems promising for this purpose because his approach uses the signal to model itself, and thus does not rely on particular acoustic cues nor requires training. Also, since it can find individual note boundaries or even natural segment boundaries such as verse/chorus or speech/music transitions, it is much more effective in comparing similar musical pieces in terms of their musical information. His method is interesting because regardless of the parameterization used, the result is a compact vector of parameters for every frame and the actual parameterization is not crucial as long as similar sounds yield similar parameters. It is no doubt an advantage that different parameterization may be used for different applications. 2

1.Features used in Content-Based Music Retrieval Methods Sounds have traditionally been described by their pitch, loudness, and timbre. Many of the previous works related to general audio content retrieval tried to extract these features. It is apparent that music needs more high-level information such as musical characteristics than just the features listed above. I think that for some cases, features used in general audio content retrieval may not be efficient when used in music retrieval/classification. Here are the features used in audio information retrieval/classification systems. I will further experiment with these and try to find useful feature extractors for future music classification systems. 1.1 Spectral Features Timbre-related features are used often in audio information classification/retrieval. In Scheirer and Slaney's work (1997), for speech/music discriminator, spectral centroid and its variance are used assuming that music involve percussive sounds which, by including high-frequency noise, push the spectral mean higher. In addition, excitation energies can be higher for music than for speech. They also calculated spectral flux, which means the 2-norm of the frame-to-frame spectral amplitude difference vector. Their assumption is that music has a higher rate of range, and goes through more drastic frame-to-frame changes than speech does. Also, roll off of the spectrum is used for measuring the "skewness" of the spectral shape assuming that unvoiced speech has a high proportion of energy contained in the high-frequency range of the spectrum, where most of the energy for unvoice/music is contained in lower bands. Classifying audio content into 10 groups: animal, bells, crowds, laughter, machine, instrument, male speech, female speech, telephone, and water. Wold (1996} analyzed Mel-Filtered Cepstral Coefficients and centroid of STFT for the purpose of clustering different kinds of audio sources. At the music analysis level, they have tried to identify the source instruments by training all possible cases of instruments using histogram modelling. Zhang and Kuo (1999a, 1999b) and Tzanetakis and Cook (1999} also used the features listed above in a similar way. Whether the methods listed above will work fine or not in order to cluster musical pieces in terms of their instrumentation should be examined carefully with more numbers of music from the database. Some of the spectral features that I might add in my future research are as the follows. The Spectrogram might show the textural complexity in a sound mixture. It may be possible to analyze how many sources are playing in it. With the spectrogram, it may be possible to figure out if voice singing is involved in the signal and if it is a female/male or purely instrumental. Also, high frequency distribution may show if there is a drum beat or not. With the spectral centroid it is possible to show the frequency components in the signal indicating what instrument combination might be used in it. 3

Mel-Frequency Cepstral Coefficient is a promising feature extractor that allows us to see the spectral shape of the source signal. I will include a detailed explanation/experiment with MFCC in my next report. 1.2 Temporal Features Tempo or a particular rhythmic pattern is a crucial component for measuring similarity between musical pieces. Tempo can be classified as fast, medium, slow and indicate some kinds of styles of music. A particular rhythmic pattern can be used to track a type of folk music such as Latin dance, Salsa, or Tango. A substantial amount of work has been performed in beat tracking of music. Most previous beat-tracking systems dealt with MIDI signals (Allen, 1990} and wasn't successful in processing, in real-time, audio signals containing sounds of various instruments. Scheirer (1998) uses correlated energy peaks across sub-bands attempting to track beats in popular music. A more complete approach to beat tracking of acoustic signals was developed by Goto and Muraoka (1994,1998). They developed a system called BTS which uses frequency histograms to find significant peaks in the low frequency regions, corresponding to the frequencies of the bass and snare drums, and then tracked these low frequency signals by matching patterns of onset times to a set of pre-stored drum beat patterns. This method was successful in tracking the beat of most of the popular songs on which it was tested. Their later system allowed music without drums to be tracked by recognizing chord changes, assuming that significant harmonic changes occur at strong rhythmic positions. Dixon (2000) said that these systems required a powerful parallel computer in order to run in real time and he has built a modest one. In his system, Dixon detects the salient note onsets in time-domain and determines possible inter-beat intervals by clustering algorithm. Then he employs multiple agents to find a sequence of events that represent the beat of the popular music. 1.3 Other Musical Features Other music-related features are harmony and melody. It might be impossible to extract these features from all kinds of music. Harmony and melody extractors will be applied mainly to classical music and to music that is relatively in simple texture. Fujishima (1999) developed a real-time system that recognizes music chords from acoustic signals. His system uses DFT spectrum and derives Pitch Class Profile (PCP). And pattern-matching algorithm is applied to PCP to determine the chord type and root. His system was tested on classic music pieces and it mentioned that the future works include a better classifier, the table tuning technique, bass note tracking, used of musical context for the symbol level error correction. Another interesting work by Purwins (2000) was explored at CCRMA. They developed a system that uses "cq-profiles" applied to track tonal modulations in music. Cq-profiles are 12-dimensional vectors, each component referring to a pitch class. They are calculated with the constant Q filter bank (Brown, 1992) and by using the cq-profile techniques as a simple auditory model. Self Organizing Map (SOM) was combined for an 4

arrangement of key emerges, that resembles results from psychological experiments, and from music theory. There is also a work exploring to track melody and bass lines from sound mixtures. Goto (2000) built a system for estimating the fundamental frequency (F0) of melody and bass lines in monaural real-world musical audio signals. They propose to use a predominant- F0 estimation method called PreFEst that obtains the most predominant F0 supported by harmonics within an intentionally limited frequency range. It evaluates the relative dominance of every possible F0 by using the Expectation-Maximazation algorithm and considers the temporal continuity of F0s by using a multiple-agent architecture. There are several research going on in the field of automatic music transcription. 1.4 Overall Structural Features It is possible to analyze if the changes in tempo or harmony occur in music or not. Furthermore, whether it occurs in a regular pattern or not might determine a certain style of music. Scheirer (1999) developed a new technique for correlogram (Licklider, 1951} whose approach is to understand musical signal without separation. The algorithm was demonstrated to locate perceptual events in time and frequency. The model stands as a theoretical alternative to methods that use pitch as their primary grouping cues. It operates within a strict probabilistic framework, which makes it convenient to incorporate into a larger signal-understanding tested. Though his methods locates many of the perceptual objects, it needs a pre-determined interpretation of its mapping. Foote (1999a) did a very interesting work on music self-similarity analysis. His method automatically locates points of significant change in music by analyzing local self-similarity. This method can find individual note boundaries or even natural segment boundaries such as verse/chorus or speech/music transitions, even in the absence of cues such as silence. This approach uses the signal to model itself, and thus does not rely on particular acoustic cues nor requires training. His method has applications in indexing, segmenting, summarization, and beat tracking of music. The details of the Foote's method will be summarized based on his paper (Foote, 1999a) and reviewed in the following section. 5

2. Foote s Self-Similarity Method The method produces a time series that is proportional to the acoustic novelty of source audio at any instant. High values and peaks correspond to large audio changes. The novelty score can be thresholded to find these instances, which can be used as segment boundaries. The system flow chart is shown below (Foote, 1999a). Source Audio Parameterization Distance matrix embedding Kernel correlation Novelty Score Thresholding Segment boundaries 2.1 Parameterization The first step is to parameterize the audio. This is typically done by windowing the audio waveform. Variable window width and overlaps can be used. Each frame is then parameterized using a standard analysis such as a Fourier transform or Mel-Frequency Cepstral Coefficients analysis. Other parameterization includes those based on linear prediction, psychoacoustic considerations (Slaney, 1998) or even a combination (Perceptual Linear Prediction). He mentions that regardless of the parameterization used, the result is a compact vector of parameters for every frame. Though the actual parameterization is not crucial as long as similar sounds yield similar parameters. Different parameterization may be very useful for different applications; for example he has shown that the MFCC representation, which preserves the coarse spectral shape while discarding fine harmonic structure due to pitch, may be particularly appropriate for certain applications. Thus MFCCs will tend to match similar timbres rather than exact pitches. He claims that thus the system is very flexible and can subsume most any existing audio analysis method. 2.2 Distance Matrix Embedding Once the audio has been parameterized, it is then embedded in a 2 dimensional representation. The next figure shows the step schematically. 6

start Waveform i j end Similarity Matrix S D(i,j) end i start j A measure D of the (dis)similarity between feature vectors v i and v j is calculated between every pair of audio frames i and j. A simple distance measure is the Euclidean distance in the parameter space, that is the square root of the sum of the squares of the differences of each vector parameter. D E (i,j) v i v j Another useful metric of vector similarity is the scalar (dot) product of the vectors. To remove the dependence on magnitude, (and hence energy, given our features), the product can be normalized to give the cosine of the angle between the parameter vectors. This is equivalent to the cosine of the angle between the vectors and has the property that it yields a large similarity score even if the vectors are small in magnitude. D c (i, j) vi vj vi vj Using the cosine measure means that windows with low energy, such as those containing silence, will be spectrally similar, which is a generally desirable. Because windows, hence feature vectors, occur at a rate much faster than typical musical events, a better similarity measure can be obtained by computing the vector correlation over a window w. This also captures the time dependence of the vectors. To result in a high similarity score, vectors in a window must not only be similar but their sequence must be similar as well. 1 1 w D (i,j,w) w k = 0 D( i + k, j + k) 7

The distance measure is a function of two frames, hence instants in the source signal. It is convenient to consider the similarity between all possible instants in a signal. Embedding the distance measure in a two-dimensional representation does this. The matrix S contains the similarity metric calculated for all frame combinations, hence time indexes i and j such that the i, jth element of S is D (i, j). In general S will have maximum values on the diagonal (because every window will be maximally similar to itself); furthermore if D is symmetric then S will be symmetric as well. To simplify computation, the similarity can be represented in the slanted domain L (i,l) wherelisthelagl=i-j. S can be visualized as a square image such that each pixel i, j is given a gray scale value proportional to the similarity measure D (i, j), and scaled such that the maximum value is given the maximum brightness. Regions of high audio similarity, such as silence or long sustained notes, appear as bright squares on the diagonal. Repeated figures, such as themes, phrases, or choruses, will be visible as bright off-diagonal rectangles. If the music has a high degree of repetition, this will be visible as diagonal stripes or checkerboards, offset from the main diagonal by the repetition time. Let s take an example. This is a 3-dimensional feature vector from the parameterization, repeating its pattern globally three times by 12 elements as follows: ftv = red 1.0000 3.0000 2.0000 1.0000 4.0000 1.0000 2.0000 1.0000 1.0000 2.0000 1.0000 1.0000 2.0000 1.0000 1.0000 4.0000 4.0000 5.0000 3.0000 3.0000 4.0000 0-4.0000 2.0000-2.0000 1.0000 1.5000 1.0000 1.0000 1.0000 3.0000 2.0000 4.0000-0.6000 4.0000 2.0000 1.0000 3.0000 2.0000 1.0000 4.0000 1.0000 2.0000 1.0000 1.0000 2.0000 1.0000 1.0000 2.0000 1.0000 1.0000 4.0000 4.0000 5.0000 3.0000 3.0000 4.0000 0-4.0000 2.0000-2.0000 1.0000 1.5000 1.0000 1.0000 1.0000 3.0000 2.0000 4.0000-0.6000 4.0000 2.0000 1.0000 3.0000 2.0000 1.0000 4.0000 1.0000 2.0000 1.0000 1.0000 2.0000 1.0000 1.0000 2.0000 1.0000 1.0000 4.0000 4.0000 5.0000 3.0000 3.0000 4.0000 0-4.0000 2.0000-2.0000 1.0000 1.5000 1.0000 1.0000 1.0000 3.0000 2.0000 4.0000-0.6000 4.0000 2.0000 blue purple 8

Here is a matrix S calculated by Euclidean distance method. X-axis/Y-axis are time (s). Note that the diagonal portion is black because the Euclidian distance is 0 when the distance of the frame is calculated to itself. Also since Euclidian distance method gets values from 0 to infinity, it may not be a useful tool for plotting similarity matrix. Here is a matrix S calculated cosine similarity method. 9

As I draw the red lines in each pattern, it can be clearly seen that the feature vectors are repeated three times by 13 elements. The repeated pattern (marked as a bracket in the feature vectors and which are the 3rd, 4th, 5th elements in blue box) shows a relatively bright color. The 3,4,5,6,7th elements are not changing much, so they function like a sustained tone or repeated pattern. Thus the purple line portion shows a bright color. ConsiderasegmentfromthecosinematrixS. White squares on the diagonal correspond to the notes, which have high self-similarity; black squares on the off-diagonals correspond to regions of low cross-similarity. If we assume we are using the cosine similarity matrix S, similar regions will be close to 1 while dissimilar regions will be closer to -1. In the figure above, sky-blue lined-square is the value of cross-similarity between two sky-blue lined-rectangles. The same rule is applied to the red lined-square. The sky-blue lined square shows brighter color than the red-lined square which means that the red-lined rectangles have a less similar correlation than the sky-blue lined-rectangles. Here is a windowed matrix S calculated cosine similarity method. The window size is 4. 10

You can see the matrix S smoothed. This windowed matrix S smoothes the radical change appeared in the original S. It can be useful to track individual notes or other musical events because window rates (or the rates of feature vectors) usually are higher than typical musical events. Window rates cannot be set to same rates as musical events because it must be higher than that and track detailed feature vectors. 2.3 Kernel Correlation The structure of S is the key to the novelty measure. Finding the instant when the notes change can be done by correlating S with a kernel that itself looks like a checkerboard. Perhaps the simplest checkerboard kernel is the 2x2 unit kernel. 1 1 The unit checkerboard kernel can be decomposed into coherence and anticoherence kernelsasfollows: 1 1 1 1 11 = 10 01-01 10 The first term measures the self-similarity on either side of the center point; this will be high when both regions are self-similar. The second term measures the cross-similarity between the two regions; this will be high when the regions are substantially similar, thus with little difference across the center point. The difference of the two values estimates the novelty of the signal at the center point; this will have a high value when the two regions are self-similar but different from each other. Correlating a checkerboard kernel with the similarity matrix S results in a measure of novelty. To see how this works, imagine sliding C along the diagonal of our example, and summing the element-by-element product of C and S. WhenCis over a relatively uniform region, such as a sustained note, the positive and negative regions will tend to sum to zero. Conversely, when C is positioned exactly at the crux of the checkerboard, the negative regions will multiply the negative regions of low cross-similarity and the overall sum will be large. Thus calculating this correlation along the diagonal of S gives a time-aligned measure of audio novelty N(i), where i is the frame number, hence time index, corresponding to the original source audio. By convention, the kernel C N(i) = w / 2 m= w / 2 w / 2 n= w / 2 C( m, n) S( i + m, i + n) 11

has a width (lag) of W and is centered 0,0. For computation, S can be zero-padded to avoid undefined values, or, as in the present examples, only computed for the interior of the signal where the kernel overlaps S completely. Thus only regions of S with a lag of W or smaller are used; slant representation is particularly helpful. Also typically both S and C are symmetric thus only one-half of the values under the double summation (those for m n) need to be computed. The width of the kernel W directly affects the properties of the novelty measure. A small kernel detects novelty on a short time scale, such as beats or notes. Increasing the kernel size decreases the time resolution, and increases the length of novel events that can be detected. Larger kernels average over short-time novelty and detect longer structure, such as musical transitions like that between verse and chorus, key modulations, or symphonic movements. This method has no a-priori knowledge of musical phrases or pitch, but finding perceptually and musically significant points. For example, the next figure show a novelty score produced from cosine matrix S with kernel size 2 and 4 respectively. X-axis is time (s). Y-axis is novelty score. Kernels can be smoothed to avoid edge effects using windows (such as a Hamming) that taper towards zero at the edges. For the experiments presented here, a radially-symmetric Gaussian function is used. 12

Below figure shows a novelty score of the previous figure with radial Gaussian taper. Each has kernels of 2 and 4 respectively. 3. Experiments with Real Music The sound files I have got the excerpt are the followings. They have sampled at 11025 Hz, 16 bits, and mono. I have used MFCC (code from Slaney s Auditory Toolbox) for the parameterization. Please see my webpage to listen to these sound samples at http://wwwccrma.stanford.edu/~unjung/air. 3.1 Bach s Well-Tempered Clavier I, Prelude no. 1 C major Since Foote did one of his experiments on this musical piece, I tried to test my code on it. I have tested on two version of this piece of music. The duration of the pieces is about 12 seconds. They are: 1) acoustic realization from MIDI data 2) chorus singing arrangement. This is the score of this music. 13

The following figures are cosine similarity matrices of each. 1) Acoustic realization from MIDI data. X-axis/Y-axis are time (s). 2) Chorus singing arrangement. 14

Below figure shows a novelty score of the similarity matrices above with radial Gaussian taper. 1) Acoustic realization from MIDI data: size(s) = 548 With kernel size 32, each peak clearly shows individual notes in the music. X-axis is time (s). Y-axis is novelty score. With kernel size = 400 Note that the figure above only shows the novelty score for about 2 seconds. It is because the kernel size 400, which is about 80% of the size of the similarity matrix. The checkerboard with kernel size 400 slides on the diagonal of the matrix only for about 2 seconds showing the novelty score calculated on 400x400 checkerboard kernel. In the 15

figure below, the large checkerboard kernel slides along the diagonal of the similarity matrix several times though it is marked three times in the graph. However, because the size of the kernel is too larger, only short duration of the novelty score can be produced as we have seen in the previous figure. The blue line is the duration for which the novelty score is produced. 2) Chorus singing arrangement (size (S) = 1728) With kernel size= 64 16

With kernel size=1000 Note that the figure above only shows the novelty score for calculated for the duration of 2 seconds in S. It is because the kernel size 1000, which is about 60 % of the size of the similarity matrix. It seems that the red circled peaks are the perceptually salient pitch progressing e-f-f-e. It gets repeated 2 times in 8 occurrences. 3.2 Bach s Air on a G String The similarity matrix on the same music with different versions is tested. If they return similar matrices, this method may be used for clustering same musical pieces played by different instrument arrangements. In other words, this method works timbre independently.û I have experimented with four different versions of Bach s Air on a G String. The duration is about 30 seconds long and sampled at 11025 Hz, 16 bits, and mono. Four different versions of audio files are played by: 1) acoustic realization from midi data 2) quartet 3) wind orchestra 4) jazz arrangement. This is the score of this music. 17

The following figures are the cosine similarity matrices of the four pieces. The tempo of each one are a bit different, thus has little bit different durations. X-axis is time(s) and y- axis time(s). 1) Acoustic realization from MIDI data (framerate=50). X-axis/Y-axis are time (s). 2) Quartet (framerate=50) 18

3) Wind Orchestra (framerate=25) 4) Jazz arrangement (framrate=50) 19

The following figures show the novelty score graph with different kernel width. 1) Acoustic realization from MIDI data: kernel size 64 with Gaussian taper In this figure each peak represent the individual bass 8 th notes. They appear approximately 32 of 8 th notes which I have drawn read circles. X-axis is time (s). Y-axis is novelty score. 2) Quartet: kernel size 64 with Gaussian taper In this figure each peak represent the individual bass 8 th notes. They appear approximately 36 of 8 th notes. 20

3) Wind orchestra This figure itself shows very low novelty score compared to others. I think that it is because the novelty score of this signal is very low and comparatively stable. It can be seen as the smoothened brightness in the similarity matrix. When I hear this music, it is played very smoothly. However, I should study on this case further in the future in order to clarify what is happening here. Note that X-axis shows 10-3 values.. 4) Jazz arrangement. In this figure each peak represent the individual bass 8 th notes. They appear approximately 36 of 8 th notes.. 21

3.3 Discussion By just looking at the graphs and by comparing them we see some similarities. However, there should be a scientific measure that can analyze the similarity between the two samples of graphs. I haven t yet developed a method on how to measure the (dis) similarities among the graphs. One way is to build a representative case and compare other versions of the same musical pieces with this representative case. My hypothesis is that there must be a way to use musical information embedded in symbolic representation of music, like MIDI or Humdrum and make some general rule to utilize them in order to plot a representative case of a piece of music. The beat spectrum on the novelty score will also be explored in the next stage of my research. As I mentioned earlier, each variable in the system such as frame rate, kernel size, and etc. should be carefully examined. The thorough empirical experiments on different kinds of musical signals will yield optimal conditions for each case. Also, the problem of expensive computing time could be solved by using C code together with MATLAB. 22

Bibliography Allen, P. E. and R. B. Dannenberg (1990). Tracking musical beats in real time. In Proceedings of 1990 ICMC, pp. 140-143. Assayag, G., S. Dubnov, and O. Delerue (1999). Guessing the composer s mind: applying universal prediction to musical style. In Proceedings of the 1999 ICMC, Beijing, China. Brown, J. C. and M. S. Puckette (1992). An efficient algorithm for the calculation of a constant q transform. Journal of Acoustical Society of America 92 (5), 2698-2701. Dixon, S. (2000). A lightweight multi- agent musical beat tracking system. In Proceedings of the AAAI Workshop on Artificial Intelligence and Music: Towards Formal Models for Composition, Performance and Analysis. Duda, R. O., P. E. Hart, and D. G. Stork (2000). Pattern Classification (second ed.). New York: WileyInterscience. Foote, J. and S. Uchihashi. (2001) The Beat Spectrum: A New Approach to Rhythm Analysis submitted to ICME 2001. Foote, J. (1999a) Methods for the Automatic Analysis of Music and Audio. FXPAL Technical Report TR-99-038, December 1999.Û Foote, J. (1999b) Visualizing Music and Audio using Self-Similarity. In Proceedings of ACM Multimedia 99, (Orlando, FL) ACM Press, pp. 77-80, 1999. Foote, J. (1997a) Content-Based Retrieval of Music and Audio. InC.-C.J.Kuoetal., editor, Multimedia Storage and Archiving Systems II, Proc. of SPIE, Vol. 3229, pp. 138-147, 1997. Foote, J. (1997b) A Similarity Measure for Automatic Audio Classification. In Proc. AAAI 1997 Spring Symposium on Intelligent Integration and Use of Text, Image, Video, and Audio Corpora. Stanford, March 1997. Fujishima, T. (1999). Realtime chord recognition of musical sound: a system using common lisp music. In Proceedings of 1999 ICMC. Goto, M. (2000). A robust predominant-f0 estimation method for real- time detection of melody and bass lines in CD recordings. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing. Goto, M. and Y. Muraoaka (1994). A beat tracking system for acoustic signals of music. In Proceedings of 1994 ACM Multimedia. 23

Goto, M. and Y. Muraoaka (1998). An audio- based real- time beat tracking system and its applications. In Proceedings of 1998 ICMC. Hippel, P. V. (Winter 2000). Questioning a melodic archetype: do listeners use gap- fill to classify melodies? Music Perception 18 (2), 139-153. Huron, D. (2000). Perceptual and cognitive applications in music information retrieval. In Proceedings of International Symposium on Music Information Retrieval, MUSIC IR 2000. Licklider, J. R. (1951). A duplex theory of pitch perception. Experientia 7, 128-134. Logan, B. and S. Chu (2000). Music summarization using key phrases. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing. Purwins, H., B. Blankertz, and K. Obermayer (2000). A new method for tracking modulations in tonal music in audio data format. In Proceedings of the IEEE- INNS- ENNS International Joint Conference on Neural Networks. Scheirer, E. and M. Slaney (1997). Construction and evaluation of a robust multifeature speech/ music discriminator. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 1331-1334. Scheirer, E. D. (1998). Tempo and beat analysis of acoustic musical signals. Journal of Acoustical Society of America 103 (1), 588-601. Scheirer, E. D. (1999). Towards music understanding without separation: segmenting music with correlogram comodulation. In Proceedings of the 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. Tzanetakis, G. and P. Cook (1999). Multifeature audio segmentation for browsing and annotation. In Proceedings of 1999 IEEE WASPAA. Wold, E., T. Blum, D. Keislar, and J. Wheaton (1996). Content- based classification, search and retrieval of audio data. IEEE Multimedia Magazine 3 (3), 27-36. Zhang, T. and C.- C. J. Kuo (1999a). Heuristic approach for generic audio data segmentation and annotation. In Proceedings of 7th ACM on Multimedia, pp. 67-76. Zhang, T. and C.- C. J. Kuo (1999b). Hierarchical system for content- based audio classification and retrieval. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 3001-3004. 24