A Variable Resolution transform for music analysis

Size: px

Start display at page:

Download "A Variable Resolution transform for music analysis"

Angelina Walsh
5 years ago
Views:

1 1 A Variable Resolution transorm or music analysis Aliaksandr Paradzinets and Liming Chen A Research Report, Lab. LIRIS, Ecole Centrale de Lyon Ecully, June 2009

2 2 A Variable Resolution transorm or music analysis Aliaksandr Paradzinets and Liming Chen Laboratoire d'inormatique en Images et Systemes d'inormation (LIRIS), Département MI, Ecole Centrale de Lyon, University o Lyon 36 avenue Guy de Collongue, Ecully Cedex, France; {aliaksandr.paradzinets; liming.chen}@ec-lyon.r Abstract This paper presents a novel music representation using a Variable Resolution Transorm (VRT) which is particularly well adapted or music audio analysis. The VRT is inspired by continuous wavelet transorm and it applies dierent wavelet unction at dierent scale. This method enables a good lexibility o the transorm in order to ollow log scale o musical note requencies and at the same time to maintain good time and requency resolution. As an example o application o this novel VRT, a multiple 0 detection algorithm is presented and evaluated showing convincing results. Furthermore, a direct comparison with the FFT applied to the same algorithm is also provided. Index Terms music representation, music analysis, Variable resolution transorm, multiple undamental requency estimation I. INTRODUCTION As a major product or entertainment, there is a huge amount o digital musical content produced, broadcasted, distributed and exchanged. Consequently there is a rising demand or better ways o cataloging, annotating and accessing these musical data. This in turn has motivated intensive research activities or music analysis, content-based music retrieval, etc. The primary stage in any kind o audio signal processing is an eective audio signal representation. While there exists some algorithms perorming music data analysis in the time domain as or example some beat detection algorithms, the majority o music processing algorithms perorm their computation in the requency domain, or a time-requency representation, to be exact. So, the perormance o all urther steps o processing is strictly dependent on the initial data representation. As compared to a vocal signal, a music signal is likely to be more stationary and owns some very speciic properties in terms o musical tones, intervals, chords, instruments, melodic lines and rhythms, etc. [1]. While many eective and high perormance music inormation retrieval (MIR) algorithms have been proposed [2-9], most o

3 3 these works unortunately tend to consider a music signal as a vocal one and make use o MFCC-based eatures which are primarily designed or speech signal processing. Mel Frequency Cepstrum Coeicients (MFCC) was introduced in the 60 s and used since that time or speech signal processing. The MFCC computation averages spectrum in sub-bands and provides the average spectrum characteristics. Whereas they are inclined to capture the global timbre o a music signal and claimed to be o use in music inormation retrieval [10; 11], they cannot characterize the aorementioned music properties as needed or perceptual understanding by human beings and quickly ind their limits [12]. Recent works suggest combining spectral similarity descriptors with high-level analysis in order to overcome existing ceiling [13]. The Fast Fourier Transorm and the Short-Time Fourier Transorm have been the traditional techniques in audio signal processing. This classical approach is very powerul and widely used owing to its great advantage o rapidity. However, a special eature o musical signals is the exponential law o notes requencies. The requency and time resolution o the FFT is linear and constant across the requency scale while the human perception o a sound is logarithmic according to Weber-Fechner law (including loudness and pitch perception). Indeed, as it is well known, the requencies o notes in equally-tempered tuning system in music ollow an exponential law (with each semi-tone the requency is increased by a actor o 2 1/12 ). I we consider a requency range or dierent octaves, this requency range is growing as the number o octave increases. Thus, to cover a wide range o octaves with a good requency grid large sized windows are necessary in the case o FFT; this aects the time resolution o the analysis. On the contrary, the use o small windows makes resolving requencies o neighboring notes in low octaves almost impossible. The ability o catching all octaves in music with the same requency resolution is essential or music signal analysis, in particular construction o melodic similarity eatures. In this paper, we propose a new music signal analysis technique by variable-resolution transorm (VRT) particularly suitable to music signal. Our VRT is inspired by Continuous Wavelet Transormation (CWT) [14] and speciically designed to overcome the limited time-requency localization o the Fourier-Transorm or non-stationary signals. Unlike classical FFT, our VRT depicts similar properties as CWT, i.e. having a variable time-requency resolution grid with a high requency resolution and a low time resolution in low-requency area and a high temporal/low requency resolution on the other requency side, thus behaving as a human ear which exhibits similar time-requency resolution characteristics [15]. The remainder o this paper is organized as ollows. Section II overviews related music signal representations.

4 4 Our variable resolution transorm is then introduced in section III. The experiments and the results are discussed in section IV. Finally, we conclude our work in section V. II. RELATED WORKS There are plenty o works in the literature dedicated to musical signal analysis. In this section, we propose irst to compare the popular FFT with wavelet transorm on the basis o desirable properties or music signal analysis and then overviews some other transorms and ilter banks so ar proposed in the literature. A. Time-requency transorms: FFT vs WT The common approach is the use o FFT (Fast Fourier Transorm) which has become a de-acto standard in music inormation retrieval community. The use o FFT seems straightorward in this ield and relevance o its application or music signal analysis is almost never motivated. There are some works in music inormation retrieval attempting to make use o wavelet transorm as a novel and powerul tool in musical signal analysis. However, this new direction is not very well explored. [8] proposes to rely on discrete wavelet transorm or beat detection. Discrete packet wavelet transorm is studied in [15] to build time and requency eatures in music genre classiication. In [16], wavelets are also used or automatic pitch detection. As it is well known, Fourier transorm enables a spectral representation o a periodic signal as a possibly sum o a series o sines and cosines. While Fourier transorm gives an insight into the spectral properties o a signal, its major disadvantage is that a decomposition o a signal by Fourier transorm has ininite requency resolution and no time resolution. It means that we are able to determine all requencies in the signal, but without any knowledge about when they are present. This drawback makes Fourier transorm to be perect or analyzing stationary signals but unsuitable or irregular signals whose characteristics change in time. To overcome this problem several solutions have been proposed in order to represent more or less the signal in time and requency domains. One o these techniques is windowed Fourier transorm or short-time Fourier transorm. The idea behind is to bring time localization into classic Fourier transorm by multiplying the signal with an analyzing window. The problem here is that the short-time discrete Fourier transorm has a ixed resolution. The width o the windowing unction is a tradeo between a good requency resolution transorm and a good time resolution transorm. Shorter window leads to smaller requency resolution but higher time resolution while larger window leads to greater requency resolution but lower time resolution. This phenomenon is related to Heisenberg s uncertainty principle which says that

5 5 1 Δt ~ (1) Δ where t is a time resolution step and is a requency resolution step. Remember that in our work the main goal is music analysis. In this respect, we consider a rather music-related example which illustrates speciicities o musical signals. As it is known, the requencies o notes in equally-tempered tuning system in western music ollow a logarithmic law, i.e. adding a certain interval (in semitones) corresponds to multiplying a requency by a given actor. For an equally-tempered tuning system a semitone is deined by a requency ratio o 2 1/12. So, the interval between two requencies is n = 12 log (2) I we consider a requency range or dierent octaves, it is growing as the number o octave is higher. Thus, applying the Fast Fourier Transorm we either lose resolution o notes in low octaves (Figure 1) or we are not able to distinguish high-requency events which are closer in time and have shorter duration. Frequency resolution Notes requencies Figure 1. Mismatch o note requencies and requency resolution o the FFT. Time-requency representation, which can overcome resolution issues o the Fourier transorm is Wavelet transorm. Wavelets (literally small waves ) are a relatively recent instrument in modern mathematics. Introduced about 20 years ago, wavelets have made a revolution in theory and practice o non-stationary signal analysis [14; 17]. Wavelets have been irst ound in the literature in works o Grossmann and Morlet [18]. Some ideas o wavelets partly existed long time ago. In 1910 Haar published a work about a system o locally-deined basis unctions. Now these unctions are called Haar wavelets. Nowadays wavelets are widely used in various signal analysis, ranging rom image processing, analysis and synthesis o speech, medical data and music [16; 19]. Continuous wavelets transorm o a unction (t) L 2 (R) is deined as ollows:

6 6 W 1 * t b, = ( t) ψ dt (3) a a ( a b) where a, b R, a 0. In the equation (3) ψ(t) is called basic wavelet or mother wavelet unction (* stands or complex conjugate). Parameter a is called wavelet scale. It can be considered as analogous to requency in the Fourier transorm. Parameter b is localization or shit. It has no correspondence in the Fourier transorm. One important thing is that the wavelet transorm does not have a single set o basis unctions like the Fourier transorm. Instead, the wavelet transorm utilizes an ininite set o possible basis unctions. Thus, it has an access to a wide range o inormation including the inormation which can be obtained by other time-requency methods such as Fourier transorm. As explained in brie introduction on music signal, a music excerpt can be considered as a sequence o note (pitches) events lasting certain time (durations). Beside beat events, singing voice and vibrating or sweeping instruments, the signal between two note events can be assumed to be quasi-stationary. The duration o a note varies according to the main tempo o the play, type o music and type o melodic component the note is representing. Fast or short notes usually ound in melodic lines in high requency area while slow or long notes are usually ound in bass lines with rare exceptions. Let s consider the ollowing example in order to see the dierence between the Fourier transorm and wavelet one. We construct a test signal as containing two notes E1 and A1 playing simultaneously during the whole period o time (1 second). These two notes can represent a bass line, which, as it is well known, does not change quickly in time. At the same time, we add 4 successive notes B5 with small intervals between them (around 1/16 sec). These notes can theoretically be notes o the main melody line. Let s see now the Fourier spectrogram o the test signal with a small analyzing window. Frequency Time Figure 2. Small-windowed Fourier transorm (512 samples) o the test signal containing notes E1 and A1 at the bottom and 4 repeating B5 notes at the top.

7 As we can see rom Figure 2, while high-octave notes can be resolved in time, two bass notes are irresolvable in requency domain. Now we increase the size o the window in the Fourier transorm.

Large-windowed Fourier transorm ( 1024 samples) o the test signal containing notes E1 and A1 at the bottom and 4 repeating B5 notes at the top.

7 7 As we can see rom Figure 2, while high-octave notes can be resolved in time, two bass notes are irresolvable in requency domain. Now we increase the size o the window in the Fourier transorm. Figure 3 illustrates the resulting spectrogram. Frequency Time Figure 3. Large-windowed Fourier transorm ( 1024 samples) o the test signal containing notes E1 and A1 at the bottom and 4 repeating B5 notes at the top. As we can see, two lines at the bottom o the spectrogram are now clearly distinguishable while the time resolution o high-octave notes has been lost. Finally we apply wavelet transorm to the test signal. Figure 4 shows such Morlet-based wavelet spectrogram o our test signal. Frequency Time Figure 4. Wavelet transorm (Morlet) o the test signal containing notes E1 and A1 at the bottom and 4 repeating B5 notes at the top. O course, the given example is quite artiicial; however it explains well our motivation or a wavelet like time-requency representation o a signal. It is also known, that human ear exhibits time-requency characteristic closer to that rom wavelet transorm [20].

8 8 B. Other transorms and ilter banks The idea to adapt the time/requency scale o a Fourier-related transorm to musical applications is not completely novel. A technique called Constant Q Transorm [21] is related to the Fourier transorm and it is used to transorm a data series to the requency domain. Similar to the Fourier transorm a constant Q transorm is a bank o ilters, but contrary to the Fourier transorm it has geometrically spaced center requencies k b = 2 (k = k 0 0; ), where b is the number o ilters per octave. In addition it has a constant requency resolutions ratio R 1 1 b = / Δ. Choosing appropriately k and makes central requencies to correspond to the requencies o notes. In general, the transorm is well suited to musical data (see e.g. [22], in [23] it was successully used or recognizing instruments), and this can be seen in some o its advantages compared to the Fast Fourier Transorm. As the output o the transorm is eectively amplitude/phase against log requency, ewer spectral bins are required to cover a given range eectively, and this proves useul when requencies span several octaves. The downside o this is a reduction in requency resolution with higher requency bins. Besides constant Q transorm there are bounded version o it (BQT) which use quasi-linear requency sampling when requency sampling remains linear within separate octaves. This kind o modiication allows construction o medium complexity computation schemes in comparison to standard CQT. However, making the requency sampling quasi-linear (within separate octaves) renders the inding o harmonic structure much more complex task. Fast Filter Banks are designed to deliver higher requency selectivity maintaining low computational complexity. This kind o ilter banks inherits all disadvantages o FFT in music analysis applications. More advanced techniques, described or example in [24] are medium-complexity methods which aim to overcome disadvantages o FFT and try to ollow note system requency sampling. However, octave-linear requency sampling keeps the same disadvantage as in the case o bounded Q transorms. III. VARIABLE RESOLUTION TRANSFORM Our Variable Resolutions Transorm (VRT) is irst derived rom the classic deinition o Continuous Wavelet Transorm (CWT) in order to enable a variable time-requency coverage which should it to music signal analysis better. The consideration o speciic properties o music signal inally leads us to change the mother unction as

9 9 well and thus our VRT is not a true CWT but a ilter bank. We start the construction o our VR Transorm rom Continuous Wavelet Transorm deined by (3). Thus, we deine our mother unction as ollows ψ j 2π t () t = H( t, l) e (4) where H(t,l) is the Hann window unction o a length l with l Z as deined by (5). In our case l will lie in a range between ms. Notice that using dierent length values l amounts to change the mother wavelet unctionψ. H 1 1 2πt, = + cos (5) 2 2 l ( t l) Once the length l is ixed, unction (4) becomes much more similar to a Morlet wavelet. It is an oscillating unction, a lat wave modulated by a Hann window. The parameter l deines the number o periods to be present in the wave. Figure 5 illustrates such a unction with l=20 waves. Figure 5. Our mother wavelet unction. A lat wave modulated by a Hann window with l=20. We can write according to the deinition o the unction (since l < ): () t ψ dt < and () t 2 ψ dt < (6) The unction is oscillating symmetrically around its 0 value, hence ψ () t dt 0 (7) Using (3) we write a discrete version o the transorm or a sampled signal between the instants o time orm t l/2 to t+l/2. Applying the wavelet transorm to the signal, we are interested in spectrum magnitude

10 10 W 2 2 l / 2 l / 2 1 t (, ) [ ], t t cos 2 + [ + ], t a b = s t + b H l s t b H l sin 2π t= l / 2 a a t= l / 2 a a a π (8) Here W(a,b) is the magnitude o the spectral component or the signal s[t] at time instant b and wavelet scale a. The value o W(a,b) can be obtained or any a and b provided that b does not exceed the length o the signal. The equation (8) thus deines a Continuous Wavelet Transorm or a discrete signal (time sampling). The scale o wavelet a can be expressed in terms o central requency corresponding to it since our mother unction is a unit oscillation: S a = (9) where S is the sampling requency o the signal. A higher value o a stands or a lower central requency. A. Logarithmic requency sampling First o all, the sampling o the scale axis is chosen to be logarithmic in the meaning o requency. It means that each musical octave or each note will have an equal number o spectral samples. Such a choice is explained by the properties o a music signal, which is known to have requencies o notes to ollow a logarithmic law (ollowing the human perception). Logarithmic requency sampling also simpliies harmonic structure analysis and economizes the amount o data necessary to cover the musical tuning system eectively. A voiced signal with single pitch is in the general case represented by its undamental requency and the undamental requency s partials (harmonics) with the requencies equal to the undamental requency multiplied by the number o a partial. Hence the distances between partials (harmonic components) and 0 (basic requency) in logarithmic requency scale are constant independently rom 0. Such harmonic structure looks like a ence, depicted on Figure Log requency Figure 6. Harmonic structure in logarithmic requency scale.

11 11 In order to cover the requency axis orm min to max with N requency samples with a logarithmic law we deine a discrete unction a(n), which denotes the scale o wavelet and where n stands or a wavelet bin number ranging in the interval 0..N-1. a ( n) = min e S n ln N max min (10) Now the transorm (8) sampled in both directions gives W l / 2 1 = n C t min e i min S ( n, b) s[ t + b] H, l e S n C mine l / 2 t S e n C (11) 1 where the constant C = ln N max min. Expression (11) is the basic expression to obtain an N-bin spectrogram o the signal at time instant b. Thus, or a discrete signal o length S, expression (11) provides S N values or each instant o time, N being the number o requency samples. The expression (11) is still a sampled version o the Continuous Wavelet Transorm where the sampling o the scale axis has been chosen logarithmic or N samples. Frequency dependency on the bin number has the ollowing orm (with min =50, max =8000, N=1000). n N ln min nc ( n) = e = e min max min (12) In order to depict the time/requency properties o music signals by this discretized wavelet transorm with a ixed length value (l=20), let s consider wavelet spectrograms o several test signals. Figure 7 shows the wavelet spectrogram W(n,b) o a piano recording. One can observe single notes on the let and chords on the right. Fundamental requency ( 0 ) and its harmonics can be observed in the spectrum o each note. As we can see rom the Figure 7, up to 5 harmonics are resolvable. Higher harmonics ater the 5 th one become indistinguishable especially in the case o chords where the number o simultaneously present requency components is higher.

12 Bin number n (requency) cursor 3*F 0 2*F 0 F 0 Time Single notes Spectral proile at cursor Chords Figure 7. Wavelet spectrogram o a piano recording (wavelet (4)).

Higher harmonics ater the 5 th one become indistinguishable especially in the case o chords where the number o simultaneous requency components is higher.

12 12 Bin number n (requency) cursor 3*F 0 2*F 0 F 0 Time Single notes Spectral proile at cursor Chords Figure 7. Wavelet spectrogram o a piano recording (wavelet (4)). Single notes on the let and chords on the right. Up to 5 harmonics are resolvable. Higher harmonics ater the 5 th one become indistinguishable especially in the case o chords where the number o simultaneous requency components is higher. Good time resolution is important in such tasks as beat or onset detection or music signal analysis. The next example serves to illustrate the time resolution properties o the Variable Resolution Transorm we are developing. In this example we examine a signal with a series o delta-pulses (Dirac) as illustrated in Figure 8 which is a wavelet spectrogram o 5 delta-pulses (1 on the let, 2 in the middle and 2 on the right). As we can see rom this igure, Delta-pulses on the picture are still distinguishable even i the distance between them is only 8 ms (right case). In the case o FFT one need 64-sample window size in order to obtain such time resolution. Figure 8. Wavelet transorm o a signal containing 5 delta-pulses. The distance between two pulses on the right is only 8 ms.

13 13 A quite straightorward listening experiment that we have carried out reveals that the human auditory system is capable to distinguish delta-pulses when a distance between them is around 10 ms. On the other hand, the human auditory system is also able to distinguish very close requencies - 4Hz in average 1, and down to 0.1Hz B. Varying the mother unction However, music analysis requires good requency resolution as well. As we can see rom the spectrogram in Figure 7, neither high-order partials nor close notes are resolvable, because the spectral localization o the used wavelet is too wide. Increasing the length parameter l in (4) or (11) o the Hann window would render our wavelet transorm unusable in low-requency area since the time resolution in low-requency area would rise exponentially. Thus, we propose in this work to make dynamic parameter l with a possibility to adjust its behavior across the scale axis. For such a purpose we propose to use the ollowing law or parameter l in (11) instead o applying scale a(n) to parameter t in H(t,l): l k2 N ( n) L 1 k e n n = 1 (13) N where L is the initial window size, k 1 and k 2 adjustable parameters The transorm (11) becomes: W l / 2 1 = n C tmin e i k2 S ( n, b) s[ t + b] H t, L 1 k e e S n C mine l / 2 1 n N n N (14) The expression (13) allows the eective wavelet width to vary in dierent ways: rom linear to completely S exponential to ollow the original transorm deinition. When L =, k 1 =0 and k 2 =C N, (14) is equivalent to (11). min Figure 9. Various l(n), depending on parameters. From linear (let) to exponential (right). 1

14 Doing so, we are now able to control the time resolution behavior o our transorm. In act, such transorm is not anymore a wavelet transorm since the mother-unction changes across the scale axis.

14 14 Doing so, we are now able to control the time resolution behavior o our transorm. In act, such transorm is not anymore a wavelet transorm since the mother-unction changes across the scale axis. For this reason we call the resulted transorm as variable resolution transorm (VRT). It can be also reerred as a custom ilter bank. As the eective mother-unction width (number o wave periods) grows in high-requency relatively to the original mother-unction, the spectral line width becomes more narrow, and hence the transorm allows to resolve harmonic components (partials) o the signal. An example o the spectrogram with new variable resolution transorm is depicted in Figure 10. Bin number n (requency) 3*F 0 2*F 0 F 0 Single notes Chords Time Figure 10. VRT spectrogram o the piano recording used in the previous experiment. Fundamental requencies and partials are distinguishable (k1=0.8, k2=2.1). C. Properties o the VR transorm A music signal between 50 and 8000 Hz contains approximately 8 octaves. Each octave consists o 12 notes, leading to a total number o notes around 100. A ilterbank with 100 ilters would be enough to cover such octave range. In reality, requencies o notes may dier rom the theoretical note requencies o equal-tempered tune because o recording and other conditions. Thereore or music signal analysis considered here, we are working with spectrogram size o 1024 bins 10 times the amount necessary which covers the note scale by 10 bins per note. Timbre is a one o major properties o music signal along with melody and rhythm. Let s consider now a structure o partials o a harmonic signal (harmonic structure). In Figure 6 we have depicted an approximate view o such structure in logarithmic requency scale. According to the deinition o the unction (n) (12), the distance

15 15 between partial i and partial j in terms o number o bins is independent o the absolute undamental requency value. Indeed, according to (12) ( ) n 1 ( ) n( ) = ln( j) j i C 1 n = ln and taking into account i =i* 0 and j =j* 0 we obtain: C min 1 C ( ln ) ( ln( i) ln ) = ( ln( j) ln( i) ) 0 min 0 min 0 0 = 1 C 1 ln C An accurate harmonic analysis o music signal implies that requency resolution in terms o spectrogram bin number, expressed by the spectral dispersion, should be always below the distance between neighboring components under consideration. Having the total width o 20-partial harmonic structure to be a constant around 600 points in terms o number o bins (n( 20 ) - n( 0 )), we can establish that the requency resolution o the obtained transorm is large enough to resolve high-order partials we are interested in at all positions o the VRT spectrogram, especially or low octave notes. It means that a 20-partial harmonic structure starting rom the beginning o the spectrogram will always lie above the dispersion curve. I we consider now the time resolution o the transorm, we must recall Figure 9, where various dependencies on the eective width o ilter were given. I we deine the maximum eective window size to be 180ms (recall our musical signal properties) we obtain the ollowing time resolution grid as illustrated in Figure 11. j i Figure 11. Time resolution dependency o VR transorm with k 2 =0.8, k 2 =2.1. D. Discussion As we can see, our Variable Resolution Transorm is derived rom the classic deinition o Continuous Wavelet Transorm [25; 26]. However, our VRT is not a CWT even though they have many similarities. The main dierence between VRT and CWT resides in the requency axis sampling, as well as in the mother wavelet

16 16 unction which is changing its orm across the scale (or requency) axis in the case o VRT in order to have enough resolution details or high order requency partials. This last property is not a wavelet transorm, because in the true wavelet transorm the mother unction is only scaled and shited making a discrete tiling o the time-requency space in the case o DWT or ininite coverage in the case o CWT. Our VRT can be also reerred to as a specially crated ilter bank. Major dierences between our VRT and a wavelet transorm are: no 100% space tiling no 100% signal reconstruction (depending on parameters) mother unction changes Major similarities between our VRT and a wavelet transorm are the ollowing: They are based on specially sampled version o CWT with certain parameters they can provide 100% signal reconstruction low time resolution and high requency resolution in low requency area and high time with low requency resolution in high requency area IV. APPLICATIONS: MULTIPLE F0 ESTIMATION A music signal generally is a composite signal blended o signals rom several instruments and/or voices thus having multiple undamental requencies. Accurate estimation o these multiple F0s can greatly contribute to urther music signal processing and it is an important scientiic issue in the ield. As the estimation o multiple F0s mostly requires the signal processing in the requency domain, this problem is a very good illustration highlighting the properties o our VRT. Early works on automatic pitch detection were developed or speech signal. (see e.g. [27; 28]). Much literature nowadays treats the monophonic case (only one 0 present and detected) o undamental requency estimation. There are also works studying the polyphonic case o music signal. However, in most o these works the polyphonic music signal is usually considered with a number o restrictions such as the number o notes played simultaneously or some hypothesis about the instruments involved. The work [29] presents a pitch detection technique using separate time-requency windows. Both monophonic and two-voice polyphonic cases are studied. Multiple-pitch estimation in the polyphonic single-instrument case is described in [30] where authors propose to apply a comb-ilter mapping linear requency scale o FFT into logarithmic scale o notes requencies. As the method is FFT-based, the technique inherits drawbacks o FFT or

17 17 music signal analysis as we highlighted in Chapter 3, namely requiring large FFT analysis windows thus leading to low time resolution. An advanced 0 detection algorithm is presented in [31] which is based on inding requencies which maximize a 0 probability density unction. The algorithm is claimed to work in the general case and have been tested on CD recordings. We can also mention many other recent works on multiple undamental requency estimation, or instance the ones in [32; 33]. Both these works are probabilistic methods. The irst one uses a probabilistic HMM-based approach taking into account some a priori musical knowledge such as tonality. Variable results rom 50% to 92% o recognition rates or dierent instruments in MIDI synthesized sequences are reported. The second algorithm is evaluated on synthetic samples where each ile contains only one combination o notes (1 note or 1 chord). It is not evident how to compare these dierent multiple 0 estimation algorithms as assumptions or models on the polyphonic music signal are oten not explicitly stated. On the other hand, there is no single evident way o multiple 0 detection. Some algorithms are strong in noisy environment; some algorithms require a priori training; others are able to detect inharmonic tones etc. The most popular approach to 0 estimation is harmonic pattern matching in requency domain. Our multiple- 0 estimation algorithm makes use o this basic idea. It is illustrated in this paper as an example which relies on our VRT speciically designed or music signal analysis. A. VRT-based multiple 0 estimation The basic principle o the 0 estimation algorithm consists o modeling o our VRT spectrum with harmonic models. Real musical instruments are known to have inharmonic components in their spectrum [34]. It means that the requency o the n th partial can be not strictly equal to 0 *n. The algorithm we describe does not take such inharmonic components into account, but it tolerates some displacement o partials in a natural way. A typical lat harmonic structure used to model the spectrum is depicted in the Figure 12. Figure 12. Harmonic structure. This ence is a vertical cut o VRT spectrogram calculated rom a synthetic signal representing an ideal harmonic instrument. The width o peaks and space between them is variable because the VR transorm has a

18 logarithmic requency scale. In the next step, these models are used to approximate the spectrum o the signal being analyzed in order to obtain a list o 0 candidates.

18 18 logarithmic requency scale. In the next step, these models are used to approximate the spectrum o the signal being analyzed in order to obtain a list o 0 candidates. Harmonic models F 0 VRT spectrum Figure 13. Matching o harmonic models to spectrum. During every iteration o the algorithm, such harmonic ence is shited along the requency axis o the spectrogram and matched with it at each starting point. The matching o the harmonic model is done as ollows. At every harmonic their amplitudes a i are taken rom the values o the spectrogram or the requencies o i th harmonics. As requencies o harmonics do not necessarily have integer ratios to the undamental requency, we take the maximum amplitude in a close neighborhood, as it is explained in Figure 14. a 1, a 2, a 3, a 4 a n maximal values Spectrum Tolerance windows Figure 14. Procedure o extraction o harmonic amplitude vector. This procedure orms a unction A() which is a norm o the vector a or the requency. The value o requency or which the unction A takes its maximum value is considered as an 0 candidate.

19 19 Further, the obtained 0 candidate and the corresponding vector a o harmonics amplitudes is transormed into a spectrum slice like in Figure 12. The shape o peaks is taken rom the shape o VRT spectrum o a signal with a sine wave with corresponding requency. This slice is then subtracted rom the spectrum under study. The iterative process is repeated either until the current value o harmonic structure A() becomes inerior compared to a certain threshold or until the maximum number iterations has been reached. We limit the maximum number o iterations to 4, and thereore the maximum number o notes that can be simultaneously detected is 4. As it was observed in preliminary experiments, increasing the number o simultaneously detected notes doesn t improve the 0 detection perormance signiicantly or high-polyphonic music, because ater 3 rd or 4 th iteration the residue o spectrum is already quite noisy as almost all harmonic components have been already subtracted rom it due to harmonic overlaps. The procedure o note extraction is applied each 25 ms to the input signal sampled at 16 khz 16 bits. Hence, or the shortest notes with duration around ms we obtain note candidates at least twice in order to be able to apply iltering techniques. Every slice produces a certain number o 0 candidates; then, 0 candidates are iltered in time in order to remove noise and unreliable notes. The time iltering method used is the nearest neighbor interrame iltering. 3 successive rames are taken and 0 candidates in the middle rame are changed according to the 0 candidates in the side neighbors. This ilter removes noisy (alse detected) 0 candidates as well as holes in notes issued by misdetection. B. Experimental evaluation The easiest way to make basic evaluation experiments in automated music transcription is to use MIDI iles (plenty o them can be reely ound on the Internet) rendered into waves as input data. The MIDI events themselves serve as the ground truth. However, the real lie results must be obtained rom recorded music with true instruments and then transcribed by educated music specialists. In our work we used wave iles synthesized rom MIDI using hardware wavetable synthesis o Creative SB Audigy2 soundcard with a high quality 140Mb SoundFont bank Fluid_R3 reely available on the Internet. In such wavetable synthesis banks all instruments are sampled with good sampling rates rom real ones: the majority o pitches producible by an instrument are recorded as sampled (wave) block and stored in the soundont. In the soundont we used, acoustic grand piano, or example, is sampled every our notes rom a real acoustic grand piano. Waves or notes which are in between these reerence notes are taken as resampled waves o closest reerence notes. Thereore, signal generated using such wavetable synthesis can be considered as a real instrument signal

20 20 recorded under ideal conditions. And a polyphonic piece is an ideal linear mixture o true instruments. To make the recording conditions closer to reality in some tests we passed the signal over speakers and record it with a microphone. Recall and Precision measures are used to measure the perormance o the note detection. Recall measure is deined as: the number correct notes detected Recall = the actual number o notes (15) Precision is deined as ollows: the number correct notes detected Precision = the number o all notes deteced (16) For the overall measure o the transcription perormance, the ollowing F1 measure is used Recall Precision F1 = 2 Recall + Precision (17) All alsely detected notes also include those with octave errors. For some tasks o music indexing as or instance tonality determination, what is important is the note basis and not the octave number. For this reason, the perormance o note detection without taking into account octave errors is estimated as well. Our test dataset consists o 10 MIDI iles o classical and pop compositions containing 200 to 3000 notes. Some other test sequences were directly played using the keyboard. The ollowing tables (Table 1 - Table 4) display precision results o our multiple pitch detection. Per.Oct column stands or perormance o note detection not ta king into account notes octaves (just the basic note is important). The polyphony column indicates the maximum and the average number o simultaneously sounding notes ound in the play. Table 1. Note detection perormance in monophonic case. Sequences are played manually using the keyboard. Name o notes Polyphony Perormance Per. Oct max / avg Recall Prec F1 F1 Piano Manual / Violin Manual / Table 2. Note detection perormance in polyphonic case. Sequences o chords are played manually using the keyboard. Name o notes Polyphony Perormance Per. Oct max / avg Recall Prec F1 F1 Piano Manual / Piano Manual / Flute Manual / Table 3. Note detection perormance in polyphonic case. Classical music titles (s ingle- and multi-instrument, no percussion).

21 21 Name Polyphony Perormance Per. Oct o notes max / avg Recall Prec F1 F1 Fur_Elize / Fur_Elize w/ microphone / Tchaikovski / Tchaikovski / Bach / Bach / Bach Fugue / Vivaldi Mandolin Concerto / Table 4. Note detection perormance in polyphoni c case. Popular and other music ( multi-instrument with percussion). Name Polyphony Perormance Per. Oct o notes max / avg Recall Prec F1 F1 K. Minogue / Madonna / Soundtrack / Godather / As we can see rom these t ables, our algorithm perorms quite well in the monophonic case. Good results are also obtained in polyphonic case with classical music having a low average level o polyphony (number o notes simultaneously played). More complex musical compositions which include percussion instrument and have high polyphony rate have produced lower recognition rates. In our note detection algorithm, we have limited the maximal detectable polyphony to 4 while the maximal and average polyphony in the case o popular and other music is 10 and 4.7 correspondingly. The octave precision, however, stays high (per. Oct F1 ield). For comparison purpose, we also implemented our note detection algorithm based on FFT with dierent window size instead o our VRT. We carried out an experiment with a set o polyphonic classical compositions (~1000 notes) using this FFT-based note detection algorithm. Table 5 and Figure 15 summarize the experimental results. Table 5. Comparison o transcription perormance based on dierent time-requency transorms (the FFT with various window sizes versus VRT). Transorm FFT FFT FFT VRT FFT size or number o VRT requency samples Result (F1)

22 22 F1, % FFT-based VRT-based number o requency samples Figure 15. Note detection algorithm perormance according to underlying spectral analysis approach. Further increase o the FFT window size lowers the time resolution down to seconds so that note changes quicker that 0.5 seconds cannot be resolved anymore. These experimental results show the advantage o our VRT and its simple use perorms multiple note detection quite well in the case o low average polyphony rate. V. CONCLUSION In this paper we have introduced a Variable Resolution Transorm as a novel signal processing technique speciically designed or music signal analysis. A music signal is characterized by our major properties: melody, harmony, rhythm and timbre. The classic Fast Fourier transorm, a de-acto standard in music signal analysis in the current literature, has its main drawback o having a uniorm time-requency scale which makes it impossible to perorm eicient spectrum analysis together with good time resolution. The wavelet transorm overcomes this limit by varying the scale o mother-wavelet unction and, hence, the eective window size. This kind o transorm keeps requency details in low-requency area o the spectrum as well as time localization inormation about quickly changing high-requency components. However, the dramatic decrease o requency resolution o the basic wavelet transorm in high-requency area leads to conusion in high order harmonic components where a suicient resolution is necessary or the analysis o harmonic properties o a music signal. We have thus introduced our Variable Resolution Transorm in varying mother-unction. The law o variation is controlled by two parameters, linearity and exponentiality, which can be careully chosen in order to adjust the requency-time resolution grid o the VRT. Hence, our VRT takes advantage o the classic continuous wavelet transorm and the windowed or short-time. As an example o direct VRT application we have presented a VRT-based multiple- 0 estimation algorithm characterized by its simplicity, rapidity and high temporal resolution as opposed to the FFT-based methods. It perorms pretty well in the detection o multiple pitches with non-integer rates. However, as other similar

23 23 algorithms, our VRT-based multiple 0 estimation algorithm does not solve the ollowing problem: two notes with a distance o an octave can hardly be separated, because the second note does not bring any new harmonics into the spectrum, but rather changes the amplitude o existing harmonics o the lower note, so some knowledge o the instruments involved in the play or instrument recognition techniques and multi-channel source separation is necessary to resolve the problem. Our note detection mechanism was evaluated in its direct application musical transcription rom the signal. In this evaluation ground truth data was taken as note score iles MIDI. These iles rom various genres (mostly classical) were rendered into waves using high-quality wavetable synthesis. The resulting wave iles were passed as input or the transcriptions algorithm. The results o the transcription and the ground-truth data were compared and a perormance measure was calculated. Compared to the FFT, the VRT being used in described 0 estimation algorithms gives much higher results together with excellent time resolution. As a major drawback o the VRT an important complexity could be mentioned. Nevertheless, it does not hamper a real-time audio processing every 25ms. Actually we also applied the VRT to the extraction o other music eatures including timber, tempo estimation or music similarity-based retrieval [25; 26]. In all these problems, the VRT has depicted interesting properties or music signal analysis [thesis]. VI. REFERENCES [1] Tanguiane A.S.. Artiicial perception and music recognition (lecture notes in computer science).. Springer, October [2] Casagrande N., Eck D., Kegl B.. Frame-level audio eature extraction using adaboost. Proceedings o the ISMIR International Conerence on Music Inormation Retrieval (London) (2005) : pp [3] Logan B. SA. A music similarity unction based on signal analysis.. In Proceedings o IEEE International Conerence on Multimedia and Expo ICME 01 (2001) [4] Mandel M. ED. Song-level eatures and support vector machines or music classiication. Proceedings o the ISMIR International Conerence on Music Inormation Retrieval (London) (2005) : pp [5] McKinney M.F. BJ. Features or audio and music classiication. Proceedings o the ISMIR International Conerence on Music (2003) : pp [6] Meng A., Shawe-Taylor J.,. An investigation o eature models or music genre classiication using the

24 24 support vector classiier. Proceedings o the ISMIR International Conerence on Music Inormation Retrieval (London) (2005) : pp [7] Scaringella N. ZG. On the modeling o time inormation or automatic genre recognition systems in audio signals. Proceedings o the ISMIR International Conerence on Music Inormation Retrieval (London) (2005) : pp [8] Tzanetakis G. CP. Automatic musical genre classiication o audio signals. IEEE Transactions on Speech and Audio Processing 10 (2002) : p. no 5, [9] West K. CS. Features and classiiers or the automatic classiication o musical audio signals. Proceedings o the ISMIR International Conerence on Music Inormation Retrieval (Barcelona, Spain) (2004) : pp [10] Foote Jonathan T.. Content-based retrieval o music and audio. Proceedings o SPIE Multimedia Storage and Archiving Systems II (Bellingham,WA) vol. 3229, SPIE (1997) : pp [11] Logan B.. Mel requency cepstral coeicients or music modeling. Proceedings o the ISMIR International Symposium on Music Inormation Retrieval (Plymouth, MA) (2000) [12] Aucouturier J.J. PF. Timbre similarity: how high is the sky?. In JNRSAS (2004) [13] Pampalk E.. Computational models o music similarity and their application in music inormation retrieval. PhD thesis at Technishen Universitaet Wien, Fakultaet uer Inormatik [14] Kronland-Martinet R. MJAGA. Analysis o sound patterns through wavelet transorm. International Journal o Pattern Recognition and Artiicial Intelligence,Vol. 1(2) (1987) : pp [15] Grimaldi M., Kokaram A., Cunningham P.. Classiying music by genre using the wavelet packet transorm and a round-robin ensemble. (2002) [16] Kadambe S. FBG. Application o the wavelet transorm or pitch detection o speech signals. IEEE Transactions on Inormation Theory (1992) 38, no 2: pp [17] Mallat S.G.. A wavelet tour o signal processing.. Academic Press, [18] Grossman A. MJ. Decomposition o hardy into square integrable wavelets o constant shape. SIAM J. Math. Anal. (1984) 15: pp [19] Lang W.C. FK. Time-requency analysis with the continuous wavelet transorm. Am. J. Phys. (1998) 66(9): pp [20] Tzanetakis G., Essl G., Cook P.. Audio analysis using the discrete wavelet transorm.. WSES Int. Con. Acoustics and Music: Theory 2001 and Applications (AMTA), Skiathos, Greece (2001)

25 25 [21] Brown J. C.. Calculation o a constant q spectral transorm. J. Acoust. Soc. Am. (1991) 89(1): pp [22] Nawab S.H., Ayyash S.H., Wotiz R.. Identiication o musical chords using constant-q spectra.. In Proc. ICASSP (2001) [23] Essid S.. Classisication automatique des signaux audio-réquences : reconnaissance des instruments de musique. PhD thesis Inormatique, Telecommunications et Electronique, ENST [24] Diniz F.C.C.B, Kothe I, Netto S.L., Biscainho L.P.. High-selectivity ilter banks or spectral analysis o music signals. EURASIP Journal on Advances in Signal Processing (2007) [25] Paradzinets A., Harb H., Chen L.,. Use o continuous wavelet-like transorm in automated music transcription. Proceedings o EUSIPCO (2006) [26] Paradzinets A., Kotov O., Harb H., Chen L.,. Continuous wavelet-like transorm based music similarity eatures or intelligent music navigtion. In proceedings o CBMI (2007) [27] Abe T. et al.. Robust pitch estimation with harmonics enhancement in noisy environments based on instantaneous requency. In proceedings o ICSLP'96 (1996) : pp [28] Hu J., Sheng Xu., Chen J.. A modiied pitch detection algorithm. IEEE COMMUNICATIONS LETTERS (2001) Vol. 5, No 2 [29] Klapuri A.. Pitch estimation using multiple independent time-requency windows. IEEE Workshop on Applications o Signal Processing to Audio and Acoustics (1999) [30] Lao W., Tan E.T., Kam A.H.. Computationally inexpensive and eective scheme or automatic transcription o polyphonic music. Proceedings o ECME (2004) [31] Goto M.. A predominant-0 estimation method or cd recordings: map estimation using em algorithm or adaptive tone models. In proceedings o ICASSP (2001) [32] Li Y. WD. Pitch detection in polyphonic music using instrument tone models. In proceedings o ICASSP (2007) [33] Yeh C., Roebel A., Rodet X.. Multiple undamental requency estimation o polyphonic music signals. in Proc. IEEE, ICASSP (2005) [34] Klapuri A.. Signal processing methods or the automatic transcription o music. PhD thesis at Tampere University o Technology.2004.

Topic 10. Multi-pitch Analysis

Topic 10. Multi-pitch Analysis Topic 10 Multi-pitch Analysis What is pitch? Common elements of music are pitch, rhythm, dynamics, and the sonic qualities of timbre and texture. An auditory perceptual attribute in terms of which sounds