Automatic music transcription

Size: px

Start display at page:

Download "Automatic music transcription"

Herbert Ford
5 years ago
Views:

Music transcription 1 Music transcription 2 Automatic music transcription Sources: * Klapuri, Introduction to music transcription, 2006. www.cs.tut.fi/sgn/arg/klap/amt-intro.

1 Music transcription 1 Music transcription 2 Automatic music transcription Sources: * Klapuri, Introduction to music transcription, * Klapuri, Eronen, Astola: Analysis of the Meter of Acoustic Musical Signals, IEEE TASLP * Klapuri, Multiple fundamental frequency estimation by summing harmonic amplitudes, ISMIR * Ryynänen, Klapuri, Automatic transcription of melody, bass line, and chords in polyphonic music, Computer Music Journal, Introduction to music transcription Contents: Introduction to music transcription Rhythm analysis Multiple-F0 analysis Acoustic and musicological models Vocals separation and lyrics Application to music retrieval Music transcription 3 / klap Music transcription 3 Music transcription Music transcription 4 Excerpt from Song #034 in the RWC popular music database: Figures top-down: 1. time-domain signal 2. spectrogram 3. musical notation 4. piano roll Complete vs. partial transcription complete transcription is sometimes impossible or irrelevant partial transcription: for example melody / bass line / percussions / chords etc. Applications and related areas music retrieval structured audio coding intelligent processing / effects stage lighting, automatic accompaniment etc. equipment computer games music perception Anything missing?

Perspectives on music transcription Music transcription 5 Perspectives on music transcription Acoustic and musicological models Music transcription 6 Music transcription is a wide topic It is useful

2 Perspectives on music transcription Music transcription 5 Perspectives on music transcription Acoustic and musicological models Music transcription 6 Music transcription is a wide topic It is useful to structure the problem by decomposing it into smaller and more tractable subproblems Speech recognition systems depend on language models e.g. probabilities of different word sequences (N gram models) Musicological information is equally important for transcription e.g. probabilities of tone sequences or combinations instrument models P( ) P( ) Acoustic signal Internal models Analysis Result Music transcription 7 Time structure analysis Music transcription 8 2 Onset detection and meter analysis Onset detection = Detection of the beginnings of sounds in an acoustic signal for example tapping foot to music (beat tracking) may include several time scales detect moments of musical stress in an audio signal and discover underlying periodicities in them Applications beat-synchronous feature extraction temporal framework for audio editing synchronization of audio/audio or audio/video

Music transcription 9 Music transcription 10 Measuring degree of change in music Characterizes the temporal regularity of the moments of stress Basic idea is to analyse the periodicity of the change

3 Music transcription 9 Music transcription 10 Measuring degree of change in music Characterizes the temporal regularity of the moments of stress Basic idea is to analyse the periodicity of the change signal Figure: Musical meter is hierarchical structure pulse sensations at different time scales tactus level is the most prominent ( foot tapping rate ) tatum: time quantum (fastest pulse) measure pulse: related to harmonic change rate Moments of change are important for onset detection and meter analysis change in the intensity, pitch or timbre of a sound moments of musical stress (accents) are caused by the beginnings of sound events, sudden changes in loudness or timbre, harmonic changes Perceptual change should be estimated to detect what humans detect and to ignore what humans ignore musically meaningful rhythmic parsing Measuring degree of change in music Music transcription 11 Measuring degree of change in music Music transcription 12 Time-domain signal some data reduction is needed But: the power envelope of a signal is not sufficient Frequency selectivity of hearing: audibility of a change at each critical band is only affected by the spectral components within the same band components within a single critical band may mask each other but this does not happen if the frequency separation is sufficiently large Measure change independently at critical bands, and then combine the results Scheirer: perceived rhythmic content of many music types remains the same if only the power envelopes of a few subbands are preserved and then used to modulate a white noise signal one band is not enough applies to music with strong beat

Music transcription 13 Measuring degree of change in music: In practice Measuring degree of change in music Degree of change at each band Music transcription 14 music signal Filterbank.

4 Music transcription 13 Measuring degree of change in music: In practice Measuring degree of change in music Degree of change at each band Music transcription 14 music signal Filterbank... Perceived change at subband d / dt xb ( n) x ( n) b d ln xb ( n) dt Filterbank: Fourier transforms in successive ~ 20ms time frames (50% overlap) in each frame n, measure the power x b (n) within b=1,2,...,36 triangular-response bandpass filters that are uniformly distributed on Mel-frequency scale (50Hz 20kHz) f Mel... Combine results output 2595log10(1 f Hz ) 700 Denote by x b (n) the power at critical band b=1,...,36 as a function of time (frame index) n How to measure the degree of change at subbands? Differential? For humans, the smallest detectable change in intensity, I, is approximately proportional to the intensity I of the signal, the same amount of increase being more prominent in a quiet signal. Audible ratio I / I is approximately constant Thus it is reasonable to normalize the differential of power with power: d / dt xb ( n) x ( n) b d ln dt Figure (piano onset): dashed line: (d/dt) x b (n) solid line: (d/dt) ln[x b (n)] x ( n) b Measuring degree of change in music Degree of change at each band A numerically robust way of calculating the logarithm is the µ-law compression, y b n ln 1 ln 1 x b n Music transcription 15 Measuring degree of change in music Summary music signal Filterbank x b (n)... Perceived change at subband... Music transcription 16 Combine results output v(n) constant determines the degree of compression for x b (n) ( = / x ) Differentiate, and retain only positive changes (HWR(x)=max(x, 0)): y b (n) = HWR{y b (n) y b (n 1)} power envelope -law compress d / dt, rectify Finally: sum across channels to estimate overall change 36 ( ) = '( ) =1

Measured change signals Music transcription 17 Degree of change ( accent ) Music transcription 18 v(n) v(n) v(n) signal level adaptation would be needed Accent signals (degree of change) Degree of

5 Measured change signals Music transcription 17 Degree of change ( accent ) Music transcription 18 v(n) v(n) v(n) signal level adaptation would be needed Accent signals (degree of change) Degree of accent as a function of time As described above Pulse strengths ( saliences ) Music transcription 19 Bank of comb filters Music transcription 20 Use bank of comb filters for periodicity analysis We used a = 0.5 where T is half-time in samples (3s) x(n) 1-a a z -k y(n) Metrical pulse saliences Strengths of different metrical pulses at time n (resonator energies) Use comb filters for period analysis Magnitude response: a = 0.9 k = 7 Impulse response:

6 Bank of comb filters Music transcription 21 Higher-level modeling Music transcription 22 Time-varying energies of each comb filter in the filterbank r(,n), input impulse train r(,n), input white noise (t, )= 1 t = -t +1 2 Ø º (t, ) ø ß Figure: r(,n), 1,2,...,100 for an impulse train (period 24 samples) and for white noise r(,n) can be further normalized to get rid of the trend (details are beyond the scope of this course) Meter tatum, tactus, measure Higher-level modeling Music transcription 23 Demonstrations Music transcription 24 Observed: (normalized) comb filter energies r(,n) Prior probabilities (typical tempo values): log-normal distribution Temporal continuity constraints: p(next tempo / prev tempo)

7 Music transcription 25 Introduction Music transcription 26 3 Polyphonic pitch analysis Pitch information is an essential part of almost all Western music Extracting pitch information from recorded audio is hard spectrogram can be calculated straightforwardly piano-roll... more tricky Multiple F0 estimation = F0 estimation in polyphonic signals music variety of sources, wide pitch range, presence of drums A number of completely different approaches have been proposed in the literature Musical sounds Most Western instruments produce harmonic sounds Figure: trumpet sound (260Hz) in time and frequency domains period in time-domain: 1/F0 period in frequency-domain: F0 Music transcription 27 1/ F0 Music transcription 28 How about just autocorrelation function (ACF)? Autocorrelation function (ACF) based algorithms are among the most frequently used single-pitch estimators Usually the maximum value in ACF is taken as 1/F0 period Short-time ACF r( ) for a discrete time domain signal x(n): r( ) 1 N N n 1 n 0 x( n) x( n ) F0 Signal x(n): (vowel [ae]) Frequency (Hz) ACF:

8 Autocorrelation function Music transcription 29 Autocorrelation function Music transcription 30 Short-time ACF within a time frame of length N : N 1 r( ) xnxn ( ) ( ) n 0 Short-time ACF for real-valued signals can be computed via FFT as K / k 2 r( ) IDFT X k cos X k K K where IDFT is inverse Fourier transform and X(k) is DFT of x(n) (padding zeros so that FFT length is twice the length of x) The latter identity is true only for real-valued (audio) signals k 0 From the frequency-domain interpretation, we see at least three properties of ACF that make it non-robust for the period analysis of polyphonic audio the entire spectrum is used (weighting with values btw -1 and 1) all integer multiples of f s / are given the same (unity) weight squaring the spectrum emphasizes timbral properties (formants etc.) In the following, we propose a method which makes three basic modifications to ACF to enhance its robustness 1. sharper peaks (cf. comb filter); 2. weight harmonics ( 1 g(,m) More reliable method* Music transcription 31 Proposed method Summing harmonic amplitudes Music transcription 32 Starting point is conceptually very simple 1. Input signal is first spectrally flattened ( whitened ) to suppress timbral information 2. The salience (strength) of a F0 candidate is calculated as a weighted sum of the amplitudes of its harmonic partials ( t )= t, =1 ( ) ( t, ) where f,m = mf s / is the frequency of the m:th harmonic partial of a F0 candidate f s / f s is the sampling rate, and function g(,m) defines the weight of partial m of period in the sum Y(f) is the short-time Fourier transform of the whitened time-domain signal The basic idea of harmonic summation is intuitively appealing: pitch perception is closely related to time-domain periodicity of sounds Fourier theorem states that a periodic signal can be represented with spectral components at integer multiples of the inverse of the period Question of an optimal mapping of the Fourier spectrum to pitch spectrum (or, a piano roll) is closely related to these methods here, function g(,m) is learned by brute-force optimization ( 300Hz): * Klapuri, A., Multiple fundamental frequency estimation by summing harmonic amplitudes," 7th International Conference on Music Information Retrieval, Victoria, Canada, Oct M s g, m Y f m 1, m g, m g1 g2 f, m fs / mf / s 1 m

9 Proposed method Spectral whitening Music transcription 33 Proposed method Calculation of the F0 salience function Music transcription 34 One of the big challenges in F0 estimation is to make systems robust for different sound sources A way to achieve this is to try to suppress timbral information prior to the actual F0 estimation Whitening 1. Calculate DFT X(k) of the input signal x(n) 2. Calculate standard deviations b (= sqrt(power)) within subbands in the frequency domain (square and sum frequency bins within bands, then sqrt) 3. Calculate bandwise compression coefficients b = b / b, where = 0.3 is a parameter determining the amount of spectral whitening 4. Whitened spectrum Y(k) is obtained by weighting each subband with its compression coefficent and then recombining the subbands Calculated as M s g, m max Y k m 1 k where the set,m defines a range of frequency bins in the vicinity of the m:th overtone of the F0 candidate f s / : where denotes rounding and denotes spacing between fundamental period candidates ( = 1 or 0.5) Weight function was found by optimisation (, m k t, = / t+dt /2 g, m g g f 1 2, m ( ),, /( t-dt /2) fs / mf / s 300Hz): Proposed method Predominant F0 estimation Music transcription 35 Maximum of the salience function s( ) is a quite robust indicator of one of the correct F0s in a polyphonic audio signal predominant F0 estimation: find one (any) of the correct F0s But the second or third-highest peak is often due to the same sound and located at that is half or twice the position of the highest peak Multiple-F0 estimation accuracy can be improved by an iterative estimation and cancellation scheme where each detected sound is cancelled from the mixture and s( ) is updated accordingly before deciding the next F0 Iterative estimation and cancellation Music transcription 36 Step 1: Residual spectrum Y R (k) is initialized to Y(k). A spectrum of detected sounds, Y D (k), is initialized to zero. Step 2: Fundamental period 0 is estimated using Y R (k) to compute s( ). The maximum of s( ) determines 0 Step 3: Harmonic partials of 0 are located at bins mk / 0 m=1,2,...m. Spectrum of the time-domain window function is translated to those frequencies, weighted by g(,m) and added to Y D (k). Step 4: The residual spectrum is updated as Y R (k) max(0, Y R (k) d Y D (k)) where d = 0.2 is a free parameter. Step 5: Return to Step 2. Y R (k)

10 Iterative estimation and cancellation Music transcription 37 F0 gram : piano-roll with confidence levels Music transcription 38 first,... second,... third,... fourth iteration: Music transcription 39 F0 gram : piano-roll with salience (RWC-P #25) Music transcription 40 F0 gram : piano-roll with salience (RWC-P #95)

Remarks Music transcription 41 Music transcription 42 The principle of summing harmonic amplitude is very simple, yet it suffices for predominant-f0 estimation in polyphonic signals, provided that

11 Remarks Music transcription 41 Music transcription 42 The principle of summing harmonic amplitude is very simple, yet it suffices for predominant-f0 estimation in polyphonic signals, provided that the weight g(,m) are appropriate Iterative detection and cancellation helps to remove harmonics and subharmonics of already detected sounds and to reveal remaining sounds behind the most prominent ones Reasonably accurate for a wide range of instruments and F0s 4 Acoustic and musicological modeling Music transcription 43 Music transcription 44 Why acoustic modeling of notes? Acoustic modeling of notes Frame-wise F0 strengths must be processed to get discrete notes (MIDI, score) pitch quantization, onsets, offsets clean up frame-wise errors 1. Extract frame-wise F0 salience (strength) and its differential (here not doing peak-picking or iterative cancellation) Examples in the following Ryynänen, M. and Klapuri, A., Automatic transcription of melody, bass line, and chords in polyphonic music, Computer Music Journal, 32(3), Fall Ryynänen, Klapuri, WASPAA Use training data (RWC Popular Music database) to learn acoustic models for note events (100 pieces with audio + time-aligned MIDI)

Music transcription 45 Music transcription 46 Music transcription system Music transcription system Figure: Acoustic model Musicological model: musical key estimation N-gram models for note sequences

12 Music transcription 45 Music transcription 46 Music transcription system Music transcription system Figure: Acoustic model Musicological model: musical key estimation N-gram models for note sequences Combination of an acoustic model and a musicological model (HMMs) Music transcription 47 Music transcription 48 Transcription examples Case study: Singing transcription Complete polyphonic transcriptionhttp:// Ryynänen, Klapuri, Modeling of note events for singing transcription, SAPA Workshop, Transcription of melody, bass, and chords: acoustic signal Feature extraction pitch voicing, accent, meter Probabilistic models discrete note sequence Estimated pitch track has to be post-processed to get notes

13 Case study: Singing transcription Music transcription 49 Brother can you spare me a dime Pieni tytön tylleröinen

Tempo and Beat Analysis

Advanced Course Computer Science Music Processing Summer Term 2010 Meinard Müller, Peter Grosche Saarland University and MPI Informatik meinard@mpi-inf.mpg.de Tempo and Beat Analysis Musical Properties: