Complex audio feature extraction: Transcription

Size: px

Start display at page:

Download "Complex audio feature extraction: Transcription"

Ralph Neal
5 years ago
Views:

1 Complex audio feature extraction: Transcription Anja Volk Sound and Music Technoloy, 14 Dec,

2 Outline What is transcription? basic representations of musical content Application: query by Hummin Audio and symbolic representations in folk son (multiple) F0 estimation Melody transcription Chord transcription 2

3 Recap Audio features for corpus analysis Why do we undertake corpus analysis Examples: What makes a chorus? What makes a hook? What type of audio features: Psycho-acoustic audio features Corpus-relative features Evolution of Contemporary Western Popular music Pitch, timbre, loudness Million Son Database Graphs constructed Results Pitch variety remains constant Timbre homoenizes Loudness increases 3

Transcription: Basic representations of musical content musical content example compare imae compare text structure convert to above convert to below Diital audio (MP3, Wav) level 1: primitive

4 Transcription: Basic representations of musical content musical content example compare imae compare text structure convert to above convert to below Diital audio (MP3, Wav) level 1: primitive features speech none - hard Timestamped events (MIDI) level 2: objects text little easy fairly hard (OK job) Music notation (Finale, Sibelius, MusicXML) level 2: compound objects text + markup much easy (OK job) - 4

5 Transcription: Basic representations of musical content transcription musical content example compare imae compare text structure convert to above convert to below Diital audio (MP3, Wav) level 1: primitive features speech none - hard Timestamped events (MIDI) level 2: objects text little easy fairly hard (OK job) Music notation (Finale, Sibelius, MusicXML) level 2: compound objects text + markup much easy (OK job) - 5

6 Audio transcription reconstruction of sound events, or even music notation, from audio sinals sound events is what we perceive music notation can be seen as an approximation of our perception but other symbolic representations of sound events may be equally useful audio transcription problem is a major obstacle lots of research usable solutions exist for controlled situations 6

7 What does audio transcription involve? many different tasks, often covered by MIREX (multiple) F0 estimation onset detection melody extraction audio chord estimation instrument reconition 7

8 Applications Cover son detection Query by hummin Audio notation alinment (score followin) Example: PHENICX project on enhancin experience of concert audience: Performance analysis not all of these require a full transcription 8

9 9 Questionnaire: musicoloy challenes for MIR

10 Where it started Ghias et al., Query By Hummin: Musical Information Retrieval in an Audio Database (1995) actually, searchin MIDI files but the query is user-enerated audio matchin after audio transcription 11

11 Query by Hummin How (Ghias et al. 1995)? F0 estimation human voice has peculiar resonance properties formants (hih frequency ranes that we perceive as vowels) autocorrelation based method worked best output turned into 3-letter alphabet a.k.a. Parson s code U, D, S (up, down, same) approximate strin matchin results sequences of pitch transitions were sufficient to discriminate 90% of the sons hih hopes for MIR on the Internet 12

12 QBH as a research paradim attractive scenario with very stron impact in the next 10 years (and even up to now) audio-to-symbolic matchin audio-to-audio matchin how realistic is it? MIREX Query By Sinin/Hummin (QBSH) run since 2006, led by Roer Jan several datasets, includin Jan s MIR-SBSH corpus 4431 audio queries 48 MIDI round truths noise from Essen Folk Sons evaluation: Mean Reciprocal Rank scores up to (overfittin?) 13

SoundHound currently, the most visible commercial application of QBH http://www.youtube.com/watch?v=8ue5svesbm (from 1:27) SoundHound documentation doesn t explain much https://www.soundhound.

13 SoundHound currently, the most visible commercial application of QBH (from 1:27) SoundHound documentation doesn t explain much nd2sound vaue compact and flexible Crystal representation matches multiple aspects of the user's rendition (includin melody, rhythm, and lyrics) with millions of user recordins from midomi.com) Midomi database consists of monophonic melodies linked to metadata unclear whether and to what extent melodies are transcribed matchin is monophonic 14

Human performance in QBH constraints on QBH input sound production (i.e. sinin) melody accuracy usually, beinnin PhD research Micheline Lesaffre (Ghent U.

14 Human performance in QBH constraints on QBH input sound production (i.e. sinin) melody accuracy usually, beinnin PhD research Micheline Lesaffre (Ghent U.) Music Information Retrieval: Conceptual Framework, Annotation and User Behavior (2005) experiment: what musical queries do people create when iven complete freedom? rane is lare 15

15 F0 estimation determinin the perceptual pitch from the acoustic sinal the monophonic case is often considered solved see e.. hih scores for QBSH query by sinin hummin no separate MIREX task in practice, pitch transcription is far from easy sinal vs. perception, octave errors recordin quality musical competence of performers 16

16 The WITCHCRAFT case task: build a content-based search enine for 7000 Dutch folkson recordins initial plan: use MAMI transcriber (Ghent University) state-of-the-art in 2005 based on computational model of chochlea produces neural firin patterns selection of pitch candidates sementation (division into notes) based on sun lyrics outpreformed many other methods with realistic input 17

17 The challene lots of sons were just too hard to be transcribed reliably son OGL is just an averae case beinnin end that s why we chose to use the existin paper transcriptions and encode these instead 18

. as round truth for F0 estimation example: research Müller, Grosche & Wierin

18 symbolic-audio alinment Meertens Tune Collections audio AND encodin AND notation AND metadata AND annotations useful e.. as round truth for F0 estimation example: research Müller, Grosche & Wierin 2009 MIDI is 1 strophe, audio contains many strophes matchin miht produce sementation chroma from MIDI chroma from audio 19

19 dealin with pitch drift chroma shifted over 1 / 24 octave distances sementation performance 0,9-0,95 (F measure) may support further variation / music performance research 20

20 Audio folk son retrieval in the end, we did experiment with audio retrieval (PhD thesis Peter van Kranenbur, 2010) F0 estimation usin Yin (De Cheveiné & Kawahara) autocorrelation based; error prevention adaptations: smoothin over 11 frames, usin median initial retrieval experiment with sements: OK results 21

21 Melody transcription 2 steps: (1) estimate when the melody is present and when it is not (2) estimate the correct pitch of the melody when it is present Considered as a still unsolved problem Issue 1: separatin components of polyphonic mixture of complex sounds very difficult Issue 2: estimatin pitch from the separated stream is still a challene in itself Issue 3: determinin note boundaries challenin for several instruments, includin voice (onset and transition blurred) 22

22 Melody transcription Example a male siner (melody) accompaniment (drums, piano, uitar) M. Mueller: Fundamentals of Music Processin P

23 Melody transcription Example Spectroram isolated sinin voice melody M. Mueller: Fundamentals of Music Processin P

24 Melody transcription Example Spectroram full recordin Siner is active M. Mueller: Fundamentals of Music Processin P

25 Melody transcription YIN Cheveine & Kawahara (2001) pitch-estimation alorithm operatin in the time domain (no Fourier transformation) Works with autocorrelation function (ACF) of the sinal 26

26 Melody transcription YIN Cheveine & Kawahara (2001) pitch-estimation alorithm operatin in the time domain (no Fourier transformation) Works with autocorrelation function (ACF) of the sinal 27

27 Melody transcription YIN Cheveine & Kawahara (2001) pitch-estimation alorithm operatin in the time domain (no Fourier transformation) Works with autocorrelation function (ACF) of the sinal the sinals on which YIN is typically used are larely monophonic, with only one, or one very dominant pitch present 28

28 Melody transcription Melodia: Justin Salamon and Emilia Gomez. Melody Extraction From Polyphonic Music Sinals Usin Pitch Contour Characteristics. IEEE Transactions on Audio, Speech, and Lanuae Processin, 20(6): ,

29 Melody transcription Melodia: Justin Salamon and Emilia Gomez. Melody Extraction From Polyphonic Music Sinals Usin Pitch Contour Characteristics. IEEE Transactions on Audio, Speech, and Lanuae Processin, 20(6): , Which frequencies are present in audio sinal at every point in time? 30

30 Melody transcription Melodia: Justin Salamon and Emilia Gomez. Melody Extraction From Polyphonic Music Sinals Usin Pitch Contour Characteristics. IEEE Transactions on Audio, Speech, and Lanuae Processin, 20(6): , search for a harmonic series of frequencies that would contribute to our perception of this pitch Salience = (weihted) sum of eneries of these harmonic frequencies 31

31 Melody transcription Melodia: Justin Salamon and Emilia Gomez. Melody Extraction From Polyphonic Music Sinals Usin Pitch Contour Characteristics. IEEE Transactions on Audio, Speech, and Lanuae Processin, 20(6): , pitch contour = a series of consecutive pitch values which are continuous in both time and frequency 32

32 Melody transcription Melodia: Justin Salamon and Emilia Gomez. Melody Extraction From Polyphonic Music Sinals Usin Pitch Contour Characteristics. IEEE Transactions on Audio, Speech, and Lanuae Processin, 20(6): , By studyin the distribution of these characteristics for contours that belon to melodies and contours that belon to accompaniments, we were able to devise a set of rules for filterin out non-melodic contours! calculation of contour characteristics. 33

33 Melody transcription Melodia main melody extraction from polyphonic sinals start from a frequency representation of the sinal, and assess, the salience of all possible candidate pitches based on a hih-resolution STFT, from which peaks are found Estimate of exact location: instantaneous frequency is found by considerin the manitudes of the spectrum in each frame, andinterpolatin between the phases of peaks in consecutive frames of the STFT Justin Salamon and Emilia Gomez. Melody Extraction From Polyphonic Music Sinals Usin Pitch Contour Characteris- tics. IEEE Transactions on Audio, Speech, and Lanuae Processin, 20(6): ,

34 Melody transcription Melodia Then: harmonic summarization on the set of peaks peak candidates are rouped in time-varyin melodic contours usin a set of heuristics based on perceptual streamin cues candidate contours are then iven a score based on their total salience and shape, and post-processed selects the set of contours that most likely constitutes the melody. Justin Salamon and Emilia Gomez. Melody Extraction From Polyphonic Music Sinals Usin Pitch Contour Characteris- tics. IEEE Transactions on Audio, Speech, and Lanuae Processin, 20(6): ,

35 Melody transcription Melodia Demo: Justin Salamon and Emilia Gomez. Melody Extraction From Polyphonic Music Sinals Usin Pitch Contour Characteris- tics. IEEE Transactions on Audio, Speech, and Lanuae Processin, 20(6): ,

36 Melody transcription Data-driven and source separation-based systems extract the melody by separatin it from the rest of the mix E..: use a trained timbre model to describe each of two sources (melody and accompaniment) timbre models can be Gaussian mixture models (GMM s), in which each source is seen as a weihted sum of a finite set of multidimensional Gaussians, each describin a particular spectral shape, or hidden Markov models (HMM), a eneralisation of GMM s Models are trained on on source-separated round truth data usin expectation maximization Graham E. Poliner and Dan Ellis. A Classification Approach to Music Transcription. Proc. 6th International Society for Music Information Retrieval Conference,

37 The matter of onset detection most transcriptions so far present a pitch contour onset-offset detection missin Audio onset detection is (aain) MIREX task difficulty of subtasks is very different 38

38 Multiple F0 estimation many approaches tried Deep Learnin (neural networks) Non-neative Matrix Factorization (NMF) for example, in NMF the audio sinal is rendered as a matrix, which is the a product of a matrix of activations (sound events) a matrix of (learned) note templates aim is to calculate the matrix of activations that has the best match with the audio sinal activations correspond to notes automatic transcription of piano piece (black) and round truth (white) 39

39 Chord reconition Melody vs Chords: Note: a sinle soundin tone with a pitch (heiht) Melody: a sequence of monophonic notes 40

40 Chord reconition Melody vs Chords: Note: a sinle soundin tone with a pitch (heiht) Melody: a sequence of monophonic notes Chords: soundin of simultaneous (at least two) notes Chord sequence: a sequence of chords 41

41 From Score to Audio Score Easy Audio recordin 42

42 From Audio to Score Score Hard Audio recordin 43

Pipeline: chord reconition Sound Chromaram Chord candidates Key chanes Model based selection Chords Match a the beat synchronised chroma features with chord templates

43 Pipeline: chord reconition Sound Chromaram Chord candidates Key chanes Model based selection Chords Match a the beat synchronised chroma features with chord templates Different approaches: Use knowlede-based templates Match by Euclidean distance Learn an averae profile from the data Learn 12 dimensional Gaussian distribution for all chords 44

44 Case study: chord reconition Sound Chromaram Chord candidates Key chanes Model based selection Chords Some models require information about the Key General approach Krumhansl-Kessler profiles Match by Pearson correlation with chroma feature Variations exist Approach is similar to chord candidate selection 45

45 Short excerpt: What is a key? pitch is enerally not equally distributed within a piece of music if it is, you et atonal music e.. Webern s piano variations when we use only a subset, music enerally sounds much more structured 46

hovers around certain pitch, the final or tonic most common scales: major and minor 7 different

46 Short excerpt: What is a key? subsets are often visualised as musical scales perceptually, they help us identify tonality music hovers around certain pitch, the final or tonic most common scales: major and minor 7 different pitches within octave most audible difference: third note of scale chane can be quite dramatic major minor 47 ex. Gustav Mahler, 1 st symphony, 3 rd mvt

47 Case study: chord reconition Sound Chromaram Chord candidates Key chanes Model based selection Chords Some models require information about the Key General approach Krumhansl-Kessler profiles Match by Pearson correlation with chroma feature Variations exist Approach is similar to chord candidate selection 48

48 Case study: chord reconition Sound Chromaram Chord candidates Key chanes Model based selection Chords Two kind of approaches: Knowlede driven: Musical knowlede is modelled and used to select a plausible chord sequence Data driven: The Transition probabilities between chords are learned from a lare corpus of chords 49

49 Case study: chord reconition Sound Chromaram Chord candidates Key chanes Model based selection Chords Knowlede driven: HarmTrace model (de Haas): Analyses the function of chord Needs key information Robust aainst noisy data Flexible Based on functional prorammin 50

50 Case study: chord reconition Sound Chromaram Chord candidates Key chanes Model based selection Chords Data Driven: Hidden Markov models: Estimate the probability of a chord-transition by countin the chord transitions in a corpus Use chroma features to estimate the probability of a chord candidate Goal: to find the most likely sequence of chords that results on the curren chromaram: Viterbi alorithm Many variants exist 51

51 Case study: chord reconition Sound Chromaram Chord candidates Key chanes Model based selection Chords 52

52 Practical approaches based on consideration of (western) predominant musical structures melody + accompaniment is common complex polyphony is relatively scarce, especially in popular music counterpoint between the inner voices does matter in classical works such as fuues (or the ex. of the Shostakovich strin quartet) pramatic approaches are thus focussin on melody extraction chord reconition 53

53 Summary transcription = derivin a symbolic representation from musical audio F0 estimation: derivin perceived pitch from audio Query by Hummin as early MIR paradim F0 estimation for monophony considered solved but in practice multiple F0 estimation: very hard task practical restrictions: melody extraction, chord labellin onset detection: yet another important task no functional systems yet that combine F0 extraction and onset detection 54

Transcription of the Singing Melody in Polyphonic Music

Transcription of the Singing Melody in Polyphonic Music Matti Ryynänen and Anssi Klapuri Institute of Signal Processing, Tampere University Of Technology P.O.Box 553, FI-33101 Tampere, Finland {matti.ryynanen,