LEARNING AUDIO SHEET MUSIC CORRESPONDENCES. Matthias Dorfer Department of Computational Perception

Similar documents
TOWARDS SCORE FOLLOWING IN SHEET MUSIC IMAGES

arxiv: v1 [cs.ir] 31 Jul 2017

Towards a Complete Classical Music Companion

An AI Approach to Automatic Natural Music Transcription

Data-Driven Solo Voice Enhancement for Jazz Music Retrieval

Deep Neural Networks Scanning for patterns (aka convolutional networks) Bhiksha Raj

Music Composition with RNN

Image-to-Markup Generation with Coarse-to-Fine Attention

A Transfer Learning Based Feature Extractor for Polyphonic Sound Event Detection Using Connectionist Temporal Classification

LSTM Neural Style Transfer in Music Using Computational Musicology

Scene Classification with Inception-7. Christian Szegedy with Julian Ibarz and Vincent Vanhoucke

OPTICAL MUSIC RECOGNITION WITH CONVOLUTIONAL SEQUENCE-TO-SEQUENCE MODELS

Chairs: Josep Lladós (CVC, Universitat Autònoma de Barcelona)

Singer Traits Identification using Deep Neural Network

MUSICAL INSTRUMENT RECOGNITION WITH WAVELET ENVELOPES

Automatic Laughter Detection

arxiv: v2 [cs.sd] 31 Mar 2017

Audio. Meinard Müller. Beethoven, Bach, and Billions of Bytes. International Audio Laboratories Erlangen. International Audio Laboratories Erlangen

Video-based Vibrato Detection and Analysis for Polyphonic String Music

arxiv: v1 [cs.cv] 16 Jul 2017

A repetition-based framework for lyric alignment in popular songs

A Discriminative Approach to Topic-based Citation Recommendation

MUSIC scores are the main medium for transmitting music. In the past, the scores started being handwritten, later they

Improvised Duet Interaction: Learning Improvisation Techniques for Automatic Accompaniment

JOINT BEAT AND DOWNBEAT TRACKING WITH RECURRENT NEURAL NETWORKS

Music Synchronization. Music Synchronization. Music Data. Music Data. General Goals. Music Information Retrieval (MIR)

Introductions to Music Information Retrieval

Chord Classification of an Audio Signal using Artificial Neural Network

2016 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, SEPT , 2016, SALERNO, ITALY

Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes

Neural Network for Music Instrument Identi cation

Transcription of the Singing Melody in Polyphonic Music

SPECTRAL LEARNING FOR EXPRESSIVE INTERACTIVE ENSEMBLE MUSIC PERFORMANCE

Indexing local features. Wed March 30 Prof. Kristen Grauman UT-Austin

Real-valued parametric conditioning of an RNN for interactive sound synthesis

arxiv: v1 [cs.lg] 15 Jun 2016

Automatic Laughter Detection

An Introduction to Deep Image Aesthetics

Music Information Retrieval (MIR)

DRUM TRANSCRIPTION FROM POLYPHONIC MUSIC WITH RECURRENT NEURAL NETWORKS.

Detecting Musical Key with Supervised Learning

Noise (Music) Composition Using Classification Algorithms Peter Wang (pwang01) December 15, 2017

Melody Extraction from Generic Audio Clips Thaminda Edirisooriya, Hansohl Kim, Connie Zeng

Algorithmic Music Composition using Recurrent Neural Networking

POST-PROCESSING FIDDLE : A REAL-TIME MULTI-PITCH TRACKING TECHNIQUE USING HARMONIC PARTIAL SUBTRACTION FOR USE WITHIN LIVE PERFORMANCE SYSTEMS

Music Processing Introduction Meinard Müller

First Step Towards Enhancing Word Embeddings with Pitch Accents for DNN-based Slot Filling on Recognized Text

Distortion Analysis Of Tamil Language Characters Recognition

CTP431- Music and Audio Computing Music Information Retrieval. Graduate School of Culture Technology KAIST Juhan Nam

Automatic Construction of Synthetic Musical Instruments and Performers

International Journal of Advance Engineering and Research Development MUSICAL INSTRUMENT IDENTIFICATION AND STATUS FINDING WITH MFCC

CS 7643: Deep Learning

Structured training for large-vocabulary chord recognition. Brian McFee* & Juan Pablo Bello

Music Generation from MIDI datasets

Music Emotion Recognition. Jaesung Lee. Chung-Ang University

Singing voice synthesis based on deep neural networks

A wavelet-based approach to the discovery of themes and sections in monophonic melodies Velarde, Gissel; Meredith, David

CHORD GENERATION FROM SYMBOLIC MELODY USING BLSTM NETWORKS

DETECTION OF SLOW-MOTION REPLAY SEGMENTS IN SPORTS VIDEO FOR HIGHLIGHTS GENERATION

Deep learning for music data processing

MATCH: A MUSIC ALIGNMENT TOOL CHEST

AUTOMATIC ACCOMPANIMENT OF VOCAL MELODIES IN THE CONTEXT OF POPULAR MUSIC

SentiMozart: Music Generation based on Emotions

Automatic Labelling of tabla signals

A STUDY ON LSTM NETWORKS FOR POLYPHONIC MUSIC SEQUENCE MODELLING

Week 14 Music Understanding and Classification

Music Information Retrieval

A DISCRETE FILTER BANK APPROACH TO AUDIO TO SCORE MATCHING FOR POLYPHONIC MUSIC

Robert Alexandru Dobre, Cristian Negrescu

Statistical Modeling and Retrieval of Polyphonic Music

Automatic Piano Music Transcription

Music Information Retrieval (MIR)

Beethoven, Bach, and Billions of Bytes

Deep Jammer: A Music Generation Model

Music Information Retrieval

AUTOMATIC MUSIC TRANSCRIPTION WITH CONVOLUTIONAL NEURAL NETWORKS USING INTUITIVE FILTER SHAPES. A Thesis. presented to

CHAPTER-9 DEVELOPMENT OF MODEL USING ANFIS

Various Artificial Intelligence Techniques For Automated Melody Generation

Musical Motif Discovery in Non-Musical Media

SMART VEHICLE SCREENING SYSTEM USING ARTIFICIAL INTELLIGENCE METHODS

Predicting Aesthetic Radar Map Using a Hierarchical Multi-task Network

Composing a melody with long-short term memory (LSTM) Recurrent Neural Networks. Konstantin Lackner

Music Similarity and Cover Song Identification: The Case of Jazz

TOWARD AN INTELLIGENT EDITOR FOR JAZZ MUSIC

Improving Polyphonic and Poly-Instrumental Music to Score Alignment

Classification of Musical Instruments sounds by Using MFCC and Timbral Audio Descriptors

Refined Spectral Template Models for Score Following

Joint Image and Text Representation for Aesthetics Analysis

Bach2Bach: Generating Music Using A Deep Reinforcement Learning Approach Nikhil Kotecha Columbia University

Music Representations. Beethoven, Bach, and Billions of Bytes. Music. Research Goals. Piano Roll Representation. Player Piano (1900)

Representations of Sound in Deep Learning of Audio Features from Music

Topic 11. Score-Informed Source Separation. (chroma slides adapted from Meinard Mueller)

gresearch Focus Cognitive Sciences

Predicting the immediate future with Recurrent Neural Networks: Pre-training and Applications

BAYESIAN METER TRACKING ON LEARNED SIGNAL REPRESENTATIONS

Musical Instrument Identification Using Principal Component Analysis and Multi-Layered Perceptrons

Improving Frame Based Automatic Laughter Detection

DeepID: Deep Learning for Face Recognition. Department of Electronic Engineering,

Polyphonic music transcription through dynamic networks and spectral pattern identification

Capturing Handwritten Ink Strokes with a Fast Video Camera

Transcription:

LEARNING AUDIO SHEET MUSIC CORRESPONDENCES Matthias Dorfer Department of Computational Perception

Short Introduction... I am a PhD Candidate in the Department of Computational Perception at Johannes Kepler University Linz (JKU). My supervisor Prof. Gerhard Widmer 1/39

Short Introduction... I am a PhD Candidate in the Department of Computational Perception at Johannes Kepler University Linz (JKU). My supervisor Prof. Gerhard Widmer "Basic and applied research in machine learning, pattern recognition, knowledge extraction, and generally Artificial and Computational Intelligence.... focus is on intelligent audio (specifically: music) processing." 1/39

This Talk Is About... Multi-Modal Neural Networks Task... Modality 1 Modality 1 2/39

This Talk Is About... Multi-Modal Neural Networks Task... Audio-Visual Representation Learning Modality 1 Modality 1 2/39

This Talk Is About... Multi-Modal Neural Networks Task... Audio-Visual Representation Learning Modality 1 Modality 1 Learning Correspondences between Audio and Sheet-Music 2/39

OUR TASKS

Our Tasks Score Following (Localization) Cross-Modality Retrieval Ranking Loss Embedding Layer View 1 View 2 3/39

Task - Score Following Score Following is the process of following a musical performance (audio) with respect to a known symbolical representation (e.g. a score). 4/39

The Task: Audio to Sheet Matching 5/39

The Task: Audio to Sheet Matching 5/39

The Task: Audio to Sheet Matching 5/39

The Task: Audio to Sheet Matching 5/39

The Task: Audio to Sheet Matching Simultaneously learn (in end-to-end neural network fashion) to read notes from images (pixels) listen to music match played music to its corresponding notes 6/39

METHODS

Spectrogram to Sheet Correspondences Rightmost onset is target note onset Temporal context of 1.2 sec into the past 7/39

Multi-modal Convolution Network The output layer is a B-way soft-max! 8/39

Multi-modal Convolution Network The output layer is a B-way soft-max! 8/39

Multi-modal Convolution Network The output layer is a B-way soft-max! 8/39

Multi-modal Convolution Network The output layer is a B-way soft-max! 8/39

Multi-modal Convolution Network The output layer is a B-way soft-max! 8/39

Soft Target Vectors Staff image is quantized into buckets Each bucket is represented by one output neuron Buckets hold probability of containing the note Neighbouring buckets share probability soft targets 9/39

Soft Target Vectors Staff image is quantized into buckets Each bucket is represented by one output neuron Buckets hold probability of containing the note Neighbouring buckets share probability soft targets Used as target values for training our networks 9/39

Optimization Objective Output activation: B-way soft-max φ(y j,b ) = ey j,b B k=1 ey j,k 10/39

Optimization Objective Output activation: B-way soft-max φ(y j,b ) = ey j,b B k=1 ey j,k Soft targets t j 10/39

Optimization Objective Output activation: B-way soft-max φ(y j,b ) = ey j,b B k=1 ey j,k Soft targets t j Loss: Categorical Cross Entropy l j (Θ) = B k=1 t j,k log(p j,k ) 10/39

Discussion: Choice of Objective Allows to model uncertainties (e.g. repetitive structures in music) Our experience: Much nicer to optimize than MSE regression or Mixture Density Networks 11/39

Sheet Location Prediction At test time: Predict expected location ˆx j of audio snippet with target note j in sheet image. 12/39

Sheet Location Prediction At test time: Predict expected location ˆx j of audio snippet with target note j in sheet image. Probability weighted localization ˆx j = k {b 1,b,b +1} w kc k bucket b with highest probability p j weights w = {p j,b 1, p j,b, p j,b +1}, bucket coordinates c k 12/39

EXPERIMENTS / DEMO

Train / Evaluation Data Matthias Dorfer, Andreas Arzt, and Gerhard Widmer. "Towards Score Following in Sheet Music Images." In Proc. of 17th International Society for Music Information Retrieval Conference, 2016. Trained on monophonic piano music Localization of staff lines Synthesize midi-tracks to audio Signal processing Spectrogram (22.05 khz, 2048 window, 31.25 fps) Filterbank: 24 band logarithmic (80 Hz to 8 khz) 13/39

Model Architecture and Optimization Sheet-Image 40 390 Spectrogram 136 40 VGG style image model VGG style audio model 3 3 Conv, BN, ReLU 3 3 Conv, BN, ReLU Max pooling Max pooling Dense, BN, ReLu, Drop-Out Dense, BN, ReLu, Drop-Out Multi-modality merging Concatenation-Layer Dense, BN, ReLu, Drop-Out Dense, BN, ReLu, Drop-Out B-way Soft-Max Layer 14/39

Model Architecture and Optimization Sheet-Image 40 390 Spectrogram 136 40 VGG style image model VGG style audio model 3 3 Conv, BN, ReLU 3 3 Conv, BN, ReLU Max pooling Max pooling Dense, BN, ReLu, Drop-Out Dense, BN, ReLu, Drop-Out Multi-modality merging Concatenation-Layer Dense, BN, ReLu, Drop-Out Dense, BN, ReLu, Drop-Out B-way Soft-Max Layer Mini-batch stochastic gradient descent with momentum Mini-batch size: 100 Learning rate: 0.1 (divided by 10 every 10 epochs) Momentum: 0.9 Weight decay: 0.0001 14/39

Demo with Real Music Minuet in G Major (BWV Anhang 114, Johann Sebastian Bach) Played on Yamaha AvantGrand N2 hybrid piano Recorded using a single microphone 15/39

Demo with Real Music 16/39

So far so good... Model works well on monophonic music and seems to learn reasonable representations. Important observation: No temporal model required! What to do next? 17/39

Switch to "Real Music" 18/39

Switch to "Real Music" 18/39

Switch to "Real Music" 18/39

Composers, Sheet Music and Audio Pieces from MuseScore (annotating becomes feasible) Classical Piano Music by Mozart (14 pieces), Bach (16), Beethoven (5), Haydn (4) and Chopin (1) Experimental Setup: train / validate: Mozart test: all composers Audio is synthesized 19/39

ANNOTATION PIPELINE

Fully Convolutional Segmentation Networks Optical Music Recognition (OMR) Pipeline 1. Input Image 20/39

Fully Convolutional Segmentation Networks Optical Music Recognition (OMR) Pipeline 1. Input Image 2. System Probability Maps 20/39

Fully Convolutional Segmentation Networks Optical Music Recognition (OMR) Pipeline 1. Input Image 2. System Probability Maps 3. Systems Recognition 20/39

Fully Convolutional Segmentation Networks Optical Music Recognition (OMR) Pipeline 1. Input Image 2. System Probability Maps 3. Systems Recognition 4. Regions of Interest 20/39

Fully Convolutional Segmentation Networks Optical Music Recognition (OMR) Pipeline 1. Input Image 2. System Probability Maps 3. Systems Recognition 4. Regions of Interest 5. Note Probability Maps 20/39

Fully Convolutional Segmentation Networks Optical Music Recognition (OMR) Pipeline 1. Input Image 2. System Probability Maps 3. Systems Recognition 4. Regions of Interest 5. Note Probability Maps 6. Note Head Recognition 20/39

Annotation Pipeline Image of Sheet Music 2. Annotation of individual note heads 1. Detect systems by bounding box 3. Relate note heads and onsets 21/39

Annotation Pipeline Image of Sheet Music 2. Annotation of individual note heads 1. Detect systems by bounding box 3. Relate note heads and onsets Now we know the locations of staff systems and note heads and for each note head its onset time in the audio. overall 63836 annotated correspondences of 51 pieces. 21/39

Train Data Preparation We unroll the score and have the relations to the audio This is all we need to train our models! 22/39

Demo W.A. Mozart Piano Sonata K545, 1st Movement Plain, Frame-wise Multi-Modal Convolution Network 23/39

Observations Sometimes a bit shaky Score following fails at the beginning of second page! But why? 24/39

Failure 25/39

Failure 25/39

Failure 25/39

Failure 25/39

Failure 25/39

Failure 25/39

Failure 25/39

Failure 25/39

NET DEBUGGING

Guided Back-Propagation Springenberg et al., "Striving for Simplicity - The All Convolutional Net", 2016. Saliency Maps for understanding trained models 26/39

Guided Back-Propagation Springenberg et al., "Striving for Simplicity - The All Convolutional Net", 2016. Saliency Maps for understanding trained models Given a trained network f and a fixed input X we compute the gradient of network prediction f(x) R k with respect to its input max(f(x)) X (1) Determines those parts of the input having the highest effect on the prediction when changed. 26/39

Guided Back-Propagation Springenberg et al., "Striving for Simplicity - The All Convolutional Net", 2016. Saliency Maps for understanding trained models Given a trained network f and a fixed input X we compute the gradient of network prediction f(x) R k with respect to its input max(f(x)) X (1) Determines those parts of the input having the highest effect on the prediction when changed. Guided back-propagation with rectified linear units only backpropagates positive error signals δ l 1 = δ l 1 x>0 1 δl >0 26/39

Net Debugging 27/39

Net Debugging 27/39

Net Debugging 27/39

Net Debugging 27/39

Net Debugging 27/39

Net Debugging 27/39

Failure Analysis Continued Network pays attention to note heads but does not seem to be pitch sensitive However, exploiting temporal relations inherent in music could fix the problem! 28/39

RECURRENT NEURAL NETWORKS!

RNN Training Examples 29/39

RNN Training Examples 29/39

RNN Training Examples 29/39

RNN Training Examples 29/39

RNN Training Examples 29/39

RNN Learning Curves 3.5 3.0 2.5 more_conv_musescore_results_tr more_conv_musescore_results_va rnn_more_conv_musescore_results_tr rnn_more_conv_musescore_results_va Loss 2.0 1.5 1.36161 1.0 0 20 40 60 80 100 Epoch 30/39

HIDDEN MARKOV MODELS (HMMS)

Hidden Markov Models Enforce spatial and temporal structure into single-time-step prediction score-following-model. 31/39

HMM - Design 32/39

HMM - Design States 32/39

HMM - Design 0.75 States 0.25 Observations 32/39

HMM - Design 0.75 States 0.25 Observations Map Local Predictions to Global Sheet Image and use them as Observations 32/39

HMM - Design 0.75 States 0.25 Observations Apply HMM Filtering / Tracking Algorithm 32/39

HMM - Demo W.A. Mozart Piano Sonata K545, 1st Movement HMM-Tracker Multi-Modal Convolution Network 33/39

CONCLUSIONS

Conclusions Learning multi-modal representations in the context of music-audio and sheet-music is a challenging application. 34/39

Conclusions Learning multi-modal representations in the context of music-audio and sheet-music is a challenging application. Multi-Modal Convolution Networks are the right direction. 34/39

Conclusions Learning multi-modal representations in the context of music-audio and sheet-music is a challenging application. Multi-Modal Convolution Networks are the right direction. However there are many open problems left: Learning Temporal Relations from training data Real audio and real performances, (asynchronous onsets, pedal, and varying dynamics) More training data!... 34/39

Data Augmentation Image augmentation: spectrogram 180 pxl 200 pxl 35/39

Data Augmentation Image augmentation: spectrogram 180 pxl 200 pxl 35/39

Data Augmentation Image augmentation: spectrogram 180 pxl 200 pxl 35/39

Data Augmentation Image augmentation: spectrogram 180 pxl 200 pxl 35/39

Data Augmentation Image augmentation: spectrogram 180 pxl 200 pxl Audio augmentation Different tempi and sound founts 35/39

AUDIO - SHEET MUSIC CROSS-MODALITY RETRIEVAL

The Task Our Goal: Find a common vector representation of both audio and sheet music (low dimensional embedding) 36/39

The Task Our Goal: Find a common vector representation of both audio and sheet music (low dimensional embedding) 36/39

The Task Our Goal: Find a common vector representation of both audio and sheet music (low dimensional embedding) Why would we like this: to make them comparable. 36/39

Cross-Modality Retrieval Neural Network Ranking Loss Embedding Layer View 1 View 2 Optimizes the similarity (in embedding space) between corresponding audio and sheet image snippets 37/39

Model Details and Optimization Ranking Loss Embedding Layer Uses CCA Embedding Layer Trained with Pairwise Ranking Loss View 1 View 2 32-dimensional embedding 38/39

Model Details and Optimization Ranking Loss Embedding Layer Uses CCA Embedding Layer Trained with Pairwise Ranking Loss View 1 View 2 32-dimensional embedding Encourage an embedding space where the distance between matching samples is lower than the distance between mismatching samples. 38/39

Cross-Modality Retrieval Cross-modality retrieval by cosine distance query result Sheet Audio Audio query point of view: blue dots: embedded candidate sheet music snippets red dot: embedding of an audio query. 39/39

Cross-Modality Retrieval Cross-modality retrieval by cosine distance query result Sheet Audio Audio query point of view: blue dots: embedded candidate sheet music snippets red dot: embedding of an audio query. Retrieval by nearest neighbor search 39/39