Data-Driven Solo Voice Enhancement for Jazz Music Retrieval

Similar documents
DATA-DRIVEN SOLO VOICE ENHANCEMENT FOR JAZZ MUSIC RETRIEVAL

TOWARDS EVALUATING MULTIPLE PREDOMINANT MELODY ANNOTATIONS IN JAZZ RECORDINGS

Singer Traits Identification using Deep Neural Network

SINGING VOICE MELODY TRANSCRIPTION USING DEEP NEURAL NETWORKS

Music Structure Analysis

DEEP SALIENCE REPRESENTATIONS FOR F 0 ESTIMATION IN POLYPHONIC MUSIC

TOWARD AN INTELLIGENT EDITOR FOR JAZZ MUSIC

RETRIEVING AUDIO RECORDINGS USING MUSICAL THEMES

Music Information Retrieval

Deep Neural Networks in MIR

Audio. Meinard Müller. Beethoven, Bach, and Billions of Bytes. International Audio Laboratories Erlangen. International Audio Laboratories Erlangen

LEARNING AUDIO SHEET MUSIC CORRESPONDENCES. Matthias Dorfer Department of Computational Perception

JAZZ SOLO INSTRUMENT CLASSIFICATION WITH CONVOLUTIONAL NEURAL NETWORKS, SOURCE SEPARATION, AND TRANSFER LEARNING

MATCHING MUSICAL THEMES BASED ON NOISY OCR AND OMR INPUT. Stefan Balke, Sanu Pulimootil Achankunju, Meinard Müller

Retrieval of textual song lyrics from sung inputs

Chord Classification of an Audio Signal using Artificial Neural Network

Audio Cover Song Identification using Convolutional Neural Network

Music Representations. Beethoven, Bach, and Billions of Bytes. Music. Research Goals. Piano Roll Representation. Player Piano (1900)

Miles vs Trane. a is i al aris n n l rane s an Miles avis s i r visa i nal s les. Klaus Frieler

Melody, Bass Line, and Harmony Representations for Music Version Identification

Automatic Piano Music Transcription

Semantic Audio. Semantic audio is the relatively young field concerned with. International Conference. Erlangen, Germany June, 2017

Further Topics in MIR

Tempo and Beat Tracking

A Note Based Query By Humming System using Convolutional Neural Network

Structured training for large-vocabulary chord recognition. Brian McFee* & Juan Pablo Bello

Melody Extraction from Generic Audio Clips Thaminda Edirisooriya, Hansohl Kim, Connie Zeng

A CHROMA-BASED SALIENCE FUNCTION FOR MELODY AND BASS LINE ESTIMATION FROM MUSIC AUDIO SIGNALS

Beethoven, Bach, and Billions of Bytes

Effects of acoustic degradations on cover song recognition

arxiv: v1 [cs.ir] 2 Aug 2017

Semi-supervised Musical Instrument Recognition

Efficient Vocal Melody Extraction from Polyphonic Music Signals

Week 14 Query-by-Humming and Music Fingerprinting. Roger B. Dannenberg Professor of Computer Science, Art and Music Carnegie Mellon University

Outline. Why do we classify? Audio Classification

Music Processing Introduction Meinard Müller

A COMPARISON OF MELODY EXTRACTION METHODS BASED ON SOURCE-FILTER MODELLING

Lecture 12: Alignment and Matching

Improving Frame Based Automatic Laughter Detection

MUSI-6201 Computational Music Analysis

Piano Transcription MUMT611 Presentation III 1 March, Hankinson, 1/15

Musical Instrument Identification based on F0-dependent Multivariate Normal Distribution

THE importance of music content analysis for musical

A PERPLEXITY BASED COVER SONG MATCHING SYSTEM FOR SHORT LENGTH QUERIES

arxiv: v2 [cs.sd] 31 Mar 2017

Music Structure Analysis

gresearch Focus Cognitive Sciences

Music Processing Audio Retrieval Meinard Müller

2 2. Melody description The MPEG-7 standard distinguishes three types of attributes related to melody: the fundamental frequency LLD associated to a t

Audio Structure Analysis

Lecture 9 Source Separation

A CLASSIFICATION APPROACH TO MELODY TRANSCRIPTION

Music Radar: A Web-based Query by Humming System

A Study of Synchronization of Audio Data with Symbolic Data. Music254 Project Report Spring 2007 SongHui Chon

MUSIC SHAPELETS FOR FAST COVER SONG RECOGNITION

Musical Instrument Identification Using Principal Component Analysis and Multi-Layered Perceptrons

Music Representations

MedleyDB: A MULTITRACK DATASET FOR ANNOTATION-INTENSIVE MIR RESEARCH

Drum Sound Identification for Polyphonic Music Using Template Adaptation and Matching Methods

Chord Label Personalization through Deep Learning of Integrated Harmonic Interval-based Representations

Computational Modelling of Harmony

Probabilist modeling of musical chord sequences for music analysis

MELODY EXTRACTION FROM POLYPHONIC AUDIO OF WESTERN OPERA: A METHOD BASED ON DETECTION OF THE SINGER S FORMANT

User-Specific Learning for Recognizing a Singer s Intended Pitch

MUSICAL INSTRUMENT RECOGNITION WITH WAVELET ENVELOPES

2016 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, SEPT , 2016, SALERNO, ITALY

Appendix A Types of Recorded Chords

INTER GENRE SIMILARITY MODELLING FOR AUTOMATIC MUSIC GENRE CLASSIFICATION

Deep feature learning for cover song identification

Automatic Identification of Samples in Hip Hop Music

AUDIO-ALIGNED JAZZ HARMONY DATASET FOR AUTOMATIC CHORD TRANSCRIPTION AND CORPUS-BASED RESEARCH

ANALYSIS OF INTONATION TRAJECTORIES IN SOLO SINGING

Tempo and Beat Analysis

DRUM TRANSCRIPTION FROM POLYPHONIC MUSIC WITH RECURRENT NEURAL NETWORKS.

The Million Song Dataset

CHORD GENERATION FROM SYMBOLIC MELODY USING BLSTM NETWORKS

APPLICATIONS OF A SEMI-AUTOMATIC MELODY EXTRACTION INTERFACE FOR INDIAN MUSIC

Audio Structure Analysis

FREISCHÜTZ DIGITAL: A CASE STUDY FOR REFERENCE-BASED AUDIO SEGMENTATION OF OPERAS

Data Driven Music Understanding

Music Genre Classification

Automatic Rhythmic Notation from Single Voice Audio Sources

JOINT BEAT AND DOWNBEAT TRACKING WITH RECURRENT NEURAL NETWORKS

Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes

A REAL-TIME SIGNAL PROCESSING FRAMEWORK OF MUSICAL EXPRESSIVE FEATURE EXTRACTION USING MATLAB

Music Database Retrieval Based on Spectral Similarity

Audio Structure Analysis

The song remains the same: identifying versions of the same piece using tonal descriptors

IMPROVING MELODIC SIMILARITY IN INDIAN ART MUSIC USING CULTURE-SPECIFIC MELODIC CHARACTERISTICS

Music Composition with RNN

Krzysztof Rychlicki-Kicior, Bartlomiej Stasiak and Mykhaylo Yatsymirskyy Lodz University of Technology

POST-PROCESSING FIDDLE : A REAL-TIME MULTI-PITCH TRACKING TECHNIQUE USING HARMONIC PARTIAL SUBTRACTION FOR USE WITHIN LIVE PERFORMANCE SYSTEMS

A STUDY ON LSTM NETWORKS FOR POLYPHONIC MUSIC SEQUENCE MODELLING

Music Structure Analysis

Singing voice synthesis based on deep neural networks

NEW QUERY-BY-HUMMING MUSIC RETRIEVAL SYSTEM CONCEPTION AND EVALUATION BASED ON A QUERY NATURE STUDY

Query By Humming: Finding Songs in a Polyphonic Database

A QUERY BY EXAMPLE MUSIC RETRIEVAL ALGORITHM

Sparse Representation Classification-Based Automatic Chord Recognition For Noisy Music

Proc. of NCC 2010, Chennai, India A Melody Detection User Interface for Polyphonic Music

Transcription:

Data-Driven Solo Voice Enhancement for Jazz Music Retrieval Stefan Balke1, Christian Dittmar1, Jakob Abeßer2, Meinard Müller1 1International Audio Laboratories Erlangen 2Fraunhofer Institute for Digital Media Technology IDMT

Vision T T 2

Problem Setting Monophonic Transcription vs. Collection of Polyphonic Music Recordings Matching Procedure Solo Voice Enhancement Retrieval Scenario Given a monophonic transcription of a jazz solo as query, find the corresponding document in a collection of polyphonic music recordings. Solo Voice Enhancement 1. Model-based Approach [Salamon13] 2. Data-Driven Approach [Rigaud16, Bittner15] Our Data-Driven Approach Use a DNN to learn the mapping from a polyphonic TF representation to a monophonic TF representation. 3

Overview Philippe Halsman, Louis Armstrong 1. Background on the Data 2. DNN Architecture & Training 3. Evaluation within Retrieval Scenario 4

Weimar Jazz Database (WJD) [Pfleiderer17] 299 transcribed jazz solos of monophonic instruments. Transcription Beats Transcriptions specify a musical pitch for physical time instances. 570 min. of audio recordings. E 7 A 7 D 7 G 7 Chords Thanks to the Jazzomat Research team: M. Pfleiderer, K. Frieler, J. Abeßer, W.-G. Zaddach 5

DNN Training Input: Log-freq. STFT frame (120 semitones, 10 Hz feature rate) TF-representation of jazz solo recording Output: Pitch activations (120 semitones, 10 Hz feature rate) Target: TF-representation with solo instrument s pitch activations 8372 Input Target Frequency (Hz) 1760 440 110 28 9 4 5 6 7 8 9 4 5 6 7 8 9 Time (s) Time (s) Time (s) Time (s) 6

DNN Architecture! = Input, % = Output, & = Target, ' = Loss! % ReLU ReLU ReLU ReLU ReLU W 1 W 2 W 3 W 4 W 5 ' = MSE(!, %) Dimensions: 120 120 120 120 120 120 120 Basic feed-forward DNN with 5 hidden layers. Training is applied layer-wise [Bengio06], extended in [Uhlich15]. 7

Layer-Wise Training [Uhlich15] W 1, b 1 Initialize weights (W 1 ) and bias (b 1 ) with Linear Least Squares (LLS) Train 600 epochs Interpret output of trained network as input Keep weights to the next layer W 1, b 1 W 2, b 2 Append next layer Initialize W 2 and b 2 with LLS Train 600 epochs 8

Training Details Total Duration: 570 min. Active Solo Frames: 62% Split: 10-fold cross-validation Training Set: 63%, Validation Set: 27% Test Set: 10% Loss: Mean-Squared Error Optimizer: Stochastic Gradient Descent Mini-batch size = 100 frames (10 s) Learning Rate = 10 01, Momentum = 0.9 600 epochs per layer (3000 epochs in total) 9

Training Loss Number of Hidden Layers: 1 600 10

Training Loss Number of Hidden Layers: 2 600 1200 11

Training Loss Number of Hidden Layers: 3 600 1200 1800 12

Training Loss Number of Hidden Layers: 4 600 1200 1800 2400 13

Training Loss Number of Hidden Layers: 5 600 1200 1800 2400 3000 14

Qualitative Evaluation 8372 Input Target Output 1760 Frequency (Hz) 440 110 28 9 4 5 6 7 8 9 Time (s) 4 5 6 7 8 9 4 5 6 7 8 9 Time (s) 4 5 6 7 8 9 Time (s) 15

Experiment: Jazz Music Retrieval T T Weimar Jazz Database vs. Matching Procedure Solo Voice Enhancement 30 queries with a duration of 25 s for each fold 1 relevant document in the database per query Additional queries by shortening to [20, 15, 10, 8, 6, 5, 4, 3] s Evaluation measure is the mean reciprocal rank (MRR) 16

Experiment: Jazz Music Retrieval Results Baseline Chroma-based matching [Mueller15] Melodia Quantized F0-trajectory [Salamon13] DNN 17

Conclusions Data-driven approaches seem to be beneficial for solo voice enhancement. Data-driven and model-based approaches show similar performance in a retrieval scenario. Future Work Investigate scenarios where predominance assumption is violated, e. g., walking bass transcription. Train instrument-specific models, e. g., implicit instrument recognition. Utilize DNN s output for other tasks (e. g., F0-tracking). Audio examples, trained models, and data: https://www.audiolabs-erlangen.de/resources/mir/2017-icassp-solovoiceenhancement stefan.balke@audiolabs-erlangen.de 18

feat. Masataka Goto, Mark Plumbley, and Udo Zölzer as keynote speakers. More Details: http://www.aes.org/conferences/2017/semantic/

References [Salamon13] Justin Salamon, Joan Serrà, and Emilia Gómez, Tonal representations for music retrieval: from version identification to query-by-humming, Int. Journal of Multimedia Information Retrieval, vol. 2, no. 1, pp. 45 58, 2013. [Rigaud16] F. Rigaud and M. Radenen, Singing voice melody transcription using deep neural networks, in Proc. of the Int. Society for Music Information Retrieval Conf. (ISMIR), New York City, USA, 2016, pp. 737 743. [Bittner15] Rachel M. Bittner, Justin Salamon, Slim Essid, and Juan Pablo Bello, Melody extraction by contour classification, in Proc. of the Int. Society for Music Information Retrieval Conf. (ISMIR), Málaga, Spain, 2015, pp. 500 506. [Bengio06] Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle, Greedy Layer-Wise Training of Deep Networks, in Proc. of the Annual Conference on Neural Information Processing Systems (NIPS), 2006, pp. 153 160. [Uhlich15] Stefan Uhlich, Franck Giron, and Yuki Mitsufuji, Deep neural network based instrument extraction from music, in Proc. of the IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), April 2015, pp. 2135 2139. [Pfleiderer17] The Jazzomat Research Project, Database download, last accessed: 2016/02/17, http://jazzomat.hfm-weimar.de. [Mueller15] Meinard Müller, Fundamentals of Music Processing, Springer Verlag, 2015. 20