Structured training for large-vocabulary chord recognition. Brian McFee* & Juan Pablo Bello

Similar documents
Chord Label Personalization through Deep Learning of Integrated Harmonic Interval-based Representations

Music Genre Classification

Music Composition with RNN

Data-Driven Solo Voice Enhancement for Jazz Music Retrieval

Singer Traits Identification using Deep Neural Network

Noise (Music) Composition Using Classification Algorithms Peter Wang (pwang01) December 15, 2017

Detecting Musical Key with Supervised Learning

An AI Approach to Automatic Natural Music Transcription

arxiv: v2 [cs.sd] 31 Mar 2017

Automatic Construction of Synthetic Musical Instruments and Performers

Melody Extraction from Generic Audio Clips Thaminda Edirisooriya, Hansohl Kim, Connie Zeng

Composer Identification of Digital Audio Modeling Content Specific Features Through Markov Models

LEARNING AUDIO SHEET MUSIC CORRESPONDENCES. Matthias Dorfer Department of Computational Perception

Image-to-Markup Generation with Coarse-to-Fine Attention

Chord Recognition with Stacked Denoising Autoencoders

EE391 Special Report (Spring 2005) Automatic Chord Recognition Using A Summary Autocorrelation Function

Automatic Piano Music Transcription

Homework 2 Key-finding algorithm

Neural Network for Music Instrument Identi cation

Chord Classification of an Audio Signal using Artificial Neural Network

The Million Song Dataset

A System for Automatic Chord Transcription from Audio Using Genre-Specific Hidden Markov Models

Adaptive decoding of convolutional codes

THE estimation of complexity of musical content is among. A data-driven model of tonal chord sequence complexity

Generating Music with Recurrent Neural Networks

Computational Modelling of Harmony

A STUDY ON LSTM NETWORKS FOR POLYPHONIC MUSIC SEQUENCE MODELLING

mir_eval: A TRANSPARENT IMPLEMENTATION OF COMMON MIR METRICS

2016 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, SEPT , 2016, SALERNO, ITALY

A probabilistic framework for audio-based tonal key and chord recognition

Searching for Similar Phrases in Music Audio

Pitch Perception and Grouping. HST.723 Neural Coding and Perception of Sound

Lecture 9 Source Separation

JOINT BEAT AND DOWNBEAT TRACKING WITH RECURRENT NEURAL NETWORKS

gresearch Focus Cognitive Sciences

Basic Theory Test, Part A - Notes and intervals

Student Performance Q&A:

LSTM Neural Style Transfer in Music Using Computational Musicology

AUTOMATIC ACCOMPANIMENT OF VOCAL MELODIES IN THE CONTEXT OF POPULAR MUSIC

Automatic Music Genre Classification

Experimenting with Musically Motivated Convolutional Neural Networks

An Introduction to Deep Image Aesthetics

COMP 249 Advanced Distributed Systems Multimedia Networking. Video Compression Standards

Recognition and Summarization of Chord Progressions and Their Application to Music Information Retrieval

Deep Neural Networks Scanning for patterns (aka convolutional networks) Bhiksha Raj

Music Information Retrieval for Jazz

CHORD GENERATION FROM SYMBOLIC MELODY USING BLSTM NETWORKS

Compressed-Sensing-Enabled Video Streaming for Wireless Multimedia Sensor Networks Abstract:

Chorale Harmonisation in the Style of J.S. Bach A Machine Learning Approach. Alex Chilvers

The reduction in the number of flip-flops in a sequential circuit is referred to as the state-reduction problem.

A Psychoacoustically Motivated Technique for the Automatic Transcription of Chords from Musical Audio

CS229 Project Report Polyphonic Piano Transcription

Probabilist modeling of musical chord sequences for music analysis

Topic 4. Single Pitch Detection

AUDIO-ALIGNED JAZZ HARMONY DATASET FOR AUTOMATIC CHORD TRANSCRIPTION AND CORPUS-BASED RESEARCH

Acoustic Scene Classification

DeepID: Deep Learning for Face Recognition. Department of Electronic Engineering,

Deep Jammer: A Music Generation Model

AUDIOVISUAL COMMUNICATION

Using Genre Classification to Make Content-based Music Recommendations

Week 14 Music Understanding and Classification

Personalized TV Recommendation with Mixture Probabilistic Matrix Factorization

Deep learning for music data processing

Music Similarity and Cover Song Identification: The Case of Jazz

DAY 1. Intelligent Audio Systems: A review of the foundations and applications of semantic audio analysis and music information retrieval

Music Emotion Recognition. Jaesung Lee. Chung-Ang University

A Survey of Audio-Based Music Classification and Annotation

Chord Recognition in Symbolic Music: A Segmental CRF Model, Segment-Level Features, and Comparative Evaluations on Classical and Popular Music

Pitch. The perceptual correlate of frequency: the perceptual dimension along which sounds can be ordered from low to high.

Finding Sarcasm in Reddit Postings: A Deep Learning Approach

SINGING VOICE MELODY TRANSCRIPTION USING DEEP NEURAL NETWORKS

POST-PROCESSING FIDDLE : A REAL-TIME MULTI-PITCH TRACKING TECHNIQUE USING HARMONIC PARTIAL SUBTRACTION FOR USE WITHIN LIVE PERFORMANCE SYSTEMS

IMPROVED CHORD RECOGNITION BY COMBINING DURATION AND HARMONIC LANGUAGE MODELS

Labelling. Friday 18th May. Goldsmiths, University of London. Bayesian Model Selection for Harmonic. Labelling. Christophe Rhodes.

Laughbot: Detecting Humor in Spoken Language with Language and Audio Cues

Audio: Generation & Extraction. Charu Jaiswal

A CHROMA-BASED SALIENCE FUNCTION FOR MELODY AND BASS LINE ESTIMATION FROM MUSIC AUDIO SIGNALS

Take a Break, Bach! Let Machine Learning Harmonize That Chorale For You. Chris Lewis Stanford University

Lessons from the Netflix Prize: Going beyond the algorithms

CSC475 Music Information Retrieval

Sarcasm Detection in Text: Design Document

AP Music Theory 2010 Scoring Guidelines

MUSI-6201 Computational Music Analysis

RECOMMENDATION ITU-R BT (Questions ITU-R 25/11, ITU-R 60/11 and ITU-R 61/11)

A DISCRETE MIXTURE MODEL FOR CHORD LABELLING

Music Generation from MIDI datasets

IEEE Proof Web Version

Minimax Disappointment Video Broadcasting

First Step Towards Enhancing Word Embeddings with Pitch Accents for DNN-based Slot Filling on Recognized Text

arxiv: v1 [cs.lg] 15 Jun 2016

Jazz Melody Generation and Recognition

Combinational / Sequential Logic

Scene Classification with Inception-7. Christian Szegedy with Julian Ibarz and Vincent Vanhoucke

Hidden Markov Model based dance recognition

APPLICATIONS OF A SEMI-AUTOMATIC MELODY EXTRACTION INTERFACE FOR INDIAN MUSIC

2014A Cappella Harmonv Academv Handout #2 Page 1. Sweet Adelines International Balance & Blend Joan Boutilier

Advanced Video Processing for Future Multimedia Communication Systems

IMPROVING PREDICTIONS OF DERIVED VIEWPOINTS IN MULTIPLE VIEWPOINT SYSTEMS

TOWARD AN INTELLIGENT EDITOR FOR JAZZ MUSIC

XI. Chord-Scales Via Modal Theory (Part 1)

Transcription:

Structured training for large-vocabulary chord recognition Brian McFee* & Juan Pablo Bello

Small chord vocabularies Typically a supervised learning problem N C:maj C:min C#:maj C#:min D:maj D:min...... B:maj B:min Frames chord labels 1-of-K classification models are common 25 classes: N + (12 min) + (12 maj) Hidden Markov Models, Deep convolutional networks, etc. Optimize accuracy, log-likelihood, etc.

Small chord vocabularies Typically a supervised learning problem N C:maj C:min C#:maj C#:min D:maj D:min...... B:maj B:min Frames chord labels 1-of-K classification models are common 25 classes: N + (12 min) + (12 maj) Hidden Markov Models, Deep convolutional networks, etc. Optimize accuracy, log-likelihood, etc. Implicit training assumption: All mistakes are equally bad

Large chord vocabularies Classes are not well-separated Chord quality Frequency maj 52.53% C:7 = C:maj + m7 min 13.63% C:sus4 vs. F:sus2 7 10.05%... hdim7 0.17% dim7 0.07% minmaj7 0.04% Distribution of the 1217 dataset Class distribution is non-uniform Rare classes are hard to model

Some mistakes are better than others d a b y Ver Not so ba d

Some mistakes are better than others d This a b y Ver Not implies that chord so ba d space is structured!

Our contributions Deep learning architecture to exploit structure of chord symbols Improve accuracy in rare classes Preserve accuracy in common classes Bonus: package is online for you to use!

Chord simplification All classification models need a finite, canonical label set

Chord simplification All classification models need a finite, canonical label set Vocabulary simplification process: a. Ignore inversions G :9(*5)/3 G :9(*5)

Chord simplification All classification models need a finite, canonical label set Vocabulary simplification process: a. b. Ignore inversions Ignore added and suppressed notes G :9(*5)/3 G :9(*5) G :9

Chord simplification All classification models need a finite, canonical label set Vocabulary simplification process: a. b. c. Ignore inversions Ignore added and suppressed notes Template-match to nearest quality G :9(*5)/3 G :9(*5) G :9 G :7

Chord simplification All classification models need a finite, canonical label set Vocabulary simplification process: a. b. c. d. Ignore inversions Ignore added and suppressed notes Template-match to nearest quality Resolve enharmonic equivalences G :9(*5)/3 G :9(*5) G :9 G :7 F :7

Chord simplification All classification models need a finite, canonical label set Vocabulary simplification process: a. b. c. d. Ignore inversions Ignore added and suppressed notes Template-match to nearest quality Resolve enharmonic equivalences G :9(*5)/3 G :9(*5) Simp lifica (but all ch tion is lossy ord m! odel s do it) G :9 G :7 F :7

14 12 + 2 = 170 classes 14 qualities min maj dim aug min6 C C#... B N No chord (e.g., silence) X Out of gamut (e.g., power chords) maj6 min7 minmaj7 maj7 7 dim7 hdim7 sus2 sus4

Structural encoding Represent chord labels as binary encodings Encoding is lossless* and structured: Similar chords with different labels will have similar encodings Dissimilar chords will have dissimilar encodings Learning problem: Predict the encoding from audio Learn to decode into chord labels * up to octave-folding

The big idea Jointly estimate structured encoding AND chord labels Full objective = root loss + pitch loss + bass loss + decoder loss

Model architectures Input: constant-q spectral patches Per-frame outputs: Root Pitches Bass Chords [multiclass, 13] [multilabel, 12] [multiclass, 13] [multiclass, 170] Convolutional-recurrent architecture (encoder-decoder) End-to-end training

Encoder architecture Hidden state at frame t: h(t) [-1, +1]D Suppress transients Encode frequencies Contextual smoothing

Decoder architectures Chords = Logistic regression from encoder state Frames are independently decoded: y(t) = softmax(w h(t) + β)

Decoder architectures Chords = Logistic regression from encoder state Decoding = GRU + LR Frames are recurrently decoded: h2(t) = Bi-GRU[h](t) y(t) = softmax(w h2(t) + β)

Decoder architectures Chords = Logistic regression from encoder state Decoding = GRU + LR Chords = LR from encoder state + root/pitch/bass Frames are independently decoded with structure: y(t) = softmax(wr r(t) + Wp p(t) + Wb b(t) + Wh h(t) + β)

Decoder architectures Chords = Logistic regression from encoder state Decoding = GRU + LR Chords = LR from encoder state + root/pitch/bass All of the above

What about root bias? Quality and root should be independent But the data is inherently biased Solution: data augmentation! muda [McFee, Humphrey, Bello 2015] Pitch-shift the audio and annotations simultaneously Each training track ± 6 semitone shifts All qualities are observed in all root positions All roots, pitches, and bass values are observed http://photos.jdhancock.com/photo/2012-09-28-001422-big-data.html

Evaluation 8 configurations ± data augmentation ± structured training 1 vs. 2 recurrent layers 1217 recordings (Billboard + Isophonics + MARL corpus) 5-fold cross-validation Baseline models: DNN [Humphrey & Bello, 2015] KHMM [Cho, 2014]

CR1: 1 recurrent layer CR2: 2 recurrent layers Results Data augmentation (+A) is necessary to match baselines. +A: data augmentation +S: structure encoding

CR1: 1 recurrent layer Results CR2: 2 recurrent layers +A: data augmentation +S: structure encoding Structured training (+S) and deeper models improve over baselines.

CR1: 1 recurrent layer CR2: 2 recurrent layers Results Improvements are bigger on the harder metrics (7ths and tetrads) +A: data augmentation +S: structure encoding

CR1: 1 recurrent layer Results CR2: 2 recurrent layers +A: data augmentation +S: structure encoding Substantial gains in maj/min and MIREX metrics CR2+S+A wins on all metrics

Error analysis: quality confusions Errors tend toward simplification Reflects maj/min bias in training data Simplified vocab. accuracy: 63.6%

Summary Structured training helps Deeper is better Data augmentation is critical pip install muda Rare classes are still hard We probably need new data

Thanks! Questions? Implementation is online https://github.com/bmcfee/ismir2017_chords pip install crema brian.mcfee@nyu.edu https://bmcfee.github.io/

Extra goodies

Error analysis: CR2+S+A vs CR2+A Reduction of confusions to major Improvements in rare classes: aug, maj6, dim7, hdim7, sus4

Learned model weights Layer 1: Harmonic saliency Layer 2: Pitch filters (sorted by dominant frequency)

Training details Keras / TensorFlow + pescador ADAM optimizer Early stopping @20, learning rate reduction @10 Determined by decoder loss 8 seconds per patch 32 patches ber batch 1024 batches per epoch

Inter-root confusions Confusions primarily toward P4/P5

Inversion estimation For each detected chord segment Find the most likely bass note If that note is within the detected quality, predict it as the inversion Implemented in the crema package Inversion-sensitive metrics ~1% lower than inversion-agnostic

Pitches as chroma