Deep learning for music data processing

Deep learning for music data processing A personal (re)view of the state-of-the-art Jordi Pons www.jordipons.me Music Technology Group, DTIC, Universitat Pompeu Fabra, Barcelona. 31st January 2017 Jordi Pons 31st January 2017 Deep learning for music data processing 1 / 33

What problems do we care about in music technology research? (Automatically) cataloging large-scale music collections. Music recommendation. Similarity ie. Shanzam. Synthesis: instruments, singing voice.... Some of them can be approached with deep learning. Jordi Pons 31st January 2017 Deep learning for music data processing 2 / 33

Why deep learning might be useful for music data processing? Music is hierarchic in frequency (note, chord) and time (onset, rhythm) and deep learning naturally allows this representation. Contextual analysis Short time-scale features: CNNs - ie. note, chords. Long time-scale features: RNNs - ie. structure. Unsupervised learning: potential of learning from any audio! Time/frequency invariant operations: max-pool. Any input: spectrogram, MFCCs, self-similarity matrices, video, text. Jordi Pons 31st January 2017 Deep learning for music data processing 3 / 33

Acronyms: MLP: multi layer perceptron feed-forward neural network. RNN: recurrent neural network. LSTM: long-short term memory. CNN: convolutional neural network. Assumed notion of deep learning: It is deep when several non-linearities are applied to the input. The parameters of the network are learnt: typically by using back-propagation. Jordi Pons 31st January 2017 Deep learning for music data processing 4 / 33

Chronology: the big picture Jordi Pons 31st January 2017 Deep learning for music data processing 5 / 33

Chronology: the big picture Jordi Pons 31st January 2017 Deep learning for music data processing 6 / 33

Jordi Pons 31st January 2017 Deep learning for music data processing 7 / 33

Jordi Pons 31st January 2017 Deep learning for music data processing 8 / 33

Jordi Pons 31st January 2017 Deep learning for music data processing 9 / 33

Jordi Pons 31st January 2017 Deep learning for music data processing 10 / 33

Jordi Pons 31st January 2017 Deep learning for music data processing 11 / 33

Jordi Pons 31st January 2017 Deep learning for music data processing 12 / 33

Jordi Pons 31st January 2017 Deep learning for music data processing 13 / 33

Jordi Pons 31st January 2017 Deep learning for music data processing 14 / 33

Used for: Classification: genre, artist, singing-voice detection, music-speech. Pons et al., Lidy et al. Auto-tagging. Dieleman et al., Choi et al. Key estimation. Humphrey et al., Korzeniowski et al. Feature extraction (unsupervised). Hamel et al., Lee et al. Music similarity estimation. Schlüter et al. Music recommendation. Aäron van den Oord et al. Onset/boundary detection. Böck et al., Durand et al. Source separation. Huang et al., Miron et al. Singing voice synthesis. Blaauw et al. Jordi Pons 31st January 2017 Deep learning for music data processing 15 / 33

Chronology: the big picture Jordi Pons 31st January 2017 Deep learning for music data processing 16 / 33

LSTMs for automatic music composition with symbolic data Eck and Schmidhuber. Learning The Long-Term Structure of the Blues. ICANN 02..compositions are quite pleasant Some examples of music composed by LSTMs: 1 Bob Sturm plays: The Mal s Copporim. 2 LSTMetallica: Drums from Metallica. Choi et al. 3 LSTM Realbook: Generation of Jazz chord progressions. Jordi Pons 31st January 2017 Deep learning for music data processing 17 / 33

CNNs interpretation and filter shapes discussion S. Dieleman. http://benanne.github.io/2014/08/05/spotify-cnns.html Content-based music recommendation @ Spotify. CNN is learning (music) hierarchical features: L1 Vibrato, vocal thirds, bass drums, A/Bb pitch, A/Am chord. L3 Christian rock, Chinese pop, 8-bit, multimodal. Jordi Pons 31st January 2017 Deep learning for music data processing 18 / 33

Lee et al. Unsupervised feature learning for audio classification using convolutional deep belief networks. NIPS 09 Visualization of some randomly selected first-layer convolutional filters trained with music. Jordi Pons 31st January 2017 Deep learning for music data processing 19 / 33

Lee et al. Unsupervised feature learning for audio classification using convolutional deep belief networks. NIPS 09 Visualization of the four different phonemes and their corresponding first-layer convolutional filters trained with speech. Jordi Pons 31st January 2017 Deep learning for music data processing 20 / 33

Choi et al. Explaining Deep CNNs on Music Classification. arxiv:1607.02444 Figure : Filters of the first CNN layer trained for genre classification Layer 1 : onsets. Layer 2 : onsets, bass, harmonics, melody. Layer 3 : onsets, melody, kick, percussion. Layer 4 : harmonic structures, notes, vertical-horizontal lines. Layer 5 : textures, harmo-rhythmic patterns. 3x3 filters are limiting the representational power of the 1st layer! Does it make sense then to use computer vision architectures? as in: Hershey et al. CNN architectures for large-scale audio classification. ICASSP 17 Jordi Pons 31st January 2017 Deep learning for music data processing 21 / 33

Pons et al. Experimenting with musically motivated CNNs. CBMI 16 Squared/rectangular filters (m-by-n): kick, notes: m M and n N Temporal filters (1-by-n): onsets, patterns....very efficient! Frequency filters (m-by-1): timbre, chords....interpretable! Jordi Pons 31st January 2017 Deep learning for music data processing 22 / 33

Pons et al. Experimenting with musically motivated CNNs. CBMI 16 Pons & Serra. Designing efficient architectures for modeling temporal features with CNNs. ICASSP 17 Jordi Pons 31st January 2017 Deep learning for music data processing 23 / 33

in collaboration with Thomas Lidy: CNNs (12x8, 1x80, 40x1) white > black Jordi Pons 31st January 2017 Deep learning for music data processing 24 / 33

Source Separation Jordi Pons 31st January 2017 Deep learning for music data processing 25 / 33

Po-Sen Huang et al. Singing-Voice Separation from Monaural Recordings using Deep Recurrent Neural Networks ISMIR 14 3 deep layers (2nd recurrent) estimating 2 sources simultaneously. Joint modelling of DRNN + mask with a discriminative cost. Jordi Pons 31st January 2017 Deep learning for music data processing 26 / 33

Chandna et al. Monoaural audio source separation using deep convolutional neural networks. LVA-ICA 17 Presented to Signal Separation Evaluation Campaign 2017. Jordi Pons 31st January 2017 Deep learning for music data processing 27 / 33

End-to-end learning S. Dieleman and B. Schrauwen. End-to-end learning for music audio. ICASSP 14 Learning frequency selective filters similar to MEL filter bank. Jordi Pons 31st January 2017 Deep learning for music data processing 28 / 33

Aäron van den Oord et al. Wavenet: A generative model for raw audio. arxiv:1609.03499 (2016) Generative model for speech and music audio. Jordi Pons 31st January 2017 Deep learning for music data processing 29 / 33

Chronology: the big picture Jordi Pons 31st January 2017 Deep learning for music data processing 30 / 33

Limitations the academic music technology community is facing when approaching their problems with deep learning: Lack of annotated data. Lack of hardware (GPUs) Expertise goes to the industry. Trends for solving the issue of annotated data: Collaborative effort for jointly annotating music data. Artificial augmentation of the annotated data. Trends for solving hardware limitations: Researchers avoid end-to-end learning approaches: Inputting hand-crafted features to deep networks. Using non deep learning classifiers/models stacked on top of deep learning feature extractors. Constraining the solution space considering prior information: music nature or human audio perception. Jordi Pons 31st January 2017 Deep learning for music data processing 31 / 33

Limitations the academic music technology community is facing when approaching their problems with deep learning: Lack of annotated data. Lack of hardware (GPUs) Expertise goes to the industry. Trends for solving the issue of annotated data: Collaborative effort for jointly annotating music data. Artificial augmentation of the annotated data. Trends for solving hardware limitations: Researchers avoid end-to-end learning approaches: Inputting hand-crafted features to deep networks. Using non deep learning classifiers/models stacked on top of deep learning feature extractors. Constraining the solution space considering prior information: music nature or human audio perception. References @ jordipons.me/lack-of-annotated-music-data-restrict-the-solution-space/ Jordi Pons 31st January 2017 Deep learning for music data processing 31 / 33

Imaginable research directions? End-to-end learning from raw audio. Aytar et al. SoundNet: Learning Sound Representations from Unlabeled Video. @ NIPS 16 Multimodal deep processing. Slizovskaia et al. Automatic musical instrument recognition in audiovisual recordings by combining image and audio classification strategies. @ SMC 16 Unsupervised learning such as generative models. Aaron van den Oord et al. Wavenet: A generative model for raw audio. @ arxiv:1609.03499 (2016) Efficient learning long-term dependencies. Eck and Schmidhuber. Learning The Long-Term Structure of the Blues. @ICANN02 Understanding which features are learnt. Pons et al. Experimenting with musically motivated convolutional NNs. @ CBMI 16 Jordi Pons 31st January 2017 Deep learning for music data processing 32 / 33

Thanks! :) Deep learning for music data processing A personal (re)view of the state-of-the-art Jordi Pons www.jordipons.me Music Technology Group, DTIC, Universitat Pompeu Fabra, Barcelona. 31st January 2017 Jordi Pons 31st January 2017 Deep learning for music data processing 33 / 33