Experimenting with Musically Motivated Convolutional Neural Networks Jordi Pons 1, Thomas Lidy 2 and Xavier Serra 1 1 Music Technology Group, Universitat Pompeu Fabra, Barcelona 2 Institute of Software Technology and Interactive Systems, TU Wien January 23, 2017 Download the paper CLICKING HERE! J. Pons, T. Lidy and X. Serra January 23, 2017 Experimenting with Musically Motivated CNN 1 / 18
Outline Motivation Motivation J. Pons, T. Lidy and X. Serra January 23, 2017 Experimenting with Musically Motivated CNN 2 / 18
Input representation: why log-mel spectrograms for CNN? Input spectrograms for CNN: interpretable filters in time and frequency! N=80 1.88 seg. J. Pons, T. Lidy and X. Serra January 23, 2017 Experimenting with Musically Motivated CNN 3 / 18
Squared/rectangular filters: inertia from computer vision. Are these efficiently representing the relevant local stationaries in music data? J. Pons, T. Lidy and X. Serra January 23, 2017 Experimenting with Musically Motivated CNN 4 / 18
Squared/rectangular filters: inertia from computer vision. Are these efficiently representing the relevant local stationaries in music data? J. Pons, T. Lidy and X. Serra January 23, 2017 Experimenting with Musically Motivated CNN 5 / 18
CNNs modeling music data? Three filter shapes are discussed in the following: 1 Squared/rectangular filters 2 Temporal filters 3 Frequency filters Which musical concepts can these filters model? J. Pons, T. Lidy and X. Serra January 23, 2017 Experimenting with Musically Motivated CNN 6 / 18
Temporal filters (1-by-n): Setting m = 1, NO frequency features but temporal cues. Filters can learn musical concepts at different time-scales depending how n is set, i.e.: Onsets, attack-sustain-release: n N. BPM and rhythm patterns: n N. J. Pons, T. Lidy and X. Serra January 23, 2017 Experimenting with Musically Motivated CNN 7 / 18
Frequency filters (m-by-1): Setting n = 1, frequency features but NO temporal cues. Filters can learn different aspects depending how m is set, i.e.: Timbre + note: m = M. Similar to NMF!! Timbre: m < M. J. Pons, T. Lidy and X. Serra January 23, 2017 Experimenting with Musically Motivated CNN 8 / 18
Squared/rectangular filters (m-by-n filters) learning time and frequency features at the same time. Filters can learn different aspects depending how m/n are set, i.e.: Bass or kick modeling: m M and n N. Represented by a sub-band for a short-time. Cymbals or snare drums modeling: m = M and n N. Broad in frequency with a fixed decay time. J. Pons, T. Lidy and X. Serra January 23, 2017 Experimenting with Musically Motivated CNN 9 / 18
Architectures J. Pons, T. Lidy and X. Serra January 23, 2017 Experimenting with Musically Motivated CNN 10 / 18
Architectures J. Pons, T. Lidy and X. Serra January 23, 2017 Experimenting with Musically Motivated CNN 11 / 18
Architectures Joint architecture: Time-Frequency J. Pons, T. Lidy and X. Serra January 23, 2017 Experimenting with Musically Motivated CNN 12 / 18
Black-box and time architectures Filter shape Accuracy: mean ± std Architecture (m,n) # param. 10 cross-fold validation Black-box (12,8) 3.275.312 87.25 ± 3.39 % Time (1,60) 7.336 81.79 ± 4.72 % Frequency (32,1) 3.368 59.59 ± 5.82 % Frequency (36,1) 2.472 57.88 ± 5.38 % Frequency (40,1) 1.576 52.43 ± 5.63 % Time-Frequency (1,60)-(32,1) 196.816 86.54 ± 4.29 % Time-FrequencyInit (1,60)-(32,1) 196.816 87.68 ± 4.44 % 1 Genre classification task with the Ballroom dataset. 93.12% (Marchand et al.) using time and frequency cues 82.3% (Gouyon et al.) using only time cues 15.9% predicting most probable class 2 Black-box and Time-Frequency architecture achieve inferior results than the state-of-the-art. 3 Time architecture achieve equivalent results as its baseline. 4 Frequency architecture outperforms the random baseline: frequency features are more relevant than expected. J. Pons, T. Lidy and X. Serra January 23, 2017 Experimenting with Musically Motivated CNN 13 / 18
Pitch invariance experiment Filter shape Accuracy: mean ± std Architecture (m,n) # param. 10 cross-fold validation Black-box (12,8) 3.275.312 87.25 ± 3.39 % Time (1,60) 7.336 81.79 ± 4.72 % Frequency (32,1) 3.368 59.59 ± 5.82 % Frequency (36,1) 2.472 57.88 ± 5.38 % Frequency (40,1) 1.576 52.43 ± 5.63 % Time-Frequency (1,60)-(32,1) 196.816 86.54 ± 4.29 % Time-FrequencyInit (1,60)-(32,1) 196.816 87.68 ± 4.44 % 1 Designing the filters such that they can convolve in frequency (m < M), helps predicting the Ballroom classes. because is pitch invariant? because the network is more expressive? J. Pons, T. Lidy and X. Serra January 23, 2017 Experimenting with Musically Motivated CNN 14 / 18
Time-Frequency experiment Filter shape Accuracy: mean ± std Architecture (m,n) # param. 10 cross-fold validation Black-box (12,8) 3.275.312 87.25 ± 3.39 % Time (1,60) 7.336 81.79 ± 4.72 % Frequency (32,1) 3.368 59.59 ± 5.82 % Frequency (36,1) 2.472 57.88 ± 5.38 % Frequency (40,1) 1.576 52.43 ± 5.63 % Time-Frequency (1,60)-(32,1) 196.816 86.54 ± 4.29 % Time-FrequencyInit (1,60)-(32,1) 196.816 87.68 ± 4.44 1 Pre-initializing the weights is beneficial. 2 With a much less expressive network, Time-FrequencyInit, we obtain similar accuracy results than Black-box. we propose efficient way of representing music data. J. Pons, T. Lidy and X. Serra January 23, 2017 Experimenting with Musically Motivated CNN 15 / 18
We have discussed how several CNNs filter shapes can model musical aspects. We have proposed some musically motivated deep learning architectures. We have shown that these can achieve competitive results on predicting the Ballroom dataset classes. understand what the architectures are learning efficient way of representing musical concepts J. Pons, T. Lidy and X. Serra January 23, 2017 Experimenting with Musically Motivated CNN 16 / 18
Thanks! Reproduce our research, the code is available: github.com/jordipons/cbmi2016/ Experimenting with Musically Motivated Convolutional Neural Networks Jordi Pons 1, Thomas Lidy 2 and Xavier Serra 1 1 Music Technology Group, Universitat Pompeu Fabra, Barcelona 2 Institute of Software Technology and Interactive Systems, TU Wien J. Pons, T. Lidy and X. Serra January 23, 2017 Experimenting with Musically Motivated CNN 17 / 18
Ballroom dataset 698 songs. 30 seconds long 8 music genres: cha-cha-cha, jive, quickstep, rumba, samba, tango, viennese-waltz and slow-waltz. J. Pons, T. Lidy and X. Serra January 23, 2017 Experimenting with Musically Motivated CNN 18 / 18