MUSI-6201 Computational Music Analysis

MUSI-6201 Computational Music Analysis Part 9.1: Genre Classification alexander lerch November 4, 2015

temporal analysis overview text book Chapter 8: Musical Genre, Similarity, and Mood (pp. 151 155) sources: slides (latex) & Matlab github repository lecture content definition of musical genre typical features and feature categories simple classifiers and basic classifier properties

introduction one of the oldest research topics in MIR classic machine learning task related fields: speech-music classification instrument recognition artist identification music emotion recognition

applications large music databases: annotation sorting, browsing, retrieving recommendation systems automatic playlist generation mashup generation

genre: definition what is musical genre

genre: definition what is musical genre clusters of musical similarity? hard to answer in general, there are many systematic problems

genre: definition what is musical genre clusters of musical similarity? hard to answer in general, there are many systematic problems 1 non-agreement on taxonomies

genre: definition what is musical genre clusters of musical similarity? hard to answer in general, there are many systematic problems 1 non-agreement on taxonomies 2 genre label scope: song, album, artist, piece of a song 3 ill-defined genre labels: geographic (indian music), historic (baroque), technical (barbershop), instrumentation (symphonic music), usage (christmas songs)

genre: taxonomy examples Speech Music Male Female Sports Disco Country Hip Hop Rock Blues Reggae Pop Metal Classical Jazz Choir Orchestra Piano String Quartet Big Band Cool Fusion Piano Quartet Swing Background Speech Music Male Female +Background Classical Non-Classical Chamber Orchestra Rock Electro/Pop Jazz/Blues Piano Solo String Quartet Other Symphonic +Choir +Soloist Soft Rock Hard Rock Hip Hop Techno/Dance Pop

observations with humans 1 human classification far from perfect: 75 90 % for limited set of classes 2 for many genres, humans need only a fraction of a second to classify short time timbre features sufficient? plots from 1, 2 1 S. Lippens, J.-P. Martens, T. D. Mulder, et al., A Comparison of Human and Automatic Musical Genre Classification, in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Montreal, 2004. 2 R. O. Gjerdingen and D. Perrott, Scanning the Dial: The Rapid Recognition of Music Genres, Journal of New Music Research, vol. 37, no. 2, pp. 93 100, Jun. 2008, 00067, issn: 0929-8215.

overview Audio Signal Feature Extraction Classification Genre Label 1 feature extraction dimensionality reduction meaningful representation 2 classification map or convert feature to comprehensible domain

feature categories high level similarities? melody, hook lines, bass lines, harmony progression rhythm & tempo structure instrumentation & timbre... technical feature categories tonal technical timbral temporal intensity extracted features should be extractable (not: time envelope in polyphonic signals) relevant (not: pitch chroma for instrument ID) non-redundant have discriminative power (robust to noise)

instantaneous features spectral features (timbre): Spectral Centroid, MFCCs, Spectral Flux,... pitch features (tonal): pitch chroma distribution/change,... rhythm features (temporal): onset density, beat histogram features,... statistical features (technical): standard deviation, skewness, zero crossings,... intensity features: level variation, number of pauses,...

overview intro MGC classifiers example feature extraction 1 extract instantaneous features 2 compute derived features (derivative, filtered) compute long term features & subfeatures per texture window compute subfeatures per file normalize subfeatures (select or) transform subfeatures feature vector classifier input 3 4 5 6 7 summary

feature extraction 1 extract instantaneous features 2 compute derived features (derivative, filtered) 3 compute long term features & subfeatures per texture window 4 compute subfeatures per file 5 normalize subfeatures 6 (select or) transform subfeatures 7 feature vector classifier input

long term features 1/2 derived from beat histogram 3 3 G. Tzanetakis and P. Cook, Musical genre classification of audio signals, Transactions on Speech and Audio Processing, vol. 10, no. 5, pp. 293 302, Jul. 2002, issn: 1063-6676. doi: 10.1109/TSA.2002.800560.

long term features 2/2 derived from pitch histogram or pitch chroma 4 4 G. Tzanetakis, A. Ermolinskyi, and P. Cook, Pitch Histograms in Audio and Symbolic Music Information Retrieval, in Proceedings of the 3rd International Conference on Music Information Retrieval (ISMIR), Paris, 2002.

additional feature examples stereo features mid channel energy vs. side channel energy spectral channel differences features at higher semantic levels: tempo, structure, harmonic complexity, instrumentation

classification: general steps 1 define training set: annotated results 2 normalize training set 3 train classifier 4 evaluate classifier with test set 5 (adjust classifier settings, return to 4.)

training set training set size vs. number of features training set too small overfitting feature number too large overfitting training set too noisy underfitting training set not representative bad classification performance classifier poor classifier bad classification performance different classifier features poor features bad classification performance feature selection new, better features features not normalized possibly bad classification performance feature range feature mean feature distribution classifier: rules of thumb

classifier: evaluation define test set for evaluation test set different from training set otherwise, same requirements example: N-fold cross validation 1 split training set into N parts (randomly, but preferably identical number per class) 2 select one part as test set 3 train the classifier with all observations from remaining N 1 parts 4 compute the classification rate for the test set 5 repeat until all N parts have been tested 6 overall result: average classification rate

classification: extract test vector and set class to majority of classifier: knn training: extract reference vectors from training set (keep class labels) matlab source: matlab/displayknn.m

classifier: knn training: extract reference vectors from training set (keep class labels) classification: extract test vector and set class to majority of k nearest reference vectors matlab source: matlab/displayknn.m

classifier: GMM training: build model of each class distribution as superposition of Gaussian distributions classification: compute output of each Gaussian and select class with highest probability classifier data: per class per Gaussian: µ and covariance, mixture weight?

classifier: SVM training: map features to high dimensional space find separating hyperplane (linear classification) through maximum distance of support vectors (data points) classification: apply feature transform and proceed with linear classification classifier data: support vectors, kernel, kernel parameters https://en.wikipedia.org/wiki/support vector machine

results classification results depend on training set, test set, and number of classes typical ranges: 10 classes 50 80% note: results vary largely between datasets ill-defined genre boundaries non-uniformly distributed classes overfitting through songs from same album or artist...

speech/music classification baseline example 1 extract features 2 represent each file with its 2-dimensional feature vector 3 knn to classify unknown audio files 4 evaluate classification performance

speech/music classification example: features 1/2 for each audio file 1 split input signal into (overlapping) blocks 2 compute 2 feature series (spectral centroid, RMS) 3 aggregate feature series to one value each mean of Spectral Centroid µ SC = 1 v SC (n) N standard deviation of RMS 1 σ RMS = (v RMS (n) µ RMS ) N 2 4 represent each file as 2-dimensional vector ( µsc, σ RMS ) T n n

speech/music classification example: features 2/2 std rms music speech matlab source: matlab/displayscatter.m mean spectral centroid

speech/music classification example: training set use dataset annotated as speech and music: requirements large compared to number of features representative for use case (diverse) here: 110 speech files 119 music files extract the features for the dataset

speech/music classification example: results (knn) confusion matrix: classification rate: speech music # files speech 93 17 110 music 19 100 119 100 + 93 110 + 119 = 84.2% single feature classification results Spectral Centroid: 56.7% RMS: 85.1%

summary lecture content 1 name three possible problems in the definition of the ground truth for genre classification 2 is it possible for genre classifiers to yield better accuracy than human experts 3 list the feature processing steps from audio to the input of the classifier