Perceptual dimensions of short audio clips and corresponding timbre features Jason Musil, Budr El-Nusairi, Daniel Müllensiefen Department of Psychology, Goldsmiths, University of London
Question How do listeners make similarity judgements when comparing very short music clips? Assumption: For really short clips sound is most important
Background Related real world behaviours: Scanning the radio dial Browsing large music collection Instant recognition of favourite songs Psychological studies on short audio clips: Genre identification (Gjerdingen & Perrot, 2008; Mace et al., 2012) Identification of artist and title (Krumhansl, 2010)
A New Test The Sound Similarity Test: Part of Goldsmiths Musical Sophistication test battery* Testing ability to extract and compare information from short and unfamiliar audio clips => Familiarity with breadth of musical styles No correlation with formal musical training No use of genre labels, no use of rating scales => nonverbal similarity classification task Clips chosen as representative (All Music Guide) pieces from 4 meta-styles (Rentfrow & Gosling, 2003) * Documentation and online implementation at: http://www.gold.ac.uk/music-mind-brain/gold-msi/
Test Interface
Data Test variants: BBC implementation: 16 clips (400ms) from 4 genres, (n=138,469) Lab implementations (differ by clip length and excerpt, n ~ 130) A400 A800 B400 B800
Data for acoustic analysis B800 data set: 800ms from 4 genres n=131 Raw data: 131 16x16 similarity matrices Aggregate congruent with genre provenance
Data for acoustic analysis
Question How do listeners make similarity judgements when comparing very short music clips? Are there any acoustic features that explain listeners judgements?
Analysis Plan 1. Extract main perceptual dimensions from similarity data: Multi-dimensional Scaling 2. Describe music clips by acoustic features: The Echonest timbre descriptors 3. Predict perceptual coordinates by acoustic features: Statistical regression
1. Multi-dimensional Scaling non-metric MDS 3-dimensional solution stress: 6.52
2. Echonest Timbre Descriptors Based on short audio segments (2-5) 12 coefficients per segment, partially interpretable (1=loudness, 2=brightness, 3=flatness, 4=attack, etc.) 12 means and 12 variances per clip as acoustic features plus #segments
3. Predicting Perceptual Dimensions from Acoustic Features Problems: k > n : 16 objects, 25 features (Potentially) non-linear relationships Solution 1: Random Forest regression (non-linear, handles k>n, sensitive to small influences and complex interactions)
Random Forest Variable Importance according to random forest Predicting dim. 1 (R 2 =.058) Predicting dim. 2 (R 2 =.215) Predicting dim. 3 (R 2 =.263)
Problems Interpretation / documentation of Echonest timbre coefficients 5 and 9 unclear No simple model for perceptual dimension 3
Solution 2 Partial-Least Squares regression (handles k>n very well, linear, no interactions) Use well-documented features: Two variants of MFCCs plus stand-alone features (spectral centroid, spectral spread, flatness etc.) from Queen Mary s Vamp plug-in set
Partial Least Squares Regression Results: From CV: 27% of variance explained in Perceptual Dimension 1 Dimension 2, 3 not explained at all Both sets of MFCCs are most important features
Summary Perceptual dimension 1 and 2 are closely related to Echonest timbre coefficients 5 and 9. Perceptual dimension 1 is predicted by ensemble of MFCC features Model fits are moderate at best (R 2 ~.25)
Conclusions Human similarity judgements of short audio clips show some commonality with statistical model using acoustic features At least one dimension isn t explained at all by lowlevel features => higher order information (e.g. rhythm, harmony, instrumentation, style) or even valence and arousal? => There is a lot more in short music clips that low-level features can t capture
Next Steps Try alternatives for acoustic modelling Construct new test based on acoustic model: Select new pool of sound clips Design easy and difficult version of sorting task according to acoustic model distance (on dimension 1) Test participants with easy /difficult versions and in genres they are un/familiar with.
Perceptual dimensions of short audio clips and corresponding timbre features Jason Musil, Budr El-Nusairi, Daniel Müllensiefen Department of Psychology, Goldsmiths, University of London
Item-wise analysis