Music Complexity Descriptors Matt Stabile June 6 th, 2008
Musical Complexity as a Semantic Descriptor Modern digital audio collections need new criteria for categorization and searching. Applicable to: Music Collections Media Asset Management Systems (Sound Effects Libraries) Problem: Collections too large for manual assignment of descriptors by humans Solution: Automatic computation of descriptors based on the audio file itself
Existing Applications SIMAC: Semantic Interaction with Music Audio Contents
Existing Applications FindSounds Palette Uses Sound Matching Technology to search by sound prototype Also can search by name, description, category, genre, source, copyright, format, size, number of channels, resolution, sample rate, duration, key, and tempo Could use further descriptors to refine search
Defining Music Complexity Sebastian Streich Music Complexity: A Multi Faceted Description of Audio Content, 2007 Complexity of music as a high level, intuitive attribute, which can be experienced directly or indirectly by the active listener Streich defines musical complexity as that property of a musical unit that determines how much effort the listener has to put into following and understanding it
Complexity Facets Finnas (1989) states that unusual harmonies and timbres, irregular tempi and rhythms, unexpected tone sequences and variations in volume raise the level of perceived complexity in music. Facets of music are at least partly independent Complex rhythms with no melodic voice Unexpected volume and timbre changes with simple melody and chord sequences Useful to analyze these facets separately to obtain better complexity descriptors
Complexity Facets: Song Level (Streich) Acoustic Complexities Dynamic: loudness evolution within a track Spatial: disparity of the stereo channels Tonal Complexity Melodic and Harmonic complexity Most difficult due to imperfect transcription Timbral Complexity Timbral texture of a track, # different instr. Rhythmic Complexity Danceability
Methods and Algorithms Timbre Complexity Methods (Streich): Unsupervised HMMs using MFCCs Produce finite set of timbre models for a given input signal Complexity measure: number of models created = number of different inst. textures Too computationally expensive (Repeated training on top of feature extraction) Too unsupervised! Are the HMMs created perceptually meaningful? LZ77 Compression Gain Uses Timbre Symbols to apply entropy estimation from information theory Models human memory: 3 5s chunks of audio are used Timbre Symbols: Bass, Presence, Spectral Roll off, Spectral Flatness Measure Complexity Measure: low compression factor = source entropy low, complexity low Problem: Only computationally practical with coarse quantization, but this limits accuracy. Can guarantee different symbols different perceptual impression, but not the reverse. Spectral Envelope Matching Chosen method, based on idea that it takes a change of at least 4 db in higher harmonics and 10 db in low harmonics to distinguish timbre of two tones. (Winckel)
Timbre/Dynamics Algorithm Loudness Estimation & Spectral Envelope Matching Utilizes functions from Palmpak s Matlab MA Toolbox
Pre Processing FFT w/hann Window Normalized Power Spectrum Terhardt s Outer Ear Frequency Weighting Bark Scale (critical bands of hearing) Heuristic Spreading Function (Spectral Masking)
Sonogram Plots Sone scale: linear correspondence to human loudness perception
Dynamic Complexity Relates to properties of the loudness evolution In terms of Abruptness and Rate of Changes in dynamic level. Dynamic range and time scope important After Pre-processing: Total loudness estimate for each frame, Mmax = band with max loudness: Complexity: Average fluctuation of successive loudness values: Comp Values: a) 0.134 c) 0.304 b) 0.247 d) 0.488
Timbral Complexity Spectral Envelope Matching Avoids hard quantization necessary in LZ77 by using human perception of timbre. (4 db in higher harmonics and 10 db in low harmonics to distinguish timbre of two tones) Compares changes in spectral envelope rather than single harmonics Complexity Measure: Counts amount of timbral change in a given temporal window and then extract a complexity number Bandwise comparison with preceding frames, 6dB threshold, reaches back 80ms to 4s, approx. human memory Band Loudness Similarities Complexity Measure: % frames with similarity = 0 a) Bagpipe: 8.5% b) Symphony: 11.3% c) Rap: 4.8%
Complexity on Samples Started off aiming to calculate timbral complexity and dynamic complexity of a sample individually Influence of dynamics on timbre became apparent Dynamic level + technique of performance (i.e. vibrato, pizzicato, spiccato) strongly effects timbre of the tone Proposed method: Use Spectral Envelope Matching average fluctuation of similar frames combined with amount the sign of the derivative changes.
Timbre/Dynamics Algorithm: Sample Level MedianFiltering Average Fluctuation of Similar Frames/ Derivation Sign Change Counting
Results Reference Samples: Triangle wave, Sine Wave No perceived change in Timbre Avg Frames Fluct: = 1 Deriv sign change = 0 Loudness Evolution Sonogram of Sine Oscillator Band Similarity Plot
Results Samples: Cello and Harp playing at varying dynamic levels Avg Frames Fluct: = 0.8 Deriv sign change = 97 Loudness Evolution Sonogram of Cello playing mf Band Similarity Plot
Results Timbral Complexity Results: Avg. Frame Fluc Deriv Sine: 1 0 Triangle 1 0 Saw(Arp) 3.9 53 Cello: p = 1.05 130 mf = 0.8 97 f = 0.76 159 Harp: p = 2.4 403 mf = 4.06 635 f = 6.7 461 Band loudness similarities for Cello Avg. Frame Fluc reveals how extreme the timbral jumps are Derivative sign change count indicator of overall number of changes in timbre Categories: - No Timbral Change = Sine/Triangle - Periodic Timbral Change = Saw - Measurable Non-Periodic Timbral Change = Cello/Harp - No Similar Frames (Timbre always changing ) = Some Crazy Signal
Conclusions Need to find robust calculation of timbral complexity number Ex: Streich Timbral Complexity: % frames with sim. = 0 Can definitely be useful as attributes for refined sorting of samples Would need surveys completed to find correlation to human perception and then tuned accordingly.