GCT535- Sound Technology for Multimedia Timbre Analysis. Graduate School of Culture Technology KAIST Juhan Nam

GCT535- Sound Technology for Multimedia Timbre Analysis Graduate School of Culture Technology KAIST Juhan Nam 1

Outlines Timbre Analysis Definition of Timbre Timbre Features Zero-crossing rate Spectral summary features Mel-Frequency Cepstral Coefficient (MFCC) 2

What is timbre? Definition Attribute of sensation in terms of which a listener can judge that two sounds having the same loudness and pitch are dissimilar (ANSI) Tone color or quality that defines a particular sound Associated with classifying or identifying sound sources Class: piano, guitar, singing voice, engine sound Identity: Steinway Model D, Fender Stratocaster, Michael Jackson, Harley Davisson Also used to holistically describe polyphonic sounds For example, music or environmental sounds Associated with genre, mood or other high-level descriptions 3

What is timbre? Timbre is a very vague concept There is no single quantitative scale like loudness or pitch. There are actually multiple attributes. Different aspects of the multiplicity Acoustic attributes: temporal or spectral factors Timber space: perceptual similarity/dissimilarity Semantic attributes: textual descriptions 4

Acoustic Attributes in Timbre Perception Acoustic Attributes (Schouten, 1968) Harmonicity: the range between tonal and noise-like character Time envelope (ADSR) Spectral envelope Changes of spectral envelope and fundamental frequency The onset of a sound differing notably from the sustained vibration ADSR Changes of spectral envelope 5

Acoustic Attributes in Timbre Perception Sound design problem? 6

Timbre Space Perceptual multi-dimensional attributes based on measuring similarity Ask human to listen a pair of sounds and judge the degree of similarity as a score The similarity matrix is processed using multidimensional scaling (MDS), a dimensionality reduction algorithm which determines the timbre space Acoustic correlation with the three (reduced) dimensions Spectral energy distribution Attack and decay time Amount of inharmonic sound in the attack (Grey, 1977) 7

Semantic attributes Verbally describe different characteristics of timbre using words Dull Brilliant Cold Warm Pure Rich (Pratt and Doak, 1976) Dull Sharp Compact Scattered Full Empty Colorful Colorless (von Bismark, 1974) (T. Rossing s music150 slides) 8

Timbre Feature Extraction Extracting acoustic features from signals Low-level Acoustic Features Zero-crossing rates Spectral summaries Spectral envelope: MFCC 9

Zero-Crossing Rate (ZCR) ZCR is low for harmonic (voiced) sounds and high for noisy (unvoiced) sounds For simple periodic signals, it is related to the F0 Voiced Unvoiced 10

Spectral Summary Features Spectral Centroid: Center of gravity of the spectrum Associated with the brightness of sounds SC(t) = k k f k X t (k) X t (k) Spectral Roll-off: frequency under which 85% or 95% of spectral energy is concentrated in R t X t (k) = 0.85 X t (k) k N k 11

Spectral Summary Features Spectral Spread(SS): a measure of the bandwidth of the spectrum SS(t) = Spectral flatness (SF): a measure of the noisiness of the spectrum The ratio between the geometric and arithmetic means ( f k SC(t)) 2 X t (k) Examples: white noise à 1, pure tone à 0 k k X t (k) SF(t) = K 1 K k k X t (k) X t (k) 12

Examples of Spectral Centroids 10000 10000 9000 9000 8000 8000 7000 7000 frequency [Hz] 6000 5000 4000 frequency [Hz] 6000 5000 4000 3000 3000 2000 2000 1000 1000 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 time [sec] 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 time [sec] Classical: Beethoven String Quartet Pop: Video killed the radio star 13

Mel-Frequency Cepstral Coefficient (MFCC) Most popularly used audio feature that extracts spectral envelop from an audio frame Standard audio feature in speech recognition Introduced in music domain by Logan in 2000 Computation Steps DFT (audio frame) Mapping freq. scale to mel Log magnitude DCT 14

Mel-Frequency Spectrogram Convert linear frequency to mel scale Usually reduce the dimensionality of spectrum Spectrum Spectrum (mel-scaled) 15

Discrete Cosine Transform Real-valued transform: similar to DFT De-correlate the mel-scaled log spectrum and reduce the dimensionality again X DCT (k) = 2 N N 1 n=1 x(n)cos( πk N (n 0.5)) Spectrum (mel-scaled) MFCC 16

Reconstructed Frequency Spectrum from MFCC Frequency spectrum (512 bins) Frequency spectrum (mel-scaled, 60 bins) MFCC (13 dim) Reconstructed Frequency spectrum Reconstructed Frequency Spectrum (mel-scaled) 17

Comparison of Spectrogram and MFCC Spectrogram Mel-frequency Spectrogram MFCC Reconstructed Spectrogram from MFCC 18

Sound Examples of MFCC Original: MFCC reconstruction (using white-noise as a source): 19

Post-processing Adding temporal dynamics Short-term dynamics of features are characterized with delta or double-delta Δx = x(n) x(n h) h ΔΔx = Δx(n) Δx(n h) h 39 MFCCs in speech recognition: 13 MFCCs + 13 delta + 13 double-delta Normalization Cepstral Mean Subtraction (CMS): subtract the mean over surrounding frames Standardization: subtract the mean and divide by the variance 20

Applications Music Musical Instrument classification Music genre/mood classification Similarity-based audio retrieval Speech Speech recognition Speaker recognition 21

References J. Grey, Multidimensional Perceptual Scaling of musical timbre, 1977 D. Wessel, Timbre Space as a musical control structure, 1979 S. Donnadieu, Mental Representation of the Timbre of Complex Sounds, book chapter (ch. 8) in Analysis, Synthesis and Perception of Musical sounds, ed. J. Beauchamp, 2007 B. Logan, Mel Frequency Cepstral Coefficients for Music Modeling, 2000 22