Learning the meaning of music

Similar documents
Learning Word Meanings and Descriptive Parameter Spaces from Music. Brian Whitman, Deb Roy and Barry Vercoe MIT Media Lab

MUSI-6201 Computational Music Analysis

Music Emotion Recognition. Jaesung Lee. Chung-Ang University

AUTOMATIC RECORD REVIEWS

Supervised Learning in Genre Classification

Music Recommendation from Song Sets

WHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG?

Outline. Why do we classify? Audio Classification

Music Genre Classification and Variance Comparison on Number of Genres

Can Song Lyrics Predict Genre? Danny Diekroeger Stanford University

Music Genre Classification


Toward Evaluation Techniques for Music Similarity

Singer Traits Identification using Deep Neural Network

Combination of Audio & Lyrics Features for Genre Classication in Digital Audio Collections

GCT535- Sound Technology for Multimedia Timbre Analysis. Graduate School of Culture Technology KAIST Juhan Nam

Using Genre Classification to Make Content-based Music Recommendations

Automatic Rhythmic Notation from Single Voice Audio Sources

Computational Models of Music Similarity. Elias Pampalk National Institute for Advanced Industrial Science and Technology (AIST)

Music Information Retrieval Community

The Million Song Dataset

Musical Hit Detection

Music Segmentation Using Markov Chain Methods

Classification of Timbre Similarity

Extracting Information from Music Audio

Music Mood. Sheng Xu, Albert Peyton, Ryan Bhular

Assigning and Visualizing Music Genres by Web-based Co-Occurrence Analysis

Machine Learning Term Project Write-up Creating Models of Performers of Chopin Mazurkas

DAY 1. Intelligent Audio Systems: A review of the foundations and applications of semantic audio analysis and music information retrieval

APPLICATIONS OF A SEMI-AUTOMATIC MELODY EXTRACTION INTERFACE FOR INDIAN MUSIC

INTER GENRE SIMILARITY MODELLING FOR AUTOMATIC MUSIC GENRE CLASSIFICATION

Music Mood Classification - an SVM based approach. Sebastian Napiorkowski

Acoustic Scene Classification

Week 14 Query-by-Humming and Music Fingerprinting. Roger B. Dannenberg Professor of Computer Science, Art and Music Carnegie Mellon University

jsymbolic 2: New Developments and Research Opportunities

Music Radar: A Web-based Query by Humming System

Predicting Time-Varying Musical Emotion Distributions from Multi-Track Audio

A Survey of Audio-Based Music Classification and Annotation

Lecture 15: Research at LabROSA

Introductions to Music Information Retrieval

Automatic Laughter Detection

Melody Retrieval On The Web

MusCat: A Music Browser Featuring Abstract Pictures and Zooming User Interface

SONG-LEVEL FEATURES AND SUPPORT VECTOR MACHINES FOR MUSIC CLASSIFICATION

ONLINE ACTIVITIES FOR MUSIC INFORMATION AND ACOUSTICS EDUCATION AND PSYCHOACOUSTIC DATA COLLECTION

Creating a Feature Vector to Identify Similarity between MIDI Files

An ecological approach to multimodal subjective music similarity perception

Automatic Music Clustering using Audio Attributes

About Giovanni De Poli. What is Model. Introduction. di Poli: Methodologies for Expressive Modeling of/for Music Performance

Audio Feature Extraction for Corpus Analysis

Automatic Music Genre Classification

Chapter Two: Long-Term Memory for Timbre

Automatic Music Similarity Assessment and Recommendation. A Thesis. Submitted to the Faculty. Drexel University. Donald Shaul Williamson

Gaussian Mixture Model for Singing Voice Separation from Stereophonic Music

THE importance of music content analysis for musical

THE FUTURE OF VOICE ASSISTANTS IN THE NETHERLANDS. To what extent should voice technology improve in order to conquer the Western European market?

Section I. Quotations

Subjective Similarity of Music: Data Collection for Individuality Analysis

Automatic Piano Music Transcription

Analyzing the Relationship Among Audio Labels Using Hubert-Arabie adjusted Rand Index

Music Information Retrieval

The song remains the same: identifying versions of the same piece using tonal descriptors

Automatic Laughter Detection

AudioRadar. A metaphorical visualization for the navigation of large music collections

Unifying Low-level and High-level Music. Similarity Measures

Computer Coordination With Popular Music: A New Research Agenda 1

Music Information Retrieval with Temporal Features and Timbre

SIGNAL + CONTEXT = BETTER CLASSIFICATION

Inferring Descriptions and Similarity for Music from Community Metadata

Music Similarity and Cover Song Identification: The Case of Jazz

Detecting Musical Key with Supervised Learning

Neural Network for Music Instrument Identi cation

Tempo and Beat Analysis

CTP431- Music and Audio Computing Music Information Retrieval. Graduate School of Culture Technology KAIST Juhan Nam

Composer Style Attribution

ISMIR 2008 Session 2a Music Recommendation and Organization

Composer Identification of Digital Audio Modeling Content Specific Features Through Markov Models

AUTOREGRESSIVE MFCC MODELS FOR GENRE CLASSIFICATION IMPROVED BY HARMONIC-PERCUSSION SEPARATION

Efficient Computer-Aided Pitch Track and Note Estimation for Scientific Applications. Matthias Mauch Chris Cannam György Fazekas

Week 14 Music Understanding and Classification

Data Driven Music Understanding

A New Method for Calculating Music Similarity

A Need for Universal Audio Terminologies and Improved Knowledge Transfer to the Consumer

Singer Recognition and Modeling Singer Error

Release Year Prediction for Songs

Computational Modelling of Harmony

However, in studies of expressive timing, the aim is to investigate production rather than perception of timing, that is, independently of the listene

Bi-Modal Music Emotion Recognition: Novel Lyrical Features and Dataset

Predicting Hit Songs with MIDI Musical Features

Recommending Music for Language Learning: The Problem of Singing Voice Intelligibility

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. X, NO. X, MONTH Unifying Low-level and High-level Music Similarity Measures

Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes

A TEXT RETRIEVAL APPROACH TO CONTENT-BASED AUDIO RETRIEVAL

Hidden Markov Model based dance recognition

Speech Recognition and Signal Processing for Broadcast News Transcription

A PERPLEXITY BASED COVER SONG MATCHING SYSTEM FOR SHORT LENGTH QUERIES

Singer Identification

Speech and Speaker Recognition for the Command of an Industrial Robot

Music Source Separation

Melody Extraction from Generic Audio Clips Thaminda Edirisooriya, Hansohl Kim, Connie Zeng

Transcription:

Learning the meaning of music Brian Whitman Music Mind and Machine group - MIT Media Laboratory 2004

Outline Why meaning / why music retrieval Community metadata / language analysis Long distance song effects / popularity Audio analysis / feature extraction Learning / grounding Application layer

Take home messages 1) Grounding for better results in both multimedia and textual information retrieval Query by description as multimedia interface 2) Music acquisition, bias-free models, organic music intelligence

Music intelligence Structure Structure Genre Genre / / Style Style ID ID Song Song similarity similarity Recommendation Recommendation Artist Artist ID ID Synthesis Synthesis Extracting salience from a signal Learning is features and regression ROCK/POP Classical

Better understanding through semantics Structure Structure Genre Genre / / Style Style ID ID Song Song similarity similarity Recommendation Recommendation Artist Artist ID ID Synthesis Synthesis Loud college rock with electronics. How can we get meaning to computationally influence understanding?

Using context to learn descriptions of perception Grounding meanings (Harnad 1990): defining terms by linking them to the outside world

Symbol grounding in action Linking perception and meaning Regier, Siskind, Roy Duygulu: Image descriptions Sea sky sun waves Cat grass tiger Jet plane sky

Meaning ain t in the head

Where meaning is in music Relational Actionable Significance Correspondence meaning: Meaning: Meaning: The This XTC (Relationship Shins song were are makes the between like most me the important dance. Sugarplastic. Jason This British representation song Falkner pop makes group was and me of in system) the The cry. 1980s. Grays. This song reminds me of my exgirlfriend. There s a trumpet there. These pitches have been played. Key of F

Parallel Review For Beginning the majority with "Caring of Americans, Is Creepy," it's which a given: opens summer this album is the with besta season psychedelic of the flourish year. Or that so would you'd not think, be out judging of place from on the a anonymous late- TV 1960s ad men Moody and Blues, women Beach who proclaim, Boys, or "Summer Love release, is here! the Get Shins yourpresent [insert a collection iced drink of retro here] pop now!"-- nuggets whereas that distill in the the winter, finer they aspects regret of classic to inform acid rock us that with it's surrealistic time to brace lyrics, ourselves independently with a new Burlington melodic bass coat. lines, And jangly TV is just guitars, an exaggerated echo laden reflection vocals, minimalist of ourselves; keyboard motifs, the hordes and a of myriad convertibles of cosmic making sound the effects. weekend With only two of the cuts clocking in at over four minutes, Oh Inverted World avoids the penchant for self-indulgence pilgrimage to the nearest beach are proof enough. Vitamin D that befalls most outfits who worship at the altar of Syd Barrett, Skip Spence, and Arthur Lee. Lead overdoses singer James Mercer's abound. lazy, hazy If phrasing my tone and vocal isn't timbre, suggestive which often echoes enough, a young then Brian Wilson, I'll drifts in and out of the subtle tempo changes of "Know Your Onion," the jagged rhythm "Girl Inform say Me," the it Donovan-esque flat out: folksy I veneer hate of the "New summer. Slang," and the It Warhol's is, in Factory my aura opinion, of "Your Algebra," the worst all of which season illustrate of this the New year. Mexico-based Sure, quartet's it's adept great knowledge for of the holidays, progressive/art work rock genre which they so lovingly pay homage to. Though the production and mix are somewhat polished when vacations, compared to the memorable and ogling recordings the of Moby underdressed Grape and early-pink opposite Floyd, the sex, Shins capture but the you spirit payof '67 with stunning accuracy. for this in sweat, which comes by the quart, even if you obey summer's central directive: be lazy. Then there's the traffic, both pedestrian and automobile, and those unavoidable, unbearable Hollywood blockbusters and TV reruns (or second-rate series). Not to mention those package music tours. But perhaps worst of all is the heightened aggression. Just last week, in the middle of the day, a reasonable-looking man in his mid-twenties decided to slam his palm across my forehead as he walked past me. Mere days later-- this time at night-- a similar-looking man (but different; there a lot of these guys in Boston) stumbled out of a bar and immediately grabbed my shirt and tore the pocket off, spattering his blood across my arms and chest in the process. There's a reason no one riots in the winter. Maybe I need to move to the home of Sub Pop, where the sun is shy even in summer, and where angst and aggression are more likely to be internalized. Then again, if Sub Pop is releasing the Shins' kind-of debut (they've been around for nine years, previously as Flake, and then Flake Music), maybe even

What is post-rock? Is genre ID learning meaning?

How to get at meaning Self label LKBs / SDBs Ontologies OpenMind / Community directed Observation Better initial results More accurate more generalization power (more work, too) scale free / organic

Music ontologies

Language Acquisition Animal experiments, birdsong Instinct / Innate Attempting to find linguistic primitives Computational models

Music acquisition Short term music model: auditory scene to events Structural music model: recurring patterns in music streams Language of music: relating artists to descriptions (cultural representation) Music acceptance models: path of music through social network Grounding sound, what does loud mean? Semantics of music: what does rock mean? What makes a song popular? Semantic synthesis

Acoustic vs. Cultural Representations Acoustic: Instrumentation Short-time (timbral) Mid-time (structural) Usually all we have Cultural: Long-scale time Inherent user model Listener s perspective Two-way IR Which genre? Which artist? What instruments? Describe this. Do I like this? 10 years ago? Which style?

Community metadata Whitman / Lawrence (ICMC2002) Internet-mined description of music Embed description as kernel space Community-derived meaning Time-aware! Freely available

Language Processing for IR Web page to feature vector HTML Aosid asduh asdihu asiuh oiasjodijasodjioaisjdsaioj aoijsoidjaosjidsaidoj. Oiajsdoijasoijd. Iasoijdoijasoijdaisjd. Asij aijsdoij. Aoijsdoijasdiojas. Aiasijdoiajsdj., asijdiojad iojasodijasiioas asjidijoasd oiajsdoijasd ioajsdojiasiojd iojasdoijasoidj. Asidjsadjd iojasdoijasoijdijdsa. IOJ iojasdoijaoisjd. Ijiojsad. Sentence Chunks. XTC was one of the smartest and catchiest British pop bands to emerge from the punk and new wave explosion of the late '70s.. n1 n2 n3 XTC Was One Of the Smartest And Catchiest British Pop Bands To Emerge From Punk New wave XTC was Was one One of Of the The smartest Smartest and And catchiest Catchiest british British pop Pop bands Bands to To emerge Emerge from From the The punk Punk and And new XTC was one Was one of One of the Of the smartest The smartest and Smartest and catchiest And catchiest british Catchiest british pop British pop bands Pop bands to Bands to emerge To emerge from Emerge from the From the punk The punk and Punk and new And new wave np art adj XTC Catchiest british pop bands British pop bands Pop bands Punk and new wave explosion XTC Smartest Catchiest British New late

What s a good scoring metric? TF-IDF provides natural weighting TF-IDF is More rare co-occurrences mean more. i.e. two artists sharing the term heavy metal banjo vs. rock music But s ( f, f ) = t d f f t d

Smooth the TF-IDF Reward mid-ground f terms t s ( ft, fd ) = s( f, f ) = f t d d f t e (log( 2σ f d 2 ) µ ) 2

Experiments Will two known-similar artists have a higher overlap than two random artists? Use 2 metrics Straight TF-IDF sum Smoothed gaussian sum On each term type Similarity is: for all shared terms S( a, b) = s( f t, fd )

TF-IDF Sum Results Accuracy: % of artist pairs that were predicted similar correctly (S(a,b) > S(a,random)) Improvement = S(a,b)/S(a,random) N1 N2 Np Adj Art Accuracy 78% 80% 82% 69% 79% Improvement 7.0x 7.7x 5.2x 6.8x 6.9x

Gaussian Smoothed Results Gaussian does far better on the larger term types (n1,n2,np) N1 N2 Np Adj Art Accuracy 83% 88% 85% 63% 79% Improvement 3.4x 2.7x 3.0x 4.8x 8.2x

P2P Similarity Crawling p2p networks Download user->song relations Similarity inferred from collections? Similarity metric: ) ) ( ) ( ) ( (1 ) ( ), ( ), ( c C b C a C b C b a C b a S =

P2P Crawling Logistics Many freely available scripting agents for P2P networks Easier: OpenNap, Gnutella, Soulseek No real authentication/social protocol Harder: Kazaa, DirectConnect, Hotline/KDX/etc Usual algorithm: search for random band name, browse collections of matching clients

P2P trend maps Far more #1s/year than real life 7-14 day lead on big hits No genre stratification

Query by description (audio) What does loud mean? Play me something fast with an electronic beat Single-term to frame attachment

Query-by-description as evaluation case QBD: Play me something loud with an electronic beat. With what probability can we accurately describe music? Training: We play the computer songs by a bunch of artists, and have it read about the artists on the Internet. Testing: We play the computer more songs by different artists and see how well it can describe it. Next steps: human use

The audio data Large set of music audio Minnowmatch testbed (1000 albums) Most popular on OpenNap August 2001 51 artists randomly chosen, 5 songs each Each 2sec frame an observation: TD PSD PCA to 20 dimensions 2sec audio 512-pSD 20-PCA

Learning formalization Learn relation between audio and naturally encountered description Can t trust target class! Opinion Counterfactuals Wrong artist Not musical 200,000 possible terms (output classes!) (For this experiment we limit it to adjectives)

Severe multi-class problem Observed a B C D E F G?? 1. Incorrect ground truth 2. Bias 3. Large number of output classes

Kernel space Observed (, ) (, ) (, ) (, ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( x i, x j ) = e x i 2δ x 2 j 2 Distance function represents data (gaussian works well for audio)

Regularized least-squares classification (RLSC) (Rifkin 2002) (, ) (, ) (, ) (, ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) K ( ) ( ) ( ) ( ) ( K I + )c = C t y t I 1 c t = ( K + ) y t C c t = machine for class t y t = truth vector for class t C = regularization constant (10)

New SVM Kernel for Memory Casper: Gaussian distance with stored memory half-life, fourier domain Gaussian kernel Casper kernel

Gram Matrices Gaussian vs. Casper

Results Experiment Artist ID Result (1-in-107) Pos% Neg% Weight% PSD gaussian 8.9 99.4 8.8 PSD casper 50.5 74.0 37.4

Per-term accuracy Good terms Bad terms Electronic 33% Annoying 0% Digital 29% Dangerous 0% Gloomy 29% Fictional 0% Unplugged 30% Magnetic 0% Acoustic 23% Pretentious 1% Dark 17% Gator 0% Female 32% Breaky 0% Romantic 23% Sexy 1% Vocal 18% Wicked 0% Happy 13% Lyrical 0% Classical 27% Worldwide 2% Baseline = 0.14% Good term set as restricted grammar?

Time-aware audio features MPEG-7 derived state-paths (Casey 2001) Music as discrete path through time Reg d to 20 states 0.1 s

Per-term accuracy (state paths) Good terms Bad terms Busy 42% Artistic 0% Steady 41% Homeless 0% Funky 39% Hungry 0% Intense 38% Great 0% Acoustic 36% Awful 0% African 35% Warped 0% Melodic 27% Illegal 0% Romantic 23% Cruel 0% Slow 21% Notorious 0% Wild 25% Good 0% Young 17% Okay 0% Weighted accuracy (to allow for bias)

Real-time Description synthesis

Semantic decomposition Music models from unsupervised methods find statistically significant parameters Can we identify the optimal semantic attributes for understanding music? Female/Male Angry/Calm

The linguistic expert Some semantic attachment requires lookups to an expert Dark Big Light? Small

Linguistic expert Perception + observed language: Big Lookups to linguistic expert: Light Dark Small Big Dark Small Light Allows you to infer new gradation:? Big Dark Small Light

Top descriptive parameters All P(a) of terms in anchor synant sets averaged P(quiet) = 0.2, P(loud) = 0.4, P(quiet-loud) = 0.3. Sorted list gives best grounded parameter map Good parameters Bad parameters Big little 30% Evil good 5% Present past 29% Bad good 0% Unusual familiar 28% Violent nonviolent 1% Low high 27% Extraordinary ordinary 0% Male female 22% Cool warm 7% Hard soft 21% Red white 6% Loud soft 19% Second first 4% Smooth rough 14% Full empty 0% Vocal instrumental 10% Internal external 0% Minor major 10% Foul fair 5%

Learning the knobs Nonlinear dimension reduction Isomap Like PCA/NMF/MDS, but: Meaning oriented Better perceptual distance Only feed polar observations as input Future data can be quickly semantically classified with guaranteed expressivity Quiet Male Loud Female

Parameter understanding Some knobs aren t 1-D intrinsically Color spaces & user models!

Mixture classification Eye ring Beak Uppertail coverts Bird head machine Bird tail machine Call pitch histogram Gis type Wingspan sparrow 0.2 0.4 sparrow 0.7 0.9 bluejay 0.8 0.6 bluejay 0.3 0.1

Mixture classification Rock Classical Beat < 120bpm Harmonicity MFCC deltas Wears eye makeup Has made concept album Song s bridge is actually chorus shifted up a key

Clustering / de-correlation

Big idea Extract meaning from music for better audio classification and understanding 70% 60% 50% 40% 30% 20% 10% 0% baseline straight signal statistical reduction semantic reduction understanding task accuracy

Creating a semantic reducer Good terms Busy Steady Funky Intense Acoustic African Melodic Romantic Slow The Shins Madonna 42% 41% 39% 38% 36% 35% 27% 23% 21% Wild 25% Young Jason Falkner 17%

Applying the semantic reduction New audio: funky f(x) 0.5 cool -0.3 highest 0.8 junior 0.3 low -0.8

Experiment - artist ID The rare ground truth in music IR Still hard problem - 30% Perils: album effect, madonna problem Best test case for music intelligence

Proving it s better; the setup etc Bunch of music Basis extraction Artist ID (257) PCA sem NMF rand Train Test (10) Train Test (10) Train Test (10)

Artist identification results non pca nmf sem rand 22.2 24.6 19.5 67.1 3.9 80% 70% 60% 50% 40% 30% 20% 10% 0% non pca nmf sem per-observation baseline

Next steps Community detection / sharpening Human evaluation (agreement with learned models) (inter-rater reliability) Intra-song meaning

Thanks Dan Ellis, Adam Berenzweig, Beth Logan, Steve Lawrence, Gary Flake, Ryan Rifkin, Deb Roy, Barry Vercoe, Tristan Jehan, Victor Adan, Ryan McKinley, Youngmoo Kim, Paris Smaragdis, Mike Casey, Keith Martin, Kelly Dobson