MODELLING PERCEPTION OF SPEED IN MUSIC AUDIO

Similar documents
Motivation. Analysis-and-manipulation approach to pitch and duration of musical instrument sounds without distorting timbral characteristics

Implementation of Expressive Performance Rules on the WF-4RIII by modeling a professional flutist performance using NN

RHYTHM TRANSCRIPTION OF POLYPHONIC MIDI PERFORMANCES BASED ON A MERGED-OUTPUT HMM FOR MULTIPLE VOICES

Analyzing the influence of pitch quantization and note segmentation on singing voice alignment in the context of audio-based Query-by-Humming

Energy-Efficient FPGA-Based Parallel Quasi-Stochastic Computing

Logistics We are here. If you cannot login to MarkUs, me your UTORID and name.

Chapter 7 Registers and Register Transfers

PROBABILITY AND STATISTICS Vol. I - Ergodic Properties of Stationary, Markov, and Regenerative Processes - Karl Grill

Research on the Classification Algorithms for the Classical Poetry Artistic Conception based on Feature Clustering Methodology. Jin-feng LIANG 1, a

A Novel Method for Music Retrieval using Chord Progression

Quality improvement in measurement channel including of ADC under operation conditions

PROJECTOR SFX SUFA-X. Properties. Specifications. Application. Tel

Internet supported Analysis of MPEG Compressed Newsfeeds

Polychrome Devices Reference Manual

Australian Journal of Basic and Applied Sciences

Line numbering and synchronization in digital HDTV systems

Daniel R. Dehaan Three Études For Solo Voice Summer 2010, Chicago

A Model of Metric Coherence

Part II: Derivation of the rules of voice-leading. The Goal. Some Abbreviations

RELIABILITY EVALUATION OF REPAIRABLE COMPLEX SYSTEMS AN ANALYZING FAILURE DATA

NexLine AD Power Line Adaptor INSTALLATION AND OPERATION MANUAL. Westinghouse Security Electronics an ISO 9001 certified company

VOCALS SYLLABUS SPECIFICATION Edition

CODE GENERATION FOR WIDEBAND CDMA

Quantifying Domestic Movie Revenues Using Online Resources in China

Image Intensifier Reference Manual

ABSTRACT. woodwind multiphonics. Each section is based on a single multiphonic or a combination thereof distributed across the wind

EE260: Digital Design, Spring /3/18. n Combinational Logic: n Output depends only on current input. n Require cascading of many structures

Forces: Calculating Them, and Using Them Shobhana Narasimhan JNCASR, Bangalore, India

Image Enhancement in the JPEG Domain for People with Vision Impairment

NIIT Logotype YOU MUST NEVER CREATE A NIIT LOGOTYPE THROUGH ANY SOFTWARE OR COMPUTER. THIS LOGO HAS BEEN DRAWN SPECIALLY.

References and quotations

2 Specialty Application Photoelectric Sensors

THE Internet of Things (IoT) is likely to be incorporated

Tempo and Beat Analysis

Voice Security Selection Guide

NewBlot PVDF 5X Stripping Buffer

Recognition of Human Speech using q-bernstein Polynomials

Working with PlasmaWipe Effects

Practice Guide Sonata in F Minor, Op. 2, No. 1, I. Allegro Ludwig van Beethoven

L-CBF: A Low-Power, Fast Counting Bloom Filter Architecture

Reliable Transmission Control Scheme Based on FEC Sensing and Adaptive MIMO for Mobile Internet of Things

Math of Projections:Overview. Perspective Viewing. Perspective Projections. Perspective Projections. Math of perspective projection

Mullard INDUCTOR POT CORE EQUIVALENTS LIST. Mullard Limited, Mullard House, Torrington Place, London Wel 7HD. Telephone:

Volume 20, Number 2, June 2014 Copyright 2014 Society for Music Theory

TOWARDS AN AUDITORY REPRESENTATION OF COMPLEXITY

Comparative Study of Different Techniques for License Plate Recognition

The Blizzard Challenge 2014

2 Specialty Application Photoelectric Sensors

Analysis and Detection of Historical Period in Symbolic Music Data

STx. Compact HD/SD COFDM Transmitter. Features. Options. Accessories. Applications

Achieving 550 MHz in an ASIC Methodology

Manual Industrial air curtain

A. Flue Pipes. 2. Open Pipes. = n. Musical Instruments. Instruments. A. Flue Pipes B. Flutes C. Reeds D. References

Detection of Historical Period in Symbolic Music Text

The Communication Method of Distance Education System and Sound Control Characteristics

A Backlight Optimization Scheme for Video Playback on Mobile Devices

A COMPARISON OF PERCEPTUAL RATINGS AND COMPUTED AUDIO FEATURES

Apollo 360 Map Display User s Guide

Music Scope Headphones: Natural User Interface for Selection of Music

A Simulation Experiment on a Built-In Self Test Equipped with Pseudorandom Test Pattern Generator and Multi-Input Shift Register (MISR)

The new, parametrised VS Model for Determining the Quality of Video Streams in the Video-telephony Service

PIANO SYLLABUS SPECIFICATION. Also suitable for Keyboards Edition

SMARTEYE ColorWise TM. Specialty Application Photoelectric Sensors. True Color Sensor 2-65

8825E/8825R/8830E/8831E SERIES

T-25e, T-39 & T-66. G657 fibres and how to splice them. TA036DO th June 2011

Background Manuscript Music Data Results... sort of Acknowledgments. Suite, Suite Phylogenetics. Michael Charleston and Zoltán Szabó

Incidence and Progression of Astigmatism in Singaporean Children METHODS

Tempo and Beat Tracking

Size Doesn t Really Matter

ttco.com

Digest Journal of Nanomaterials and Biostructures Vol. 13, No. 2, April - June 2018, p

Manual Comfort Air Curtain

Music Source Separation

COLLEGE READINESS STANDARDS

Higher-order modulation is indispensable in mobile, satellite,

,..,,.,. - z : i,; ;I.,i,,?-.. _.m,vi LJ

Robert Alexandru Dobre, Cristian Negrescu

Manual RCA-1. Item no fold RailCom display. tams elektronik. n n n

Linguistic Stereotyping in Hollywood Cinema

2 Specialty Application Photoelectric Sensors

Long-term Average Spectrum in Popular Music and its Relation to the Level of the Percussion

For children aged 5 7

Randomness Analysis of Pseudorandom Bit Sequences

2 Specialty Application Photoelectric Sensors

DIGITAL DISPLAY SOLUTION REAL ESTATE POINTS OF SALE (POS)

Research Article Measurements and Analysis of Secondary User Device Effects on Digital Television Receivers

UPGRADE OF THE LENS/NREP PROTON LINAC: COMMISSIONING AND RESULTS

Organic Macromolecules and the Genetic Code A cell is mostly water.

PowerStrip Automatic Cut & Strip Machine

Obsolete Product(s) - Obsolete Product(s)

FLUID COOLING Industrial BOL Series

Grammar 6: Sheet 1 Answer Guide

FHD inch Widescreen LCD Monitor USERGUIDE

Entropy ISSN by MDPI

Digest Journal of Nanomaterials and Biostructures Vol. 12, No. 3, July - September 2017, p

TOWARDS IMPROVING ONSET DETECTION ACCURACY IN NON- PERCUSSIVE SOUNDS USING MULTIMODAL FUSION

CSI 2130 Machinery Health Analyzer

Read Only Memory (ROM)

BOUND FOR SOUTH AUSTRALIA

Facial Expression Recognition Method Based on Stacked Denoising Autoencoders and Feature Reduction

Transcription:

MODELLING PERCEPTION OF SPEED IN MUSIC AUDIO Aders Elowsso KTH Royal Istitute of Techology CSC, Dept. of Speech, Music ad Hearig elov@kth.se Aders Friberg KTH Royal Istitute of Techology CSC, Dept. of Speech, Music ad Hearig afriberg@kth.se 16th otes. As a result, the umber of osets comig from percussive istrumets is high. Example B has a high umber of osets from harmoic istrumets (e.g. vocals, piao, etc.) but a moderate tempo. Fially, example C has the highest tempo but the lowest overall ote desity. How do these differet aspects affect the perceived speed? I this study we will model the perceptio of speed by extractig specifically developed features (such as tempo ad oset desities) from music audio. A importat idea is that the model should exploit the characteristics of the osets to better uderstad the music. ABSTRACT Oe of the major parameters i music is the overall speed of a musical performace. Speed is ofte associated with tempo, but other factors such as ote desity (osets per secod) seem to be importat as well. I this study, a computatioal model of speed i music audio has bee developed usig a custom set of rhythmic features. The origial audio is first separated ito a harmoic part ad a percussive part ad osets are extracted separately from the differet layers. The characteristics of each oset are determied based o frequecy cotet as well as perceptual saliece usig a clusterig approach. Usig these separated osets a set of eight features icludig a tempo estimatio are defied which are specifically desiged for modellig perceived speed. I a previous study 20 listeers rated the speed of 100 rigtoes cosistig maily of popular sogs, which had bee coverted from MIDI to audio. The ratigs were used i liear regressio ad PLS regressio i order to evaluate the validity of the model as well as to fid appropriate features. The computed audio features were able to explai about 90 % of the variability i listeer ratigs. Kick Sare HiHat Harm-Os A. 1. INTRODUCTION B. This study is focused o oe of the major parameters i music, the overall speed of a musical performace. From a music theoretic backgroud we are used to associate speed with the tempo of the music. However, as suggested earlier, the perceived speed is related to the tempo but may also be depedet o other aspects like the ote desity (umber of osets per secod) [1]. A idirect idicatio of this was provided i [2] where it was foud that the ote desity (ad ot the tempo) was costat for a certai emotioal expressio across differet music examples. Madiso & Pauli [3] asked listeers to rate the speed for 50 music examples spaig a variety of musical styles ad rhythms. They foud that speed correlated with tempo but also idicated that there must be other aspects ivolved i the perceptual judgmet of speed. I Figure 1, three examples with differet tempos ad oset desities are show. As outlied i Table 1, example A has a slow tempo but the hi-hat plays o C. Figure 1. Several factors that ca ifluece the perceived speed of a piece of music. The tempo is oe importat factor but oset desity is relevat as well. Example A B C Copyright: 2013 A. Elowsso et al. This is a ope-access article distributed uder the terms of the Creative Commos Attributio Licese 3.0 Uported, which permits urestricted use, distributio, ad reproductio i ay medium, provided the origial author ad source are credited. 735 Slow Mid Fast Drum-Os Fast Mid Slow Harm-Os Mid Fast Slow Table 1. Differet characteristics of the music which are related to speed. A sog may have a slow tempo but may osets that icreases the perceived speed.

The curret work is part of a ogoig study about perceptually determied features i music iformatio retrieval. I a previous study it was show that speed could be modeled by a combiatio of tempo ad differet ote desities of the istrumets usig symbolic data [4]. The explaied variatio was about 90 % usig liear regressio. This idicates that a similar result could i theory be obtaied usig audio data provided that the appropriate low-level audio features could be extracted. A flowchart of the processes used i the model is show i Figure 2. As a first step, source separatio (Sectio 3) was used to separate harmoic cotet ad percussive cotet i the audio as well as to cluster osets ito differet groups. Features were computed from both the percussive ad the harmoic part as well as from the origial audio as described i Sectio 4. To fid appropriate features as well to evaluate the validity of the model, regressio was used, i which the audio features were mapped agaist groud truth data cosistig of listeer ratigs of speed. This is described i Sectio 5. 2. SPEED DATA AND AUDIO EXAMPLES The speed estimatios were perceptually determied i a previous experimet i which 20 listeers rated speed for each music example o a quasi-cotiuous scale marked slow-fast with the rage 1-9. The music examples were a set of 100 rigtoes cosistig maily of popular sogs, origially i MIDI format ad coverted to audio [5, 6]. 3. SOURCE SEPARATION AND ONSET DETECTION The itermediate processig steps betwee audio ad feature extractio (gree boxes i Figure 2) are described i this sectio. 3.1 HP-Separatio Source separatio was used to separate harmoic ad percussive cotet. Source separatio has bee used i the past i computatioal models related to rhythm [7]. The method proposed by FitzGerald [8] was used as the first step of the separatio. The basic idea of the method is that percussive souds are broadbad oise sigals with short duratio ad that harmoic souds are arrow bad sigals with loger duratio. To be able to separate these differet souds, the audio is trasformed to the spectral domai by usig a short-time Fourier trasform (STFT). By applyig a media filter across each frame i the frequecy directio, harmoic souds are suppressed. By applyig a media filter across each frequecy bi i the time directio percussive souds are suppressed. After media filterig, the sigal is trasformed back to the time domai agai usig the iverse STFT. To further suppress harmoic cotet i the percussive waveform a secod separatio stage icorporates a costat-q trasform (CQT) [9]. The CQT ca be uderstood as a STFT with logarithmically spaced frequecy bis, accomplished by varyig the legth of the aalysis widow. The implicatio relevat to this study is that a high frequecy resolutio ca be achieved also i the low frequecies, at the expese of a poor time resolutio. The frequecy resolutio of the CQT was set to 60 bis per octave ad each frame was media filtered across the frequecy directio with a widow size of 40 bis. After filterig, the percussive sigal was trasformed back to the time domai usig a iverse CQT. By trasformig back to the time-domai, the uderlyig phase iformatio is retaied. The phase ca be regarded as a mappig that coects a frequecy bi to a certai poit i time. This is especially useful i the CQT-stage as the filterig ca be performed at a low MIDI Audio (Origial) HP-separatio (STFT) SF-CQT SF-STFT O Det. O Det. O Des. Harmoic O Des. Bass MIDI Audio Process Feature Percussive Harmoic HP-separatio (CQT) Percussive SF-CQT SF-STFT SF-CQT O Det. Clusterig O Des. Perceptual O Des. Strog Strog Cluster IOI S-Curve Percussiveess Figure 2. Flowchart of the processes used to compute audio features for the speed i music. Audio is geerated from MIDI, the audio is filtered to separate harmoic ad percussive cotet, osets are detected from a spectral flux, ad audio features are computed. 736

Origial time-resolutio (with widow legths up to a secod); but subsequet oset detectio algorithms ca be computed at a higher time-resolutio. The resultig percussive ad harmoic waveforms are show i Figure 3. Origial Harmoic Osets Figure 4. The oset detectio fuctios that discover osets by fidig peaks i the SF. I this example harmoic osets are tracked. Percussive 3.3 Clusterig To better exploit the characteristics of the percussive osets they were clustered ito groups. The clusterig was based o soud level i 8 frequecy bads, spaced approximately a octave apart, as well as the RMS soud level. As the appropriate umber of clusters is ukow beforehad, three k-meas clusterigs [11, 12] were carried out, with the umber of clusters k, set to 2, 3 ad 4. The fit of each clusterig attempt was defied by the smallest Euclidia distace betwee ay two clusters, where a large smallest distace gave a higher fit. Whe choosig which clusterig attempt to use, a higher umber of groups (k) were premiered over a lower if their fit was similar. The result is show i Figure 5. Harmoic Figure 3. The result of the HP-separatio. The origial waveform is separated ito a percussive ad a harmoic waveform. The example is a 3-secod sectio of the sog Cady Shop, by 50 cet, which will be used to visualize the feature extractio throughout this paper. 3.2 Oset Detectio Audio features were computed from all three waveforms (origial, harmoic ad percussive) by the scheme show i Figure 2. The first step, idepedet of feature ad waveform, was to compute a spectral flux (SF) [10], where spectral fluctuatios alog the time-domai are detected. The SF was computed several times i differet ways. Some shared steps will be described here, with uique steps described i Sectios 4.1-4.8. The power spectrum was computed with a CQT or a STFT ad coverted to soud level. A rage of 30 db was used. Thus, the maximum soud level of each bad is set to 0 db ad soud levels below -30 db are set to -30 db. Let L(, i) represet the soud level at the ith frequecy bi/bad of the th frame. The SF is give by A. Cluster 1 Cluster 2 Cluster 3 Cluster 2 Cluster 3 b SF () H L(, i ) ( L( s, i ) (1) i 1 B. where b is the umber of bis/bads. The variable s is the step size ad H is a half-wave rectifier fuctio, or for the percussive SF: if x 0 x H ( x) 0.2 x if x 0 Cluster 1 2&3 (2) The implicatio of Eq. 2 is that egative spectral fluctuatios have a slight ifluece o the oset detectio fuctio. Osets were detected by peak pickig o a low-pass filtered curve of the spectral flux (see Figure 4). Figure 5. The clusterig of percussive osets. I example A the drums are clustered ito three differet clusters. I example B three clusters are iitially discovered, but the osets i Cluster 1 are assiged to Cluster 2 & 3. 737

To further determie the perceived stregth of the osets, each oset was compared to the surroudig osets withi 1.5 secods. This time spa (3 secods i total) was defied as the perceptual preset of the particular oset. By comparig it with the strogest oset withi the perceptual preset its stregth could be altered to represet its perceptual impact. The oset was give a higher stregth if there were o sigificatly stroger osets withi the perceptual preset. If there were osets that were sigificatly stroger, its stregth was lowered. The height of the cluster-bars i Figure 6 represets the perceptual stregth of each oset. To derive at a measure of osets desity, the sum of the perceptual stregth of the osets was used. Whe the clusterig is completed the osets have bee divided ito 2, 3 or 4 clusters. At this poit the clusters are further aalyzed to fid out if the soud of the osets i two of the clusters ca be combied to form the soud of the osets i a third cluster. This happes i example B of Figure 5. The k-meas clusterig has divided the osets ito three differet clusters, correspodig to the soud of the kick ad the hihat combied, as well as both played separately. The algorithm the compares the differet clusters ad discovers that Cluster 2 (the kick) ad Cluster 3 (the hihat) ca be combied to form the soud of Cluster 1 (the kick ad the hihat). To accout for this, each osets belogig to Cluster 1 will istead be set as belogig to both Cluster 2 ad Cluster 3, ad Cluster 1 will cease to exist. This does ot happe i example A of Figure 5 where 3 uique clusters have bee idetified. Percussive 4. FEATURE EXTRACTION A total of 8 audio features were computed, 2 from the origial waveform, 5 from the percussive waveform ad 1 from the harmoic waveform. The audio features are show as the ed result i the flowchart i Figure 2. The 8 features are explaied i Sectios 4.1-4.8, with oe subsectio for each feature. Osets Perceptual 4.1 Oset Desity Harmoic Cluster 1 (Kick) Cluster 2 (Claps) Cluster 2 (Toms) Cluster 4 (Shakers) Osets i harmoic istrumets were tracked from the origial waveform, with the SF of a CQT. The bis of the CQT were ot combied ito broader bads before the SF. This facilitates the detectio of harmoic osets, as a pitch shift of a semitoe i a istrumet will result i a icrease i eergy i the half wave rectified SF. To avoid false oset detectios at pitch glides from vibratos, shifts of a peak by 20 cets (oe bi), without a icrease i soud level, were restricted from affectig the SF. This was accomplished by subtractig the soud level of each bi of the ew frame, by the maximum soud level of the adjacet bis i the old frame. The oset detectio fuctio for harmoic osets is show i Figure 4. Period Legth Strog Cluster IOI Figure 6. A overview of the processes ivolved i extractig 5 features (described i Sectio 4.3-4.7) from the percussive waveform. Osets are detected ad clustered ito differet compoets to gai a uderstadig of how the music will be perceived. The perceptual weightig of the osets is represeted by the height of the bars. I this particular sog, is derived from the IOI betwee kick ad hadclaps ad oly the kick beloged to a strog cluster. The percussiveess feature is related to the height of the peaks i the oset detectio fuctio, as visualized by the dotted lie. 4.2 Oset Desity Bass Osets i the low register (frequecies betwee 40 Hz ad 210 Hz) were tracked with a SF of the lower bis of a STFT. The frequecy bis were summed to a sigle bad before the SF. 4.4 Osets Desity Strog The strogest clusters of the clusterig process were used to compute two features. The first feature was simply the umber of osets, belogig to a strog cluster, per secod. This feature was oly computed for periods of strog osets withi 1.5 secods of each other. 4.3 Oset Desity Perceptual weightig Percussive osets were tracked with a SF of a STFT o the percussive waveform. The bis of the frequecy domai represetatio were divided ito 13 ooverlappig frequecy bads (half-octave spacig). Subbad processig for oset detectio has bee described i [13], ad ca be motivated by its similarity to huma hearig [14]. The stregth of each detected oset was calculated based o the average soud level of the first 50 ms from the oset positio, where lower frequecies were give a higher relevace. 4.5 Strog Cluster IOI The secod feature derived from the strog clusters was developed to catch the assumed perceptio of a slow speed, whe the iteroset itervals (IOIs) of osets belogig to the same strog cluster are log. As a example, a sog with equally spaced drum osets cosistig of Kick, Sare, Kick, Sare, Kick Sare,.. was assumed to 738

have a higher perceived speed tha a sog where the drums istead plays Kick, Kick, Sare, Kick, Kick, Kick, Sare, Kick, etc... This is accouted for i the feature as well, because the tempo i the secod example would be half the tempo of the first example. I Figure 6, this feature is derived from the IOI betwee osets belogig to Cluster 1. Commo IOIs are detected by peak pickig i a low pass filtered histogram of cluster IOIs. Each foud peak cotributes to the feature based o its relative height as well as the cluster stregth. Figure 7. The S-Curve that gives differeces i tempos betwee 60 ad 160 BPM a higher impact. 4.6 S-Curve The tempo detectio algorithm is part of a ogoig project, ad a detailed descriptio is i preparatio. All distaces betwee osets withi 5 secods from each other are used to detect the tempo. 4.7 Percussiveess Oe feature was based o the percussiveess of the osets. This estimate is derived from the height h of the peaks i the SF of the percussive waveform, as show i Figure 6. 4.6.1 Period Legth First, the period legth of the percussive waveform is detected. The period legth correspods to the legth of the most promiet patter of repeated rhythmic souds i the music. A histogram over oset distaces is geerated, where the cotributio of each oset-pair icreases with icreasig similarity i spectrum as well as icreasig oset stregth. The leftmost peak i the low pass filtered histogram, withi 92 % of the highest peak, is chose as the period legth. Percussiveess 1 p i 1 h(i) (4) p i 1 Equatio 4 gives the mea peak height whe p is 0, a estimate closer to the lowest peaks whe p is egative, ad a estimate closer to the highest peaks whe p is positive. I this study p was set to 0.4. 4.6.2 4.8 SF CQT Secodly, the tempo (beat legth) is detected. A histogram over oset distaces is oce agai geerated, where the cotributio of each oset-pair icreases with icreasig dissimilarity i spectrum as well as icreasig oset stregth. The fial probability distributio for tempo is the Hadamard product of the histogram ad several filters. Oe filter is based o the determied period legth. The idea is that the beat will be a simple ratio of the period legth, so Haig widows are produced at positios h(i) Whe extractig iformatio from the harmoic waveform the itegral of the SF was used; idicated as the colored area i Figure 8. The use of a oset detectio fuctio was avoided as the HP-separatio had removed all trasiets from the harmoic waveform. The use of a CQT was motivated by the harmoic ature of the processed audio. Spectral chages i high frequecies were used for this feature. 1 1 1 Ple, Ple 2 2 3 0,1, 2,.. (3) Harmoic Aother filter is based o IOIs withi strog clusters as described i Sectio 4.5. The geeral distributio of tempos i popular music is take ito accout i oe filter ad several filters are coected to the oset desity of the particular sog. The highest peak i the fial probability distributio was chose as the tempo. SF - CQT 4.6.3 S-Curve Figure 8. The itegral of the spectral flux of the harmoic waveform. I compliace with the fidigs i [3], a S-Curve (Figure 7) was applied to the tempo value, givig differeces i tempo a higher impact betwee 60 ad 160 BPM. 5. PREDICTING SPEED FROM THE FEATURES Two regressio techiques were used to aalyze the mappig betwee the computed audio features ad the listeer ratigs of speed. First, a multiple liear regressio was used, justified by a predictor-to-case ratio higher 739

tha 1:10. Secodly, PLS regressio was used [15]. PLS regressio carries out data reductio, whilst maximizig covariace betwee features ad predicted data [16]. The multiple liear regressio betwee listeer ratigs ad computed audio features is preseted i Table 2. As show, a liear combiatio of the computed audio features was able to explai about 90 % of the variability. I compariso, the agreemet amog the listeers estimated by the mea itersubject correlatio was 0.71 ad Crobach s alpha 0.98 [4]. Multiple Regressio - Speed R 2 = 0.909 Adjusted R 2 = 0.900 Variable beta sr 2 p-value O Des. - Harmoic 0.205 0.033 0.000*** O Des. - Bass 0.130 0.007 0.016* O Des. - Perceptual 0.302 0.018 0.000*** O Des. - Strog -0.155 0.010 0.004** Strog Cluster IOI 0.127 0.006 0.021* S-Curve 0.430 0.056 0.000*** Percussiveess -0.095 0.005 0.041* SF CQT 0.107 0.004 0.053 Table 2. The predictio of the perceptual feature speed from computed audio features. The variable sr 2 is the squared semi-partial correlatio coefficiet. The most importat feature was S-Curve, followed by Oset Desity - Harmoic, Oset Desity - Perceptual ad Oset Desity - Strog (egative cotributio). The idepedet cotributio i terms of the squared semi-partial correlatio coefficiet sr 2 idicates that Oset Desity - Bass, Strog Cluster IOI, Percussiveess ad SF CQT each icreased the explaied variace with less tha 1 %. The egative cotributio of Percussiveess could be explaied as a higher perceived speed whe the percussive osets are less clear. A partial least square regressio (PLS) of the same features is show i Table 3. With 3 compoets, the cross-validated adjusted R 2 idicates that just below 90 % of the variability could be explaied. Note also that the cross-validatio procedure oly lowers the result margially, supportig the validity of the features. PLS Regressio - Speed Number of Compoets Used = 3 R 2 = 0.907 Adjusted R 2 = 0.903 R 2 cv = 0.883 Adjusted R 2 cv = 0.878 Compoet Explaied variace Cum. variace 1 0.853 0.853 2 0.042 0.895 3 0.011 0.907 Table 3. The predictio of the perceptual feature speed from computed audio features. The squared correlatio coefficiet R 2 was derived usig Partial Least-square Regressio (PLS), with 10-fold cross validatio. I the lower part, R 2 as a fuctio of the umber of compoets is show. Compoets 4-8 did ot cotribute ad are ot show. The fitted values of the liear regressio from Table 2 are show i Figure 9 below. As see i the figure, the deviatios from the target are rather evely distributed across the rage ad with a maximal deviatio of about oe uit. Figure 9. The fitted values i the predictio of the perceptual feature speed, where higher meas faster. For each sog (umbered for easier idetificatio), the x- axis represets the estimated speed (derived from computed audio features), ad the y-axis represets the groud truth (derived from listeers). 6. CONCLUSIONS AND DISCUSSION The computed audio features were able to explai about 90 % of the variability i listeer ratigs. The most importat features were tempo together with oset desities for differet layers of the music. The validity of the features was supported by a cross-validatio, ad fitted values were relatively close to target values. The results show that it was possible to reach the same high explaied variace o audio data as o MIDI data usig similar features [4]. This idicates that the appropriate low-level audio features have bee extracted, which is reassurig for the ogoig study. Sice good results were achieved oly after we applied source separatio, both i terms of clusterig ad HPseparatio, the segmetatio of data seems to be a promisig path forward. From a ecological poit of view it seems reasoable to assume that the iteractio betwee osets of the same source is relevat; especially if the soud of this source is oe of the most promiet oes. By clusterig osets we ca detect osets belogig to the same source ad thus use the rhythmic patter of this source i the model. By usig several oset detectio fuctios o separate parts of the audio, differet aspects of the music ca be captured. The CQT seems to be suitable for detectig osets i harmoic istrumets, while the better time-resolutio of the STFT i lower frequecies facilitates the detectio of percussive istrumets. A drawback with the proposed system is that the computatio of several STFTs ad CQTs is relatively time cosumig. 740

7. ACKNOWLEDGEMENT This work was supported by the Swedish Research Coucil, Grat Nr. 2009-4285 ad 2012-4685. 8. REFERENCES [1] A. Gabrielsso, Studies i Rhythm, doctoral dissertatio, Uppsala Uiversity, 1973. [2] R. Bresi, ad A. Friberg, Emotio rederig i music: Rage ad characteristic values of seve musical variables, Cortex, Vol. 47, o. 9, pp. 1068-1081, 2011. [3] G. Madiso, ad J. Pauli, Ratigs of speed i real music as a fuctio of both origial ad maipulated beat tempo. Joural of the Acoustical Society of America, Vol. 128, o. 5, pp. 3032-3040, 2010. [4] A. Friberg, E. Schooderwaldt, A. Hedblad, M. Fabiai, ad A. Elowsso Perceptually derived features ca be used i music iformatio retrieval, submitted for publicatio. [5] A. Friberg, E. Schooderwaldt, ad A. Hedblad, Perceptual ratigs of musical parameters, i vo Loesch, H., & Weizierl, S. (Eds.), Gemessee Iterpretatio - Computergestützte Aufführugsaalyse im Kreuzverhör der Disziplie, Maiz: Schott, 2011, pp. 237-253. [6] A. Hedblad, Evaluatio of Musical Feature Extractio Tools Usig Perceptual Ratigs. Master thesis, KTH Royal Istitute of Techology, 2011. [7] M. Aloso, G. Richard ad B. David, Accurate tempo estimatio based o harmoic + oise decompositio, EURASIP Joural o Advaces i Sigal Processig, vol. 2007, Article ID 82795, 14 pages, 2007. [8] D. FitzGerald, Harmoic/percussive separatio usig media filterig, i Proc. of the 13th Iteratioal Coferece o Digital Audio Effects (DAFx-10), Graz, Austria, 2010. [9] C. Schörkhuber ad A. Klapuri, Costat-Q trasform toolbox for music processig, i 7th Soud ad Music Coferece (SMC 2010), Barceloa, 2010. [10] S. Dixo, Oset detectio revisited, i Proceedigs of Iteratioal Coferece o Digital Audio Effects, pages 133 137, 2006. [11] G. A. F. Seber, Multivariate Observatios. Hoboke, NJ: Joh Wiley & Sos, Ic., 1984. [12] H. Spath, Cluster Dissectio ad Aalysis: Theory, FORTRAN Programs, Examples. Traslated by J. Goldschmidt. New York: Halsted Press, 1985. [13] A. Klapuri, Soud oset detectio by applyig psychoacoustic kowledge, i Proc. IEEE Cof. Acoustics, Speech ad Sigal Processig (ICASSP, 99), 1999. [14] C. Duxbury, J. P. Bello, M. Sadler, ad M. Davies, A compariso betwee fixed ad multiresolutio aalysis for oset detectio i musical sigals, i Proc. 7th It. Cof. Digital Audio Effects (DAFx), Naples, Italy, 2004. [15] P. Geladi, & B. R. Kowalski. Partial least-squares regressio: a tutorial. Aalytica chimica acta, Vol. 185, pp. 1-17, 1986. [16] T. Eerola, O. Lartillot, P. Toiviaie. Predictio of multidimesioal emotioal ratigs i music from audio usig multivariate regressio models, i 10th Iteratioal Society for Music Iformatio Retrieval Coferece (ISMIR 2009), 2009. 741