GEOGRAPHICAL ORIGIN PREDICTION OF FOLK MUSIC RECORDINGS FROM THE UNITED KINGDOM

Similar documents
MUSI-6201 Computational Music Analysis

WHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG?

Release Year Prediction for Songs

INTER GENRE SIMILARITY MODELLING FOR AUTOMATIC MUSIC GENRE CLASSIFICATION

Supervised Learning in Genre Classification

Analytic Comparison of Audio Feature Sets using Self-Organising Maps

LEARNING A FEATURE SPACE FOR SIMILARITY IN WORLD MUSIC

Singer Traits Identification using Deep Neural Network

Chord Classification of an Audio Signal using Artificial Neural Network

Subjective Similarity of Music: Data Collection for Individuality Analysis

Automatic Laughter Detection

Melody Extraction from Generic Audio Clips Thaminda Edirisooriya, Hansohl Kim, Connie Zeng

Modeling memory for melodies

Audio Feature Extraction for Corpus Analysis

Week 14 Query-by-Humming and Music Fingerprinting. Roger B. Dannenberg Professor of Computer Science, Art and Music Carnegie Mellon University

Music Genre Classification and Variance Comparison on Number of Genres

Singer Recognition and Modeling Singer Error

Automatic Piano Music Transcription

Computational Modelling of Harmony

Outline. Why do we classify? Audio Classification

Improving Frame Based Automatic Laughter Detection

Creating a Feature Vector to Identify Similarity between MIDI Files

TOWARD UNDERSTANDING EXPRESSIVE PERCUSSION THROUGH CONTENT BASED ANALYSIS

Topics in Computer Music Instrument Identification. Ioanna Karydi

NAA ENHANCING THE QUALITY OF MARKING PROJECT: THE EFFECT OF SAMPLE SIZE ON INCREASED PRECISION IN DETECTING ERRANT MARKING

GRADIENT-BASED MUSICAL FEATURE EXTRACTION BASED ON SCALE-INVARIANT FEATURE TRANSFORM

DAY 1. Intelligent Audio Systems: A review of the foundations and applications of semantic audio analysis and music information retrieval

Automatic Rhythmic Notation from Single Voice Audio Sources

Predicting Time-Varying Musical Emotion Distributions from Multi-Track Audio

Automatic Laughter Detection

Lyrics Classification using Naive Bayes

TOWARDS THE CHARACTERIZATION OF SINGING STYLES IN WORLD MUSIC

International Journal of Advance Engineering and Research Development MUSICAL INSTRUMENT IDENTIFICATION AND STATUS FINDING WITH MFCC

POST-PROCESSING FIDDLE : A REAL-TIME MULTI-PITCH TRACKING TECHNIQUE USING HARMONIC PARTIAL SUBTRACTION FOR USE WITHIN LIVE PERFORMANCE SYSTEMS

Music Genre Classification

Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes

ABSOLUTE OR RELATIVE? A NEW APPROACH TO BUILDING FEATURE VECTORS FOR EMOTION TRACKING IN MUSIC

IMPROVING RHYTHMIC SIMILARITY COMPUTATION BY BEAT HISTOGRAM TRANSFORMATIONS

Music Similarity and Cover Song Identification: The Case of Jazz

Music Recommendation from Song Sets

Analysis of local and global timing and pitch change in ordinary

Composer Identification of Digital Audio Modeling Content Specific Features Through Markov Models

CS229 Project Report Polyphonic Piano Transcription

AUTOREGRESSIVE MFCC MODELS FOR GENRE CLASSIFICATION IMPROVED BY HARMONIC-PERCUSSION SEPARATION

Robert Alexandru Dobre, Cristian Negrescu

Automatic Music Clustering using Audio Attributes

Automatic Music Genre Classification

The song remains the same: identifying versions of the same piece using tonal descriptors

Effects of acoustic degradations on cover song recognition

EE391 Special Report (Spring 2005) Automatic Chord Recognition Using A Summary Autocorrelation Function

Perceptual dimensions of short audio clips and corresponding timbre features

Transcription of the Singing Melody in Polyphonic Music

HIT SONG SCIENCE IS NOT YET A SCIENCE

Wipe Scene Change Detection in Video Sequences

Recognising Cello Performers Using Timbre Models

THE SIGMA-DELTA MODULATOR FOR MEASUREMENT OF THE THERMAL CHARACTERISTICS OF THE CAPACITORS

Automatic Extraction of Popular Music Ringtones Based on Music Structure Analysis

STRUCTURAL CHANGE ON MULTIPLE TIME SCALES AS A CORRELATE OF MUSICAL COMPLEXITY

ISMIR 2008 Session 2a Music Recommendation and Organization

WHAT'S HOT: LINEAR POPULARITY PREDICTION FROM TV AND SOCIAL USAGE DATA Jan Neumann, Xiaodong Yu, and Mohamad Ali Torkamani Comcast Labs

IMPROVING GENRE CLASSIFICATION BY COMBINATION OF AUDIO AND SYMBOLIC DESCRIPTORS USING A TRANSCRIPTION SYSTEM

A Categorical Approach for Recognizing Emotional Effects of Music

However, in studies of expressive timing, the aim is to investigate production rather than perception of timing, that is, independently of the listene

Music Information Retrieval with Temporal Features and Timbre

Research & Development. White Paper WHP 232. A Large Scale Experiment for Mood-based Classification of TV Programmes BRITISH BROADCASTING CORPORATION

TOWARDS CHARACTERISATION OF MUSIC VIA RHYTHMIC PATTERNS

MODELS of music begin with a representation of the

The Million Song Dataset

2 2. Melody description The MPEG-7 standard distinguishes three types of attributes related to melody: the fundamental frequency LLD associated to a t

MELODY ANALYSIS FOR PREDICTION OF THE EMOTIONS CONVEYED BY SINHALA SONGS

A NOVEL MUSIC SEGMENTATION INTERFACE AND THE JAZZ TUNE COLLECTION

Content-based music retrieval

Computational Models of Music Similarity. Elias Pampalk National Institute for Advanced Industrial Science and Technology (AIST)

Dimensional Music Emotion Recognition: Combining Standard and Melodic Audio Features

POLYPHONIC INSTRUMENT RECOGNITION USING SPECTRAL CLUSTERING

Music Mood. Sheng Xu, Albert Peyton, Ryan Bhular

19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007

Neural Network for Music Instrument Identi cation

A Large Scale Experiment for Mood-Based Classification of TV Programmes

An Adaptive Length Frame Synchronization Scheme

Reconstruction of Ca 2+ dynamics from low frame rate Ca 2+ imaging data CS229 final project. Submitted by: Limor Bursztyn

Lecture 9 Source Separation

Evaluating Melodic Encodings for Use in Cover Song Identification

Towards Music Performer Recognition Using Timbre Features

CAE, May London Exposure Rating and ILF. Stefan Bernegger, Dr. sc. nat., SAV Head Analytical Services & Tools Swiss Reinsurance Company Ltd

A Computational Model for Discriminating Music Performers

Speech To Song Classification

Enabling editors through machine learning

Video-based Vibrato Detection and Analysis for Polyphonic String Music

About Giovanni De Poli. What is Model. Introduction. di Poli: Methodologies for Expressive Modeling of/for Music Performance

Musical Instrument Identification Using Principal Component Analysis and Multi-Layered Perceptrons

Music Emotion Recognition. Jaesung Lee. Chung-Ang University

MUSICAL INSTRUMENT RECOGNITION WITH WAVELET ENVELOPES

Music Mood Classification - an SVM based approach. Sebastian Napiorkowski

Automatic Labelling of tabla signals

Analysing Musical Pieces Using harmony-analyser.org Tools

Recognising Cello Performers using Timbre Models

arxiv: v1 [cs.sd] 5 Apr 2017

EVALUATION OF FEATURE EXTRACTORS AND PSYCHO-ACOUSTIC TRANSFORMATIONS FOR MUSIC GENRE CLASSIFICATION

Feature-Based Analysis of Haydn String Quartets

Transcription:

GEOGRAPHICAL ORIGIN PREDICTION OF FOLK MUSIC RECORDINGS FROM THE UNITED KINGDOM Vytaute Kedyte 1 Maria Panteli 2 Tillman Weyde 1 Simon Dixon 2 1 Department of Computer Science, City University of London, United Kingdom 2 Centre for Digital Music, Queen Mary University of London, United Kingdom {Vytaute.Kedyte, T.E.Weyde}@city.ac.uk, {m.panteli, s.e.dixon}@qmul.ac.uk ABSTRACT Field recordings from etnomusicological researc since te beginning of te 2t century are available today in large digitised music arcives. Te application of music information retrieval and data mining tecnologies can aid large-scale data processing leading to a better understanding of te istory of cultural excange. In tis paper we focus on folk and traditional music from te United Kingdom and study te correlation between spatial origins and musical caracteristics. In particular, we investigate weter te geograpical location of music recordings can be predicted solely from te content of te audio signal. We build a neural network tat takes as input a feature vector capturing musical aspects of te audio signal and predicts te latitude and longitude of te origins of te music recording. We explore te performance of te model for different sets of features and compare te prediction accuracy between geograpical regions of te UK. Our model predicts te geograpical coordinates of music recordings wit an average error of less tan 12 km. Te model can be used in a similar manner to identify te origins of recordings in large unlabelled music collections and reveal patterns of similarity in music from around te world. 1. INTRODUCTION Since te beginning of te 2t century etnomusicological researc as contributed significantly to te collection of recorded music from around te world. Collections of field recordings are preserved today in digital arcives suc as te Britis Library Sound Arcive. Te advances of Music Information Retrieval (MIR) tecnologies make it possible to process large numbers of music recordings. We are interested in applying tese computational tools to study a large collection of folk and traditional music from te United Kingdom (UK). We focus on exploring music attributes wit respect to geograpical regions of te UK and investigate patterns of music similarity. c Vytaute Kedyte, Maria Panteli, Tillman Weyde, Simon Dixon. Licensed under a Creative Commons Attribution 4. International License (CC BY 4.). Attribution: Vytaute Kedyte, Maria Panteli, Tillman Weyde, Simon Dixon. Geograpical origin prediction of folk music recordings from te United Kingdom, 18t International Society for Music Information Retrieval Conference, Suzou, Cina, 217. Te comparison of music from different geograpical regions as been te topic of several studies from te field of etnomusicology and in particular te branc of comparative musicology [13]. Savage et al. [17] studied stylistic similarity witin music cultures of Taiwan. In particular, tey formed music clusters for a collection of 259 traditional songs from twelve indigenous populations of Taiwan and studied te distribution of tese clusters across geograpical regions of Taiwan. Tey sowed tat songs of Taiwan can be grouped into 5 clusters correlated wit geograpical factors and repertoire diversity. Savage et al. [18] analysed 34 recordings contained in te Garland Encyclopedia of World Music [14] and investigated te distribution of music attributes across music recordings from around te world. Tey proposed 18 music features tat are sared amongst many music cultures of te world and a network of 1 features tat often occur togeter. Te aforementioned studies incorporated knowledge from uman experts in order to annotate music caracteristics for eac recording. Wile expert knowledge provides reliable and in-dept insigts into te music, te amount of uman labour involved in te process makes it impractical for large-scale music corpora. Computational tools on te oter and provide an efficient solution to processing large numbers of music recordings. In te field of MIR several studies ave used computational tools to study large music corpora. For example, Mauc et al. [1] studied te evolution of popular music in te USA in a collection of approximately 17 recordings. Tey concluded tat popular music in te US evolved wit particular rapidity during tree stylistic revolutions, around 1964, 1983 and 1991. Wit respect to non-western music repertoires Moelants et al. [12] studied pitc distributions in 91 recordings from Central Africa from te beginning until te end of te 2t century. Tey observed tat recent recordings tend to use more equally-tempered scales tan older recordings. Computational studies ave also focused on predicting te geograpic location of recordings from teir music content. Gomez et al. [3] approaced prediction of musical cultures as a classification problem, and classified music tracks into Western and non-western. Tey identified correlations between te latitude and tonal features, and te longitude and rytmic descriptors. Teir work illustrates te complexity of using regression to predict te geograpical coordinates of music origin. Zou et al. [23] also approaced tis as a regression problem, predicting latitudes 664

Proceedings of te 18t ISMIR Conference, Suzou, Cina, October 23-27, 217 665 and longitudes of te capital city of te music s country of origin, for pieces of music from 73 countries. Tey used K-nearest neigbours and Random Forest regression tecniques, and acieved a mean distance error between predicted and target coordinates of 3113 kilometres (km). Te advantage of treating geograpic origin prediction as a regression problem is tat it allows te latitude and longitude correlations found by Gomez et al. [3] to be considered as well as te topology of te Eart. Te disadvantage is not accounting for latitudes getting distorted towards te poles, and longitudes diverging at ±18 degrees. Location is usually used as an input feature in regression models, owever some studies ave explored prediction of geograpical origin in a continuous space in te domains of linguistics [2], criminology [22], and genetics [15, 21]. In tis paper we study te correlation between spatial origins and musical caracteristics of field recordings from te UK. We investigate weter te geograpical location of a music recording can be predicted solely based on its audio content. We extract features capturing musical aspects of te audio signal and train a neural network to predict te latitude and longitude of te origins of te recording. We investigate te model s performance for different network arcitectures and learning parameters. We also compare te performance accuracy for several feature sets as well as te accuracy across different geograpical regions of te UK. Our developments contribute to te evaluation of existing audio features and teir applicability to folk music analysis. Our results provide insigts for music patterns across te UK, but te model can be expanded to process music recordings from all around te world. Tis could contribute to identifying te location of recordings in large unlabelled music collections as well as studying patterns of music similarity in world music. Tis paper is organised as follows: Section 2 provides an overview of te music collection and Section 3 describes te different sets of audio features considered in tis study. Section 4 provides a detailed description of te neural network arcitecture as well as te training and testing procedures. Section 5 presents te results of te model for different learning parameters, audio features, and geograpical areas. We conclude wit a discussion and directions for future work. 2. DATASET Our music dataset is drawn from te World & Traditional music collection of te Britis Library Sound Arcive 1 wic includes tousands of music recordings collected over decades of etnomusicological researc. In particular, we use a subset of te World & Traditional music collection curated for te Digital Music Lab project [1]. Tis subset consists of more tan 29 audio recordings wit a large representation (17) from te UK. We focus solely on recordings from te UK and process information on te recording s location (if available) to extract te latitude and 1 ttp://sounds.bl.uk/world-and-traditional-music (a) Geograpical spread Count 4 189 192 195 198 21 Year (b) Year distribution Figure 1: Geograpical spread and year distribution in our dataset of 155 traditional music recordings from te UK. longitude coordinates. We keep only tose tracks wose extracted coordinates lie witin te spatial boundaries of te UK. Te final dataset consists of a total of 155 recordings. Te recordings span te years between 194 and 2 wit median year 1983 and standard deviation 12.3 years. See Figure 1 for an overview of te geograpical and temporal distribution of te dataset. Te origins of te recordings span a range of maximum 1222 km. From te origins of all 155 recordings we compute te average latitude and average longitude coordinates and estimate te distance between eac recording s location and te average latitude, longitude. Tis results in a mean distance of 167 wit standard deviation of 85 km. A similar estimate is computed from recordings in te training set and used as te random baseline for our regression predictions (Section 5). 3. AUDIO FEATURES We aim to process music recordings to extract audio features tat capture relevant music caracteristics. We use a speec/music segmentation algoritm as a preprocessing step and extract features from te music segments using available VAMP plugins 2. We post-process te output of te VAMP plugins to compute musical descriptors based on state of te art MIR researc. Additional dimensionality reduction and scaling is considered as a final step. Te metodology is summarised in Figure 2 and details are explained below. Several recordings in our dataset consist of compilations of multiple songs or a mixture of speec and music segments. Te first step in our metodology is to use a speec/music segmentation algoritm to extract relevant music segments from wic te rest of te analysis is derived. We coose te best performing segmentation algoritm [9] based on te results of te Music/Speec Detection task of te MIREX 215 evaluation 3. We apply te segmentation algoritm to extract music segments from 2 ttp://www.vamp-plugins.org 3 ttp://www.music-ir.org/mirex/wiki/215: Music/Speec_Classification_and_Detection

666 Proceedings of te 18t ISMIR Conference, Suzou, Cina, October 23-27, 217 Britis Library Speec Music Speec Onset Times Melody Cromagram Mel-Freq. Cepstral Coeff. IOI Ratio Histogram Pitc Histogram Contour Features Min, Max, Mean, Std Min, Max, Mean, Std PCA Lat Lon Figure 2: Summary of te metodology: UK folk music recordings are processed wit a speec/music segmentation algoritm and VAMP plugins are applied to music segments. Audio features are derived from te output of te VAMP plugins, PCA is applied, and output is fed to a neural network tat predicts te latitude and longitude of te recording. eac recording in our dataset. We require a minimum of 1 seconds of music for eac recording and discard any recordings wit total duration of music segments less tan tis tresold. Our analysis aims to capture relevant musical caracteristics wic are informative for te spatial origins of te music. We focus on aspects of rytm, melody, timbre, and armony. We derive audio features from te following VAMP plugins: MELODIA - Melody Extraction 4, Queen Mary - Cromagram 5, Queen Mary - Mel-Frequency Cepstral Coefficients 6, and Queen Mary - Note Onset Detector 7. We apply tese plugins for eac recording in our dataset and omit frames tat correspond to non-music segments as annotated by te previous step of speec/music segmentation. Te raw output of te VAMP plugins cannot be directly incorporated in our regression model. We post-process te output to low-dimensional and musically meaningful descriptors as explained below. Rytm. We post-process te output of te Queen Mary - Note Onset Detector plugin to derive istograms of inter-onset interval (IOI) ratios [4]. Let O = {o 1,..., o n } denote a sequence of n onset locations (in seconds) as output by te VAMP plugin. Te IOIs are defined as IOI = {o i+1 o i } for index i = 1,..., n 1. Te IOI ratios IOI j+1 are defined as IOIR = { IOI j } for index j = 1,..., n 2. Te IOI ratios denote tempo-independent descriptors because te tempo information carried wit te magnitude of IOIs vanises wit te ratio estimation. We compute a istogram for te IOIR values wit 1 bins uniformly distributed between [, 1). Timbre. We extract summary statistics from te output of te Queen Mary - Mel-Frequency Cepstral Coefficients (MFCC) plugin [8] wit te default values of frame and op size. In particular, we remove te first coefficient (DC component) and extract te min, max, mean, and standard deviation of te remaining 19 MFCCs over time. Melody. Te output of te MELODIA - Melody Extraction plugin denotes te frequency estimates over time 4 ttp://mtg.upf.edu/tecnologies/melodia 5 ttp://vamp-plugins.org/plugin-doc/ qm-vamp-plugins.tml#qm-cromagram 6 ttp://vamp-plugins.org/plugin-doc/ qm-vamp-plugins.tml#qm-mfcc 7 ttp://vamp-plugins.org/plugin-doc/ qm-vamp-plugins.tml#qm-onsetdetector of te lead melody. We extract a set of features capturing caracteristics of te pitc contour sape and melodic embellisments [16]. In particular, we extract statistics of te pitc range and duration, fit a polynomial curve to model te overall sape and turning points of te contour, and estimate te vibrato range and extent of melodic embellisments. Eac recording may consist of multiple sorter pitc contours. We keep te mean and standard deviation of features across all pitc contours extracted from te audio recording. We also post-process te output from MELODIA to compute an octave-wrapped pitc istogram [2] wit 1-cent resolution. Harmony. Te output of te Queen Mary - Cromagram plugin is an octave-wrapped cromagram wit 1- cent resolution [5]. We use te default frame and op size and extract summary statistics denoting te min, max, mean, and standard deviation of croma vectors over time. Te above process results in a total of 1484 features per recording. Before furter processing, te features were standardised wit z-scores. Dimensionality reduction was also applied wit Principal Component Analysis (PCA) including witening and keeping enoug components to represent 99% of te variance. 4. REGRESSION MODEL Te prediction of spatial coordinates from music data as been treated as a regression problem in previous researc using K-nearest neigbours and Random Forest Regression metods [23]. We explore te application of a neural network metod. Neural networks ave been sown to outperform existing metods in supervised tasks of music similarity [7, 11, 19]. We evaluate te performance of a neural network under different parameters for te regression problem of predicting latitude and longitudes from music features. A neural network wit two continuous value outputs, latitude and longitude predictions, was built in Tensorflow. We used te Adaptive Moment Estimation (Adam) algoritm for optimisation, Rectified Linear Unit (ReLU) as activation function, and drop-out rate of.5 for regularisation. Te evaluation of te model performance was based on te mean distance error in km, calculated using te Haversine formula [6]. Te Haversine distance d between two points in km is given by

Proceedings of te 18t ISMIR Conference, Suzou, Cina, October 23-27, 217 667 Parameters Values Target Scaling True or False Number of idden layers {3, 4} Cost function Haversine or MSE Learning Rate {.5,.1,.5} L1 regularisation {,.5,.5} L2 regularisation {,.5,.5} Table 1: Te yper-parameters and teir range of values for optimisation. d = 2r arcsin([sin 2 ( φ 2 φ 1 )+ 2 cos(φ 1 ) cos(φ 2 ) sin 2 ( λ 2 λ 1 )] 1 2 ) (1) 2 were φ represents te latitude, λ longitude, and r te radius of te spere (wit r fixed to 6367 km in tis study). We furter explored te performance of te model under arcitectures wit different numbers of idden layers, two different cost functions, and a range of regularisation parameters as explained below. 4.1 Parameter Optimisation A grid-searc of model yper-parameters was performed to identify te combination tat acieves best performance in cross-validation. Te following yper-parameters were considered for optimisation: weter or not to scale te targets (i.e., z-score standardisation of te ground trut latitude/longitude coordinates of eac recording), te number of idden layers, two possible cost functions, namely, te Haversine distance in km and te Mean Squared Error (MSE), and a range of values for learning rate, L1 and L2 regularisation parameters. Te parameter optimisation is summarised in Table 1. We tested in total 216 combinations of yper-parameters and selected te best performing combination to tune parameters and retrain te model for te final results. 4.2 Train-test splits Te training of te model was done in two pases. First te model was trained using te full set of features (Section 3) and te different yper-parameters as defined in Table 1. Te yper-parameters were tuned based on te optimal performance obtained troug cross-validation. In te second pase, te yper-parameters were fixed to teir optimal values and te model was retrained for different sets of features. Eac new model s performance was assessed on a test set unique to tat model. In te first training pase, we sampled at random 7% from te total number of 155 recordings for training. Tis resulted in a total of 738 samples in te training set, of wic 3% (2111) was set aside for validation. Following PCA, te feature dimensionality of te dataset was 368. Target Hidden Cost Training Validation Scaling Layers Function Error (km) Error (km) True 3 Haversine 72.68 119.36 True 3 MSE 166.21 166.27 True 4 Haversine 98.3 128.44 True 4 MSE 166.19 166.24 False 3 Haversine 165.34 166.79 False 3 MSE 169.91 169.3 False 4 Haversine 17.91 171.26 False 4 MSE 181.44 18.1 Table 2: Results for parameter optimisation. Learning rate, L1, and L2 regularisation parameters are fixed to.5,,.5 respectively. Best performance is obtained wen target scaling is combined wit 3 idden layers and Haversine distance as cost function. We used cross-validation wit K = 5 folds and tuned parameters based on te mean of te distance error on te validation set (Equation 1). In te second pase we retrained te model for different feature sets. For eac feature set, te dataset was split into training (random 7%) and test (remaining 3%) and te performance of te model was assessed on te test set. 5. RESULTS 5.1 Parameter Optimisation Te model tat produced te lowest mean error on te validation set (119 km) used te following yper parameters: target scaling, 3 idden layers, Haversine distance as cost function, learning rate of.5, and L1, L2 regularisation parameters of and.5, respectively. Te main yperparameters tat determined te accuracy of te model were te use of Haversine distance as te cost function, and te application of target scaling. Te performance of te model for different parameter values is sown in Table 2. 5.2 Results for different feature sets Te second set of experiments explored te performance of te model wen trained for different sets of features. We estimated te random baseline from te origins of recordings in te training set. In particular, we computed te average latitude and average longitude coordinates of recordings and estimated te distance between eac recording s location and te average latitude, longitude. Based on tis estimate te mean distance error of te baseline approac was 167.4 km. Eac model was compared to te baseline approac (i.e., te mean distance error of its test targets) wit a Wilcoxon signed-rank test. Te performances of te models trained on different sets of features and evaluated on separate test sets were compared wit a pairwise Wilcoxon rank sum test (also known as Mann-Witney) wit Bonferroni correction for multiple comparisons. We consider a significance level of α =.5 and denote te Bonferroni corrected level by ˆα.

f f f 668 Proceedings of te 18t ISMIR Conference, Suzou, Cina, October 23-27, 217 Model Feature Set Error No. Name (km) 1 All features 149.8 2 Rytm: IOIR istogram 16. 3 Harmony: Cromagram statistics 152.5 4 Timbre: MFCC statistics 129. 5 Pitc istogram 16.1 6 Contour features mean 159.8 7 Contour features standard deviation 162.3 8 Melody: Pitc ist., contour features 152.6 9 Rytm and Harmony 149.1 1 Rytm and Timbre 12.1 11 Rytm and Melody 15.5 12 Melody and Harmony 139.4 13 Melody and Timbre 117.1 14 Timbre and Harmony 114. 15 Rytm, Harmony, and Timbre 118.3 16 Rytm, Harmony, and Melody 142.8 17 Rytm, Timbre, and Melody 119.8 18 Harmony, Timbre, and Melody 14.3 Baseline 167.4 Table 3: Te mean distance error (in km) of te test set for 18 models trained on different sets of features. Distance error (km) 1 3 4 a bc af g 1 2 3 4 5 6 7 8 9 1 12 14 16 18 Feature sets Figure 3: Distance error of predictions for different sets of features (see Table 3 for te feature set used to train eac model). Labels a l in indicate features sets tat ave nonsignificantly different results (p > ˆα) were tey sare te same letter. For example, feature set 3 sares te label a wit feature set 8 but sares no label wit any oter feature set, indicating tat results from model 3 are significantly different from all oter models except for model 8. ce dg d dg ce bg e (a) Ground trut 1 9 8 7 5 4 3 1 (b) Predictions Figure 4: (a) Ground trut and (b) predicted music recording origins, coloured by te distance error (in km) for te best performing model (no. 14). All models acieved results significantly different from te baseline approac (p <.1). Te best performance (lowest error of 114. km) was acieved wen combining te timbral and armonic descriptors (model 14). Tis combines te summary statistics of te cromagram and te summary statistics of te MFCCs. Te performance of tis model was significantly different (p < ˆα) from all oter models except models 13 and 15 trained on melodic and timbral, and rytmic, armonic and timbral descriptors, respectively. Te model acieved a mean error of 149.8 km on te test set wen all features (Section 3) were used. Te results from model 3 trained on armonic descriptors were significantly different from all oter models except model 8 trained on melodic features. Te model trained on rytmic descriptors (model 2) is amongst te weakest predictors. However, adding rytmic features to any of melodic, armonic, or timbral features, for example models 9, 1, 11, significantly improves te performance of te model (p < ˆα for pairwise comparisons between models 3 and 9, 4 and 1, 8 and 11). Models 5, 6, 7 trained on pitc istograms, contour features mean, and contour features standard deviation, respectively, are also amongst te weakest predictors but wen all tese features are combined togeter as in model 8, te performance is improved. See Table 3 for an overview of te prediction accuracy of models trained on different feature sets. Figure 3 provides a box-plot visualisation of te results from different feature sets and marks statistical significance between results. 5.3 Results for different regions Te last analyses aim to study te prediction accuracy wit respect to te geograpical origins of recordings. Figure 4 sows te ground trut and predicted coordinates for te best performing model (model no.14 as denoted in Table 3) coloured by te distance error in km. We observe tat data points wit te lowest predictive accuracy originate from te nort-eastern and te sout-western areas of te UK (Figure 4a). Predictions are mostly concentrated in te soutern part of te UK. Data points predicted towards te 1 9 8 7 5 4 3 1

Proceedings of te 18t ISMIR Conference, Suzou, Cina, October 23-27, 217 669 1 1 1 1 9 9 9 9 8 8 8 8 7 7 7 7 5 5 5 5 4 4 4 4 3 3 3 3 1 1 1 1 (a) Rytm (b) Harmony (c) Timbre (d) Melody Figure 5: Music recording origins coloured by te distance error (in km) for models trained on (a) rytmic, (b) armonic, (c) timbral, and (d) melodic features (models no. 2, 3, 4, 8 respectively as defined in Table 3). eastern areas indicate a larger distance error (Figure 4b). In Figure 5 we visualise te prediction accuracy of models trained on different feature sets wit respect to geograpy. We observe tat for all models te nortern areas of te UK (i.e., in te region of Scotland) are predicted wit a relatively large distance error (lowest accuracy). For te model trained on timbral features (Figure 5c) we also observe te sout west of England predicted wit lower accuracy tan te models trained on armonic and melodic features (Figures 5b and 5d). 6. DISCUSSION Our results provide insigts on te contribution of different feature sets and suggest patterns of music similarity across geograpical regions. Te metodology can be improved in various ways. Te initial corpus of folk and traditional music from te UK consisted of a total of 17 of wic only 155 were processed in tis study. Te final dataset ad a skewed geograpical distribution wit over-representation of te sout-eastern and sout-western UK regions, e.g., Devon and Suffolk, and under-representation of te Nort- Eastern, Nort-Western areas, e.g., Scotland and Nortern Ireland. Effects from te skewness of te dataset could be observed in te distribution of predicted latitude and longitude coordinates (Figure 4b). A larger and more representative corpus can be used in future work. We used features derived from te output of VAMP plugins to describe musical content of audio recordings. Some of tese plugins were designed for different music styles and teir application to folk music migt not give robust results. A toroug evaluation of te suitability of te features can give valuable insigts for improving teir robustness to different corpora suc as te one used in tis study. We used feature representations averaged over time but in future work preserving temporal information in te features could provide better music content description. We observed tat results from models trained on individual features sowed on average larger distance errors. Wen owever combinations of features were considered, te model acieved on average iger accuracies. An exception is te case wen all features were considered but te performance of te model ad a relatively large distance error. Tis could be due to limitations of te model especially wit regards to over-fitting or te lack of adequate music information captured by te features. Integrating additional audio features could elp capture more of te variance of te data and improve te model. Te model was validated for a range of parameters and several approaces were considered to avoid over-fitting. However, evidence of over-fitting could still be observed in te final results. Training wit more data could elp make te model more generalisable in future work. Wat is more, oversampling tecniques could be explored to overcome te problem of under-represented geograpical regions in our dataset. Neural networks in combination wit audio features as proposed in tis study, can provide good predictions of te origins of te music. Tis can aid musicological researc as well as improve spatial metadata associated wit large music collections. 7. CONCLUSION We studied a collection of field recordings from te UK and investigated weter te geograpical origins of recordings can be predicted from te music attributes of te audio signal. We treated tis as a regression problem and trained a neural network to take as input audio features and predict te latitude and longitude of te music s origin. We trained te model under different yperparameters and tested its performance for different feature sets. Higest accuracy was acieved for te model trained on timbral and armonic features but no significant differences were found to te same model wit rytm features added or wit melody replacing armony. Te soutern regions of te UK were predicted wit a relatively ig accuracy wereas nortern regions were predicted wit low accuracy. Effects of te skewness of te dataset and te reliability of audio features were discussed. Te corpus and metodology can be improved in future work and te applicability of te model could be extended to music from around te world. 8. ACKNOWLEDGEMENTS MP is supported by a Queen Mary researc studentsip.

67 Proceedings of te 18t ISMIR Conference, Suzou, Cina, October 23-27, 217 9. REFERENCES [1] S. Abdalla, E. Benetos, N. Gold, S. Hargreaves, T. Weyde, and D. Wolff. Te Digital Music Lab: A Big Data Infrastructure for Digital Musicology. ACM Journal on Computing and Cultural Heritage, 1(1), 217. [2] J. Eisenstein, B. O Connor, N.A Smit, and E.P. Xing. A Latent Variable Model for Geograpic Lexical Variation. In Proceedings of te 21 Conference on Empirical Metods in Natural Language Processing, pages 1277 1287, 21. [3] E. Gómez, M. Haro, and P. Herrera. Music and geograpy: Content description of musical audio from different parts of te world. In Proceedings of te International Society for Music Information Retrieval Conference, pages 753 758, 9. [4] F. Gouyon, S. Dixon, E. Pampalk, and G. Widmer. Evaluating rytmic descriptors for musical genre classification. In Proceedings of te AES 25t International Conference, pages 196 24, 4. [5] C. Harte and M. Sandler. Automatic cord identifcation using a quantised cromagram. In 118t Audio Engineering Society Convention, 5. [6] J. Inman. Navigation and Nautical Astronomy: For te Use of Britis Seamen. F. & J. Rivington, 1849. [7] I. Karydis, K. Kermanidis, S. Sioutas, and L. Iliadis. Comparing content and context based similarity for musical data. Neurocomputing, 17:69 76, 213. [8] B. Logan. Mel-Frequency Cepstral Coefficients for Music Modeling. In Proceedings of te International Symposium on Music Information Retrieval,. [9] M. Marolt. Music/speec classification and detection submission for MIREX 215. In MIREX, 215. [1] M. Mauc, R. M. MacCallum, M. Levy, and A. M. Leroi. Te evolution of popular music: USA 196-21. Royal Society Open Science, 2(5):1581, 215. [15] J. Novembre, K. Bryc, S. Bergmann, A.R. Boyko, C.D. Bustamante, A. Auton, M. Stepens, Z. Kutalik, A. Indap, T. Jonson, M.R. Nelson, and K.S. King. Genes mirror geograpy witin Europe. Nature, 456(7218):98 11, 8. [16] M. Panteli, R. Bittner, J. P. Bello, and S. Dixon. Towards te caracterization of singing styles in world music. In IEEE International Conference on Acoustics, Speec and Signal Processing, pages 636 64, 217. [17] P. E. Savage and S. Brown. Mapping Music: Cluster Analysis Of Song-Type Frequencies Witin and Between Cultures. Etnomusicology, 58(1):133 155, 214. [18] P. E. Savage, S. Brown, E. Sakai, and T. E. Currie. Statistical universals reveal te structures and functions of uman music. Proceedings of te National Academy of Sciences of te United States of America, 112(29):8987 8992, 215. [19] D. Turnbull and C. Elkan. Fast recognition of musical genres using RBF networks. IEEE Transactions on Knowledge and Data Engineering, 17(4):58 584, 5. [2] G. Tzanetakis, A. Ermolinskyi, and P. Cook. Pitc istograms in audio and symbolic music information retrieval. Journal of New Music Researc, 32(2):143 152, 3. [21] W. Yang, J. Novembre, E. Eskin, and E. Halperin. A model-based approac for analysis of spatial structure in genetic data. Nature Genetics, 44(6):725 731, 212. [22] J. M. Young, L. S. Weyric, J. Breen, L. M. Macdonald, and A. Cooper. Predicting te origin of soil evidence: Hig trougput eukaryote sequencing and MIR spectroscopy applied to a crime scene scenario. Forensic Science International, 251:22 31, 215. [23] F. Zou, Q. Claire, and R. D. King. Predicting te Geograpical Origin of Music. In IEEE International Conference on Data Mining, pages 1115 112, 214. [11] C. McKay and I. Fujinaga. Automatic genre classification using large ig-level musical feature sets. In Proceedings of te International Society for Music Information Retrieval Conference, pages 525 53, 4. [12] D. Moelants, O. Cornelis, and M. Leman. Exploring African Tone Scales. In Proceedings of te International Society for Music Information Retrieval Conference, pages 489 494, 9. [13] B. Nettl. Te Study of Etnomusicology: Tirty-one Issues and Concepts. University of Illinois Press, Urbana and Cicago, 2nd edition, 5. [14] B. Nettl, R. M. Stone, J. Porter, and T. Rice, editors. Te Garland Encyclopedia of World Music. Garland Pub, New York, 1998-2 edition, 1998.