Analyzing the Relationship Among Audio Labels Using Hubert-Arabie adjusted Rand Index

Analyzing the Relationship Among Audio Labels Using Hubert-Arabie adjusted Rand Index Kwan Kim Submitted in partial fulfillment of the requirements for the Master of Music in Music Technology in the Department of Music and Performing Arts Professions in The Steinhardt School New York University Advisor: Dr. Juan P. Bello Reader: Dr. Kenneth Peacock Date: 212/12/11

Abstract With the advent of advanced technology and instant access to the Internet, the music databases have grown rapidly, requiring more efficient ways of organizing and providing access to music. A number of automatic classification algorithms are proposed in the field of music information retrieval (MIR) by a means of supervised learning method, in which ground truth labels are imperative. The goal of this study is to analyze a statistical relationship among audio labels such as era, emotions, genres, instruments, and origin, using the Million Song Dataset and Hubert-Arabie adjusted Rand Index in order to observe whether there is a significant enough correlation between these labels. It is found that the cluster validation is low among audio labels, which implies no strong correlation and not enough co-occurrence between these labels when describing songs.

Acknowledgements I would like to thank everyone involved in completing this thesis. I especially send my deepest gratitude to my advisor, Juan P. Bello, for keeping me motivated. His critics and insights consistently pushed me to become a better student. I also thank Mary Farbood for being such a friendly mentor. It was a pleasure to work as her assistant for the past year and half. I thank the rest of NYU faculty for providing an opportunity and excellent program to study. Lastly, I thank my family and wife for their support and love.

Contents List of Figures List of Tables iv vi 1 Introduction 1 2 Literature Review 4 2.1 Music Information Retrieval......................... 4 2.2 Automatic Classification........................... 5 2.2.1 Genre................................. 5 2.2.2 Emotion................................ 7 3 Methodology 9 3.1 Data Statistics................................ 1 3.2 Filtering.................................... 1 3.2.1 1 st Filtering.............................. 11 3.2.2 2 nd Filtering............................. 14 3.2.3 3 rd Filtering.............................. 14 3.2.3.1 Co-occurence........................ 14 3.2.3.2 Hierarchical Structure................... 16 3.2.4 4 th Filtering.............................. 18 3.2.4.1 Term Frequency...................... 18 3.2.5 5 th Filtering.............................. 19 3.3 Audio Labels................................. 2 3.3.1 Era.................................. 2 3.3.2 Emotion................................ 21 ii

CONTENTS 3.3.3 Genre................................. 22 3.3.4 Instrument.............................. 23 3.3.5 Origins................................ 24 3.4 Audio Features................................ 25 3.4.1 k-means Clustering Algorithm................... 25 3.4.2 Feature Matrix............................ 26 3.4.3 Feature Scale............................. 27 3.4.4 Feature Clusters........................... 27 3.5 Hubert-Arabie adjusted Rand Index.................... 29 4 Evaluation and Discussion 31 4.1 K vs. ARI HA................................. 33 4.2 Hubert-Arabie adjusted Rand Index (revisited).............. 34 4.3 Cluster Structure Analysis.......................... 35 4.3.1 Neighboring Clusters vs. Distant Clusters............. 35 4.3.2 Correlated Terms vs. Uncorrelated Terms............. 41 5 Conclusion and Future Work 49 References 5 iii

List of Figures 1.1 System Diagram of a Generic Automatic Classification Model...... 3 2.1 System Diagram of a Genre Classification Model............. 6 2.2 System Diagram of a music emotion recognition model.......... 8 2.3 Thayer s 2-Dimensional Emotion Plane (19)................ 8 3.1 Clusters.................................... 11 3.2 Co-occurence - same level.......................... 15 3.3 Co-occurence - different level........................ 15 3.4 Hierarchical Structure (Terms)....................... 16 3.5 Intersection of Labels............................. 19 3.6 Era Histogram................................ 2 3.7 Emotion Histogram.............................. 21 3.8 Genre Histogram............................... 22 3.9 Instrument Histogram............................ 23 3.1 Origin Histogram............................... 24 3.11 Elbow Method................................ 27 3.12 Content-based Cluster Histogram...................... 28 4.1 K vs. ARI HA................................. 34 4.2 Co-occurence between feature clusters and era clusters.......... 36 4.3 Co-occurence between feature clusters and emotion clusters....... 37 4.4 Co-occurence between feature clusters and genre clusters......... 38 4.5 Co-occurence between feature clusters and instrument clusters..... 39 4.6 Co-occurence between feature clusters and origin clusters........ 4 4.7 Co-occurence between era clusters and feature clusters.......... 42 iv

LIST OF FIGURES 4.8 Co-occurence between emotion clusters and feature clusters....... 42 4.9 Co-occurence between genre clusters and feature clusters......... 43 4.1 Co-occurence between instrument clusters and feature clusters..... 43 4.11 Co-occurence between origin clusters and feature clusters........ 44 v

List of Tables 3.1 Overall Data Statistics............................ 1 3.2 Field List................................... 12 3.3 Terms..................................... 13 3.4 Labels..................................... 13 3.5 Clusters.................................... 14 3.6 Hierarchical Structure (Clusters)...................... 16 3.7 Hierarchical Structure (µ and σ)...................... 18 3.8 Mutually Exclusive Clusters......................... 18 3.9 Filtered Dataset............................... 19 3.1 Era Statistics................................. 2 3.11 Emotion Terms................................ 21 3.12 Genre Terms................................. 22 3.13 Instrument Statistics............................. 23 3.14 Origin Terms................................. 24 3.15 Audio Features................................ 25 3.16 Cluster Statistics............................... 28 3.17 2 x 2 Contingency Table........................... 3 4.1 ARI HA..................................... 31 4.2 Term Cooccurrence.............................. 32 4.3 Term Cooccurrence.............................. 33 4.4 Optimal Cluster Validation......................... 34 4.5 Self-similarity matrix............................. 35 4.6 Neighboring Clusters............................. 4 4.7 Distant Clusters............................... 41 vi

LIST OF TABLES 4.8 Term Correlation............................... 44 4.9 Term Correlation............................... 45 4.1 Term Vectors................................. 46 4.11 Label Cluster Distance............................ 47 4.12 Label Cluster Distance............................ 48 vii

Chapter 1 Introduction In 21 st century, we are living in a world, where an instant access to the countless number of music database is granted. Online music stores such as itunes store or online music streaming service such as Pandora provides millions of songs from artists all over the world. As the music database has grown rapidly with the advent of advanced technology and the Internet, it requires much more efficient ways of organizing and finding music. One of the main tasks in the field of music information retrieval (MIR) is to generate a computational model for classification of audio data such that it is faster and easier to search for and listen to music. A number of researchers have proposed methods to categorize music into different classifications such as genres, emotions, activities, or artists (1, 2, 3, 4). This automated classification would then let us search for audio data based on their labels e.g., when we search for sad music, the audio emotion classification model returns songs with label sad. Regardless of the type of classification model, there is a generic approach to this problem as outlined in figure 1.1 extracting audio features, obtaining labels, and computing the parameters to generate a model by a means of supervised machine learning technique. When utilizing a supervised learning technique to construct a classification model, however, it is imperative that ground truth labels are provided. Obtaining labels involves 1

human subjects, which makes the process expensive and inefficient. In certain cases, the number of labels are bound to be insufficient, making it even harder to collect data. As a result, researchers have used a semi-supervised learning method, in which unlabeled data is combined with labeled data during training process in order to improve performance (5). However, this method is also limited to certain situation, where data has only one type of label e.g., if a dataset is labeled by genre, it is possible to construct a genre classification model; however, it is not possible to create a mood classification model without knowing a priori correspondence between genre and mood labels. This causes a problem when certain dataset has only one type of label and needs to be classified into a different label class. It can be much efficient and less expensive if there exists a statistical correspondence among different audio labels so that it enables to easily predict a different label class from the same dataset. Therefore, the goal of this study is to define a statistical relationship among different audio labels such as genres, emotions, era, origin, and instruments, using The Million Song Dataset (6), applying an unsupervised learning technique, i.e. k-means algorithm, and calculating the Hubert-Arabie adjusted Rand (ARI HA ) index (7). The outline of this thesis is organized in following steps: literature review will be provided about previous MIR studies on automatic classification models. The detailed methodology and data analysis will be given in Chapter 3. Based on the results obtained from Chapter 3, possible interpretations of the data will be discussed in Chapter 4. Finally, concluding remarks and future work are laid out in Chapter 5. 2

Figure 1.1: System Diagram of a Generic Automatic Classification Model - Labels are used only in supervised learning case 3

Chapter 2 Literature Review 2.1 Music Information Retrieval There are many ways to categorize music. One of the traditional ways to categorize music is by its metadata such as name of song, artist, or album, which is known as tag-based or text-based categorization (8). As music databases have grown virtually countless, it requires more efficient ways to query and retrieve music. As opposed to tag-based query and retrieval, which only enables to retrieve songs that we have a priori information about, a content-based query and retrieval allows us to find songs in different ways - e.g., it allows to find songs similar in musical context or structure and it could also recommend songs based on musical labels such as emotion. Music information retrieval (MIR) is a widely and rapidly growing research topic in the multimedia processing industry, which aims at extending the understanding and usefulness of music data, through the research, development and application of computational approaches and tools. As a novel way of retrieving songs or creating a playlist, researchers have come up with a number of classification methods using different labels such as genre, emotion, or cover song (1, 2, 3, 4) so that each classification model could retrieve a song based on its label or musical similarities. These methods are different 4

2.2 Automatic Classification than tag-based method since audio features are extracted and analyzed prior to constructing a computational model. Therefore, a retrieved song is based on the content of the audio, not on its metadata. 2.2 Automatic Classification In previous studies, most audio classification models are based on supervised learning method, in which musical labels such as genre or emotion are required (1, 2, 3, 4). Using labels along with well-defined high-dimensional musical features, learning algorithms go through computations to train the data to find possible relationships between the features and a label so that for any given unknown (test) data, the model could correctly recognize the label. 2.2.1 Genre Tzanetakis et al. (1, 9) are among the earliest researchers who worked on automatic genre classification. Instead of manually assigning musical genre for a song, automatic genre classification model enables to generate a genre label for a given song after comparing its musical features with the model. In (1, 9) the authors used three feature sets, in which each describes timbral texture, rhythmic content, and pitch content, respectively. Features such as spectral centroid, rolloff, flux, zero crossing rate, and MFCC (1) were extracted to construct a feature vector that describes timbral texture of music. Automatic beat detection algorithm (4, 11) was used to calculate the rhythmic structure of music and used as a feature vector that describes rhythmic content. Lastly, pitch detection techniques (12, 13, 14) were used to construct a pitch content feature vector. Figure 3.12 represents the system overview of the automatic genre classification model described in (1, 9). 5

2.2 Automatic Classification Figure 2.1: System Diagram of a Genre Classification Model - Gaussian Mixture Model (GMM) is used as a classifier 6

2.2 Automatic Classification 2.2.2 Emotion In 26, the work of L. Lu et al. (15) was one of a few studies that provided indepth analysis of mood detection and tracking of music signals using acoustic features extracted directly from audio waveform, instead of using MIDI or symbolic representations. Although it has been an active research topic, researchers have consistently faced the same problem with quantification of music emotion due to the nature of subjectivity of music emotion. Recent studies have sought ways to minimize the inconsistency among labels. Skowronek et al. (16) paid close attention to material collection process. They obtained a large number of labelled data from 12 subjects and accounted for only those in agreement with one another in order to exclude the ambiguous ones. In (17), the authors created a collaborative game that collects dynamic (time-varying) labels of music mood from two players and ensures that the players cross check each other s label in order to build a consensus. Defining mood classes is not an easy task. There have been mainly two approaches to defining mood: categorical and continuous. In (15) mood labels are classified into adjectives such as happy, angry, sad, or sleepy. However, the authors in (18) defined mood as a continuous regression problem as described in figure 2.2, and mapped emotion into two-dimensional Thayer s Plane (19) shown in figure 2.3. Recent studies focus on multi-modal classification using both lyrics and audio contents to quantify music emotion (2, 21, 22), on dynamic music emotion modeling (23, 24), or on unsupervised learning approach for mood recognition (25). 7

2.2 Automatic Classification Figure 2.2: System Diagram of a music emotion recognition model - Arousal and Valence are two independent regressors Figure 2.3: Thayer s 2-Dimensional Emotion Plane (19) - Each axis is used as an independent regressor 8

Chapter 3 Methodology Previous studies have constructed the automatic classification model, using a relationship between audio features and one type of label (e.g. genre or mood). As it is stated in chapter 1, however, if statistical relationship among several audio labels is defined, it could reduce the cost of constructing the automatic classification models. In order to solve the problem, two things are needed: 1. Big Music Data with multiple labels: The Million Song Dataset (6) 2. Cluster Validation Method: Hubert-Arabie adjusted Rand Index A large dataset is required to minimize bias and noisiness of labels. Since labels are acquired from users, small number of music data would lead to large variance among labels and thus large error. A cluster validation method is required to compare sets of clusters created by different labels, hence Hubert-Arabie adjusted Rand index. 9

3.1 Data Statistics 3.1 Data Statistics The Million Song Dataset (6) consists of million files in HDF5 format, from which various information can be retrieved including metadata such as the name of artist, title of song, or tags (terms) and musical features such as chroma, tempo, loudness, mode, or key. Table 3.1 shows the overall statistics of the dataset and table 3.2 shows a list of fields available in the files of the dataset. No. Type Total 1 Songs 1,, 2 Data 273 GB 3 Unique Artists 44,745 4 Unique Terms 7,643 5 Artists with at least one term 43,943 6 Identified Cover Song 18,196 Table 3.1: Overall Data Statistics - Statistics of The Million Song Dataset 3.2 Filtering LabROSA, the distributor of The Million Song Dataset, also provides all the necessary functions to access and manipulate the data from Matlab. HDF5 Song File Reader function lets convert.h5 files into a Matlab object, which can be further used to extract labels using get artist terms function and features using get segments pitches, get tempo, and get loudness functions. Therefore, labels enable to create several sets of clusters, while audio features are used to form another set of cluster. Figure 3.1 indicates different sets of clusters. Although it is idealistic to take account of all million songs, due to the noisiness of data, dataset must undergo following filtering process to get rid of unnecessary songs: 1. All terms are categorized into one of 5 label classes 2. Create a set of clusters based on each label class. 1

3.2 Filtering 3. Find hierarchical structure of each label class. 4. Make each set of clusters mutually exclusive. 5. Songs that contain at least one term from all of the five label classes are retrieved. Figure 3.1: Clusters - Several sets of clusters can be made using labels and audio features 3.2.1 1 st Filtering As shown in table 3.1, there are 7643 unique terms that describe songs in the dataset. Some examples of these terms are shown in table 3.3. These unique terms have to be filtered so that meaningless terms are ignored. In other words, five labels are chosen so that each term can be categorized into one of following five labels: era, emotion, genre, instrument, and origin. Doing so, any term that cannot be described as one of those labels is dropped. Table 3.4 shows the total number of terms that belong to each label. Note the small number of terms in each label category compared to original 7643 unique terms. This is because many terms cross reference each other. For example, s, s alternative, and s country all count as unique terms, but they are all represented as s under era label class. Similarly, alternative jazz, alternative 11

3.2 Filtering Field Name Type Description analysis sample rate float sample rate of the audio used artist familiarity float algorithmic estimation artist hotnesss float algorithmic estimation artist id string Echo Nest ID artist name string artist name artist terms array string Echo Nest tags artist terms freq array float Echo Nest tags freqs audio md5 string audio hash code bars confidence array float confidence measure bars start array float beginning of bars beats confidence array float confidence measure beats start array float result of beat tracking danceability float algorithmic estimation duration float in seconds energy float energy from listener perspective key int key the song is in key confidence float confidence measure loudness float overall loudness in db mode int major or minor mode confidence float confidence measure release string album name sections confidence array float confidence measure sections start array float largest grouping in a song segments confidence array float confidence measure segments loudness max array float max db value segments loudness max time array float time of max db value segments loudness max start array float db value at onset segments pitches 2D array float chroma feature segments start array float musical events segments timbre 2D array float texture features similar artist array string Echo Nest artist IDs song hotttnesss float algorithmic estimation song id string Echo Nest song ID tempo float estimated tempo in BPM time signature int estimate of number of beats/bar time signature confidence float confidence measure title string song title track id string Echo Nest track ID Table 3.2: Field List - A list of fields available in the files of the dataset 12

3.2 Filtering rock, alternative r & b, and alternative metal are simply alternative, jazz, rock, r & b, and metal under genre label category. No. Terms 1 s 2 s alternative 3 s country.... 3112 gp worldwide 3113 grammy winner 3114 gramophone.... 5787 punky reggae 5788 pure black metal 5789 pure grunge.... Table 3.3: Terms - Examples of terms (tags) Label Total era 17 emotion 96 genre 436 instrument 78 origin 635 Table 3.4: Labels - Terms belong to each label class In this way, the total number of terms in each category is reduced and it is still possible to search songs without using repetitive terms. For example, a song that has alternative jazz term can be searched by both alternative and jazz keywords, instead of alternative jazz. In addition, composite terms such as alternative jazz or ambient electronics are not included since they are at the lowest level of hierarchical level and the number of elements that belong to such clusters is few. 13

3.2 Filtering 3.2.2 2 nd Filtering After all unique terms are filtered into one of five label classes, each term belonging to each label class is regarded as a cluster as shown in table 3.5. Note that it is still not deterministic that all terms are truly representative as independent clusters as it must be taken into account that there are a few hierarchical layers among terms i.e. piano and electric piano terms might not be in the same level of hierarchy in instrument label class. In order to account for differences in layers, co-occurence between a pair of clusters is calculated as explained in next section. Label era emotion genre instrument origin Clusters s 17s 191s 19th century angry chill energetic horror mellow ambient blues crossover dark electronic opera accordion banjo clarinet horn ukelele laptop african belgian dallas hongkong moroccan Table 3.5: Clusters - Each term forms a cluster within each label class 3.2.3 3 rd Filtering 3.2.3.1 Co-occurence Within a single label class, there are a number of different terms, of which each could possibly represent an individual cluster. However, while certain terms inherently possess clear meaning, some do not e.g. in genre label class, the distinctions between dark metal and death metal or acid metal and acid funk might not be obvious. In order to avoid ambiguity among clusters, co-ocurrences of two clusters are measured. Co-occurrences of a pair of clusters can be easily calculated as follows: cooc a,b = intersect(a, b) intersect(a, a), cooc intersect(a, b) b,a = intersect(b, b) (3.1) 14

3.2 Filtering where intersect(i, j) counts the number of elements in both i and j clusters. Therefore, if both clusters have high or small co-occurrence values, it implies that there is a large or small overlap between clusters, while if only one of two clusters has a high value and the other has a low value, it implies that one cluster is a subset of the other as illustrated in figures 3.2 and 3.3. Also note that if one cluster is a subset of the other, it implies that they are not at the same hierarchical level. Figure 3.2: Co-occurence - same level - (a) small overlap between two clusters; (b) large overlap between two clusters Figure 3.3: Co-occurence - different level - (a) Cluster B is a subset of Cluster A; (b) Cluster A has relatively large number of elements than Cluster B, of which most belong to intersection Therefore, threshold is set such that if (cooc a,b >.9 & cooc b,a <.1) or (cooc a,b <.1 & cooc b,a >.9), then cluster A is a subset of cluster B or vice versa. If neither condition 15

3.2 Filtering is met, two clusters are at the same hierarchical level. In doing so, layers of hierarchy can be retrieved. 3.2.3.2 Hierarchical Structure After obtaining co-occurrence values for all the pairs of clusters, the structure of clusters in each label classes can be known. Table 3.6 shows the hierarchical structure of each label class and figure 3.4 shows some of the terms at different hierarchical level. Label 1st Layer 2nd Layer 3rd Layer Total era 3 14 empty 17 emotion 3 93 empty 96 genre 27 127 274 428 instrument 18 72 empty 9 origin 13 135 487 635 Table 3.6: Hierarchical Structure(Clusters) - Total number of clusters at different layers in each label class Figure 3.4: Hierarchical Structure (Terms) - Examples of terms at different layers in each label class 16

3.2 Filtering The structure looks well correlated with intuition with more general terms at higher level such as bass or guitar, while terms such as acoustic bass or classical guitar are at lower level. The number of songs in each cluster also matches well with intuition. Terms at the high level of hierarchy have a large number of songs, while there are relatively small number of songs that belong to terms at the low level. Now that the structure of clusters for each label class is known, it must be carefully decided that which layer should be used as there is a tradeoff between the number of clusters and the number of songs belonging to each cluster: higher layer has a small total number of clusters but each cluster contains sufficient amount of songs and vice versa. In order to make a logical decision, three different thresholds are set: the number of cluster, N, the mean, µ, and the standard deviation, σ, of all levels are calculated and shown in table 3.7. The rationale is that each layer within a label class must have enough number of clusters and that each cluster must contain sufficient number of songs while the variance of the distribution is as small as possible. The author defined the value for all three thresholds as follows: N > 5 µ > 5, σ = as small as possible 1st layer from instrument class and 2nd layer from era, emotion, genre, and origin label classes are selected as shown in table 3.7. 17

3.2 Filtering Label 1st Layer 2nd Layer 3rd Layer µ σ µ σ µ σ era 59,524 (3) 47,835 21,686 (14) 43,263 empty empty emotion 23,95 (3) 15,677 5,736 (93) 19,871 empty empty genre 35,839 (27) 79,37 5,744 (127) 14,37 2,421 (274) 6,816 instrument 39,744 (18) 76,272 998 (72) 2,464 empty empty origin 69,452 (13) 141,44 8,84 (135) 23,383 644 (487) 2,8 Table 3.7: Hierarchical Structure (µ and σ)] - The mean and the standard deviation for each layer. Number in parenthesis denotes number of clusters. Bold numbers denote selected layer. 3.2.4 4 th Filtering 3.2.4.1 Term Frequency After finding the structure of clusters and selecting the layer in the previous section, all the clusters within the same layer must become mutually exclusive, leaving no overlapping elements among clusters. Therefore, after finding intersections among clusters, it needs to be decided to which cluster the element should belong. In order to resolve conflicts in multiple clusters, the frequency of terms is retrieved for every single element via provided function get artist terms freq. Therefore, for every element within intersection, the term frequency value is taken into account and whichever term that has a higher value should take the element, while the other should lose. In this way, total number of clusters are reduced via merging and all the terms become mutually exclusive. Table 3.8 indicates total number of songs in each label class. Label # of Songs era 387,977 emotion 394,86 genre 7,778 instrument 384,59 origin 871,631 Table 3.8: Mutually Exclusive Clusters - Total number of songs in mutually exclusive clusters 18

3.2 Filtering 3.2.5 5 th Filtering Since most songs are given multiple terms, they might belong to several label classes e.g. a song with s and alternative jazz terms belong to both era and genre label class. Therefore, after obtaining the indexes of songs that belong to each category, intersections among these indexes are retrieved so that only the songs with each of all five labels are considered. The description of aforementioned process is shown in figure 3.5. Finally, the total number of clusters in each label class and the total number of songs used in the study after all filtering processes is shown in table 3.9. Figure 3.5: chosen Intersection of Labels - Songs that belong to all five label classes are Songs Era Emotion Genre Instrument Origin Original 1,, 14 91 122 17 99 Filtered 41,269 7 34 44 7 33 Table 3.9: Filtered Dataset - Total number of songs and clusters after filtering 19

3.3 Audio Labels 3.3 Audio Labels 3.3.1 Era After all the filtering processes, 7 clusters are selected for era label class. Terms such as 16 th century or 21 th century as well as 3s and 4s are successfully ignored via merging and hierarchy. Table 3.1 and figure 3.6 show the statistics of remaining terms. Note that the distribution is negatively skewed, which is intuitive, because there are more songs that exist in recorded format in later decades than early 2th century due to the advanced recording technology. It also makes sense that the cluster 8s consists of most songs because people use the term 8s to describe 8s rock or 8s music more often than s music or s pop. Era 5s 6s 7s 8s 9s s 2th Total 661 3,525 2,826 17,359 9,555 6,111 1,232 41,269 Table 3.1: Era Statistics - # of Songs belonging to each era cluster Histogram of Era Cluster 16 14 12 1 8 6 4 2 5s 6s 7s 8s 9s s 2th century Cluster Figure 3.6: Era Histogram - Distribution of songs based on era label 2

3.3 Audio Labels 3.3.2 Emotion There are a total of 34 clusters in emotion label class, which are shown in table 3.11. Note the uneven distribution of songs in emotion label class is shown in figure 3.7. Clusters such as beautiful, chill, and romantic together consist about one third of the total songs, while there are relatively a few number of songs belonging to clusters such as evil, haunting, and uplifting. Emotion beautiful brutal calming chill energetic evil gore grim happy harsh haunting horror humorous hypnotic intense inspirational light loud melancholia mellow moody obscure patriotic peace relax romantic sad sexy strange trippy uplifting wicked wistful witty Table 3.11: Emotion Terms - all the emotion terms. Histogram of Emotion Cluster 12 1 8 6 4 2 beautiful evil haunting inspirational moody romantic uplifting Cluster Figure 3.7: Emotion Histogram - Distribution of songs based on emotion label 21

3.3 Audio Labels 3.3.3 Genre A total of 44 genre clusters are created and shown in table 3.12 and its distribution is shown in figure 3.8. Also note that certain genre terms such as hip hop, indie, and wave have more songs than the others like emo or melodic. Genre alternative ambient ballade blues british christian classic country dance dub electronic eurodance hard style hip hop instrumental industrial indie lounge modern neo new noise nu old orchestra opera post power progressive r&b rag soundtrack salsa smooth soft swing synth pop techno thrash tribal urban waltz wave zouk Table 3.12: Genre Terms - all the genre terms. 7 Histogram of Genre Cluster 6 5 4 3 2 1 alternative country instrumental noise progressive swing wave Cluster Figure 3.8: Genre Histogram - Distribution of songs based on genre label 22

3.3 Audio Labels 3.3.4 Instrument There are only 7 instrument clusters after filtering processes. The name of each cluster and the number of songs belonging to corresponding cluster is given in table 3.13. The values make perfect sense as guitar, piano, and synth have many songs in their clusters while there are relatively small number of songs belonging to saxophone and violin. Figure 3.9 shows the histogram of instrument clusters. Instrument bass 2444 drum 513 guitar 9731 piano 5667 saxophone 134 violin 322 synth 16662 Table 3.13: Instrument Statistics - belonging to each instrument cluster Histogram of Genre Cluster 16 14 12 1 8 6 4 2 bass drum guitar piano saxophone violin synth Cluster Figure 3.9: Instrument Histogram - Distribution of songs based on instrument label 23

3.3 Audio Labels 3.3.5 Origins There are 33 different origin clusters as laid out in table 3.14. Note that clusters such as american, british, dc, and german have a large number of songs, while clusters such as new orleans, suomi, or texas consists of relatively small number of songs. Also note that terms american and texas both appear as independent clusters, while it seems intuitive that texas should be a subset of american. It is because when describing a song with origin label, certain songs are specifically described by texas than american or united states e.g. country music. Finally, the statistics of origin label class is shown in figure 3.1. Origin african american belgium british canada cuba dc east coast england german ireland israel italian japanese los angeles massachusetts mexico nederland new york norway new orleans poland roma russia scotland southern spain suomi sweden tennessee texas united states west coast Table 3.14: Origin Terms - all the origin terms. Histogram of Origin Cluster 7 6 5 4 3 2 1 african cuba ireland massachusetts new orleans southern texas Cluster Figure 3.1: Origin Histogram - Distribution of songs based on origin label 24

3.4 Audio Features 3.4 Audio Features Audio features are extracted in order to construct feature clusters using clustering algorithm, using provided functions such as get segments timbre or get segments pitches. Table 3.15 shows a list of extracted features. It takes about 3ms to extract a feature from one song, which makes a total of 8 hours from million songs. However, since only 41,269 songs are used, the computation time is reduced to less than an hour. No. Feature Function 1 Chroma get segments pitches 2 Texture get segments timbre 3 Tempo get tempo 4 Key get key 5 Key Confidence get key confidence 6 Loudness get loudness 7 Mode get mode 8 Mode Confidence get mode confidence Table 3.15: Audio Features - Several audio features are extracted via respective functions 3.4.1 k-means Clustering Algorithm Content-based clusters can be constructed based on clustering algorithm, an unsupervised learning method, which does not require pre-labeling for data and uses only features to construct clusters of similar data points. There are several variants of clustering algorithms such as k-means, k-median, centroid-based, or single-linkage (26, 27, 28). In this study, k-means clustering algorithm is used for automatic clusters. The basic structure of the algorithm is defined in following steps (29, 3): 1. Define a similarity measurement metric, d. (e.g. Euclidean, Manhattan, etc.) 2. Randomly initialize k centroids, µ k. 3. For all data points x, find µ k that returns minimum d. 25

3.4 Audio Features 4. Find C k, a cluster that includes a set of points assigned to µ k. 5. Recalculate µ k for every C k. 6. Repeat steps 3 through 5 until it converges. 7. Repeat steps 2 through 6 multiple times to avoid local minima. The author used the (squared) Euclidean distance as the similarity measurement metric, d, and computed the centroid means of each cluster as such: d (i) := x (i) µ k 2 (3.2) µ k := 1 C k i C k x (i) (3.3) where x (i) is the position of i th point. C k is constructed by finding c (i) that minimizes (3.3), where c (i) is the index of the centroid closest to x (i). In other words, points belong to a cluster, where the Euclidean distance between a point and its centroid is minimum. 3.4.2 Feature Matrix Using extracted audio features such as chroma, timbre, key, key confidence, mode, mode confidence, tempo, and loudness, feature matrix F IxJ is constructed, where I is the total number of points (= 41,269), and J is the total number of features (= 3 i.e. both chroma and timbre features are averaged across time, resulting in 12 x 1 dimensions for each point). Therefore, the cost function of the algorithm is: 1 I I d (i) (3.4) i=1 26

3.4 Audio Features and the optimization objective is to minimize (3.4). 3.4.3 Feature Scale Feature scaling is necessary as each feature vector is in different range and therefore needs to be normalized for equal weighting. The author used mean/standard deviation scaling method for each feature f j as such: ˆf j = f j µ fj σ fj (3.5) 3.4.4 Feature Clusters It is often arbitrary what should be the correct number for K and there is no algorithm that leads to the definitive answer. However, an elbow method is often used to determine the number of cluster, K. Figure 3.11 shows a plot of a cost function based on different K. Either K = 8 or K = 1 marks the elbow of the plot and a possible candidate for the number of clusters. In this study, K = 1 is chosen. 3 Elbow Method 28 26 Cost 24 22 2 18 5 1 15 2 25 3 35 4 45 5 K Figure 3.11: Elbow Method - K = 8 or K = 1 is the possible number of clusters 27

3.4 Audio Features After choosing the right value of K, the structure of clusters is found and shown in figure 3.12. Cluster 1 2 3 4 5 # of Songs 4,349 4,172 4,128 4,475 5,866 Cluster 6 7 8 9 1 # of Songs 3,933 2,544 5,149 2,436 4,217 Table 3.16: Cluster Statistics - The number of songs within each cluster is found. Histogram of Content based Cluster 5 4 3 2 1 1 2 3 4 5 6 7 8 9 1 Cluster Figure 3.12: audio features Content-based Cluster Histogram - Distribution of songs based on 28

3.5 Hubert-Arabie adjusted Rand Index 3.5 Hubert-Arabie adjusted Rand Index After obtaining six sets of clusters i.e. five with labels and one with audio features, the relationship among a pair of clusters can be found by calculating the Hubert- Arabie adjusted Rand (ARI HA ) index (7, 31). ARI HA index enables to quantify cluster validation by comparing the generated clusters with the original structure of the data. Therefore, by comparing two different sets of clusters, the correlation between two clusters can be drawn. ARI HA index can be measured as: ARI HA = ( N ) 2 (a + d) [(a + b)(a + c) + (c + d)(b + d)] ( N ) 2. (3.6) 2 [(a + b)(a + c) + (c + d)(b + d)] where N is the total number of data and a, b, c, d represents four different types of pairs. Let A and B be two sets of clusters and P and Q be number of clusters in each set, then a, b, c, and d are defined as following: a : element in the same group of both A and B b : elements in the same group of B but in different group of A c : elements in the same group of A but in different group of B d : elements in different group of both A and B which can be easily described by a contingency table shown in 3.17. This leads to the computation of a, b, c, and d as following: a = P Q t 2 pq N p=1 q=1 2. (3.7) b = P t 2 p+ P p=1 Q t 2 pq p=1 q=1 2. (3.8) 29

3.5 Hubert-Arabie adjusted Rand Index c = Q t 2 +q P q=1 Q t 2 pq p=1 q=1 2. (3.9) d = P Q t 2 pq + N 2 P t 2 p+ p=1 q=1 2 p=1 Q t 2 +q q=1. (3.1) where t pq, t p+, and t +q denote the total number of elements belonging to both pth and qth cluster, the total number of elements belonging to pth cluster, and the total number of elements belonging to qth cluster, respectively. It can be viewed as such that ARI HA = 1 means perfect cluster recovery, while values greater than.9,.8, and.65 mean excellent, good, and moderate recovery, respectively (7). B A pair in same group pair in different group pair in same group a b pair in different group c d Table 3.17: 2 x 2 Contingency Table - 2 x 2 contingency table that describes four different types of pairs: a, b, c, d 3

Chapter 4 Evaluation and Discussion ARI HA is calculated for all pairs of cluster sets and shown in table 4.1. Features Era Emotion Genre Instrument Origin Features 1.145.66.44.289.139 Era.145 1.399.823.11.315 Emotion.66.399 1.1267.39.961 Genre.44.823.1267 1.833.843 Instrument.289.11.39.833 1.418 Origin.139.315.961.843.418 1 Table 4.1: Rand Index ARI HA - Cluster validation is calculated based on Hubert-Arabie adjusted It is observed from Table 4.1 that the cluster validation between any pair of cluster sets is overall very low with the highest correlation between emotion and genre at 12.67 % and the lowest between origin and era at 3.15 %. Although all the validation values are too low to draw a relationship between a pair of audio labels, it is still interesting to observe that emotion and genre are most correlated among those, indicating that there are common emotion annotations for certain genres. In order to observe a closer relationship between emotion and genre, the number of intersections between each term from both label classes are calculated and the maximum intersection for each term is 31

shown in table 4.2 and 4.3. Genre Intersection Emotion Genre Intersection Emotion alternative 159 beautiful old 121 beautiful ambient 92 chill orchestra 117 beautiful ballade 161 beautiful opera 116 romantic blues 24 energetic post 164 chill british 119 beautiful power 16 melancholia christian 214 inspirational progressive 827 chill classic 392 romantic r&b 12 chill country 9 romantic rag 182 chill dance 74 chill soundtrack 142 chill dub 99 chill chill 173 chill electronic 59 chill smooth 1584 chill eurodance 52 uplifting soft 536 mellow hard style 94 gore swing 6 mellow hip hop 32 chill synth pop 132 melancholia instrumental 156 beautiful techno 96 happy industrial 44 romantic thrash 134 peace indie 2167 chill tribal 99 brutal lounge 58 beautiful urban 154 beautiful modern 7 chill waltz 49 romantic neo 169 chill wave 3448 romantic new 2 chill zouk 1 beautiful noise 121 beautiful nu 44 chill Table 4.2: Term Cooccurrence - The most common emotion term for each genre term is observed It is observed that because of disproportional distribution among emotion terms, most genre labels share the same emotion terms such as beautiful, chill, romantic. On the other hand, as the distribution of genre terms are more flat, many emotion terms share different genre terms. However, do note that the co-occurrence between an emotion label and a genre label does not correlate well with intuition as it can be observed from table 4.3. e.g. beautiful & indie, happy & hip hop, uplifting & progressive, 32

4.1 K vs. ARI HA Emotion Intersection Genre Emotion Intersection Genre beautiful 883 indie loud 119 christian brutal 99 tribal melancholia 325 indie calming 118 synthpop mellow 536 soft chill 32 hip hop moody 88 alternative energetic 276 wave obscure 5 new evil 72 indie patriotic 3 hip hop gore 94 hardstyle peace 134 thrash grim 659 hip hop relax 23 smooth happy 1472 hip hop romantic 3448 wave harsh 28 noise sad 161 indie haunting 11 electronic sexy 752 hip hop horror 37 wave strange 76 progressive humorous 96 salsa trippy 67 progressive hypnotic 93 smooth uplifting 79 progressive intense 7 rag wicked 99 hip hop inspirational 214 christian wistful 121 classic light 118 soft witty 14 progressive Table 4.3: Term Cooccurrence - The most common genre term for each emotion term is observed which is indicative of the low cluster validation rate. It also indicates that people use only limited vocabulary to describe the emotional aspect of a song regardless of the genre of the given song. Although it seems intuitive and expected that the correlations between audio labels turn out to be low, it is quite surprising that the cluster validations between audio features and each label are also low. In order to understand why this is the case, a number of post-processing steps are proposed. 4.1 K vs. ARI HA In section 3.4.4., the number of clusters, K, was chosen based on the elbow method. This K does not necessarily generate optimal validation rates, and therefore, K vs. ARI HA plot is drawn to find out K that maximizes the validation rates for each set of clusters. Figure 4.1 shows the pattern of ARI HA for each label class as K changes. It 33

4.2 Hubert-Arabie adjusted Rand Index (revisited) turns out that the sum of ARI HA is maximum when K = 5, the maximum number of feature clusters..6.5 ARI ha vs. K Era Emotion Genre Instrument Origin.4 ARI ha.3.2.1 5 1 15 2 25 3 35 4 45 5 # of cluster: K Figure 4.1: K vs. ARI HA - ARI HA is maximum when K = 5 4.2 Hubert-Arabie adjusted Rand Index (revisited) Using the result from previous section, (K = 5), ARI HA is re-calculated for each label class and shown in table 4.4. Era Emotion Genre Instrument Origin Features (original).145.66.44.289.139 Features (K = 5).228.69.439.42.16 Table 4.4: Optimal Cluster Validation - optimal ARI HA are calculated for each label class 34

4.3 Cluster Structure Analysis 4.3 Cluster Structure Analysis Now that the optimal K and ARI HA values are found, it needs to be discussed the reason for such low cluster validation rates. In order to do so, the structure of clusters needs to be known by calculating the Euclidean distance between centroids of clusters. Table 4.5 shows the Euclidean distance between centroids of clusters. Note that the centroids of clusters 1 and 2 have the minimum distance while those of clusters 3 and 4 have the maximum distance, indicating most similar and dissimilar clusters, respectively. Cluster 1 2 3 4 5 1.86.1638.142.161 2.86.1975.963.186 3.1638.1975.231.1368 4.142.963.231.181 5.161.186.1368.181 Table 4.5: Self-similarity matrix - the distances between each pair of clusters are calculated 4.3.1 Neighboring Clusters vs. Distant Clusters In order to observe the detailed structure of the cluster, co-occurrence between feature clusters and label clusters are calculated and the first four most co-occurred clusters are returned. In other words, for each feature cluster 1 through 5, four most intersecting clusters from each label class is calculated and shown in figures 4.2 4.6. Note that due to uneven distribution of songs within each label class, the cluster that contains the largest number of songs such as 8s in era label, chill in emotion, hip hop in genre, synth in instrument, and dc in origin, appear frequently across all five feature clusters. In fact, 8s and chill clusters appear as the most co-occurring cluster with all five feature clusters. 35

4.3 Cluster Structure Analysis 3 2 1 8s 9s s 6s Cluster 1 5 4 3 2 1 8s 9s s 7s Cluster 2 1 8 6 4 2 8s 2th century 6s 9s Cluster 3 3 2 1 8s 9s s 7s Cluster 4 3 2 1 8s 9s s 6s Cluster 5 Figure 4.2: Co-occurence between feature clusters and era clusters - First four most co-occurred era clusters are returned for each feature cluster 36

4.3 Cluster Structure Analysis 3 2 2 1 15 1 5 chill beautiful romantic happy Cluster 1 chill romantic beautiful happy Cluster 2 25 1 5 chill romantic beautiful mellow Cluster 3 2 15 1 5 chill happy romantic sexy Cluster 4 25 2 15 1 5 chill beautiful romantic mellow Cluster 5 Figure 4.3: Co-occurence between feature clusters and emotion clusters - First four most co-occurred emotion clusters are returned for each feature cluster 37

4.3 Cluster Structure Analysis 2 2 15 1 5 15 1 5 hip hop wave smooth indie Cluster 1 wave indie hip hop soft Cluster 2 6 4 2 soundtrack smooth classic indie Cluster 3 25 2 15 1 5 hip hop wave techno progressive Cluster 4 15 1 5 indie wave smooth soft Cluster 5 Figure 4.4: Co-occurence between feature clusters and genre clusters - First four most co-occurred genre clusters are returned for each feature cluster 38

4.3 Cluster Structure Analysis 3 4 2 1 3 2 1 synth guitar drum piano Cluster 1 synth guitar piano drum Cluster 2 15 1 5 piano guitar synth drum Cluster 3 4 3 2 1 synth drum guitar bass Cluster 4 3 2 1 synth guitar piano drum Cluster 5 Figure 4.5: Co-occurence between feature clusters and instrument clusters - First four most co-occurred instrument clusters are returned for each feature cluster 39

4.3 Cluster Structure Analysis 15 1 5 dc american german british Cluster 1 6 4 2 american roma german los angeles Cluster 3 2 15 1 5 dc british roma german Cluster 2 25 2 15 1 5 dc british german roma Cluster 4 1 5 american british dc german Cluster 5 Figure 4.6: Co-occurence between feature clusters and origin clusters - First four most co-occurred origin clusters are returned for each feature cluster Knowing that the distance between clusters 1 and 2 is minimum and the distance between clusters 3 and 4 is maximum, it can be also observed from figures 4.2 4.6 that the co-occurring terms within clusters 1 and 2 are similar, while those within clusters 3 and 4 are quite dissimilar as shown in tables 4.6 and 4.7, indicating neighboring feature clusters share similar label clusters, while distant feature clusters do not. Cluster 1 vs Cluster 2 (8s, 9s, s, 6s) (8s, 9s, s, 7s) (chill, beautiful, romantic, happy) (chill, romantic, beautiful, happy) (hip hop, wave, smooth, indie) (wave, indie, hip hop, soft) (synth, guitar, drum, piano) (synth, guitar, piano, drum) (dc, american, german, british) (dc, british, roma, german) Table 4.6: Neighboring Clusters - clusters with minimum Euclidean distances share similar label clusters 4

4.3 Cluster Structure Analysis Cluster 3 vs Cluster 4 (8s, 2th century, 6s, 9s) (8s, 9s, s, 7s) (chill, romantic, beautiful, mellow) (chill, happy, romantic, sexy) (soundtrack, smooth, classic, indie) (hip hop, wave, techno, progressive) (piano, guitar, synth, drum) (synth, drum, guitar, bass) (american, roma, german, los angeles) (dc, british, german, roma) Table 4.7: Distant Clusters - clusters with maximum Euclidean distances have dissimilar label clusters 4.3.2 Correlated Terms vs. Uncorrelated Terms Considering the opposite case, the author selected four largest clusters from each label class and calculated the co-occurrence with every feature clusters as shown in figures 4.7 4.11. In order to observe whether highly correlated label clusters can also be characterized by feature clusters, table 4.8 shows the summary of the most correlated terms for the four largest clusters for each label class, whereas table 4.9 shows the least correlated terms for the same clusters. Using histograms from figures 4.7 4.11, 5-dimensional vector can be created for each term by finding the ratio of each feature cluster (e.g. a vector for 8s term is (Cluster 1, Cluster 2, Cluster 3, Cluster 4, Cluster 5) = (.71, 1,.196,.727,.61)). Using the same method, a total of 41 vectors are retrieved for every single term in tables 4.8 and 4.9 and shown in table 4.1. Using the relationship from tables 4.8, 4.9 and the vectors in 4.1, the Euclidean distance between a pair of vectors is calculated and shown in tables 4.11 and 4.12. As its average distance indicates, highly correlated terms share similar combination of feature clusters, whereas lowly correlated terms do not. 41