Release Year Prediction for Songs

Similar documents
Music Genre Classification and Variance Comparison on Number of Genres

Music Genre Classification

The Million Song Dataset

WHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG?

Singer Traits Identification using Deep Neural Network

Automatic Music Genre Classification

WHAT'S HOT: LINEAR POPULARITY PREDICTION FROM TV AND SOCIAL USAGE DATA Jan Neumann, Xiaodong Yu, and Mohamad Ali Torkamani Comcast Labs

Singer Recognition and Modeling Singer Error

PICK THE RIGHT TEAM AND MAKE A BLOCKBUSTER A SOCIAL ANALYSIS THROUGH MOVIE HISTORY

A combination of approaches to solve Task How Many Ratings? of the KDD CUP 2007

NETFLIX MOVIE RATING ANALYSIS

More About Regression

Chapter 27. Inferences for Regression. Remembering Regression. An Example: Body Fat and Waist Size. Remembering Regression (cont.)

Detecting Musical Key with Supervised Learning

Perceptual dimensions of short audio clips and corresponding timbre features

Lyrics Classification using Naive Bayes

Composer Identification of Digital Audio Modeling Content Specific Features Through Markov Models

Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes

Outline. Why do we classify? Audio Classification

Automatic Music Clustering using Audio Attributes

MUSI-6201 Computational Music Analysis

What is Statistics? 13.1 What is Statistics? Statistics

Music Recommendation from Song Sets

Improving Frame Based Automatic Laughter Detection

STAT 113: Statistics and Society Ellen Gundlach, Purdue University. (Chapters refer to Moore and Notz, Statistics: Concepts and Controversies, 8e)

Supervised Learning in Genre Classification

DOES MOVIE SOUNDTRACK MATTER? THE ROLE OF SOUNDTRACK IN PREDICTING MOVIE REVENUE

Recognising Cello Performers Using Timbre Models

ABSOLUTE OR RELATIVE? A NEW APPROACH TO BUILDING FEATURE VECTORS FOR EMOTION TRACKING IN MUSIC

Feature-Based Analysis of Haydn String Quartets

Using Genre Classification to Make Content-based Music Recommendations

Libraries as Repositories of Popular Culture: Is Popular Culture Still Forgotten?

Skip Length and Inter-Starvation Distance as a Combined Metric to Assess the Quality of Transmitted Video

CS229 Project Report Polyphonic Piano Transcription

Creating a Feature Vector to Identify Similarity between MIDI Files

The Great Beauty: Public Subsidies in the Italian Movie Industry

APPLICATION OF MULTI-GENERATIONAL MODELS IN LCD TV DIFFUSIONS

NAA ENHANCING THE QUALITY OF MARKING PROJECT: THE EFFECT OF SAMPLE SIZE ON INCREASED PRECISION IN DETECTING ERRANT MARKING

Chapter 5. Describing Distributions Numerically. Finding the Center: The Median. Spread: Home on the Range. Finding the Center: The Median (cont.

Can Song Lyrics Predict Genre? Danny Diekroeger Stanford University

Music Similarity and Cover Song Identification: The Case of Jazz

EVALUATING THE IMPACT FACTOR: A CITATION STUDY FOR INFORMATION TECHNOLOGY JOURNALS

Effects of acoustic degradations on cover song recognition

MATH 214 (NOTES) Math 214 Al Nosedal. Department of Mathematics Indiana University of Pennsylvania. MATH 214 (NOTES) p. 1/3

Resampling Statistics. Conventional Statistics. Resampling Statistics

Neural Network for Music Instrument Identi cation

Recognising Cello Performers using Timbre Models

UC San Diego UC San Diego Previously Published Works

Automatic Laughter Detection

Music Mood Classication Using The Million Song Dataset

Relationships. Between Quantitative Variables. Chapter 5. Copyright 2006 Brooks/Cole, a division of Thomson Learning, Inc.

INTER GENRE SIMILARITY MODELLING FOR AUTOMATIC MUSIC GENRE CLASSIFICATION

Normalization Methods for Two-Color Microarray Data

hprints , version 1-1 Oct 2008

Modeling memory for melodies

Open Access Determinants and the Effect on Article Performance

Automatic Laughter Detection

A QUERY BY EXAMPLE MUSIC RETRIEVAL ALGORITHM

Chord Classification of an Audio Signal using Artificial Neural Network

Mixed models in R using the lme4 package Part 2: Longitudinal data, modeling interactions

Automatic Music Similarity Assessment and Recommendation. A Thesis. Submitted to the Faculty. Drexel University. Donald Shaul Williamson

A Pattern Recognition Approach for Melody Track Selection in MIDI Files

Discriminant Analysis. DFs

Music Composition with RNN

Analysis of Film Revenues: Saturated and Limited Films Megan Gold

Setting Energy Efficiency Requirements Using Multivariate Regression

Visual Encoding Design

ECONOMICS 351* -- INTRODUCTORY ECONOMETRICS. Queen's University Department of Economics. ECONOMICS 351* -- Winter Term 2005 INTRODUCTORY ECONOMETRICS

Statistical Consulting Topics. RCBD with a covariate

Supplemental Material: Color Compatibility From Large Datasets

Composer Style Attribution

Topics in Computer Music Instrument Identification. Ioanna Karydi

MusCat: A Music Browser Featuring Abstract Pictures and Zooming User Interface

Algebra I Module 2 Lessons 1 19

For these items, -1=opposed to my values, 0= neutral and 7=of supreme importance.

Detecting Medicaid Data Anomalies Using Data Mining Techniques Shenjun Zhu, Qiling Shi, Aran Canes, AdvanceMed Corporation, Nashville, TN

ECE438 - Laboratory 1: Discrete and Continuous-Time Signals

GENDER IDENTIFICATION AND AGE ESTIMATION OF USERS BASED ON MUSIC METADATA

STI 2018 Conference Proceedings

Hidden Markov Model based dance recognition

A Categorical Approach for Recognizing Emotional Effects of Music

AUTOREGRESSIVE MFCC MODELS FOR GENRE CLASSIFICATION IMPROVED BY HARMONIC-PERCUSSION SEPARATION

EXPLORING THE USE OF ENF FOR MULTIMEDIA SYNCHRONIZATION

Music Information Retrieval

Lecture 15: Research at LabROSA

arxiv: v1 [cs.dl] 9 May 2017

Relationships Between Quantitative Variables

Draft December 15, Rock and Roll Bands, (In)complete Contracts and Creativity. Cédric Ceulemans, Victor Ginsburgh and Patrick Legros 1

STAT 503 Case Study: Supervised classification of music clips

NEXTONE PLAYER: A MUSIC RECOMMENDATION SYSTEM BASED ON USER BEHAVIOR

Evaluation of video quality metrics on transmission distortions in H.264 coded video

Recommending Music for Language Learning: The Problem of Singing Voice Intelligibility

A Study of Predict Sales Based on Random Forest Classification

Automatic Rhythmic Notation from Single Voice Audio Sources

Lessons from the Netflix Prize: Going beyond the algorithms

Cluster Analysis of Internet Users Based on Hourly Traffic Utilization

Multiple-point simulation of multiple categories Part 1. Testing against multiple truncation of a Gaussian field

Frequencies. Chapter 2. Descriptive statistics and charts

Restoration of Hyperspectral Push-Broom Scanner Data

The Effect of DJs Social Network on Music Popularity

Transcription:

Release Year Prediction for Songs [CSE 258 Assignment 2] Ruyu Tan University of California San Diego PID: A53099216 rut003@ucsd.edu Jiaying Liu University of California San Diego PID: A53107720 jil672@ucsd.edu ABSTRACT In this assignment, we study on a subset of Million Song Dataset from UCI Machine Learning Repository to obtain a suitable model that can predict the Release Year based on selected features: Timbre Average and Timbre Covariance. We do some exploratory analysis on the data set, label and features, and then apply Linear Regression, Ridge Regression, LASSO Regression and Random Forest models to this prediction task. Mean Absolute Error(MAE) is chosen to measure the model accuracy. The results we got indicate that Linear Regression model is the best one with the MAE of 6.80. Keywords Release year; Songs; Linear regression; Ridge regression; Lasso regression; Random forest; Mean absolute error 1. INTRODUCTION Million Song Dataset, a famous data set with freely available collection of audio features and metadata for a million contemporary popular music tracks. Also includes its creation process, its content and its possible tracks. There are many attractive features in the Million Song Database. Here, what we want to focus are Timbre Average and Timbre Covariance features, and then do the release year prediction based on these, since this study may have some practical applications in music recommendation. We define year prediction as estimating the year in which a song was released based on its audio features(although metadata features such as artist name or similar artist tags would certaingly be informative). Listeners often have particular affection for music from certain periods of their lives (such as high school or college), thus predicted the release year could be a useful topic. Moreover, a successful model of the variation in music audio characteristics through the years could throw light on the long-term evolution of popular music. Honestly, it is hard to specifically addressing the release year prediction, since surely we are lacking a large music collection spanning both a wide range of genres(at least within western pop) and a long period of time. 2. DATA SET DESCRIPTION AND ANALYSIS The data set used for this project is UCI Machine Learning Respository[1] at https://archive.ics.uci.edu/ml/datasets/ YearPredictionMSD which is a subset of the Million Song Dataset[2]. The songs here are mostly western, commercial tracks ranging from 1922 to 2011, with a peak in the year 2000s. Before building our model to predict the songs release year which is later down the road, we take to heart the wise maxim: Essentially, all models are wrong, but some are useful. This suggested and reminded us that we need to learn more on the data set we are studying so that we are able to develop a more useful, but not necessarily correct, release year prediction model for this. 2.1 Data Set Description The data set consists of 515,345 data entries in total and is splitted into two parts as train and test data set 1. The train data set contains 463,715 entries while test data set contains 51,630 entries. The detail is shown in Table 1. Number of entries in total data set: 515,345 Number of entries in train data set: 463,715 Number of entries in test data set: 51,630 Table 1: data set basic information Each data entry consists of 91 attributes containing the information of release year and MFCC 2 -like features represented as numerical vector, which is shown in Table 2. Index Description 0 Release Year 1-12 Timbre Average 13-90 Timbre Covariance Table 2: data entry description 2.2 Exploratory Analysis Since we have already seen the content of this data set, we need to get basic data statistics and also get a bit further understanding on the content of this data set. Furthermore, we will take an exploratory analysis for label and features in order to know more about it. 2.2.1 Label: Release Year Release Year is the predictive target variable. The Release Year in train data set is ranging from the year 1922 to 1 The split strategy avoids the producer effect by making sure no song from a given artist ends up in both the train and test set 2 MFCC is the abbreviation for Mel Frequency Cepstral Coefficent

2011. The mean of the release year is 1998.04, the median is 2002.0 and the standard deviation is 10.94. The release year that appears most is the year 2007. All the basic statistics informations are shown in Table 3. min 1922.0 max 2011.0 mean 1998.4 median 2002.0 mode 2007.0 standard deviation 10.94 Table 3: Statistics of Release Year Figure 2: violin plot of timbre average By plotting the histogram of release year as Figure 1, we can obtain the peak of release year is around 2000s, with gradually increasing before 2000s and then rapidly falling. To visualize Timbre Average features, we tried the principal components analysis on the former 12 attributes in train data set. Percentages of variance explained by first principal component and second principal component are 50.22% and 23.38%, respectively. Figure 3 is the scatter plot of the first principal component versus second principal component from Timbre Average features. Figure 1: Histogram of Release Year 2.2.2 Features: Timbre Average and Timbre Covariance From the data set, we can find that each data entry contains 90 features whose former 12 features stand for timbre average and the others represent the timbre covariance( ( ) 12 2 + 12). The sample features are shown in Table 4 following in the next page. As for Timbre Average features, we can study them by violin plots shown as Figure 2. Furthermore, we can calculate the covariance matrix of standardized former 12 attributes shown as Table 5, which indicates that these 12 features for Timbre Average are not too correlated. Figure 3: 1st principal component versus 2nd principal component of timbre average 2.2.3 Label versus Features In order to visualize the relationship among 1st, 2nd principal components and the release year, we draw the scatter plot of Release Year versus 1st and 2nd principal components from Timbre Average features is shown as Figure 4.

Timbre Average Timbre Covariance X1 X2... X12 X13 X14... X90 49.94357 21.47114... -2.31521 10.20556 611.10913... 2.26327 48.73215 18.4293... 0.34006 44.38997 2056.93836... 26.92061 50.95714 31.85602... 0.82804 7.46586 699.54544... -0.66345.. 50.32201 6.71191... 10.66774 14.39176 357.67468... 0.05278 Table 4: sample features in data set X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X1 1.00 0.56 0.25 0.02-0.29-0.27 0.17-0.06 0.22 0.10 0.06-0.13 X2 0.56 1.00 0.02 0.13-0.19-0.11 0.12 0.11 0.13 0.37-0.09-0.06 X3 0.25 0.02 1.00 0.15-0.13 0.04-0.06 0.08 0.04-0.09 0.04-0.02 X4 0.02 0.13 0.15 1.00 0.03 0.32 0.28 0.03-0.04 0.17 0.31-0.10 X5-0.29-0.19-0.13 0.03 1.00 0.02-0.11-0.01-0.22-0.10 0.02 0.03 X6-0.27-0.11 0.04 0.32 0.02 1.00-0.25 0.01-0.04-0.04-0.33 0.15 X7 0.17 0.12-0.06 0.28-0.11-0.25 1.00 0.17-0.10 0.01-0.11-0.18 X8-0.06 0.11 0.08 0.03-0.01 0.01 0.17 1.00-0.18 0.42-0.24-0.05 X9 0.22 0.13 0.04-0.04-0.22-0.04-0.10-0.18 1.00 0.34-0.06 0.09 X10 0.10 0.37-0.09 0.17-0.10-0.04 0.01 0.42 0.34 1.00-0.16-0.08 X11 0.06-0.09 0.04 0.31 0.02-0.33-0.11-0.24-0.06-0.16 1.00 0.21 X12-0.13-0.06-0.02-0.10 0.03 0.15-0.18-0.05 0.09-0.08 0.21 1.00 Table 5: covariance of standardized timbre average Figure 4: release year versus 1st and 2nd principal components of timbre average 3. PREDICTIVE TASK IDENTIFICATION The idea of this predictive task is to predict the release year based on timbre average and timbre covariance features and apply all the possible models to this data set. The criterion we use to measure model accuracy is the Mean Absolute Error(MAE), the model that has the smallest MAE we will consider as the best model. 4. MODEL SELECTION Based on the description above, we aim to predict the release year for the songs on the test set. We will use five models for this predictive task. 4.1 Baseline Model In the baseline model, we simply choose the average of release year in train data set as prediction in test data set. ŷ = ȳ 4.2 Linear Regression Model Linear regression is an approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables(or independent variables) denoted as X. A simple linear model can be represented as y = Xβ min β Xβ y 2 2 4.3 Ridge Regression Model Ridge regression helps to penalize the size of regression coefficients in linear model. A ridge regression model is of the form: with the solution: arg min β Xβ y 2 2 + λ β 2 2 ˆβ ridge = (X T X + I) 1 X T Y 4.4 LASSO Lasso(least absolute shrinkage and selection operator) is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the statistical model it produces. With the form as: 1 min β 2N Xβ y 2 2 + λ β 1 4.5 Random Forest Model

Random forest is a meta estimator that fits a number of classifying decision tress on various sub-samples of the dataset, which helps to improve the predictive accuracy and control over-fitting. Random decision forests correct for decision trees habit of overfitting to their training set. Here, we use the RandomForestRegressor provided by sklearn package in Python. 5. LITERATURE AND RESEARCH UCSD Data Science Student Society http://ds3-at-ucsd. github.io/msd-fp-p1/ undertook an exploratory analysis and built year predictive model on Million Song Dataset[2]. They gave an explanation that the peak song count is at year 2007: The criterion metric for this predictive task is Mean Absolute Error(MAE). Successively, we build linear regression, ridge regression, LASSO regression and random forest models on the train date set and apply these models to predict the release year on the test data set. For ridge and LASSO regression, we select different λ range from 0.01 to 100.0 and achieve the MAE versus λ shown as Figure 5 from which we find MAE will increase as λ increases. This result indicates that there is not much correlation among the 90 attributes of timbre average and timbre covariance. We see from the distribution of the number of songs that over 50% of the songs in the dataset are from the 2000-2010 year increment. From looking further into the advancements in technology over time, we observed that the increase in the development of technology used to play and share music such as mp3 players, ipods and iphone s during this time period can explain this trend. In their feature analysis, the correlations between features both from the extended year and the UCI subset were calculated. Their values were visualized with a heatmap from which they concluded that the most highly correlated features are: Hotness vs. familiarity, Loudness vs. familiarity, Artist tag length vs. Artist familiarity, Artist tag length vs. Artist hotness, Pitch averages for the 12 segments and Timbre averages for the 12 segments. Finally Ridge and LASSO regression were used to predict release year with the Mean Square Error criterion. MatthHew Moocarme 3 also had a research on the subset data from the Million Song Data set from the UCI Machine Learning Repository. The innovation of his research was to use Spark to predict the year of a song release. Using the Root Mean Square Error(RMSE) to measure the prediction accuracy, his conclusion was that there is not much correlation between the features, and the linear regression model is pretty good since it obtained RMSE beats the baseline by almost 7 years. Besides, song year prediction using Apache Spark[4] had some similar studies and the UCI subset data was mentioned in several books[3]. 6. RESULT ANALYSIS Mean Absolute Error(MAE) and Root Mean Square Error(RMSE) are two common methods used in model accuracy measurement. Mean Absolute Error is defined as follows: MAE = 1 N y i ŷ i N i=1 and Root Mean Square Error is defined as follows: RMSE = 1 N y i ŷ i N 2 i=1 3 http://www.mattmoocar.me/blog/spark-song-year/ Figure 5: mean absolute error versus lambda for ridge and lasso regression models Next, we displayed the results gathering all the models and their MAEs together that shown in Table 6. In terms of MAE, the baseline model is the worst model with MAE 8.11306989198 and the simple linear regression model is the best one with MAE 6.80049646319. The baseline model is the worst because it simply takes the average release year of the train data set as the release year of the test data set, which decreases the accuracy. The reason that linear regression model beats the others might be that it assembles all the timbre average and timbre covariance information for release year prediction. Model MAE Baseline Model 8.11306989198 Linear Regression 6.80049646319 Ridge Regression Lasso Regression λ = 0.01 6.8004964635 λ = 0.1 6.80049646632 λ = 1.0 6.80049649453 λ = 10.0 6.80049677659 λ = 100.0 6.80049959719 λ = 0.01 6.80056083884 λ = 0.1 6.80193746089 λ = 1.0 6.83034695799 λ = 10.0 7.37446311611 λ = 100.0 7.85082878353 Random Forest 6.97150655304 Table 6: mean absolute errors of models Deeply looking into linear regression model, we want to get the absolute of difference between the estimated release

year and the actual release year ŷ y in order to know how good is our prediction work going and the results are grouped into 5 parts: < 1 year, 1-3 years, 3-5 years, 5-10 years and > 10 years. The best predictions(< 1 year) takes 10.6% among total results while the worst part(> 10 years) takes 20.0%. The largest amount of this absolute differences is 5-10 years, which is 29.6%. The detail is shown below as a pie chart(figure 6). Compared to the baseline model, simple linear regression model improves MAE from 8.11306989198 to 6.80049646319. Figure 6: pie chart of prediction error for linear regression model 7. REFERENCES [1] T. Bertin-Mahieux. Million song dataset. http://labrosa.ee.columbia.edu/millionsong/. [2] M. Lichman. Uci machine learning repository. http://archive.ics.uci.edu/ml, 2013. [3] W. W. Piegorsch. Statistical Data Analytics: Foundations for Data Mining, Informatics, and Knowledge Discovery, page 364. John Wiley Sons, illustrated edition, 2015. [4] A. K. Prakhar Mishra, Ratika Garg. Song year prediction using apache spark. IEEE, 21-24 Sept. 2016.