Release Year Prediction for Songs

Release Year Prediction for Songs [CSE 258 Assignment 2] Ruyu Tan University of California San Diego PID: A53099216 rut003@ucsd.edu Jiaying Liu University of California San Diego PID: A53107720 jil672@ucsd.edu ABSTRACT In this assignment, we study on a subset of Million Song Dataset from UCI Machine Learning Repository to obtain a suitable model that can predict the Release Year based on selected features: Timbre Average and Timbre Covariance. We do some exploratory analysis on the data set, label and features, and then apply Linear Regression, Ridge Regression, LASSO Regression and Random Forest models to this prediction task. Mean Absolute Error(MAE) is chosen to measure the model accuracy. The results we got indicate that Linear Regression model is the best one with the MAE of 6.80. Keywords Release year; Songs; Linear regression; Ridge regression; Lasso regression; Random forest; Mean absolute error 1. INTRODUCTION Million Song Dataset, a famous data set with freely available collection of audio features and metadata for a million contemporary popular music tracks. Also includes its creation process, its content and its possible tracks. There are many attractive features in the Million Song Database. Here, what we want to focus are Timbre Average and Timbre Covariance features, and then do the release year prediction based on these, since this study may have some practical applications in music recommendation. We define year prediction as estimating the year in which a song was released based on its audio features(although metadata features such as artist name or similar artist tags would certaingly be informative). Listeners often have particular affection for music from certain periods of their lives (such as high school or college), thus predicted the release year could be a useful topic. Moreover, a successful model of the variation in music audio characteristics through the years could throw light on the long-term evolution of popular music. Honestly, it is hard to specifically addressing the release year prediction, since surely we are lacking a large music collection spanning both a wide range of genres(at least within western pop) and a long period of time. 2. DATA SET DESCRIPTION AND ANALYSIS The data set used for this project is UCI Machine Learning Respository[1] at https://archive.ics.uci.edu/ml/datasets/ YearPredictionMSD which is a subset of the Million Song Dataset[2]. The songs here are mostly western, commercial tracks ranging from 1922 to 2011, with a peak in the year 2000s. Before building our model to predict the songs release year which is later down the road, we take to heart the wise maxim: Essentially, all models are wrong, but some are useful. This suggested and reminded us that we need to learn more on the data set we are studying so that we are able to develop a more useful, but not necessarily correct, release year prediction model for this. 2.1 Data Set Description The data set consists of 515,345 data entries in total and is splitted into two parts as train and test data set 1. The train data set contains 463,715 entries while test data set contains 51,630 entries. The detail is shown in Table 1. Number of entries in total data set: 515,345 Number of entries in train data set: 463,715 Number of entries in test data set: 51,630 Table 1: data set basic information Each data entry consists of 91 attributes containing the information of release year and MFCC 2 -like features represented as numerical vector, which is shown in Table 2. Index Description 0 Release Year 1-12 Timbre Average 13-90 Timbre Covariance Table 2: data entry description 2.2 Exploratory Analysis Since we have already seen the content of this data set, we need to get basic data statistics and also get a bit further understanding on the content of this data set. Furthermore, we will take an exploratory analysis for label and features in order to know more about it. 2.2.1 Label: Release Year Release Year is the predictive target variable. The Release Year in train data set is ranging from the year 1922 to 1 The split strategy avoids the producer effect by making sure no song from a given artist ends up in both the train and test set 2 MFCC is the abbreviation for Mel Frequency Cepstral Coefficent

2011. The mean of the release year is 1998.04, the median is 2002.0 and the standard deviation is 10.94. The release year that appears most is the year 2007. All the basic statistics informations are shown in Table 3. min 1922.0 max 2011.0 mean 1998.4 median 2002.0 mode 2007.0 standard deviation 10.94 Table 3: Statistics of Release Year Figure 2: violin plot of timbre average By plotting the histogram of release year as Figure 1, we can obtain the peak of release year is around 2000s, with gradually increasing before 2000s and then rapidly falling. To visualize Timbre Average features, we tried the principal components analysis on the former 12 attributes in train data set. Percentages of variance explained by first principal component and second principal component are 50.22% and 23.38%, respectively. Figure 3 is the scatter plot of the first principal component versus second principal component from Timbre Average features. Figure 1: Histogram of Release Year 2.2.2 Features: Timbre Average and Timbre Covariance From the data set, we can find that each data entry contains 90 features whose former 12 features stand for timbre average and the others represent the timbre covariance( ( ) 12 2 + 12). The sample features are shown in Table 4 following in the next page. As for Timbre Average features, we can study them by violin plots shown as Figure 2. Furthermore, we can calculate the covariance matrix of standardized former 12 attributes shown as Table 5, which indicates that these 12 features for Timbre Average are not too correlated. Figure 3: 1st principal component versus 2nd principal component of timbre average 2.2.3 Label versus Features In order to visualize the relationship among 1st, 2nd principal components and the release year, we draw the scatter plot of Release Year versus 1st and 2nd principal components from Timbre Average features is shown as Figure 4.

Timbre Average Timbre Covariance X1 X2... X12 X13 X14... X90 49.94357 21.47114... -2.31521 10.20556 611.10913... 2.26327 48.73215 18.4293... 0.34006 44.38997 2056.93836... 26.92061 50.95714 31.85602... 0.82804 7.46586 699.54544... -0.66345.. 50.32201 6.71191... 10.66774 14.39176 357.67468... 0.05278 Table 4: sample features in data set X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X1 1.00 0.56 0.25 0.02-0.29-0.27 0.17-0.06 0.22 0.10 0.06-0.13 X2 0.56 1.00 0.02 0.13-0.19-0.11 0.12 0.11 0.13 0.37-0.09-0.06 X3 0.25 0.02 1.00 0.15-0.13 0.04-0.06 0.08 0.04-0.09 0.04-0.02 X4 0.02 0.13 0.15 1.00 0.03 0.32 0.28 0.03-0.04 0.17 0.31-0.10 X5-0.29-0.19-0.13 0.03 1.00 0.02-0.11-0.01-0.22-0.10 0.02 0.03 X6-0.27-0.11 0.04 0.32 0.02 1.00-0.25 0.01-0.04-0.04-0.33 0.15 X7 0.17 0.12-0.06 0.28-0.11-0.25 1.00 0.17-0.10 0.01-0.11-0.18 X8-0.06 0.11 0.08 0.03-0.01 0.01 0.17 1.00-0.18 0.42-0.24-0.05 X9 0.22 0.13 0.04-0.04-0.22-0.04-0.10-0.18 1.00 0.34-0.06 0.09 X10 0.10 0.37-0.09 0.17-0.10-0.04 0.01 0.42 0.34 1.00-0.16-0.08 X11 0.06-0.09 0.04 0.31 0.02-0.33-0.11-0.24-0.06-0.16 1.00 0.21 X12-0.13-0.06-0.02-0.10 0.03 0.15-0.18-0.05 0.09-0.08 0.21 1.00 Table 5: covariance of standardized timbre average Figure 4: release year versus 1st and 2nd principal components of timbre average 3. PREDICTIVE TASK IDENTIFICATION The idea of this predictive task is to predict the release year based on timbre average and timbre covariance features and apply all the possible models to this data set. The criterion we use to measure model accuracy is the Mean Absolute Error(MAE), the model that has the smallest MAE we will consider as the best model. 4. MODEL SELECTION Based on the description above, we aim to predict the release year for the songs on the test set. We will use five models for this predictive task. 4.1 Baseline Model In the baseline model, we simply choose the average of release year in train data set as prediction in test data set. ŷ = ȳ 4.2 Linear Regression Model Linear regression is an approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables(or independent variables) denoted as X. A simple linear model can be represented as y = Xβ min β Xβ y 2 2 4.3 Ridge Regression Model Ridge regression helps to penalize the size of regression coefficients in linear model. A ridge regression model is of the form: with the solution: arg min β Xβ y 2 2 + λ β 2 2 ˆβ ridge = (X T X + I) 1 X T Y 4.4 LASSO Lasso(least absolute shrinkage and selection operator) is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the statistical model it produces. With the form as: 1 min β 2N Xβ y 2 2 + λ β 1 4.5 Random Forest Model

Random forest is a meta estimator that fits a number of classifying decision tress on various sub-samples of the dataset, which helps to improve the predictive accuracy and control over-fitting. Random decision forests correct for decision trees habit of overfitting to their training set. Here, we use the RandomForestRegressor provided by sklearn package in Python. 5. LITERATURE AND RESEARCH UCSD Data Science Student Society http://ds3-at-ucsd. github.io/msd-fp-p1/ undertook an exploratory analysis and built year predictive model on Million Song Dataset[2]. They gave an explanation that the peak song count is at year 2007: The criterion metric for this predictive task is Mean Absolute Error(MAE). Successively, we build linear regression, ridge regression, LASSO regression and random forest models on the train date set and apply these models to predict the release year on the test data set. For ridge and LASSO regression, we select different λ range from 0.01 to 100.0 and achieve the MAE versus λ shown as Figure 5 from which we find MAE will increase as λ increases. This result indicates that there is not much correlation among the 90 attributes of timbre average and timbre covariance. We see from the distribution of the number of songs that over 50% of the songs in the dataset are from the 2000-2010 year increment. From looking further into the advancements in technology over time, we observed that the increase in the development of technology used to play and share music such as mp3 players, ipods and iphone s during this time period can explain this trend. In their feature analysis, the correlations between features both from the extended year and the UCI subset were calculated. Their values were visualized with a heatmap from which they concluded that the most highly correlated features are: Hotness vs. familiarity, Loudness vs. familiarity, Artist tag length vs. Artist familiarity, Artist tag length vs. Artist hotness, Pitch averages for the 12 segments and Timbre averages for the 12 segments. Finally Ridge and LASSO regression were used to predict release year with the Mean Square Error criterion. MatthHew Moocarme 3 also had a research on the subset data from the Million Song Data set from the UCI Machine Learning Repository. The innovation of his research was to use Spark to predict the year of a song release. Using the Root Mean Square Error(RMSE) to measure the prediction accuracy, his conclusion was that there is not much correlation between the features, and the linear regression model is pretty good since it obtained RMSE beats the baseline by almost 7 years. Besides, song year prediction using Apache Spark[4] had some similar studies and the UCI subset data was mentioned in several books[3]. 6. RESULT ANALYSIS Mean Absolute Error(MAE) and Root Mean Square Error(RMSE) are two common methods used in model accuracy measurement. Mean Absolute Error is defined as follows: MAE = 1 N y i ŷ i N i=1 and Root Mean Square Error is defined as follows: RMSE = 1 N y i ŷ i N 2 i=1 3 http://www.mattmoocar.me/blog/spark-song-year/ Figure 5: mean absolute error versus lambda for ridge and lasso regression models Next, we displayed the results gathering all the models and their MAEs together that shown in Table 6. In terms of MAE, the baseline model is the worst model with MAE 8.11306989198 and the simple linear regression model is the best one with MAE 6.80049646319. The baseline model is the worst because it simply takes the average release year of the train data set as the release year of the test data set, which decreases the accuracy. The reason that linear regression model beats the others might be that it assembles all the timbre average and timbre covariance information for release year prediction. Model MAE Baseline Model 8.11306989198 Linear Regression 6.80049646319 Ridge Regression Lasso Regression λ = 0.01 6.8004964635 λ = 0.1 6.80049646632 λ = 1.0 6.80049649453 λ = 10.0 6.80049677659 λ = 100.0 6.80049959719 λ = 0.01 6.80056083884 λ = 0.1 6.80193746089 λ = 1.0 6.83034695799 λ = 10.0 7.37446311611 λ = 100.0 7.85082878353 Random Forest 6.97150655304 Table 6: mean absolute errors of models Deeply looking into linear regression model, we want to get the absolute of difference between the estimated release

year and the actual release year ŷ y in order to know how good is our prediction work going and the results are grouped into 5 parts: < 1 year, 1-3 years, 3-5 years, 5-10 years and > 10 years. The best predictions(< 1 year) takes 10.6% among total results while the worst part(> 10 years) takes 20.0%. The largest amount of this absolute differences is 5-10 years, which is 29.6%. The detail is shown below as a pie chart(figure 6). Compared to the baseline model, simple linear regression model improves MAE from 8.11306989198 to 6.80049646319. Figure 6: pie chart of prediction error for linear regression model 7. REFERENCES [1] T. Bertin-Mahieux. Million song dataset. http://labrosa.ee.columbia.edu/millionsong/. [2] M. Lichman. Uci machine learning repository. http://archive.ics.uci.edu/ml, 2013. [3] W. W. Piegorsch. Statistical Data Analytics: Foundations for Data Mining, Informatics, and Knowledge Discovery, page 364. John Wiley Sons, illustrated edition, 2015. [4] A. K. Prakhar Mishra, Ratika Garg. Song year prediction using apache spark. IEEE, 21-24 Sept. 2016.