Using Genre Classification to Make Content-based Music Recommendations

Using Genre Classification to Make Content-based Music Recommendations Robbie Jones (rmjones@stanford.edu) and Karen Lu (karenlu@stanford.edu) CS 221, Autumn 2016 Stanford University I. Introduction Our project goal is to create personalized music recommendations based on song similarity. Popular music recommender systems like Pandora, itunes Genius, and Spotify Discover use a variety of information filtering techniques to engage users through the discovery of new music that they would enjoy. These recommender systems go beyond surface-level classifications such as pop songs, songs from the year 2010, or songs by Bruce Springsteen. Rather, they consider similarities in user profiles, including demographic data, geographic information, and similar likes and dislikes, as well as similarities in the audio files themselves. In this project, we choose to focus specifically on developing effective content-based recommendations by looking at song similarity through the lenses of lyrics and audio metadata. II. Task Definition We broke down our project into two main tasks. Firstly, we chose to focus on multi-class genre classification as our method for training song features relevant to identifying song similarity. We decided to concentrate on predicting pre-selected genres because humans generally accept these genres to be a useful method of clustering music according to similarity. Most of us would consider two country songs, for example, to be more similar than a country song and an electronic song. In the process of classifying songs by genre, we create a high-dimensional feature space model that we can then use to approach our second task: given a set of songs that a user likes, we create a content-based recommendation system to deliver additional songs suitable to their taste. III. Literature Review We initially discovered our dataset (the Million Song Dataset, described in detail in Section IV) by researching similar projects that predict song popularity based on audio metadata [4, 6, 10]. These projects define a song s features based on metadata such as tempo, key, time signature, etc. and output a popularity index that can be ranked against other songs. The evaluation of popularity can be achieved via predefined rankings such as Billboard Top 10. Although the models and algorithms vary widely across projects, the task is more or less the same. 1

Although our initial project idea was to predict song popularity, we soon became more interested in song recommender systems. Some previous recommendation work has been done using our dataset; the creators of the dataset even released the Million Song Dataset Challenge in 2012, encouraging participants to create an offline music recommendation system to be evaluated against real users listening history [8]. Some popular music recommendation services in industry include Spotify [1, 5] and Pandora [7]. Of course, neither of these companies make their specific models or algorithms publically available, but we can still get an overview of the general frameworks. Spotify has historically relied on collaborative filtering as their main method of generating recommendations. Collaborative filtering determines user preferences based on the usages of other users -- users that have listened to similar sets of songs are likely to have similar taste. Issues with collaborative filtering approaches include the cold-start problem (the system doesn t recommend songs that are new) and the fact that popular songs are much more likely to be recommended than unpopular songs. After acquiring The Echo Nest, Spotify has start to incorporate content-based recommendation into its product. Pandora, on the other hand, focuses on content-based recommendation. Their Music Genome Project documents 400 musical attributes per song, hand-labeled by music experts. IV. Data Song Metadata Our main dataset was downloaded from the Million Song Dataset [2], which was compiled by Daniel P.W. Ellis and Thierry Bertin-Mahieux from the Laboratory for the Recognition and Organization of Speech and Audio (LabROSA) of Columbia University. This publically available dataset contains audio features and metadata information for one million individual songs. Most of the fields were extracted using the Echo Nest API, although it also contains song IDs from other websites, like musicbrainz.org and playme.com (Echo Nest was acquired by Spotify for $100 million in March 2014). Fields include basic metadata like loudness, artist location, and duration as well as more subjective or algorithmically calculated values like danceability and time signature. The data is formatted in individual HDF5 files per song, and a single reduced summary file of non-array fields for all one million songs is also available. A SQLite database of most metadata from each track is also available for download. Lyrics Data Since the Million Song Dataset was created and published in February 2011, several additional compatible datasets have also been compiled and shared with the research community. One dataset that we found particularly interesting was a lyrics dataset developed in a partnership between Million Song Dataset and musixmatch [9]. The team 2

was able to resolve lyrics for over 77% of the dataset, of which the lyrics for 237,662 tracks are released. The 5,000 most common words from the dataset are included in bag-of-words format (unordered word counts). Genre Data For our genre classification task, we required labeled ground truth genre data for each track in our training set. We chose to use a dataset of genre annotations developed by tagtraum industries incorporated because the dataset uses a majority voting method for combining crowd-sourced genre labels from multiple sources, including the Last.fm datase and the beatunes Genre Dataset (BGD) [3]. In total, the dataset contains genre labels for 280,831 tracks. The genre dataset defines fifteen ground truth genres: Rock, Pop, Metal, Country, Rap, RnB, Electronic, Punk, Latin, Folk, Reggae, Jazz, Blues, World, and New. Creating Our Dataset In order select tracks at the intersection of all three datasets, we created a Python parser for the lyrics and genre text files and loaded the relevant information from each dataset into tables in a SQLite database. By joining the tables on track_id, we found that a subset of 122,447 tracks included both tagtraum genre annotations and musixmatch dataset of song lyrics. We ran some queries to examine the number of tracks for each genre (Figure 1). Figure 1. Count of Tracks with Genre and Lyric Data, By Genre. Genre Track Count Rock Pop Metal Country Rap RnB* Electronic Punk Latin Folk Reggae Jazz Blues World New 61531 15263 7777 5760 5511 5209 4772 3737 2918 2762 2175 2152 1605 1008 267 * In this dataset, RnB refers to the genre Rhythm and Blues. 3

Next, we created balanced datasets with equal examples from each of ten most common genres: Rock, Pop, Metal, Country, Rap, RnB, Electronic, Punk, Latin, and Folk. Our training set included 2500 songs of each genre, or 25,000 songs in total, while our test set included 100 songs of each genre, or 1,000 songs in total. V. Approach Model We aim to classify the genre of a song since it seems to be the most natural way to cluster groups of songs. We model individual songs in classic machine learning fashion as a point in a high-dimensional space, where each dimension represents a value in the corresponding feature vector. We model our problem as a multi-class classification problem: Learn a classifier F that accurately classifies a song by genre, such that f( Φ (x)) = y i, where Φ (x) is the feature vector of x and y i is the ground truth genre. Input: x = track_id from Million Song Dataset. Output: y { y 1, y 2,, y K }, where y i are all of the possible predefined genres. Algorithms Our goal was to choose a feature extractor and algorithm that would incorporate information from both audio metadata and word counts. In order to accomplish this task, we chose to use a combination of two machine learning algorithms: one-vs-rest and k-nearest neighbors. First, we focused on word features only and derived a list of ten confidence scores per song, one for each of our ten predefined genres. One-vs-rest is a strategy that trains a binary classifier for each class (genre), using tracks from that genre as positive training examples and tracks from all other genres as negative training examples. In order to evaluate a new track, we run it through all ten genre classifiers and choose the genre with the highest confidence score. Instead of simply outputting the final genre classification, we chose to keep the ten confidence scores and use them as features in the next step of our project. We chose to do this instead of using the 5,000 word count features directly in order to reduce our feature space and also prevent recommending songs based primarily on word overlap. The next step in our implementation is to use our word feature confidence scores along with metadata features to train a multiclass classifier for genre. Because of its simplicity, effectiveness and easy training stage, we decided to start by implementing the k-nearest neighbors algorithm. With k-nearest neighbors, the training set s confidence score and metadata can be used directly as feature vectors without needing to tune parameters. At test time, we simply find the k points with the smallest Euclidian distance and choose the genre based on a simple majority vote. 4

Once we ve selected our features in order to optimize our genre classifier s accuracy, we can use the same vector space model to recommend similar songs by finding the closest neighbors in vector space. If done effectively, this allows us to define more fine-grained similarity than a genre label could capture. V. Data and Experimentation Baseline and Oracle For our project baseline, we ran the k-nearest neighbors using k = 1 with 5 floating-point features. These features, taken directly from the Million Song Dataset format, were: duration: length of the song in seconds tempo: speed of the song in beats per minute loudness: loudness in decibels derived from a model of human listening song hotttnesss: an aggregate measure of how much buzz the song is getting (derived from online mentions, play counts, etc.) danceability: a derived measure of the ease with which a person could dance to a song We trained our model using a very small training set of 40 tracks for each of three genres: Rock, Pop, and Rap. We tested our model using a test set of 30 tracks comprised of 10 tracks each for each of the same three genres. Our baseline method s accuracy was 37%. Our oracle consisted simply of using our ground truth genre labels, resulting in an accuracy of 100%. One-vs-rest Classification on Word Features We started by training a single classifier for each class, such that training examples labeled with that genre are considered positive, while all other training examples are considered negative. Using the one-vs-rest strategy, we achieved an overall accuracy of 47.2% using a Naive Bayes classifier and 44.0% using a logistic regression classifier (Figure 2). In Figure 3, we present more fine-grained accuracies based on the ground truth genre of each test set example. The chart in Figure 4 displays these accuracies for the Naive Bayes classifier. 5

Figure 2. Overall Accuracies of One-vs-rest Classification using Word Count Features. Binary Classification Algorithm Naive Bayes Logistic Regression Overall Test Set Accuracy 47.2% 44.0% Figure 3. Accuracies of One-vs-rest Classification using Word Count Features, by Genre. Genre Latin Rap Country RnB Punk Metal Folk Electronic Pop Rock Test Set Accuracy using Naive Bayes 91.0% 82.0% 69.0% 68.0% 54.0% 53.0% 34.0% 9.0% 6.0% 5.0% Test Set Accuracy using Logistic Regression 85 74 40 53 42 43 34 28 24 17 Figure 4. Accuracies of One-vs-rest Classification with Naive Bayes Classifier using Word Count Features. 6

k-nearest Neighbors for Classification and Recommendations We ran our k-nearest neighbors algorithm on word count confidence scores only, song metadata features only, and both sets of features combined. These results are documented in Figure 5. Figure 5. Comparison of k-nearest Neighbors Classification Accuracies Using Word Count Confidence Scores vs. Acoustic Metadata (k = 21). Word Count Confidence Scores Only Acoustic Metadata Only Both Word Count Confidence Scores and Acoustic Metadata Accuracy on Training Set 67.1% 22.5% 25.5% Accuracy on Test Set 46.7% 12.5% 14.0% Since our acoustic metadata accuracies were so low, we also ran our algorithm individually on each feature to see if we could find individual features that worked well (Figure 6). Figure 6. Comparison of k-nearest Neighbors Classification Accuracies for Individual Acoustic Metadata Features (k = 21). duration end_of_fade_in key loudness mode start_of_fade_in tempo time_signature Accuracy on Training Set 10.0% 10.1% 9.8% 10.0% 9.7% 21.5% 21.8% 10.2% Accuracy on Test Set 10.0% 9.5% 8.5% 10.0% 12.6% 10.1% 10.3% 10.1% We also tried various values of k (k = 1, k = 11, k = 21, k = 31), but the results for each were similar to the results for k = 21 (Figure 5 above). 7

VI. Analysis Interpretation and Discussion For our initial task of training a one-vs-rest classifier with word count features, we were able to achieve an accuracy of 47.2% across all ten genres. While this accuracy is not fantastic, it is much higher than random (10% for classification across all ten genres). Based on our results in Section V, it is evident that using a bag-of-word model on song lyrics is much more informative for some genres than for others. For example, when using Naive Bayes, we were able to achieve 91% accuracy on tracks classified as Latin and 82% accuracy on tracks classified as Rap, but we were only able to achieve 6% and 5% on Pop and Rock, respectively. We surmise that this large variation in accuracy between genres is due both to the more specific slang and vernacular of some genres, as well as more focused themes and subjects. Our Nearest-Neighbors algorithm also did not perform as well as we had hoped. While we initially assumed that songs from different from genres would have noticeably different metadata, it s possible that the variance of metadata of songs within individual genres was enough to throw off the algorithm. In addition, there s a good chance that whatever intrinsic qualities in genres that allow our ears to distinguish one from the other is simply not captured within the metadata features that we used for our classification. Challenges Many of the challenges we faced with this project involved our inexperience with working with datasets of this size. Since the main Million Song Dataset contains metadata for 1,000,000 songs, our feature extraction process was slow. Even though we selected a relatively small training set of 25,000 songs, for example, we needed to go through all 1,000,000 songs in order to find the subset of track_ids that we desired. We started from an unbalanced dataset; for example, in the intersection of the Million Song Database, our genre database, and our lyrics database, there are 61,531 rock songs but only 2,762 folk songs. In addition, our data came from multiple sources in multiple formats, including plain text files, HDF5 files, and SQLite databases. We also faced several other challenges in the realms of feature engineering and evaluation. We lacked some domain-specific knowledge and intuition for how to engineer features that work well with our algorithms to classify effectively. In addition, while evaluation of genre classification is simple given our dataset of ground-truth genre labels, evaluation of our unsupervised learning task of song recommendation is much more open-ended. Even once we have achieved reasonable genre classification, it is difficult to determine whether user-specific recommendations are effective. 8

VII. Conclusion In order to classify songs by ten predefined genres, we took a two-pronged approach. First, we used one-vs-rest classifier powered by Naive Bayes to find confidence scores for each genre based simply on bag-of-words word count features. Next, we combined word count features and audio metadata features to train a k-nearest neighbors genre classifier, with limited success. In order to recommend songs based on user preferences, we searched our training set examples for the nearest songs in this high-dimensional vector space. Our research indicated that word count features performed much better than the acoustic metadata features available to us. With more domain-specific knowledge and better feature engineering, perhaps our accuracies would improve. VIII. References [1] Dieleman, Sander. "Recommending Music on Spotify with Deep Learning." Sander Dieleman. N.p., 5 Aug. 2014. Web. 15 Dec. 2016. [2] "Getting the Dataset." Million Song Dataset. LabROSA, Echo Nest, n.d. Web. 26 Oct. 2016. [3] Hendrik Schreiber. Improving genre annotations for the million song dataset. In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 241-247, Málaga, Spain, Oct. 2015. [4] Herremans, Dorien, David Martens, and Kenneth Sörensen. "Dance Hit Song Prediction." Journal of New Music Research 43.3 (2014): 291-302. Web. 26 Oct. 2016. [5] Johnson, Chris. "Algorithmic Music Recommendations at Spotify." LinkedIn SlideShare. N.p., 13 Jan. 2014. Web. 15 Dec. 2016. [6] Koenigstein, Noam, Yuval Shavitt, and Noa Zilberman. "Predicting Billboard Success Using Data-Mining in P2P Networks - Semantic Scholar." Predicting Billboard Success Using Data-Mining in P2P Networks. N.p., n.d. Web. 26 Oct. 2016. [7] Layton, Julia. "How Pandora Radio Works." HowStuffWorks. N.p., 23 May 2006. Web. 16 Dec. 2016. [8] Mcfee, Brian, Thierry Bertin-Mahieux, Daniel P.w. Ellis, and Gert R.g. Lanckriet. "Million Song Dataset Challenge." N.p., n.d. Web. 26 Oct. 2016. [9] "The MusiXmatch Dataset: Connecting Lyrics." Million Song Dataset. LabROSA, Echo Nest, 11 Apr. 2011. Web. 15 Dec. 2016. [10] Pham, James, Edric Kyauk, and Edwin Park. "Predicting Song Popularity." (n.d.): n. pag. Web. 26 Oct. 2016. 9