Predicting Similar Songs Using Musical Structure Armin Namavari, Blake Howell, Gene Lewis

Predicting Similar Songs Using Musical Structure Armin Namavari, Blake Howell, Gene Lewis 1 Introduction In this work we propose a music genre classification method that directly analyzes the structure of a song. Information regarding the melody, rhythm, and dynamics of any given song can be used to discern certain structural features that are apparent within the music e.g. intervals, progressions, among other patterns. We are particularly interested in modeling a song as a structured sequence of different symbols (e.g. pitches); we show that this paradigm allows for the use of standard NLP techniques by viewing a song as a language (which is also a structured sequence of symbols). Consequently, we can experiment with certain natural language processing techniques to bring out certain structural features within our data. We show how to map a given song to a finite character space and build NLP-inspired features from this low-level representation; from these high-level features, we train a classifier to predict the genre of the input song out of four options. Concretely, our algorithm takes as input post-processed sonic information about a song and outputs a single classification prediction out of rock, folk, hip hop, and jazz as depicted in Figure 1. Figure 1: Input-Output Specification We evaluate our system on a held-out test set, which is a subset of our data. Our evaluation metric is simple classification accuracy; that is, out of all of our predictions on the test set, our evaluation metric is the proportion of predictions that match the ground truth labels. We hope to do better than random guessing over our four genres, which is around 25% accuracy; a simple model that predicts the label from using the closest song in Euclidean space, where each song is represented by the first 10 notes, yields only very slightly better results at 28% on a small subset of the data. On the other hand, human-level performance on this task is very nearly 100%; thus, there is lots of opportunity for growth in this area. 1

2 Related Work Much research has been done in the field of music classification. Dieleman et al. [1] have investigated the use of convolutional neural networks for genre classification. They, like us, also made use of the Million Song Dataset compiled by Columbia s LabROSA. Although this approach exhibits some success, we were more interested in seeing how well a simple NLP-inspired model could achieve the task, though we also made use of a neural network. Instead of focusing on what neural network architecture yields the best results (although this would be an intriguing future direction), we wish to emphasize what combinations of features allow us to make reasonable predictions regarding song genres. Coviello et al. [2] have explored time series models, taking into account deeper structural features related to how a song develops over time. They use a dynamic texture mixture (DTM) as a linear dynamical model for their audio chunks. In our own model, we capture some time series information regarding how we analyze the volumes of certain chunks within the song. However, we opt to only look at the variability of volume over these chunks instead of plugging them into a linear dynamical model. Coviello et al. use EM to learn the parameters of their DTM. We instead train a neural network using our selected features. Hoffman et al. [3] used a Codeword Bernoulli Average (CBA) probabilistic model to assign tags to song. They train this model using EM. Overall, their method makes use of unsupervised learning to represent a song as counts of codewords (clusters found among their feature vectors) which are then used in a bag-of-codewords representation. Our approach is similar in that we also use a bag-of-x representation, however, we use a hand-designed feature extractor instead of supervised learning to generate this representation. 3 Models 3.1 Music as a Language To apply NLP techniques to song audio, we note that we first need an atomic-level representation of a song in the spirit of having characters or words for text. To this end, we note that our dataset of post-processed audio provides several suitable candidate representations: sections, which represent high-level semantic shifts in the course of a song, bars, which represent regular collections of notes known as measures, and segments, which represent specific discrete events such as when a note begins to be played. In the interest of choosing a representation that allows for as much flexibility in engineering high-level features as possible, we chose to go with the fine-grained representation of segments. This segment representation gives us the ability to craft features at the per-note level, along with a rich collection of sonic information such as pitch, timbre, loudness, and duration ; much of this information is lost at the coarser representation levels, even if the coarser levels of representation give more structure to both the technical and semantic skeleton of the composition. With this raw data-level representation in hand, we then focused on creating a compu- 2

tationally tractable representation of a song s pitch data. If we let the number of timesteps in a single given song be denoted by T, the pitch data for each song is provided as a matrix with dimensions T 12, where each row represents a distribution over the 12 tonal pitches for each timestep. When considering how to use this data in conjunction with standard NLP techniques, we noted that NLP benefits from the assumption of a finite character space; that is, the range of characters a-z. Since the representation of each timestep with a real-valued 12-vector admits an uncountably infinite range of possible characters, we opted to reduce this character space to 12 by selecting the tone with maximal weight for every timestep and reducing each real-valued 12-vector to an integer-valued scalar in the range of 1-12. Thus, our final low-level representation of a given song is given by a vector of size T. 3.2 High-Level Language Features 3.2.1 Tonal Bigram Bag-of-Words Figure 2: Tonal Bigram Feature Extractor Leveraging the low-level song representation described above, we constructed a bag of words representation of the song by creating bigrams of these thresholded pitch values and storing the frequency with which they occur in the song, resulting in a histogram feature vector of size 144 for each song. A convenient advantage of this bag of word representation is the ability to account for the variable length of the songs in our dataset. 3

After implementing a feature extractor to create this bag of word representation, we inspected the features extracted from several songs in our database and noted that each song demonstrated distinct spikes in the frequency data, with a small subset of bigrams being featured at frequencies an order of magnitude higher than the rest of the bigrams; we hypothesize that these frequency spikes correspond to common note intervals that comprise the melody or chorus of each song. This provided evidence that a significant level of structure was being captured with this representation. 3.2.2 Time-Series Volume Inflections Figure 3: Volume Inflection Feature Extractor In addition to frequency patterns, loudness conveys useful information regarding the structure of a song. Certain genres have particular characteristics pertaining to how much loudness variation they exhibit. Jazz and Folk music, for example, vary in loudness by a large amount. Rock music, on the other hand, makes much use of compression (a technique by which quiet sounds are boosted and loud sounds are lowered), leading to it having a much more consistent loudness profile. Thus, it is useful to design a feature that allows us to capture how much a song s loudness varies. One way to capture this information is by taking the standard deviation over various time chunks (of equal length) over the song. In the case of our model, we found that the ideal number of chunks to take was four (as illustrated in Figure 3). After this process, we have a new feature vector that describes the volume profile of a song. 3.2.3 Combined Representation The loudness feature vector can simply be appended to the pitch information to obtain a new feature vector that captures melodic and dynamic information. We expect this combined 4

representation to help disambiguate any classes that might be hard to differentiate using only pitch features or only loudness features. One particular area where we might benefit from the combined representation is disambiguating between rock music and folk music, which are quite similar in terms of melodic patterns. However, as mentioned before, the use of compression in rock will give it a more consistent loudness profile whereas folk tends to be more varied in volume over the course of a song. Furthermore, volume information gives more insight into how a song evolves over time (as the chunks are always sequential in time), whereas pitch information gives helps us better understand the fundamental structure of the song. Ideally, with the combined information, we expect to see the best aspects of the differentiation between loudness and pitches. 3.3 Neural Network Model Figure 4: Neural Network Architecture To evaluate the power of each of our feature representations, we train a learner to utilize these features to make a prediction about the genre for the input song. Noting that we have a fairly sizeable dataset for each of our four genres and that there is potentially highly non-linear and even non-convex relationships between our input features, we choose to model our learner as a multi-layer neural network [4], depicted in Figure 4. Our neural network utilizes 50 hidden units for each layer, and output scores for four genres. The number of hidden layers in our network was tuned as a hyperparameter, and ranged between one to three. Each layer is fully connected to the next; concretely, each hidden layer consists of a matrix W h R 50 50, followed by a ReLU non-linearity [5], followed by DropOut. Let our input vector be denoted as φ(x) R T and our output vector be denoted as ŷ R 4 ; our network model then also has a fully connected input layer W in R T 50 and fully connected output layer W out R 50 4. At training time, we encode our target vectors as one-hot vectors of length four, with a one in the index that corresponds to the correct genre. If the correct genre is given by i and our softmax output vector is given by ŷ, our 5

loss function is then given by Categorical Cross-Entropy: ( ) exp ŷ i Loss = log j exp ŷ j Thus, our model is incentivized to not only predict the correct label, but to predict the correct label with the maximum possible confidence (i.e. putting as much probability mass in the slot that corresponds to the correct genre). 4 Experiments 4.1 Computational Infrastructure Our data was hosted on an Amazon AWS machine, where we also cleaned the data and ran our training algorithms. We used the Amazon C4 Compute Optimized machines, with Intel Xeon processors, 16 virtual cores, and 30GB of RAM. Our data was cleaned using the Python Numpy library, and stored as a serialized Python dictionary for ease of access. To build and train our neural network, we leveraged the Keras library [6] implemented with TensorFlow [7] to enable rapid experimentation. 4.2 Dataset We utilize the Million Song Dataset [8], which consists of approximately one million sound clips annotated with a large collection of structured metadata such as artist, release year, hotttnesss, latitude, and longitude. Due to copyright issues, it is very difficult to find a large collection of raw audio files for full-length songs for feature extraction and training. The Million Song Dataset mitigates this issue by providing a wealth of sonic information extracted from each song such as a discretization of the continuous audio into note-size chunks, a distribution over pitches for each chunk, volume information in decibels for each note, and duration information for each note. For the purposes of our experiment, the largest drawback of the Million Song Dataset is the lack of direct identification of genre for a given song. However, each song comes labeled with a collection of tags, which are strings that describe assorted characteristics and information related to the song. When counting the frequency of unique tags over all the data, we found that some of the most frequent tags included a comment on the genre of the song. We therefore chose four sufficiently frequent genre tags as our supervision labels for their related songs. The four genres we chose were rock, folk, hip hop, and jazz ; these labels were selected as they seemed sufficiently different to allow for discernible patterns in their constituent songs. In our data cleaning, we retained all songs that belonged to exactly one of these four labels and removed all others; we then applied our feature extractor to each of the retained songs to create our (φ(x), y) pairs. This gave us 10,000 examples per label, for a total of 40,000 examples total. We then separated these examples into a 1,000 example testing set and 9,000 example training set. 6

4.3 Experimental Setup In our initial experimentation, we found that if we didn t randomize the training data, our model would very quickly learn to predict the label of the first few training examples for all subsequent examples, leading to a test accuracy of 25%, which is no better than random. We note that this is because of the nature of incremental update algorithms like Stochastic Gradient Descent and AdaGrad, which take their largest steps in the earliest iterations of the algorithm. To mitigate this issue, we randomized our training set and testing set for all subsequent experiments, resulting in non-trivial accuracy. We also found that using an improved update algorithm over vanilla Stochastic Gradient Descent [4] was instrumental to achieving our best results. In particular, we had the best success with AdaGrad [9], which is a learning algorithm that adapts the learning rate for each parameter independently over the lifetime of the training; for our model trained on tonal bigrams, using AdaGrad over SGD boosted our accuracy from 37% to 60%. We subsequently used AdaGrad for all experiments presented in this work. We found that using models of increasing size and training for an increasing number of epochs often led to severe overfitting, where our training accuracies would far outpace our testing accuracies. To mitigate this problem, we introduced two forms of regularization into all of our experiments. We first imposed an L2 penalty on all of our weights, which prevents the weights from growing unreasonably large and thus over-estimating the importance of various features due to their predictive capacity on the training data, which can lead to overfit. We next introduced DropOut [10] layers into our Neural Network model, which randomly eliminates a given neural activation with probability.5; though there is as of yet no rigorous explanation for why DropOut has a regularizing effect on neural network models, empirical evidence has shown it to be a very effective tool in reducing overfit. Finally, we note that the nature of a neural network model leads to a highly non-convex objective function to be optimized; this means that there is very little intuition that can be applied when hand-tuning the training hyper parameters for our models, and so after our initial experimentation we sought a principled way to go about exploring our parameter space. To do so, we utilized grid search, which trains a model for each set of hyper parameters in the Cartesian product of all the possible hyperparameters and takes the one that performs best on the test set. We utilized this setup to identify our optimal model for each of the feature extractors we test below; the hyperparameters we change include the initial learning rate, the L2 regularization strength, and the number of layers in the model. 7

5 Results 5.1 Tonal Bigram Bag-of-Words Figure 5: Confusion Matrix for NN trained on Bigram Features Our first experiments involved training and testing our neural network model with the song features generated from our Tonal Bigram Bag-of-Words feature extractor. Our most performant model was able to achieve an accuracy of 60% on our test set; the resulting confusion matrix is given in Figure 5. Straight away, we notice some interesting patterns. First, the classification accuracy is relatively high for jazz, folk, and hip hop. Furthermore, we note that the primary sources of error in our confusion matrix is localized to the top-left and bottom-right corners of the matrix, which indicate that it is relatively difficult to separate songs within the pairs jazz / hip hop and folk / rock ; this result seems fairly intuitive to the authors, who would generally agree that jazz and hip hop are relatively similar, as well as folk / rock. This result seems to indicate then that frequencies of tonal bigrams is sufficient to differentiate broad musical categories, but lacks the representative power to make finer-grained decisions; we postulate that this is because the tonal bigrams represent frequent note intervals, and folk and rock music likely both use a large number of common musical intervals that are known to be pleasing to a broad audience, while jazz and hip hop likely use a wider range of intervals in a manner more indicative of their respective genres. We therefore seek to augment our Tonal Bigram Bag-of-Words features with additional features that increase the discriminative power of our model in deciding within these genre pairs. 8

5.2 Time-Series Volume Inflections Figure 6: Confusion Matrix for NN trained on Volume Features Motivated by the previous discussion, we then turned our attention to attempting to identify additional features that could help discriminate more clearly between rock and folk, as well as between jazz and hip hop. We reasoned that an important tool in musical composition is volume dynamics, and that different genres of music likely have different volume patterns that correspond to standard compositional structure. To capture this intuition, we created the Time-Series Volume Inflections feature extractor. Our best model trained on volume inflection features was able to achieve an accuracy of 43%. Though we were initially disheartened by this drop in performance relative to the model trained on tonal bigrams, we noticed an important pattern when analyzing the confusion matrix given in Figure 6; the degree of confusion between folk and rock had been largely disentangled. Indeed, the rock label was highly distinguishable with the volume inflections feature, while it had been difficult to distinguish with tonal bigrams. We postulate that this is due to modern compression techniques for loud music such as rock, which greatly reduces the variation in loudness of much rock music and so is much more easily distinguishable than folk, hip hop, or jazz. We note that though folk and jazz are especially highly entangled with the volume inflections features, they were highly disentangled with the tonal bigrams features. We therefore hypothesized that the combination of tonal bigrams and volume inflection features would give a strong enough representation to distinguish between all pairs of genres, and in particular would help reduce the number of misclassifications in the top-left and bottom-right corners of our confusion matrix. 9

5.3 Combined Representation Figure 7: Confusion Matrix for NN trained on Combined Features For our final set of experiments, we combined our tonal bigram feature extractor and our volume inflection feature extractor into a single feature extractor that collects both sets of information and concatenates the resulting feature vectors together. We reasoned that this Combined Feature Extractor would leverage the complementary strengths of our previous two song representations to allow for a more powerful classifier. Our best model trained on these combined features was able to achieve an accuracy of 66%, which is a 10% gain over the accuracy achieved with tonal bigrams alone. When we examine the corresponding confusion matrix shown in Figure 7, we notice a striking change from Figure 5; we now have a very large number of predictions clustered on the main diagonal (corresponding to correct predictions) and a very small number of predictions on the antidiagonal. The matrix overall exhibits a much smoother distribution of predictions where the top-left and bottom-right corners are far less prominent, verifying our hypothesis that the combined features allows for stronger discrimination and a kind of interpolation between the functions derived using either feature extractor in isolation. 6 Conclusion In this work, we have shown that with an appropriate choice of character space, sonic attributes can be mapped to a notion of language, and therefore common natural language analysis techniques such as bigram frequencies and time-series sequence structure can be utilized successfully to disambiguate between different musical patterns; this is an exciting development, since it suggests that there is much structure in common between natural language and musical sequences and so further NLP techniques beyond those explored in this work may potentially be utilized for song analysis. It would be interesting to consider different music-to-language mappings, perhaps maybe even exploring the possibility of musical embeddings (similar to systems such as word2vec). It would also be promising to consider neural network architectures that deal well with data that arises from a temporal domain e.g. RNNs and their variants such as LSTMs. We hope to explore these directions further in future work. 10

References [1] Sander Dieleman, Philémon Brakel, and Benjamin Schrauwen. Audio-based music classification with a pretrained convolutional network. In 12th International Society for Music Information Retrieval Conference (ISMIR-2011), pages 669 674. University of Miami, 2011. [2] Emanuele Coviello. Automatic music tagging with time series models. 2014. [3] Matthew D Hoffman, David M Blei, and Perry R Cook. Easy as cba: A simple probabilistic model for tagging music. [4] Stuart J. Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. Pearson Education, 2 edition, 2003. [5] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 807 814, 2010. [6] Franois Chollet. keras, 2015. [7] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org. [8] Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The million song dataset. In Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR 2011), 2011. [9] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121 2159, 2011. [10] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929 1958, 2014. 11