Lyrical Features of Popular Music of the 20th and 21st Centuries: Distinguishing by Decade

Lyrical Features of Popular Music of the 20th and 21st Centuries: Distinguishing by Decade Cody Stocker, Charlotte Munger, Ben Hannel December 16, 2016 1 Introduction Music has been called the voice of a generation. Here, we try to use that voice to predict which generation - decade - the song came from. It seems logically intuitive that a song that includes the world fuckin is a recent song, say 2000 or 2010, while a song that has the name Ethel in it is likely much older (1940s). The impact of other features is less clear: is rhyming more common in recent songs or older ones, and do rhyme schemes vary with decade? Relying on lyric-based features, we attempt to classify songs according to the era they come from. 2 Literature Review Various experiments have been conducted on humans ability to recognize the release decade of music and the ability of lyrical features to predict song traits. For instance, Carol Krumhansl conducted an experiment where she played clips of music to test subjects who were able to recall decade and other information fairly readily after a few seconds. Results from this study indicate that participants were able to identify the decade of popular songs about 80% of the time, which is highly significant [Kum10]. Xiao Hu conducted a study comparing the performance of lyric-based mood classifications to audio based ones. Of the 18 mood categories in the experiment, lyric based classification significantly outperformed audio based classification on 7 of them and on only one category did audio features significantly outperform lyric features. While this could imply that lyric based classifications are superior, every single one significantly underperformed audio based classifications on negative valence negative arousal states (like calmness) [DH10]. Mayer, Newmayer, and Rauber used rhyme and style features to try to classify song genres. They were able to achieve 28.55% accuracy on a test set of 397 songs, 30-45 per each of the 10 song genres they were classifying using K-nearest neighbors, and 27.58% by using Naieve Bayes. [MNR08] 3 Task Definition 3.1 Input The input of our classifier is the full text lyrics of a song. The artist and title are not provided. Here are some example inputs: I m dreaming of a white Christmas Just like the ones I used to know... <full song lyrics> Tried to keep you close to me, But life got in between... <full song lyrics> 3.2 Output The general goal of our algorithm is to predict when the lyrics of the song were written. However, we did not implement a continuous classifier because the features of songs clearly have not followed linear trends between 1940 and 2010, and we thought it would be easier to use a discrete multiclass approach. 1

Originally, we wanted to predict the exact decade in which a song was written, but it proved difficult to train a classifier with eight output classes for high accuracy. We used three label functions to run different tests on our classifiers. They were: 1. 50-50 Labeler: Anything before 1980 is category 0, everything after is category 1 2. Bi-Decade Labeler: Groups decades in pairs, i.e. 1940/1950, 1960/1970 3. Decade Labeler: Groups song by decade (i.e. 1940 s, 1950 s, 1960 s, 1970 s, 1980 s, 1990 s, 2000 s, 2010 s) Naturally, The smaller the number of output labels, the higher our precision, accuracy, and F1 scores were. In all cases, our classifier performed significantly better than random chance but interesting the classifier improved relative to random chance as more classes were added. 4 Infrastructure Our data collection process began with a group of songs that we thought were a fairly even sampling of genre and time period. We used two different song lists for this. One is the Alltime Pop Classics top charts by year lists from 1940 onward. This list is comparatively small, but provides a very even sampling over time. The other song list was the Million Song Data set, restricted to only songs where lyrics were available. The Million Song Data set is larger, but songs are sparse prior to 1960. In order to partially compensate for this, the number of songs per single year was capped at 1000 so later years didn t dominate. Once we had our list of song titles and artists, we scraped the website Lyrically, which provides song lyrics for free. About 30% of songs were not found, especially among older years, exacerbating the skew in the data set. In the end, the smaller data set contained 5006 song/year pairs, and the larger data set contained 44913. Each data set was split into train (90%) and test (10%) subsets. 5 Approach 5.1 Feature Functions 5.1.1 Unigrams, Bigrams, and Trigrams These are relatively simple linguistic features composed of single words, pairs of words, and triples of words. We also tried variants of these features. We used start and end tokens as well as bigrams and trigrams without start and end tokens. In general, we did not expand beyond these three because of sparsity of data. Even though quad-grams would have been possible, it is unlikely they would have been particularly good features because they would be very sparse and would have a tendency to allow for overfitting of the model. 5.1.2 Stemming Functions We also used a Porter Stemmer to stem all words in our dataset, and then ran unigrams, bigrams, trigrams, and our start and end token functions. 5.1.3 Stop Word Removal In order to speed up the classifiers, we also tried incorporating stop word removal. We found a list of the most common English stopwords at ranks.nl and used two versions, one, the complete version, and the other with pronouns removed. 5.1.4 Length Multiple features involved song length. We used number of words per song, average number of words per line, and average number of syllables per line*, average number of sounds per line*. *As determined by the Carnegie Mellon University (CMU) pronouncing dictionary, which provides word pronunciations based on sound and syllable stress. 2

5.1.5 Rhyme Schemes Rhyme scheme is a high level musical feature which specifies the pattern of rhyming lines in a song. A song can be constructed such that adjacent lines rhyme, alternating lines rhyme, or some other pattern. In order to capture this as a feature, we used the CMU dictionary to determine if the final words of any two lines rhymed, based on the syllables in the word and their stresses. We then counted the number of such rhyming pairs where two consecutive lines rhymed (AA), where two lines separated by one line rhymed (ABA), and where any one of the next three lines rhymed (e.g. ABCBA has two rhyming lines). All of these schemes were reliant on the CMU dictionary which is not entirely complete, and counted words as rhymes if any recognized pronunciation of a word rhymed (though we have no guarantee that this was the intended pronunciation in the song, it seems likely to think it was). 5.2 Classifiers We tried multiple types of linear classifiers provided by SKLearn: 1. Multinomial Naive Bayes 2. Logistic Regression 3. Logistic Regression with L1 Regularization 4. Logistic Regression with L2 Regularization 5. Support Vector Machine (SVM) 5.3 Oracle Our Oracle was one of our group members who hand classified fifteen songs per decade based solely on their lyrics. The results of our Oracle s work is below: label 1940 1950 1960 1970 1980 1990 2000 2010 Overall precision 0.357 0.333 0.286 0.500 0.214 0.188 0.360 0.700 0.350 recall 0.333 0.333 0.267 0.400 0.200 0.200 0.600 0.467 0.350 f1 0.355 0.333 0.276 0.444 0.206 0.194 0.450 0.560 0.350 Our Oracle did better than random chance on every task, and best on decades closest to the present. The Oracle did not have access to the title and artist of the songs (neither does the classifier), and it is likely that both the oracle and classifier would perform better given this data. 5.4 Baseline Our baseline was implemented with a Multinomial Naive Bayes classifier using a unigram feature function. The results are below: label 1940 1950 1960 1970 1980 1990 2000 2010 overall precision 0.00 0.00 0.00 0.22 0.20 0.50 0.19 0.21 0.18 recall 0.00 0.00 0.00 0.24 0.24 0.20 0.31 0.67 0.22 f1 0.00 0.00 0.00 0.23 0.22 0.29 0.24 0.32 0.17 6 Results and Analysis 6.1 Classifiers Overall, we used five classifiers, four of which we were able to run on our subset of the Million Song Dataset. We were not able to run the SVM on the MSD sample because it took over a week to run our feature functions on a two class classifier. 50-50 Labeler Classifier Multinomial Naive Bayes Logistic Logistic L1 Logistic L2 Average 0.646 0.653 0.646 0.621 Maximum 0.73 0.73 0.71 0.73 Bi-Decade Labeler Classifier Multinomial Naive Bayes Logistic Logistic L1 Logistic L2 Average 0.466 0.497 0.485 0.497 Max 0.58 0.58 0.56.58 3

Decade Labeler Classifier Multinomial Naive Bayes Logistic Logistic L1 Logistic L2 Average 0.248 0.265 0.256 0.265 Max 0.35 0.35 0.33 0.35 The SVM marginally outperformed the other classifiers on the small dataset with a two class classifier but did substantially worse on the multiclass classifications. Combined with the increased runtimes, it was not worth running the SVM on the larger dataset. Overall, the Multinomial Naive Bayes classifier and the Logistic Regression with L2 smoothing did the best, achieving exactly equal maximum scores over all three labeler functions, which is fairly impressive considering the basic probabalistic approach of the Multinomial Naive Bayes compared to the logistic regressions. Additionally, the Multinomial Naive Bayes ran far faster than the Logistic Regressions, so overall it was the best classifier. 6.2 Feature Functions We had several types of feature functions, each with different focuses, benefits, and drawbacks. 6.2.1 Unigrams, Bigrams, and Trigrams Unigram, bigrams, and trigrams were ultimately the most successful features we used and the first we tried. Over all classifiers, they were able to achieve the best results. For all classifiers, the best feature functions and F1 scores are as follows: 1. Multinomial Naive Bayes (a) 50-50: Unigrams, 0.73 (b) Bi-decade: Unigrams, 0.58 (c) Decade: Unigrams, 0.35 2. Logistic Regression (a) 50-50: Combined Function, 0.73 (b) Bi-decade: Combined Function, 0.58 (c) Decade: Combined Function, 0.35 3. Logistic L1 (a) 50-50: Combined Function, 0.71 (b) Bi-decade: Unigrams and Combined Function, 0.55 (c) Decade: Unigrams, 0.33 4. Logistic L2 (a) 50-50: Combined Function, 0.73 (b) Bi-decade: Combined Function, 0.58 (c) Decade: Combined Function, 0.35 5. SVM (with small dataset) (a) 50-50: Combined Function, 0.80 (b) Bi-decade: Combined Function, 0.53 (c) Decade: Unigrams and Combined Function, 0.39 These functions did at least even to and often better than our Oracle, and unigrams provided a solid baseline that was seldom outperformed with the data we had. 4

6.2.2 Stemming Functions Stemming did not significantly alter performance, and when it did, it usually caused classifiers to perform slightly worse. For example, on the SVM it never yielded more than a 2-percentage point difference in precision, recall, or F1 score on any feature for any labeler. Multinomial Naive Bayes Classifier on 50-50 Labeler Feature Function Stemmed Unstemmed Unigrams 0.73 0.73 Bigrams 0.69 0.70 Trigrams 0.68 0.68 Bigrams with Tokens 0.72 0.72 Trigrams with Tokens 0.69 0.68 Combined 0.68 0.69 In the Multinomial Naive Bayes case, the classifier did on average about the same with stemmed and unstemmed functions. On average over all classifiers and all labelers, the stemmed feature functions received about 0.005 lower F1 scores than the unstemmed feature functions. However, this result could be because of the way the Porter Stemmer works. The Porter Stemmer is not designed to handle slang, like "workin " or "drivin ", which could explain the decrease in performance. It would be helpful to have a stemmer that could handle the variety of slang that music threw at it. 6.2.3 Stop Word Removal Our two stop lists did not increase performance, despite our hopes, nor did they improve run time significantly, likely because we did not pre-process the files. Perhaps preprocessing the lyrics would improve the feature functions. Our two stop lists were derived from the same source, although the partial list removed all pronouns from the stop lists. Multinomial Naive Bayes Classifier on 50-50 Labeler Feature Function Full Text Partial Stop Full Stop Unigrams 0.73 0.73 0.73 Bigrams 0.69 0.69 0.68 Trigrams 0.68 0.67 0.66 Bigrams with Tokens 0.72 0.70 0.69 Trigrams with Tokens 0.69 0.68 0.68 Combined 0.68 0.67 0.67 As seen in the chart, removing more stop words actually decreased performance, even though through word count we were able to see that the words were actually incredibly common. The effect was fairly marginal, so it is unclear if it was merely our dataset or our stop word choice. Future work could focus on finding a stop word list more representative of music, or combining stop words with a music specific stemmer. 6.2.4 Length Features Length features underperform relative to traditional NLP ones, but usually outperform random chance. They performed best on the linear-kernel based SVM for all labelers. On 50/50 and bi-decade, overall line length (total wordcount) significantly outperformed average number of words per line. Both average syllables and average sounds per line were decidedly bad features, on their own, only able to beat random chance on the individual decade labeler and not by any significant margin. The best performance - best on all but bidecade, where it was outdone by pure wordcount - came from a combination of average line length, total lines total num words, and total number of words, presumably because this allowed the dependencies between the features (more lines tends to signify lower average line length). With this said, wordcount and combined performed quite similarly indicating that wordcount was probably a heavily weighted feature within combined. The below chart details results from the small dataset, because running the SVM on the larger one was not feasible. Red indicates that the feature performed worse than random chance. 5

Feature Labeler Logistic F1 Logistic L1 F1 SVM F1 Wordcount 50/50 0.56 0.56 0.69 Wordcount bidecade.37.37.51 Wordcount individual decade.15.15.21 Avg Wordcount per line 50/50 0.56 0.56 0.48 Avg Wordcount per line bidecade 0.34 0.34 0.2 Avg Wordcount per line individual decade 0.14 0.14 0.06 Avg Sounds per line 50/50 0.56 0.56 0.49 Avg syllables per line 50/50 0.56 0.56 0.49 Combined length 50/50 0.56 0.56 0.7 Combined length bidecade 0.44 0.44 0.46 Combined length individual decade.17.17.23 6.2.5 Rhyme schemes Unsurprisingly, combined rhyme - which included rhyming every and every other line, rhyming in the next of 3 lines, average syllables and sounds per line, and total num lines total num words - performed the best. With this said, as before with length features, rhyme features individually were not standout. They tended to do passably on 50/50 - all beat random chance - but not so well on later labelers: for example, for bidecade, only 1-3-2-4 and combined beat random chance. 6.3 Feature Analysis Features Labeler Logistic F1 Logistic L1 F1 SVM Combined Rhyming 50/50 0.56 0.56 0.70 Combined Rhyming bi-decade 0.44 0.44 0.46 Combined Rhyming individual decade.11.22.25 The most predictive features according to each decade reveal interesting trends in popular music of the times. In many ways, they are somewhat predictable; "Tutti Frutti" was more common in the 1940s-70s while "hoes" and "turn it up" are more common in the 1980s onwards. However, the lack of examples from early decades becomes readily apparent through the features in the bi-decade labeler and the individual decade labeler. For instance "Deacon Jones", while likely unique to the 1940s, is probably not emblematic of the era. The lack of examples stems from the website we used for scraping and the Million Song Dataset; the MSD is skewed towards modern songs, and the songs that made it onto our scraping website are added by community. Here, we can see songs that are brought back by nostalgia or video game references: "Santa Claus" is a strong 1940s feature, which should not be surprising as Christmas music tends to be timeless. However "that pistol down", "that pistol", "pistol down", and "Lay that", the other four top 1940s features are from a a Bing Crosby song called "Pistol Packin Mama", which was recently featured in Bethesda s video game, Fallout 4. We do see a shift in the general lyrics over time, and as people who grew up in the 2000s and 2010s, the lyrics "Party like a" and "We re up all" for 2000 and 2010 respectively seem to make sense, as does "stanky" and "legg". Some of the 1980s features seem almost stereotypical, like "Funkaela", while the 1960s had "boogety" and "hitch hike". Further feature analysis with a larger dataset would probably have better results getting the zietgeist of each decade, and with further data cleaning, we could probably more standardized features. We also observe that in eras following the 40 s and 50 s it is more common for a song to have a more strict, more frequent chorus line. The prediction values of the top-5 most predictive unigrams and bigrams for the 1940/1950 (the best bigram, for example, has a.4255 prediction value according to Bayes rule) pairing are significantly less than the top-5 predictive values for the other decade pairings (all of which are >.9 for bigrams: see tables in the appendix). While this could indicate that 40 s and 50 s choruses use the same words other eras do (thus making them not as predictive under Bayes), this likely isn t the case here: intuitively, "Rootie" and "Tootie" don t seem like words that would be frequent in more recent songs. 6.4 Error Analysis Improper data cleaning was the root cause of many of our issues. For instance, some of the most predictive features for the 50-50 classifier were the features "Verse" and "<S> Verse", which are clearly human entered labels and not actually elements of the lyrics. Additionally, it could be helpful to stick everything in lowercase and remove punctuation characters, like apostraphes and quotation marks. Also, as in most cases, more data would be ideal. Because lyrics are copyrighted, it is difficult to get large quantities of verified data, and many websites are difficult to scrape. 6

References [DH10] J. Stephen Downie and Xiao Hu. When lyrics outperform audio for music mood classification: A feature analysis. In International Society for Music Information Retrieval Conference, number 11, 2010. [Kum10] Carol L. Kumhansl. Plink: thin slices of music. Music Perception: An Interdisciplinary Journal, 27(3):337 354, 2010. [MNR08] Rudolph Mayer, Robert Neumayer, and Andraes Rauber. Rhyme and style features for musical genre classification by song lyrics. In ISMIR 2008: Proceedings of the 9th International Conference of Music Information Retrieval, number 9, 2008. 7 Appendix 7.1 Most predictive words in Naive Bayes Classifier The following table is designed such that unigrams 1 indicates the most predictive (in regards to direct bayesian probability) unigrams word and unigrams 5 indicates the 5th best unigrams feature for the class at the top of the column. Prediction values are rounded, and in the case of a tie the word was chosen alphabetically. Combined phi is a combination of unigrams, bigrams with tokens, and trigrams with tokens. For 50/50 classifier Feature 1940-1979 1980-2010 UNIGRAMS unigrams 1 Tutti,.9864 niggaz,.9978 unigrams 2 boogety,.9862 tha,.9964 unigrams 3 Wages,.9860 hoes,.9963 unigrams 4 Sloopy,.9853 niggas,.9939 unigrams 5 Elis,.9838 nigga,.9934 BIGRAMS bigrams 1 Mr Lee,.9834 a nigga,.9965 bigrams 2 hitch hike,.9829 this shit,.9942 bigrams 3 the hump,.9826 yo yo,.9923 bigrams 4 Tutti Frutti,.9826 the fuck,.9913 bigrams 5 Los Wages,.9826 fuck with,.9912 TRIGRAMS trigrams 1 I want Thats,.9779 it Want it, 0.9930 trigrams 2 want Thats what,.9775 Want it Want,.9926 trigrams 3 over the hump, 0.9771 Jah la man,.9926 trigrams 4 star fucker star, 0.9760 turn it up,.9918 trigrams 5 Too much pressure,.9756 man Jah la,.9913 COMBINED combined 1 um um,.9926 nigga.9926 combined 2 um um um,.9903 u,.9917 combined 3 Mr Lee,.9897 Imma,.9896 combined 4 boogety,.9894 Verse,.9855 combined 5 <S> night,.9889 <S> Verse,.9844 7

For paired decade labeler: Feature 1940/1950 1960/1970 1980/1990 2000/2010 UNIGRAMS unigrams 1 Rootie,.6030 boogety,.9698 Babba,.9725 niggaz,.9868 unigrams 2 Tootie,.5350 Elis,.9647 Wages,.9673 nigga, 0.9791 unigrams 3 Attorney,.4647 rutti,.9641 pegs, 0.9668 lai,.9740 unigrams 4 hobble,.4510 awimoweh,.9592 IceT,.9652 niggas,.9732 unigrams 5 District,.3946 ShooBop,.9538 Funkadelala,.9564 yuh, 0.9721 BIGRAMS bigrams 1 Deacon Jones,.2254 Mr Lee,.9550 Harlem Harlem,.9863 la man,.9826 bigrams 2 Rootie Tootie,.1704 hitch hike,.9537 Babba Do,.9798 Jah la, 0.9826 bigrams 3 happening everyday,.1128 oh rutti,.9451 ghetto The,.9790 a nigga, 0.9810 bigrams 4 Jones Deacon,.1128 frutti oh,.9451 good ooh,.9773 Lies Lies,.9810 bigrams 5 District Attorney,.1050 Simple Simon,.9366 mellow when,.9763 the ounce,.9796 TRIGRAMS trigrams 1 sho is hard,.0747 frutti oh rutti,.9261 ghetto The ghetto,.9812 Want it Want,.9847 trigram 2 things happening everyday,.0698 ahh ahh ahh,.9222 The ghetto The,.9798 Jah la man,.9847 trigrams 3 strange things happening,.06978 Tutti frutti oh,.9148 mellow when Im,.9786 man Jah la,.9820 trigrams 4 find me cryin,.0698 Simple Simon says,.9114 be mellow when,.9876 la man Jah,.9820 trigrams 5 are strange things,.0698 Mr Lee Mr,.9038 Ill be mellow,.9875 To the ounce,.9806 For individual decade labeler (split into two tables) Feature 1940 1950 1960 1970 UNIGRAMS unigrams 1 Rootie,.3938 Banua,.8100 boogety,.9254 Wages,.9249 unigrams 2 Tootie,.3610 Diddy,.7076 Elis,.9134 steward,.8935 unigrams 3 Deacon,.2944 biga,.7031 awimoweh,.9007 HeyO,.8844 unigrams 4 hobble,.2875 Matelot,.6681 ShooBop,.8884 CM,.8792 unigrams 5 Attorney,.2707 Atell,.6030 hike,.8756 Neat,.8791 COMBINED combined 1 that pistol down,.4255 Mr Lee,.9259 um um,.9561 on up </S>,.9064 combined 2 that pistol,.4355 <S> night and,.9171 um um um,.9426 beat goes on,.9043 combined 3 pistol down,.4255 lama,.9118 boogety,.9378 Bennie,.9024 combined 4 Santa Claus,.3941 jungle jungle,.9041 hike,.9302 Get on up,.8992 combined 5 Lay that.3901 an around,.9038 Hitch hike,.9288 who who,.8955 Feature 1980 1990 2000 2010 UNIGRAMS unigrams 1 IceT,.9172 lai,.9402 BANG,.9193 Amelle,.8038 unigrams 2 Funkadela,.8975 Babba,.9338 stanky,.9020 BOB,.7674 unigrams 3 Ludd,.8775 Mistah,.9196 legg,.9003 Vanderpool,.7393 unigrams 4 Undercover,.8696 Bart,.9078 shik,.8890 TUNING,.6830 unigrams 5 Antmusic,.8642 oie,.9012 Ziggy,.8807 seo,.6447 COMBINED combined 1 down on it,.9356 La La,.9187 da na,.8981 whooooo,.7814 combined 2 Get down on,.9339 The promise,.8949 Party like a,.8893 Were up all.7696 combined 3 em when,.9116 <S> The promise,.8949 Party like,.8893 Were up.7696 combined 4 em when theyre,.9089 wants to give,.8836 <S> Party like,.8893 <S> Were up,.7697 combined 5 you okay,.8868 La La La,.8836 This is why,.8756 imma be,.7356 8