Combining Learned Lyrical Structures and Vocabulary for Improved Lyric Generation arxiv:1811.04651v1 [cs.ai] 12 Nov 2018 Pablo Samuel Castro Google Brain psc@google.com Abstract Maria Attarian Google jmattarian@google.com The use of language models for generating lyrics and poetry has received an increased interest in the last few years. They pose a unique challenge relative to standard natural language problems, as their ultimate purpose is creative; notions of accuracy and reproducibility are secondary to notions of lyricism, structure, and diversity. In this creative setting, traditional quantitative measures for natural language problems, such as BLEU scores, prove inadequate: a high-scoring model may either fail to produce output respecting the desired structure (e.g. song verses), be a terribly boring creative companion, or both. In this work we propose a mechanism for combining two separately trained language models into a framework that is able to produce output respecting the desired song structure, while providing a richness and diversity of vocabulary that renders it more creatively appealing. 1 Introduction With the increased realism and sophistication of generative models, artists have been increasingly drawn to incorporate these methods into their creative process. The approaches vary, from transferring style from one artist to another (Dumoulin et al., 2016) to adapting a pre-existing process to produce abstract art that maximizes the likelihood of a category under a classification model (White, 2018). Lyrics are a particularly challenging artistic endeavour; high-quality lyrics typically require following a specific lyric structure, the use of a rich vocabulary, a mastery of the language, and the use of poetic techniques such as metaphors and alliteration. Because of this, the use of machine learning models for the generation of lyrics has seen a slower increase. The few cases where machine learning models have been used for lyric generation have required a substantial amount of human intervention. In our submission to the Machine Learning and Creativity Workshop at NIPS 2017 (Castro et al., 2017) we trained a Recurrent Neural Network (RNN) over a dataset of lyrics. We then manually curated the lyrics produced with renowned Canadian songwriter David Usher to rewrite one of his songs (Sparkle and Shine). Although a successful experiment in human-machine collaboration, the lyrics required more manual intervention than we would have liked. We recently switched to the more sophisticated Transformer Language Models (TLMs) (Vaswani et al., 2017) to train over the same dataset. The results are of substantially improved, but although they seem to maintain the general structure of lyrics, they still suffer from a lack of variety. 2 Proposed Framework Our approach combines two different TLMs. The first model (L S ) is trained to capture the structure of lyrics, while the second (L V ) is trained to provide a richer vocabulary than what is currently available in the lyrics dataset, while still leveraging the context of the existing lyrics. Given an 32nd Conference on Neural Information Processing Systems (NIPS 2018), Montréal, Canada.
initial input lyric l 1, we combine these models to produce the next line as follows (PoS will be described below): l 2 = RichLyrics(l 1 ) L V (l 1 L S (PoS(l 1 ))). L S : Our dataset consists of a large set of lyrics spanning multiple genres and decades (see Appendix A). Our inputs consist of the separate lines of all the song lyrics, while the targets are the same lines shifted by one (e.g. line n line n+1 ). We pre-processed the lyrics by converting them into their respective Parts-of-Speech (PoS) 1. This was done to ensure that the Lyric model is only capturing lyric structure, but not vocabulary. We will refer to this conversion process as PoS(l); in other words our input-to-target mapping becomespos(line n ) PoS(line n+1 ). L V : We picked a subset of Project Guttenberg s Top 20 books Kaggle dataset 2 (the list of books used is provided in Appendix B). We split each sentence s into two parts: s 1 and s 2 ; the splitting point was chosen at about halfway through the sentence, without splitting any words (see Appendix C for more details). Denoting as string concatenation, a sentence s is converted to an input-to-target mapping as: s 1 PoS(s 2 ) s 2. The intuition behind this approach is that s 1 provides the context, whilepos(s 2 ) provides the structure to be materialized. 3 Empirical Evaluation In order to evaluate our approach we generated 100 lyric verses using the following procedure. We randomly picked 100 lines from our lyrics dataset as starter lines l 1. Then, for each model we incrementally built a verse of 5 lines by setting l i+1 RichLyrics(l i ). We are using beam search with max size 3, so each l i results in 3 different l i+1 s. We consider the verse produced by each of these possibilities. This means that for each l 1 we produce up to 3 5 = 243 different verses (depending on the input, the beam search may sometimes produce less than 3 variants). We compared our RichLyrics approach against two baselines: PureLyrics is a TLM trained only on the lyrics dataset; PureBooks is a TLM trained only on the books dataset. 3.1 Quantatitative Evaluation From the verses generated by each model we computed the number of words and average word length per line, the number line repeats in the verse (l i l i+1 ), and the fraction of words repeated from one line to the next. The results are presented in Table 1 and demonstrate that RichLyrics makes use of a much larger vocabulary and with fewer repeats. Given that the verses are 5 lines long, P urelyrics is repeating lines about half the time! Qualitative examples of the generations are presented in Appendix D and further confirm these quantitative results. Words per line Word length Number of line Fraction of per line repeats repeated words P urelyrics 12.853 ± 1.106 3.351 ± 0.060 2.233 ± 0.125 0.671 ± 0.026 P urebooks 6.633 ± 0.024 3.758 ± 0.010 0.002 ± 0.001 0.244 ± 0.002 RichLyrics 6.928 ± 0.046 3.542 ± 0.016 0.441 ± 0.039 0.480 ± 0.009 Table 1: Statistics for the different models. 4 Discussion and Future Work Although we are able to substantially improve the quality of the generated lyrics, there is still much work ahead of us. We would like to train over a larger set of books, and ones that are more current to have more modern vocabulary. An important aspect of lyric structure that we are investigating is having the generation adapt to rhyming structure and phonetic cadence, as this is something songwriters use often to fit a musical melody. As with most language models out there, semantic consistency still proves challenging, and is something we are actively investigating. 1 We used pos_tag from Python s nltk library to extract PoS. 2 https://www.kaggle.com/currie32/project-gutenbergs-top-20-books 2
References Castro, P. S., Usher, D., and Larochelle, H. (2017). Sparkle and shine 2.0. https://vimeo.com/274650906. Dumoulin, V., Shlens, J., and Kudlur, M. (2016). A learned representation for artistic style. CoRR, abs/1610.07629. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. CoRR, abs/1706.03762. White, T. (2018). Perception Engines. https://medium.com/artists-and-machine-intelligence/perception-engi A Lyrics dataset These are the genres used for the lyric structure model detailed in Section 2. We exclude Children s Music and Hip-Hop, the latter to reduce the amount of profanity in the generations. Alternative / indie Country Folk Jazz Metal Pop R-and-B / Soul Rock Soundtracks In total this resulted in over 1 million input/target pairs, and about 91,000 for the test/validation sets. B Kaggle books dataset The books used for training the Lyric Vocabulary model detailed in Section 2 were: A Tale of Two Cities by Charles Dickens Adventures of Huckleberry Finn by Mark Twain Alices Adventures in Wonderland by Lewis Carroll Dracula by Bram Stoker Emma by Jane Austen Frankenstein by Mary Shelley Great Expectations by Charles Dickens Grimms Fairy Tales by The Brothers Grimm Metamorphosis by Franz Kafka Pride and Prejudice by Jane Austen The Adventures of Sherlock Holmes by Arthur Conan Doyle The Adventures of Tom Sawyer by Mark Twain The Count of Monte Cristo by Alexandre Dumas The Picture of Dorian Gray by Oscar Wilde The Prince by Nicolo Machiavelli The Yellow Wallpaper by Charlotte Perkins Gilman In total this resulted in around 155,000 input/target pairs used for training, and around 13,000 for the test/validation sets. 3
C Word splitting mechanism The split happens assuming that periods denote the end of a sentence (aside from abbreviations typically containing a period e.g. Mr. or St.) while also taking into account that quotes in books sometimes mark the end of a sentence, e.g. in the case of quotes being followed by an uppercase word. In order to avoid very long sentences skewing the structure of the proposed framework, we are splitting into subsentences of 15 words each, if the total word number in a sentence exceeds a threshold. Subsequently, we split each sentence into approximately half, in respect of not cutting words in half. This has been parameterized in case we choose to experiment with different ratios. As an example, if we have the sentence: The quick brown fox jumped over the fence, our procedure would produce the following input and target phrases: Input: The quick brown fox VBD IN DT NN Target: jumped over the fence 4
D Qualitative Evaluation We present some sample lyrics produced by the different models, using the same starter lines (in italics below). As discussed in Section 2, we generate lines incrementally: l 1 l 2 l n. In Table 2 we compare P urelyrics with RichLyrics, where the increased variety in outputs produced byrichlyrics is evident. Table 2: Comparison between P urelyrics and RichLyrics models. PureLyrics i m not gonna write you a love song cause you tell me it s i m the man of the woods, i m the man of the woods you told me you loved nobody else but you you told me you loved me but you loved me RichLyrics you remember the voice of the widow i love the girl of the age i have a regard for the whole i have no doubt of the kind i am sitting in the corner of the mantelpiece you know my secret secret you have my second estate you suit your high origin you have my cursed youth you have my life you told me you wanted everything else, you never would he put his hand on the pillow of the marquis he put his cap on the ground like a stone he put his hand on the latch of a door he put his key in the lock as a key In Table 3 we compare P urebooks with RichLyrics, which highlights how our proposal produces output that is more reminiscent of real lyrics, both in terms of phrase structure and length. Table 3: Comparison between P urebooks and RichLyrics models. PureBooks she was a new man but it was not a thing to be done it was a confession you, and i ll tell you all about it i don t understand you, said the young man, and we shall be happy to-day have felt that you were coming to know of yourself i have no doubt of that, said the young man, that you have been a great fancy for a few minutes, and then another? RichLyrics you remember the voice of the widow i love the girl of the age i have a regard for the whole i have no doubt of the kind i am sitting in the corner of the mantelpiece you know my secret secret you have my second estate you suit your high origin you have my cursed youth you have my life you told me you wanted everything else, you never would he put his hand on the pillow of the marquis he put his cap on the ground like a stone he put his hand on the latch of a door he put his key in the lock as a key 5