Generating lyrics with the variational autoencoder and multi-modal artist embeddings

Generting lyrics with the vritionl utoencoder nd multi-modl rtist embeddings Olg Vechtomov, Hreesh Bhuleyn, Amirpsh Ghbussi, Vineet John University of Wterloo, ON, Cnd {ovechtom,hpllik,ghbuss,vineet.john}@uwterloo.c Abstrct We present system for generting song lyrics s conditioned on the style of specified rtist. The system uses vritionl utoencoder with rtist embeddings. We propose the pre-trining of rtist embeddings with the representtions lerned by CNN clssifier, which is trined to predict rtists bsed on MEL spectrogrms of their song clips. This work is the first step towrds combining udio nd text modlities of songs for generting lyrics conditioned on the rtist s style. Our preliminry results suggest tht there is benefit in initilizing rtists embeddings with the representtions lerned by spectrogrm clssifier. 1 Introduction Outputs of neurl genertive models cn serve s n inspirtion for rtists, writers nd musicins when they crete originl rtwork or compositions. In this work we explore how genertive models cn ssist songwriters nd musicins in writing song lyrics. In contrst to systems tht generte lyrics for n entire song, we propose to generte suggestions for lyrics s in the style of specified rtist. The hope is tht unusul nd cretive rrngements of words in the generted s will inspire the songwriter to crete originl lyrics. Conditioning the genertion on the style of specific rtist is done in order to mintin stylistic consistency of the suggestions. Such use of genertive models is intended to ugment the nturl cretive process when n rtist my be inspired to write song bsed on something they hve red or herd. We use the vritionl utoencoder (VAE) [1] with Long Short Term Memory networks (LSTMs) s encoder nd decoder, nd trinble rtist embedding, which is conctented with the input to every time step of the decoder LSTM. The dvntge of using the VAE for cretive tsks, such s lyrics genertion, is tht once the VAE is trined, ny number of s cn be generted by smpling from the ltent spce. The unique style of ech musicin is combintion of their musicl style nd the style expressed through their lyrics. We therefore compre rndomly initilized trinble rtist embeddings with embeddings pre-trined by convolutionl neurl network (CNN) clssifier optimized to predict the rtist bsed on MEL spectrogrms of 10-second song clips. There re lrge number of pproches towrds poetry genertion. Some pproches focus on such chrcteristics s rhyme nd poetic meter [2], while others on generting poetry in the style of specific poet [3]. In [4] the uthors propose imge-inspired poetry genertion. The pproch of using style embeddings in controlled text genertion is not new, nd hs been explored in generting text conditioned on sentiment [5, 6] nd person-conditioned responses in dilogue systems [7]. To our knowledge there hs been no prior work on using music udio nd text modlities to generte song lyrics. We explore whether rtist embeddings lerned bsed on music udio re useful in generting lyrics in the style of given rtist. Our preliminry results suggest tht there is some benefit in using multi-modl embeddings for conditioned lyrics genertion.

CNN rtist A rtist B rtist C originl lyrics LSTM z LSTM originl lyrics z LSTM smpled new lyrics () Trining (b) Inference Figure 1: Overview of our pproch - First, CNN is implemented to clssify rtists bsed on spectrogrm imges, thereby lerning rtist embeddings. Then, VAE is trined to reconstruct s from song lyrics, conditioned on the pre-trined rtist embeddings. At inference time, in order to generte lyrics in the style of desired rtist, we smple z from the ltent spce nd decode it conditioned on the embedding of tht rtist. 2 Model nd Experiments We collected dtset of lyrics by seven rtists, one from ech of the following genres: Art Rock, Electronic, Industril, Clssic Rock, Alterntive, Hrd Rock, nd Psychedelic Rock. Ech of the selected rtists hs distinct musicl nd lyricl style, nd lrge ctlogue of songs, spnning mny yers. In totl our dtset contins 34,000 s of song lyrics. To obtin pre-trined rtist embeddings, we split the wveform udio of the rtists songs into 10-second clips, nd trnsformed them into MEL spectrogrms 1. The dtset consists of 21,235 spectrogrms. Next, VGG16 [8] pre-trined CNN clssifier ws trined to predict rtists bsed on spectrogrms 2. The clssifier chieved n ccurcy of 83% on the test set. The lst hidden lyer of the clssifier ws used to initilize the rtist embeddings of the lyrics VAE. The VAE is trined to perform the tsk of sentence reconstruction on the lyrics dtset. At inference time, we smple dt points from the lerned ltent spce, nd pss them to the decoder together with the embedding of the rtist, in whose style we wnt to generte lyrics. Two vrints of this model were evluted: VAE+udioT nd VAE+udioNT, with trinble nd non-trinble rtist embeddings, respectively. For the bse we implemented the VAE model with rndomly initilized rtist embeddings: VAE+rnd (trinble) nd VAE+rndNT (non-trinble), nd VAE with rtist embeddings s one-hot encodings (VAE+onehot). All VAE models were trined by nneling the coefficient of the KL cost up to 3000 itertions nd decoder input word dropout of 0.5 [9, 10]. The encoder is bi-directionl LSTM with 100 hidden units. The dimension of the rtist embedding vector ws set to 50 in ll VAEs except for VAE+onehot. We used 300-dimension word2vec embeddings pre-trined on lrge corpus of song lyrics (2.5M s). The spectrogrm CNN model consists of CNN bse model, which is the VGG-16 nd uses pretrined ImgeNet weights [8]. The CNN bse is followed by three fully-connected lyers (512,128,50 units) with 30% dropout. The model ws trined for 20 epochs. Clssifiction ccurcy on the test set (80/10/10 trining/vlidtion/test split) ws 83%. While creting the dt splits, we ensured tht the udio clips for the sme song remined in the sme set. 1 https://www.kggle.com/vinvinvin/high-resolution-mel-spectrogrms/notebook 2 https://github.com/pshpnther/deep-music-genre-clssifiction 2

Electronic like shckles of the eternl night oh i wnt to shke the sun blck obsession is wering in your soul Industril every inch of reptile in your hed just when the jgged sound i wtch me get into my skin Art Rock love cn drown your hert no wy to heven where she stnds when the shdows were young Alterntive i m drifting wy from the se for superior betryl forevermore he held the erth Tble 1: Lyrics s generted by VAE+udioNT 3 Evlution nd Results To quntittively evlute whether the generted lyrics dhere to the style of the rtist they were conditioned on, we trined CNN clssifier [11] 3 on the originl lyrics of the selected seven rtists. This is commonly used pproch to evlute the style ttribute of generted texts, e.g. in style trnsfer [12]. The results re presented in Tble 2. The ccurcy on the originl lyrics is 60%, nd the mjority bse is 17.7%. The results re presented in Tble 2. The VAE+udioNT model received the highest style clssifiction ccurcy of 42%, which suggests tht there is some benefit in pre-trining rtist embeddings on spectrogrm imges. The performnces of VAE+rnd nd VAE+rndNT re somewht vrible between trining instnces due to the rndom initiliztion of the rtist embeddings. This is evident from the fct tht fter ech trining instnce the embedding vectors of different pirs of rtists hve the highest cosine similrity. To ccount for this vribility, we trined five instnces of ech model nd verged their evlution results. The VAE+rnd nd VAE+rndNT results presented in Tbles 2 nd 3 re verged over five trining instnces. Model Accurcy VAE+onehot 0.266 VAE+rnd 0.368 VAE+rndNT 0.396 VAE+udioT 0.361 VAE+udioNT 0.420 Tble 2: Style clssifiction ccurcy on the generted lyric s We lso trined Kneser-Ney smoothed trigrm lnguge model [13] on the corpus of ech rtist s lyrics, nd then used ech of the seven rtists lnguge models to score the lyrics generted for ny given rtist. The intuition is tht if the model successfully genertes lyrics in the style of given rtist, then tht rtist s lnguge model should result in the lowest negtive log-likelihood vlue. In VAE+udioNT, for six out of seven rtists, the lowest negtive log likelihood vlues were given by the model trined on the sme rtist s lyrics, which suggests tht our model genertes lyrics in the style of specified rtist. Tble 3 contins the results of the VAE model with rndomly initilized rtist embeddings, while Tble 4 contins the results of VAE+udioNT. While the bove metrics evlute how well the models generte s in the style of n rtist, the perfect scores would be obtined by systems tht simply lern to reproduce the originl s, wheres wht we wnt re new s tht re inspired by the rtist s lyrics, but re not verbtim copies. The number of verbtim copies mong ll evluted models ws very low (2-3%). Also, ll models generted diverse s: 98%-99% of s re unique. A smll-scle humn evlution ws conducted to ssess how close the generted s re to the style of specific rtist. We recruited three nnottors, one of whom ws fmilir with the selected rtists in Electronic nd Clssic Rock genres, nd two nnottors were fmilir with one rtist ech. We obtined 100 smples of lyrics s generted by ech of the four VAE models conditioned on ech rtist, shuffled nd presented them to ech evlutor. The evlutors were sked to select the s tht resemble the style of the given rtist. The results (Tble 5) indicte tht except for one cse, VAE+udioT nd VAE+udioNT generted the most s in the style of the given rtist, lthough the 3 https://github.com/dennybritz/cnn-text-clssifiction-tf 3

Artist genre Lnguge model AR E I CR A HR PR Art Rock (AR) 16.9 17.44 17.32 17.55 17.79 17.89 17.5 Electronic (E) 17.49 16.23 16.63 17.34 17.48 17.47 17.34 Industril (I) 17.37 16.85 15.68 17.42 17.3 17.51 17.32 Clssic Rock (CR) 17.66 17.39 17.24 16.99 17.8 17.89 17.48 Alterntive (A) 17.47 17.18 16.82 17.43 16.82 17.54 17.23 Hrd Rock (HR) 16.83 16.54 16.6 16.82 16.91 16.22 16.86 Psychedelic Rock (PR) 17.1 17.14 17.12 17.19 17.43 17.53 16.29 Tble 3: Negtive log-likelihood vlues for the lyrics generted by VAE+rndNT. The lnguge models were trined on the originl lyrics of rtists. Artist genre Lnguge model AR E I CR A HR PR Art Rock (AR) 15.5 15.95 16.19 16.04 16.29 16.43 15.81 Electronic (E) 16.38 15.08 15.89 16.36 16.38 16.31 16.36 Industril (I) 16.47 16.01 15.16 16.66 16.47 16.61 16.37 Clssic Rock (CR) 17.09 16.86 16.78 16.32 17.07 17.07 16.88 Alterntive (A) 17.74 17.3 16.92 17.77 16.95 17.67 17.35 Hrd Rock (HR) 17.49 17.04 17.07 17.13 17.63 16.7 17.28 Psychedelic Rock (PR) 17.07 17.23 17.15 17.27 17.22 17.24 16.37 Tble 4: Negtive log-likelihood vlues for the lyrics generted by VAE+udioNT. The lnguge models were trined on the originl lyrics of rtists (smller vlues re better). differences re rther smll. Cohen s kpp between the pirs of nnottors ws low, which cn be expd by the subjective nture of judging n rtist s style. Model Electronic Clssic Rock Annottor A Annottor B Annottor A Annottor C VAE+onehot 0.79 0.29 0.67 0.34 VAE+rnd 0.8 0.35 0.67 0.3 VAE+udioT 0.79 0.33 0.7 0.32 VAE+udioNT 0.73 0.37 0.6 0.4 Tble 5: Mnul evlution results (rtio of selected s out of 100 generted s per rtist). 4 Conclusions nd Future Work Our initil results re promising nd suggest tht pre-trining rtist embeddings on music spectrogrms helps to condition lyric genertion on the rtist s style. Since rtist embeddings re pre-trined using seprte model, their mening is not known to the VAE. However, the difference between rtist embeddings is meningful, s it reflects the difference between their musicl styles. Our pproch is bsed on the ssumption tht rtists with similr musicl styles, nd hence, similr udio-derived embeddings, hve more similr lyricl styles thn rtists tht re very different musiclly. In future work, we pln to evlute other models for pre-trining of rtist embeddings, for exmple spectrogrm utoencoders. We will lso explore other pproches to lern multi-modl representtions, e.g. [14] nd dversril pproches. References [1] Diederik P Kingm nd Mx Welling. Auto-encoding vritionl Byes. In Proceedings of the Interntionl Conference on Lerning Representtions, 2014. [2] Xingxing Zhng nd Mirell Lpt. Chinese poetry genertion with recurrent neurl networks. In Proceedings of the 2014 Conference on Empiricl Methods in Nturl Lnguge Processing (EMNLP), pges 670 680, 2014. 4

[3] Aleksey Tikhonov nd Ivn P Ymshchikov. Guess who? multilingul pproch for the utomted genertion of uthor-stylized poetry. rxiv preprint rxiv:1807.07147, 2018. [4] Wen-Feng Cheng, Cho-Chung Wu, Ruihu Song, Jinlong Fu, Xing Xie, nd Jin-Yun Nie. Imge inspired poetry genertion in xioice. rxiv preprint rxiv:1808.03090, 2018. [5] Zhiting Hu, Zicho Yng, Xiodn Ling, Rusln Slkhutdinov, nd Eric P. Xing. Towrd controlled genertion of text. In Proceedings of the 34th Interntionl Conference on Mchine Lerning, pges 1587 1596, 2017. [6] Zhenxin Fu, Xioye Tn, Nnyun Peng, Dongyn Zho, nd Rui Yn. Style trnsfer in text: Explortion nd evlution. In AAAI, pges 663 670, 2018. [7] Jiwei Li, Michel Glley, Chris Brockett, Georgios Spithourkis, Jinfeng Go, nd Bill Doln. A personbsed neurl converstion model. In Proceedings of the 54th Annul Meeting of the Assocition for Computtionl Linguistics (Volume 1: Long Ppers), pges 994 1003. Assocition for Computtionl Linguistics, 2016. [8] Kren Simonyn nd Andrew Zissermn. Very deep convolutionl networks for lrge-scle imge recognition. rxiv preprint rxiv:1409.1556, 2014. [9] Smuel R. Bowmn, Luke Vilnis, Oriol Vinyls, Andrew Di, Rfl Jozefowicz, nd Smy Bengio. Generting sentences from continuous spce. In Proceedings of the 20th SIGNLL Conference on Computtionl Nturl Lnguge Lerning, pges 10 21, 2016. [10] Hreesh Bhuleyn, Lili Mou, Olg Vechtomov, nd Pscl Pouprt. Vritionl ttention for sequenceto-sequence models. Proceedings of the 27th Interntionl Conference on Computtionl Linguistics (COLING), 2018. [11] Yoon Kim. Convolutionl neurl networks for sentence clssifiction. rxiv preprint rxiv:1408.5882, 2014. [12] Tinxio Shen, To Lei, Regin Brzily, nd Tommi Jkkol. Style trnsfer from non-prllel text by cross-lignment. In NIPS, pges 6833 6844, 2017. [13] Reinhrd Kneser nd Hermnn Ney. Improved bcking-off for m-grm lnguge modeling. In ICASSP, 1995. [14] Hongru Ling, Hozheng Wng, Jun Wng, Shodi You, Zhe Sun, Jin-Mo Wei, nd Zhenglu Yng. Jtv: Jointly lerning socil medi content representtion by fusing textul, coustic, nd visul fetures. rxiv preprint rxiv:1806.01483, 2018. 5