Generating lyrics with the variational autoencoder and multi-modal artist embeddings

Similar documents
Answers to Exercise 3.3 (p. 76)

ECE 274 Digital Logic. Digital Design. Datapath Components Registers. Datapath Components Register with Parallel Load

Chapter 1: Introduction

VISUAL IDENTITY GUIDE

SeSSION 9. This session is adapted from the work of Dr.Gary O Reilly, UCD. Session 9 Thinking Straight Page 1

VOCAL MUSIC I * * K-5. Red Oak Community School District Vocal Music Education. Vocal Music Program Standards and Benchmarks

DRAFT. Vocal Music AOS 2 WB 3. Purcell: Music for a While. Section A: Musical contexts. How is this mood achieved through the following?

Reverse Iterative Deepening for Finite-Horizon MDPs with Large Branching Factors

Pitch I. I. Lesson 1 : Staff

Before Reading. Introduce Everyday Words. Use the following steps to introduce students to Nature Walk.

arxiv: v2 [cs.sd] 13 Dec 2016

walking. Rhythm is one P-.bythm is as Rhythm is built into our pitch, possibly even more so. heartbeats, or as fundamental to mu-

Evaluation of the Suitability of Acoustic Characteristics of Electronic Demung to the Original Demung

WE SERIES DIRECTIONAL CONTROL VALVES

PIRELLI BRANDBOOK 4. IDENTITY DESIGN

The Official IDENTITY SYSTEM. A Manual Concerning Graphic Standards and Proper Implementation. As developed and established by the

Predicted Movie Rankings: Mixture of Multinomials with Features CS229 Project Final Report 12/14/2006

Soft Error Derating Computation in Sequential Circuits

Corporate Logo Guidelines

DIGITAL EFFECTS MODULE OWNER'S MANUAL

Reproducible music for 3, 4 or 5 octaves handbells or handchimes. by Tammy Waldrop. Contents. Performance Suggestions... 3

LOGICAL FOUNDATION OF MUSIC

CPE 200L LABORATORY 2: DIGITAL LOGIC CIRCUITS BREADBOARD IMPLEMENTATION UNIVERSITY OF NEVADA, LAS VEGAS GOALS:

Star. Catch a. How. Teachers Pack. A starry moonlit adventure. Based on the beautiful picture book by Oliver Jeffers

Standard Databases for Recognition of Handwritten Digits, Numerical Strings, Legal Amounts, Letters and Dates in Farsi Language

Animals. Adventures in Reading: Family Literacy Bags from Reading Rockets

Your Summer Holiday Resource Pack: English

Appendix A. Quarter-Tone Note Names

Application Support. Product Information. Omron STI. Support Engineers are available at our USA headquarters from

LAERSKOOL RANDHART ENGLISH GRADE 5 DEMARCATION FOR EXAM PAPER 2

Chapter 5. Synchronous Sequential Logic. Outlines

Engineer To Engineer Note

Chapter 2 Social Indicators Research and Health-Related Quality of Life Research

Cooing, Crying, and Babbling: A Link between Music and Prelinguistic Communication

CPSC 121: Models of Computation Lab #2: Building Circuits

Have they bunched yet? An exploratory study of the impacts of bus bunching on dwell and running times.

What do these sentences describe? Write P for plants, A for animals or B for both next to each sentence below. They ve got roots.

GRABLINKTM. FullTM. - DualBaseTM. - BaseTM. GRABLINK Full TM. GRABLINK DualBase TM. GRABLINK Base TM

Contents. Thank you for the music page 3 Fernando 9 Waterloo 18

lookbook Higher Education

Explosion protected add-on thermostat

Safety Relay Unit G9SB

Kelly McDermott h#s tr#veled the U.S., C#n#d# #nd Europe #s performer, te#cher #nd student. She h#s # B#chelor of Music degree in flute perform#nce

Applications to Transistors

TAP 413-1: Deflecting electron beams in a magnetic field

CMST 220 PUBLIC SPEAKING

Standards Overview (updated 7/31/17) English III Louisiana Student Standards by Collection Assessed on. Teach in Collection(s)

Panel-mounted Thermostats

artifacts, of thinking, feeling, believing, valuing and acting.

LCD Data Projector VPL-S500U/S500E/S500M

THE SOLAR NEIGHBORHOOD. XV. DISCOVERY OF NEW HIGH PROPER MOTION STARS WITH 0B4 yr 1 BETWEEN DECLINATIONS 47 AND 00

Interactions of Folk Melody and Transformational (Dis)continuities in Chen Yi s Ba Ban

Mapping Arbitrary Logic Functions into Synchronous Embedded Memories For Area Reduction on FPGAs

Chapter 3: Sequential Logic Design -- Controllers

lookbook Transportation - Airports

Outline. Circuits & Layout. CMOS VLSI Design

Unit 10: I don t feel very well

ARCHITECTURAL CONSIDERATION OF TOPS-DSP FOR VIDEO PROCESSING. Takao Nishitani. Tokyo Metropolitan University

ECE 274 Digital Logic. Digital Design. Sequential Logic Design Controller Design: Laser Timer Example

Contents. English. English. Your remote control 2

Introduction. APPLICATION NOTE 712 DS80C400 Ethernet Drivers. Jun 06, 2003

A Proposed Keystream Generator Based on LFSRs. Adel M. Salman Baghdad College for Economics Sciences

Binaural and temporal integration of the loudness of tones and noises

Big Adventures. Why might you like to have an adventure? What kind of adventures might you enjoy?

Safety Relay Unit G9SB

1. acquiring 2. compilation 3. cornerstone 4. concise 5. comprehensive 6. advancement

TAU 2013 Variation Aware Timing Analysis Contest

Contents 2. Notations Used in This Guide 6. Introduction to Your Projector 7. Using Basic Projector Features 29. Setting Up the Projector 16

Can you believe your eyes?

PRACTICE FINAL EXAM T T. Music Theory II (MUT 1112) w. Name: Instructor:

Homework 1. Homework 1: Measure T CK-Q delay

Sequencer devices. Philips Semiconductors Programmable Logic Devices

Pro Series White Toner and Neon Range

Day care centres (ages 3 to 5) Kindergarten (ages 4 to 5) taken part in a fire drill in her building and started to beep.

MILWAUKEE ELECTRONICS NEWS

arxiv: v3 [cs.sd] 14 Jul 2017

Lecture 3: Circuits & Layout

Music Composition with RNN

M A G A Z I N E MEDIA KIT 2018

When it comes to creating music, composers like to push the limits. Get ready to hear how!

Predicting the immediate future with Recurrent Neural Networks: Pre-training and Applications

1 --FORMAT FOR CITATIONS & DOCUMENTATION-- ( ) YOU MUST CITE A SOURCE EVEN IF YOU PUT INFORMATION INTO YOUR OWN WORDS!

Synchronising Word Problem for DFAs

LCD VIDEO MONITOR PVM-L1700. OPERATION MANUAL [English] 1st Edition (Revised 2)

Electrospray Ionization Ion MoMlity Spectrometry

Contents 2. Notations Used in This Guide 6. Introduction to Your Projector 7. Using Basic Projector Features 29. Setting Up the Projector 16

400 Series Flat Panel Monitor Arm

Contents 2. Notations Used in This Guide 6. Introduction to Your Projector 7. Using Basic Projector Features 30. Setting Up the Projector 17

INPUT CAPTURE WITH ST62 16-BIT AUTO-RELOAD TIMER

A New Concept of Providing Telemetry Data in Real Time

Tran Thi Thanh Thao Ticker: STB - Exchange: HSX Recommend: HOLD Target price 2011: VND 15,800 STATISTICS

Strange but true! Vocabulary 1 Prepositions. Reading 1. Listen and repeat the prepositions. their superstitions? down

The main website for the Royal and McPherson Theatres Society is: Section view of the stage and auditorium

Going beyond the limit of an LCD s color gamut

Successful Transfer of 12V phemt Technology. Taiwan 333, ext 1557 TRANSFER MASK

THE MOSSAT COLLECTION BOOK SIX

The wonders of the mind. The way we are. Making a difference. Around the world. Module 1. Module 2. Module 3. Module 4

MODELING OF BLOCK-BASED DSP SYSTEMS Dong-Ik Ko and Shuvra S. Bhattacharyya

lookbook Senior Living

Contents 2. Notations Used in This Guide 6. Introduction to Your Projector 7. Using Basic Projector Features 28. Setting Up the Projector 15

Transcription:

Generting lyrics with the vritionl utoencoder nd multi-modl rtist embeddings Olg Vechtomov, Hreesh Bhuleyn, Amirpsh Ghbussi, Vineet John University of Wterloo, ON, Cnd {ovechtom,hpllik,ghbuss,vineet.john}@uwterloo.c Abstrct We present system for generting song lyrics s conditioned on the style of specified rtist. The system uses vritionl utoencoder with rtist embeddings. We propose the pre-trining of rtist embeddings with the representtions lerned by CNN clssifier, which is trined to predict rtists bsed on MEL spectrogrms of their song clips. This work is the first step towrds combining udio nd text modlities of songs for generting lyrics conditioned on the rtist s style. Our preliminry results suggest tht there is benefit in initilizing rtists embeddings with the representtions lerned by spectrogrm clssifier. 1 Introduction Outputs of neurl genertive models cn serve s n inspirtion for rtists, writers nd musicins when they crete originl rtwork or compositions. In this work we explore how genertive models cn ssist songwriters nd musicins in writing song lyrics. In contrst to systems tht generte lyrics for n entire song, we propose to generte suggestions for lyrics s in the style of specified rtist. The hope is tht unusul nd cretive rrngements of words in the generted s will inspire the songwriter to crete originl lyrics. Conditioning the genertion on the style of specific rtist is done in order to mintin stylistic consistency of the suggestions. Such use of genertive models is intended to ugment the nturl cretive process when n rtist my be inspired to write song bsed on something they hve red or herd. We use the vritionl utoencoder (VAE) [1] with Long Short Term Memory networks (LSTMs) s encoder nd decoder, nd trinble rtist embedding, which is conctented with the input to every time step of the decoder LSTM. The dvntge of using the VAE for cretive tsks, such s lyrics genertion, is tht once the VAE is trined, ny number of s cn be generted by smpling from the ltent spce. The unique style of ech musicin is combintion of their musicl style nd the style expressed through their lyrics. We therefore compre rndomly initilized trinble rtist embeddings with embeddings pre-trined by convolutionl neurl network (CNN) clssifier optimized to predict the rtist bsed on MEL spectrogrms of 10-second song clips. There re lrge number of pproches towrds poetry genertion. Some pproches focus on such chrcteristics s rhyme nd poetic meter [2], while others on generting poetry in the style of specific poet [3]. In [4] the uthors propose imge-inspired poetry genertion. The pproch of using style embeddings in controlled text genertion is not new, nd hs been explored in generting text conditioned on sentiment [5, 6] nd person-conditioned responses in dilogue systems [7]. To our knowledge there hs been no prior work on using music udio nd text modlities to generte song lyrics. We explore whether rtist embeddings lerned bsed on music udio re useful in generting lyrics in the style of given rtist. Our preliminry results suggest tht there is some benefit in using multi-modl embeddings for conditioned lyrics genertion.

CNN rtist A rtist B rtist C originl lyrics LSTM z LSTM originl lyrics z LSTM smpled new lyrics () Trining (b) Inference Figure 1: Overview of our pproch - First, CNN is implemented to clssify rtists bsed on spectrogrm imges, thereby lerning rtist embeddings. Then, VAE is trined to reconstruct s from song lyrics, conditioned on the pre-trined rtist embeddings. At inference time, in order to generte lyrics in the style of desired rtist, we smple z from the ltent spce nd decode it conditioned on the embedding of tht rtist. 2 Model nd Experiments We collected dtset of lyrics by seven rtists, one from ech of the following genres: Art Rock, Electronic, Industril, Clssic Rock, Alterntive, Hrd Rock, nd Psychedelic Rock. Ech of the selected rtists hs distinct musicl nd lyricl style, nd lrge ctlogue of songs, spnning mny yers. In totl our dtset contins 34,000 s of song lyrics. To obtin pre-trined rtist embeddings, we split the wveform udio of the rtists songs into 10-second clips, nd trnsformed them into MEL spectrogrms 1. The dtset consists of 21,235 spectrogrms. Next, VGG16 [8] pre-trined CNN clssifier ws trined to predict rtists bsed on spectrogrms 2. The clssifier chieved n ccurcy of 83% on the test set. The lst hidden lyer of the clssifier ws used to initilize the rtist embeddings of the lyrics VAE. The VAE is trined to perform the tsk of sentence reconstruction on the lyrics dtset. At inference time, we smple dt points from the lerned ltent spce, nd pss them to the decoder together with the embedding of the rtist, in whose style we wnt to generte lyrics. Two vrints of this model were evluted: VAE+udioT nd VAE+udioNT, with trinble nd non-trinble rtist embeddings, respectively. For the bse we implemented the VAE model with rndomly initilized rtist embeddings: VAE+rnd (trinble) nd VAE+rndNT (non-trinble), nd VAE with rtist embeddings s one-hot encodings (VAE+onehot). All VAE models were trined by nneling the coefficient of the KL cost up to 3000 itertions nd decoder input word dropout of 0.5 [9, 10]. The encoder is bi-directionl LSTM with 100 hidden units. The dimension of the rtist embedding vector ws set to 50 in ll VAEs except for VAE+onehot. We used 300-dimension word2vec embeddings pre-trined on lrge corpus of song lyrics (2.5M s). The spectrogrm CNN model consists of CNN bse model, which is the VGG-16 nd uses pretrined ImgeNet weights [8]. The CNN bse is followed by three fully-connected lyers (512,128,50 units) with 30% dropout. The model ws trined for 20 epochs. Clssifiction ccurcy on the test set (80/10/10 trining/vlidtion/test split) ws 83%. While creting the dt splits, we ensured tht the udio clips for the sme song remined in the sme set. 1 https://www.kggle.com/vinvinvin/high-resolution-mel-spectrogrms/notebook 2 https://github.com/pshpnther/deep-music-genre-clssifiction 2

Electronic like shckles of the eternl night oh i wnt to shke the sun blck obsession is wering in your soul Industril every inch of reptile in your hed just when the jgged sound i wtch me get into my skin Art Rock love cn drown your hert no wy to heven where she stnds when the shdows were young Alterntive i m drifting wy from the se for superior betryl forevermore he held the erth Tble 1: Lyrics s generted by VAE+udioNT 3 Evlution nd Results To quntittively evlute whether the generted lyrics dhere to the style of the rtist they were conditioned on, we trined CNN clssifier [11] 3 on the originl lyrics of the selected seven rtists. This is commonly used pproch to evlute the style ttribute of generted texts, e.g. in style trnsfer [12]. The results re presented in Tble 2. The ccurcy on the originl lyrics is 60%, nd the mjority bse is 17.7%. The results re presented in Tble 2. The VAE+udioNT model received the highest style clssifiction ccurcy of 42%, which suggests tht there is some benefit in pre-trining rtist embeddings on spectrogrm imges. The performnces of VAE+rnd nd VAE+rndNT re somewht vrible between trining instnces due to the rndom initiliztion of the rtist embeddings. This is evident from the fct tht fter ech trining instnce the embedding vectors of different pirs of rtists hve the highest cosine similrity. To ccount for this vribility, we trined five instnces of ech model nd verged their evlution results. The VAE+rnd nd VAE+rndNT results presented in Tbles 2 nd 3 re verged over five trining instnces. Model Accurcy VAE+onehot 0.266 VAE+rnd 0.368 VAE+rndNT 0.396 VAE+udioT 0.361 VAE+udioNT 0.420 Tble 2: Style clssifiction ccurcy on the generted lyric s We lso trined Kneser-Ney smoothed trigrm lnguge model [13] on the corpus of ech rtist s lyrics, nd then used ech of the seven rtists lnguge models to score the lyrics generted for ny given rtist. The intuition is tht if the model successfully genertes lyrics in the style of given rtist, then tht rtist s lnguge model should result in the lowest negtive log-likelihood vlue. In VAE+udioNT, for six out of seven rtists, the lowest negtive log likelihood vlues were given by the model trined on the sme rtist s lyrics, which suggests tht our model genertes lyrics in the style of specified rtist. Tble 3 contins the results of the VAE model with rndomly initilized rtist embeddings, while Tble 4 contins the results of VAE+udioNT. While the bove metrics evlute how well the models generte s in the style of n rtist, the perfect scores would be obtined by systems tht simply lern to reproduce the originl s, wheres wht we wnt re new s tht re inspired by the rtist s lyrics, but re not verbtim copies. The number of verbtim copies mong ll evluted models ws very low (2-3%). Also, ll models generted diverse s: 98%-99% of s re unique. A smll-scle humn evlution ws conducted to ssess how close the generted s re to the style of specific rtist. We recruited three nnottors, one of whom ws fmilir with the selected rtists in Electronic nd Clssic Rock genres, nd two nnottors were fmilir with one rtist ech. We obtined 100 smples of lyrics s generted by ech of the four VAE models conditioned on ech rtist, shuffled nd presented them to ech evlutor. The evlutors were sked to select the s tht resemble the style of the given rtist. The results (Tble 5) indicte tht except for one cse, VAE+udioT nd VAE+udioNT generted the most s in the style of the given rtist, lthough the 3 https://github.com/dennybritz/cnn-text-clssifiction-tf 3

Artist genre Lnguge model AR E I CR A HR PR Art Rock (AR) 16.9 17.44 17.32 17.55 17.79 17.89 17.5 Electronic (E) 17.49 16.23 16.63 17.34 17.48 17.47 17.34 Industril (I) 17.37 16.85 15.68 17.42 17.3 17.51 17.32 Clssic Rock (CR) 17.66 17.39 17.24 16.99 17.8 17.89 17.48 Alterntive (A) 17.47 17.18 16.82 17.43 16.82 17.54 17.23 Hrd Rock (HR) 16.83 16.54 16.6 16.82 16.91 16.22 16.86 Psychedelic Rock (PR) 17.1 17.14 17.12 17.19 17.43 17.53 16.29 Tble 3: Negtive log-likelihood vlues for the lyrics generted by VAE+rndNT. The lnguge models were trined on the originl lyrics of rtists. Artist genre Lnguge model AR E I CR A HR PR Art Rock (AR) 15.5 15.95 16.19 16.04 16.29 16.43 15.81 Electronic (E) 16.38 15.08 15.89 16.36 16.38 16.31 16.36 Industril (I) 16.47 16.01 15.16 16.66 16.47 16.61 16.37 Clssic Rock (CR) 17.09 16.86 16.78 16.32 17.07 17.07 16.88 Alterntive (A) 17.74 17.3 16.92 17.77 16.95 17.67 17.35 Hrd Rock (HR) 17.49 17.04 17.07 17.13 17.63 16.7 17.28 Psychedelic Rock (PR) 17.07 17.23 17.15 17.27 17.22 17.24 16.37 Tble 4: Negtive log-likelihood vlues for the lyrics generted by VAE+udioNT. The lnguge models were trined on the originl lyrics of rtists (smller vlues re better). differences re rther smll. Cohen s kpp between the pirs of nnottors ws low, which cn be expd by the subjective nture of judging n rtist s style. Model Electronic Clssic Rock Annottor A Annottor B Annottor A Annottor C VAE+onehot 0.79 0.29 0.67 0.34 VAE+rnd 0.8 0.35 0.67 0.3 VAE+udioT 0.79 0.33 0.7 0.32 VAE+udioNT 0.73 0.37 0.6 0.4 Tble 5: Mnul evlution results (rtio of selected s out of 100 generted s per rtist). 4 Conclusions nd Future Work Our initil results re promising nd suggest tht pre-trining rtist embeddings on music spectrogrms helps to condition lyric genertion on the rtist s style. Since rtist embeddings re pre-trined using seprte model, their mening is not known to the VAE. However, the difference between rtist embeddings is meningful, s it reflects the difference between their musicl styles. Our pproch is bsed on the ssumption tht rtists with similr musicl styles, nd hence, similr udio-derived embeddings, hve more similr lyricl styles thn rtists tht re very different musiclly. In future work, we pln to evlute other models for pre-trining of rtist embeddings, for exmple spectrogrm utoencoders. We will lso explore other pproches to lern multi-modl representtions, e.g. [14] nd dversril pproches. References [1] Diederik P Kingm nd Mx Welling. Auto-encoding vritionl Byes. In Proceedings of the Interntionl Conference on Lerning Representtions, 2014. [2] Xingxing Zhng nd Mirell Lpt. Chinese poetry genertion with recurrent neurl networks. In Proceedings of the 2014 Conference on Empiricl Methods in Nturl Lnguge Processing (EMNLP), pges 670 680, 2014. 4

[3] Aleksey Tikhonov nd Ivn P Ymshchikov. Guess who? multilingul pproch for the utomted genertion of uthor-stylized poetry. rxiv preprint rxiv:1807.07147, 2018. [4] Wen-Feng Cheng, Cho-Chung Wu, Ruihu Song, Jinlong Fu, Xing Xie, nd Jin-Yun Nie. Imge inspired poetry genertion in xioice. rxiv preprint rxiv:1808.03090, 2018. [5] Zhiting Hu, Zicho Yng, Xiodn Ling, Rusln Slkhutdinov, nd Eric P. Xing. Towrd controlled genertion of text. In Proceedings of the 34th Interntionl Conference on Mchine Lerning, pges 1587 1596, 2017. [6] Zhenxin Fu, Xioye Tn, Nnyun Peng, Dongyn Zho, nd Rui Yn. Style trnsfer in text: Explortion nd evlution. In AAAI, pges 663 670, 2018. [7] Jiwei Li, Michel Glley, Chris Brockett, Georgios Spithourkis, Jinfeng Go, nd Bill Doln. A personbsed neurl converstion model. In Proceedings of the 54th Annul Meeting of the Assocition for Computtionl Linguistics (Volume 1: Long Ppers), pges 994 1003. Assocition for Computtionl Linguistics, 2016. [8] Kren Simonyn nd Andrew Zissermn. Very deep convolutionl networks for lrge-scle imge recognition. rxiv preprint rxiv:1409.1556, 2014. [9] Smuel R. Bowmn, Luke Vilnis, Oriol Vinyls, Andrew Di, Rfl Jozefowicz, nd Smy Bengio. Generting sentences from continuous spce. In Proceedings of the 20th SIGNLL Conference on Computtionl Nturl Lnguge Lerning, pges 10 21, 2016. [10] Hreesh Bhuleyn, Lili Mou, Olg Vechtomov, nd Pscl Pouprt. Vritionl ttention for sequenceto-sequence models. Proceedings of the 27th Interntionl Conference on Computtionl Linguistics (COLING), 2018. [11] Yoon Kim. Convolutionl neurl networks for sentence clssifiction. rxiv preprint rxiv:1408.5882, 2014. [12] Tinxio Shen, To Lei, Regin Brzily, nd Tommi Jkkol. Style trnsfer from non-prllel text by cross-lignment. In NIPS, pges 6833 6844, 2017. [13] Reinhrd Kneser nd Hermnn Ney. Improved bcking-off for m-grm lnguge modeling. In ICASSP, 1995. [14] Hongru Ling, Hozheng Wng, Jun Wng, Shodi You, Zhe Sun, Jin-Mo Wei, nd Zhenglu Yng. Jtv: Jointly lerning socil medi content representtion by fusing textul, coustic, nd visul fetures. rxiv preprint rxiv:1806.01483, 2018. 5