arxiv: v1 [cs.cl] 12 Sep PDF Free Download

Powered by TCPDF (www.tcpdf.org) Neural Melody Composton from Lyrcs Hangbo Bao, Shaohan Huang 2, Furu We 2, Le Cu 2, Yu Wu 3, Chuanq Tan 3, Songhao Pao, Mng Zhou 2 School of Computer Scence, Harbn Insttute of Technology, Harbn, Chna 2 Mcrosoft Research, Bejng, Chna 3 State Key Laboratory of Software Development Envronment, Behang Unversty, Bejng, Chna addf00@foxmal.com,{shaohanh,fuwe,lecu,mngzhou}@mcrosoft.com, wuyu@buaa.edu.cn, tanchuanq@nlsde.buaa.edu.cn, paosh@ht.edu.cn arxv:09.03v [cs.cl] 2 Sep 20 Abstract In ths paper, we study a novel task that learns to compose musc from natural language. Gven the lyrcs as nput, we propose a melody composton model that generates lyrcscondtonal melody as well as the exact algnment between the generated melody and the gven lyrcs smultaneously. More specfcally, we develop the melody composton model based on the sequence-to-sequence framework. It conssts of two neural encoders to encode the current lyrcs and the context melody respectvely, and a herarchcal decoder to jontly produce muscal notes and the correspondng algnment. Expermental results on lyrcs-melody pars of,5 pop songs demonstrate the effectveness of our proposed methods. In addton, we apply a sngng voce syntheszer software to synthesze the sngng of the lyrcs and melodes for human evaluaton. Results ndcate that our generated melodes are more melodous and tuneful compared wth the baselne method. Introducton We study the task of melody composton from lyrcs, whch consumes a pece of text as nput and ams to compose the correspondng melody as well as the exact algnment between generated melody and the gven lyrcs. Specfcally, the output conssts of two sequences of muscal notes and lyrc syllables wth two constrants. Frst, each syllable n the lyrcs at least corresponds to one muscal note n the melody. Second, a syllable n the lyrcs may correspond to a sequence of notes, whch ncreases the dffculty of ths task. Fgure shows a fragment of a Chnese song. For nstance, the last Chnese character 恋 (love) algns two notes and n the melody. There are several exstng research works on generatng lyrcs-condtonal melody (Ackerman and Loker 207; Screa et al. 205; Monteth, Martnez, and Ventura 202; Fukayama et al. 200). These works usually treat the melody Contrbuton durng nternshp at Mcrosoft Research. Copyrght c 209, Assocaton for the Advancement of Artfcal Intellgence (www.aaa.org). All rghts reserved. A syllable s a word or part of a word whch contans a sngle vowel sound and that s pronounced as a unt. Chnese s a monosyllabc language whch means words (Chnese characters) predomnantly consst of a sngle syllable (https://en.wkpeda. org/wk/monosyllabc_language). rest E5 D5 B E Lyrcs: 爱恨两茫茫 Pnyn: à hèn lǎng máng máng It s so far away between love and hate E D5 问君何时恋 wèn jūn hé shí làn When do you want to fall n love agan Fgure : A fragment of a Chnese song Drunken Concubne (new verson). The blue rectangles ndcate rests, some ntervals of slence n a pece of melody. The red rectangles ndcate the algnment between the lyrcs and the melody, meanng a mappng from syllable of lyrcs to muscal notes. Pnyn ndcates the syllables for each Chnese character. We can observe that the second Chnese character 恨 (hate) algns one note E5 and the last Chnese character 恋 (love) algns two notes and n melody, whch descrbes the one-to-many relatonshp n the algnment between the lyrcs and melody. composton task as a classfcaton or sequence labelng problem. They frst determne the number of muscal notes by countng the syllables n the lyrcs, and then predct the muscal notes one after another by consderng prevously generated notes and correspondng lyrcs. However, these works only consder the one-to-one algnment between the melody and lyrcs. Accordng to our statstcs on,5 Chnese songs, 97.9% songs contans at least one syllable that corresponds to multple muscal notes (.e. one-to-many algnment), thus the smplfcaton may ntroduce bas nto the task of melody composton. In ths paper, we propose a novel melody composton model whch can generate melody from lyrcs and well handle the one-to-many algnment between the generated melody and the gven lyrcs. For the gven lyrcs as nput, we frst dvde the nput lyrcs nto sentences and then use our model to compose each pece of melody from the sentences one by one. Fnally, we merge these peces to a complete melody for the gven lyrcs. More specfcally, t conssts of two encoders and one herarchcal decoder. The frst encoder encodes the syllables n current lyrcs nto an array of hdden vectors wth a b-drectonal recurrent neural network (RNN) and the second encoder leverages an attenton mechansm to convert the context melody nto a dynamc context rest

Powered by TCPDF (www.tcpdf.org) vector wth a two-layer b-drectonal RNN. In the decoder, we employ a three-layer RNN decoder to produce the muscal notes and the algnment jontly, where the frst two layers are to generate the ptch and duraton of each muscal note and the last layer s to predct a label for each generated muscal note to ndcate the algnment. We collect,5 Chnese pop songs and generate the lyrcs-melody pars wth precse syllable-note algnment to conduct experments on our methods and baselnes. Automatc evaluaton results show that our model outperforms baselne methods on all the metrcs. In addton, we leverage a sngng voce syntheszer software to synthesze the sngng of the lyrcs and melodes and ask human annotators to manually judge the qualty of the generated pop songs. The human evaluaton results further ndcate that the generated lyrcs-condtonal melodes from our method are more melodous and tuneful compared wth the baselne methods. The contrbutons of our work n ths paper are summarzed as follows. To the best of our knowledge, ths paper s the frst work to use end-to-end neural network model to compose melody from lyrcs. We construct a large-scale lyrcs-melody dataset wth,5 Chnese pop songs and 6,72 lyrcs-contextmelody trples, so that the neural networks based approaches are possble for ths task. Compared wth tradtonal sequence-to-sequence models, our proposed method can generate the exact algnment as well as the one-to-many algnment between the melody and lyrcs. The human evaluaton verfes that the syntheszed pop songs of the generated melody and nput lyrcs are melodous and meanngful. Prelmnary We frst ntroduce some basc defntons from musc theory and then gve a bref ntroducton to our lyrcs-melody parallel corpus. Table lsts some mathematcal notatons used n ths paper. Concepts from Musc Theory Melody can be regarded as an ordered sequence of many muscal notes. The basc unt of melody s the muscal note whch manly conssts of two attrbutes: ptch and duraton. The ptch s a perceptual property of sounds that allows ther orderng on a frequency-related scale, or more commonly, the ptch s the qualty that makes t possble to judge sounds as hgher and lower n the sense assocated wth muscal melodes 2. Therefore, we use a sequence of numbers to represent the ptch. For example, we represent and Eb6 as 72 and 7 respectvely based on the MIDI 3. A rest s an 2 https://en.wkpeda.org/wk/ptch_ (musc) 3 https://newt.phys.unsw.edu.au/jw/notes. html Notatons X x j M m m pt, m dur Y y y < ypt, y dur, y lab P tch Duraton Label h j lrc h con c c con R Table : Notatons used n ths paper Descrpton the sequence of syllables n gven lyrcs the j-th syllable n X the sequence of muscal notes n context melody the -th muscal note n M the ptch and duraton of m, respectvely the sequence of muscal notes n predcted melody the -th muscal note n Y the prevously predcted muscal notes {y,..., y } n Y the ptch, duraton and label of y, respectvely the ptch sequence comprsed of each ypt n Y the duraton sequence comprsed of each ydur n Y the label sequence comprsed of each ylab n Y the j-th hdden state n output of lyrcs encoder the -th hdden state n output of context melody encoder the dynamc context vector at tme step the -th melody context vector from context melody encoder ndcates the rest, specally An example of a sheet musc: rest E5 D5 B E Lyrcs: 爱恨两茫茫 Pnyn: à hèn lǎng máng máng It s so far away between love and hate Lyrcs-melody algned data: rest E D5 问君何时恋 wèn jūn hé shí làn When do you want to fall n love agan Pnyn à hèn lǎng máng máng wèn jūn hé shí làn Lyrcs 爱恨两茫茫问君何时恋 P tch R E5 D5 B E R E D5 Duraton 6 6 2 2 Label 0 0 0 0 0 0 0 0 0 0 Fgure 2: An llustraton for lyrcs-melody algned data. The P tch and the Duraton respectvely represent the ptch and duraton of each muscal note. In addton, the Label provdes the nformaton on algnment between the lyrcs and melody. To be specfc, a muscal note s assgned wth label that denotes t s a boundary of the muscal note subsequence algned to the correspondng syllable otherwse t s assgned wth label 0. Addtonally, we always algn the rests wth ther latter syllables. nterval of slence n a pece of musc and we use R to represent t and treat t as a specal ptch. Duraton s a partcular tme nterval to descrbe the length of tme that the ptch or tone sounds, whch s to judge how long or short a muscal note lasts. Lyrcs-Melody Parallel Corpus Fgure 2 shows an example of a lyrcs-melody algned par wth precse syllable-note algnment, where each Chnese character of the lyrcs algns wth one or more notes n the melody. The generated melody conssts of three sequences P tch, Duraton and Label where the Label sequence represents the algnment between melody and lyrcs. We are able to re- https://en.wkpeda.org/wk/duraton_ (musc)

0 0 0 0 Label Label Layer ½ Duraton Duraton Layer R E D5 Ptch Ptch Layer Attenton Melody Decoder... ½ Context Melody Encoder Lyrcs Encoder R... E 问 wèn 君 jūn 何 hé 时 shí 恋 làn Lyrcs Pnyn Fgure 3: An llustraton of Songwrter. The lyrcs encoder and context melody encoder encode the syllables of gven lyrcs and the context melody nto two arrays of hdden vectors, respectvely. For decodng the -th muscal note y, Songwrter uses attenton mechansm to obtan a context vector c con from the context melody encoder (green arrows) and counts how many label has been produced n prevously muscal notes to obtan h j con to represent the current syllable correspondng to y from the lyrcs encoder (red arrows) to melody decoder. In melody decoder, the ptch layer and duraton layer frst predct the ptch y pt and duraton y dur of y, then the label layer predcts a label y lab for y to ndcate the algnment. buld the sheet musc wth them. P tch sequence represents the ptch of each muscal note n melody and R represents the rest n P tch sequence specfcally. Smlarly, Duraton sequence represents the duraton of each muscal note n melody. P tch and Duraton consst of a complete melody but do not nclude nformaton on the algnment between the gven lyrcs and correspondng melody. Label contans the nformaton of algnment. Each tem of the Label s labeled as one of {0, } to ndcate the algnment between the muscal note and the correspondng syllable n the lyrcs. To be specfc, a muscal note s assgned wth label that denotes t s a boundary of the muscal note sub-sequence, whch algned to the correspondng syllable, otherwse t s assgned wth label 0. We can splt the muscal notes nto the n parts by label, where n s the number of syllables of the lyrcs, and each part s a muscal note subsequence. Then we can algn the muscal notes to ther correspondng syllables sequentally. Addtonally, we always algn the rests to ther latter syllables. For nstance, we can observe that the second rest algns to the Chnese character 问 (ask). Task Defnton Gven lyrcs as the nput, our task s to generate the melody and algnment that make up a song wth the lyrcs. We can formally defne ths task as below: The nput s a sequence X = (x,..., x X ) representng the syllables of lyrcs. The output s a sequence Y = (y,..., y Y ) representng the predcted muscal notes for correspondng lyrcs, where the y = {ypt, y dur, y lab }. In addton, the output sequence Y should satsfy the followng restrcton: Y X = ypt () = whch restrcts the generated melody can be exactly algned wth the gven lyrcs. Approach In ths secton, we present the end-to-end neural networks model, termed as Songwrter, to compose a melody whch algns exactly to the gven nput lyrcs. Fgure 3 provdes an llustraton of Songwrter. Overvew Gven lyrcs as the nput, we frst dvde the lyrcs nto sentences and then use Songwrter to compose each pece of the melody sentence by sentence. For each sentence n lyrcs, Songwrter takes the syllables n the sentence lyrcs and the context melody, whch are some prevous predcted muscal notes, as nput and then predcts a pece of melody. When the last pece of melody has been predcted, we merge these peces of melody to make a complete song wth the gven lyrcs. Ths procedure can be consdered as a sequence generaton problem wth two sequences as nput, syllables of the current lyrcs X and the context melody M. We develop

our melody composton model based on a modfed RNN encoder-decoder (Cho et al. 20a) to support multple sequences as nput. Songwrter employs two neural encoders, lyrcs encoder and context melody encoder, to respectvely encode the syllables of the current lyrcs X and the context melody M, and leverages a herarchcal melody decoder to produce muscal notes and the algnment Y. To be specfc, the lyrcs encoder and context melody encoder encode X and M nto two arrays of hdden vectors, respectvely. At the tme step, melody decoder obtans a context vector c con from the context melody encoder and a hdden vector h j lrc from the lyrcs encoder to produce the -th muscal note y. c con s computed dynamcally by the attenton mechansm from the output of the context melody encoder. h j lrc s one of output hdden vectors of the lyrcs encoder, whch represents the j-th syllable x j n the current lyrcs. In the melody decoder, whch s a three-layer RNN, the ptch layer and duraton layer frst predct the ptch ypt and duraton y dur, then the label layer predcts a label ylab of y to ndcate the algnment. Gated Recurrent Unts We use Gated Recurrent Unt (GRU) (Cho et al. 20b) nstead of basc RNN. We descrbe the mathematcal model of the GRU as follows: z = σ(w hz h + W xz x + b z ) (2) r = σ(w hr h + W xr x + b r ) (3) ĥ = tanh ( W h (r h ) + W x x + b ) () h = ( z ) h + z ĥ (5) where W hz, W xz, b z, W hr, W xr, b r, W h, W x and b are parameters to be learned n GRU, s an element-wse multplcaton, σ( ) s a logstc sgmod functon, r and z are the gates and h s the hdden state at tme step. Lyrcs Encoder We use a b-drectonal RNN (Schuster and Palwal 997) bult by two GRUs to encode the syllables of lyrcs whch concatenates the syllable feature embeddng and word embeddng as nput X = {x,..., x X } to the GRU encoders: h lrc = f GRU ( h lrc, x ) (6) lrc h = fgru( + lrc h, x ) (7) [ ] h h lrc = lrc () Then, the lyrcs encoder outputs an array of hdden vectors {h lrc,..., h X lrc } to represent the nformaton of each syllable n the lyrcs. Context Melody Encoder We use the context melody encoder to encode the context melody M = {m,..., m M }. The encoder s a two-layer h lrc RNN that encodes ptch, duraton and label of a muscal note respectvely at each tme step. Each layer s a b-drectonal RNN whch s bult by two GRUs. For the frst layer, we descrbe the forward drectonal GRU and the backward drectonal GRU at tme step as follows: h pt = f GRU ( h pt, m pt) (9) h pt = fgru( h + pt, m pt) (0) where m pt s the ptch attrbute of -th note m. Then, we concatenate them nto one vector: [ ] h h pt = pt () The bottom layer encodes the output of the frst layer and the duraton attrbute of melody. The employment can be descrbed as follows: h dur = f GRU ( h dur, m dur, h pt) (2) h pt dur h = fgru( dur, m dur, h pt) (3) [ ] h h dur = dur () h dur h + We concatenate the two output arrays of vectors to an array of vectors to represent the context melody sequence: [ ] h h con = pt (5) h dur Melody Decoder The decoder predcts the next note y from all prevously predcted notes {y,..., y } (y <, for short), the context muscal notes M = {m,..., m M } and the syllables X = {x,..., x X } of gven lyrcs. We defne the condtonal probablty when decodng -th note as follows: arg max P (y y <, X, M) (6) To model the three attrbutes of y, where we use {ypt, y dur, y lab } to respectvely represent the ptch, duraton and label, we decompose Eq. (6) nto Eq.(7): P (y y <, X, M) =P (y pt y <, X, M) P (y dur y <, X, M, y pt) P (y lab y <, X, M, y pt, y dur) (7) We use a three-layer RNN as decoder to respectvely decode the ptch, duraton and label of a muscal note at each tme step. We defne the condtonal probabltes of each layer n the decoder: P (y pt y <, X, M) = g p (s pt, c, y ) () P (y dur y <, X, M, y pt) = g d (s dur, c, y, y pt) (9) P (y lab y <, X, M, y pt, y dur) = g l (s lab, c, y, y pt, y dur) (20) where g p ( ), g d ( ) and g l ( ) are nonlnear functons that output the probabltes of y pt, y dur and y lab respectvely.

s pt, s dur and s lab are respectvely the correspondng hdden states of each layer. c s a dynamc context vector representng the M and X. We ntroduce the employment of c before s pt, s dur and s lab : c = c con + h j lrc (2) where c con s a context vector from context melody encoder and h j lrc s one of output hdden vectors of lyrcs encoder, whch represent the x j that should be algned to the current predctng y. In partcular, we set c con as a zero vector f there s no context melody as nput. From our representaton method for lyrcs-melody algned pars, t s not dffcult to understand how to get the x j that y should be algned to: j = ylab t (22) t= c con s recomputed at each step by algnment model (Bahdanau, Cho, and Bengo 20) as follows: M c con = α,t h t con (23) t= where h t con s one hdden vector from the output of melody encoder and the weght α,t s computed by: α,t exp(e,t ) = M k= exp(e,k ) (2) e,k = v a tanh(w a s + U a h k con) (25) where v a, W a and U a are learnable parameters. Fnally, we obtan the c and then employ the s p, s d, s l and s as follows: s pt = f GRU (s pt, c, y pt, hj lrc ) (26) s dur = f GRU (s dur, c, y dur, y pt, s pt) (27) s lab = f GRU (s lab, c, y lab, y pt, ydur, d dur) (2) s = [s p ; s d ; s l ] (29) Objectve Functon Gven a tranng dataset wth n lyrcs-context-melody trples D = {X (), M (), Y () } n =, where X() = {x (),..., x () X() }, M () = {m (),..., m () M () } and Y () = {y (),..., y () Y() }. In addton, (, j), y ()j = ). Our tranng objectve s to mnmze the negatve log lkelhood loss L wth respect to the learnable model parameter θ: (y ()j pt, y()j dur, y()j lab L = n n = Y () j= where y <j s short for {y (),..., yj () }. log P (y ()j pt, y()j dur, y()j lab θ, X(), M (), y <j ) (30) Experments Dataset We crawled,5 Chnese pop songs, whch nclude melodes wth the duraton over 00 hours n total, from an onlne Karaoke app. Then preprocess the dataset wth rules as descrbed n Zhu et al. (20) to guarantee the relablty of the melodes. For each song, we convert the melody to C major or A mnor that can keep all melodes n the same tune and we set BPM (Beats Per Mnute) to 60 to calculate the duraton of each muscal note n the melody. We further dvde the lyrcs nto sentences wth ther correspondng muscal notes as lyrcs-melody pars. Besdes, we set a wndow sze as 0 to the context melody and use the prevously muscal notes as the context melody for each lyrcs-melody par to make up lyrcs-context-melody trples. Fnally, we obtan 6,72 trples to conduct our experments. We randomly choose 5% songs for valdatng, 5% songs for testng and the rest of them for tranng. Baselnes As melody composton task can generally be regarded as a sequence labelng problem or a machne translaton problem, we select two state-of-the-art models as baselnes. CRF A modfed sequence labelng model based on CRF (Lafferty, McCallum, and Perera 200) whch contans two layers for predctng P tch and Duraton, respectvely. For one-to-many relatonshps, ths model uses some specal tags to represent a seres of orgnal tags. For nstance, f a syllable algns two notes and, we use a tag to represent them. Seq2seq A modfed attenton based sequence to sequence model whch contans two encoders and one decoder. Compared wth Songwrter, Seq2seq uses attenton mechansm (Bahdanau, Cho, and Bengo 20) to capture nformaton on the gven lyrcs. Seq2seq may not guarantee the algnment between the generated melody and syllables n gven lyrcs. To avod ths problem, Seq2seq model stops predctng when the number of the label n predcted muscal notes s equal to the number of syllables n the gven lyrcs. Implementaton Model Sze For all the models used n ths paper, the number of recurrent hdden unts s set to 256. In the context melody encoder and melody decoder, we treat the ptch, duraton, and label as tokens and use word embeddng to represent them wth 2, 2, and 6 dmensons, respectvely. In the lyrcs encoder, we use GloVe (Pennngton, Socher, and Mannng 20) to pre-tran a char-level word embeddng wth 256 dmensons on a large Chnese lyrcs corpus and use Pnyn 5 as the syllable features wth 2 dmensons. Parameter Intalzaton We use two lnear layers wth the last backward hdden states of the context melody encoder to respectvely ntalze the hdden states of the ptch 5 https://en.wkpeda.org/wk/pnyn

Table 2: Automatc evaluaton results Teacher-forcng Ptch Duraton Label Samplng PPL P R F P R F P R F BLEU DW CRF /.23 2.02 0.9 9.2 53.2 50. / / / 2.02 25.53 Seq2seq 2.2 5.76 55.0 5.56 6.66 67. 65.33 93. 93.06 92.60 3.96 37.0 Songwrter 2.0 63.23 63.2 62.90 69. 7.2 69.69 93.5 93.6 93.3 6.63 3.3 layer and duraton layer n the melody decoder n Songwrter and Seq2seq. We use zero vectors to ntalze the hdden states n the lyrcs encoder and context melody encoder. Tranng We use Adam (Dederk P. Kngma 205) wth an ntal learnng rate of 0.00 and an exponental decay rate of 0.9999 as the optmzer to tran our models wth batch sze as 6, and we use the cross entropy as the loss functon. Automatc evaluaton We use two modes to evaluate our model and baselnes. Teacher-forcng: As n (Roberts et al. 20), models use the ground truth as nput for predctng the next-step at each tme step. Samplng Models predct the melody from gven lyrcs wthout any ground truth. Metrcs We use the F score to the automatc evaluaton from Roberts et al. (20). Addtonally, we select three automatc metrcs for our evaluaton as follows. Perplexty (PPL) Ths metrc s a standard evaluaton measure for language models and can measure how well a probablty model predcts samples. Lower PPL score s better. (weghted) Precson, Recall and F 6 These metrcs measure the performance of predctng the dfferent attrbutes of the muscal notes. BLEU Ths metrc (Papnen et al. 2002) s wdely used n machne translaton. We use t to evaluate our predcted ptch. Hgher BLEU score s better. Duraton of Word (DW) Ths metrc checks the sum of the duraton of all notes whch algned to one word s equal to the ground truth. Hgher DW score s better. Results The results of the automatc evaluaton are shown n Table 2. We can see that our proposed method outperforms all models n all metrcs. As Songwrter performs better than Seq2seq, t shows that the exact nformaton of the syllables (Eq. (22)) can enhance the qualty of predctng the correspondng muscal notes relatve to attenton mechansm n tradtonal Seq2seq models. In addton, the CRF model demostrates lower performance n all metrcs. In CRF 6 We calculate these metrcs by sckt-learn wth the parameter average set as weghted : http://sckt-learn.org/ stable/modules/classes.html#module-sklearn. metrcs model, we use a specal tag to represent multple muscal notes f a syllable algns more than one muscal note, whch wll produce a large number of dfferent knds of tags and result n the CRF model s dffcult to learn from the sparse data. Human evaluaton Smlar to the text generaton and dalog response generaton (Zhang and Lapata 20; Schatzmann, Georgla, and Young 2005), t s challengng to accurately evaluate the qualty of musc composton results wth automatc metrcs. To ths end, we nvte 3 partcpants as human annotators to evaluate the generated melodes from our models and the ground truth melodes of human creatons. We randomly select 20 lyrcs-melody pars, the average duraton of each melody approxmately 30 seconds, from our testng set. For each selected par, we prepare three melodes, ground truth of human creatons and the generated results from Songwrter and Seq2seq. Then, we syntheszed all melodes wth the lyrcs by NaoNao 7 usng default settngs for the generated songs and ground truth, whch s to elmnate the nfluences of other factors of sngng. As a result, we obtan 3 (annotators) 3 (melodes) 20 (lyrcs) samples n total. The human annotatons are conducted n a blnd-revew mode, whch means that human annotators do not know the source of the melodes durng the experments. Metrcs We use the metrcs from prevous work on human evaluaton for musc composton as shown below. We also nclude an emoton score to measure the relatonshp between the generated melody and the gven lyrcs. The human annotators are asked to rate a score from to 5 after lstenng to the songs. Larger scores ndcate better qualty n all the three metrcs. Emoton Does the melody represent the emoton of the lyrcs? Rhythm (Zhu et al. 20; Watanabe et al. 20) When lstenng to the melody, are the duraton and pause of words natural? Overall (Watanabe et al. 20) What s the overall score of the melody? 7 A sngng voce syntheszer software whch can synthesze Chnese song, http://www.dsoundsoft.com/product/ naoedtor/

Table 3: Human evaluaton results n blnd-revew mode Model Overall Emoton Rhythm Seq2seq 3.2 3.52 2.66 Songwrter 3.3 3.9 3.52 Human.57.50.7 Results Table 3 shows the human evaluaton results. Accordng to the results, Songwrter outperforms Seq2seq n all metrcs, whch ndcates ts effectveness over the Seq2seq baselne. On the Rhythm metrcs, human annotators gve sgnfcantly lower scores to Seq2seq than Songwrter, whch shows that the generated melodes from Songwrter are more natural on the pause and duraton of words than the ones generated by Seq2seq. The results further suggest that usng the exact nformaton of syllables (Eq. (22)) s more effectve than the soft attenton mechansm n tradtonal Seq2seq models n the melody composton task. We can also observe from Table 3 that the gaps between the system generated melodes and the ones created by human are stll large on all the three metrcs. It remans an open challenge for future research to develop better algorthms and models to generate melodes wth hgher qualty. Related Work A varety of musc composton works have been done over the last decades. Most of the tradtonal methods compose musc based on musc theory and expert doman knowledge. Chan, Potter, and Schubert (2006) desgn rules from musc theory to use musc clps to sttch them together n a reasonable way. Wth the development of machne learnng and the ncrease of publc musc data, data-drven methods such as Markov chans model (Pachet and Roy 20) and graphc model (Pachet, Papadopoulos, and Roy 207) have been ntroduced to compose musc. Recently, deep learnng has been revealed the potentals for muscal creaton. Most of these deep learnng approaches use the recurrent neural network (RNN) to compose the musc by regardng as a sequence. The MelodyRNN (Wate 206) model, proposed by Google Bran Team, uses lookng back RNN and attenton RNN to capture the long-term dependency of melody. Chu, Urtasun, and Fdler (206) propose a herarchcal RNN based model whch addtonally ncorporates knowledge from musc theory nto the representaton to compose not only the melody but also the drums and chords. Some recent works have also started explorng varous generatve adversaral networks (GAN) models to compose musc (Mogren 206; Yang, Chou, and Yang 207; Dong et al. 207). Brunner et al. (20) desgn recurrent varatonal autoencoders (VAEs) wth a herarchcal decoder to reproduce short muscal sequences. Generatng a lyrcs-condtonal melody s a subset of musc composton but under more restrctons. Early works frst determne the number of muscal notes by countng the syllables n lyrcs and then predct the muscal notes one after another by consderng prevously generated notes and correspondng lyrcs. Fukayama et al. (200) use dynamc programmng to compute a melody from Japanese lyrcs, the calculaton needs three human well-desgned constrants. Monteth, Martnez, and Ventura (202) propose a melody composton ppelne for gven lyrcs. For each gven lyrcs, t frst generates hundreds of dfferent possbltes for rhythms and ptches. Then t ranks these possbltes wth a number of dfferent metrcs n order to select a fnal output. Screa et al. (205) employ Hdden Markov Models (HMM) to generate rhythm based on the phonetcs of the lyrcs already wrtten. Then a harmoncal structure s generated, followed by generaton of a melody matchng the underlyng harmony. Ackerman and Loker (207) desgn a co-creatve automatc songwrtng system ALYSIA base on machne learnng model usng random forests, whch analyzes the lyrcs features to generate one note at a tme for each syllable. Concluson and Future Work In ths paper, we propose a lyrcs-condtonal melody composton model whch can generate melody and the exact algnment between the generated melody and the gven lyrcs. We develop the melody composton model under the encoder-decoder framework, whch conssts of two RNN encoders, lyrcs encoder and context melody encoder, and a herarchcal RNN decoder. The lyrcs encoder encodes the syllables of current lyrcs nto a sequence of hdden vectors. The context melody leverages an attenton mechansm to encode the context melody nto a dynamc context vector. In the decoder, t uses two layers to produce muscal notes and another layer to produce algnment jontly. Expermental results on our dataset, whch contans,5 Chnese pop songs, demonstrate our model outperforms baselne models. Furthermore, we leverage a sngng voce syntheszer software to synthesze sngng of the lyrcs and generated melodes for human evaluaton. Results ndcate that our generated melodes are more melodous and tuneful. For future work, we plan to ncorporate the emoton and the style of lyrcs to compose the melody. References Ackerman, M., and Loker, D. 207. Algorthmc songwrtng wth alysa. In Internatonal Conference on Evolutonary and Bologcally Inspred Musc and Art, 6. Sprnger. Bahdanau, D.; Cho, K.; and Bengo, Y. 20. Neural machne translaton by jontly learnng to algn and translate. CoRR abs/09.073. Brunner, G.; Konrad, A.; Wang, Y.; and Wattenhofer, R. 20. Md-vae: Modelng dynamcs and nstrumentaton of musc wth applcatons to style transfer. In Proc. Int. Socety for Musc Informaton Retreval Conf. Chan, M.; Potter, J.; and Schubert, E. 2006. Improvng algorthmc musc composton wth machne learnng. In Proceedngs of the 9th Internatonal Conference on Musc Percepton and Cognton, ICMPC. Cho, K.; van Merrenboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengo, Y. 20a. Learnng phrase representatons usng rnn encoder decoder for statstcal machne translaton. In Proceedngs of the 20 Conference on Emprcal

Methods n Natural Language Processng (EMNLP), 72 73. Doha, Qatar: Assocaton for Computatonal Lngustcs. Cho, K.; van Merrenboer, B.; Gülçehre, Ç.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengo, Y. 20b. Learnng phrase representatons usng RNN encoder-decoder for statstcal machne translaton. In Proceedngs of the 20 Conference on Emprcal Methods n Natural Language Processng, EMNLP 20, October 25-29, 20, Doha, Qatar, A meetng of SIGDAT, a Specal Interest Group of the ACL, 72 73. Chu, H.; Urtasun, R.; and Fdler, S. 206. Song from p: A muscally plausble network for pop musc generaton. arxv preprnt arxv:6.0377. Dederk P. Kngma, J. B. 205. Adam: A method for stochastc optmzaton. In Proceedngs of the Internatonal Conference on Learnng Representatons (ICLR). Dong, H.-W.; Hsao, W.-Y.; Yang, L.-C.; and Yang, Y.-H. 207. Musegan: Symbolc-doman musc generaton and accompanment wth mult-track sequental generatve adversaral networks. arxv preprnt arxv:709.0629. Fukayama, S.; Nakatsuma, K.; Sako, S.; Nshmoto, T.; and Sagayama, S. 200. Automatc song composton from the lyrcs explotng prosody of the japanese language. In Proc. 7th Sound and Musc Computng Conference (SMC), 299 302. Lafferty, J.; McCallum, A.; and Perera, F. C. 200. Condtonal random felds: Probablstc models for segmentng and labelng sequence data. Mogren, O. 206. C-rnn-gan: Contnuous recurrent neural networks wth adversaral tranng. arxv preprnt arxv:6.0990. Monteth, K.; Martnez, T. R.; and Ventura, D. 202. Automatc generaton of melodc accompanments for lyrcs. In ICCC, 7 9. Pachet, F., and Roy, P. 20. Markov constrants: steerable generaton of markov sequences. Constrants 6(2): 72. Pachet, F.; Papadopoulos, A.; and Roy, P. 207. Samplng varatons of sequences for structured musc generaton. In Proceedngs of the th Internatonal Socety for Musc Informaton Retreval Conference (ISMIR 207), Suzhou, Chna, 67 73. Papnen, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. Bleu: a method for automatc evaluaton of machne translaton. In Proceedngs of the 0th annual meetng on assocaton for computatonal lngustcs, 3 3. Assocaton for Computatonal Lngustcs. Pennngton, J.; Socher, R.; and Mannng, C. 20. Glove: Global vectors for word representaton. In Proceedngs of the 20 conference on emprcal methods n natural language processng (EMNLP), 532 53. Roberts, A.; Engel, J.; Raffel, C.; Hawthorne, C.; and Eck, D. 20. A herarchcal latent vector model for learnng long-term structure n musc. arxv preprnt arxv:03.052. Schatzmann, J.; Georgla, K.; and Young, S. 2005. Quanttatve evaluaton of user smulaton technques for spoken dalogue systems. In 6th SIGdal Workshop on DISCOURSE and DIALOGUE. Schuster, M., and Palwal, K. 997. Bdrectonal recurrent neural networks. Trans. Sg. Proc. 5():2673 26. Screa, M.; Barros, G. A.; Shaker, N.; and Togelus, J. 205. Smug: Scentfc musc generator. In ICCC, 20 2. Wate, E. 206. Generatng long-term structure n songs and stores. Magenta Bolg. Watanabe, K.; Matsubayash, Y.; Fukayama, S.; Goto, M.; Inu, K.; and Nakano, T. 20. A melody-condtoned lyrcs language model. In Proceedngs of the 20 Conference of the North Amercan Chapter of the Assocaton for Computatonal Lngustcs: Human Language Technologes, Volume (Long Papers), volume, 63 72. Yang, L.-C.; Chou, S.-Y.; and Yang, Y.-H. 207. Mdnet: A convolutonal generatve adversaral network for symbolc-doman musc generaton. arxv preprnt arxv:703.07. Zhang, X., and Lapata, M. 20. Chnese poetry generaton wth recurrent neural networks. In EMNLP, 670 60. Zhu, H.; Lu, Q.; Yuan, N. J.; Qn, C.; L, J.; Zhang, K.; Zhou, G.; We, F.; Xu, Y.; and Chen, E. 20. Xaoce band: A melody and arrangement generaton framework for pop musc. In Proceedngs of the 2th ACM SIGKDD Internatonal Conference on Knowledge Dscovery & Data Mnng, 237 26. ACM.

arxiv: v1 [cs.cl] 12 Sep 2018