Generating Music from Text: Mapping Embeddings to a VAE s Latent Space

Similar documents
arxiv: v1 [cs.lg] 15 Jun 2016

LSTM Neural Style Transfer in Music Using Computational Musicology

arxiv: v3 [cs.sd] 14 Jul 2017

Predicting the immediate future with Recurrent Neural Networks: Pre-training and Applications

Music Composition with RNN

Modeling Musical Context Using Word2vec

Modeling Temporal Tonal Relations in Polyphonic Music Through Deep Networks with a Novel Image-Based Representation

arxiv: v2 [cs.sd] 15 Jun 2017

CHORD GENERATION FROM SYMBOLIC MELODY USING BLSTM NETWORKS

Real-valued parametric conditioning of an RNN for interactive sound synthesis

Melody Extraction from Generic Audio Clips Thaminda Edirisooriya, Hansohl Kim, Connie Zeng

Deep learning for music data processing

arxiv: v1 [cs.sd] 17 Dec 2018

LEARNING AUDIO SHEET MUSIC CORRESPONDENCES. Matthias Dorfer Department of Computational Perception

OPTICAL MUSIC RECOGNITION WITH CONVOLUTIONAL SEQUENCE-TO-SEQUENCE MODELS

Hidden Markov Model based dance recognition

Deep Neural Networks Scanning for patterns (aka convolutional networks) Bhiksha Raj

A STUDY ON LSTM NETWORKS FOR POLYPHONIC MUSIC SEQUENCE MODELLING

Noise (Music) Composition Using Classification Algorithms Peter Wang (pwang01) December 15, 2017

Using Variational Autoencoders to Learn Variations in Data

arxiv: v1 [cs.cv] 16 Jul 2017

Music Genre Classification

Automated sound generation based on image colour spectrum with using the recurrent neural network

The Human Features of Music.

Humor recognition using deep learning

Composer Identification of Digital Audio Modeling Content Specific Features Through Markov Models

Automatic Polyphonic Music Composition Using the EMILE and ABL Grammar Inductors *

Towards End-to-End Raw Audio Music Synthesis

A Unit Selection Methodology for Music Generation Using Deep Neural Networks

Generating Chinese Classical Poems Based on Images

Bach-Prop: Modeling Bach s Harmonization Style with a Back- Propagation Network

Image-to-Markup Generation with Coarse-to-Fine Attention

Discriminative and Generative Models for Image-Language Understanding. Svetlana Lazebnik

RoboMozart: Generating music using LSTM networks trained per-tick on a MIDI collection with short music segments as input.

Algorithmic Composition of Melodies with Deep Recurrent Neural Networks

DataStories at SemEval-2017 Task 6: Siamese LSTM with Attention for Humorous Text Comparison

Music Understanding and the Future of Music

arxiv: v1 [cs.sd] 8 Jun 2016

A QUERY BY EXAMPLE MUSIC RETRIEVAL ALGORITHM

Algorithmic Music Composition

Modeling memory for melodies

The Accuracy of Recurrent Neural Networks for Lyric Generation. Josue Espinosa Godinez ID

Lyrics Classification using Naive Bayes

A Discriminative Approach to Topic-based Citation Recommendation

Creating a Feature Vector to Identify Similarity between MIDI Files

Audio: Generation & Extraction. Charu Jaiswal

Talking Drums: Generating drum grooves with neural networks

Singer Traits Identification using Deep Neural Network

Deep Jammer: A Music Generation Model

Building a Better Bach with Markov Chains

Musical Creativity. Jukka Toivanen Introduction to Computational Creativity Dept. of Computer Science University of Helsinki

Neural Network Predicating Movie Box Office Performance

Can Song Lyrics Predict Genre? Danny Diekroeger Stanford University

Music Emotion Recognition. Jaesung Lee. Chung-Ang University

Music genre classification using a hierarchical long short term memory (LSTM) model

DISTRIBUTION STATEMENT A 7001Ö

Robert Alexandru Dobre, Cristian Negrescu

arxiv: v1 [cs.sd] 12 Dec 2016

CS229 Project Report Polyphonic Piano Transcription

Joint Image and Text Representation for Aesthetics Analysis

Neural Network for Music Instrument Identi cation

Research Projects. Measuring music similarity and recommending music. Douglas Eck Research Statement 2

Algorithmic Music Composition using Recurrent Neural Networking

MUSICAL INSTRUMENT RECOGNITION WITH WAVELET ENVELOPES

Sudhanshu Gautam *1, Sarita Soni 2. M-Tech Computer Science, BBAU Central University, Lucknow, Uttar Pradesh, India

SYNTHESIS FROM MUSICAL INSTRUMENT CHARACTER MAPS

About Giovanni De Poli. What is Model. Introduction. di Poli: Methodologies for Expressive Modeling of/for Music Performance

SentiMozart: Music Generation based on Emotions

The Sparsity of Simple Recurrent Networks in Musical Structure Learning

Various Artificial Intelligence Techniques For Automated Melody Generation

... A Pseudo-Statistical Approach to Commercial Boundary Detection. Prasanna V Rangarajan Dept of Electrical Engineering Columbia University

Finding Sarcasm in Reddit Postings: A Deep Learning Approach

Deep Recurrent Music Writer: Memory-enhanced Variational Autoencoder-based Musical Score Composition and an Objective Measure

A Fast Alignment Scheme for Automatic OCR Evaluation of Books

CSC475 Music Information Retrieval

Music Radar: A Web-based Query by Humming System

Melody classification using patterns

Rewind: A Music Transcription Method

arxiv: v1 [cs.ir] 16 Jan 2019

Automatic Piano Music Transcription

Automatic characterization of ornamentation from bassoon recordings for expressive synthesis

An AI Approach to Automatic Natural Music Transcription

Computational Modelling of Harmony

Using Deep Learning to Annotate Karaoke Songs

Jazz Melody Generation from Recurrent Network Learning of Several Human Melodies

Less is More: Picking Informative Frames for Video Captioning

arxiv: v1 [cs.ir] 20 Mar 2019

arxiv: v1 [cs.sd] 21 May 2018

Shimon the Robot Film Composer and DeepScore

Chord Classification of an Audio Signal using Artificial Neural Network

arxiv: v1 [cs.sd] 9 Dec 2017

An Introduction to Deep Image Aesthetics

QUALITY OF COMPUTER MUSIC USING MIDI LANGUAGE FOR DIGITAL MUSIC ARRANGEMENT

Sequence generation and classification with VAEs and RNNs

Repeating and mistranslating: the associations of GANs in an art context

Music Similarity and Cover Song Identification: The Case of Jazz

Experiments on musical instrument separation using multiplecause

A STATISTICAL VIEW ON THE EXPRESSIVE TIMING OF PIANO ROLLED CHORDS

CREATING all forms of art [1], [2], [3], [4], including

Predicting Similar Songs Using Musical Structure Armin Namavari, Blake Howell, Gene Lewis

Transcription:

MSc Artificial Intelligence Master Thesis Generating Music from Text: Mapping Embeddings to a VAE s Latent Space by Roderick van der Weerdt 10680195 August 15, 2018 36 EC January 2018 - August 2018 Supervisor: Dr. A. Meroño Peñuela Examiner: Prof. Dr. F.A.H. van Harmelen Assessor: Dr. K.S. Schlobach VU Amsterdam

Abstract Music has always been used to elevate the mood in movies and poetry, adding emotions which might not have been without the music. Unfortunately only the most musical people are capable of creating music, let alone the appropriate music. This paper proposes a system that takes as input a piece of text, the representation of that text is consequently transformed into the latent space of a VAE capable of generating music. The latent space of the VAE contains representations of songs and the transformed vector can be decoded from it as a song. An experiment was performed to test this system by presenting a text to seven experts, along with two pieces of music from which one was created from the text. On average the music generated from the text was only recognized in half of the examples, but the poems gave significant results in their recognition, showing a relation between the poems and the generated music. Contents 1 Introduction 1 2 Related Work 2 2.1 Background....................................... 2 2.1.1 Music Generating with RNN......................... 3 2.1.2 Music Generation with VAE......................... 3 2.1.3 The MIDI Representation........................... 4 2.1.4 Text Based Variational AutoEncoders.................... 4 2.1.5 Word2Vec.................................... 5 3 Method 5 3.1 Encoding the Lyrics (with Word2Vec)........................ 5 3.2 Bringing the Latent Spaces Together......................... 6 3.3 Decoding the Music (with Magenta s MusicVAE).................. 7 4 Experiment 7 4.1 Implementation..................................... 7 5 Results 8 5.1 Discussion........................................ 9 6 Conclusion 10 References 11 A Poems used during the Experiment 13 B Table of the Results 15

1 Introduction Music has always been used to elevate the mood in pictures and poetry, adding emotions which might not have been without the music. When a different piece of music is played during a movie it can change a scene from a horror to a comedy, completely changing the impact of what is seen by something that is heard. The famous shower scene from the movie Psycho would not be famous if not for the music [2], bringing a sense of horror, before any visual clues are given. Another example of this synergy, but now between text and music, happens on the radio. When a presenter has prepared a piece of text they often select a piece of music to play along the message. Not only does this give some more depth to the text, it can also be used to embed an extra meaning in the text, just like it is done in cinema. The problem lies here in the time it takes to determine what piece of stock music fits the text, if one fits at all. Preferable would be to create a new piece of music every time one is required, unfortunately creating music to specifically augment one scene or one piece of text requires a lot of time. A musician with enough skill would have to be available to create meaningful music and even then it requires time for the musician to interpret the meaning of the text or scene in order to channel it to a piece of music. Generating a new piece of music specifically for the text would be ideal, but at the moment a way to do this does not exist. Automated addition and creation of music could help in this situation, and everyone else to augment their creations. Research has been exploring different ways to generate new music based on existing music for years [4] using statistical models to learn the transitions between notes from existing music in order to generate new compositions. Advanced neural networks have recently been used to generate entirely new music 1 [16, 14]. The models are trained on a large set of music, resulting in models from which new music can be generated, influenced by the music used during training. But none has sought to create music based on another medium, such as text. This paper details the research of creating music based on a specific text. This will require a model that can generate specific music and also interpret text to base this music on. The newly generated music should fit the text similar to how the music of a song fits the lyrics of that same song. Because when an artist creates a new song the music and the lyrics work together, and are written together in order to enhance the emotion they want to express. In order to create a connection between the feelings addressed by the text and the feelings addressed by the music a mapping will have to be trained that connects a text with a piece of music, this research will work under the hypotheses that the text and music of every song express the same feelings, allowing it to be used to train a model based on the similarity. This model is trained to transform the representations of the text into a representation of a piece of music. Training this model require two models that are able to create a latent space in which the text (in the first) and the music (in the second) can be represented. This can be summed up in the research question: To what extent can the relationship between lyrics and melodies be learned from representations of text and music trained with large collections using embeddings and VAE?. The contributions of this research are: A comparison between state of the art music generation algorithms. (Section 2) A novel approach to combine two latent spaces. (Section 3) An evaluation of the proposed system. (Sections 4 and 5) Section 2 starts with a survey of research related to the goal of generating music, examining different ways of automatically generating music and how to represent the text input for the 1 Whether it can be considered to be new music is discussed in Section 5.1. 1

generation of the music. Section 3 will detail the system used for the experiment in Section 4. The results of the experiment will be examined and discussed, along with the complications, in Section 5 and lastly an answer to the research question will be given in Section 6. 2 Related Work The field of Machine Learning has been experimenting with different approaches to generate music, this section will survey some of these past approaches. The automated generation of music can be considered to progress concurrently with the generation of text [4]. Initially statistical models were used to generate music [4]. Statistical models like Markov Models [13] were used to generate music based on existing music, looking at the transitions in the existing music in order to re-create similar transitions in the new music. This allowed for pieces that sounded different, but they still were limited to what already existed for the transitions. When RNN (Recurrent Neural Network) [21] models became more practically usable Sutskever et al. created a RNN that was capable of generating the next word in a sentence [17]. This system was character based, meaning there was no representation of the words in the system, but it still generated existing words, learning enough from the characters. This even allowed the system to create words that were not in the training set. For words this might not be convenient, because non-existing words are not useful, but in music this would be an advantage. Further experiments with RNN s yielded entire pages of generated texts [5], which at a glance were not recognizably different from a real text. The effectiveness of RNN s in the text domain led Simon and Oore to investigate the possibilities of music generation with RNN s [16]. Their research mainly focused on adding expressive timing and dynamics to automatically generated music. This allowed the system to generate music, from which unlimited samples could be taken, but to make the music sound differently the model would have to be retrained entirely. In order to create a system that did not have to be retrained and allow the user to work with different possible outputs Bowman et al. used a VAE (Variational AutoEncoder) [11] to generate texts [1]. The VAE allowed them to train a latent space and from their manipulate their texts, generating interpolations of different texts, showing the transition from one line into the other. Another research used the same system to generate new tweets based on a large collection of tweets [15]. As music generation tends to follow in the footsteps of text generation a VAE was also used to generate music [14]. The VAE of Roberts et al. allowed the user to manipulate and in their own words doodle with the music, generating different pieces of music. But it requires music as input to start with the manipulations. All of the previous RNN and VAE implementations that generate music use MIDI [19] representations of music. Other research has been experimenting with the generation of music based on the signal, or raw audio of the music [22]. Zukowski and Carr use a SampleRNN which is able to train on a small training set, in their experiments only one album, and produce similar music. The system would have to be retrained for every new training set and the model would generate hours of music, from which only minutes of music were use able as new music. 2.1 Background This section will investigate the methods created in related research that will be used in the system of this research. 2

2.1.1 Music Generating with RNN Simon and Oore use a RNN to generate new music [16]. RNN s are models consisting of multiple layers of nodes where all nodes are connected to each other node within that same layer. The benefit of this is that each node can influence every other node, resulting in a more expressive network. Often the nodes used in RNN s are LSTM (Long Short Term Memory) nodes (or units) [7] which differ from regular nodes in that they remember the previous inputs (with a decay over time). Allowing previous input to influence the next input and be influenced by the input before that. This feature makes RNN s very suitable for longer sentences 2 where each word could affect other words. A second effective usage of RNN s is with images as input, where multiple pixels together means something different from those same pixels separately [20]. In their research Simon and Oore intended to achieve more dynamic music [16] by using a RNN trained on approximately 1400 piano recordings performed by skilled pianists and by using a different music representation from the one applied in the past. Instead of initializing a longer note being played by a sequence of short notes being played Simon and Oore chose to use a notation more similar to the MIDI encoding. Now the beginning (note-on event) and end (note-off event) are used to encode the duration of a note (for more on MIDI see Section 2.1.3). The resulting implementation is able to generate music based on its input, which are (initially) 1400 music pieces. When those 1400 pieces of music are replaced with different music, different music will be generated. In order to make this viable for the system of this research it would require the representation of the text to be approximately 1400 pieces of music that are representative of the text. This would also require the retraining of the RNN for every new input text, which is a costly process. The research only used high quality MIDI files with only one instrument playing, which does not scale to modern music where all different kinds of instruments are played. 2.1.2 Music Generation with VAE Another way to generate music is to use the MusicVAE proposed by Roberts et al. [14]. VAE s were first proposed by Kingma and Welling [11]. They combine two RNN s with a latent space in between as depicted in Figure 1. The first RNN serves as an encoder of the input data (for example music), which encodes the data as a vector in the latent space. By training the VAE sufficiently the latent space will represent all the input data. The second RNN is a decoder that allows for a vector in the latent spaces to be decoded to a piece of music. When a piece of music would be encoded to a vector in the latent spaces and that same vector would then be decoded to a piece of music the new piece of music will resemble the original piece of music, but it will not be identical. Because the latent space is a relatively small dimensional space to represent the pieces of music, information about the music will be lost. But this is not a problem, since the purpose of the VAE is not to replicate music, but to generate new music. MusicVAE uses the same music representation as the RNN uses internally, taking MIDI files as input. For the training of the VAE a lot more music is required than the training of the RNN, because the latent space must be trained enough to be representative. [14] used approximately 1.5 million unique MIDI files in order to train MusicVAE. One large benefit of using MusicVAE over the RNN from Simon and Oore [16] is the reusability, MusicVAE only needs to be trained once to be able to generate a new piece of music. However it does need some kind of input to base its new music on, as to not simply generate random new music. Two different algorithms to replace the encoder will be examined in Section 2.1.4 and 2.1.5, both being able to encode text into an embedding. 2 As famously explained in the blogpost written by Andrej Karpathy [10]. 3 Original image taken from https://magenta.tensorflow.org/music-vae 3

Figure 1: Visualization of the VAE used by [14]. First a piece of music is encoded into latent space z, and subsequently this encoding is decoded from the latent space back into a (similar) piece of music. 3 2.1.3 The MIDI Representation Both the MusicVAE [14] and Performance RNN [16] use music files encoded in the MIDI format. MIDI (Musical Instrument Digital Interface) is an encoding for musical information originally intended for use between electronic instruments and computers [19]. What makes the MIDI format special when compared to more regular music formats as MP3 or WAV is that it does not contain a compressed recording of a song. Instead it contains separate tracks which each in turn contains the notes and musical events of a song in symbolic form resulting in a machine-interpretable music score. MIDI events tell the instrument everything it needs to know in order to produce the correct sound. This includes events about what kind of instrument should be synthesized but also when a note should start and when a note should stop. This nature of the MIDI file means that it can not be played as other music files can, but every track needs to be synthesized in order to hear the song. The advantage this brings is that each individual track can be extracted, which is not possible with the more conventional music formats. This allows the training of models with only one musical part, instead of all the instruments playing simultaneously. 2.1.4 Text Based Variational AutoEncoders VAE s trained on large text corpora have had success in creating a latent space capable of generating new sentences. Bowman et al. [1] used approximately 12.000 ebooks to train their VAE which was capable of generating natural sentences. Semeniuta et al. [15] trained a VAE on twitterdata, creating a VAE that was able to generate new tweets. The benefit of using a VAE to encode the lyrics of the songs would be that the nature of the latent spaces of both the lyrics VAE and the music VAE are more similar. Following the hypotheses that the lyrics and music have the same meaning the latent spaces might be assumed to also be shaped similarly. This would make a transformation between the latent spaces easier to create. But using a VAE would also create a problem. The latent space of the VAE does not create an exact representation of the lyrics, meaning that information about the lyrics will be lost. making a transformation from something similar to the text into the music latent space might create too much noise. On top of that the training data used for the VAE by Bowman et al. [1] and Semeniuta et al. [15] do not fit the purposes of this research, since they differ too much from lyrics. 4

2.1.5 Word2Vec A simpler solution would be to use Word2Vec [12]. Word2Vec is a small feed-forward neural network often used in NLP tasks. It allows the user to train a model with a large set of words. The model will create a vector for every word and train the model resulting in the vectors of words similar to each other to be closer together as opposed to words that are dissimilar. This vector space could be compared to the latent space of the VAE s, creating a large representation of the words used during the training. The vector space of a Word2Vec model differs from the latent space of a VAE in that it keeps all the information about the words. When a word in encoded to a vector the vector will be decoded into the original word. This would be more fitting for the system proposed in this paper, since it would allow to keep as much information about the lyrics as possible. 3 Method The proposed system uses a simple learning algorithm, MLP [8], in order to infer a function that maps a latent space of lyrics learned through Word2Vec [12], to a latent space of melodic musical sequences in MIDI learned through MusicVAE [14]. Figure 2 displays the data flow of the system. First the input text is encoded and embedded into the first latent space (z 1 ), as explained in Section 3.1. Secondly the text representation is used as input for the MLP which transforms it into the latent space (z 2 ) of the VAE. The transformation (T) is described in Section 3.2. The resulting embedding subsequently gets decoded into a MIDI file (Section 3.3). All the code used during this research and instructions on how to use them can be found at https://github.com/roderickvanderweerdt/text2midi. This section will expand upon the implementation focusing on these three components starting with the lyrics representation. Music decode z 1 : latent space T: transformation z 2 : latent space VAE w2v input text Figure 2: Visualization of the entire system, the arrow shows the data flow. 3.1 Encoding the Lyrics (with Word2Vec) The representation of the lyrics was created using Word2Vec. Word2Vec, as described in Section 2.1.5, allows for the training of vectors, or embeddings, representing words. Embeddings of words more similar to each other will be closer to each other compared with words that are dissimilar. For this project the embeddings were trained only on lyrics from songs, instead of the more usual collections of texts. This was done in order to train only on musical usage of the words, 5

and not the more regular use of words as is practised in, for example, news articles [12] or product reviews [6]. Both of which are more often used as training data to train the Word2Vec embeddings. The decision to only use lyrics as training data was based on research [3] stating that when there is enough data available it best to use the data you are trying to represent for your training, instead of a more general dataset. Because the training of the transformation (T) from the text latent space into the music latent space (as described in Section 3.2) will use sentences of text instead of single words the representation of the words alone is not enough. To create the representation of the longer pieces of text the mean of the vector representations of all the words in the text was taken, resulting in one vector representing the piece of text. 3.2 Bringing the Latent Spaces Together In order to connect the latent space of the lyrics to the latent space of the music a MLP (Multilayer Perceptron) [8] was trained to learn the transformation from pieces of lyrics into pieces of music. A visualization of this transformation (T) can be seen in Figure 3. The vector representations of the text are taken as input for the transformation (T) and the vector representations of the pieces of music are used as the output during the training. z 1 z 2 T Figure 3: Visualization of the transformation performed by the MLP, with the dimensions of z 1 and z 2 reduced to two so it can be displayed. The training set consisted of 12.907 lyrics and music pairs, taken from 409 songs. All those songs were MIDI files that contained a track where the lyrics of the song are timed with the rest of the music. 4 These timed lyrics allowed the MIDI to be broken up into smaller parts, creating a training set of lyrics with a maximum of 140 characters. Each paired with a piece of music from the original song starting at the moment the lyrics also start in the song and continuing for 16 bars (because MusicVAE works best with pieces of music 16 bars long). Keeping the parts a maximum of 140 characters long allowed for the parts to be similar in length even though sentences in songs often are not. The cut-off point of a part was never in the middle of a word, but always at the end of a sentence, resulting in some lyric parts to be a little shorter then 140 characters, thereby maintaining the structure of the sentences. Instead of selecting which melody line to use by examining the music or the tracknames [18] all the MIDI tracks using melody instruments were initially cut up into the 16 bar pieces paired with the lyrics. Melody tracks, as so defined by [14], are the tracks using instruments with a MIDI identifier between 1 and 32. Using the MusicVAE model all those music pieces were tested on whether they could be encoded into the latent space. For each song the track that contained the most pieces of music that could be encoded was selected and only the pieces of music from that track were used to train the MLP model. 4 These kind of MIDI files are often used to play karaoke, where the lyrics must play along with the song. 6

3.3 Decoding the Music (with Magenta s MusicVAE) For the implementation a pre-trained VAE model originally trained by Magenta was used [14]. The model was trained on music sequences taken from a large midi dataset, where for every MIDI file melody parts where retrieved with a length of 16 bars (which for a song with a ritme of 120 bpm approximately comes down to 30 seconds chunks). The latent space of the VAE used 512 dimensions. 4 Experiment In order to test the effectiveness of the created system an experiment was performed. The questions consisted of short pieces of text, mostly excerpts from poems. Along with each text two pieces of music were played, one of the pieces was generated from the text used as the question and the second piece of music was generated with a separate piece of text. After both of the pieces of music were finished the participants were asked to mark down which of the two they found most similar to the text based on the feelings expressed by the music. This goal of the experiment is to determine whether the music created by the system are similar enough to the text used to create that specific piece of music so that a different piece of music (created from a different text) is recognizably different. Juslin et al.state the importance of the explicit difference between emotion expressed by music and emotion induced by music [9]. Emotion expressed by music is the emotion corresponding to the song itself. The induced emotion is the emotion that a listener feels when listening to the song. These two emotions do not have to correspond, it might for example be possible to feel happy from listening to a sad song. Even though this research, and therefore this experiment, does not specifically focus only on the emotion expressed by the music, but the bigger less specific feeling expressed by the music, we use professionals for the experiment. These professionals are instructed to (and are expected to be able to) focus on the expressed feelings over the induced feelings. Seven professionals have participated in the experiment, answering questions about eight different poems, each time having to select the best fitting piece of music out of two different pieces. The participants were all part of a training program learning how to work on the radio. Part of there education is learning what kind of music should be played while different texts are read on the radio. This makes them very capable participants, being better suited to perform the experiment than an untrained person. Out of the eight poems that were used during the experiment six were excerpts taken from pre-existing poems by famous poets and two where only one word. The poems can be found in Appendix A. 4.1 Implementation Approximately 57,650 songlyrics were used to train the Word2Vec embeddings, scrapped from the LyricsFreak website and retrieved from: https://www.kaggle.com/mousehead/songlyrics. All of the songs are modern music 5 from all different genres, ranging from Pop to Metal to hip hop, but all are written in English. All the lyrics were processed by the system removing all punctuation marks and empty lines, but the sentences were kept intact as this could influence the sliding window. 5 As opposed to classical music. 7

Total Correct 15 12 9 6 3 0 P1 P2 P3 P4 P5 P6 P7 Figure 4: Correctly answered questions for each participant of the experiment, with confidence interval. The Word2Vec implementation of Gensim 6 has been used, which is a Python port of the original implementation by Google [12]. Standard parameters with a vector size of 100, sliding window size of three, and minimal word count of five were used for the model. The MLP used for the transformation (T) had two hidden layers, the first had 256 nodes and the second 1024. Stochastic gradient descent was used to train the model while using the mean squared error as the loss function. It was trained until convergence. 5 Results The raw results of the experiment can be found in Appendix B. Seven experts participated and answered fifteen questions. After fifteen questions the experiment had to be ended prematurely because a radio broadcast had to start. This resulted in only fifteen question being answered. 1,00 0,75 0,50 0,25 0,00 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Figure 5: Normalized number of correct answers for each question. 6 https://radimrehurek.com/gensim/models/word2vec.html 8

Figure 4 visualizes the accuracy of the participants based on their answers. Out of the fifteen questions each expert on average answered 6.71 questions correctly, with at most eight and at least five correct answers by the individual participants. This means the experts chose the wrong piece of music for the text more frequent then the correct piece of music. On average each question was answered correctly by 45% of the experts, with the best answered question having six of the seven experts giving the correct answer. Figure 5 displays how often the question was answered correctly. The worst answered questions were only correctly answered by one of the experts. Performing an ANOVA test showed that there is no significant difference between the questions, with a p-value of 0.053. Figure 6 shows a comparison of the results of the first and second time a text was used. A paired T-test between the first and second time gives a p-value of 0.741, indicating that the text of the poem was relevant to whether it would be recognized or not. Round 1 Round 2 1,00 0,75 0,50 0,25 0,00 Poem 1 Poem 2 Poem 3 Poem 4 Poem 5 Poem 6 Poem 7 Poem 8 Figure 6: Normalized number of correct answers for each question. 5.1 Discussion Given the similarity between the results of the questions with the same poems a relation between between the poem and the piece of music with which it is associated is shown. This is evidence showing that the text did affect the music that was generated by the system Even though it might not have given the best fitting music for each text, it did give similar results to each poem in both rounds. Meaning that the transformation, although not correctly, did transform to specific points in the latent space. The ANOVA test over the different questions had a p-value of 0.053, which is such a small difference from being significant which means that a slightly longer experiment, especially since this experiment was cut short, could already be enough to make the outcome significant. During the experiment participants mentioned that all the music started to sound similar after the first few music clips, and considering they had to listen to 30 different MIDI clips it could have been too much to see the differences. MIDI music always is more static when compared to real music, since it synthesized instead of played. Suggesting that MIDI might not be the best format suited for these applications, or that the MIDI should be synthesized with actual (digital) instruments, instead of on a laptop. 9

Something that should also be mentioned is whether this kind of generated music should be considered actual new music. All the information the model (it being RNN or VAE) uses comes from existing music, so any generated music can only be a product of the original training data. One might consider this means any generated music is unoriginal music. But a more fitting comparison might be to compare it to inspiration, just like real musicians are inspired the model also requires inspiration to create something new. After the experiment one of the participants mentioned that the music commonly used to play under text on the radio has a lower temperature then the music created by the system. Temperature is also discussed by [14] but it is not something that can be controlled when the decoder is used. But by adding this functionality future research might explore as way to create more usable music. For future research it is interesting to examine whether it would be possible to replace the Word2Vec model with a VAE trained on lyrics. In the first place it could be used for the same purposes this research had, but the added benefit would be that the transformation might be turned around. Resulting in a system that should be able to generate text from music. 6 Conclusion In order to answer the research question of this paper: To what extent can the relationship between lyrics and melodies be learned from representations of text and music trained with large collections using embeddings and VAE? a system was created. The system is able take a piece of text and encode it into an embedding using Word2Vec. Transforming the embedding with the MLP into an embedding in the MusicVAE latent space allows the decoder of MusicVAE to decode this second embedding into a piece of music. Experiments have been performed with music generated with the system for specific poems, were every poem was presented to participants of the experiment together with two pieces of music, one generated with this poem, and one with a different poem. The participants only answered 45% of the questions correctly, showing the system did not produce the correct music for the poems. All of the poems were used twice, both with music created specifically for that poem, but the music itself was different. The answers show that the participants had very similar preferences when the poem was used the second time. This indicates that the system, although not producing the correct music, did generate music specifically for the poems. Based on this the answer to the research question is that the relationship between lyrics and melodies can be learned from representations of text and music. Even though the poems and music might not be recognizably the same, they were consistently (dis-)similar. Showing the extent of the relationship to be consistent between the text and the music. 10

References [1] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. arxiv preprint arxiv:1511.06349, 2015. [2] Royal S Brown. Herrmann, hitchcock, and the music of the irrational. Cinema Journal, pages 14 49, 1982. [3] Erion Çano and Maurizio Morisio. Quality of word embeddings on sentiment analysis tasks. In International Conference on Applications of Natural Language to Information Systems, pages 332 338. Springer, 2017. [4] Darrell Conklin. Music generation from statistical models. In Proceedings of the AISB 2003 Symposium on Artificial Intelligence and Creativity in the Arts and Sciences, pages 30 35, 2003. [5] Alex Graves. Generating sequences with recurrent neural networks. arxiv preprint arxiv:1308.0850, 2013. [6] Ruining He and Julian McAuley. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In proceedings of the 25th international conference on world wide web, pages 507 517. International World Wide Web Conferences Steering Committee, 2016. [7] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735 1780, 1997. [8] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359 366, 1989. [9] Patrik N Juslin and Petri Laukka. Expression, perception, and induction of musical emotions: A review and a questionnaire study of everyday listening. Journal of New Music Research, 33(3):217 238, 2004. [10] Andrej Karpathy. The unreasonable effectiveness of recurrent neural networks. http:// karpathy.github.io/2015/05/21/rnn-effectiveness/, May 2015. [11] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arxiv preprint arxiv:1312.6114, 2013. [12] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arxiv preprint arxiv:1301.3781, 2013. [13] Lawrence R Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257 286, 1989. [14] Adam Roberts, Jesse Engel, Colin Raffel, Curtis Hawthorne, and Douglas Eck. A hierarchical latent vector model for learning long-term structure in music. arxiv preprint arxiv:1803.05428, 2018. [15] Stanislau Semeniuta, Aliaksei Severyn, and Erhardt Barth. A hybrid convolutional variational autoencoder for text generation. arxiv preprint arxiv:1702.02390, 2017. 11

[16] Ian Simon and Sageev Oore. Performance rnn: Generating music with expressive timing and dynamics. https://magenta.tensorflow.org/performance-rnn, 2017. [17] Ilya Sutskever, James Martens, and Geoffrey E Hinton. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1017 1024, 2011. [18] Michael Tang, Yip Chi Lap, and Ben Kao. Selection of melody lines for music databases. In Computer Software and Applications Conference, 2000. COMPSAC 2000. The 24th Annual International, pages 243 248. IEEE, 2000. [19] The MIDI Manufacturers Association. The complete midi 1.0 detailed specification. In Tech. rep., The MIDI Manufacturers Association, Los Angeles, CA. https://www.midi. org/specifications/item/the-midi-1-0-specification, 1996-2014. [20] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156 3164, 2015. [21] Ronald J Williams and David Zipser. Gradient-based learning algorithms for recurrent networks and their computational complexity. Backpropagation: Theory, architectures, and applications, 1:433 486, 1995. [22] Zack Zukowski and Cj Carr. Generating black metal and math rock: Beyond bach, beethoven, and beatles. In 31st Conference on Neural Information Processing Systems (NIPS 2017), 2017. 12

A Poems used during the Experiment Hope is the thing with feathers - That perches in the soul - And sings the tune without the words - And never stops - at all - And sweetest - in the Gale - is heard - And sore must be the storm - That could abash the little Bird That kept so many warm - I ve heard it in the chillest land - And on the strangest Sea - Yet - never - in Extremity, It asked a crumb - of me. Poem 1: Emily Dickinson - Hope is the Thing with Feathers You came to me this morning And you handled me like meat You d have to be a man to know How good that feels, how sweet My mirrored twin, my next of kin I d know you in my sleep And who but you would take me in A thousand kisses deep Poem 2: Leonard Cohen - A Thousand Kisses Deep Stop all the clocks, cut off the telephone, Prevent the dog from barking with a juicy bone, Silence the pianos and with muffled drum Bring out the coffin, let the mourners come. Poem 3: WH Auden - Funeral Blues I was a child and she was a child, In this kingdom by the sea: But we loved with a love that was more than love I and my Annabel Lee; With a love that the winged seraphs of heaven Laughed loud at her and me. Poem 4: EA Poe - Annabell Lee 13

Happy the man, and happy he alone, He who can call today his own: He who, secure within, can say, Tomorrow do thy worst, for I have lived today. Be fair or foul or rain or shine The joys I have possessed, in spite of fate, are mine. Not Heaven itself upon the past has power, But what has been, has been, and I have had my hour. Poem 5: John Dryden - Happy the Man So I would have had him leave, So I would have had her stand and grieve, So he would have left As the soul leaves the body torn and bruised, As the mind deserts the body it has used. I should find Some way incomparably light and deft, Some way we both should understand, Simple and faithless as a smile and a shake of the hand. Poem 6: TS Elliot - The Weeping Girl LOVE Poem 7 HATE Poem 8 14

B Table of the Results participants P1 P2 P3 P4 P5 P6 P7 Q1 1 0 0 0 0 1 0 0,29 Q2 1 0 0 0 0 0 0 0,14 Q3 1 0 1 0 1 1 1 0,71 Q4 1 0 0 0 0 1 0 0,29 Q5 0 1 0 1 0 1 1 0,57 Q6 0 1 0 0 1 0 1 0,43 Q7 1 0 0 1 1 0 1 0,57 Q8 0 1 0 0 1 1 1 0,57 Q9 1 0 0 0 0 0 0 0,14 Q10 0 0 1 1 1 0 0 0,43 Q11 0 0 1 1 0 0 0 0,29 Q12 0 1 1 0 1 1 1 0,71 Q13 1 1 1 1 0 1 1 0,86 Q14 0 0 0 0 1 0 0 0,14 Q15 1 0 1 0 1 0 1 0,57 Total Correct per Expert 8 5 6 5 8 7 8 Average Correct per question Table 1: Results of the experiment performed with seven participants. 15