LSTM Neural Style Transfer in Music Using Computational Musicology Jett Oristaglio Dartmouth College, June 4 2017 1. Introduction In the 2016 paper A Neural Algorithm of Artistic Style, Gatys et al. discovered that by modifying CNNs trained on object recognition, they were capable of separating style from content in images. i This critical concept of content/style separation allowed them to merge the style of well-known artwork with the content of a regular photo, creating artistic variations on classic art that still kept the original content recognizable. This process is known as neural style transfer. Thus far, neural style transfer has been mostly limited in application to the blending of images by using CNN s. My project attempts to tackle the problem of neural style transfer in the domain of music. In music, the content of a song can be approximated by the melody line, and the style can be approximated by the harmonies around the melody. By first using a computational musicology library to separate melody from harmony in songs, I then used an LSTM to predict classical harmonies that accompany classical melody lines. ii The network is then fed jazz melody lines, in order to generate classical harmonies over jazz melodies, effectively resulting in neural style transfer in music using LSTMs. 2. Related Work This project is based on concepts introduced by Gatys et al. in the above-mentioned paper on neural style transfer. The idea of content/style separation is the basis of neural style transfer, although the domain that they applied it to is unrelated to music. In the field of music, there is work being done in pure music generation. DeepJazz is a neural network that uses a 2-layer LSTM to generate music based on a jazz song by training only on that song. iii It learns the general structure, chords, and melody of the song and creates similar (but not quite identical) samples. Project Magenta is an open-source project by Google to advance the state-of-the-art in music generation, and includes multiple different models that generate music using diverse methods. iv Both of these projects focus on pure music generation, and do not attempt neural style transfer. There is an attempt at neural style transfer in music, which seems to use raw audio converted to a 2D spectrogram image using Short Time Fourier Transformations. v They Gatys regular neural style transfer via CNNs on the 2D spectrogram. This leads to perceptually poor results when neural style transfer is attempted garbled songs that sound like they are played on top of each other. There does not appear to be a previous attempt to use LSTMs for neural style transfer in music.
3. Data and Methods Data Representation The first design choice involved choosing how to represent musical events (i.e. each note being played at a time step) as a vector for the network. To represent a sequence of musical events, I use a tensor: where and. On-notes are recorded as hot encodings, with each row representing a time step, and each columns representing a specific note on the scale. A single note in a single time step would be represented as a vector with a one-hotencoding: This vector only represents a single note among 12 possible pitches for example, middle C to middle B. A full harmony in this time step would be represented by a vector with multi-class encodings: And a full sequence with a length of N and 12 pitches would be represented by a matrix of dimensionality Nx12: Content/Style Separation Now that we have decided how to represent the song as a vector, the central problem of the project is to find a way to represent content separately from style. In order to do this, I approximate musical content as the melody line of the song; when one thinks of the song Ode to Joy, one doesn t imagine the low cello harmonies, but rather the soprano violin melodies. The style of the song can be approximated with the harmony that surrounds the melody; jazz chords are very distinctive from classical harmonies, and the exact same melody can result in the same song in two different styles depending on the surrounding harmony. In order to separate melody from harmony, I used the Computational Musicology Python library Music21. vi It can parse MIDI files, automatically translating them into Stream objects that contain each musical voice as a continuous stream of notes. I can then use this library to iterate through each song object, finding the high notes at each time step and adding them to a separate melody stream, and adding the accompanying lower notes to a separate harmony stream. I am then left with a stream that represents the content of the song in the form of melody, and another stream that represents the style of the song as the harmony. Data Preprocessing In the data preprocessing step, I first parse each of the MIDI files into Stream objects using the Music21 library. I transpose each song to the key of C (or A minor if it is in a minor scale, which uses the same notes as C). I restrict the number of octaves in order to limit the number of possible pitches. I then
breakdown the stream into its component melody and harmony streams, saving the streams separately as the content and style representations of the song. Finally, I translate these separated stream objects into vectorized inputs, where each song is represented by a tensor x as described above. I then splice n random samples from each song that each have a dimensionality of LxP, where L is a hyperparameter denoting a uniform sequence length, and P is a hyperparameter denoting the total number of possible pitches. This is built into a batch tensor of size [n, L, P] for each song. Finally, the samples from each song are joined together in a full batch of size [T, L, P], where T = n * (Total Number of Songs). Here is an example of this content and style separation in action: Combined song: Melody stream only: Neural Style Transfer Once the two tensors containing the full batch of sampled melodies and harmonies were built, the groundwork is set for neural style transfer. The full dataset is represented by the batch of sampled melody tensors as the network input, and the batch of harmony tensors as the network ground truth: An RNN (or more specifically, an LSTM) is given sequences of melody vectors as input, and is trained to output predictions for the sequence of surrounding harmonies. The model learns to act as a function that can predict full harmonies from a simple melody: If it is trained on a corpus from one genre, and then is fed the melody lines from songs of a different genre, I hypothesize that it will generate harmonies from the original genre around the new melodies, effectively implementing neural style transfer for music taking the content of one song, and changing its style. Chord stream only: Dataset I chose to train the network on classical music, and attempt neural style transfer on jazz melodies. I used the Classical Midi Dataset at www.piano-midi.de/, which is a small corpus of 89 classical songs translated into full MIDI files. vii There are other classical datasets, but they are often poorly
formatted and bad translations of the original songs I found the best results by taking many subsamples of this well-formed dataset. For the actual neural style transfer using jazz melodies, I use the Weimar Jazz Solo Database at jazzomat.hfmweimar.de/dbformat/dbcontent.html. viii It is a corpus of 456 jazz songs with just their melody lines translated into MIDI files. These are small datasets, which was a challenge for the project. Unfortunately, there are very few high-quality MIDI datasets for jazz or classical music many of them are poorly translated, or have other problems that cause complications for a network (for instance, many songs encode the percussion as being played by a piano, which introduces noise and poor formatting into the note stream). Loss Function and Training The loss function used in the network was simple binary cross-entropy, which counts false positives as well as false negatives: 4. Model Design and Implementation Model Architecture I used many different design choices in order to try to produce perceptually compelling generated music with this method of neural style transfer. For the broader network architecture, I used a 2- layer LSTM, with 256 nodes each hidden layer, and 0.5 dropout. The number of possible pitches was restricted to 60, so the output layer was a densely connected layer with 60 nodes and a sigmoid activation. The input is a batch of melody vectors, and the output is a predicted batch of harmony vectors. Unfortunately, this loss function caused a problem: because there are so many more off-notes than on-notes in any given time step, the network was pushed to extremely low prediction values. I hypothesize that with a reweighted loss function, the network would achieve significantly better results. However, despite best efforts I was unable to find a way to successfully implement reweighting without porting the network to PyTorch and using dynamic graph computations; I instead used simple unweighted cross-entropy. Training and Hyperparameters For training, I used Adam gradient descent optimization, ix with weight decay of 0.001 and a variable learning rate over 50 epochs:
0.01 for the first 10 epochs, then 0.001 for the next 25, and 0.0005 until the end of training. The network trained for 50 epochs, making predictions on a jazz vector every 5 epochs throughout training. It trained on batch sizes of 32, and used 20 samples per song, each with uniform sequence length of 256. I will explain in the next section what each step in the sequence represented 5. Experiments and Results Format of Results Because the goal of the network is to generate perceptually compelling music through neural style transfer, the actual statistics of training on classical music such as accuracy and loss function graphs are not highly informative as to the performance of the network at test time on jazz melodies. Most of the networks achieved extremely high accuracy and low loss both on the training and validation sets unfortunately, this did not help the network generate good predictions, because the predictions mostly hovered close to 0 due to the weighting problem. Almost all of the training data looked like the following when graphed: By manually adjusting the way that predictions were translated to MIDIs I was capable of generating some level of predictions. I implemented both a manually adjustable cutoff for predicting notes, as well as a method for sampling the notes and predictions. However, for the most part it was an unsuccessful endeavor the cutoff points required heavy hand-tweaking, and generally predicted common classical notes across the whole song with little variation in time or note choice. The sampling, on the other hand, tended to produce scattered and entirely chaotic results. In order to try to improve the results, I also implemented two different ways to build the harmony vectors; in one, each step in the sequence represented a single uniform time step through the song, that was adjustable by a hyperparameter. In the second, each sequential step represented the jump to the next melody note, and the harmonies were forced to snap to the melody notes. This generated perceptually improved results, because it kept the timing of the original jazz melodies but the network did not learn the times on its own, and this therefore represents another form of hand-tweaking for perceptual quality.
Examples of Results As stated before, the results were extremely poor. Even with hand-selected cutoffs, it tended to find mostly degenerate solutions with the same notes playing repetitively, and most of my attempts at tweaking the network led to actually worse results. The network occasionally made predictions that sounded like there was some level of classical harmony over jazz melody, but they required too much hand-selection to be particularly compelling. The vast majority of predictions were highly perceptually uncompelling and repetitive. Attached are the embedded audio of the results of several experiments: Vectorized by: Time Time subdivision: 4 Cutoff level: 0.065 6. Conclusion and Discussion Despite the poor results, I believe that neural style transfer is possible via this methodology and architecture. However, it would require successfully implementing a weighed loss function to counteract the heavy imbalance between off-notes and onnotes. The squeezing of predictions to nearzero made it difficult to get any perceptually compelling results. However, the data preprocessing that initially separated content from style was highly effective. With more time, future work could involve modifying the loss function to weigh on-notes equally as off-notes. This may lead to perceptually compelling neural style transfer for music using computation musicology preprocessing and LSTM inference. Vectorized by: Note Cutoff level: 0.15 Unfortunately, almost all of the results sound very similar to this, which makes sharing a broad range of results somewhat redundant.
References i Gatys, L., Ecker, A., & Bethge, M. (2016). A Neural Algorithm of Artistic Style. Journal Of Vision, 16(12), 326. http://dx.doi.org/10.1167/16.12.326 ii Hochreiter, Sepp, & Schmidhuber, Jürgen. 1997. Long short-term memory. Neural Comput. 9, 9 (November 1997), 1735-1780. http://dx.doi.org/10.1162/neco.1997.9.8.173 5 iii Kim, Ji-Sung. (April 2016). DeepJazz. Github Repository. https://github.com/jisungk/deepjazz. v Ulyanov, D., & Lebedev, D. (2017). Audio texture synthesis and style transfer. Dmitry Ulyanov. iv Google. Project Magenta (2017). https://magenta.tensorflow.org/welcome-tomagenta https://dmitryulyanov.github.io/audiotexture-synthesis-and-style-transfer vi Cuthbert, Michael, et al. (2017). MIT. Music21 Python Library. http://web.mit.edu/music21/ vii Krueger, B. (2017). Classical Piano Midi Page. Piano-midi.de. Retrieved 17 April 2017, from http://www.piano-midi.de/. viii Weimar Jazz Database. (2017). Weimar Jazz Database. Retrieved 17 April 2017, from jazzomat.hfmweimar.de/dbformat/dbcontent.html ix Kingma, D. & Ba, J. (2015). Adam: A Method for Stochastic Gradient Optimization. Published as a conference paper at ICLR 2015. arxiv:1412.6980