LSTM Neural Style Transfer in Music Using Computational Musicology

Similar documents
arxiv: v1 [cs.lg] 15 Jun 2016

Music Composition with RNN

Melody Extraction from Generic Audio Clips Thaminda Edirisooriya, Hansohl Kim, Connie Zeng

An AI Approach to Automatic Natural Music Transcription

Noise (Music) Composition Using Classification Algorithms Peter Wang (pwang01) December 15, 2017

Neural Network for Music Instrument Identi cation

Generating Music with Recurrent Neural Networks

Deep Jammer: A Music Generation Model

Image-to-Markup Generation with Coarse-to-Fine Attention

A STUDY ON LSTM NETWORKS FOR POLYPHONIC MUSIC SEQUENCE MODELLING

RoboMozart: Generating music using LSTM networks trained per-tick on a MIDI collection with short music segments as input.

Audio: Generation & Extraction. Charu Jaiswal

Bach2Bach: Generating Music Using A Deep Reinforcement Learning Approach Nikhil Kotecha Columbia University

Blues Improviser. Greg Nelson Nam Nguyen

CHORD GENERATION FROM SYMBOLIC MELODY USING BLSTM NETWORKS

Bach-Prop: Modeling Bach s Harmonization Style with a Back- Propagation Network

Building a Better Bach with Markov Chains

Laughbot: Detecting Humor in Spoken Language with Language and Audio Cues

LEARNING AUDIO SHEET MUSIC CORRESPONDENCES. Matthias Dorfer Department of Computational Perception

arxiv: v1 [cs.sd] 8 Jun 2016

Singer Traits Identification using Deep Neural Network

Detecting Musical Key with Supervised Learning

Audio spectrogram representations for processing with Convolutional Neural Networks

Chord Classification of an Audio Signal using Artificial Neural Network

OPTICAL MUSIC RECOGNITION WITH CONVOLUTIONAL SEQUENCE-TO-SEQUENCE MODELS

Finding Sarcasm in Reddit Postings: A Deep Learning Approach

Learning Musical Structure Directly from Sequences of Music

Jazz Melody Generation from Recurrent Network Learning of Several Human Melodies

Rewind: A Music Transcription Method

Algorithmic Music Composition using Recurrent Neural Networking

Laughbot: Detecting Humor in Spoken Language with Language and Audio Cues

Music Genre Classification and Variance Comparison on Number of Genres

Evaluating Melodic Encodings for Use in Cover Song Identification

Music Genre Classification

Predicting the immediate future with Recurrent Neural Networks: Pre-training and Applications

BachBot: Automatic composition in the style of Bach chorales

CPU Bach: An Automatic Chorale Harmonization System

Chord Label Personalization through Deep Learning of Integrated Harmonic Interval-based Representations

Hidden Markov Model based dance recognition

Take a Break, Bach! Let Machine Learning Harmonize That Chorale For You. Chris Lewis Stanford University

arxiv: v1 [cs.cv] 16 Jul 2017

Music Information Retrieval with Temporal Features and Timbre

Experiments on musical instrument separation using multiplecause

Automated sound generation based on image colour spectrum with using the recurrent neural network

arxiv: v3 [cs.sd] 14 Jul 2017

Outline. Why do we classify? Audio Classification

arxiv: v1 [cs.sd] 9 Dec 2017

CREATING all forms of art [1], [2], [3], [4], including

JazzGAN: Improvising with Generative Adversarial Networks

Towards End-to-End Raw Audio Music Synthesis

POST-PROCESSING FIDDLE : A REAL-TIME MULTI-PITCH TRACKING TECHNIQUE USING HARMONIC PARTIAL SUBTRACTION FOR USE WITHIN LIVE PERFORMANCE SYSTEMS

SentiMozart: Music Generation based on Emotions

Effects of acoustic degradations on cover song recognition

Music Generation from MIDI datasets

Feature-Based Analysis of Haydn String Quartets

Improvised Duet Interaction: Learning Improvisation Techniques for Automatic Accompaniment

A probabilistic approach to determining bass voice leading in melodic harmonisation

Using Deep Learning to Annotate Karaoke Songs

Sequence generation and classification with VAEs and RNNs

Computational Modelling of Harmony

Recurrent Neural Networks and Pitch Representations for Music Tasks

A Transfer Learning Based Feature Extractor for Polyphonic Sound Event Detection Using Connectionist Temporal Classification

Algorithmic Composition: The Music of Mathematics

Modeling Temporal Tonal Relations in Polyphonic Music Through Deep Networks with a Novel Image-Based Representation

Various Artificial Intelligence Techniques For Automated Melody Generation

Composing a melody with long-short term memory (LSTM) Recurrent Neural Networks. Konstantin Lackner

The Sparsity of Simple Recurrent Networks in Musical Structure Learning

Automatic Piano Music Transcription

Representations of Sound in Deep Learning of Audio Features from Music

AUTOMATIC ACCOMPANIMENT OF VOCAL MELODIES IN THE CONTEXT OF POPULAR MUSIC

Music Composition Using Recurrent Neural Networks and Evolutionary Algorithms

Analysis of local and global timing and pitch change in ordinary

SIMSSA DB: A Database for Computational Musicological Research

Deep Recurrent Music Writer: Memory-enhanced Variational Autoencoder-based Musical Score Composition and an Objective Measure

Chord Representations for Probabilistic Models

Real-valued parametric conditioning of an RNN for interactive sound synthesis

arxiv: v1 [cs.sd] 17 Dec 2018

arxiv: v1 [cs.lg] 16 Dec 2017

1) New Paths to New Machine Learning Science. 2) How an Unruly Mob Almost Stole. Jeff Howbert University of Washington

Deep learning for music data processing

EVALUATING LANGUAGE MODELS OF TONAL HARMONY

COMPARING RNN PARAMETERS FOR MELODIC SIMILARITY

Music Similarity and Cover Song Identification: The Case of Jazz

Music Radar: A Web-based Query by Humming System

Sudhanshu Gautam *1, Sarita Soni 2. M-Tech Computer Science, BBAU Central University, Lucknow, Uttar Pradesh, India

Singer Recognition and Modeling Singer Error

A QUERY BY EXAMPLE MUSIC RETRIEVAL ALGORITHM

MUSICAL INSTRUMENT RECOGNITION WITH WAVELET ENVELOPES

Algorithmic Music Composition

Perceptual Evaluation of Automatically Extracted Musical Motives

Singing voice synthesis based on deep neural networks

A Discriminative Approach to Topic-based Citation Recommendation

DJ Darwin a genetic approach to creating beats

Topic 10. Multi-pitch Analysis

CONDITIONING DEEP GENERATIVE RAW AUDIO MODELS FOR STRUCTURED AUTOMATIC MUSIC

CS229 Project Report Polyphonic Piano Transcription

Robert Alexandru Dobre, Cristian Negrescu

MUSI-6201 Computational Music Analysis

A FUNCTIONAL CLASSIFICATION OF ONE INSTRUMENT S TIMBRES

Jazz Melody Generation and Recognition

Transcription:

LSTM Neural Style Transfer in Music Using Computational Musicology Jett Oristaglio Dartmouth College, June 4 2017 1. Introduction In the 2016 paper A Neural Algorithm of Artistic Style, Gatys et al. discovered that by modifying CNNs trained on object recognition, they were capable of separating style from content in images. i This critical concept of content/style separation allowed them to merge the style of well-known artwork with the content of a regular photo, creating artistic variations on classic art that still kept the original content recognizable. This process is known as neural style transfer. Thus far, neural style transfer has been mostly limited in application to the blending of images by using CNN s. My project attempts to tackle the problem of neural style transfer in the domain of music. In music, the content of a song can be approximated by the melody line, and the style can be approximated by the harmonies around the melody. By first using a computational musicology library to separate melody from harmony in songs, I then used an LSTM to predict classical harmonies that accompany classical melody lines. ii The network is then fed jazz melody lines, in order to generate classical harmonies over jazz melodies, effectively resulting in neural style transfer in music using LSTMs. 2. Related Work This project is based on concepts introduced by Gatys et al. in the above-mentioned paper on neural style transfer. The idea of content/style separation is the basis of neural style transfer, although the domain that they applied it to is unrelated to music. In the field of music, there is work being done in pure music generation. DeepJazz is a neural network that uses a 2-layer LSTM to generate music based on a jazz song by training only on that song. iii It learns the general structure, chords, and melody of the song and creates similar (but not quite identical) samples. Project Magenta is an open-source project by Google to advance the state-of-the-art in music generation, and includes multiple different models that generate music using diverse methods. iv Both of these projects focus on pure music generation, and do not attempt neural style transfer. There is an attempt at neural style transfer in music, which seems to use raw audio converted to a 2D spectrogram image using Short Time Fourier Transformations. v They Gatys regular neural style transfer via CNNs on the 2D spectrogram. This leads to perceptually poor results when neural style transfer is attempted garbled songs that sound like they are played on top of each other. There does not appear to be a previous attempt to use LSTMs for neural style transfer in music.

3. Data and Methods Data Representation The first design choice involved choosing how to represent musical events (i.e. each note being played at a time step) as a vector for the network. To represent a sequence of musical events, I use a tensor: where and. On-notes are recorded as hot encodings, with each row representing a time step, and each columns representing a specific note on the scale. A single note in a single time step would be represented as a vector with a one-hotencoding: This vector only represents a single note among 12 possible pitches for example, middle C to middle B. A full harmony in this time step would be represented by a vector with multi-class encodings: And a full sequence with a length of N and 12 pitches would be represented by a matrix of dimensionality Nx12: Content/Style Separation Now that we have decided how to represent the song as a vector, the central problem of the project is to find a way to represent content separately from style. In order to do this, I approximate musical content as the melody line of the song; when one thinks of the song Ode to Joy, one doesn t imagine the low cello harmonies, but rather the soprano violin melodies. The style of the song can be approximated with the harmony that surrounds the melody; jazz chords are very distinctive from classical harmonies, and the exact same melody can result in the same song in two different styles depending on the surrounding harmony. In order to separate melody from harmony, I used the Computational Musicology Python library Music21. vi It can parse MIDI files, automatically translating them into Stream objects that contain each musical voice as a continuous stream of notes. I can then use this library to iterate through each song object, finding the high notes at each time step and adding them to a separate melody stream, and adding the accompanying lower notes to a separate harmony stream. I am then left with a stream that represents the content of the song in the form of melody, and another stream that represents the style of the song as the harmony. Data Preprocessing In the data preprocessing step, I first parse each of the MIDI files into Stream objects using the Music21 library. I transpose each song to the key of C (or A minor if it is in a minor scale, which uses the same notes as C). I restrict the number of octaves in order to limit the number of possible pitches. I then

breakdown the stream into its component melody and harmony streams, saving the streams separately as the content and style representations of the song. Finally, I translate these separated stream objects into vectorized inputs, where each song is represented by a tensor x as described above. I then splice n random samples from each song that each have a dimensionality of LxP, where L is a hyperparameter denoting a uniform sequence length, and P is a hyperparameter denoting the total number of possible pitches. This is built into a batch tensor of size [n, L, P] for each song. Finally, the samples from each song are joined together in a full batch of size [T, L, P], where T = n * (Total Number of Songs). Here is an example of this content and style separation in action: Combined song: Melody stream only: Neural Style Transfer Once the two tensors containing the full batch of sampled melodies and harmonies were built, the groundwork is set for neural style transfer. The full dataset is represented by the batch of sampled melody tensors as the network input, and the batch of harmony tensors as the network ground truth: An RNN (or more specifically, an LSTM) is given sequences of melody vectors as input, and is trained to output predictions for the sequence of surrounding harmonies. The model learns to act as a function that can predict full harmonies from a simple melody: If it is trained on a corpus from one genre, and then is fed the melody lines from songs of a different genre, I hypothesize that it will generate harmonies from the original genre around the new melodies, effectively implementing neural style transfer for music taking the content of one song, and changing its style. Chord stream only: Dataset I chose to train the network on classical music, and attempt neural style transfer on jazz melodies. I used the Classical Midi Dataset at www.piano-midi.de/, which is a small corpus of 89 classical songs translated into full MIDI files. vii There are other classical datasets, but they are often poorly

formatted and bad translations of the original songs I found the best results by taking many subsamples of this well-formed dataset. For the actual neural style transfer using jazz melodies, I use the Weimar Jazz Solo Database at jazzomat.hfmweimar.de/dbformat/dbcontent.html. viii It is a corpus of 456 jazz songs with just their melody lines translated into MIDI files. These are small datasets, which was a challenge for the project. Unfortunately, there are very few high-quality MIDI datasets for jazz or classical music many of them are poorly translated, or have other problems that cause complications for a network (for instance, many songs encode the percussion as being played by a piano, which introduces noise and poor formatting into the note stream). Loss Function and Training The loss function used in the network was simple binary cross-entropy, which counts false positives as well as false negatives: 4. Model Design and Implementation Model Architecture I used many different design choices in order to try to produce perceptually compelling generated music with this method of neural style transfer. For the broader network architecture, I used a 2- layer LSTM, with 256 nodes each hidden layer, and 0.5 dropout. The number of possible pitches was restricted to 60, so the output layer was a densely connected layer with 60 nodes and a sigmoid activation. The input is a batch of melody vectors, and the output is a predicted batch of harmony vectors. Unfortunately, this loss function caused a problem: because there are so many more off-notes than on-notes in any given time step, the network was pushed to extremely low prediction values. I hypothesize that with a reweighted loss function, the network would achieve significantly better results. However, despite best efforts I was unable to find a way to successfully implement reweighting without porting the network to PyTorch and using dynamic graph computations; I instead used simple unweighted cross-entropy. Training and Hyperparameters For training, I used Adam gradient descent optimization, ix with weight decay of 0.001 and a variable learning rate over 50 epochs:

0.01 for the first 10 epochs, then 0.001 for the next 25, and 0.0005 until the end of training. The network trained for 50 epochs, making predictions on a jazz vector every 5 epochs throughout training. It trained on batch sizes of 32, and used 20 samples per song, each with uniform sequence length of 256. I will explain in the next section what each step in the sequence represented 5. Experiments and Results Format of Results Because the goal of the network is to generate perceptually compelling music through neural style transfer, the actual statistics of training on classical music such as accuracy and loss function graphs are not highly informative as to the performance of the network at test time on jazz melodies. Most of the networks achieved extremely high accuracy and low loss both on the training and validation sets unfortunately, this did not help the network generate good predictions, because the predictions mostly hovered close to 0 due to the weighting problem. Almost all of the training data looked like the following when graphed: By manually adjusting the way that predictions were translated to MIDIs I was capable of generating some level of predictions. I implemented both a manually adjustable cutoff for predicting notes, as well as a method for sampling the notes and predictions. However, for the most part it was an unsuccessful endeavor the cutoff points required heavy hand-tweaking, and generally predicted common classical notes across the whole song with little variation in time or note choice. The sampling, on the other hand, tended to produce scattered and entirely chaotic results. In order to try to improve the results, I also implemented two different ways to build the harmony vectors; in one, each step in the sequence represented a single uniform time step through the song, that was adjustable by a hyperparameter. In the second, each sequential step represented the jump to the next melody note, and the harmonies were forced to snap to the melody notes. This generated perceptually improved results, because it kept the timing of the original jazz melodies but the network did not learn the times on its own, and this therefore represents another form of hand-tweaking for perceptual quality.

Examples of Results As stated before, the results were extremely poor. Even with hand-selected cutoffs, it tended to find mostly degenerate solutions with the same notes playing repetitively, and most of my attempts at tweaking the network led to actually worse results. The network occasionally made predictions that sounded like there was some level of classical harmony over jazz melody, but they required too much hand-selection to be particularly compelling. The vast majority of predictions were highly perceptually uncompelling and repetitive. Attached are the embedded audio of the results of several experiments: Vectorized by: Time Time subdivision: 4 Cutoff level: 0.065 6. Conclusion and Discussion Despite the poor results, I believe that neural style transfer is possible via this methodology and architecture. However, it would require successfully implementing a weighed loss function to counteract the heavy imbalance between off-notes and onnotes. The squeezing of predictions to nearzero made it difficult to get any perceptually compelling results. However, the data preprocessing that initially separated content from style was highly effective. With more time, future work could involve modifying the loss function to weigh on-notes equally as off-notes. This may lead to perceptually compelling neural style transfer for music using computation musicology preprocessing and LSTM inference. Vectorized by: Note Cutoff level: 0.15 Unfortunately, almost all of the results sound very similar to this, which makes sharing a broad range of results somewhat redundant.

References i Gatys, L., Ecker, A., & Bethge, M. (2016). A Neural Algorithm of Artistic Style. Journal Of Vision, 16(12), 326. http://dx.doi.org/10.1167/16.12.326 ii Hochreiter, Sepp, & Schmidhuber, Jürgen. 1997. Long short-term memory. Neural Comput. 9, 9 (November 1997), 1735-1780. http://dx.doi.org/10.1162/neco.1997.9.8.173 5 iii Kim, Ji-Sung. (April 2016). DeepJazz. Github Repository. https://github.com/jisungk/deepjazz. v Ulyanov, D., & Lebedev, D. (2017). Audio texture synthesis and style transfer. Dmitry Ulyanov. iv Google. Project Magenta (2017). https://magenta.tensorflow.org/welcome-tomagenta https://dmitryulyanov.github.io/audiotexture-synthesis-and-style-transfer vi Cuthbert, Michael, et al. (2017). MIT. Music21 Python Library. http://web.mit.edu/music21/ vii Krueger, B. (2017). Classical Piano Midi Page. Piano-midi.de. Retrieved 17 April 2017, from http://www.piano-midi.de/. viii Weimar Jazz Database. (2017). Weimar Jazz Database. Retrieved 17 April 2017, from jazzomat.hfmweimar.de/dbformat/dbcontent.html ix Kingma, D. & Ba, J. (2015). Adam: A Method for Stochastic Gradient Optimization. Published as a conference paper at ICLR 2015. arxiv:1412.6980