Generating Music with Recurrent Neural Networks

Generating Music with Recurrent Neural Networks 27 October 2017 Ushini Attanayake Supervised by Christian Walder Co-supervised by Henry Gardner COMP3740 Project Work in Computing The Australian National University 1

Declaration I declare that to the best of my knowledge, the following is entirely my own work, and does not contain material produced by another person, except where cited otherwise. Ushini Attanayake October 2017 2

ACKNOWLEDGEMENTS I would like to thank the following individuals for their contributions towards this project. Thank you to Dr. Christian Walder for his teachings, advice and support throughout the project. Thank you to Dr. Henry Gardner for his advice regarding fulfilling course criteria and the approach to the project. Thank you to Dr. Peter Strazdins for running and curating the practice meetings. 3

ABSTRACT Recurrent Neural Networks have been used to model various styles and representations of music. Currently, a sufficient model does not exist which has been trained on the irealb jazz corpus of textual **jazz files. When trained well, such a model can be used to generate novel jazz chord progressions. These progressions can be interpreted by the irealb player software to generate a variety of interesting progressions jazz soloists can improvise to. This report contributes towards achieving such a model by training a Recurrent Neural Network with LSTM cells on the irealb jazz corpus. Two main training techniques were used; separating the root and extension of a chord and transposing the files in the corpus. The test perplexity of the model trained on a dataset where each song was transposed through all keys and the root and extension of all chords were separated was 1.605. This is a significant improvement from the original data test perplexity of 5.177. No evidence of overfitting was found in any of the models. 4

Table of Contents ACKNOWLEDGEMENTS... 3 ABSTRACT... 4 1. Introduction... 7 1.1 Literature Review... 8 2. Background... 9 2.1 The Model... 9 2.1.1 Cross Validation... 10 2.2 The Data... 11 2.3 Music Theory... 12 3. Experiments... 13 3.1 Overview... 13 3.2 Splitting the Root and Extension of Chords... 14 3.2.1 Method... 15 3.3 Transposing... 15 3.3.1 Transposing to a single key... 16 3.3.2 Transposing through all keys... 16 3.3.3 Hyperparameters Combinations... 17 4. Results... 18 4.1 Splitting the Root and Extension of Chords... 18 4.2 Transposing to a Single Key... 19 4.3 Transposing Through All Keys... 20 4.4 Hyperparameter Combinations... 21 4.5 Test Scores... 22 5. Conclusions... 24 5.1 Future Work... 24 References... 25 Appendix... 26 Appendix 1: Study Contract... 26 5

List of Figures 1 Figure 1: **jazz file example List of Graphs 1 Graph 1: No Split 2 Graph 2: Split 3 Graph 3: Single key transpose, no split 4 Graph 4: Single key transpose, split 5 Graph 5: All key transpose, no split 6 Graph 6: Hyperparameter Combinations 7 Graph 7: Test Scores List of Tables 1 Table 1: Test Scores 6

1. Introduction The use of Recurrent Neural Networks to generate music has been extensively explored with various genres. This includes models which have been successfully trained on blues and jazz corpuses; modelling chords and melody sequences in various representations. However, a sufficient model which can generate chord progressions in the text-based **jazz representation does not exist. Such a model has useful applications in assisting jazz soloist improve their improvisational skills when used in conjunction with irealb player. This is a software which interprets chord charts and generates band accompaniments. Chord progressions generated from such a model can be easily interpreted by the irealb player as they will be in the **jazz representation. Since the irealb player produces accompaniments for existing songs, a model capable of producing novel chord progressions becomes more useful to a musician as they will have access to a variety of unique chord progression to practice soloing to. This is particularly useful in an experimental style like jazz, where improvisation play a large role in the genre. This project aims to contribute towards creating such a model by training an existing Recurrent Neural Network with LSTM cells on the irealb jazz corpus. Given an input sequence of musical elements, the network makes predictions on the next element in the sequence and thus can be used to predict an entire chord progression for a song. The aim was to obtain a model which can generate novel and meaningful chord progressions in the **jazz representation. Training techniques such as transposing and separating chords into roots and extensions were explored. These techniques reveal musical structures to the network on two levels of granularity; within chords and progressions. Since revealing these structures has the effect of generalising the information the Recurrent Neural Network interprets, it was hypothesised that the use of these techniques would bring us closer to achieving the aim. For pedagogical reasons, this project focusses on the training process and looks to extend upon this work in the future by sampling from the model and producing chord progressions which can be used by the irealb player to generate accompaniments. The model s performance was evaluated by using the perplexity on the test dataset as the performance measure. The perplexity indicates the level of randomness in the predictions the model makes. Cross validation was used to observe whether the model was overfitting the data. The main results show that transposing the dataset through all keys and separating the root and the extension of a chord in pre-processing significantly improved the test perplexity of the model. The test perplexity for this model was 1.605 and the results from cross validation showed no signs of overfitting. 7

1.1 Literature Review The idea for this project was inspired by the works of Sturm, Santos and Korshunova [1] who used textual data in the ABC representation to train a Deep Recurrent Neural Network to generate folk music. Their network was trained on a large amount of data; 23, 958 files. The network was configured to have 3 layers and 512 hidden units. Though the ABC representation does account for chords, Sturm, Santos and Korshunova do not explicitly model chords. In fact, one of their models was trained on a dataset where multiple voices were removed. The ABC representation presents chords in a similar way to the ***jazz representation and the folk music corpus used has files which represent chord progressions. But, what separates the work in this project from the work of Sturm, Santos and Korshunova is that chord progression as explicitly being modelled in this work. The training techniques used in this project also explicitly harness underlying structures in chords and chord progressions. The work of Sturm, Santos and Korshunova was used as a general guideline for this project. Since the architecture of the neural network and the data representation used in this project was similar to theirs, training processes could be kept as the focal point of the project. This project was after all intended to be used as a tool for learning. This project also differs from the one mentioned above in terms of genres as the network in this project was trained on jazz chord progressions. There are a couple of significant works which have trained Recurrent Neural Networks on jazz and blues music. Eck and Schmidhuber [2] trained their network with LSTM cells on blues music. They had models which were capable of generating chords and melodies. However, like the work found in [1], the methods for training the network did not explicitly use the properties in the structures of chords. The data representation used by Eck and Schmidhuber is multi-voiced and had a unit per musical note. Each unit had a binary state. If the note was present, the state was ON, if it was absent, the state was OFF. Therefore, chords and melodies had the same representation. Franklin [3] uses a representation similar to PHCCCF. The pitches are represented by 7bits. The first 4 bits identifies which major Circle of Thirds the note belongs to and the last three bits identify which minor Circle of thirds the note belongs to. Chords are represented by summing the 7bit representation of the notes in a chord. Due to the complexity of this representation, Franklin omits complex chord tones from her experiments and focusses on two tones of chords. These were the triad and the triad plus the 7 th tone. The simpler representation of chords in the **jazz files allows more complex tones to be modelled in this project. 8

2. Background In order to explain the reasoning behind the techniques used for training the model and the methods used to evaluate the results, we will first take a look at how the network which was used processes the data and how to interpret its output. Since the techniques also harnesses properties intrinsic to musical data, we will briefly explain the **jazz representation and some basic music theory. 2.1 The Model The model used was an existing Recurrent Neural Network with LSTM cells written in the Tensorflow framework. The network was not written from scratch in ordered to focus the project on the training process. The network was previously trained to model natural language. It takes in 3 plain text files as input. These files represented the train, validation and test datasets. More on this in the cross-validation section. Each LSTM cell takes as an input a sequence of words and predicts the next word in the sequence. Each input sequence is n timesteps long and each LSTM cell in the network processes m such input sequences in parallel. The timestep associated with a word in a sequence indicates its position in the sequence. A cell processes its input sequences one word at a time. After a cell process each word, it feeds the cell state back into itself. The cell state captures information the cell has learnt about a sequence up to a certain timestep. For example, when predicting the word at the t +1 th timestep, the cell takes as input the t th timestep and the cell state of the t-1th timestep. By having this feedback mechanism, the cell is estimating the conditional probability below. n P(X 1, X 2,, X n ) P(X i X 1, X 2,, X i 1 ) i1 It states that the probability of a sequence (X 1, X 2,, X n ) occurring is equivalent to the product of the probabilities that each element in the sequence X 1, X 2,, X n 1 occurs next, given the preceding sequence for that element. This feedback mechanism is what makes LSTM cells particularly useful in modelling sequential data such as language and music. The nature, and therefore sound of a musical chord in a sequence is dependent on all the chords that came before it. Before the network starts processing the data, the raw files must be separated into sequences of words. The algorithm used to do this identifies any group of adjacent characters which are surrounded in whitespace as a word. The algorithm also considers the new-line character \n as whitespace. Though the original network was predicting the next word in the sequence, the output is represented as a probability vector. The dimension of an output vector is the size of the vocabulary of the dataset. The vocabulary of the dataset is the collection of all the unique words in the dataset. After the cell generates an output sequence of predictions based on the input sequence it received. This output sequence is compared to a target sequence. The 9

targets are the elements from the input sequence shifted up by one timestep. An error is calculated from the comparison between the sequence of predictions and the target sequence. From this error, the perplexity of the output is calculated. Perplexity is given by the following formula where H(p) is the entropy of the distribution p. 2 H(p) 2 x p(x)log 2p(x) Entropy can be used to describe the randomness of a probability distribution. Since perplexity is a function of entropy, a low perplexity for a given iteration is equivalent to having a low level of randomness in the predictions for that iteration. The perplexity was used as the performance measure because it captures the level of randomness in the prediction. Since there are clear structure in music which we want the network to recognise, we want the predictions to be made in a systemic way rather than a random way. 2.1.1 Cross Validation If we are to sample from the model, we want the network to generate a sequence which is novel. Therefore, it is important to make sure the model is not overfitting. When the model overfits the data, it is more likely to generate sequences which are extremely similar to the songs it encountered on the training data. Cross validation gives us a way of gauging whether the model is overfitting. It is a technique where the dataset is split into 3 disjoint subsets; the train set, validation set and test set. The training set was chosen to be 60% of the entire dataset while the validation and test sets were 20% each. This is generally how the data is split for cross validation which is why the specific percentages above were chosen. The network processes the entire training set a certain number of times and this is known as the number of epochs. The configuration used for training the jazz dataset had 13 epochs. After each epoch, the network is introduced to a fraction of the validation set. The perplexity is calculated for the training epoch and the validation data after a given epoch. This allows us to gauge the network s performance while its training as we are introducing data it has not previously been trained on throughout the training process. If the perplexity value for the training data continues to improve (grow smaller) while the validation perplexity deteriorates (grows larger), we can assume the model is overfitting. Finally, the test dataset is introduced to the network and the perplexity for the test data of calculated. The test dataset is significantly larger than the fractions of validation data introduce during training. The test perplexity will be my main performance measure as it indicates how well the network is able to generate meaningful sequences when compared to a relatively large amount of data which it hasn t trained on. 10

2.2 The Data The irealb jazz corpus consists of 1,186 files. These files are in the **jazz representation which is a text-based representation. The first few lines in each file describe the song; for example, they list the composer and the time signature. The body of each tune is comprised of a sequence of tokens and symbols which represent bar lines. The tokens encapsulate the duration of a chord and the chord itself. The chord is comprised of the root and the extension. For example, for a token 1C:maj7, the duration is 1, the root is C and the extension is maj7. Here, the duration is in Humdrum reciprocal form where the reciprocal of the duration is the fractional duration of a bar. Each token and bar line appear in a new line in the file (See figure below). Therefore, the network identifies each new line as a word in the dataset. If we were to feed the files into the network without any pre-processing, the network would be predicting the likelihood of entire tokens appearing next in the sequence. This is undesirable as the tokens can be broken down further into fundamental components. If these components aren t considered as separate words in the dataset, network is unable to explicitly learn about the relationship between the duration, the root and the extension. Identifying this was what lead to one of the main experiments conducted in the project; investigating whether separating the duration, root and the extension would improve the test perplexity.!!!otl: Afro Blue!!!COM: Santamaria, Mongo!!!ODT: 1959 **jazz *M3/4 *f: 2.F:min7 2.G:min7 4.A-:maj7 4.G:min7 2.F:min7 2.F:min7 2.G:min7 4.A-:maj7 4.G:min7 2.F:min7 2.E- 2.E- 4.D- 4.E- 2.F:min7 2.E- 2.E- 4.D- 4.E- 2.F:min7 *-!!!EEV: irb Corpus 1.0!!!YER: 26 Dec 2012!!!EED: Daniel Shanahan and Yuri Broze!!!ENC: Yuri Broze!!!AGN: Jazz Lead Sheets!!!PPP: http://irealb.com/forums/showthread.php?1580-jazz-1200-standards Figure 1: **jazz file example 11

2.3 Music Theory The root of a chord in the **jazz representation take on the values of note names. There are 7 unique note names in music, namely A, B, C, D, E, F, G. Adjacent notes are separated in pitch by a whole step. Each note can also be sharpened by moving up in pitch by a half step or flattened by moving down in pitch by a half step. In the **jazz representation a sharpened note is suffixed with a # symbol and a flattened note is suffixed with a - symbol. There are some instances where a note at a certain pitch has two different note names. For example, A and B are separated by a whole step. A# is a half-step up from A and B- is a half-step down from B. Therefore A# and B- represent the same note in terms of pitch. The following is an exhaustive set of note names corresponding to 12 distinct pitches. {A, A# or B-, B, C, C# or D-, D, D# or E-, E, F, F# or G-, G, G# or A-} Each of these note names has a corresponding scale. A scale is a sequence of notes ordered by pitch. The degree of a note in a scale identifies its position in this sequence. Since a root in the **jazz representation identify a note name, (and so a corresponding scale) any chord with that root can be expressed as a set of notes played simultaneously. What s interesting is that the notes played alongside the root can be identified purely by their degree in the scale corresponding to the root. This means we can represent a chord solely by the root and a set of degrees in the root s scale. For example, take the chord C major which is rooted at C. The basic triad chord for C major consists of the note names CEG. In the scale of C, the degree of E is 3, the degree of G is 5 and the degree of C is 1. So, an equivalent representation of the chord is C:135. The extension of a chord in the **jazz representation describes a chord in a similar relativistic way. The lack of an extension implies the chord is a triad and so consists of the 1 st,3 rd and 5 th degree. Other examples of extensions are min7, major7 and 7#9. Though these extensions only implicitly identify the degrees in a chord, what is important is that they do not describe absolute note names. Having a relativistic description of the notes in a chord (over a absolute description) gives us a way of expressing the general and therefore fundamental way chords are structured. This observation is essential to make in order to understand the reasoning behind the two training techniques that were explored. 12

3. Experiments 3.1 Overview The experiments conducted investigate the effect certain training techniques have on the network s performance. The main techniques involve changing the representation of the data and artificially increasing the size of the corpus. The first technique involves changing the representation of the data so that the root and extension are considered as separate elements. The second technique involves transposing each song in the dataset to one of two keys. We will refer to the key a song is transposed to as the destination key. If the original key of a song was a major key, the destination key would be C major. If the original key of the song was a minor key, the destination key would be A minor. C major was chosen arbitrarily as the major destination key, however A minor was chosen as the minor destination key because it is the relative minor of C. The third technique artificially increases the data set by representing each tune in all 12 keys by way of transposition. The first technique aimed to teach the network the relationship between the root and the extension since any extension can appear next to any root. Both transposition techniques aimed to teach the network that a chord progression is independent of the key of a song; the context the song is in in terms of pitch. The 6 combinations of these techniques were explored by generating the 6 datasets listed below. The network was trained on each of these datasets which resulted in 6 models. The model was trained on the following datasets a) Original data with the duration and chord separated b) Original data with duration, root and extension separated c) Each tune transposed to C major or A minor with the duration and chord separated d) Each tune transposed to C major or A minor with the duration, root and extension separated. e) Each tune transposed through all keys with the duration and chord separated f) Each tune transposed through all keys with the duration, root and extension separated. Cross validation was conducted on the predictions made by each model and the test perplexities were compared across all 6 models. It was expected that the model with the lowest test perplexity would be the most capable of generating realistic chord progressions. One of the original objectives was to experiment with hyperparameter values which dictate the number of hidden layers and hidden units on the best model. A certain combination of values for these hyperparameters could have reduced the model s test perplexity even further. Unfortunately, I was unable to investigate the effect certain hyperparameter values have on the test perplexity of the best model. This is because the best model was trained on data set f) which was 12 times larger than the original dataset. Hence training the network became very slow and data collection wasn t completed in time. However, I was able to try these 13

hyperparameter combination on the original data. I restricted the experiments to only consider the hyperparameters which dictate the number of layers in the network and the number of hidden units and the following combinations of hyperparameters were used 1. 1 hidden layer, 300 hidden units. 2. 2 hidden layers, 300 hidden units. 3. 3 hidden layers, 300 hidden units. 4. 1 hidden layers, 600 hidden units. 5. 2 hidden layers, 600 hidden units. 6. 3 hidden layers, 600 hidden units. 3.2 Splitting the Root and Extension of Chords By splitting the root and the extension of a chord in pre-processing, we are indicating that the network should consider the root and the extension as distinct elements in a sequence. To illustrate the intuition behind the assumption that splitting the root and extension will improve the perplexity, I will compare the probability distribution over the output vector for a model trained on O) the original data set and S) the dataset with the chord split at the root. For simplicity, we will assume that the data will only consist of tokens from the **jazz representation. There are 12 possible values for the root of a chord, namely {"C", "C#"or D, "D", "D#" or E, "E", "F", "F#"or G, "G", "G#"or A, "A", "A#"or B, "B"} For example s sake, let {, min7, maj7, dim7, sus, sus 4, 7b9 } be the set of possibilities for the extensions for a dataset. The exact number of distinct extensions in the dataset is not important for the illustration, it will be a finite number nevertheless. The alphabet generated from dataset S) will have dimensionality 12+719. Compare this to the dimension of the alphabet for the model trained on dataset O). In this case, the alphabet will consist of all possible combinations of root and extension. Therefore, the alphabet s dimensionality will be 12*784. Since the probability vector the model outputs has the same dimensionality as the alphabet, the distribution will be spread over fewer outcomes for dataset S) so the distribution will be less disperse compared to the distribution over the output vectors for dataset O). The more disperse the distribution is over the output vector, the closer the distribution is to a uniform distribution which is what the distribution would look like if the model was making random predictions. Since perplexity can be thought of as a measure of randomness, it was hypothesised that splitting the root and the extension will lower the test perplexity. 14

3.2.1 Method Since the raw data amalgamates the duration with the root and extension, we will consider the data set with the duration separated from the root and the extension as the original dataset (dataset f ). This allows us to isolate the effects of separating the root from the extension. The root was separated from the extension by parsing the lines of each file in the dataset and replacing each token 1 C:maj7 with 1 C : maj7. Replacing the tokens in this was effectively done by inserting white space before and after the colon symbol. This separates the token into 1 C : maj7. Leaving the colon between the C and maj7 ensures that the model will learn to only predict an extension after it has predicted the colon symbol. The colon symbol did not appear in between every root and extension in the dataset. An example would be 2C7. Because this inconsistency in representation, it was important to identify the root of the token and check if it is followed by a colon symbol. If root was not followed by a colon symbol, one was added with surrounding whitespace. The root was taken to be the character at index 1 of the token string. If the character at index 2 of the token string was a sharp symbol # or a flat symbol -, it was concatenated to the root. The splitting process was only applied to the lines in the file that represented the chord progression of the tune, meaning the first few header lines were not parsed during this process. In order to be able to start the parsing from the beginning of the chord progression, we required the index of the line at which the progression starts. Generally, the chord progression would start on the line two lines below the line identifying the time signature. The time signature string always began with an M which made it easy to identify and no other lines within a given file started with an M. So, if i denotes the index of the line starting with M, i +2 was marked as the starting index of the progression. Since the order in which the fields appeared in the header lines were not consistent throughout the corpus, not all files had progressions starting at i +2. Therefore, some manual changes had to be made to the heading lines in a small number of raw **jazz files. 3.3 Transposing Since the chords associated with a key are defined by the degrees in the scale associated with the key, any chord progression can be represented in a key. All that is required are the relative separations between the chords in the progression. Therefore, a chord progression can essentially be defined by the separation of the chords pitches. This means that each chord progression can have a meaningful representation in a different key. It was hypothesised that transposing the jazz files would allow the network to recognised that the structure of the progression is independent of the key the song is defined in. This should improve the crossvalidation results for the following reason. If the model encounters a sequence of chords during training and the same sequence appears in the training or validation set but in a different key, the model will be able to predict the sequence in the validation or test set with a high accuracy. 15

3.3.1 Transposing to a single key The transpositions were done through a function in the pre-processing file which extracts the key from a given jazz file and determines whether the key is major or minor. In the **jazz representation, major keys are uppercase letters while minor keys are lowercase. Based on this, the function will transpose all major tunes to C major and all minor tunes to A minor. The transpose function determines the offset of the original key of a song from the key it is being transposed to. The offset is in terms of the number of the number of half steps between notes. Two dictionaries were used to hold the notes and their separation from the destination key; one dictionary holds the offsets of each note name from the note C and the other from the note A. This offset is determined by retrieving the value corresponding to the note from one of the two dictionaries. An array could have used instead of a dictionary and initialised in such a way that the indices of the array represented the separation of the given note from the destination key. But dictionaries were used because several notes can have the same pitch and therefore the same separation from the note we are transposing to. Using a dictionary allowed the same separation value to be assigned to all notes that share the same pitch, while simply using the indexing of an array does not allow for this. 3.3.2 Transposing through all keys There are 12 musical keys in total. Each key is separated from its adjacent key by a value of 1. If a tune is in the key of G minor, then shifting each root in the tune down in pitch by an offset of 1 will have the effect of transposing the tune to the key of F# min. For the same tune in the key G minor, shifting each root in the tune down in pitch by an offset of 2 will have the effect of transposing the tune to the key of F min. And so, for a given tune, we have a variable which represents the offset we use for transposition which is initialised to 1. The offset was incremented until it reaches a value of 12 and each time we incremented the variable, we created a copy of the original file and used the current value of the offset variable to transpose the copy. This resulted in 12 copies of the song. Each copy will be the tunes representation in one of the 12 keys in music. The only data structures required for this method of transposing are an array holding all the notes and a dictionary whose elements are key value pairs holding all the notes and their corresponding indices in the array. Since there are several notes that map to the same element in the array, it is necessary to use both the array and the dictionary for the same reasons that were expressed in the previous section. The index of the original key is obtained by retrieving it from the dictionary and the value of the offset is added to this value to return the index of the new key and the new root. The new key or root is retrieved from the array with the new index and replaces the old root or old key in the copy of the song. 16

3.3.3 Hyperparameters Combinations I aimed to re-train the best model from the those mentioned above using several combinations of hyperparameters. The hyperparameters I wanted to experiment with were the number of layers in the network and the number of hidden units in the network. The number of layers dictates how many LSTM cells are stacked on top of each other. LSTMs are stacked by feeding the output of one cell as the input of another cell. LSTMs are stacked in order to model hierarchical structures in the data. The hidden units represent the network s learning capacity. Having too few hidden units may result in a poor test perplexity while have too many hidden unit may cause the model to overfit the data. The following 6 combinations of hyperparameters were tested to see which combination resulted in the lowest test score. 1. 1 hidden layer, 300 hidden units. 2. 2 hidden layers, 300 hidden units. 3. 3 hidden layers, 300 hidden units. 4. 1 hidden layer, 600 hidden units. 5. 2 hidden layer, 600 hidden units. 6. 3 hidden layer, 600 hidden units. Adjusting the hyperparameters were as simple as changing the value of the variable in the network configuration. Unfortunately, due to time constraints, I was unable to try these combinations on the model with the best test score. But I did achieve some results with the combinations on the original dataset. 17

4. Results It is important to note that in the graphs below, though the test perplexity appears to be plotted for each epoch number, the test perplexity was calculated after all 13 epoch numbers were processed. The test perplexity was included in the graphs this was to make for easy comparisons. 4.1 Splitting the Root and Extension of Chords Graph 1: No Split Graph 2: Split 18

We can see in the dataset with no split between the root and the chord that the validation perplexity starts to rise after 3 epochs while the training perplexity continues to decrease. This could be an indication of overfitting, but this is unlikely as the validation perplexity soon plateaus. The validation perplexity for the dataset with the root and extension split is nonincreasing, which could imply it is generalizing better than the other model trained on the dataset with no split. In both datasets, the test perplexity is very close to the final validation perplexity. This is likely to be a good indication that the model is capturing a large amount of information about the fundamental structures in the progressions in the dataset within 13 epochs. 4.2 Transposing to a Single Key Graph 3: Single key transpose, no split Graph 4: Single key transpose, split 19

For both of the datasets, the validation perplexity more or less plateaus while the training perplexity reduces. Therefore, it would seem that transposing to a single key does not improve or worsen any signs of overfitting that weren t already present from the results in section 4.1. The final validation and test perplexities are very similar, just like the results in section 4.1. 4.3 Transposing Through All Keys Graph 5: All key transpose, no split Graph 5: All key transpose, no split 20

Given the size of the scale of perplexity for these results if much smaller when compared to the size of the scale from previous result sections, we can assume the increases in validation perplexity to be negligible. This clearly shows that there is no sign of overfitting and this model is generalising very to the new data in encounters. The test perplexities both models in this section are almost identical to their corresponding final validation perplexities. This is a very strong indication that the model has successfully learned the underlying structures of the chord progression in the dataset. This is especially true for the model trained on the dataset which transposed all tunes through all 12 keys and split the root and extension. This is because the final test and validation perplexities are the closest to the final train perplexity in this model when compared to all other models. 4.4 Hyperparameter Combinations Graph 6: Hyperparameter Combinations These results show that changing the hyperparameter values for the number of hidden layers and hidden units don t make a significant improvement on the test perplexity. Especially when compared to the results achieved from other methods. However, no conclusions can be made on the exact effectiveness of the hyperparameter values on the test perplexity. 21

4.5 Test Scores Graph 7: Test Scores When comparing the test scores, it is clear that separating the root form the extension of the chord and transposing each song through all 12 keys were the two techniques that were most effective in training a model which recognises the relevant musical structures in chord progressions. Combining these two techniques resulted in the model with the lowest test score. We can see that transposing through all keys achieved better results than transposing to one key. This is likely due to the fact that, when transposing through all keys, the model is more likely to encounter chord structures it has encountered during training. A small part of it may also be due to the fact that the header information in each transposed copy is intertidal across all copies when the songs are transposed through all keys. 22

No Split Split Original 5.177 2.802 Transposed to the same key 4.655 2.828 Transposed through all keys 2.574 1.605 Table 1: Test Scores We can quantitatively see from this table the size of the improvements made by the training techniques used. The model trained on the original data with no split had a test perplexity of 5.177 and the model trained on the dataset where the roots were split from the extensions and the songs were transposed through all keys had a test perplexity of 1.605. Therefore, combining the best training techniques used resulted in the test perplexity reducing by 3.572. 23

5. Conclusions From the results obtained, the best model was the one trained on dataset f) where each tune was transposed through all keys and the root of each chord was separated from the extension. This model had the lowest test score which implies that, out of the 6 models that resulted form the experiments, it can predict the next element in a sequence with the lowest level of randomness. This model, like all other models, showed no sign of overfitting which implies the model can generalise well to new data it encounters and therefore is capable of generating novel chord progressions. At the very least, the model is unlikely to replicate progressions which are identical to songs the network encountered during training. However, the best way to test whether the model has learned the structures of chord progressions and the structures of chords themselves is to sample from the data and listen to the progressions it generates. The hyperparameter combinations didn t drastically improve the text score when applied to a model trained on the original data. But, they may be more effective on the data where the root and chord are separated, and the tunes are transposed. 5.1 Future Work To further validate the conclusions made, I aim to extend this work by sampling from the best model and listening to the chord progression the model is capable of producing. I aim to try the combinations of hyperparameters on the best model from this project and see if any of them can improve the test scores much further. An interesting extension would be to introduce an element of human interaction with the model by allowing people to define a seed sequence. When the network is fed with the seed sequence it can generates a sequence of chords which completes the progression. 24

References [1] B.L. Sturm, J.F. Santos and I. Korshunova. Folk music style modelling by recurrent neural networks with long short term memory units. In Proc. 16th International Society for Music Information Retrieval Conference. 2015. [2] D. Eck and J. Schmidhuber. Learning the long-term structure of the blues. In Proc. Int. Conf. on Artificial Neural Networks, 2002. [3] Judy A. Franklin. Recurrent neural networks for music computation. ORSA journal on computing, 18(3):321 338, 2006. 25

Appendix Appendix 1: Study Contract 26