CREATING all forms of art [1], [2], [3], [4], including

Size: px

Start display at page:

Download "CREATING all forms of art [1], [2], [3], [4], including"

Roy Smith
5 years ago
Views:

1 Grammar Argumented LSTM Neural Networks with Note-Level Encoding for Music Composition Zheng Sun, Jiaqi Liu, Zewang Zhang, Jingwen Chen, Zhao Huo, Ching Hua Lee, and Xiao Zhang 1 arxiv: v1 [cs.lg] 16 Nov 2016 Abstract Creating any aesthetically pleasing piece of art, like music, has been a long time dream for artificial intelligence research. Based on recent success of long-short term memory (LSTM) on sequence learning, we put forward a novel system to reflect the thinking pattern of a musician. For data representation, we propose a note-level encoding method, which enables our model to simulate how human composes and polishes music phrases. To avoid failure against music theory, we invent a novel method, grammar argumented (GA) method. It can teach machine basic composing principles. In this method, we propose three rules as argumented grammars and three metrics for evaluation of machine-made music. Results show that comparing to basic LSTM, grammar argumented model s compositions have higher contents of diatonic scale notes, short pitch intervals, and chords. Index Terms Music composition, note-level encoding, LSTM neural networks, grammar argumented method 1 INTRODUCTION CREATING all forms of art [1], [2], [3], [4], including music, has been a long time pursue for artificial intelligence (AI) research. Even AI follows the principle as stated by the musicologist Leonard B. Meyer, that styles of music are in effect complex systems of probability relationships. In the early years, symbolic AI methods were popular and specific grammars describing a set of rules drive the composition [5], [6]. This method was latter improved by evolutionary algorithms in different ways [7], including the famous EMI project [8]. Latter, statistic model such as Markov chains and Hidden Markov model(hmm) became popular in algorithmic composition [9]. In the meanwhile, neural network(nn) has made remarkable progress in recognition and other field [10], including music composition using Recurrent neural networks(rnn) [11], [12], [13], [14] and Long-Short Term Memory(LSTM) [15], [16]. RNN and LSTM perform well in modeling sequential data, however, when they are applied on music composition, the outcomes will be drab and dull, and even fall into a group of harsh notes. Moreover, the generated music sometimes go against general music theory. We hope to teach machine basic composing principles, but merely neural networks or older grammar methods are incapable of it. In this work, we improvise LSTM with an original method named grammar argumented method (GA), which combine neural networks and grammars. We begin by training a LSTM neural network with a dataset full of musicianmade music. In the training process, machine learns sequencial information as much as possible. Then we feed a short Z. Sun, J. Liu, Z. Zhang J. Chen and X. Zhang are with the Department of Physics, Sun Yat-sen University, Guangzhou, P. R. China. {sunzh6, liujq33, zhangzw3, chenjw93, zhangxiao}@{mail2, mail2, mail2, mail2, mail}.sysu.edu.cn Z. Huo is with China University of Political Science and Law. huozhao@cupl.edu.cn C. H. Lee is with Institute of High Performance Computing. calvin-lee@ihpc.a-star.edu.sg Manuscript received xx/xx/xxxx; revised xx/xx/xxxx. music phrase to trigger the first phase of generation. Instead of adding the generated notes directly to music score, we evaluate each predicted result with some composing rules. The notes go against general music theory will be abandoned and new notes confirmed to rules can be accessed by repredicting. All amended results and their corresponding inputs will be added to training set, we retrain our model with updated training set and use the original generating method to do the second phase of (real) generation. We also adopt a new representation of notes by concatenating each note s duration and pitch as a single input vector. With this note-level encodig method, we enable machine to process real notes and think like human. Results show that our system is capable of composing pleasing melody, and it performs better than non-ga system in the percentages of notes in diatonic scale, pitch intervals in one octave, and chords. In other words, our specific system can learn basic composing principles and bring out common and melodious music. 2 METHODS 2.1 Note-Level Encoding Although machine learning method makes a significant progress in music composition, none of the related works really simulates how human creates music. During music composition, composers often focus on several typical music phrases. In the process of polishing music phrases, they need to deliberate each music note. Music note is a particular combination of pitch and duration, like a quarter note of E6. However, the related works either represent music as quantized time series [11], [13], [15], [17], [18], [19], or treat pitches and durations in separate neural networks [20], [21]. In this work, we encode music with note-level method. Because it is troublesome to simulate the process of polishing and generating notes with raw MIDI files, we need to convert each file to a sequence of music notes. First, we choose MIDI files of 106 Soft Piano music whose composers include Joe Hisaishi, Yiruma, Yoko Kanno and Shi Jin, et

These compositions time signature are all 4/4. We only pick music with time signature 4/4 because music with different time structures is intractable for modeling.

2 2 Fig. 1. We extract music score from the MIDI files. Each note symbol contains the information of duration and pitch, both of which are then converted into one-hot vector and concatenated into one specific binary vector. (a) al. These compositions time signature are all 4/4. We only pick music with time signature 4/4 because music with different time structures is intractable for modeling. Then, we delete all the accompaniments and score annotations like Control Change Events and grace notes. Besides, we ignore the change of intensity and just keep the highest pitch note when more than two notes appear at the same time. This treatment can discretize melodies into a sequence of Note On Events and Note Off Events. We then encode each pair of Note On Event and Note Off Event into one 2-D vector. Each vector includes two dimensions: duration ranging from semiquaver to breve and pitch ranging from A0 to C8. For reducing dimentions of model s input layer, we transpose all melodies into C major/a minor, which eventually include 30 kinds of durations and 59 kinds of pitches. Then, we convert each note s duration and pitch into corresponding one-hot vector, one has 30 bits, another has 59 bits, as in Fig.1. Concatenate these two binary vectors, we can get specific representation for each note. Unlike the common used encoding methods, piano-roll representation, this method can represent both pitch and duration in one single vector. What is more significant, a sequence of real music notes enable our model to simulate how human generates and polishes music phrases. 2.2 Long-Short Term Memory Neural Networks Recurrent neural networks (RNN) perform well in sequence learning. Each hidden layer receives both input from the previous layer and input from itself one time step in the past. It enables the networks remember information of previous time steps. However, simple RNN does not perform well in long-term dependency because of vanishing gradients [22]. Long-short term memory (LSTM) neural network is a kind of advanced RNN and solve this problem. It use memory cell to store necessary information and various gates to control other information. The graphic structure of LSTM in Fig.2 shows how the data flows through the LSTM module. In the following we describe how the hidden state u (t) of LSTM is computed. To obtain u (t) for each time step, we input data v (t) and u (t 1) from last time step into the LSTM. b i, b o, b f, b c denote corresponding vectors of biases, whereas the subscript i, o, f, c means input, output, forget and cell gate, sequentially. W i, W o, W f, W c denote the matrices of weights between input vector and corresponding gate units and (b) Fig. 2. The graphic structure of LSTM. With v (t) and u (t 1) as inputs, LSTM layer outputs a hidden state u (t) that can convey temporal information. denotes element-wise multiplication and + denotes elementwise addition. The schematic diagram (a) shows what to input and output and (b) indicates the details that how the LSTM layer computes the current hidden state u (t). U i, U o, U f, U c denote the matrices of weights connecting hidden state from last time step to the corresponding gate units. First, with the previous temporal information u (t 1), we construct input gate i (t), output gate o (t) and forget gate f (t) to decide which parts of information from memory cell to pass through the gate. σ(x) = (1 + e ( x) ) 1 is an elementwise sigmoid function whose output value is between (0, 1) interval. i (t) = σ(w i v (t) + U i u (t 1) + b i ) (1) o (t) = σ(w o v (t) + U o u (t 1) + b o ) (2) f (t) = σ(w f v (t) + U f u (t 1) + b f ) (3) In the same way, we compute the new state C (t) of memory cell depending on the value of input v (t) and hidden state u (t 1) from last time step. The difference is that the activation here is tanh(x) = (1 e 2x )/(1 + e 2x ). C (t) = tanh(w c v (t) + U c u (t 1) + b c ) (4) Then we can define the update rule for the current state C (t) of memory cell of LSTM, a unit that stores and accumulates information. It is the new state C (t) of memory cell multiplied by the output of input gate i (t) plus the old state C (t 1) of memory cell from last time step multiplied by the output of forget gate f (t) : C (t) = i (t) C (t) + f (t) C (t 1) (5)

3 3 Finally, the current hidden state u (t) of LSTM is computed by the activated current state of memory cell under the control of output gate. u (t) = o (t) tanh(c (t) ) (6) 2.3 Grammar Argumented Method In an original method, after hundreds of epoches, the loss of model stops decreasing, we will consider machine has learned as much information as possible from the dataset. Once a seed input is given, the model will be able to predict notes continuously. However, it does not perform as well as expected because the model often generates results not conformed to basic composing principles, for example, too many overtones, acute change on pitch and unpleasing melodies. These results can ruin the composition. Our GA method can teach machine to learn general music knowledge and compose more harmonious notes without any manual intervention on the second phase of (real) generation. Before the details of GA go, we need to determine what kinds of rules to use in GA method. According to music theory and unmelodious pieces in results, we put forward three kinds of rules. The first rule is diatonic scale (Dia). In music theory, notes in a diatonic scale are the most pleasing. The C major scale is one of the diatonic scales and it is made up of seven distinct notes (C, D, E, F, G, A, and B). Because all data is converted to C major in this work, we choose C major scale rule in our experiments. Conventionally, machine without GA method often produces overtones (C#, D#, F#, G#, and A#) and ruins a nice piece. Although overtones sometimes have positive effects on music pieces, it is too difficult for machine to compose melodious music with all twelve tones in one octave. So we hope there are less overtones and more notes in C major scale in machine s compositions. The second rule is short pitch interval (SPI). The pitch interval of two successive notes usually does not span over one octave. Acute change on pitch often makes listener feel uncomfortable. We think only experienced composers can use pitch intervals spanning over one octave well, so short pitch intervals are more preferred in our results. The last rule is triads (Tri). In addition to notes in a scale, chords are also the key of composing tuneful music. Triads are the simplest chords and each kind of triads represents a pair of pitch intervals. There are four kinds of triads in total and each of them has specific music emotion. Furthermore, triads are also the basis of all seventh chords, which result in variegated music. We believe that composition s level is closely related to the amount of chords. With GA method, in the first phase of generation, we aim at achieving data amended with composing rules stated above. Before a fresh note is added to music score, we check that whether or not this new note are conformed to rules. When a discordant note is predicted, we go back to the output layer of model and resample from the output distribution, as in Fig.3. Loop this operation until a note conformed to rule is generated. For example, we suppose that the fresh note is (eighth, B6) and the last note in music score is (eighth, A5), their pitch interval will be 14 semitones. Because one octave only includes 12 semitones and this new note is non-conformed to SPI. Although pitch B6 may has the highest probability in output layer, we abandon it and resample for other notes conformed to SPI. When a note whose pitch is conformed to SPI appears, we add it to music score. At the same time, we record this amended result and the current input phrase. After the generating process, we mix all recored data with original training set, like what is shown at the bottom of Fig.3. In other words, we add a handful of fabricated data which includes the information of basic composing rules. Lastly, we retrain the model with the updated training set. In the second phase of (real) generation, we adopt original method, let machine generate notes continuously without any extra mechanism. This GA method enables machine to learn basic composing principles from data amended with music theory and bring out melodious music. 3 EXPERIMENTS Our model consists of one LSTM layer and one fully connected layer. The LSTM layer includes 128 cells and its input dimension is 89, which equals to the length of note s binary representation. There are 89 nodes in fully connected layer and it is also the output layer. The size of our dataset is 30000, to speed up the training process, we divide it into mini-batches, and the batch size is 64. We use Adam [23] to perform gradient descent optimization, learning rate is set to We build our model on a high-level neural networks library Keras [24] and use T ensorf low [25] as its tensor manipulation library. We train this model with original dataset and label this set of weights with Orig. With GA method, we use Orig to generate 100k notes for each rule and get three sets of amended data. Then we mix each set of data with dataset, mix all three sets with dataset, and get four new training sets at last. We retrain our model with them and label these four sets of weights with Dia, SPI, Tri, and MIX, corresponding to their rules. For statistics analysis, we adopt a public random seed to generate 100k notes with all five sets of weights, including Orig. 4 RESULTS We take a representative piece from machine s long composition, which is generated in MIX mode. This piece s music score is shown in Fig.4. At first, it is a strong evidence that machine excludes overtones and prefers notes in C major scale. There is only one overtone in this piece. Secondly, it is worth noticing that machine has already learned to use repeated rhythmic structure in one piece as human always does, for example, bar 4-5 and bar Listen from the beginning, we notice that the whole piece sounds soft and lyrical, which is consistent with music in dataset. Bar 3, 4, 6, and 12 have faster paces sandwiched between slow moving melodies. This variety of paces is also seen in musicianmade music. Besides, the bottom half is so pleasing that it can be directly used as theme of a new song. 5 EVALUATIONS According to music theory described in section2.3, we put forward three kinds of metrics: percentages of notes in

4 4 Fig. 3. The grammar argumented method. With a music phrase or note sequence, a well-trained LSTM neural network is able to predict the next note. The output layer consists of 89 nodes, according to note-level encoding method, the left 30 nodes represent duration and the right 59 nodes are for pitch. We sample from these two parts because duration and pitch s representations are all one-hot codes. After the sampling result is converted to note, we check it with composing rule, only notes comformed to rule can be added to phrase. When a non-comforming note is predicted, we resample on the output distribution until the comforming one appears. Then, we add this amended note and its corresponding input phrase to the original training set. Fig. 4. An example of machine s composition. It is generated in MIX mode and includes about 100 notes. diatonic scale (pdia), percentage of pitch intervals in one octave (pspi) and percentages of triads (ptri). All of them are based on note-level encoding, but they are generally valid as evaluation criteria for other composing algorithms. 5.1 pdia Table 1 shows the percentages of seven notes in C major scale, we find that GA method works with C major scale rule. For E6, its ratio increases by one percentage points. Except for Dia and MIX, Tri mode also produces high pdia. This phenomenon can be explained by pitch interval. Because pitch intervals in triads are included in major scale, adding data amended with triads rule also improves pdia. 5.2 pspi We figure out each pitch interval in the composition and calculate the percentage of pitch intervals in one octave. TABLE 1 Percentages of Notes in Diatonic Scale DS Orig Dia SPI Tri MIX C D E F G A B High pspi contributes to a harmonious composition. As what is shown in Table 2, SPI and MIX mode can compose music with high pspi. We also notice that without GA method (Orig mode), model generate more unharmonious notes, which result in more poor compositions.

5 5 5.3 ptri TABLE 2 Percentage of Pitch Intervals in One Octave Modes DS Orig Dia SPI Tri MIX pspi(%) TABLE 3 Percentages of Triads DS Orig Dia SPI Tri MIX Major Minor Augmented Diminished Table 3 shows percentages of triads. In Tri and MIX mode, machines composition includes more triads than other results. We notice that music composed in MIX mode performs well with all three metrics. It suggests that those three rules are not conflicting, they are coherent rules in music theory. 6 CONCLUSION Although it is difficult for machine to learn music theory rules with simple LSTM neural networks, we propose grammar argument method and enable our model to learn those rules by adding rule-amended data to dataset. GA method reduce unharmonious notes amount significantly. Our original note-level encoding method also contributes to this successful system. On note level, machine thinks like musicians and is able to learn advanced logic from the dataset. At last, three metrics provide a solution for evaluating machine-made music. Our GA method has the potential to include high-level music theory rules, such as repeated paragraphs, and it gives an approach towards music with global structure. ACKNOWLEDGMENTS The authors would like to thank Chuangjie Ren and Qingpei Liu for helpful discussions on neural networks. REFERENCES [1] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra, Draw: A recurrent neural network for image generation, Computer Science, [2] A. Graves, Generating sequences with recurrent neural networks, Computer Science, [3] L. A. Gatys, A. S. Ecker, and M. Bethge, A neural algorithm of artistic style, Computer Science, [4] A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu, Pixel recurrent neural networks, arxiv preprint arxiv: , [5] G. M. Rader, A method for composing simple traditional music by computer, Communications of the ACM, vol. 17, no. 11, pp , [6] J. D. Fernánd Ndez and F. Vico, Ai methods in algorithmic composition: a comprehensive survey, Journal of Artificial Intelligence Research, vol. 48, no. 48, pp , [7] THYWISSEN and KURT, Genotator: an environment for exploring the application of evolutionary techniques in computerassisted composition, Organised Sound, vol. 4, no. 2, pp , [8] D. Cope, Computer modeling of musical intelligence in emi, Computer Music Journal, vol. 16, no. 16, pp , [9] M. Allan, Harmonising chorales in the style of johann sebastian bach, Master s Thesis, School of Informatics, University of Edinburgh, [10] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, d. D. G. Van, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, and M. Lanctot, Mastering the game of go with deep neural networks and tree search. Nature, vol. 529, no. 7587, pp , [11] P. M. Todd, A connectionist approach to algorithmic composition, Computer Music Journal, vol. 13, no. 4, pp , [12] M. C. MOZER, Neural network music composition by prediction: Exploring the benefits of psychoacoustic constraints and multiscale processing, Connection Science, vol. 6, no. 2-3, pp , [13] N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent, Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and transcription, arxiv preprint arxiv: , [14] S. Wermter, C. Weber, W. Duch, T. Honkela, and P. Koprinkovahristova, Artificial neural networks and machine learning icann 2014, Lecture Notes in Computer Science, vol. 8681, [15] D. Eck and J. Lapalme, Learning musical structure directly from sequences of music, University of Montreal, Department of Computer Science, CP, vol. 6128, [16] J. A. Franklin, Recurrent neural networks for music computation, Informs Journal on Computing, vol. 18, no. 3, pp , [17] K. Goel, R. Vohra, and J. Sahoo, Polyphonic music generation by modeling temporal dependencies using a rnn-dbn, in Artificial Neural Networks and Machine Learning ICANN Springer, 2014, pp [18] D. Eck and J. Schmidhuber, Finding temporal structure in music: Blues improvisation with lstm recurrent networks, in Neural Networks for Signal Processing, Proceedings of the th IEEE Workshop on. IEEE, 2002, pp [19] Q. Lyu, Z. Wu, and J. Zhu, Polyphonic music modelling with lstm-rtrbm, in Proceedings of the 23rd Annual ACM Conference on Multimedia Conference. ACM, 2015, pp [20] M. C. Mozer, Neural network music composition by prediction: Exploring the benefits of psychoacoustic constraints and multiscale processing, Connection Science, vol. 6, no. 2-3, pp , [21] J. A. Franklin, Recurrent neural networks for music computation, INFORMS Journal on Computing, vol. 18, no. 3, pp , [22] S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural computation, vol. 9, no. 8, pp , [23] D. Kingma and J. Ba, Adam: A method for stochastic optimization, Computer Science, [24] F. Chollet, Keras, [25] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, TensorFlow: Large-scale machine learning on heterogeneous systems, 2015, software available from tensorflow.org. [Online]. Available:

Music Composition with RNN

Music Composition with RNN Jason Wang Department of Statistics Stanford University zwang01@stanford.edu Abstract Music composition is an interesting problem that tests the creativity capacities of artificial