WATSON BEAT: COMPOSING MUSIC USING FORESIGHT AND PLANNING

WATSON BEAT: COMPOSING MUSIC USING FORESIGHT AND PLANNING Janani Mukundan IBM Research, Austin Richard Daskas IBM Research, Austin 1 Abstract We introduce Watson Beat, a cognitive system that composes original music based on its knowledge of music theory. We describe the main machine learning technique used in Watson Beat reinforcement learning. We then discuss two case studies on how Watson Beat composes music based on user intent, and provide analysis on its learning methodology. 2 Introduction The term computational creativity can be defined as the study of art in a partial or fully automated fashion. We are entering a new realm of artificial intelligence and cognitive thinking where computers are being taught to be creative. This includes not only the analysis of works of art, but the creation of works of art as well. In this paper we focus our attention on the latter the synthesis of art, specifically the synthesis of music. The goal of this project is to teach computers to compose music using a reinforcement learning (RL) model. Reinforcement learning [7, 8] is a machine learning technique that works similar to the ways humans learn to solve problems. An RL model learns by constantly interacting with its environment and using the feedback it receives to train itself, without relying on prior supervision of this interaction. Reinforcement learning falls under the umbrella of unsupervised learning techniques, where the model learns what to do, as opposed to being told what to do (supervised learning techniques). On the one hand, this is especially important in the arts, wherein results are subjective and qualitative. On the other hand, the field of AI can be defined in more formal and quantitative terms, wherein results can be interpreted in a more factual, numerical and algorithmic fashion. Marrying the fields of AI and the arts can thus pose significant challenges. Reinforcement learning is uniquely suitable for such a scenario, where both these goals subjectivity and objectivity can be combined. In the subsequent sections, we briefly review the principles of reinforcement learning, and then provide a background on how this machine learning technique is applicable to our specific case of composing Attribution: Janani Mukundan, Richard Daskas, {jmukund, daskas}@us.ibm.com. Watson Beat: Composing music using foresight and planning. Appears in the proceedings of the KDD 2017 Workshop on Machine Learning for Creativity, Halifax, Nova Scotia, Canada, Aug. 2017. Figure 1. The basic structure of an RL system consists of a stochastic agent and its environment. The RL agent interacts with its environment in discrete time steps, senses its current state, performs an action, is rewarded for the action, and moves to another state. music. We introduce Watson Beat, a cognitive system that works on the principles of reinforcement learning, and describe two case studies on using an RL framework for music composition. Finally, we discuss how to train these RL models to learn based on emotional and thematic intent. 3 Reinforcement Learning and Its Applicability to Music Composition Reinforcement learning studies how autonomous agents situated in probabilistic environments learn to maximize a long term goal. The objective of the RL agent is to maximize its long-term cumulative reward by interacting with its environment and learning an optimal policy that maps states to actions. Figure 1 shows the basic operation of an RL system. It consists of an RL agent that constantly interacts with its environment in discrete time steps. The RL agent senses its current state, performs an action, is assigned a numerical reward for performing the action, and moves to another state. Figure 2 illustrates Watson Beat an RL system in the context of music composition. The music composer is the RL agent 1. The composer interacts with its environment, senses its current state, and determines what note to play next. Relevant attributes for the environment will depend on the objective function the RL agent is trying to maximize, and can include features like (a) the current scale being used, (b) the current chord being played, etc. The 1 For the remainder of the paper the terms RL agent and (music) composer will be used interchangeably

composer perfoms the action, gets a reward, and moves to another state. Our goal is to train the composer to maximize the reward it receives in the long run by learning an optimal policy that will map states to actions. 3.1 Episodic and Continuing Tasks In order to precisely define the goal of the agent we first need to determine the type of RL task the composer is performing. Reinforcement learning tasks can be broadly broken down into (a) episodic, and (b) continuing tasks. Let us assume that the composer agent receives a sequence of rewards r t+1, r t+2, r t+3..., after time step t. In general, we want to maximize the expected return, where the return, R t, is defined using a function for such a reward sequence as described above. For example, R t could be defined by using the sum of all rewards obtained. R t = r t+1 + r t+2 + r t+3 +... + r T, (1) In the equation above, T is the final time step. Such tasks, which have a natural notion of an end (terminal) state or final time step are called episodic tasks. The interaction between the agent and the environment naturally breaks down into identifiable episodes or trials. These types of tasks are oftentimes called finite horizon tasks. continuing tasks, on the other hand, do not break down naturally into episodes, and are referred to as infinite horizon tasks. Such models are appropriate when the agent s lifetime is unknown. For example, the return R t could be defined using equation 2, where the final time step is unknown. R t = r t+1 + r t+2 + r t+3 +..., (2) The one issue that arises with continuing tasks is the problem of convergence of cumulative rewards. Since T is infinite, R t will approximate to infinity as well. To avoid this problem for continuing tasks, we maximize the discounted cumulative reward function at each time step t. R t = r t+1 + r t+2 + r t+3 +... = γ i r i+t+1, (3) i=0 where γ is the discount rate parameter bewteen 0 and 1. Equation 3 is applicable for both episodic tasks with no discounting ( γ = 1) and continuing tasks ( γ <1 ). Especially in the case of infinite horizon tasks, the value of γ plays a role in determining how important future rewards are. As γ approaches 0, the agent is myopic and cares only about maximizing immediate rewards, and as γ approaches 1, the agent prioritizes future rewards over immediate rewards. For the particular problem of using RL for music composition we can envision the RL task falling into both scenarios. In the case of episodic tasks, the composer agent can be programmed to reach a terminal state after n minutes of music has been composed, or after x measures (bars) have been generated. Alternatively, the agent can continue to compose music in a streaming fashion without any terminal end state making it an infinite horizon problem. For the purposes of this paper we chose to make the RL process a finite horizon task, leaving continuous music generation for future work. 3.2 Temporal Credit Assignment The composer agent needs to learn how to assign credit and blame to actions in the past for each observed immediate reward. This is called learning the value function of each state and action pair. Most RL algorithms learn to estimate the value function an estimation of how good it is for the agent to be in a particular state and perform a given action. This notion of the goodness of a state is defined in terms of the total amount of reward the agent can expect to accumulate over the future, starting from that state, i.e. the expected return. Whereas reward functions indicate immediate desirability of a state, value functions indicate long term or future desirability of a state ( by taking into account the subsequent states that are about to follow, and the future rewards that will be achieved). Naturally, the value function will be dependent on the policy that the agent follows. A policy π in the RL framework can be defined as a mapping of states (s S) and actions ( a A) to the probability π(s,a) of being in state s and taking action a. The value of a being in state s and taking action a when following policy π, denoted as Q π (s,a), is the expected return when starting from state s, executing action a and following policy π from that point on. Q π (s, a) = E π {R t s t = s, a t = a} = E π { γ i r i+t+1 s t = s, a t = a}, i=0 where Q π is the action value funtion of policy π, and is called the Q-value of policy π. In subsequent sections we describe two case studies in using the RL framework for music composition, and discuss how the composer agent learns to estimate Q-values. 4 Case Study I - RL Based Composer to generate Chords For our first case study, we provide a proof of concept framework on using a reinforcement learning system to compose music specifically to generate chord sequences. The first step towards training our RL model is to determine our objective function - generate a sequence of chord progressions based on a complexity knob. The complexity knob takes three values: (a) simple, (b) semi-complex, and (c) complex. Before we begin training our RL model, we assume the following about our composition engine and the environment. These are parameters that can be changed before training, and are only used here to illustrate our example. (4) a. Our chord progression sequence has a fixed length of n > 1 measures. Within these n measures, we can have a maximum of n*2 chords being played and a minimum of n/2. This means that the duration of a chord can be a maximum of two measures and a minimum of half a measure.

To determine the jump distance from one chord to another chord we use a custom chord progression penalty algorithm which works as follows: 1. Calculate the number of common notes between the previous chord and current chord. The more the common notes, the less is the penalty, and smaller is the jump. Figure 2. Watson Beat An unsupervised music composition engine that fits into the framework of an RL system. The composer is the RL agent. The environment can include relevant state attributes like (a) the current chord being played, (b) the current scale being used, (c) the tempo etc. The composer senses the state of the system, chooses the next note to be played, and is assigned a numerical immediate reward. The goal of the composer is to maximize an objective function (value function) like (a) generate a simple four bar chord progression, (b) generated a syncopated eight bar melody in CMajor etc. b. At discrete time steps the RL agent will assess its environment and execute an action. For this example, the time step is every chord change. c. The RL composer is given a primary key and primary chord to start with. d. We fix the value of the complexity knob for the chord progression sequence. Based on the above points, let us assume that we want to train our RL model to generate a simple chord progression sequence spanning eight measures, with eight chords, each chord having a duration of a measure. Next, we describe the actions, state attributes, reward structure and estimating the value function for the RL system. Actions: The set of actions a composer agent can perform will involve determining what note to play next. For our case study, wherein the RL agent aims to generate chord progression sequences, we describe the following actions that the agent can execute at each time step: 1. Play primary chord this indicates that the next chord to be played will be the primary chord 2. Play chord with a small jump this indicates a small jump between the previous chord and the current chord 3. Play chord with a medium jump this indicates a medium jump between the previous chord and the current chord 4. Play chord with a big jump this indicates a big jump between the previous chord and the current chord 2. If the current chord has notes that are a half step on either side of the home note of the previous chord, then assign a negative penalty of 5. This means that there is a strong pull from the current chord to the home note of the previous chord and indicates a big jump. 3. Repeat step 2 for the third note, fifth and seventh note in the chord and reduce the penalty assigned each time by 1. 4. Add up the penalties from the above steps. The higher the number, the lesser is the penalty, and smaller is the jump. State Attributes: Relevant attributes that adequately describe the environment for our RL agent include: a. Distance from current chord to previous chord - This indicates the local jump that was made in the previous time step. b. Distance from current chord to home chord - This indicates the global distance traveled by the sequence from the first chord to the current chord. c. Number of times the scale has changed - This indicates the number of times we moved outside the primary scale to accomodate chords. Reward Structure: The immediate reward structure is directly dependent on the objective function we want to maximize. For example, when generating a simple chord progression sequence, we assign higher numerical rewards when (a) playing the primary scale and (b) playing a chord with a smaller chord progression jump, while medium and big jumps are penalized. The opposite is true when generating a complex chord progression sequence. Table 1 indicates the numerical immediate rewards that were assigned for the different actions and objective functions. Table 1. Immediate reward structure for RL agent generating chord progression sequence Action simple semi-complex complex Primary Chord 2.0 1.0 0.0 Small Jump 2.0 1.0-1.0 Medium Jump 1.0 2.0 2.0 Big Jump -3.0 1.0 3.0

Estimating Q-values The basis of many reinforcement learning algorithms is to estimate the value function ( Q-values in our case). For our case study we use the SARSA update rule [7] as described below. Let us assume that the composer is in state s prev, performs action a prev, transitions to new state s current, and collects an immediate reward r. While in state s current, the composer executes action state a current. The Q-value associated with executing action a prev in s prev is updated using the SARSA update rule as shown in equation 5. Q(s prev, a prev ) = (1 α)q(s prev, a prev )+ α[ r + γq(s current, a current )] Here, α is the learning rate parameter that helps facilitate convergence in the presence of noisy and stochastic rewards and state transitions. The parameter γ is the discount rate parameter. Recall that for non discounted finite horizon tasks such as ours, γ is set to 1. For episodic tasks, the entity r + γq(s current, a current ) intuitively represents the sum of the immediate reward obtained by executing action a prev in state s prev, plus the undiscounted sum of all the future rewards when the current policy is followed from that point on. RL-Based Chord Generation Algorithm Algorithm 1 RL-Based Chord Genaration Algorithm 1: procedure CHORDSEQUENCEGENERATOR 2: Initialize all Q-values to random values 3: A getactionset() 4: maxchords 8 5: repeat(for every episode) 6: chordid 0 7: Sense system state s 8: Initialize first action cmd primary key 9: Q prev getqvaluefromtable(s,cmd) 10: while chordid < maxchords do 11: generatechord(cmd) 12: r collectreward() 13: Sense system state s 14: if rand() ɛ then 15: cmd selectrandomaction(a) 16: else 17: cmd getactionwithmaxqvalue(a) 18: Q selected getqvaluefromtable(s,cmd) 19: updateqvaluesarsa(q prev, r, Q selected ) 20: chordid chordid + 1 21: Decrease ɛ as time progresses. Algorithm 1 illustrates the RL-based chord generation algorithm. The procedure keeps track of the Q-values of all possible state-action pairs in a table, and iteratively learns these Q-values based on experience. Initially we set these Q-values to random values. Recall that our case study is a finite horizon task generating eight chords. Therefore, an (5) Figure 3. The rewards obtained per episode for the RL based chord sequence generation engine described in case study I. The x-axis indicates the number of episodes, and the y-axis indicates the reward per episode. episode will end after eight chords have been generated, and another one starts immediately, resetting the number of chords generated to 0. At the beginning of every episode, the composer senses the state of the system (s), and initializes the action to choose the primary key. The Q-value for the current state-action pair is retrieved from the table, the new chord is generated, and the composer moves to another state. In subsequent time steps, in most cases, the composer picks the action that generates the highest Q-value i.e. the composer exploits the knowledge it has gained from the system. Occasionally, in order to encourage exploration, the composer chooses a random action with a small probability. The Q-value of the new stateaction pair is retrieved, and the SARSA update rule is applied. Exploration vs Exploitation: The SARSA update rule works on the basic premise that the composer has a non-zero ability of visiting every table entry. Therefore, the composer must have the ability to constantly explore its environment (i.e. perform random actions from the action set), while also continuously utilizing the best policy it has learned so far ( i.e. choosing the action with the highest Q-value). To balance the tradeoffs between exploration vs exploitation we initially make use of a simple exploration mechanism known as ɛ-greedy action selection. The composer manages exploration by picking a random action with a small probability ɛ in the beginning phases of training. Since our environment is stationery, we can gradually reduce the probability of exploration by reducing epsilon. This allows the RL system to exploit the knowledge it has gained as training progresses. Results Next, we evaluate the performance of our RL-based chord generator. Figure 3 shows the cumulative rewards obtained when training the RL chord generator with a complexity knob set to simple. The x-axis indicates the episodes that were trained, and the y-axis shows the cumulative reward obtained for every episode. Our objective function is to generate a simple chord progression sequence. Therefore

the optimal policy in this case is to pick jumps that have a lower penalty, i.e. smaller jumps. From the figure, we see that the cumulative rewards curve fluctuates a lot during the initial training phases. This is consistent with our goal of needing to explore more in the beginning. As time progresses, the RL system has learned the optimal policy of taking small jumps when chords need to be generated. This is seen in the tail end of the plot where the cumulative rewards curve settles to the maximum value. Still, there are occassional dips in the rewards, because we never stop exploring the environment completely. 5 Case Study II - RL Based Composer to Generate Melody For our next case study, we discuss a more complicated task of generating melodies using the RL framework described in the previous sections. Our objective function will be to generate a melody for a given chord progression sequence based on a complexity knob, which can take three values : (a) simple, (b) semi-complex, and (c) complex. As before, we assume the following about our composition engine and the environment. a. The melody generated by our RL model has a fixed length of n > 1 measures. b. At discrete time steps the RL agent will assess its environment and execute an action. For this example, the time step will be based on the duration of the action, and will vary depending on what the composer chooses. c. The RL composer is given a chord progression sequence to start with. d. We fix the value of the complexity knob for the melody being generated. Based on the above assumptions, let us assume that we want to train our RL model to generate a semi-complex melody spanning four measures. Next, we describe the actions, state attributes and reward structure for the RL system. Estimating the Q-values is similar to the previous case study and will not be discussed. Actions: The melody composer can choose from the following actions. 1. Play chord tone this indicates one of the notes in the chord. 2. Play non-chord tone this indicates a note not in the chord, but potentially in the scale. 3. Play passing tone this indicates an intermediate non-chord tone between the current chord tone and another higher or lower chord tone. 4. Play neighbor tone this indicates a non-chord tone one step above or below the current chord tone that goes back to the original chord tone. 5. Play chord-to-chord tone this indicates moving from one chord tone to another. State Attributes: Relevant attributes that adequately describe the environment for our melody composer include: a. Percentage chord tones generated in sequence This indicates the ratio of the duration of chord tones with respect to other notes for the four measures being generated. Generally speaking, more the ratio of chord tones, less is the complexity. b. Percentage non-chord tones generated in sequence This indicates the ratio of the duration of non-chord tones with respect to other notes for the four measures being generated. The more the ratio of nonchord tones, the higher is the complexity. c. Gesture movement for the melody This indicates how the melody moves from one phrase to another. The more skips it takes, the more movement it generates, thereby increasing complexity. The more steps it takes, the less movement it generates, leading to a simpler melody. Reward Structure: The reward function for the melody composer is determined differently when compared to the chord sequence composer. Our objective function is to generate a semicomplex melody. We define a semi-complex melody to have between 30% to 40% of non-chord tones. Any time the melody composer takes an action it is rewarded a 1.0 if it satisfies the above condition, and a -1.0 if it does not. RL-Based Memory Generation Algorithm Algorithm 2 illustrates the RL-based memory generation algorithm. The procedure keeps track of the Q-values of all possible state-action pairs in a table, and iteratively learns these Q-values based on experience. Initially we set the Q-values to random values. Recall that our case study is a finite horizon task generating four measures of melody. Therefore, an episode will end after a melody has been generated for the duration of the four measures, and another one starts immediately, resetting the clock to 0. At the beginning of every episode, the composer senses the state of the system (s), and selects an action from the available set of actions. The Q-value for the current state-action pair is retrieved from the table, the new set of notes are generated, and the composer moves to another state. In subsequent time steps, in most cases, the composer picks the action that generates the highest Q-value i.e. the composer exploits the knowledge it has gained from the system. Occasionally, in order to encourage exploration, the composer chooses a random action with a small probability.

Algorithm 2 RL-Based Melody Genaration Algorithm 1: procedure MELODYGENERATOR 2: Initialize all Q-values to random values 3: A getactionset() 4: maxduration 8measures 5: repeat(for every episode) 6: currduration 0 7: Sense system state s 8: cmd selectrandomaction(a) 9: Q prev getqvaluefromtable(s,cmd) 10: while currduration < maxduration do 11: generateaction(cmd) 12: r collectreward() 13: Sense system state s 14: if rand() ɛ then 15: cmd selectrandomaction(a) 16: else 17: cmd getactionwithmaxqvalue(a) 18: Q selected getqvaluefromtable(s,cmd) 19: updateqvaluesarsa(q prev, r, Q selected ) 20: currduration+ = durationof Action 21: Decrease ɛ as time progresses. Figure 4. The ratio of chord tones to non-chord tones seen for each episode when training the RL based melody generator described in case study II, when the complexity knob is set to semi-complex. The x-axis indicates the episodes, and the y-axis indicates the percentage of chord tones and non-chord tones seen. The Q-value of the new state-action pair is retrieved, and the SARSA update rule is applied. Since we are dealing with a stationery environment, as time progresses, we can reduce the exploration in the system by reducing the value of ɛ. Results We first evaluate the performance of our RL-based melody generator when the complexity knob is set to semicomplex. Figure 4 shows the ratio of chord tones to nonchord tones for the different episodes generated by the RL melody generator when the complexity knob is set to semicomplex. Recall that we define a semi-complex melody to have between 30% to 40% of non-chord tones, and the reward function has been set up to accomodate this. The rationale behind this decision is that playing non-chord tones leads to the melody stepping out of the primary chord ( and potentially primary scale ), hence increasing the complexity of the piece. The x-axis in figure 4 shows the episodes that were trained, and the y-axis shows the percentage of chord tones and non-chord tones generated in each episode. From the figure we see that during the initial training phases, the episodes have on average 75%-80% chord tones and 20%-25% non chord tones. But as training progresses, the system learns to choose a better ratio between chord tones and non-chord tones. The tail end of the plot shows a more balanced percentage of non chord tones 35%-40%, and chord tones 60%-65%. This is indicative of the RL system learning a better policy over time. We also see that there is heavy fluctuation in the ratio during the beginning trials. This is because we encourage exploration ( ɛ is higher initially ) in the early training phases. Since the environment is stationery, as training continues, ɛ is gradually reduced, leading to more exploitation of the Figure 5. The ratio of chord tones to non-chord tones seen for each episode when training the RL based melody generator described in case study II, when the complexity knob is set to simple. The x-axis indicates the episodes, and the y-axis indicates the percentage of chord tones and non-chord tones seen. knowledge that the RL system has learned. Still, there are some peaks and valleys in the tail end of the plot, because we never stop exploring the environment. Next, we evaluate the performance of our RL-based melody generator when the complexity knob is set to simple. Figure 5 shows the ratio of chord tones to nonchord tones for the different episodes generated by the RL melody generator when the complexity knob is set to simple. We define a simple melody to have between 80% to 90% of chord tones. The rationale behind this is that simple melodies tend to stay within a scale and some predominant chords. The x-axis in Figure 5 shows the episodes that were trained, and the y-axis shows the percentage of chord tones and non-chord tones generated in each episode. Similar to the previous figure, we see that during the initial training phases, the episodes have on average 70%-80% chord tones and 20%-30% non chord tones. As we train

the RL system, the ratio of chord tones increases, while that of the non-chord tones reduces, in keeping with the objective function of the RL melody generator. 6 Discussion For the two case studies described in the previous sections, the Watson Beat RL engine learned to generate chords and melodies based on a complexity knob. In this section we discuss other objective functions that can be used to train the RL engine namely emotional and thematic intent. Recall that an RL model learns what to do next, and is not being told either what to do next, or how to do it. This allows us to bring subjectivity into the model, and is especially important when describing emotional or thematic intent. For example, one line of thought may describe something happy to include only major chords. Another line of thought may describe the same mood to start with minor chords, but end with major chords. Such variation in thought process can be easily accomodated in an RL model by describing the appropriate objective function, and by tracking the right state features and reward functions. As an example, let us set the objective function to generate a four bar melody that sounds ominous. From music theory and literature, we can reason that the mood ominous has a slower anticipatory tempo, uses repetition to create suspense, and is largely atonal in nature. The state attributes in such a scenario would include (a) presence of mini motives in the melody (leading to more repetitions, causing suspense), (b) the movement of the melody (a largely a tonal melody usually does not pivot towards chord tones or home notes), (c) the presence of rest notes leading to increased tension in the melody etc. The immediate reward function would include positive credit for actions that increase repetition, do not pivot towards a scale (ominous and spooky melodies usually follow an octatonic, atonal, or microtonal scales), and use a lot of rest notes. Conversely, any action that pivots towards a particular scale other than the ones described above, or does not encourage the presence of repetitions or rests will be assigned negative credit. Now, let us set the objective function to generate a four bar hip-hop bass line. From literature we can ascertain that hip-hop bass lines are mostly in minor keys and are syncopated in nature. Our state attributes can then track (a) chord progression tonality, (b) time signature, and (c) syncopation factor. The reward function can assign positive credit for actions that move the bass line towards syncopation and minor key tonalities, while assigning negative credit otherwise. Based on the above discussion, we have uploaded a playlist of some of the original compositions of Watson Beat here: https://soundcloud.com/jmukund/ sets/watsonbeat-ml4creativity-2017 2. In 2 The compositions in the playlist are multi layered. The accompanying layers are automatically generated by Watson Beat and play the chords chosen by the RL agent. the play list you can hear compositions based on different moods and themes. As per our first case study, you can also hear two chord progression sequences. The first is an example of a simple chord progression (ChordP rogression1(simple).mp3). From the general listener s perspective, we can consider this to be simple because the chord progressions presented happen more frequently in most music. All of the chords belong to the same key of D major making them closely related to one another. This simple chord progression would be something typical and expected. Key of D Major (D D G Gmaj7 Em7 G Gmaj7 Bm) (I I IV IV 7 ii7 IV IV 7 vi) The next is an example of a more complex chord progression sequence (ChordP rogression2(semi complex).mp3). This semi-complex example uses more medium and big jumps and eventually modulates outside of the home key. By doing so, it can be perceived as more unexpected than the previous. Key of B Major (B E G#m7 G#m D#m F # Db F ) I IV vi7 vi iii V (IV ) [modulate > Dbmajor] I iii It is easy to make computer music sound like computer music, i.e. mathematical, not always aesthetically pleasing, and unemotional. Having control of multiple degrees of complexity of various musical elements allows Watson Beat to compose a wide range of music that can be both familiar and new. This ability to steer learning based on emotional and thematic intent is what makes Watson Beat unique and interesting. 7 Related Work Reinforcement learning as a machine learning technique has been successfully used in a variety of problems. However, in this section, we will only discuss the class of problems that deal with computational creativity, and specifically those related to music generation. Cont et al. [2], Phon-Amnuaisuk [4], Collins [1], Smith et al. [6], use reinforcement learning agents for music improvisation. They require some notion of musical seed as input, and the reward structures are calculated based on how different the new piece is when compared to the original. Kristopher et al. [5] use RL models for both improvisation and generation of chord progressions. The reward structure is based on the cadence obtained by the chord progression sequence. Groux and Verschure [3] use reinforcement learning for music generation, but their feedback loop that feeds into the RL model, requires human intervention at every iteration to indicate whether the music generated was pleasing or not. All the works described above also lack the capability of steering learning based on emotional or thematic intent.

8 Conclusion In this paper we have presented a new approach to combine the fields of AI and arts. We introduced Watson Beat a cognitive engine that composes music using the principles of reinforcement learning and music theory. Furthermore, we have discussed ways of training our RL engine to accept thematic and emotional intent as input. We believe that the ability to steer learning based on emotional and thematic intent is what makes Watson Beat unique and interesting. 9 References [1] Nick Colins. Reinforcement learning for live musical agents. In International Computer Music Conference, 2008. [2] Arshia Cont, Shlomo Dubnov, and Gérard Assayag. Anticipatory model of musical style imitation using collaborative and competitive reinforcement learning. volume 4520 of Lecture Notes in Computer Science, pages 285 306. Springer, 2006. [3] Sylvain Le Groux and Paul F. M. J. Verschure. Adaptive music generation by reinforcement learning of musical tension. In Journal of Sound and Music Computing, 2010. [4] Somnuk Phon-Amnuaisuk. Generating tonal counterpoint using reinforcement learning. In Neural Information Processing, 16th International Conference, ICONIP 2009, Bangkok, Thailand, December 1-5, 2009, Proceedings, Part I, pages 580 589, 2009. [5] Kristopher W. Reese. Computationally generated music using reinforcement learning. PhD thesis, University of Louisville. [6] Benjamin D. Smith and Guy E Garnett. The Education of the AI Composer: Automating Musical Creativity. [7] Richard S. Sutton and Andrew G. Barto. Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA, 1st edition, 1998. [8] Csaba Szepesvri. Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan and Claypool Publishers, 2010.