Chorale Harmonisation in the Style of J.S. Bach A Machine Learning Approach. Alex Chilvers

Chorale Harmonisation in the Style of J.S. Bach A Machine Learning Approach Alex Chilvers 2006

Contents 1 Introduction 3 2 Project Background 5 3 Previous Work 7 3.1 Music Representation........................ 7 3.2 Harmonisation............................ 8 3.2.1 Manual Harmonisation.................... 9 3.2.2 Automatic Harmonisation.................. 9 3.3 Machine Learning........................... 11 3.3.1 Features and Feature Space................. 12 3.3.2 Concept Learning...................... 12 3.3.3 Bayesian Learning...................... 13 3.3.4 Decision Tree Learning.................... 13 3.3.5 Instance-based Learning................... 14 4 Representations 16 4.1 Classifications............................. 16 4.2 Features................................ 18 4.2.1 Local Features........................ 18 4.2.2 Global Features........................ 21 5 System Architecture 23 5.1 High Level............................... 23 5.1.1 Training............................ 23 5.1.2 Testing............................ 24 5.2 Low Level............................... 25 5.3 Implementation............................ 29 6 Evaluation 31 6.1 Empirical............................... 31 6.2 Qualitative.............................. 33 7 Results 34 7.1 Experiment Settings......................... 34 7.1.1 ML Classifier......................... 34 7.1.2 Data Collection........................ 34 7.2 Baseline................................ 35 7.3 ML Settings.............................. 35 1

7.4 Feature Encodings.......................... 36 7.4.1 Context Size......................... 36 7.4.2 Relative Context Pitch.................... 37 7.4.3 Contours........................... 37 7.4.4 Previous Length....................... 37 7.4.5 Previous Classifications................... 38 7.4.6 Future Context........................ 38 7.4.7 Future Context / Previous Classifications......... 38 7.4.8 Location and Metre..................... 38 7.4.9 Pitch Features........................ 39 7.4.10 Tonality............................ 39 7.5 Classification Approaches...................... 39 7.6 Summary............................... 41 8 Discussion 42 8.1 Conclusion.............................. 42 8.2 Future Work............................. 43 A Chorale Datasets 45 2

Chapter 1 Introduction Music is a part of every human culture. It predates the written word, and may well predate any spoken language. The history of music is a complex one, and can be studied from a number of different perspectives theological, sociological, political etc. Understanding the music that was produced by a certain group of people, during a certain period of time, can help one to further understand their culture and way of life. Music is used to enhance celebrations, and is often an important part of religious tradition. In western music, melody is often the most important aspect of a piece of music. The melody is the main sequence of notes that can be heard throughout the piece and is the most immediately distinguishing feature of a composition. It may be performed by a voice, some instrument, or combinations of both. Often, the melody will be louder than the other parts. However, melody alone does not necessarily contain much detail by which we can determine a genre, era or composer. It is features found in the way that a piece of music is accompanied, or harmonised, that often says the most about music stylistically particularly classical western music. It is worth noting that we are focusing on the way music is composed, not performed. If we were focusing on performances, we would need to consider how the performing artist s interpretation can also have an impact on a piece s style (even if the music is monophonic that is, without harmony). Different genres of music, often identified by a period in history and/or the part of the world in which it was predominantly created, can be identified by a trained listener. There may not necessarily be anything unique about the melody, but the combination and interaction of the melody with the harmonising notes can often make a piece of music immediately classifiable in terms of genre, or even composer. For example, there are certain chords (that is, combinations of notes played simultaneously) that are considered typical of jazz music. These stylistic features in music may not be hard rules followed by composers. Often, the best way to detect them is to have spent a great deal of time listening to examples, training your ear, and developing a feel for a style. The chorales of J.S. Bach provide a large number of pieces that can be studied in order to acquire this aforementioned feel. Though they are relatively short compositions, there are many of them, and they are all harmonised by Bach in his quite recognisable Baroque style. Being able to create new music that seems indistinguishable from music of a 3

particular style, or composer, is a step further than simply recognising it. One can imagine composing some music in a classical style and having a trained listener believe that it was written by someone who lived a few centuries ago. Attempting to assign, to a computer, the task of learning and mimicking a musical style is the focus of our research. 4

Chapter 2 Project Background Computer music is generally used to refer to music that has been generated or composed with the aid of computers beyond simply as a recording tool. People in the academic field of computer music seek to find ways of applying modern computing techniques and technologies to applications that aim at automating some aspect of music composition or analysis. Automatic harmonisation is a topic that has, in the past, been explored by numerous computer music researchers. Harmonisation refers to the implementation of harmony, usually by using chords. Harmony is sometimes referred to as the vertical aspect of music, with melody being the horizontal (with respect to time). It essentially involves multiple notes being played simultaneously, built around, and in support of, the note being played in the main melody line. Generating harmony automatically would require a computer system to make a decision as to which chords (or single notes) to use at a given time. Bach chorales have usually been the focus of automatic harmonisation research. The reason for this is the abundance of them, as explained below, along with the fact that they all began as monophonic compositions with harmonisation later generated by a second party (namely Bach). A chorale was originally a hymn of the Lutheran church, sung by the entire congregation. Although chorales were traditionally sung in unison (that is, everybody sang the same melody simultaneously), several composers harmonised the tunes. Johann Sebastian Bach (a Baroque style composer, born in 1685 and died in 1750) is most notable, having harmonised hundreds of chorales to be performed by four-part choirs. Even though Bach did not actually write any chorale melodies, his name is virtually synonymous with the term chorale. The fact that Bach applied the same basic process (that of building three parts to be performed in harmony with the main melody) repeatedly to so many chorales suggests that it may be possible to simulate the process computationally. There are arguments both for and against this idea. Although exact simulation may not necessarily be possible, subtle (and, learnable) patterns in Bach s work may be apparent, thus making simulation possible. On the other hand, any music composition is a difficult task and many would argue that Bach was a genius. Thus, it could be argued that his work cannot be generalised so that a computer can learn to simulate it that there are irreproducible indicators of his mastery. Additionally, there are many different harmonisations that can accompany a melody. While it may be possible to simulate good 5

harmonisation, hoping to choose the harmonisation chosen by Bach may be overambitious. Despite this, in the past, people have built systems that attempt to generate chorale harmonisations using Bach s general style. We later summarise a number of such alternative approaches taken by others in the field. The purpose of this project is to build a system that, having been trained on a set of four-part Bach chorales, can generate chords (that are a combination of notes in the Bass, Tenor and Alto parts) to harmonise a given Soprano melody. The system will analyse hundreds of entire four-part chorales and develop a model for harmonisation. When given the Soprano part as input, the system will use this model to generate the remaining parts. The ultimate aim is to finish with a system that produces harmony, for previously unseen melodies, that is identical to the harmony that we know Bach to have created for that same melody. We would like to accomplish this ambitious goal, but there are a number of different ways to measure the success of the system. At the very least, the system should produce harmony that is deemed to be both musically acceptable (that is, does not break any fundamental rules or produce any inappropriate/dissonant note combinations), and a close resemblance of Bach s own work. Of course, the system should be able to do this to any given melody. On top of this, it would be desirable to finish with a framework that, given another data set on which to train (for example, 300 jazz pieces), and perhaps after some alterations, should be able to perform an adequate harmonisation of melodies in the new genre. 6

Chapter 3 Previous Work Due to the interdisciplinary nature of this project, some research into, and understanding of, a number of different academic fields is required before attempting any implementation. It is also best to have an understanding of some key concepts before trying to understand the work we are presenting. Firstly, it is best to have a reasonable familiarity with music and music theory, including notation. Next, it is important to consider the different ways of approaching harmonisation, including previous attempts at automatic chorale harmonisation. Since we are taking a Machine Learning approach to this task, an overview of Machine Learning is also a beneficial premise. 3.1 Music Representation Music can be defined as combinations and sequences of notes. Musical notes are named after the first seven letters of the alphabet (A, B,..., G). These notes correspond to the white keys on a piano s keyboard. The black keys are called flats ( ) or sharps ( ) for example, the black key between A and B can be referred to as either A or B. One set of all twelve notes (7 white keys, and 5 black keys) is called an octave, and this covers all of the distinct notes used in western music. The same note can, however, be voiced in a number of different octaves (e.g. low C, middle C, high C, etc.). Music is universally transcribed as a score on manuscript paper. A symbolic language is used, and a musician is expected to read the notes that are to be played. When music is transcribed as a score, each note is represented by a symbol (different symbols indicate different durations) written on, or between, lines (this set of lines is called a stave ). The vertical position in relation to these lines determines the note s pitch (that is, its name, e.g. C, and its octave). There are also two different types of staves treble and bass. Their function is, essentially, the same. Figure 3.1a provides an example of some notes on a treble stave. The data used by researchers to approach computational music tasks (including that of automatic harmonisation) in the past has usually come in MIDI format. This is a digital representation of music, as it is performed. Each note event is represented as a combination of start time, pitch, duration and dy- 7

a. b. **kern *clefg2 *k[b-] *M2/2 =- 2d/ 2a/ = 2f/ 2d/ = 2c / 4d/ 4e/ = 2f/ *- Figure 3.1: Score (a.) vs **kern (b.) transcription namic. MIDI players are readily available, making listening to such files their primary use. The nature of MIDI can often make it difficult to handle. Most significantly, the timing of notes is, in a way, too precise. This is due, in much part, to slight human errors at performance time. It can be difficult to map the music to a score transcription. However, there are other music representation languages that are a more direct transcription from an actual score. A way of digitally representing a score is to use the Humdrum **kern format. A full description of **kern can be found at http://dactyl.som.ohio-state.edu/humdrum/representations/kern.html. Figure 3.1 shows the conversion of a short line of music from its score transcription to **kern. In the case of polyphony (multiple lines played simultaneously), simultaneous events appear on the same horizontal line, as each part is represented by a separate column. Integers are used to represent a note s duration, along side the note name. Different octaves are expressed by the case and number of letters used. For example, c, cc, C and CC all represent the note C, but each is in a different octave. 3.2 Harmonisation Here, we consider the ways in which manual harmonisation can be approached. We also look at some previous attempts at automatic chorale harmonisation. 8

3.2.1 Manual Harmonisation The act of harmonising a melody is an important part of composition that often dictates the final style of a piece of music. It is not simply a matter of filling out musical space (that is, creating a fuller sound). Scruton [1997] distinguishes two different ways of considering harmony: Chords, in which separate tones are combined to form new musical entities; and Polyphony, in which the component parts are melodies themselves. The second approach would tend to be more detailed and note specific, since there are a number of different ways of producing the same chord, while the chord alphabet (major, minor, diminished, seventh, ninth etc.) is finite. Thus, there are two separate ways of going about manually harmonising a melody choosing chords to accompany the melody notes and/or writing separate counter-melodies (which consequently produce a musical movement through chords). Of course, if there are two ways of manually approaching the harmonisation task, there may be two parallel ways of automating the process. That is, either harmonise each melody note with a complete chord, or sequentially build three separate and complete lines (in the case of chorale harmonisation) over the entire melody. It is also important to note that harmonisation cannot be satisfactorily accomplished by choosing random chords or notes. There are certain restrictions that pertain to what will sound good, and what will sound bad. Usually, a person can hear for themselves when a note combination does not work. This is explained by the concept of Beating. Beating is the result of interference patterns created between sound waves [Scruton, 1997] (namely, the sound waves produced by two distinct notes). In musical theory, the notes that can be performed within a piece tend to be restricted to those that appear in the scale around which the piece is centred (although, there are usually exceptions to this restriction found throughout the piece, called accidentals, marked by, or ), as indicated by the key signature. 3.2.2 Automatic Harmonisation As mentioned previously, there have been other researchers in the area of computer music, and artificial intelligence, who have tackled this same task of automatically harmonising chorale melodies in the style of Johann Sebastian Bach. These attempts have been made using a variety of computational techniques. Outlined below are some of the approaches that are relevant to this work. Markov Models Kaan M. Biyikoglu s [2003] submission to ESCOM5 (The European Society for the Cognitive Sciences of Music s 5th Conference) explores the implementation of a Markov model for the harmonisation of chorales. A Markov model uses the assumption that the probability of an event is conditional on a finite number of preceding events. The Maximum Likelihood Estimate (MLE) is used to estimate transition probabilities from the corpus. If w 1... w t is a sequence of random variables taking values from a finite alphabet {c 1... c n }, then a 2nd-order Markov model 9

defines the MLE as follows: P (w t = c k w t 2 = c i, w t 1 = c j ) = C(c ic j c k ) C(c i c j ) where C(c i c j c k ) and C(c i c j ) are, respectively, the counts of the occurrences of sequences (c i c j c k ) and (c i c j ) in the corpus. In the specific case of Biyikoglu s system, the alphabet consists of chord symbols (such as major, minor, diminished, sevenths etc.) built in all twelve pitches (such as C, C, D etc.). The entire corpus is transposed to the same key prior to training (thus resulting in less zero-counts that is, a sequence of chords that never occurs in the corpus, though it still may be valid), and the transition probabilities are determined using the MLE. In the testing phase, candidate chords are chosen based on the requirement that the note in the melody occurred within the chord. For example, if the melody note is an E, then the chord consisting of the notes C, E and G is a candidate chord, while the chord consisting of the notes C, E and G is not. Finally, the chord progressions are determined by using the Viterbi algorithm (see http://viterbi.usc.edu/about/viterbi/viterbi_algorithm.htm). An additional stage uses voice-leading rules to assign each note in the chord to one particular part (i.e. Alto, Tenor, Bass), thus resulting in the required 4-part texture. Unfortunately, this system has not offered any results that can be used to measure the success of the approach as a means of effectively recreating Bach s work. Probabilistic Inference Another approach, similar to that of Biyikoglu, that has been used to automate chorale harmonisation, is that taken by Moray Allan and Christopher K. I. Williams [2005]. Their NIPS (Neural Information Processing Systems) Conference paper describes a system which uses Hidden Markov Models (HMMs) as a means for composing new harmonisations. Instead of using only the observed states as in the Markov case, HMMs work around the assumption that the observations occur due to some hidden states. In the case of Allan and Williams system, the observed states are the melody notes, and the hidden states are the chords. So, rather than only using the melody notes to restrict the possible chords in a sequence (as in the Markov Model approach), the melody note is incorporated into a first-order model. In fact, the system described makes two first-order assumptions. Firstly, that the probability of a chord occurring depends only on the immediately preceding chord. Secondly, that the probability of a particular observation being made (that is, a particular note) depends only on that current state (i.e. chord). Again, the Viterbi algorithm was used to determine harmonic sequences (or chord progressions). However, there is an additional HMM introduced in order to add complexity and realism to the harmonisation by way of ornamentation. Ornamentation involves the insertion of short notes as a means of making the music more interesting. This additional stage smoothes out the transition between the notes in a line of music. This system does not produce particularly good results. That is, the generated harmonisations rarely match Bach s own composition (although, the author 10

did describe some attempts as reasonable harmonisations ). A possible reason for this, cited by the authors, is the sparseness of the data. In the HMM used, there are 5,046 hidden chord states and 58 visible melody states. Additionally, the ignorance of context could be another shortcoming of the system. Perhaps using a bigram model, or even a trigram model, would improve on the unigram model used, albeit at the expense of increased data sparseness. Others Although less closely related to the approach we are taking, there have been a number of other approaches taken by computer music researchers in creating automatic harmonisation and composition. Early work using Constraint-based systems show how rules may be used to assign a score to a particular choice for harmonisation [Pachet and Roy, 2001]. Then, search algorithms may be used to produce the best possible harmonisation. Such a system does depend on the assumption that music is governed entirely by rules, and that such rules can be used to quantify the validity of a chosen harmonisation. Probabilistic finite state grammars have also been used to harmonise new choral melodies. Conklin and Witten [1995] produced a system that combined different models of properties of a musical sequence. These properties, such as pitch, duration and position in bar, are not dissimilar to the features we are using to represent music within my system. In their system, a large number of models are thus used in parallel. Chorale harmonisation has also been attempted using neural networks. HAR- MONET [Hild et al., 1992], is a system that uses such an approach. Here, the network is trained by being shown, at time t, the Soprano voice at t 1, t and, t + 1, the harmonies from t 3 to t, and the location of t in the musical phrase. The neural nets are used to, first, determine the Bass note to harmonise the current Soprano note. Rules are then used to determine how the remainder of the chord should be constructed Discussion With particular regard to the work using Probabilistic Inference, we have found some of the issues faced to be quite indicative of potential issues that we too will face. The main problem will be finding a way of encoding the music so that the data does not become too sparse, as well as finding an optimal amount of context to use, again without resulting in data sparseness. Where other researchers have treated the melody as though it were a result of the chord progressions (or, as they call it, the harmonic motion), our approach will treat the chords as coming as a result of the melody (since we are using features of the melody to classify it with a chord at any given time). This is a more logical approach, since Bach himself completed the composition of each chorale by building harmony to match the already complete melody. 3.3 Machine Learning For many years now, Machine Learning (ML) has been applied to a large number of computational tasks. The idea that a computer system may automatically 11

improve with experience has led to a number of successful applications. Tom Mitchell s whitepaper on ML [Mitchell, 2006] provides examples such as Speech Recognition, Computer Vision, Bio-surveillance, and Robot Control. There are a great number of existing ML algorithms that can be used to map observations of an event to a resulting classification of that event. We will consider some of these algorithms, particularly those most relevant to our project. 3.3.1 Features and Feature Space Firstly, most ML classification algorithms and approaches hinge on the idea that we are trying to use a feature vector, representing some scenario or a description of an entity (known as an instance ), to reach some conclusion about, or classification of, the instance being represented. A feature is a specific property of an instance that can be observed and assigned a value (often within a specific domain). A feature vector is then an n-dimensional vector containing the values of these features. As an example, if we are trying to use ML to predict whether it is going to rain, we may choose two features to be Maximum Temperature and Humidity. Then, for every day we have on record, we can build a 2-dimensional feature with that day s maximum temperature and humidity. We will then be left with a collection of vectors and the classification they produced (Rain or No-Rain); e.g. 21, 35% No-Rain, 17, 70% Rain, 19, 75% Rain, etc. With this, one of the algorithms below can be used to build a model that, given a new 2-dimensional feature vector, can make a decision as to whether that vector corresponds to the Rain or No-Rain classification. A feature space is an abstract space in which a feature vector can be represented by a point in n-dimensional space (n being determined by the number of features used to represent an instance). In the above example, it is not difficult to imagine how each of these vectors would be represented as a point on a 2-dimensional graph the points (21,35), (17,70) and (19,75) respectively. ML aims to determine n 1-dimensional classification boundaries in an n-dimensional space. This way, a new point in space (representing a previously unseen instance) can be classified based on where it appears in relation to these boundaries. In the 2-dimensional case, a line may be drawn to separate all points that receive the Rain classification from those classified with No-Rain. 3.3.2 Concept Learning Tom Mitchell [1997] defines Concept Learning as Inferring a boolean-valued function from training examples of its input and output. A requirement of this approach is that each of the attributes has a small set of possible values. If the scenario explained above were simplified so that the only possible values for Maximum Temperature were Hot, Warm, Cold or Cool and the only possible values for Humidity were High or Low, then it would be an example of such a boolean-valued function, where each instance is labeled as a member or non-member of the concept It will rain. A Concept Learning system can determine the possible subset of values for each feature that are required in order for an instance to belong to the target concept. 12

3.3.3 Bayesian Learning Bayesian learning algorithms are based on Bayesian reasoning a probabilistic approach to inference. The assumption is that optimal decisions can be made by reasoning about the probabilities observed in a corpus, and the observations made about a new instance. Bayesian algorithms, such as naive Bayes, that calculate explicit probabilities for the hypotheses are among the most practical approaches for certain tasks, such as that of automatically classifying text documents. Bayes theorem, the principle behind Bayesian methods, provides a way of calculating the posterior probability P (h D), the probability of a hypothesis holding, given the training examples. The theorem is formalised as follows: P (h D) = P (D h)p (h) P (D) where D is the training data, h is the hypothesis in question, P (D h) is the probability of observing the training data given that the hypothesis holds, and P (h) is the prior probability that h holds. Since P (D) is a constant independent of h, this formula can be simplified to P (h D) = P (D h)p (h). As a simple example, if we are trying to determine whether it will rain on a day that is Hot with High humidity, we need to determine (from the training examples) the probability of it being Hot when it has rained, the probability of it being High humidity when it has rained, and the overall probability of it Raining on any day. In an ML application of Bayesian learning, the goal is usually to find the hypothesis h (in the example of text classification, the options may be the various genres of news articles sport, world etc.) that maximises this posterior probability. 3.3.4 Decision Tree Learning Decision Tree Learning is a method for approximating discrete-valued target functions represented by decision trees. Previous applications of this method have included medical diagnosis and risk assessment on loan applications. Classifications are made by moving down a tree, and branching in different directions, based on the value of a specific feature. According to Mitchell [1997], the most appropriate problems for this approach should meet the following criteria: Instances are represented by attribute-value pairs (i.e. feature values). The target function has discrete output values. Disjunctive descriptions may be required. The training data may contain errors. The training data may contain missing attribute values. Essentially, a Decision Tree is a (usually quite large) network of if-else statements. Branching occurs based on certain conditions, defined by feature values. When a Decision Tree model is constructed, a hierarchy of the features is learned. 13

In some cases, it may be possible to classify a new feature vector based on the value of only a single feature. If not, then the classifier will search down the tree until a final decision (classification) can be reached. Decision trees are constructed top-down. The attribute that is tested at the top node of the tree is the one which is deemed to best classify the training examples when used on its own. A statistical test is used to weight the relative importance of each instance attribute. From here, a child node is constructed for each possible branching from that node (that is, each of the different possible values). This process is then repeated at each child, and continues until each path reaches a classification. As a simple example (in the context of our project), if it happens that every note that occurs at the beginning of a piece in our training data is harmonised using the tonic chord, then knowing that the note we are trying to classify is at the beginning of a piece will be enough to determine the classification. However, if it happens that the note is not at the piece s beginning, more features will need to be checked (and our classifier will move down the tree). Ross Quinlan s C4.5 [1993] describes one implementation of a Decision Tree learning algorithm. 3.3.5 Instance-based Learning In contrast to attempting to construct a general target function based on training examples, Instance-based (or Memory-based) learning simply stores the training examples. Any generalising occurs only when an unseen example needs to be classified. This is when a more complex comparison to the training data, beyond simply finding an exact match, is used. This postponement of complex calculations is the reason such methods are often referred to as lazy learning methods. The main issue in Instance-based learning is the way in which prior instances are used to classify new, unseen, instances. The most basic method is the k-nearest Neighbour algorithm. The standard Euclidean distance is used to determine the nearest neighbours of an instance. The Euclidean distance formula is a simple way of measuring distance in an n-dimensional space. This formula, for finding the distance between the two points P = (p 1, p 2,..., p n ) and Q = (q 1, q 2,..., q n ), is formalised as: Distance = (p 1 q 1 ) 2 + (p 2 q 2 ) 2 +... + (p n q n ) 2 From here, the most common classification amongst these neighbours is chosen as the classification for the new instance. A similar approach is the Distance- Weighted Nearest Neighbour algorithm. The main difference here is that neighbours that are closer to the new instance are given more importance than those further away. Locally weighted regression is a generalisation of these approaches. The same idea applies, except some other function, be it linear, quadratic, or other, is used to determine the classification. Conversely to the Decision Tree approach, when using Instance-based learning, the entire vector is immediately taken into consideration. There is no chance of a classification being after the observation of only one single feature value. There are multiple algorithms that use this approach, and TiMBL (the Tilburg Memory-Based Learner) is an implementation of a number of such al- 14

gorithms. This is the ML software that has been incorporated into our final system. 15

Chapter 4 Representations An important part of any computational music task is to clearly define the ways in which music can be encoded. Music exists purely as sound. It does, however, have a well-defined representation language that allows visual analysis. Similarly, we need to determine how we will represent our music in a way that will allow machine analysis. Possible representations are now discussed. 4.1 Classifications Before considering the possible ways of encoding the input for the classification phase of the system (that is, the feature vectors), it is equally important to consider the different ways of encoding the output (that is, the classifications themselves). There are a number of factors that need to be considered. One consideration is the overall architecture of the system. The main question is whether we intend to classify entire chords in one go, or rather attempt to handle each part (those parts being the Alto, Tenor and Bass lines) separately, using a separately trained classifier for each. Even once this decision is made, there are different ways to approach each option. These various approaches are further explained below. Full Chord Architecturally speaking, the simplest approach to classifying harmony is to use entire chords. With such a system, only one classifier needs to be trained, and that same classifier needs to be consulted once only for each event being classified. However, encoding these chords is not a trivial matter. One option is to simply use note name combinations (for example, CEG or AC E). However, even if octave is ignored (i.e. a low G is treated the same as a G two octaves above), this would clearly result in very sparse data (that is, more sparse than is necessary to properly distinguish between different harmonisations). The problems caused by encoding features in this way are also later discussed. We have thus focused on alternate methods that avoid such problems in our implementation. 16

A sensible alternative to using note names, and the one that would most suit an approach that normalises each piece so that notes are represented as semitonal distances from the tonic, is to normalise the classifications to be relative to the tonic. Additionally, since encodings in this task are essentially being treated symbolically (rather than mathematically), the chosen symbols for these encodings is relatively unimportant as long as they remain consistent. Concerns about this approach are discussed later. A simple way of converting these classes to symbols is to build a string of 12 bits where the first bit is the tonic, the second bit is the next semitone up and so on. Then, for each note in the chromatic scale (moving up in semitones), the corresponding index in the string can be assigned a value of 0 or 1 depending on whether or not that note is found in the harmonisation (1 if that note is on, 0 otherwise). Each class will therefore consist of between zero and three 1s, and the remainder 0s. For example, the chord consisting only of the tonic would be represented by 100000000000. Using this approach, there are 1,464 possible such strings (thus a maximum of 1,464 different classes) although, clearly, not all of these will be musically acceptable. Clearly, this approach fails to distinguish between the voicing of notes by different parts. That is, a tonic chord in which the Bass part voices the tonic itself, the Tenor part voices the third, and the Alto voices the fifth would be considered equal to any inversion of that chord (provided that it is those same three notes being used) after both have been normalised. Individual Parts Since attempting to assign an entire chord to a melodic event results in a large number of different choices (even after key normalisation), although probably not the figure of 1,464 mentioned above, the reduction of possible classes that is achieved by treating each of the three harmonising parts separately should be considered as a potentially more intelligent approach. Using this approach, there will only be 13 possible classes for each part (one for each note in the chromatic scale beginning on the tonic, plus a class representing no new note is performed by that part whether it be a rest, or a sustained note from earlier). Of course, when the three parts are combined there are significantly more combinations of classifications (2,197 in fact). However, each individual model will only need to consider 13 possibilities at a time. This is quite a different task to that of choosing full chord classifications for each event. Perhaps this approach is more closely related to the approach that Bach himself would have taken in harmonising a chorale. Rather than determining the harmonic motion of the piece, three separate and disjoint lines of music will be found and will hopefully combine to create the most appropriate harmonic sequence. One issue that needs to be addressed in implementing a system that follows this approach is determining the order in which the parts should be found. Clearly, each of the three harmonising parts is not independant of the other two. So, using the classification that has been produced by the other parts would be beneficial. This would, consequently, result in the introduction of new features aimed at capturing the relationship between the part in question, and the other harmonising parts. 17

Again, there are different ways of encoding the aforementioned 13 classes that can be used by each classifier to classify an event in the melody. And, much like the options for Full Chord classifications, the chosen representation itself will have no impact on the results. Consistency is the only requirement. 4.2 Features The nature of Machine Learning requires observations about a particular event to be made, so that a decision can then be made about a resulting classification for the event. Much like a human would first consider some features of an event, they must be encoded for the ML software. Clearly, the efficiency of the system will hinge on the ability to capture as much information as possible with the features chosen. In order to categorise the types of features that can be used to describe an event in a musical melody, we have defined two types of features. These are Local (or Melody) features and Global (or Piece) features. Within these two categories, further subdivisions can be made to distinguish between pitchrelated features, and timing-related features. However, we have not made this division explicit. Below, we have considered the different possible encodings of these features that may be chosen. 4.2.1 Local Features The features deemed Local features are those that are specific to the point in the melody for which we are attempting to determine a chord. These features need to be recalculated for every single event in the piece and may hold different values for each event in the piece. When determining harmonisation, the most important aspect that needs to be considered is Pitch (that is, the name of the note, C or E for example). In the context of this project, note duration is, perhaps, less significant as we are focused on determining a set of notes to harmonise the melody with the duration of those harmonising notes being ignored. However, as mentioned, time-domain features of the melody will be captured since, for example, it may well be that a short note in the melody is unlikely to be harmonised by a new chord. The melody features that need to considered are as follows: Current Pitch This is the pitch of the note being performed in the melody (that is, the event we are trying to classify). There are a number of different ways of encoding this feature. In order to avoid the problem of sparse training data, it seems essential that this feature is measured relatively, rather than using the actual real pitch (such as C or E ). The obvious solution is to somehow normalise all of the pieces, so that relationships between the notes within a piece are captured, as opposed to the relationships between notes in general. Additionally, it seems reasonable (for greatly simplifying the task) to ignore the octave in which the note is being 18

voiced (as including this results in sparse data). So, if the piece s key can be determined, the pitch of each note can be measured as a semitonal distance from the tonic. For example, if the piece is found to be in C (or, for that matter, Cminor), then a C note within that piece can be assigned the value 0, and an E the value 3, since they are 0 semitones and 3 semitones (respectively) from the tonic. One immediate problem that does arise from this solution, and a problem that leads into an entirely different computational music task, is that of determining the key when it is not explicitly given in the score. Provided that the key is explicitly stated, there is no need to address this issue. Pitch of Previous/Next n Notes These features allow the note being classified to be considered within the context of the melody. There are a number of decisions that need to be made with regard to this set of features. There are many different ways of encoding these features. One immediately obvious way is to simply reuse the values of the Current Pitch features for those notes to which we are referring. This is reasonably straightforward to implement, however it may not necessarily be the most effective approach. An alternative is, rather than measure the pitch of surrounding notes as a value relative to the piece s tonic, to measure their pitches as relative to the current Soprano note s pitch. So, for example, we would explicitly capture that the pitch of the next note in the melody is an absolute distance of 1 semitone away, and the pitch in two notes time is 3 semitones away etc. Another way of encoding these features is to indicate somehow the direction of the melodic contour. That is, to assign a value of up or down, followed by a number of semitones. This more accurately captures the movement of the melody, and would also likely result in fewer value possibilities than the first approach mentioned above. One general underlying concern is the way in which all of these encodings will be using a symbolic approach to representing features. That is, even where numbers are being used to represent the features, these numbers will be treated symbolically by the ML system, rather than numerically. Mathematics is an important part of music. However, while it would make sense to approach this task in a more mathematical way, and it is possible to have ML software treat features mathematically, the mathematics of music is not so simple. For this reason, a more mathematical approach is left for future work. Encoding aside, using different values for n may also have a significant impact on the success of this system. Taking larger context windows (higher n) will give more information that helps a chord classification to be chosen. However, having too many context features may lead to sparsity of data within the model created by the ML software thus making it difficult to find closely matching feature vectors. The impact of this decision will be determined by running experiments using different n-values. Current Length This feature should represent the duration of the melody note currently being 19

Name Symbol Semibreve 1 Minim 2 Crotchet 4 Quaver 8 Semiquaver 16 Demisemiquaver 32 Table 4.1: Note Length key analysed. Simply having a well-defined (and sensible) mapping between each possible duration and a symbol should be adequate since the mathematical relationships between feature values are being ignored. Table 4.1 contains the symbols used. Of course, there are alternatives to this encoding of the length feature. For example, in the same way that a tonal centre is chosen for classifying pitch as a relative value, length could be measured as a relative value. However, as mentioned previously, this will intuitively have a less significant impact on how the melody is to be harmonised. Length of Previous/Next n Notes In the same way that the pitch of the surrounding notes in the melody can be represented by features, so too can the length of these notes. Whatever decisions are made in regard to both the chosen window size (as previously mentioned) and the chosen encoding of length (also previously mentioned) will have to be followed by this set of features. Distance to Previous/Next Bar Although it may not seem particularly relevant to harmonisation, the location of a note within a bar (also referred to as a measure ) of music can have an impact on its harmonisation. Bar lines often provide a basic (though, by no means comprehensive) partitioning of musical phrases, and a note at the beginning of a phrase may well be treated differently to a note later in the phrase. The encoding of these features should be quite simple. We can hope that the ML software used is able to determine the ways in which various features work together. So, it should be adequate to count the number of notes (that is, melody events) that occur between the current note and the bar line in question. From this, if the length of those notes is needed, they can be taken from the context length features described above, and a more precise placing of the note within the bar can be made. Location within Piece While the fact the location of a note within a bar may have an impact on its harmonisation has already been mentioned, we may also consider the location of that note within the piece as a whole. For example, if the note occurs at 20

either extreme of the piece (beginning or end), it may well be that the chord is more likely to be the tonic chord. Encoding this feature can easily be done by counting the number of bars in the piece (say, totalbars), then noting the number of the bar in which the note in question occurs (say, currentbar), and working out currentbar totalbars to find a value between 0 and 1. This number would not give a particularly precise location of the note in question (i.e. the value of this feature would be the same for each note within the same bar). Also, this feature should not be taken as symbolic. 4.2.2 Global Features While the features that are specific to each individual event may be more useful in helping to classify that event with a chord, it may be necessary to append some more global information about the piece to that event s feature vector. Of course, the need for such features depends greatly on how the above melody features are represented. The global features that we have considered are as follows: Metre The metre of a piece (indicated explicitly by the piece s Time Signature), states the piece s underlying rhythm and defines what constitutes a bar. The majority of pieces in the chorale corpus used (close to 90%, in fact) are in Simple Duple time (4/4). This indicates that each bar is comprised of 4 crotchet beats. The remaining pieces are in Simple Triple time (3/4). In such cases, there are 3 crotchet beats per bar. Fortunately, a piece s metre is explicitly stated at the beginning of the score. So, one of the two values mentioned can be assigned to this feature for each melodic event within the piece. Key Signature Obviously, unless the pitch of a note is represented as a raw note name (such as C or E ), then the entire key signature of a piece need not be represented by a feature (since all pieces will be normalised to the same tonal centre, which is represented by 0). However, it is worth noting that the key signature does need to be extracted from each piece in order to deduce the values for all pitch features. On top of this, it seems wise to use the piece s tonality (major/minor classification) as a feature, since this tends to define a scale of notes which can be used. Although other notes may be used, they tend to occur less frequently. In the past, this issue of chorales being major or minor has resulted in researchers simply building two separate models for harmonisation a major model and a minor model. The result of this is a reduction in the amount of training data that can be used to build each model. Our approach is to build one model, and simply rely on a feature to capture this partition in the corpus. The use of a tonality (major/minor) feature will be tested. However, the piece s tonic will be used implicitly in our system. 21

Piece Length Once again, this is a feature that should be made redundant by other features. In this case, the length of the piece will be used in determining a value for the event s Location within the Piece feature, however the length itself will not add any useful information to assist in making classifications. The use of the piece s length as an actual feature will not be tested in our system. Previous n Classifications Since we are aiming to find the optimal progression of chords to harmonise a melody, it seems ideal to incorporate the context of the harmony as well as that of the melody. So, being able to implement a system that is able to produce classifications on-the-fly, thus allowing these classifications to be used as features in the following events, is our goal. The representation of these features, of course, depends on the encoding of classifications (discussed earlier). Also, the choice of an appropriate value for n is as much a concern as for the local features mentioned above. 22

Chapter 5 System Architecture In this chapter, we will consider the architecture of our system on three levels. We begin with the most abstract description, and conclude with a specific description of the scripts that comprise our final implementation. 5.1 High Level The system developed completes three main tasks. The first task involves training a Machine Learning classifier that is capable of harmonising a chorale melody. The second is the classification stage, in which this classifier is tested with new melodies. The third task we need to consider is that of evaluating our system s accuracy. The various ways of completing this latter task are further explored in Chapter 6. Below, the first two tasks are further broken down into more specific subtasks. 5.1.1 Training The training phase is comprised of every step that is required to take a corpus of chorales (in whatever format is available to us), prepare it for processing, convert each event that we wish to classify into a feature vector (and determine its classification), arrange these vectors/classifications so that they are in the format required by our chosen ML classification package, and feed this data to the classifier. These phases are explained conceptually below. Normalise the corpus Beginning with a collection of complete chorale scores, some pre-processing of the data will take place. This will ensure that all useful information (information that will contribute to the determination of feature values) is kept, while unnecessary information (such as code that is used purely for formatting and aesthetics) is discarded. The result of this phase is a collection of simplified chorales, stored in a more easily read file structure, ready for further processing. 23