Modeling Form for On-line Following of Musical Performances

Modelng Form for On-lne Followng of Muscal Performances Bryan Pardo 1 and Wllam Brmngham 2 1 Computer Scence Department, Northwestern Unversty, 1890 Maple Ave, Evanston, IL 60201 2 Department of Math and Computer Scence, Grove Cty College, 100 Campus Drve, Box 3123, Grove Cty, PA 16127 pardo@northwestern.edu, wpbrmngham@gcc.edu Abstract Automated muscal accompanment of human performers often requres an agent be able to follow a muscal score wth smlar faclty to that of a human performer. Systems descrbed n the lterature represent muscal scores n a way that assumes no large-scale structural varaton of the pece durng performance. If the performer devates from the expected path by ppng or repeatng a secton, the system may become lost. We descrbe a way to automatcally generate a Markov model from a wrtten score that models the score form, and an on-lne algorthm to algn a performance to a score. The resultng system can follow performances that take alternate paths through the score wthout losng ts place. We compare the performance of our system to that of sequence-based score followers on a melodc corpus of 98 Jazz melodes. Results show that explctly representng the branchng structure of a score sgnfcantly mproves score followng when the branch a performer may take s unknown beforehand. Introducton Automated muscal accompanment that reacts naturally to the human performer s a long-standng goal of a number of computer-musc researchers (Grubb and Dannenberg 1994, Dannenberg 1984; Bloch and Dannenberg 1985; Tovanen 1998; Raphael 1999). The deal s a peer muscan that can be ntegrated nto an ensemble of human players wth mnmal need for the humans to adjust ther nteracton styles to accommodate the computer performer. For many styles of musc, ths requres an agent that s able to follow a representaton of a wrtten score wth smlar faclty to that of a human performer. Systems that perform ths functon are called score matchers or score followers. Fgure 1 shows a smple case of score followng. The top porton of the fgure contans a smple wrtten lead sheet, or score. A muscan performs a score by translatng the note, chord, key, and other symbols nto a sequence of performance actons (depress pano key k at tme t wth velocty v, for example). These actons result n a sequence of events that make the performance. In computer score followng, the performance s often encoded as MIDI (MIDI-Manufacturers-Assocaton 1996). Fgure 1 shows Copyrght 2005, Amercan Assocaton for Artfcal Intellgence (www.aaa.org). All rghts reserved. an example MIDI performance of the wrtten score. Ths s shown n pano roll notaton. Here, each bar represents a note. The horzontal placement of the note represents the onset tme. The vertcal placement of the note represents the ptch. Note duraton s ndcated by the length of a bar. Pano Key WRITTEN SCORE MIDI PERFORMANCE TRANSCRIPTION C E C F D C G D ALIGNMENT C E C F D C G D CURRENT SCORE LOCATION You are here Fgure 1. Score Followng Score followng can be broken down nto transcrpton and algnment (also known as matchng). Transcrpton nvolves parsng the performance nto a sequence of salent events. In Fgure 1, transcrpton conssts of encodng each MIDI note-on event as a smple ptch class, drawn from the set of twelve ptch classes used n Western musc. Matchng conssts of fndng the best algnment between the sequence of events n the performance transcrpton

and the events n the score. Typcally, these events are sngle notes, although they may also be lyrcs, percusson sounds, or groups of notes. A score follower, unless otherwse stated, s presumed to algn the performance to the score n real-tme as the performance takes place. Because of the dffculty of dealng wth polyphonc MIDI and audo, researchers (Dannenberg 1984; Dannenberg and Mont-Reynaud 1987; Puckette and Lppe 1992; Large 1993; Puckette 1995; Vantomme 1995; Desan, Honng et al. 1997; Grubb and Dannenberg 1997) generally restrct matchng to a monophonc score that (nearly) completely specfes the ptch and orderng of every note. For the sake of smplcty, nformaton n the score about key, meter, dynamcs, and song structure s gnored, leavng a smple sequence of note on and off events. The standard practce for score followng (Vantomme 1995; Desan, Honng et al. 1997; Grubb and Dannenberg 1997) s to lnearze a score by removng structural branch ponts (e.g., repeats, codas, etc.) before the performance begns. Ths effectvely lmts the performance to a sngle path through the form that s not changeable durng performance. Thus, exstng score followers requre fxng, a pror, how the repeats n a score, lke that n Fgure 2, would be handled by the muscans. The performers would have to agree on repeatng (at the end of measure two) once and then gong on to the end. The muscans would then be prohbted from alterng the path durng performance. In many lve performance stuatons, muscans repeat or p a secton n response to the needs of the moment. Muscans often extend a pece to let dancers who are enjoyng the musc contnue dancng, or shorten a pece (perhaps by ppng an ntroductory secton) when they are runnng behnd schedule. In these cases, exstng score followers cannot adapt to the changng performance stuaton. When a score contans repeats and partcularly when the form s varable (e.g., the form may be ABA or AABA or any permutaton dependng on the whm of the performers) a score representaton that does not allow branch ponts s undesrable. To account for varablty n form, we need to extend the score model to represent structural score elements that affect the form, such as repeats and codas. In ths paper, we ntroduce a new method for representng large-scale form n score followng. Our representaton s based on Markov models, whch allow us to both capture the form of a pece mpled by the score, as well as reason probablstcally about how a performer s movng through the pece. The system can then model performances that may start anywhere n the form and repeat or p sectons (as specfed by the score) a nonpredetermned number of tmes. Ths greatly expands the types of musc amenable to automatc score followng. The remander of ths paper descrbes Markov models, shows how one may follow a varable-form performance usng Markov models, and compares Markov model score followers to strng matchng based followers, usng a corpus of 98 Jazz melodes. Musc Scores as Markov Models A Markov model descrbes a process that goes through a sequence of dscrete states, such as notes or chords n a lead sheet. The model s a weghted automaton that conssts of: A set of states, S = {s 1, s 2, s 3,, s n } S E, a subset of S contanng the legal endng states. As a default S E =S A set of possble emssons, E={e 1, e 2,, e n } A transton functon, τ(s, s j ), that specfes the probablty of a transton to s j, from s A functon, σ(s ), that defnes the probablty of begnnng n s An emsson functon ε(s j, e ), that defnes the probablty state s j wll emt e. C 1 F 1 G7.3 C 1 A 1 D 1.7.7 Fgure 2. A score as a drected graph (Markov model) Markov models are generatve. A generatve model descrbes an underlyng structure able to emt a sequence of observed events. A muscal score may be represented as a (hdden) Markov model. The drected graph n Fgure 2 shows a Markov model created from the chord labels n the score passage n the fgure. Nodes represent chords n the score. Drected edges (arrows) represent transtons. Repeats and ps (codas) n the score are represented by drected edges connectng dstant portons of the score model. Transton probabltes are ndcated by a value assocated wth each edge. An observaton sequence, O = o 1, o 2,, o n, s a sequence of events drawn from the emsson alphabet, E. Relatng ths to musc, the sequence of muscal events (notes, chords, etc.) generated by the performer s the observaton sequence generated n response to the score. The emsson functon ε(s j, e ) defnes the probablty that state s j wll emt e. For a musc performance, ths s equvalent to the probablty the jth tem n the score (a chord symbol) would elct the th performance event (a chord vocng on the pano). A Markov model s called a hdden Markov model, or HMM, when t has at least one state whose emsson functon s non-zero for multple elements of the emsson alphabet. An example would be a chord symbol that maps onto multple chord vocngs. In our approach, the emsson functon s determned beforehand through an emprcal study of the lkelhood of performance events, gven each score state. Good estmaton of the emsson functon lets a system model a varety of possble causes for varable performance output n response to score states, ncludng producton errors G7.3 C

(cracked notes, poor ptch control), transcrpton errors, and ntentonal varaton (alternate vocngs). For example, we calculate a note-based emsson probablty functon ε(s j, e ) for an alto saxophonst by recordng the muscan and automatcally transcrbng hs performance of an assgned set of chromatc scales and chord arpeggos. The resultng count of assocatons between performed ptch and transcrbed ptch s used to estmate ε(s j, e ) (Pardo and Brmngham 2002). In another study (Pardo 2005), we estmate the lkelhood a set of performance notes mappng onto a chord symbol n the score through emprcal study of the performances of a Jazz panst on a known sequence of chord symbols. The assocatons learned n ths tranng phase can then be used to estmate ε(s j, e ) where the score element s a chord name and the performance emsson s a set of notes. Usng state transton values derved from the score and emsson probabltes based on pror tranng allows the constructon of score models wthout the need to tran the model on a set of performances of that score pror to use. Ths lets the system functon on performances of scores t has never heard before. Fndng the current state n the model For score followng, we want to know the most lkely current state n the score model, gven the observaton sequence. To fnd the most lkely current state, we modfy the Forward-Backward algorthm (Rabner and Juang 1993) for real-tme score followng. Our approach s dstnct from the standard algorthm, n that t s desgned to work on an n-progress sequence (a lve performance), rather than a completed sequence. Thus, we use only causal nformaton. The emsson functon ε(s j, e ) gves the probablty that state s j wll emt e. We defne the observaton functon, φ(s j, e ) as the probablty of beng n s j when observng e. Equaton 1 follows drectly from Bayes theorem. Here, P(e ) s the pror expected probablty of the th performance event and P(s j ) s the pror probablty of the jth score state. Ps ( j ) ε ( sj, e) φ ( sj, e) = Pe ( ) (1) For some musc styles or performers, t may make sense to develop estmates of performance events and score state lkelhood, especally f the style tends to use a subset of the possble ptches. It may, however, be mpractcal to collect meanngful statstcs for the full alphabets of score states and performance events. In ths case, one can save sgnfcant tranng tme by assumng all emssons n E have equal pror probablty and all states n S have equal pror probablty. Gven ths assumpton, the rato of ther probabltes s a constant, k, as follows: φ( s, e ) = kε ( s, e ) (2) j j Gven an observaton sequence, O = o 1, o 2,, o n, and a startng probablty dstrbuton σ(s j ), we defne the alpha functon for the frst as follows: α ( s, o ) = φ( s, o ) σ( s ) (3) j 1 j 1 j Ths captures the probablty of startng n any gven state n the model. The alpha functon for each subsequent observaton may be calculated recursvely usng Equaton 4. Here, the summaton captures the lkelhood of arrvng n state s j over all routes through the Markov model of length. ( 1 ) α( sj, o) = φ( sj, o) τ(, sj) α(, o ) (4) Equaton 4, when mplemented, wll often generate underflow errors, as the length of the observaton sequence ncreases. We are not nterested n the lkelhood of the overall sequence up to the current observaton. We are only nterested n fndng the most lkely state when we have reached the th observaton n the sequence. Gven ths, we create a state-value functon that normalzes state probabltes at each observaton, avodng underflow, as gven n Equaton 5. α ( sj, o) vs ( j, o) = α ( s, o ) (5) Equaton 5 requres that we make an adjustment to Equaton 4, resultng n Equaton 6. α( sj, o) = φ( sj, o) ( τ(, sj) v(, o 1) ) (6) The state wth maxmal value s then taken to be the current locaton, l, n the model. Thus, the current score locaton s gven by Equaton 7. l = arg max( v( s, o )) (7) s The normalzaton n Equaton 5 s not possble f the observaton sequence cannot be generated by the Markov model and the probablty of beng n any state n the model s zero. If, for some observaton o, all states have an observaton probablty of zero, the presumpton s that the performer has played somethng not n the score. The score follower then resets and o +1 s treated as the ntal observaton, Equaton 3 s appled, and score followng proceeds agan from that pont. The Vterb algorthm (Rabner and Juang 1993) s a commonly-used alternatve to the Forward algorthm. Instead of calculatng the lkelhood of s j gven all the paths through the model, the Vterb algorthm estmates the probablty of only the most lkely path through the model. Our method can use a Vterb-style estmaton by replacng the summaton n Equaton 6 wth the maxmzaton n the Equaton 8. k

( 1 ) α( s, o ) = φ( s, o ) max τ( s, s ) v( s, o ) (8) j j k j k Whle t s often better to favor the most lkely path, snce t gves the algnment output contnuty, there are subtle dfferences. For score followng, Vterb must be adapted for on-lne use, where the best current state may be aed for at any tme. In the on-lne case, Vterb may be more susceptble to "garden-path" errors, where what ntally appears to be the correct path proves to be ncorrect only after several addtonal observatons have been made. Later n ths paper, we compare the on-lne performance of the Vterb and Forward methods on the corpus. The current model archtecture We wsh to create a sngle model that handles smallscale formal varaton (e.g., the performer ps or repeats a note) and large-scale formal varaton (e.g., the performer ps or repeats a secton of the musc). By ntroducng specfc topologcal features nto our Markov model, we cover both stuatons. Consder the followng, f the performer ps a sngle score event, ths may be modeled wth ps n the Markov model. Fgure 3 shows a hdden Markov model for the frst eght beats of a wrtten score. Ths model admts pped or repeated states, as well as allowng for repeats as shown n the wrtten score. Here, each state represents a beat. The arrows represent allowable transtons between states. Ths model also allows for self loops on every state. The loop from the eghth state back to the frst state corresponds to the repeat sgn shown after the eghth beat of the wrtten score. 1 2 3 4 5 6 7 8 E note C C# D D# E F F# G G# A A# B Fgure 3. An HMM model allowng ps and repettons The hstogram n Fgure 3 shows the emsson probablty functon for the thrd state n the model, whch corresponds to the E n the wrtten score. Here, the heght of the bar corresponds to the relatve lkelhood of observng the gven ptch class when n the thrd state of the model. The emsson probablty functon for each element n the alphabet of possble score states s developed before constructng the score model, by analyss of a corpus of musc performances n the style of the pece to be performed (Pardo and Brmngham 2002). The followng secton descrbes a smple score followng experment that compares the use of an HMM lke that n Fgure 3 wth the typcal strng algnment algorthms, commonly used for score followng. A Smple Experment We have asserted that usng a Markov model to explctly model branch ponts n the wrtten score mproves score followng where a performer repeats a secton an unpredctable number of tmes. The followng experment llustrates ths pont. We created a synthetc corpus, based on well-known Jazz peces, desgned to emphasze the effect of adequate score-structure representaton on the ablty of a score follower to handle alternate paths through the score that are chosen at run-tme. The corpus conssts of melodc lnes from 98 Jazz peces. These range from Bossa Nova (Corcovado), to Ballad (What s New) to Blues (Blue Monk), to Swng (Back Home n Indana), to Jazz Waltz (Alce n Wonderland) to modal peces (Footprnts). For each pece, the score conssted of the full wrtten melody of the pece as shown n the Real Book 1, truncated to the frst 64 beats and encoded as scores n the format of the popular musc notaton program Sbelus. A repeat mark was nserted at the end of the 32 nd beat of each score. Ths, effectvely made each score have form AB, where the A secton could be repeated an unspecfed number of tmes. A Markov model wth the graph connectvty shown n Fgure 3 was automatcally generated from each score. All states n the model were set to an equal ntal startng probablty. Each state represented a sngle beat. At each beat, the Markov model could ether repeat the state, move on to the next state, or p forward two states. For ths experment, the transton probablty for movng forward a sngle state was 0.5, whle those of repeatng and ppng forward two states were each 0.25. At the end of state 32 (the 32 nd beat), there was an addtonal connecton back to the frst state (the repeat of the A secton). For each score, we created four MIDI performances: one that performed the A secton once, another that performed t twce, a thrd that performed t three tmes and a fnal performance that ps the secton entrely. We then followed the performance usng the on-lne modfcaton of the Forward algorthm, the on-lne modfcaton of the Vterb algorthm, an on-lne local strng matcher, and an on-lne global strng matcher used by a varety of researchers (Dannenberg 1984; Dannenberg and Mont- Reynaud 1987; Puckette 1995; Desan, Honng et al. 1997). Strng matchers fnd the best algnment between two strngs (sequences) by fndng the lowest cost transformaton of one nto the other, n terms of operatons 1 The Real Book s a standard, albet llegal (wth no publsher or author nformaton), compendum of Jazz lead sheets, used by professonal Jazz muscans.

(nserton or deleton of characters). Such matchers are all based on smlar technques and are the classc score followng approach. Dynamc-programmng based mplementatons that search for a good algnment of two strngs have been used for over 30 years to algn gene sequences based on a common ancestor (Needleman and Wunsch 1970). Global Strng algnment requres every element of the performance to be accounted for n an algnment to the full score. Local strng algnment allows matchng of a substrng of the performance to any porton of the score (Pardo and Brmngham 2002). The strng matchers were unable to use the repeat nformaton n the score fle and thus always expected a performance wth no repeat. Gven ths, one would expect the performance of both strng matchers to degrade whenever the A secton was presented an unexpected number of tmes, whle the Markov models should mantan roughly smlar performance, regardless of the number of repeats. The case where the ntal secton s pped entrely should favor both Markov model approaches and the local strng matcher. Mean error n beats 45 40 35 30 25 20 15 10 5 0 p 1st secton play once play twce play thrce 3 3.2 4.3 33.7 0.7 0.2 0 0 1.7 1 13.2 16 2.1 1.5 20.2 24.1 F V L G F V L G F V L G F V L G Algnment algorthm Fgure 4. Mean score-follower error, by algorthm Fgure 4 shows the score trackng errors generated n ths experment. Wthn each group, F stands for the Forward algorthm, V stands for the Vterb algorthm, L stands for local strng algnment, and G stands for global strng algnment. Each column shows a box plot wth lnes at the lower quartle, medan, and upper quartle values. The whers are lnes extendng from each end of the box to show the extent of the rest of the data. Outlers are ndcated by plus symbols beyond the ends of the whers. All values ndcate mean dstance (n beats) between the correct locaton n the score and the locaton reported by the score follower over the course of a performance, or group of performances. Below each box plot s the mean error for all cases. The play once group n Fgure 4 corresponds to the case where the number of repettons s known beforehand. In ths case, both strng algnment approaches work perfectly (assumng one begns at the start of the pece), whle the Markov models occasonally make a wrong choce at a repeat and take some tme to recover, wth the Forward algorthm performng worse than Vterb. Once the number of repeats ncreases, the benefts of explct representaton of score structure become clear. Both the strng matchng methods get lost for extended perods n the play twce and play thrce condtons, resultng n average poston estmates that are many beats away from the correct locaton. Both the Forward and Vterb-based followers stay wthn an average of two beats under all condtons, wth the Vterb performng slghtly better, on average. When the frst secton s pped entrely, the global strng algnment method fals. Both methods based on the HMM and the local strng algnment method do sgnfcantly better, wth methods based on the Markov model dong better than local strng algnment. Summary and Conclusons We have descrbed a score representaton desgned to handle the large-scale form varaton often found n lve performances of many styles of musc. Our representaton explctly models those elements of a muscal score that ndcate repeats and jumps to dfferent sectons (coda symbols), and thus possble changes n form. From these elements, we nduce a Markov model that allows us to accurately follow a lve performance, usng the on-lne modfcaton of ether the Forward or the Vterb algorthm. The model s generated automatcally from the score, and can be used wthout tranng on a corpus of performances of the score n queston. Ths s a sgnfcant advance n score followng technology. The expermental results n ths paper show that explctly representng branch ponts n a score sgnfcantly mproves score followng when the form a performer may take s not known beforehand and that the on-lne Vterb algorthm performs best on the performance corpus assembled for ths paper. Acknowledgments The majorty of ths research was conducted at The Unversty of Mchgan, Ann Arbor, wth partal support from the Natonal Scence Foundaton under grant IIS- 0085945. The opnons n ths paper are solely those of the authors and do not necessarly reflect the opnons of the fundng agency. We also thank Roger Dannenberg for comments on varous sectons of ths work.

References Bloch, J. and R. D. Dannenberg (1985). Real-Tme Computer Accompanment of Keyboard Performances. Internatonal Computer Musc Conference. Dannenberg, R. (1984). An On-Lne Algorthm for Real- Tme Accompanment. Internatonal Computer Musc Conference. Dannenberg, R. and B. Mont-Reynaud (1987). Followng an Improvsaton n Real Tme. Internatonal Computer Musc Conference. Desan, P., H. Honng, et al. (1997). Robust Score- Performance Matchng: Takng Advantage of Structural Informaton. Internatonal Computer Musc Conference. Grubb, L. and R. Dannenberg (1994), Automated Accompanment of Muscal Ensembles. Proceedngs of the Twelfth Natonal Conference on Artfcal Intellgence, AAAI, pp. 94-99 Grubb, L. and R. Dannenberg (1997). A Stochastc Method of Trackng a Vocal Performer. Internatonal Computer Musc Conference. Large, E. W. (1993), Dynamc Programmng for the Analyss of Seral Behavors. Behavor Research Methods, Instruments and Computers 25(2): 238-241. MIDI-Manufacturers-Assocaton (1996). The Complete MIDI 1.0 Detaled Specfcaton. Los Angeles, CA, The MIDI Manufacturers Assocaton. Needleman, S. B. and C. D. Wunsch (1970). A general method applcable to the search for smlartes n the amno acd sequence of two protens. Journal of Molecular Bology 48: 443-453. Pardo, B. (2005). Probablstc Sequence Algnment Methods for On-lne Score Followng of Musc Performances, Doctoral Dssertaton, Electrcal Engneerng and Computer Scence. Unversty of Mchgan: Ann Arbor, MI. Pardo, B. and W. Brmngham (2002). Improved Score Followng for Acoustc Performances. Internatonal Computer Musc Conference (ICMC), Goteborg, Sweden. Puckette, M. (1995). Score followng usng the sung voce. Internatonal Computer Musc Conference. Puckette, M. and C. Lppe (1992). Score Followng In Practce. Internatonal Computer Musc Conference. Rabner, L. and B.-H. Juang (1993). Fundamentals of Speech Recognton. Englewood Clffs, New Jersey, Prentce-Hall. Raphael, C. (1999). Automatc Segmentaton of Acoustc Muscal Sgnals Usng Hdden Markov Models. IEEE Transactons on Pattern Analyss and Machne Intellgence 21(4): 360-370. Tovanen, P. (1998). An Interactve MIDI Accompanst. Computer Musc Journal 22(4): 63-75. Vantomme, J. (1995). Score Followng by Temporal Pattern. Computer Musc Journal 19(3): 50-59.