Real-Time Audio-to-Score Alignment of Music Performances Containing Errors and Arbitrary Repeats and Skips

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. YY, 2015 1 Rea-Time Audio-to-Score Aignment of Music Performances Containing Errors and Arbitrary Repeats and Skips Tomohiko Nakamura, Student Member, IEEE, Eita Nakamura, Member, IEEE, Shigeki Sagayama, Member, IEEE. arxiv:1512.07748v1 [cs.sd] 24 Dec 2015 Abstract This paper discusses rea-time aignment of audio signas of music performance to the corresponding score (a.k.a. score foowing) which can hande tempo changes, errors and arbitrary repeats and/or skips (repeats/skips) in performances. This type of score foowing is particuary usefu in automatic accompaniment for practices and rehearsas, where errors and repeats/skips are often made. Simpe extensions of the agorithms previousy proposed in the iterature are not appicabe in these situations for scores of practica ength due to the probem of arge computationa compexity. To cope with this probem, we present two hidden Markov modes of monophonic performance with errors and arbitrary repeats/skips, and derive efficient score-foowing agorithms with an assumption that the prior probabiity distributions of score positions before and after repeats/skips are independent from each other. We confirmed rea-time operation of the agorithms with music scores of practica ength (around 10000 notes) on a modern aptop and their tracking abiity to the input performance within 0.7 s on average after repeats/skips in carinet performance data. Further improvements and extension for poyphonic signas are aso discussed. Keywords Score foowing, audio-to-score aignment, arbitrary repeats and skips, fast Viterbi agorithm, hidden Markov mode, music signa processing I. INTRODUCTION Rea-time aignment of an audio signa of a music performance to a given score, aso known as score foowing, has been gathering attention since its first appearance in 1984 [1], [2]. Score foowing is a basic technique for reatime musica appications such as automatic accompaniment, automatic score page-turning [3] and automatic captioning to music videos. The technique is particuary essentia for automatic accompaniment, which synchronizes an accompaniment to a performer on the fy, referring to performance and accompaniment scores. Automatic accompaniment enabes ive Citation information: DOI 10.1109/TASLP.2015.2507862, IEEE/ACM Transactions on Audio, Speech, and Language Processing. (c) 2015 IEEE. Persona use is permitted, but repubication/redistribution requires IEEE permission. See http://www.ieee.org/pubications standards/ pubications/rights/index.htm for more information. T. Nakamura is with the Department of Information Physics and Computing, Graduate Schoo of Information Science and Technoogy, the University of Tokyo, Tokyo 113-8656, Japan (Tomohiko Nakamura@ipc.i.u-tokyo.ac.jp). E. Nakamura is with the Graduate Schoo of Informatics, Kyoto University, Kyoto 606-8501, Japan (enakamura@am.kuis.kyoto-u.ac.jp). S. Sagayama is a Professor Emeritus of University of Tokyo, Tokyo, 113-8656, Japan and currenty with the Schoo of Interdiscipinary Mathematica Sciences, Meiji University, Tokyo 164-8525, Japan (sagayama@meiji.ac.jp). performance of ensembe music by one or a few performers. Many studies of score foowing have been carried out (see [4] for a review and [5] [13] for recent progress). Automatic accompaniment is particuary usefu for practices, rehearsas and persona enjoyment of ensembe music. In these situations, performers often make errors. Moreover, performers may want to start paying from the midde of a score and generay make repeats and/or skips (repeats/skips). Since errors and repeats/skips are hard to predict, a scorefoowing agorithm capabe of handing arbitrary errors and repeats/skips is necessary to reaize an automatic accompaniment system effective in those situations. Our aim is to deveop such an agorithm. Treatment of errors in score foowing is discussed in some studies [4], [5], [13], [14]. However, a detaied discussion and a systematic evauation of the effectiveness of the methods for audio score foowing have not been given in the iterature. Score-foowing agorithms that can foow repeats/skips have been proposed in [5], [11], [15]. The targets of these agorithms are predetermined repeats/skips from and to specific score positions, and treatment of arbitrary repeats/skips is not discussed nor guaranteed. In fact, as we wi show in this paper, simpe extensions of these agorithms have the probem of arge computationa cost and cannot work in rea time for ong scores of practica ength. Uness the probem is soved, score-foowing systems can ony work with imited scores with very short ength or we must give up foowing arbitrary repeats/skips as most of the current systems do, both of which sacrifice the vast potentia appication of score foowing. Therefore, it is essentia to reduce the computationa compexity to foow arbitrary repeats/skips. The authors have presented a new type of hidden Markov mode (HMM) that describes musica instrument digita interface (MIDI) performances with errors and arbitrary repeats/skips, and derived a computationay efficient agorithm for the HMM [13]. It reduces the computationa compexity with an assumption to simpify a probabiity distribution of score positions before and after repeats/skips. Whie a simiar mode woud be appicabe to the audio case, further discussions are required since audio inputs (frame-wise discrete in time and continuous in features) significanty differ with MIDI inputs (continuous in time and discrete in pitches) in nature. The main contribution of this paper is to present rea-time agorithms that can foow monophonic audio performances containing arbitrary repeats/skips and errors. Athough monophonic score foowing has been addressed since [1], [2],

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. YY, 2015 2 arbitrary repeats/skips have never been discussed despite the practica importance of their treatment as the above mentioned. Because poyphonic score foowing is sti an active fied of research and the extension of the present method for poyphonic performances requires many additiona issues discussed in Sec. V, we confine ourseves to monophonic performances. We deveop a mode of music performances containing errors and arbitrary repeats/skips with an HMM. We first discuss how various types of errors can be incorporated into the mode (Sec. II). Next, we extend the mode to incorporate arbitrary repeats/skips. In order to sove the probem of arge computationa cost for foowing arbitrary repeats/skips, two HMMs with refined topoogies are presented. We derive efficient score-foowing agorithms with reduced computationa compexity based on both HMMs (Sec. III). We demonstrate that both agorithms can work in rea time with scores of practica ength on a modern aptop computer and are effective in foowing performances with errors and arbitrary repeats/skips through evauations using carinet performances during practice (Sec. IV). We discuss possibe improvements and extensions of the proposed agorithms for poyphonic inputs (Sec. V). Part of this study (Sec. III and a part of Sec. IV) was reported in our previous conference paper [12]. II. SCORE FOLLOWING FOR PERFORMANCES WITH ERRORS A. Variety in Audio Performance and Statistica Approach Score foowing is generay chaenging since audio signas of music performances widey vary even if the same score is used. Four typica sources of variety in monophonic audio performance are isted beow. (a) Acoustic variations: Spectra features of audio performances depend on musica instruments and are not stationary. In addition, audio performances usuay incude noise caused by the surrounding environment and musica instruments (e.g. resonance, background noise, breath noise and other acoustics). (b) Tempora fuctuations: The tempo of the performance and onset times and durations of performed notes deviate from those indicated in scores due to performer s skis, physica imitations of musica instruments and musica expressions. For exampe, performances during practice are often rendered in sow tempo to avoid errors. (c) Performance errors: Performers may make errors due to ack of performance skis or mis-readings of the score. Errors are categorized into pitch errors (substitution errors), dropping notes (deetion errors), adding extra notes (insertion errors) [1]. Besides, performers may make pauses between notes, for exampe, to turn a page of the score and to check the next note. (d) Repeats/skips: Performers may repeat and/or skip phrases in particuar during practice. Furthermore, the performers generay add or deete a repeated section. These four sources of variety in monophonic audio performance make score foowing difficut and motivate us to study it. In particuar, it is essentia to adapt automatic accompaniment systems to the variety in order to keep synchronization Fig. 1. A hierarchica hidden Markov mode with two eves that describes a music performance with deetion, insertion and substitution errors. See text. to ive performances. Athough it is out of the scope of this paper, there are other sources of variety in music performance such as ornaments [6], [13], [16], [17] and improvisation [18], [19]. Recent score-foowing systems commony use probabiistic modes such as HMM to capture the variety of audio performances, and their effectiveness has been we confirmed [4] (and references in the Introduction). They are particuary advantageous to capture continuous variations of audio features and to hande errors which are hard to predict. Therefore, we take the statistica approach in this study. B. Performance HMM We represent the performance score with N musica events, each of which is a note or a rest. A performer reads the score from event to event and keeps making a sound corresponding to an event. This process of performance can be modeed with a hierarchica HMM with two eves [20], [21], which we ca the performance HMM. The top eve describes the progression of performed events, and the bottom eve expresses tempora structure of the audio signa in a performed event. Events correspond to states (top states) of the top-eve HMM (top HMM), and the performance is described as transitions between the top states. Let z (top) t = 0,, N 1 denote the random variabe describing the top state at the tth frame (t = 0,, T 1), and et i and j abe a top state. The top HMM is parameterized by state transition probabiities a j,i and initia probabiities π i : a j,i := P (z (top) t = i z (top) t 1 = j), (1) π i := P (z (top) 0 = i), (2) which satisfy N 1 i=0 π i = 1 and N 1 i=0 a j,i = 1 for a j. Each top state is itsef an HMM (bottom HMM), whose states (bottom states) correspond to subevents in an event, for

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. YY, 2015 3 exampe, sustain of an instrumenta sound, pauses between notes, etc. Let L denote the number of bottom states in the top state, z (bot) t = 0,, L 1 denote the random variabe describing the bottom state at the tth frame, and et and abe a bottom state. The state transitions of the bottom HMM are characterized by three kinds of probabiities. The initia probabiity π (i) describes the probabiity of a transition to bottom state when top state i is entered, the exiting probabiity e (i) describes the probabiity of exiting top state i from bottom state, and the transition probabiity a (i), := P (z (bot) t = z (bot) t 1 = ) represents the transition from bottom state to bottom state in top state i. These probabiities satisfy L 1 =0 π(i) = 1 and L 1 =0 a(i), + e (i) = 1 for a and i. Thus, the performance is modeed as a sequence of T pairs of random variabes {(z (top) t, z (bot) t )} T t=0 1 (Fig. 1). For exampe, if the pair z t := (z (top) t, z (bot) t ) equas to (i, ), the score position at frame t is at bottom state of top state i. Observed audio features are described as being stochasticay generated from a bottom state. Given an audio feature y t := [y t,0, y t,1,, y t,d 1 ] at frame t as a D-dimensiona rea vector, the emission probabiity of state (i, ) is defined as b (i) (y t ) := P (y t z t = (i, )). (3) C. Emission Probabiity and Substitution Error From here to Sec. II-E, we consider the performance HMM with L = 1 for simpicity, but the case for L > 1 can be treated simiary. To extract pitch information from the input signa, we need a suitabe feature representation. In the comparison of some audio features in [7], [22], the magnitude of a constant-q transform (CQT) [23] with a quaity factor set to one semitone yieded the best resut of score foowing for monophonic audio input. Furthermore, normaizing magnitudes of CQTs such that D 1 d=0 y d = 1 makes them insusceptibe to dynamic variations. Athough one may think that the normaization makes it difficut to discriminate pauses from notes, the difference in spectra shape between pauses and notes can hep the discrimination: The CQT of a pitched sound have cear peaks at its fundamenta frequency and harmonics, whereas the CQTs at pauses are reativey fat. We use normaized magnitudes of CQTs (normaized CQTs) as audio features. Let k be the pitch index and K be the set of possibe pitches. For convenience, we indicate the pitches A0 to C8 in the range of a standard piano as k = 21 to k = 108 and sience as k = 1, and K = {21, 22,, 108} { 1}. We assume that normaized CQTs corresponding to pitch k foow a D-dimensiona norma distribution with mean µ k and covariance matrix Σ k, denoted by N (y t µ k, Σ k ). The emission probabiity b (i) 0 (y t) of bottom state 0 of top state i is given as b (i) 0 (y t) = w (i) k,0 N (y t µ k, Σ k ). (4) k K Here w (i) k,0 [0, 1] is a mixture weight of pitch k of bottom state 0 of top state i, which satisfies k K w(i) k,0 = 1 for a i. Fig. 2. A pause between notes is described with the pause state (gray disk) which emits audio features corresponding to sience. When substitution errors are not made, w (i) k,0 = 0 uness k = p i, where p i K denotes the pitch of event i (p i = 1 for a rest). On the other hand, to describe a performance with substitution errors, we have sma positive vaues of w (i) k,0 for k p i since a substitution error is represented by an emission of an audio feature with an incorrect pitch. D. Transition Probabiity and Deetion and Insertion Errors Transition probabiities in the top eve a j,i represent the frequency of the transitions between the events. If performances do not contain insertion and deetion errors, a j,i = 0 uness i = j + 1. We can express an insertion error and a deetion error with a sef transition and a transition to the second next top state, which correspond to a j,j and a j,j+2. The sef-transition probabiity a (i) 0,0 of bottom state 0 of top state i describes the expected duration of the corresponding event d i, which is computed as a product of the note vaue of the event and the score-notated tempo: d i = k(a (i) 0,0 )k 1 (1 a (i) 0,0 ) = 1. (5) 1 a (i) 0,0 k=1 If d i is shorter than a processing time interva, we put a (i) 0,0 = 0. This probabiistic representation of the event duration describes the tempora fuctuations of music performance. E. Pauses between Notes Pauses between notes can be introduced into the performance HMM by adding an extra bottom state with index 1, which we ca a pause state (Fig. 2). The occurrence of the pause is expressed as a transition to the pause state, which corresponds to a (i) 0,1. The duration of the extra pause is represented by the sef-transition probabiity of the pause state a (i) 1,1, which can be set simiary to Eq. (5). We put a(i) 1,0 = 0 and π (i) 1 = 0 for a i. We assume that b (i) 1 (y t) = N (y t µ 1, Σ 1 ). F. Estimation of Score Positions For the convenience of estimating score positions, we convert the performance HMM into an equivaent standard HMM. Its state corresponds to a bottom state of the performance HMM and is abeed with (i, ). The standard HMM is

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. YY, 2015 4 parameterized by emission probabiities b (i,) (y t ), initia probabiities π (i,), and transition probabiities ã (j, ),(i,), defined by b (i,) (y t ) := b (i) (y t ), π (i,) := π i π (i), and ã (j, ),(i,) := { a (i) a i,i π (i) (i = j) a j,i π (i) (i j), (6), + e(i) e (j) Given observed normaized CQTs up to the tth frame y 0:t = {y τ } t τ=0, the score position at frame t is estimated with the standard HMM by soving where argmax P (z t y 0:t ) = argmax P (y 0:t, z t ), (7) z t z t P (y 0:t, z t ) = z 0:t 1 ( t bzτ (y τ )ã zτ 1,z τ ) bz0 (y 0 ) π z0. (8) τ=1 Here z 0:t 1 denotes {z τ } t 1 τ=0. Eq. (7) is derived from the Bayes theorem. This maximization probem can be soved efficienty with the forward agorithm. It computes the forward variabe α t,zt := P (y 0:t, z t ) in a recursive manner: b(i,) (y t ) α t 1,(j, )ã (j, ),(i,) (t 1), α t,(i,) = j=0,,n 1 =0,,L 1 b(i,) (y 0 ) π (i,) (t = 0). (9) Since ã (j, ),(i,) = 0 uness 0 i j 2, the compexity of computing α t,(i,) is of O(LN) at each time step. III. INCORPORATING ARBITRARY REPEATS/SKIPS AND FAST SCORE-FOLLOWING ALGORITHMS A. Incorporating Arbitrary Repeats/Skips and Computationa Compexity for Inference So far, the top HMM is eft-to-right and its states are connected ony to their neighboring states. However, a top states must be connected to describe arbitrary repeats/skips, i.e. a j,i > 0 for a j and i. The mode is a generaization of the performance modes in previous studies [5], [11], [15]. Assuming L = 1 for simpicity and dropping the subscripts, from the parameters of the standard HMM and the forward variabes as ã j,i := ã (j,0),(i,0), b i (y t ) := b (i,0) (y t ), π i := π (i,0) and α t,i := α t,(i,0), Eq. (9) can be rewritten as N 1 bi (y t ) α t 1,j ã j,i (t 1), α t,i = (10) j=0 bi (y 0 ) π i (t = 0). Eq. (10) for t 1 contains a summation over N states for each i, and the compexity is of O(N 2 ). As we wi experimentay show in Sec. IV-A, this compexity is too arge to run in rea time with scores of practica ength on a modern aptop. Therefore, it is crucia to reduce the compexity. It is noteworthy that a simiar arge compexity can emerge even if ony specific repeats/skips are aowed (e.g. transitions between the first notes of bars in a score), since the number of such specific transitions often increases in proportion to N. One may think that pruning techniques can be used to reduce the computationa compexity. However, pruning is ineffective here since repeats/skips sedom occur, and it is necessary to take a transitions into account. Computing a transitions has a benefit aso in foowing performances without repeats/skips. When an estimation error of score position occurs, a score foower may fai to track the performance and become ost. It often happens that a score foower with a pruning technique (e.g. with a imited search window) cannot recover from being ost. By contrast, if a score foower searches a transitions, it can return to find the correct score position after a whie if the performer continues the performance. B. Reduction of Computationa Compexity by Factorizing Probabiities of Repeats/Skips One method to reduce the computationa compexity whie computing a transitions is to introduce some constraints on the transition probabiities. In [13], reduction of the computationa compexity is achieved with an assumption that the probabiity of score positions where performers stop before repeats/skips (stop positions) is the same regardess of where they resume performing after repeats/skips (resumption positions). We sha introduce this assumption to the performance HMM. The transition probabiity of a repeat/skip from event j to event i is then written as a product of two probabiities s j and r i. s j is the probabiity of stopping at event j before a repeat/skip, and r i is the probabiity of resuming a performance at event i after a repeat/skip. The transition probabiity of the top HMM is then written as a j,i = a (nbh) j,i + s j r i. (11) where a (nbh) j,i is a band matrix satisfying a (nbh) j,i = 0 uness 0 i j 2. The parameter a (nbh) j,i characterizes transitions within neighboring states and is determined according to the normaization constraint of a j,i, which is written as 1 = i a j,i = i a(nbh) j,i + s j i r i for a j. Without oss of generaity, we can assume i r i = 1 and then we have i a(nbh) j,i = 1 s j. Let us denote the set of neighboring states of top state i by nbh(i) := {j; j = 0,, N 1, 0 i j 2}. The transition probabiity of the standard HMM ã j,i for j / nbh(i) is written as With Eqs. (12) and (10), we have { α t,i = b i (y t ) α t 1,j ã j,i + r i π (i) 0 j nbh(i) ( N 1 j=0 ã j,i = e (j) 0 s jr i π (i) 0. (12) α t 1,j e (j) 0 s j j nbh(i) )} α t 1,j e (j) 0 s j. (13)

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. YY, 2015 5 Fig. 3. A repeat/skip can be described with two-step transitions via the break state representing sient breaks. Since the first summation in the parentheses of the second term is independent of i, it is sufficient to cacuate it once at each time step. This term and the rest of Eq. (13) are of O(N), and hence the tota computationa compexity is O(N). The space compexity is aso reduced: The transition probabiity matrix in the top eve is now parameterized by 4(N 1) parameters (s j, r i and a (nbh) i,j ). It has N(N 1) parameters originay. This resut can be generaized for the performance HMM with L > 1. The standard HMM has LN states and updating α t,(i,) at each time step is of O((LN) 2 ) according to Eq. (9). If we introduce the above assumption, the transition probabiity of the standard HMM ã (j, ),(i,) can aso be divided into a component dependent ony on i, and a component dependent ony on j,. Therefore, the tota computationa compexity is reduced to O(LN) (see Appendix B for detais). Importanty, this reduction method can be used regardess of the topoogy of the bottom HMMs, and it is compatibe with the pause states and appicabe to performance HMMs with more compex structure of bottom HMMs (e.g. [6], [20], [24], [25]). A simiar reduction method is vaid for the Viterbi agorithm and the backward agorithm. The method can be appied to any HMM and simiar dynamic programming techniques as we, and it can be usefu for appications other than score foowing, (e.g. timbre editing of music signas [26]). C. Expicit Description of Sient Breaks at Repeats/Skips We can achieve a simiar reduction of the computationa compexity by using another assumption on arbitrary repeats/skips. Performers frequenty make sient breaks at repeats/skips to get ready for resuming the performance. In fact, 59 of 63 repeats/skips accompanied the breaks onger than 500 ms in actua performances used in Sec. IV-C1. Let us represent the sient breaks by introducing an additiona state (the break state) as the N th top state. The duration of the breaks is described with the sef-transition probabiity of the bottom state of the break state a (N) 0,0, and its vaue is determined simiary to Eq. (5). Repeats/skips are represented as two-step transitions via the break state (Fig. 3). Stopping (resuming) a performance is expressed as transitions to (from) the break state whose probabiity is denoted by s j (r i, respectivey). We note that the top states excuding the break state are connected ony to neighboring top states, and thus ã j,i = 0 if j / nbh(i) for a i, j N. On the other hand, the break state is connected to a top states except itsef. We put ã N,N = 0. The transition probabiity of the standard HMM from or to the break state is written as ã j,n =e (j) 0 s jπ (N) 0 (j N), (14) { e (N) 0 r ã N,i = i π (i) 0 (i N), a (N) (15) 0,0 (i = N) where e (N) 0 (= 1 a (N) 0,0 ) and π(n) 0 (= 1) denote the exiting probabiity and the initia probabiity of state (N, 0). For this mode, Eq. (10) for t 1 can be written as ( ) bi (y t ) α t 1,j ã j,i + α t 1,N ã N,i (i N) j nbh(i) α t,i = N 1 bn (y t ) α t 1,j ã j,n (i = N). j=0 (16) We see that updating α t,i invoves summation of at most four terms for each i N and N terms for i = N. The tota compexity is thus O(N) for each time step. This reduction method can aso be extended to the case of L > 1 (see Appendix C). It is noteworthy that the performance HMM with the break state is reated to the performance HMM presented in Sec. III-B. If we assume that transitions go through the break state in no time, the two-step transition from top state j to top state i via the break state is reduced to the direct transition from top state j to top state i, and its probabiity is written as a product of s j and r i. In other words, the difference between these modes is whether breaks are expicity described. Since it is difficut to quantify its effect on the performance of score foowing anayticay, we wi evauate the effect through an experiment in Sec. IV-C2. IV. EXPERIMENTAL EVALUATION OF THE PROPOSED SCORE-FOLLOWING ALGORITHMS A. Processing Time We measured processing times in order to evauate the reduction of the computationa compexity with the proposed agorithms. The processing time depends on the number of events N and virtuay not on other score content and signa content. We used synthetic scores with 10 to 10 6 events 1 and a random signa of two seconds ength with a samping rate of 16 khz as an audio input. Normaized CQTs were computed with a frame ength of 128 ms and a hopsize of 20 ms. Their center frequencies ranged from 55 to 7040 Hz at a semitone interva, and the quaity factor was set to 16, which approximatey corresponds to one semitone. Agorithms were impemented in C++ on a computer with 3.30 GHz CPU (Inte(R) Core(TM) i3-2120 CPU) and 8 GB memory running Debian. Processing times averaged over 100 frames with standard errors are shown in Fig. 4 for the agorithms proposed in Sec. III-C (break agorithm) with and without the pause states (L = 2 and L = 1) and the agorithm that cacuates α t,i 1 Practica scores contain O(10 3 ) to O(10 4 ) notes. For instance, there are around 2200 events in the carinet part of the first movement in the Mozart s Carinet Quintet.

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. YY, 2015 6 Processing time [s] 10 2 10 1 10 0 10-1 10-2 10-3 W/o pause W/ pause 10-4 10 1 10 2 10 3 10 4 10 5 10 6 Number of events Baseine Fig. 4. Average processing times with standard errors with respect to the number of events. W/ pause and W/o pause represent the break agorithm with and without the pause states, respectivey. Baseine represents a simpe extension of the agorithms proposed in previous studies [5], [11], [15]. according to Eq. (10) (baseine agorithm). (The resuts for the agorithm proposed in Sec. III-B (no-break agorithm) did not significanty differ with the resuts for the break agorithms.) It can be confirmed that the average processing times increased asymptoticay in proportion to N 2 (N) with the baseine agorithm (the break agorithms, respectivey). The resut shows that the proposed agorithms significanty suppress the increase of processing times. The processing times for N 1000 were arger than the hopsize with the baseine agorithm, and the agorithm can work in rea time with scores with ony up to O(10 2 ) events, which is the size of short music pieces. By contrast, the average processing times were beow the hopsize for N 10000 (N 50000) with the break agorithm with (without, respectivey) the pause states. Therefore, the proposed agorithms with and without the pause states can work in rea time with scores with up to O(10 3 ) events and O(10 4 ) events, respectivey. Note that processing times depend on the computing power, but their reative vaues remain amost the same and the proposed agorithms are aways effective in reducing the computationa compexity. B. Score-Foowing Accuracy for Performances with Errors 1) Data Preparation: To evauate the score-foowing accuracy for performances with errors, we conducted an experiment using the Bach10 dataset [27]. It consists of audio recordings of ten four-part choraes by J. S. Bach. The soprano, ato, tenor and bass parts of each piece were separatey recorded and performed by the vioin, carinet, saxophone and bassoon, respectivey. Their durations ranged from 25 to 41 seconds. Since the performances did not contain errors, we simuated errors by randomy inserting, dropping and substituting notes in each score, which correspond to deetion, insertion and substitution errors in the performance, respectivey. Their probabiity vaues were obtained from the MIDI piano performances during practice in [13]: 0.0034 for deetion errors and 0.0245 for insertion errors. For simpicity, substitution errors were restricted to three types typica in carinet performances, namey errors in semitone, whoe-tone and perfect 12th. The first two errors are often caused by fingering errors and mis-readings of the score, and the ast error is caused by overbowing on a carinet. The probabiity vaues of the three pitch errors were 0.0145, 0.0224 and 0.0047 in the simuation, where the probabiity of the perfect 12th pitch error was substituted by that of the octave pitch errors obtained in [13]. 2) Experimenta Conditions: We conducted a preiminary experiment and set the parameter for performance errors as foows: a i,i+2 = 1.0 10 50 for deetion errors, a i,i = 0 for insertion errors, and a (i) 1,1 = 0.999 and a(i) 0,1 = 1.0 10 100 for pauses between notes. Athough the mixture weight w (i) k,0 can be earned from audio signas at each k and i in principe, it is difficut to obtain them independenty for the ack of enormous performance data. To reduce the number of parameters, we considered ony the most important three substitution errors described in the previous section. The mixture weights w (i) k,0 for the errors were designed in proportion to their frequencies used in the simuation: w (i) k,0 = 1 C (k = p i ) C 0.175 (k = p i ± 1) C 0.270 (k = p i ± 2) C 0.055 (k = p i ± 19) 0 (otherwise) (17) for a p i 1, where C is the probabiity of pitch errors. The vaue of C was optimized in a preiminary experiment and we set C = 1.0 10 50. For p i = 1, we put w (i) k,0 = 0 uness k = 1. The probabiities of stopping and resuming a performance s j, r i were set uniformy in i, j: s 0 = s 1 = = s N 1 = 1.0 10 x for some positive x and r 0 = r 1 = = r N 1 = 1/N. Since the vaue of a (N) 0,0 did not significanty change the resut in a preiminary experiment, we fixed a (N) 0,0 = 0.996. The accuracy of score foowing generay depends on the parameters of the emission probabiities. It has been reported that earning them from audio performances improves the accuracy [10], [22], and thus we earned the parameters µ k and Σ k from audio signas. The parameters can be earned from every musica instrument if necessary data is avaiabe and we can form a detaied mode for a specific instrument. Aternativey, we can use a set of data consisting of severa musica instruments to form a genera mode that can be appied for a wider cass of instruments. Such a earning method is appicabe for any instruments in principe, and it can be even more effective for musica instruments with compex signas, for which physica modeing or manua spectrum-tempate construction is more difficut. In genera, there is a tradeoff between the generaization capabiity and the adaptation abiity. Here, we earned the parameters with performance data of severa musica instruments and used them to measure the accuracy of score foowing. The earning data consisted of performances payed by the vioin and carinet in RWC musica instrument database [28]. To reduce overfitting, we assumed that Σ k is diagona and

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. YY, 2015 7 Piecewise precision rate 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.0 Break No-Break 10-104 10-103 Antescofo 10-102 10-101 10-100 Fig. 5. Average piecewise precision rates and standard errors with respect to s j for audio performances obtained by simuating errors. The break agorithm ( Break ) and the no-break agorithm ( No-Break ) without the pause states are compared to Antescofo [6]. introduced a ower bound, or a fooring vaue F, on the diagona eements of Σ k. The introduction of F is caed the fooring method and generay used for speech recognition (e.g. see [29]). We conducted a preiminary experiment and found the optima F = 1.0 10 4. The initia probabiities were set as π i = 0 for i 0 and π 0 = 1. We compared the proposed agorithms with Antescofo [6], which is one of the most known score-foowing systems appied to various musica pieces and used in the most severe artistic situations. Antescofo was not deveoped to cope with repeats/skips in monophonic performances, and is without specia treatments for repeats/skips. It had the best accuracy in the music information retrieva evauation exchange (MIREX 2006) [30], which is the most famous evauation contest in this fied. Since Antescofo ended score foowing when the ast note in the score was estimated, estimated score positions were assumed to be the ast note from the time when Antescofo ended score foowing. The overa accuracy of score foowing was measured by piecewise precision rate (PPR), defined as the piecewise average rate of onsets correcty detected within ms error. The PPR has been used with = 2000 ms in MIREX [30], [31]. 3) Resuts: Tab. I summarizes average PPRs and standard errors with = 300 ms for every musica instrument. The resuts for the no-break agorithm did not significanty differ with the resuts for the break agorithm when s j = 0.0. We found that the proposed agorithms provided simiar accuracies for the saxophone and bassoon data, which were not contained in the earning data, compared to the carinet and vioin data. The PPRs obtained with the proposed agorithms were simiar to those obtained with Antescofo in a data. Fig. 5 iustrates average PPRs and standard errors with = 300 ms. As described in Sec. III-A, computing a transitions hep that the score foower returns to recover from being ost. The benefit can be confirmed from that the proposed TABLE I. AVERAGE PIECEWISE PRECISION RATES AND STANDARD ERRORS FOR VIOLIN, CLARINET, SAXOPHONE AND BASSOON PERFORMANCES WITH ERRORS. PROPOSED (s j = 0) ( ANTESCOFO ) DENOTES THE break agorithm WITH s j = 0 (ANTESCOFO [6], RESPECTIVELY). TABLE II. Musica instrument Proposed (s j = 0) Antescofo Vioin 0.72 ± 0.03 0.66 ± 0.06 Carinet 0.61 ± 0.05 0.57 ± 0.08 Saxophone 0.63 ± 0.06 0.64 ± 0.06 Bassoon 0.76 ± 0.04 0.79 ± 0.04 THE NUMBER OF ERRORS AND REPEATS/SKIPS IN THE USED CLARINET PERFORMANCES. Pauses Deetion Insertion Substitution Repeats/skips between notes error error error Count 21 1 21 33 63 agorithms with s j = 1.0 10 1000 provided around 0.05 higher accuracy than Antescofo, which searches ony oca transitions. On the other hand, s j s arger than 1.0 10 500 caused the frequent overdetection of repeats/skips and the accuracy became ower than s j = 0. A simiar tendency was observed in PPR with = 500 and 2000 ms. Large vaues of s j deteriorated the score-foowing accuracy of the present agorithms as shown in Fig. 5. This is because the arger s j, the more frequenty the agorithms may misestimate insertion/deetion/substitution errors as repeats/skips. We indeed confirmed that the number of misdetected repeats/skips increased with arger s j. There was around 0.1 difference in PPR between the agorithms when s j is arge. We found that the tota number of misdetected repeats/skips by the no-break agorithm was around 1.2 times arger than that of the break agorithm for s j 10 10. Since the break agorithm assumes that repeats/skips aways accompany breaks and simuated errors did not accompany pauses, the resuts suggest that the expicit description of the breaks reduced misestimations of the errors as repeats/skips. C. Score-Foowing Accuracy for Performances with Errors and Repeats/Skips 1) Performance Data During Practice: We coected 16 audio recordings of carinet performances with a time range of 31 to 213 s (totay 28 min 48 s). We requested an amateur carinetist to freey practice seven music pieces containing cassica and popuar music pieces and nursery rhymes, partiay from RWC music database [28]. His performances were recorded with a vibration microphone attached to the carinet. The performances were aigned to the notes in the scores by one of the authors. The tota number of performed notes was 2672, and Tab. II ists the count of errors and repeats/skips. Tab. III summarizes differences in score times before and after repeats/skips in the performance data, and we see that they contain repeats/skips between remote score positions. Here, ony breaks and pauses between notes onger than 500 ms were counted since it is difficut to accuratey annotate offsets of performed notes and short sient breaks and pauses between notes. A transitions with j / nbh(i) were counted

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. YY, 2015 8 TABLE III. STATISTICS OF DIFFERENCES IN SCORE TIMES BEFORE AND AFTER REPEATS/SKIPS IN THE PERFORMANCE DATA. QU. IS AN ABBREVIATION FOR QUARTILE. Piecewise precision rate Score time Min. 1st Qu. Median Mean 3rd Qu. Max. In second 84.83 15.5 7.75 8.775 1.875 45.750 In event 331 44 25 23.35 4 178 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 Break No-Break 10-104 10-103 Antescofo 10-102 10-101 10-100 Fig. 6. Average piecewise precision rates with standard errors with respect to s j. The agorithms are same as in Fig. 5. as repeats/skips, where i and j denote stop and resumption positions. 2) Resuts: The parameters were same as in Sec. IV-B2. To measure how we the agorithms foowed repeats/skips, we cacuated a detection rate of repeats/skips and the time interva between a repeat/skip and its detection, which we ca foowing time. A repeat/skip was defined to be detected if there was a correcty estimated frame unti the next repeat/skip or the end of the audio recording. Fig. 6 iustrates average PPRs with standard errors for = 300 ms. Both proposed agorithms outperformed Antescofo at a s i s, ceary showing that the proposed agorithms are effective in foowing performances with errors and repeats/skips. A simiar tendency was observed in PPR with = 100, 500 and 2000 ms. We aso measured the effect of adding the pause states in the proposed agorithms with s i = 1.0 10 100, and found that it increased PPRs by 0.05 on average. Tab. IV summarizes the detection rates of repeats/skips, and Fig. 7 iustrates averages of foowing times over a detected repeats/skips (average foowing times) and standard errors in second. Since the standard error for Antescofo was too arge to dispay in the figure, ony the average vaue is shown. Both proposed agorithms ceary outperformed Antescofo in the detection rate and the foowing time. For exampe, compared to Antescofo, both proposed agorithms with s j = 1.0 10 100 detected 14 times more repeats/skips and caught up with them 20 times faster in second. These resuts show that the proposed modes are effective for repeats/skips. The break agorithm (the no-break agorithm) with s j = 1.0 10 100 detected 56 (57) repeats/skips, but faied to TABLE IV. DETECTION RATES OF REPEATS/SKIPS FOR VARYING s j. THE ALGORITHMS ARE SAME AS IN FIG. 5. Foowing time [s] 18 16 14 12 10 8 6 4 2 0 10-104 s j Break No-Break Antescofo 1.0 10 1 58/63 60/63-1.0 10 5 59/63 59/63-1.0 10 10 59/63 60/63-1.0 10 50 58/63 59/63-1.0 10 100 56/63 57/63-1.0 10 500 55/63 55/63-1.0 10 1000 56/63 55/63-1.0 10 5000 43/63 43/63-0.0 13/63 13/63 4/63 Break No-Break 10-103 10-102 Antescofo 10-101 10-100 Fig. 7. Average foowing time and standard error for varying s j. The agorithms are same as in Fig. 5. For Antescofo, ony the average foowing time is shown. detect seven (six, respectivey) repeats/skips. These faiures were caused by the existence of sections and phrases simiar to each other in the scores (e.g. choruses in popuar music) and consideraby short performances between repeats/skips. For exampe, nine performances between repeats/skips were beow five seconds. Most of the repeats/skips accompanied sient breaks, but the break agorithm provided simiar resuts to the no-break agorithm. This is because the top states associated with rests can pay the same roe of the break state since these top states were connected to a top states. Furthermore, we measured foowing times and detection rates for performances payed by other musica instruments. The audio recordings in the Bach10 dataset did not contain repeats/skips, and we synthesized performances containing repeats/skips by randomy jumped between breaks in each recording with a probabiity of 0.1 and inserting sient breaks at repeats/skips. The durations of the breaks were samped uniformy from 0.5 to 30 seconds and each synthesized performance was forced to contain at east one repeat/skip. After the synthesis, errors were simuated in the same way as in Sec. IV-B1. Tab. V summarizes detection rates of repeats/skips for every musica instrument. The proposed agorithms with s j = 1.0 10 1000 outperformed Antescofo in the detection rate, and we found simiar tendency in the PPR and the foow-

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. YY, 2015 9 Break No-Break Antescofo Break No-Break Antescofo Piecewise precision rate 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.0 10-104 10-103 10-102 10-101 10-100 Piecewise precision rate 12 10 8 6 4 2 0 10-104 10-103 10-102 10-101 10-100 (a) Average piecewise precision rates with standard errors (b) Average foowing times with standard errors Fig. 8. (a) Average piecewise precision rates and (b) average foowing times with respect to s j for audio performances with simuated errors and repeats/skips. The agorithms are same as in Fig. 5, and ony the average foowing time is shown for Antescofo in the right pane. TABLE V. DETECTION RATES OF REPEATS/SKIPS FOR VIOLIN, CLARINET, SAXOPHONE AND BASSOON DATA WITH SIMULATED ERRORS AND REPEATS/SKIPS. THE ALGORITHMS ARE SAME AS IN FIG. 5, AND s j = 1.0 10 1000 WAS USED IN BOTH PROPOSED ALGORITHMS. Musica instrument Break No-Break Antescofo Vioin 13/13 13/13 2/13 Carinet 11/11 10/11 4/11 Saxophone 11/12 11/12 2/12 Bassoon 10/10 10/10 0/10 ing time as shown in Fig. 8 (a) and (b), respectivey. These resuts show that the proposed agorithms are aso effective in foowing performances with errors and repeats/skips for various musica instruments. A demonstration video of an automatic accompaniment system using the break agorithm without the pause states is avaiabe at https://www.youtube.com/watch?v=fw6vkic4k34 on Youtube [32]. In the video, the break agorithm successfuy foows the performances during practice and catches up the performances after repeats/skips within a few seconds. V. DISCUSSIONS A. Improvement of the Proposed agorithms We now discuss possibe extensions of the proposed agorithms. The stop and resumption positions are not competey random, and their distributions have certain tendencies in actua performances [13]. For exampe, performers frequenty resume from the first beats of bars and the beginning of phrases, which refects performers understanding of musica structures. These tendencies can be incorporated in s j, r i in our performance HMMs, and the accuracy and foowing times of the proposed agorithms woud improve [13]. Another method to improve the proposed agorithms is to refine the mode of the durations of performed events. For this purpose, we can assign mutipe bottom states to mode the duration [20], [24], [25] or expicity introduce its probabiity distribution [6]. This refinement is compatibe with the proposed methods to reduce the computationa cost since they can be used regardess of the topoogy of the bottom HMMs. The proposed agorithms successfuy foowed carinet performances against tempo changes in the experiment and the demonstration video in Sec. IV-C. However, the accuracy may deteriorate for the performances with arge tempo changes. To suppress the deterioration, it woud be effective to adequatey change d i on the fy, referring to estimated tempos. B. Extension to Poyphonic Music Athough we have confined ourseves to monophonic performances, et us briefy discuss the poyphonic case. We can construct a performance HMM for poyphonic scores simiary to the monophonic case. By associating top states with musica events (chords, notes and rests) in a poyphonic score, the top HMM can be used without any change, and insertions and deetions of chords, pauses between chords and repeats/skips can be incorporated in the same way. Importanty, the present methods to reduce the computationa compexity can be appied to the poyphonic case since it is independent of detais of the bottom HMMs. On the other hand, we need to extend the bottom HMMs to incude chords. Especiay, errors may occur at every note in a chord, and there are a combinatoriay arge number of possibe forms of errors for a arge chord. Athough we coud prepare spectra tempates for a possibe forms of payed chords and use a mixture distribution simiary to Eq. (4) in principe, it requires arge computationa cost in estimating score positions. However, the infuence of note-wise errors in spectra differences is generay ess significant for a arge chord, and a bod approximation of negecting note-wise errors woud work reativey we for such

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. YY, 2015 10 a case, which can serve as a practica method to avoid the arge computationa cost. There are other issues for poyphonic performances. For exampe, notes in a chord are indicated to be performed simutaneousy in the score, but they can be actuay performed at different times. Aso, reative energy of notes in a chord depends on the performer. Their treatment requires additiona discussions and experiments, and the extension to poyphonic performances is now under investigation. VI. CONCLUSION We discussed score foowing of monophonic music performances with errors and arbitrary repeats/skips by constructing a stochastic mode of music performance. We incorporated possibe errors in audio performances into the mode. In order to sove the probem of arge computationa cost for foowing arbitrary repeats/skips, we presented two HMMs that describe a probabiity of repeats/skips with a probabiity of stop positions and a probabiity of resumption positions, and derived computationay efficient agorithms. We demonstrated rea-time working of the agorithms with scores of practica ength (O(10 3 ) to O(10 4 ) events). Experimenta evauations using carinet performance data showed that the agorithms outperformed Antescofo in the accuracy of score foowing and the tracking abiity of repeats/skips. In addition, we briefy discussed methods to improve the proposed agorithms and extend them for poyphonic inputs. ACKNOWLEDGEMENTS We thank Yuu Mizuno and Kosuke Suzuki for participating in the eary stage of this work, Naoya Ito for paying the carinet, and Hirokazu Kameoka for usefu discussions. This research was supported in part by JSPS Research Feowships for Young Scientists No. 15J0992 (T. N.), and JSPS Grant-in- Aid No. 15K16054 (E. N.) and No. 26240025 (S. S.). REFERENCES [1] R. B. Dannenberg, An on-ine agorithm for rea-time accompaniment, in Proc. Int. Computer Music Conf., pp. 193 198, 1984. [2] B. Vercoe, The synthetic performer in the context of ive performance, in Proc. Int. Computer Music Conf., pp. 199 200, 1984. [3] A. Arzt, G. Widmer, and S. Dixon, Automatic page turning for musicians via rea-time machine istening, in Proc. European Conf. Artificia Inteigence, pp. 241 245, 2008. [4] N. Orio, S. Lemouton, D. Schwarz, and N. Schne, Score foowing: State of the art and new deveopments, in Proc. New Interfaces for Musica Expression, pp. 36 41, 2003. [5] B. Pardo and W. Birmingham, Modeing form for on-ine foowing of musica performances, in Proc. AAAI, vo. 2, pp. 1018 1023, 2005. [6] A. Cont, A couped duration-focused architecture for rea-time musicto-score aignment, IEEE Trans. Pattern Ana. Mach. Inte., vo. 32, pp. 974 987, June 2010. [7] C. Joder, S. Essid, and G. Richard, A comparative study of tona acoustic features for a symboic eve music-to-score aignment, in Proc. IEEE Workshop Appications Signa Process. Audio Acoust., pp. 409 412, 2010. [8] Z. Duan and B. Pardo, A state space mode for onine poyphonic audio-score aignment, in Proc. Int. Conf. Acoust. Speech Signa Process., pp. 197 200, 2011. [9] T. Otsuka, K. Nakadai, T. Takahashi, T. Ogata, and H. G. Okuno, Reatime audio-to-score aignment using partice fiter for copayer music robots, EURASIP J. Appied Signa Process., vo. 2011, no. 384651, pp. 1 13, 2011. [10] C. Joder, S. Essid, and G. Richard, A conditiona random fied framework for robust and scaabe audio-to-score matching, IEEE Trans. Acoust., Speech, and Language Process., vo. 19, no. 8, pp. 2385 2397, 2011. [11] N. Montecchio and A. Cont, A unified approach to rea time audioto-score and audio-to-audio aignment using sequentia Montecaro inference techniques, in Proc. Int. Conf. Acoust. Speech Signa Process., pp. 193 196, 2011. [12] T. Nakamura, E. Nakamura, and S. Sagayama, Acoustic score foowing to musica performance with errors and arbitrary repeats and skips for automatic accopaniment, in Proc. Sound and Music Computing Conf., pp. 299 304, Aug. 2013. [13] E. Nakamura, T. Nakamura, Y. Saito, N. Ono, and S. Sagayama, Outerproduct hidden Markov mode and poyphonic MIDI score foowing, J. New Music Res., vo. 43, no. 2, pp. 183 201, 2014. [14] D. Schwarz, N. Orio, and N. Schne, Robust poyphonic MIDI score foowing with hidden Markov modes, in Proc. Int. Computer Music Conf., 2004. [15] C. Oshima, K. Nishimoto, and M. Suzuki, A Piano Duo Performance Support System to Motivate Chidren s Practice at Home, Trans. Info. Process. Soc. Japan, vo. 46, no. 1, pp. 157 170, 2005. in Japanese. [16] E. Nakamura, Y. Saito, N. Ono, and S. Sagayama, Merged-output hidden Markov mode for score foowing of MIDI performance with ornaments, desynchronized voices, repeats and skips, in Proc. Joint Conf. of 40th Int. Computer Music Conf. and 11th Sound and Music Computing Conf., pp. 1185 1192, 2014. [17] E. Nakamura, N. Ono, S. Sagayama, and K. Watanabe, A stochastic tempora mode of poyphonic MIDI performance with ornaments, in preparation. [arxiv:1404.2314]. [18] C. Fremerey, M. Müer, and M. Causen, Handing repeats and jumps in score-performance synchronization, in Proc. Int. Symposium Music Info. Retrieva, pp. 243 248, 2010. [19] Z. Duan and B. Pardo, Aigning semi-improvised music audio with its ead sheet., in Proc. Int. Symposium Music Info. Retrieva, pp. 513 518, 2011. [20] N. Orio and F. Déchee, Score foowing using spectra anaysis and hidden Markov modes, in Proc. Int. Computer Music Conf., vo. 1001, pp. 1708 1710, 2001. [21] A. Cont, Reatime audio to score aignment for poyphonic music instruments, using sparse non-negative constraints and hierarchica HMMs, in Proc. Int. Conf. Acoust. Speech Signa Process., vo. 5, pp. 245 248, 2006. [22] C. Joder, S. Essid, and G. Richard, Learning optima features for poyphonic audio-to-score aignment, IEEE Trans. Acoust., Speech, and Language Process., vo. 21, pp. 2118 2128, Oct 2013. [23] J. Brown and M. Puckette, An efficient agorithm for the cacuation of a constant Q transform, J. Acoust. Soc. Am., vo. 92, pp. 2698 2701, 1992. [24] P. Cano, A. Loscos, and J. Bonada, Score-performance matching using HMMs, in Proc. Int. Computer Music Conf., pp. 441 444, 1999. [25] C. Raphae, Automatic segmentation of acoustic musica signas using hidden Markov modes, IEEE Trans. Pattern Ana. Mach. Inte., vo. 21, no. 4, pp. 360 370, 1999. [26] T. Nakamura, H. Kameoka, K. Yoshii, and M. Goto, Timbre repacement of harmonic and drum components for music audio signas, in Proc. Int. Conf. Acoust. Speech Signa Process., pp. 7520 7524, 2014. [27] Z. Duan and B. Pardo, Soundprism: An onine system for scoreinformed source separation of music audio, IEEE J. Se. Topics. Signa Process., vo. 5, no. 6, pp. 1205 1215, 2011. [28] M. Goto, Deveopment of the RWC Music Database, in Proc. Int. Congress Acoust., vo. 1, pp. 553 556, 2004.

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. YY, 2015 11 TABLE VI. Mathematica notation a i,j π i a (i), π (i) e (i) b (i) IMPORTANT PARAMETERS AND THEIR MEANINGS OF THE PROPOSED MODELS. Meaning Transition probabiity of the top HMM Initia probabiity of the top HMM Transition probabiity of the i-th bottom HMM Initia probabiity of the i-th bottom HMM Exiting probabiity of the i-th bottom HMM (y t) Emission probabiity of state (i, ) for observation y t ã (i,),(j, ) Transition probabiity of the standard HMM obtained by fatting the two-eve HMM π (i,) Initia probabiity of the standard HMM obtained by fatting the two-eve HMM Initia probabiity of the states of the standard HMM b(i,) (y t) obtained by fatting the two-eve HMM s j The probabiity of stopping at event j before a repeat/skip r i The probabiity of resuming a performance at event i after a repeat/skip [29] HTK Speech Recognition Tookit. http://htk.eng.cam.ac.uk/. [Onine; accessed 11-February-2015]. [30] MIREX HOME - MIREX Wiki. http://www.music-ir.org/mirex/wiki/ MIREX HOME. [Onine; accessed 11-February-2015]. [31] A. Cont, D. Schwarz, N. Schne, and C. Raphae, Evauation of reatime audio-to-score aignment, in Proc. Int. Symposium Music Info. Retrieva, 2007. [32] S. Sagayama, T. Nakamura, E. Nakamura, Y. Saito, H. Kameoka, and N. Ono, Automatic music accompaniment aowing errors and arbitrary repeats and jumps, in Proc. Meetings on Acoustics, vo. 21, 035003, pp. 1 11, Acoustica Society of America, 2014. APPENDIX A. List of important parameters Important parameters of the proposed modes are isted in Tab. VI. B. Derivation of the No-Break Agorithm for L > 1 We now derive an efficient agorithm of computing α t,(i,) for the performance HMM without the break state in the case of L > 1. Assuming that the transition probabiity of repeats/skips is described as a product of s j and r i, the transition probabiity of the standard HMM ã (j, ),(i,) for j / nbh(i) can be written as The first summation in the parentheses of Eq. (19) is of O(L). The second summation can be converted into α t 1,(j, )e (j) s j = j / nbh(i) =0,,L 1 j=0,,n 1 =0,,L 1 α t 1,(j, )e (j) s j j nbh(i) =0,,L 1 α t 1,(j, )e (j) s j. (20) The first summation of the right-hand side of Eq. (20) is independent of i and thus it is sufficient to compute it once at each time step. Hence, the tota computationa compexity at each time step is of O(LN). C. Derivation of the Break Agorithm for L > 1 Let us consider the performance HMM with the break state and with L bottom states in each top state. In the same way as Sec. III-C, sient breaks at repeats/skips can be introduced as top state N (the break state) and arbitrary repeats/skips are described with two-step transitions via the break state. Since the transition probabiity of the standard HMM ã (j, ),(i,) is zero uness j nbh(i) {N}, Eq. (9) for t 1 and i N can be rewritten as ( α t,(i,) = b (i,) (y t ) α t 1,(j, )ã (j, ),(i,) j nbh(i) =0,,L 1 α t 1,(N, )ã (N, ),(i,) L 1 + =0 ), (21) The second term in the parentheses of Eq. (21) for each i N is of a constant computationa compexity. On the other hand, Eq. (9) for t 1 and i = N is converted into α t,(n,) = b (N,) (y t ) α t 1,(N,) ã (j, ),(N,). (22) j=0,,n 1 =0,,L 1 This computation is of O(LN) and hence the tota computationa compexity is of O(LN) at each time step. ã (j, ),(i,) = e (j) s j r i π (i), (18) and Eq. (9) for t 1 is rewritten as ( α t,(i,) = b (i,) (y t ) α t 1,(i, )ã (j, ),(i,) + r i π (i) j nbh(i) =0,,L 1 j / nbh(i) =0,,L 1 α t 1,(j, )e (j) s j ). (19)

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. YY, 2015 12 Tomohiko Nakamura He received his B.E and M.S. degrees from the University of Tokyo, Japan, in 2011 and 2013, respectivey. He is currenty a Ph. D. student at the University of Tokyo and a research feow of Japan Society for the Promotion of Science (JSPS). His research interests invove audio signa processing and statistica machine earning. He received Internationa Award from the Society of Instrument and Contro Engineers (SICE) Annua Conference 2011, SICE Best Paper Award (Takeda Award) in 2015, and Yamashita SIG Research Award from the Information Processing Society of Japan (IPSJ) in 2015. Eita Nakamura He received a Ph. D. in physics from the University of Tokyo in 2012. After having been a post-doctora researcher at the Nationa Institute of Informatics and Meiji University, he is currenty a post-doctora researcher at the Speech and Audio Processing Group at Kyoto University. His research interests incude music information processing and statistica machine earning. Shigeki Sagayama He received the B.E., M.S., and Ph.D. degrees from the University of Tokyo, Tokyo, Japan, in 1972, 1974, and 1998, respectivey, a in mathematica engineering and information physics. He joined Nippon Teegraph and Teephone Pubic Corporation (currenty, NTT) in 1974 and started his career in speech anaysis, synthesis, and recognition at NTT Labs in Musashino, Japan. From 1990, he was Head of the Speech Processing Department, ATR Interpreting Teephony Laboratories, Kyoto, Japan where he was in charge of an automatic speech transation project. In 1993, he was responsibe for speech recognition, synthesis, and diaog systems at NTT Human Interface Laboratories, Yokosuka, Japan. In 1998, he became a Professor of the Graduate Schoo of Information Science, Japan Advanced Institute of Science and Technoogy (JAIST), Ishikawa. In 2000, he was appointed Professor at the Graduate Schoo of Information Science and Technoogy (formery, Graduate Schoo of Engineering), the University of Tokyo. After his retirement from the University of Tokyo, he is a Professor of Meiji University from 2014. His major research interests incude the processing and recognition of speech, music, acoustic signas, handwriting, and images. He was the eader of anthropomorphic spoken diaog agent project (Gaatea Project) from 2000 to 2003. Prof. Sagayama received the Nationa Invention Award from the Institute of Invention of Japan in 1991, the Director Genera s Award for Research Achievement from the Science and Technoogy Agency of Japan in 1996, and other academic awards incuding Paper Awards from the Institute of Eectronics, Information and Communications Engineers, Japan (IEICEJ) in 1996 and from the Information Processing Society of Japan (IPSJ) in 1995. He is a member of the Acoustica Society of Japan, IEICEJ, and IPSJ.