Using Natural Language Processing Techniques for Musical Parsing

Size: px

Start display at page:

Download "Using Natural Language Processing Techniques for Musical Parsing"

Colin Harrison
6 years ago
Views:

1 Using Natural Language Processing Techniques for Musical Parsing RENS BOD School of Computing, University of Leeds, Leeds LS2 9JT, UK, and Department of Computational Linguistics, University of Amsterdam Spuistraat 134, 1012 VB Amsterdam, Holland Abstract. We investigate whether probabilistic parsing techniques from Natural Language Processing (NLP) can be used for musical parsing. As in natural language, a piece of music can be segmented into groups or phrases which can be conveniently represented by a phrase-structure tree (Longuet-Higgins 1976; Tenney & Polansky 1980; Lerdahl & Jackendoff 1983). One of the main challenges for musical parsers is the problem of ambiguity: several different phrase structures may be compatible with a given musical sequence while a listener typically hears only one structure. In this paper we will consider three parsing techniques from the NLP literature that use a probabilistic heuristic to solve ambiguity. We present a new parser which combines two of these techniques, and which can correctly predict up to 85.9% of the phrases for a test set of 1,000 folksongs from the Essen Folksong Collection (Schaffrath 1995). To the best of our knowledge, this work presents the first parsing experiments with the Essen Folksong Collection, which we hope may be used as a baseline for other approaches. Our parser may also be used to speed up the timeconsuming annotation of newly collected folksongs, thereby contributing to the creation of larger musical databases in computer-assisted musicology. Keywords. computer-assisted musicology, natural language processing, music perception, probabilistic grammars, musical databases 1. Introduction We investigate whether probabilistic parsing techniques from Natural Language Processing (NLP) can be used for musical parsing. As in natural language, a listener segments a sequence of notes into groups or phrases that form a grouping structure for the whole piece (Longuet-Higgins 1976; Tenney & Polansky 1980; Lerdahl & Jackendoff 1983). For example, according to Lerdahl & Jackendoff (1983: 37) a listener hears the following grouping structure for the first few bars of melody in the Mozart G Minor Symphony, K. 550.

2 Figure 1. Grouping structure for the opening theme of Mozart's G Minor Symphony Each group is represented by a slur beneath the musical notation. A slur enclosed within a slur indicates that a group is heard as part of a larger group. This hierarhical structure of melody can, without loss of generality, also be represented by a phrase structure tree, as in figure 2. Figure 2. Tree structure for the grouping structure in figure 1 Although visually quite different, it is easy to see that the two representations in figures 1 and 2 are mathematically equivalent. Note the analogy with phrase structure trees in linguistics: a tree describes how parts of the input combine into groups or constituents and how these constituents combine into a representation for the whole input. Apart from this analogy, there is also an important difference: while the nodes in a linguistic tree structure are typically labeled with syntactic categories such as S, NP, VP etc., musical tree structures are unlabeled. This is because in language there are syntactic constraints on how words can be combined into larger constituents (e.g. in English a determiner can be combined with a noun only if it precedes that noun, which is expressed by the rule NP -> Det N), while in music there are no such restrictions: in principle any note may be combined with any other note. This makes the problem of ambiguity in music much harder than in language. Longuet-Higgins and Lee (1987) note that "Any given sequence of note values is in principle infinitely ambiguous, but this ambiguity is seldom apparent to the listener.".

3 To give an example of this ambiguity, the first few bars of Mozart's G Minor Symphony could also be assigned the following, alternative grouping structure (among the many other possible structures): Figure 3. Alternative grouping structure for the opening theme of Mozart's G Minor Symphony While this alternative structure is possible in that it can be perceived, it does not correspond to the structure that is actually perceived by a human listener. There is thus an important research question as to how to select the perceived tree structure from the total, possibly infinite set of tree structures of a musical input. In the field of natural language processing (NLP), the use of probabilistic corpusbased parsing techniques has become increasingly influential for solving ambiguity (see Charniak 1997 or Manning & Schütze 1999 for an overview). Instead of using a predefined set of rules, a probabilistic corpus-based parser learns how to parse new input by generalizing from examples of previously annotated data; in case of ambiguity, such a parser computes the most probable phrase structure for a given input. State-of-the-art probabilistic parsers, which use the Wall Street Journal corpus in the Penn Treebank (Marcus et al. 1993) as a test domain, obtain around 90% correctly predicted phrases (e.g. Collins 2000; Charniak 2000; Bod 2001a). With the current availability of large annotated musical corpora, such as the Essen Folksong Collection (Schaffrath 1995), we may wonder whether such probabilistic corpus-based parsing techniques carry over to musical parsing. In this paper we will test the usefulness of three probabilistic parsing techniques for music: the Treebank grammar technique of Charniak (1996), the Markov grammar technique of Collins (1999), and the Data-Oriented Parsing (DOP) technique of Bod (1998). We develop a new parser which combines two of these techniques, and which correctly predicts up to 85.9% of the phrases for a held-out test set of 1,000 folksongs from the Essen Folksong Collection (Schaffrath 1995). To the best of our knowledge, this paper contains

4 the first parsing experiments on the Essen Folksong Collection; moreover, it also contains the first parsing experiments on a musical test set of non-trivial size. In the following we first describe the Essen Folksong Collection, after which we test a number of probabilistic parsing models on this collection. Since no other parsing results on the Essen Folksong Collection are available, we will only informally compare our technique with other approaches that aim at solving ambiguity in music. 2. The Essen Folksong Collection The Essen Folksong Collection provides a large sample of (mostly) European folksongs that have been collected and encoded under the supervision of Helmut Schaffrath at the University of Essen (see Schaffrath 1993, 1995; Selfridge-Field 1995; or Each of the 6,251 folksongs in the Essen Folksong Collection is annotated with the Essen Associative Code (ESAC) which includes pitch and duration information, meter signatures and explicit phrase markers. The presence of phrase markers makes the Essen Folksong Collection a unique test case for musical parsers. The pitch encodings in the Essen Folksong Collection resemble "solfege": scale degree numbers are used to replace the movable syllables "do", "re", "mi", etc. Thus 1 corresponds to "do", 2 corresponds to "re", etc. Chromatic alterations are represented by adding either a "#" or a "b" after the number. The plus ("+") and minus ("-") signs are added before the number if a note falls resp. above or below the principle octave (thus -1, 1 and +1 refer al to "do", but on different octaves). Duration is represented by adding a period or an underscore after the number. A period (".") increases duration by 50% and an underscore ("_") increases duration by 100%; more than one underscore may be added after each number. If a number has no duration indicator, its duration corresponds to the smallest value. A pause is represented by 0, possibly followed by duration indicators. No loudness or timbre indicators are used in ESAC. Thus, the opening theme of Mozart's G Minor Symphony in figure 1 can be encoded in ESAC as follows (since the piece is in G Minor, all notes are related to G which corresponds to the number 1). 6b55_6b55_6b55_+3b_0_ Figure 4. ESAC encoding for the opening theme of Mozart's G Minor Symphony ESAC uses hard returns to indicate a phrase boundary. To make the Essen annotations readable for our probabilistic parsers, we automatically convert ESAC's phrase boundary indications into bracket representations, where "(" indicates the start of a phrase and ")" the

5 end of a phrase. The phrase structures in figures 1 and 2 would thus correspond to the following bracket representation. ( ( (6b55_) (6b55_) ) (6b55_+3b_0_) ) Figure 5. Bracket representation for the phrase structures in figures 1 and 2 of the opening theme of Mozart's G Minor Symphony The following figure gives an example of an encoding of an actual folksong from the Essen Folksong Collection ("Schlaf Kindlein feste") converted to our bracket representation: (3_221_-5)( _-5)( )( _)(3_221_-5_) Figure 6. Bracket representation for folksong K0029, "Schlaf Kindlein feste" It is important to note that the annotations in the Essen Folksong Collection do not contain hierarchical or nested structures. Differently from the examples in figures 1, 2 and 3, the Essen Folksong annotations represent the basic phrases (or "segmentations") only and neglect any phrase-internal or phrase-external structure (such as motives, periods and sections). Although this results in rather simple annotations, we will see that the Essen Collection is still a very tough test case for our parsers. The following example shows that many phrases in the Essen Folksong Collection could have been further analyzed in terms of subphrases (e.g. the fifth phrase into three very similar subphrases). (3 2 1_1_-5_)(-5_3_3_2_2_1_1_-5_)(-5_1_2_3_1_4 2_)(1_-7_1_2_-5_3 1_) (3_1-5_3_1_1_-5_3_1-5_)(-5_1_2_3_1_4_3_223_1 1_0_) Figure 7. Bracket representation for folksong K0690, "Ruru Rinneken" And a more extreme case of the shallowness of the Essen Folksong Collection is provided by folksong Z0147 ("Besenbinders Tochter und kachelmachers Sohn"): (5_4#_5_3_1 1_3_2_1#_2_-7_-5.)(3_5_4#_5_3_1 1_3_ 221#_2_-7_-5.) (-5_-5_-5_-5-5-5_4 4_3_2_2_3_4_5 +1_)(3_5_4#_5_3_1_-7_1_332_1#_2_3_1 0 ) (-5_-5_-5_-5_444_4_3_2_2_3_4_5 +1_)(3_5_4#_5_3_1_1_1_3_2_1#_2_3_1.) (3_5_4#_5_3_1_1_1_3_2_1#_2_3_1 1_)(3_5_4#_5_3_1_-7_1_3_2_1#_2_3_1 1_0_) (-5_-5_-5_-5_444_4_3_2_2_3_4_5 +1_)(3_5_4#_5_3_1_1_1_3_2_1#_2_3_1 )

6 Figure 8. Bracket representation for folksong Z0147, "Besenbinders Tochter und kachelmachers Sohn" We believe that every phrase in this folksong could have been further analyzed into subphrases. Yet, the annotation in figure 8 is not wrong; it just represents the most basic phrase structure of the piece only. We want to emphasize that for our experiments in section 3 we did not add (or modify) any structure in the Essen annotations. As we will see, despite (or perhaps due to) its shallow annotations, the Essen Folksong Collection is quite an interesting test case. This brings us to the problem of evaluation. To evaluate our probabilistic parsers for music, we employed the blind testing method which has been widely used in evaluating natural language parsers (see Manning & Schütze 1999). This method dictates that a collection of annotated data is randomly divided into a training set and a test set, where the annotations in the training set are used to "train" the parser, while the unannotated strings in the test set are used as input to test the parser. The degree to which the predicted structures for the test set strings match with the correct structures in the test set is a measure for the accuracy of the parser. For our experiments in section 3, we randomly divided the Essen Folksong Collection into a training set of 5,251 folksongs and a test set of 1,000 folksongs. There is an important question as to what kind of evaluation measure is most appropriate to compare the phrase structures proposed by the parser with the correct phrase structures in the test set. A widely used evaluation scheme in natural language parsing is the PARSEVAL scheme, which is based on the notions of precision and recall (see Black et al. 1991). PARSEVAL compares a proposed parse P with the corresponding test set parse T as follows: Precision = # correct phrases in P # phrases in P Recall = # correct phrases in P # phrases in T A phrase is correct if both the start and the end of the phrase is correctly predicted. Note that these measures "punish" a parser which assigns too many phrases to a folksong: for example, an extremely overgenerating parser which assigns phrases to any combination of notes would trivially include all correct phrases, resulting in an excellent recall, but its precision would be very low. On the other hand, a very conservative parser which predicts

7 very few, though correct phrases, will receive a high precision, but its recall will be low. A good parser will thus need to obtain both a high precision and a high recall. (It goes probably without saying that for computing the precision and recall for all test set strings, one needs to divide the total number of correctly predicted phrases in all proposed parses P by the total number of phrases in respectively all parses P and T.) The precision and recall scores are often combined into a single measure of performance, known as the F-score (see Manning & Schütze 1999): F-score = 2 Precision Recall Precision + Recall We will use these three measures of Precision, Recall and F-score to quantitatively evaluate our probabilistic parsing models for music. As a final pre-processing step, we (automatically) added to each phrase in the folksong the label "P" and to each whole song the label "S", so as to obtain conventional parse trees. Thus the annotation in figure 6 becomes: S( P(3_221_-5) P( _-5) P( ) P( _) P(3_221_-5_) ) Figure 9. Labeled-bracketing annotation for the structure in figure 6 Note that this labeled-bracketing annotation is equivalent to the following visual tree representation. S P P P P P 3_221_ _ _ 3_221_-5_ Figure 10. Visual tree structure for the labeled-bracketing annotation in figure 9

8 The advantage of labeled-bracketing annotations is that we can now directly apply existing probabilistic parsing models to the Essen Folksong Collection. 3. Parsing the Essen Folksong Collection In this section, we test three probabilistic parsing models from the literature on the Essen Folksong Collection: the Treebank grammar technique of Charniak (1996), the Markov grammar technique of Seneff (1992) and Collins (1999), and the Data-Oriented Parsing (DOP) technique of Bod (1993, 1998). Unless stated differently, we used the same random split of the Essen Folksong Collection into a training set of 5,251 folksongs and a test set of 1,000 folksongs. 3.1 The Treebank Grammar Technique The Treebank grammar technique is an extremely simple learning technique: it reads all context-free rewrite rules from the training set structures, and assigns each rule a probability proportional to its frequency in the training set. For example, the following context-free rules can be extracted from the structure in figure 9: S -> PPPPP P -> 3_221_-5 P -> _-5 P -> P -> _ P -> 3_221_-5_ Next, each rewrite rule is assigned a probability by dividing the number of occurrences of a particular rule in the training set by the total number of occurrences of rules that expand the same nonterminal as the particular rule. For instance, if we take folksong in figure 9 as our only training data, then the probability of the rule P -> 3_221_-5 is equal to 1/5 since this rule occurs once among a total of 5 rules that expand the nonterminal P. A Treebank grammar extracted in this way from the training set corresponds to a socalled Probabilistic Context-Free Grammar or PCFG (Booth 1969). A crucial assumption underlying PCFGs is that the context-free rules are statistically independent. Thus, given the probabilities of the individual rules, we can calculate the probability of a parse tree by taking the product of the probabilities of each rule used therein. PCFGs have been extensively studied in the literature (cf. Wetherell 1980; Charniak 1993), and the efficient parsing algorithms that exist for Context-Free Grammars carry over to PCFGs (see Charniak 1993 or Manning & Schütze 1999 for the relevant algorithms).

9 Any probabilistic grammar extracted from a training set faces the problem of datasparseness: many of the rules in the training set are so infrequent that their observed probabilities are very bad estimates of their true probabilities. A widely used method to cope with this problem is the Good-Turing method (Good 1953). In general, Good-Turing estimates the expected population frequency f* of a type by adjusting its observed sample frequency f. In order to estimate f*, Good-Turing uses an additional notion, n f, which is defined as the number of types which occur f times in an observed sample. Thus, n f can be understood as the frequency of frequency f. The Good-Turing estimator uses this extra information for computing the adjusted frequency f* as f* = ( f+1) n f+1 nf We thus compute the probabilities of our context-free rules in the Treebank grammar from their adjusted frequencies rather than from their raw observed frequences. For an instructive paper on Good-Turing, together with a proof of the formula, see Church & Gale (1991). The Treebank grammar that was obtained in this way from the 5,251 training folksongs was used to parse the 1,000 folksongs in the test set. We computed for each test folksong the most probable parse using a standard best-first parsing algorithm based on Viterbi optimization (see Charniak 1993; Manning & Schütze 1999). Although we may already foresee that a Treebank grammar is doomed to misparse folksongs if it is does not find the correct rule in the training set, it will serve as the basis for our more sophisticated parsing techniques in the following sections. Using the evaluation measures given in section 2, our Treebank grammar obtained a precision of 68.7%, a recall of 3.4%, and an F-score of 6.5%. Although the precision score may seem reasonable, the recall score is extremely low, which indicates that the Treebank grammar technique is a very conservative learner: it predicts very few phrases from the total number of phrases in the Essen Folksong Collection, resulting in a very low F-score. As noted, one of the problems with the Treebank grammar technique is that it only learns those context-free rules that literally occur in the training set, which is evidently not a very robust technique for musical parsing (while it has been shown to perform quite well in natural language parsing -- see Charniak 1996). We will see, however, that the results improve significantly if we slightly loosen the way of extracting rules from the training set. 3.2 The Markov Grammar Technique A technique which overcomes the conservativity of Treebank grammars is the Markov grammar technique (Seneff 1992; Collins 1999). While a Treebank grammar can only assign probabilities to context-free rules that have been seen in the training set, a Markov

10 grammar can in principle assign a probability to any possible context-free rule, thus resulting in a more robust model. This is accomplished by decomposing a rule and its probability by a Markov process (see Collins 1999: 44-48). For example, a third-order Markov process estimates the probability p of a rule P -> by: p(p -> 12345) = p(1) p(2 1) p(3 1, 2) p(4 1, 2, 3) p(5 2, 3, 4) p(end 3, 4, 5). The conditional probability p(end 3, 4, 5) encodes the probability that a rule ends after the notes 3, 4, 5. Thus even if the rule P -> does not literally occur in the training set, we can still estimate its probability by using a Markov history of three notes. The extension to larger Markov histories follows from obvious generalization of the above example. However, also a Markov grammar suffers from data-sparseness: we may get low counts, including zero counts, for many Markov histories. Zero counts are especially problematic: if one of the conditional probabilities in the formula above has a zero occurrence in the training set, then the whole rule is assigned a zero probability. A widely used technique to solve the data-sparseness problem in Markov models is the linear interpolation technique (see Manning & Schütze 1999: ). This technique smooths a Markov history by taking into account its shorter histories. Let n 1, n 2 and n 3 denote three notes, then the conditional probability p(n 1 n 2, n 3 ) is smoothed ("interpolated") as p(n 1 n 2, n 3 ) = λ 1 p(n 1 ) + λ 2 p(n 1 n 2 ) + λ 3 p(n 1 n 2, n 3 ). where 0 λ i 1 and λ 1 + λ 2 + λ 3 = 1. These λ-weights may be set by hand, but in general one wants to find the combination of weights λ i which works best. A simple algorithm that finds the optimal weights is Powell's algorithm (see Press et al. 1988), which is also discussed in Manning & Schütze (1999: 218). We used this algorithm to assign weights to the lambdas in the linear interpolation technique, which in turn was used to estimate the conditional probabilities in the Markov grammar technique. Furthermore, each of the probabilities p(n 1 ), p(n 1 n 2 ) and p(n 1 n 2, n 3 ) were not directly estimated from their observed relative frequencies in the training set, but were adjusted by the Good-Turing method, just as with Treebank grammars (section 3.1). Note that the extension to larger Markov histories follows from obvious generalization of the formulas above. The probability of a parse tree of a musical piece is computed by the product of the probabilities of the rules that partake in the parse tree, just as with Treebank grammars. For our experiments, we used a Markov grammar with a history of four notes. This grammar obtained a precision of 63.1%, a recall of 80.2%, and an F-score of 70.6%. These results are to some extent complementary to the Treebank grammar: although the precision is

11 somewhat lower, the recall is (much) higher than for the Treebank grammar. Thus, while the Treebank grammar predicts too few phrases, the Markov grammar predicts (a bit) too many phrases. The combined F-score of 70.6% shows an immense improvement over the Treebank grammar technique. Experiments with higher or lower order Markov models diminished our results. 3.3 Extending the Markov Grammar Technique with the DOP Technique Although the Markov grammar technique obtained considerably better scores than the Treebank grammar technique, it does not take into account any global context in computing the probability of a parse tree. Knowledge of global context, such as the number of phrases that occur in a folksong, is likely to be important for predicting the correct segmentations for new folksongs. In order to include global context, we conditioned over the S-rule higher in the structure in computing the probability of a P-rule. This approach corresponds to the Data-Oriented Parsing (DOP) technique (Bod 1993, 1998) which can condition over any higher or lower rule in a tree, and which has recently been integrated with the Markov grammar technique (Sima'an 2000). In the original DOP technique, any fragment seen in the training set, regardless of size, is used as a productive unit. But in the Essen Folksong Collection we have only two levels of constituent structure in each tree, which results in a much simpler probabilistic model. As an example take again the rule P -> and a higher S-rule such as S -> PPPP; a DOP-Markov model based on a history of three notes computes the (conditional) probability of this rule as: p(p -> S -> PPPP) = p(1 S -> PPPP) p(2 S -> PPPP, 1) p(3 S -> PPPP, 1, 2) p(4 S -> PPPP, 1, 2, 3) p(5 S -> PPPP, 2, 3, 4) p(end S -> PPPP, 3, 4, 5). The extension to larger histories follows from obvious generalization of the above example. For our experiments, we used a history of four notes, extended with the same smoothing techniques as in section 3.2 (i.e. linear interpolation combined with Good-Turing). The most probable parse of a folksong is again computed by maximizing the product of the rule probabilities that generate the folksong. Using the same training/test set division as before, this DOP-Markov parser obtained a precision of 76.6%, a recall of 85.9%, and an F-score of 81.0%. The F-score is an improvement of 10.4% over the Markov parser. Note that the DOP-Markov parser is relatively well-balanced: it is neither terribly conservative nor does it predict too many redundant phrases -- keeping in mind the idiosyncracy of the Essen Folksong annotations. While there is no reason to expect a near to 100% accuracy for the shallowly annotated Essen Folksong Collection, our results show the importance of including global context in

12 computing the probability of a parse. We also checked the statistical signifance of our results, by testing on 9 additional random splits of the Essen Folksong Collection (into training sets of 5,251 folksongs and a test sets of 1,000 folksongs). On these splits, the DOP-Markov parser obtained an average F-score of 80.7% with a standard deviation of 1.9%, while the Markov parser obtained an average F-score of 70.8% with a standard deviation of 2.2%. These differences were statistically significant according to paired t- testing. Finally, we were interested in testing the impact of the training size on the F-score. In the following experiments we started with an initial training set of only 500 folksongs (randomly chosen from the full training set of 5,251 folksongs). We then increased the size of this initial training set with 500 folksongs each time (randomly chosen from the full training set). The test set was kept constant at 1,000 folksongs. The results are shown in table 1. Training F-score % 1, % 1, % 2, % 2, % 3, % 3, % 4, % 4, % 5, % 5, % Table 1. F-score as a function of training set size The table shows that the F-score rapidly increases when the size of the training set is enlarged from 500 to 2,000 folksongs. The accuracy continues to increase at a lower rate if the training set is further enlarged. We may thus expect that the accuracy of our parser further increases if we have access to larger musical corpora. This is important if we want to use our parser for the semi-automatic annotation of musical databases. Starting with an initial, relatively small set of hand-annotated pieces, our parser can use these annotations as its training set on the basis of which the annotations for a new set of musical pieces can be predicted. The predicted annotations will need to be corrected by hand, but once we have added these corrected annotations to the training set, our parser will more accurately predict

13 the annotations for fresh folksongs. Table 1 suggests that the amount of human correction decreases if more training data becomes available. We thus expect that our parser can be used to speed up the time-consuming annotation of musical pieces, thereby contributing to the creation of larger databases in computer-assisted musicology. 4. Other approaches to musical parsing There exists an extensive literature in the field of computational models of music analysis (see Cambouropoulos 1998, or Camouropoulos et al for an overview). Most if not all approaches to musical parsing are non-probabilistic and are based on the assumption that the perceived phrase structure of a musical piece can be predicted on the basis of a combination of low-level phenomena, such as the Gestalt phenomena of proximity and similarity, and higher-level phenomena, such as melodic parallelism and internal harmony. For example, Tenney & Polansky (1980), Lerdahl & Jackendoff (1983), Handel (1989) and Cambouropoulos (1996, 1997) use the Gestalt rules of Wertheimer (1923) to predict the low-level grouping structure of a piece: phrase boundaries preferably fall on larger time intervals, larger pitch intervals, etc. While most models also incorporate higherlevel phenomena, such as melodic parallelism and harmony, these phenomena remain often unformalized. For example, Lerdahl & Jackendoff (1983) do not provide any systematic description of higher-level musical parallelism, and Narmour's Implication-Realization model (Narmour 1990, 1992) relies on factors such as meter, harmony and similarity which are not fully described by the model. As a result, these models have not been evaluated against test sets of non-trivial size, such as the Essen Folksong Collection. Only very few, hand-selected passages are typically used to evaluate these models, which questions the objectivity of the results. More importantly, perhaps, is the fact that the Gestalt principles, which were originally proposed for visual perception (Wertheimer 1923), do not straightforwardly carry over to music perception. Elsewhere (Bod 2001b), we have shown that more than 15% of the phrase boundaries in the Essen Folksong Collection fall before or after large pitch/time intervals (as in the folksong of figure 7), rather than on such intervals, and that phrase boundaries even appear between identical notes. This goes against the predictions of any Gestalt-based parser, which assigns phrase boundaries exactly on large intervals rather than before or after them. Moreover, we have shown in Bod (2001b) that higher-level phenomena, such as melodic parallelism and internal harmony, are not of any help for predicting the correct phrase boundaries for these 15% "exceptional" phrases. On the contrary, for almost all these phrases (98.7%), melodic parallelism and internal harmony reinforced the incorrect predictions made by the Gestalt principles. It is noteworthy that our DOP-Markov parser, on the other hand, performed equally well on both "exceptional" phrases and "normal" phrases

14 (where boundaries do fall on large pitch/time intervals). While our parser is still far from perfect, we believe that a probabilistic, corpus-based approach is more apt to musical parsing as it considers counts of any note sequence that has been observed with a certain structure, thereby taking into account the entire continuum between "exceptional" and "normal" phrases, rather than trying to capture this gradiency by a few formal rules. We fully admit that a fair comparison between our parser and a Gestalt-based/parallelism-based parser should await further experimental evaluation, but we hope to have made clear that musical parsing models should be tested on large corpora of musical annotations such as the Essen Folksong Collection (otherwise "exceptional" phrases may easily remain unnoted). If we wish to propose a corpus-based approach to musical parsing as a serious alternative to a Gestalt-based approach, we should address the question of how any structure can be acquired if we do not have any structured pieces in our corpus to start with. With an already analyzed corpus, we can at best simulate adult music perception -- as with an analyzed corpus of natural language (see Bod 1998). We conjecture that the acquisition of a structured corpus may be the result of a bootstrapping process where the discovery of recurrent patterns and distributional regularities plays an important role. As soon as a sequence of notes appears more than once, it may be hypothesized as a group, and may be used as a productive unit to analyze new pieces. The frequency with which a pattern occurs is used to decide between conflicting groups. Much research in unsupervised language learning is concerned with bootstrapping syntactic structure on the basis of pattern similarity and statistics from large (unannotated) language corpora (e.g. Finch & Chater 1994; Brent and Cartwright 1996; van Zaanen 2000). One of our future goals is to investigate whether such unsupervised learning techniques carry over to bootstrapping musical structure, and whether the learned structure corresponds to the structure as perceived by human listeners. On the other hand, there is already a considerable amount of work on unsupervised musical pattern induction (e.g. Cope 1990; Crawford et al. 1998; Rolland & Ganascia 2000). We hope to assess these models, along with unsupervised models of natural language learning, for the task of bootstrapping structure in a large musical corpus. Once an initial corpus of musical patterns has been bootstrapped, these patterns can be used by our probabilistic models to more efficiently parse new pieces. Only for completely new sequences of notes that have never appeared before, unsupervised methods need still to be invoked. The exact interplay between unsupervised and supervised (or memory-based) aspects of musical parsing needs to await further investigation. 5. Conclusion

15 We have shown that probabilistic parsing models from Natural Language Processing can be successfully applied to musical parsing. We have tested three models that parse musical pieces by combining fragments from structures of previously encountered pieces. In case of ambiguity, these models compute the analysis that can be considered the most probable one on the basis of the occurrence-frequencies of the fragments. We developed a new parser which combines two of these techniques (i.e. the Markov grammar technique and the DOP technique), and which can correctly predict up to 85.9% of the phrases for a test set of 1,000 folksongs from the Essen Folksong Collection. We hope that our results may serve as a baseline for other computational models of music analysis. Our parser may also be used to speed up the time-consuming annotation of newly collected folksongs, thereby contributing to the creation of larger musical databases in computer-assisted musicology. References E. Black, S. Abney, D. Flickinger, C. Gnadiec, R. Grishman, P. Harrison, D. Hindle, R. Ingria, F. Jelinek, J. Klavans, M. Liberman, M. Marcus, S. Roukos, B. Santorini and T. Strzalkowski, A Procedure for Quantitatively Comparing the Syntactic Coverage of English, Proceedings DARPA Speech and Natural Language Workshop, Pacific Grove, Morgan Kaufmann. R. Bod, Using an Annotated Language Corpus as a Virtual Stochastic Grammar. Proceedings AAAI'93, Morgan Kaufmann, Menlo Park. R. Bod, Beyond Grammar: An Experience-Based Theory of Language, Stanford, CSLI Publications (distributed by Cambridge University Press). R. Bod, 2001a. What is the Minimal Set of Fragments that Achieves Maximal Parse Accuracy? Proceedings ACL'2001, Toulouse, France. R. Bod, 2001b. Evidence against the Gestalt Principles in Music. Proceedings International Computer Music Conference 2001 (ICMC'2001), Havana, Cuba. (to appear in September 2001) T. Booth, Probabilistic Representation of Formal Languages, Tenth Annual IEEE Symposium on Switching and Automata Theory. M. Brent and T. Cartwright, Distributional Regularity and Phonotactic Contraints are Useful for Segmentation, Cognition, 61, E. Cambouropoulos, A Formal Theory for the Discovery of Local Boundaries in a Melodic Surface. Proceedings of the Troisièmes Journées d'informatique Musicale (JIM-96), Caen, France. E. Cambouropoulos, Musical Rhythm: A Formal Model for Determining Local Boundaries, Accents and Meter in a Melodic Surface, in M. Leman (ed.), Music, Gestalt and Computing - Studies in Systematic and Cognitive Musicology, Berlin, Springer-Verlag. E. Cambouropoulos, Towards a General Computational Theory of Musical Structure, Ph.D. thesis, University of Edinburgh, UK.

16 E. Cambouropoulos, T. Crawford and C. Iliopoulos, Pattern Processing in Melodic Sequences: Challenges, Caveats and Prospects. Computers and the Humanities 35: E. Charniak, Statistical Language Learning, Cambridge, The MIT Press. E. Charniak, Tree-bank Grammars, Proceedings AAAI-96, Menlo Park, Ca. E. Charniak, Statistical Techniques for Natural Language Parsing, AI Magazine, Winter 1997, E. Charniak, A Maximum-Entropy-Inspired Parser. Proceedings ANLP-NAACL'2000, Seattle, Washington. K. Church and W. Gale, A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams, Computer Speech and Language 5, M. Collins, Head-Driven Statistical Models for Natural Language Parsing, PhD-thesis, University of Pennsylvania, PA. M. Collins, Discriminative Reranking for Natural Language Parsing, Proceedings ICML-2000, Stanford, Ca. D. Cope, Pattern-Matching as an Engine for the Computer Simulation of Musical Style, Proceedings ICMC'1990, Glasgow, UK. R. Crawford, C. Iliopoulos, and R. Raman, String Matching Techniques for Musical Similarity and Melodic Recognition, Computing in Musicology 11, S. Finch and N. Chater Distributional Bootstrapping: From Word Class to Proto-Sentence, Proceedings 16th Annual Cognitive Science Society, , Hillsdale, Lawrence Erlbaum. I. Good, The Population Frequencies of Species and the Estimation of Population Parameters, Biometrika 40, S. Handel, Listening. An Introduction to the Perception of Auditory Events. Cambridge, The MIT Press. F. Lerdahl and R. Jackendoff, A Generative Theory of Tonal Music. Cambridge, The MIT Press. H. Longuet-Higgins, Perception of Melodies. Nature 263, October 21, H. Longuet-Higgins and C. Lee, The Rhythmic Interpretation of Monophonic Music. In: Mental Processes: Studies in Cognitive Science, Cambridge, The MIT Press. C. Manning and H. Schütze, Foundations of Statistical Natural Language Processing. Cambridge, The MIT Press. M. Marcus, B. Santorini and M. Marcinkiewicz, Building a Large Annotated Corpus of English: the Penn Treebank, Computational Linguistics 19(2). E. Narmour, The Analysis and Cognition of Basic Melodic Structures: The Implication-Realization Model, The University of Chicago Press, Chicago. E. Narmour, The Analysis and Cognition of Melodic Complexity, The University of Chicago Press, Chicago. W. Press, B. Flannery, S. Teukolsky, and W. Vetterling, Numerical Recipes in C. Cambridge University Press.

17 P. Rolland and J. Ganascia, Musical Pattern Extraction and Similarity Assessment, in E. Miranda (ed.) Readings in Music and Artificial Intelligence, Harwood Academic Publishers. H. Schaffrath, Repräsentation einstimmiger Melodien: computerunterstützte Analyse und Musikdatenbanken. In B. Enders and S. Hanheide (eds.) Neue Musiktechnologie, , Mainz, B. Schott's Söhne. H. Schaffrath, The Essen Folksong Collection in the Humdrum Kern Format. D. Huron (ed.). Menlo Park, CA: Center for Computer Assisted Research in the Humanities. E. Selfridge-Field, The Essen Musical Data Package. Menlo Park, California: Center for Computer Assisted Research in the Humanities (CCARH). S. Seneff, TINA: A Natural Language System for Spoken Language Applications. Computational Linguistics 18(1), K. Sima'an, Tree-gram Parsing: Lexical Dependencies and Structural Relations, Proceedings ACL'2000, Hong Kong, China. J. Tenney and L. Polansky, Temporal Gestalt Perception in Music, Journal of Music Theory, 24, M. Wertheimer, Untersuchungen zur Lehre von der Gestalt. Psychologische Forschung 4, C. Wetherell, Probabilistic Languages: A Review and Some Open Questions, Computing Surveys, 12(4). M. van Zaanen, Bootstrapping Structure and Recursion Using Alignment-Based Learning, Proceedings International Conference on Machine Learning (ICML'2000), Stanford, California.

Probabilistic Grammars for Music

Probabilistic Grammars for Music Rens Bod ILLC, University of Amsterdam Nieuwe Achtergracht 166, 1018 WV Amsterdam rens@science.uva.nl Abstract We investigate whether probabilistic parsing techniques from