Evaluation of Melody Similarity Measures

Size: px

Start display at page:

Download "Evaluation of Melody Similarity Measures"

Ernest Kelly
5 years ago
Views:

1 Evaluation of Melody Similarity Measures by Matthew Brian Kelly A thesis submitted to the School of Computing in conformity with the requirements for the degree of Master of Science Queen s University Kingston, Ontario, Canada August 2012 Copyright c Matthew Brian Kelly, 2012

2 Abstract Similarity in music is a concept with significant impact on ethnomusicology studies, music recommendation systems, and music information retrieval systems such as Shazam and SoundHound. Various computer-based melody similarity measures have been proposed, but comparison and evaluation of similarity measures is inherently difficult due to the subjective and application-dependent nature of similarity in music. In this thesis, we address the diversity of the problem by defining a set of music transformations that provide the criteria for comparing and evaluating melody similarity measures. This approach provides a flexible and extensible method for characterizing selected facets of melody similarity, because the set of music transformations can be tailored to the user and to the application. We demonstrate this approach using three music transformations (transposition, tempo rescaling, and selected forms of ornamentation) to compare and evaluate several existing similarity measures, including String Edit Distance measures, Geometric measures, and N-Gram based measures. We also evaluate a newly implemented distance measure, the Beat and Direction Distance Measure, which is designed to have greater awareness of the beat hierarchy and better responsiveness to ornamentation. Training and test data is drawn from music incipits from the RISM A/II collection, and ground truth is taken from the MIREX 2005 Symbolic Melodic Similarity task. i

3 Our test results show that similarity measures that are responsive to music transformations generally have better agreement with human generated ground truth. ii

4 Acknowledgments I would like to express my sincere thanks to my supervisor, Dr. Dorothea Blostein, for her patient and knowledgeable guidance. Our discussions were always helpful, informative, and full of positive energy. This would not have been possible without your positive spirit and enthusiasm. I would also like to thank everyone at the School of Computing who helped me along the way. I am proud to be a part of such a great group of people. Finally, I d like to thank my friends and family who supported me along the way. David, Matthew, Sean, Jun-tian, and Brenna: your support meant a lot and helped a great deal. Special thanks and love to my parents, Shane and Sally, and my brother, Adam, for their unconditional support, love, and guidance. iii

5 Table of Contents Abstract i Acknowledgments iii Table of Contents iv List of Tables vi List of Figures ix Chapter 1: Introduction Similarity Measures for Melodies Thesis Contributions Chapter 2: Music Transformations Transposition Tempo Rescaling Ornamentation Composition of Transformations iv

6 Chapter 3: Background and Literature Review Symbolic Music Representations Categories of Melody Similarity Measures Evaluations Chapter 4: Methodology Selected Similarity Measures for Study Test Data for Evaluations Evaluation Criteria Chapter 5: Experiments and Results Evaluation Process Results Analysis of Results Chapter 6: Conclusion and Future Work Bibliography v

7 List of Tables 1.1 Table 1 from [1]: Music Representations in Music Information Retrieval Summary of results from previous MIREX Symbolic Melodic Similarity tasks Aggregate of categories of techniques used in previous MIREX Symbolic Melodic Similarity tasks Possible categories for direction in the Beat and Direction Distance Measure Table 4.3 from [2] summarizing the possible structures of passing notes Results table for query Cells for each group and distance measure pair show the number of correct melodies and the recall value Results table for query Cells for each group and distance measure pair show the number of correct melodies and the recall value Results table for query Cells for each group and distance measure pair show the number of correct melodies and the recall value vi

8 5.4 Results table for query Cells for each group and distance measure pair show the number of correct melodies and the recall value Results table for query Cells for each group and distance measure pair show the number of correct melodies and the recall value Results table for query Cells for each group and distance measure pair show the number of correct melodies and the recall value Results table for query Cells for each group and distance measure pair show the number of correct melodies and the recall value Results table for query Cells for each group and distance measure pair show the number of correct melodies and the recall value Results table for query Cells for each group and distance measure pair show the number of correct melodies and the recall value Results table for query Cells for each group and distance measure pair show the number of correct melodies and the recall value Results table for query Cells for each group and distance measure pair show the number of correct melodies and the recall value vii

9 5.12 Compilation of all Results. The values in each cell represent the average dynamic recall value for the distance measure on the specified query melody Sorted compilation of average ADR values Ranks of compilation of all results from Table viii

10 List of Figures 2.1 Illustration of transposition. The differences between melody (a) and (b) can be entirely modeled as a transposition transformation. The parameters for the transposition state that the entire melody is transposed up a fifth An example of tempo rescaling. The difference between melody (a) and (b) can be modeled completely as a tempo rescale transformation. The parameters for the tempo rescaling state that the entire melody is rescaled by a factor of 2, such that the duration of each note is doubled An example of pitch-echo ornamentation. The differences between melody (a) and (b) can be modeled as a pitch-echo ornamentation. The parameters of the ornamentation state that the first three notes of (a) are affected, with the ornamentation splitting the original note into two notes that are half as long, raising the pitch of the second note one fifth higher An example of beat-hierarchy ornamentation. The difference between melody (a) and (b) can be completely modeled as a beat-hierarchy ornamentation. Each of the four notes that lie on the beat remain the same, but the passing notes change. The parameters for this transformation define each passing note explicitly ix

11 2.5 An example of a composition of transformations. The difference between melody (a) and (c) cannot be completely modeled by any one transformation. However, it can be explained by first applying a tempo rescale transformation with a parameter of 2 (shown from (a) to (b)) and then applying a transposition transformation with a parameter of a fifth (as seen from (b) to (c)) Example of the symbolic music representation ABC, with corresponding western music notation. The ABC notation begins with 5 header lines. The line beginning with X indicates a reference number (for storing multiple melodies in a single file), T gives a unique name for the music fragment, M states the time signature, L indicates the base note length, and K provides the key signature. The header lines are followed by a symbolic encoding of the sequence of notes and rests that constitute this music fragment. In this encoding, E represents a note located on the lowest staff line (E flat above middle C), and e represents a note one octave higher. The 2 represents an eighth note (double the base note length described by L in the header), and 4 represents a quarter note (four times the base note length) Example of Edit Distance Insert operation Example of Edit Distance Delete operation Example of Edit Distance Substitute operation Example of Geometric Representation of Melody The framework for evaluating melody similarity measures x

12 4.2 An example from the RISM A/II dataset. The melody shown in (a) is used as the query melody. A group of 35 music experts determined that the most similar melody in the collection was (b). The differences between (a) and (b) can be almost completely modeled as a tempo rescaling transformation (doubling the note duration) and a transposition transformation (down a full step) A second example from the RISM A/II dataset. The melody shown in (a) is used as the query melody and the melody shown in (b) is the agreed upon ground truth. In this example, the differences between (a) and (b) are more difficult to model as transformations An example of a box whisker plot from a ground truth from the RISM A/II dataset as generated by [3]. This plot shows that the rankings of experts have a median of 4. The mean of the responses is 5.5. The standard deviation of the responses by the horizontal dotted line where they demonstrate 1 standard deviation above and below the mean An example of a visualization of a Wilcoxon rank sum test from a ground truth from the RISM A/II dataset as generated by [3] Figure from data generated in [3] reproduced with permission pending. An example of ground truth from the RISM A/II dataset as generated by [3]. The query melody is shown at the top. This is followed by 10 candidate melodies that experts rank as most similar to the query melody. These 10 candidate melodies are split into 5 groups (a partially ordered list) using the Wilcoxon rank sum test xi

13 5.1 Flow chart illustrating the data and algorithms used in the experiment to evaluate selected melody similarity measures Class diagram depicting the implementation discussed in this thesis Screenshot of the Melody Viewer tab Screenshot of the Single Measure Evaluation tab Screenshot of the results from running the Single Measure Evaluation tab Screenshot of the Single Measure Evaluation versus Ground Truth tab Screenshot of the results from running the Single Measure Evaluation versus Ground Truth tab Screenshot of the Complete Evaluation versus Ground Truth tab Screenshot of the results from running the Complete Measure Evaluation versus Ground Truth tab An example query melody from the RISM A/II dataset used to illustrate the effectiveness of the Beat and Direction Distance Measure in recognizing Beat Hierarchy Ornamentation An example candidate melody from the RISM A/II dataset used to illustrate the effectiveness of the Beat and Direction Distance Measure in recognizing Beat Hierarchy Ornamentation xii

14 Chapter 1 Introduction The purpose of a music similarity measure is to quantitatively or qualitatively characterize the similarity of two pieces of music. Music similarity is a subjective characteristic that varies depending on the application, and on the judgment and tastes of the user. Studies have shown that similarity judgments differ from person to person [4]. The definition of similarity depends on the context in which the music fragments are being analyzed. For example, the similarity of two music fragments can be based on style or genre (e.g. blues versus jazz) or on the artistic period (e.g. baroque versus medieval). Since the definition of similarity is situation-dependent, it is impossible to define a unique ground truth for music similarity databases. Human judgment can be used as a basis for defining a ground truth [5], as discussed further in Section 3.3. Music similarity measures are used in a variety of applications. In interactive music-generation systems such as [6], user interactions guide the process of generating music. In such systems, similarity measures can support search operations that help the user find a desired location for editing, or they can be used to analyze the 1

15 CHAPTER 1. INTRODUCTION 2 similarity between generated music fragments, as an aid in determining good candidate fragments. In ethnomusicology, which is... the comparative study of musical systems and cultures [7], similarity measures can aid in automatically identifying trends and similarities in music composition across periods. In music information retrieval systems such as [8],[9], and in music recommender systems such as [10],[11], similarity measures are used to find music that matches a user s query. Various representations for music fragments are in common use, including audio recordings, music notation, and symbolic music representations. Table 1.1 categorizes these representations and describes their common uses [1]. Estimation of music similarity can be based on audio recordings, music notation, or symbolic music representations. In this thesis, we consider similarity measures that operate on a symbolic music representation: music fragments are represented as sequences of notes and rests with given pitches and durations. Symbolic music representations are discussed further in Section Similarity Measures for Melodies This thesis investigates the definition and evaluation of melody similarity measures, where we define a melody as a symbolic, monophonic fragment of music. Symbolic means that the music fragment is represented as a sequence of notes and rests with given pitches and durations. Monophonic means that a new note does not begin until any current note has finished sounding [12]. In assessing melody similarity, two melodies are compared in their entirety. This is illustrated in Figures We do not consider more general formulations of the music similarity problem, in which the goal is to find the best match of a short query

16 CHAPTER 1. INTRODUCTION 3 Representation Description Research Symbolic Notation (scores, charts), Event-based recordings (MIDI), Hybrid Representations Audio Recordings, Streaming Audio, Instrument Libraries Matching, Theme/Melody Extraction, Voice Separation, Musical Analysis Sound/Song Spotting, Transcription, Timbre/Genre Classification, Musical Analysis, Recommendation Systems Visual Scores Score Reading (Optical Music Metadata Cataloging, Bibliography, Descriptions Recognition) Library Testbeds, Traditional IR, Interoperability, Recommendation Systems Table 1.1: Table 1 from [1]: Music Representations in Music Information Retrieval melody to subsections of a long music fragment, such as in [13]. Even more generally, music similarity can be applied to two music fragments that are both long, aiming to find subsections in one that are a close match to subsections in the other as in [14]. This task of melody similarity applies directly to applications where the symbolic representation of music is immediately available, such as in music-generation and ethnomusicology. However, for applications such as audio-based music information retrieval the application is indirect. Before applying melody similarity measures in these audio tasks the audio must be transformed into a symbolic representation.

17 CHAPTER 1. INTRODUCTION Thesis Contributions The objective of this work is to evaluate the performance of melody similarity measures. We define a framework for assessing melody similarity by addressing the following issues. Define melody similarity. A flexible definition is required because users must be able to adjust the definition of similarity to meet the needs of their application. We define melody similarity using a set of music transformations. A music transformation is a parameterized function that maps one melody to another. In Chapter 2, we define transformations for transposition, tempo rescaling, and two types of ornamentation (pitch-echo ornamentation and beat-hierarchy ornamentation). Music transformations of this type are often used in music composition. We believe that they form a useful basis for defining melody similarity. We propose in Section 4.3 that two melodies are highly similar if the differences between the melodies can be accurately accounted for by the successive application of music transformations. Flexibility is incorporated in this definition: a user can adjust the definition of similarity by adding or omitting music transformations, specifying the possible successive number of transformations, and adjusting the allowable bounds for parameters of transformations. Define the set of similarity measures that will be subject to performance evaluation. Published similarity measures are summarized in Section 3.2. We also implement a recently-proposed similarity measure that is designed to respond better to selected music transformations: Section describes the Beat and

18 CHAPTER 1. INTRODUCTION 5 Direction Measure which responds well to beat-hierarchy ornamentation. Section 4.1 defines the subset of measures that are evaluated in this thesis. The implementation of these measures is described in Section 5.1. Define a ground-truth data set for testing the performance of similarity measures. The data set may require manipulation to a consistent format, as discussed in 3.1. During testing, performance can be assessed based on the groundtruth data set. Alternatively, performance can be assessed by comparing answers to those given by an oracle similarity measure that is taken to be correct. Section 4.2 describes the data that were used in a previous evaluation campaign where ground truth was provided in [3]. We use this similarity-testing framework to investigate the following questions: 1. Do any categories of techniques for measuring melody similarity significantly outperform others? Many previously proposed melody similarity measures use or extend techniques developed for other areas of computer science (such as string edit distance). Section 3.2 describes categories of similarity measures such as edit-distance based measures, geometric based measures, and N-gram based measures. 2. Does context sensitivity improve the performance of a melody similarity measure? Many existing melody similarity measures assess the difference between notes without regard to the context in which the notes occur. A context sensitive measure assesses the difference between two notes by taking surrounding notes into consideration. 3. Does responsiveness to music transformations improve the performance of a

19 CHAPTER 1. INTRODUCTION 6 melody similarity measure? If so, which transformations are (most) important? In this thesis we begin investigation of this area by defining similarity measures that are responsive to transposition, tempo rescaling, and two types of ornamentation (pitch-echo ornamentation and beat-hierarchy ornamentation). The design and implementation of the evaluation framework is a major contribution of this thesis. The following 3 steps were used. A Design and implement software to load symbolic music files into a consistent internal data structure and use a similarity measure to perform an evaluation. Section 4.2 describes the collection and transformation of necessary data. Implementation of the evaluation engine is described by Section 4.3. B Implement a set of representative melody similarity measures from the set of published similarity measures. Among the implemented measures are forms of Geometric Measures, N-Gram based Measures, Edit Distance based Measures, and a newly developed similarity measures which is designed to be responsive to certain music transformations. C Evaluate the implemented similarity measures by assessing their performance with respect to the ground truth described in Section 4.2. Interpret the evaluation results to assess the importance of context sensitivity and responsiveness to music transformations. The next chapters present the details of the evaluation framework. Chapter 2 outlines the selected music transformations for our study. Chapter 3 provides background and literature review, including existing categories of techniques for assessing melody similarity, and a description of the MIREX symbolic music similarity task. Chapter

20 CHAPTER 1. INTRODUCTION 7 4 describes the similarity measures and data processing implemented in this work. Chapter 5 presents the performance evaluation results and analysis and Chapter 6 concludes with a summary and discussion of future work.

21 Chapter 2 Music Transformations This chapter presents our definition of music transformations and a description of the music transformations selected for study. We define a music transformation as a function which creates or models a mapping from one melody to another. Each transformation has a distinct set of P parameters. For example, a transposition transformation requires parameters that define which subsection of the music fragment is transposed, and how many half-steps to transpose. In general (Figure 4.1), a set of T transformations is used to capture the type of melody similarity that is of interest in a particular situation. In our experiments (Figure 5.1) we use T = 4. The remainder of this section describes this set of transformations in more detail. 2.1 Transposition Transposition is a transformation that alters the pitch of a series of notes by a fixed number of half steps (a discrete amount of pitch). Figure 2.1 illustrates transposition of an entire music segment. In many applications, it is desired that similarity measures 8

22 CHAPTER 2. MUSIC TRANSFORMATIONS 9 4 (a) 4 (b) Figure 2.1: Illustration of transposition. The differences between melody (a) and (b) can be entirely modeled as a transposition transformation. The parameters for the transposition state that the entire melody is transposed up a fifth. 4 (a) 4 (b) Figure 2.2: An example of tempo rescaling. The difference between melody (a) and (b) can be modeled completely as a tempo rescale transformation. The parameters for the tempo rescaling state that the entire melody is rescaled by a factor of 2, such that the duration of each note is doubled. produce a very high similarity rating for melodies whose differences can be modeled as a transposition. 2.2 Tempo Rescaling Tempo rescaling is a transformation that alters the durations of a series of notes by a fixed scale factor. As illustrated in Figure 2.2, the relative note durations are unchanged, but all notes are lengthened or compressed by the scale factor. In some applications, it is desired that similarity measures produce a very high similarity rating for melodies whose differences can be modeled as a tempo rescaling.

23 CHAPTER 2. MUSIC TRANSFORMATIONS 10 4 (a) 4 (b) Figure 2.3: An example of pitch-echo ornamentation. The differences between melody (a) and (b) can be modeled as a pitch-echo ornamentation. The parameters of the ornamentation state that the first three notes of (a) are affected, with the ornamentation splitting the original note into two notes that are half as long, raising the pitch of the second note one fifth higher. 2.3 Ornamentation Composers use many types of ornamentation to add embellishments and variety to the music. Here, we investigate two transformations that model particular types of ornamentation. We call these pitch-echo ornamentation and beat-hierarchy ornamentation, as illustrated in Figures 2.3 and 2.4 respectively Pitch-echo ornamentation In pitch-echo ornamentation, the target music segment is produced by introducing additional notes between notes in the source music segment. A uniform method is used to base the pitch and duration of added notes on the pitch and duration of the source notes. For example, the pitch-echo ornamentation in Figure 2.3 introduces intermediate notes that are a fifth higher than the original notes, and are half as long as the original notes Beat-hierarchy Ornamentation In beat-hierarchy ornamentation, pitches of notes that occur on strong beats are preserved more than pitches of notes that occur on weak beats. Notes on weak beats

24 CHAPTER 2. MUSIC TRANSFORMATIONS 11 4 (a) 4 (b) Figure 2.4: An example of beat-hierarchy ornamentation. The difference between melody (a) and (b) can be completely modeled as a beat-hierarchy ornamentation. Each of the four notes that lie on the beat remain the same, but the passing notes change. The parameters for this transformation define each passing note explicitly. may be altered to add embellishment or variety, as illustrated by the beat-hierarchy ornamentation in Figure Composition of Transformations We propose the possibility that the previously mentioned transformations may be used in succession. Figure 2.5 demonstrates an example of one such a combination. The combination of a tempo rescale transformation and afterwards a transposition transformation can model the differences between the melodies in Figure 2.5. Also, an example of two similar melodies (as judged by human music experts) is shown in Figure 4.2.

25 CHAPTER 2. MUSIC TRANSFORMATIONS 12 4 (a) 4 4 (b) (c) Figure 2.5: An example of a composition of transformations. The difference between melody (a) and (c) cannot be completely modeled by any one transformation. However, it can be explained by first applying a tempo rescale transformation with a parameter of 2 (shown from (a) to (b)) and then applying a transposition transformation with a parameter of a fifth (as seen from (b) to (c)).

26 Chapter 3 Background and Literature Review This chapter overviews previous work related to melody similarity. A number of symbolic representations exist to encode music in a readable and serializable format. Some examples of these, such as MusicXML [15] and ABC notation [16] are discussed in Section 3.1. Next, classes of published melody similarity measures are defined in Section 3.2. Finally, Section 3.3 describes the methodology used in previously performed evaluations of melody similarity, results of these evaluations, and difficulties in evaluation. 13

27 CHAPTER 3. BACKGROUND AND LITERATURE REVIEW 14 4 X: 1 T: M: 4/4 L: 1/16 K: Eb Major E2FGABcde4B4 G4z4zEGEAEBE Figure 3.1: Example of the symbolic music representation ABC, with corresponding western music notation. The ABC notation begins with 5 header lines. The line beginning with X indicates a reference number (for storing multiple melodies in a single file), T gives a unique name for the music fragment, M states the time signature, L indicates the base note length, and K provides the key signature. The header lines are followed by a symbolic encoding of the sequence of notes and rests that constitute this music fragment. In this encoding, E represents a note located on the lowest staff line (E flat above middle C), and e represents a note one octave higher. The 2 represents an eighth note (double the base note length described by L in the header), and 4 represents a quarter note (four times the base note length). 3.1 Symbolic Music Representations Many formulations of symbolic music representations exist. However, symbolic representations all encode the same information (music voices as sequences of pitch and duration). Examples of symbolic representations include MusicXML [15], ABC notation [16], and Humdrum kern [17]. ABC notation is chosen as the preferred notation for the experimental work done in this thesis. The notation was chosen for its low amount of data overhead as well as the simplicity of parsing given that it is stored in a plaintext file. Figure 3.1 shows an example of ABC notation and the corresponding western music notation. These symbolic representations have been used in systems such as in C-Brahms [18], and GUIDO/MIR [19].

28 CHAPTER 3. BACKGROUND AND LITERATURE REVIEW Figure 3.2: Example of Edit Distance Insert operation. 4 4 Figure 3.3: Example of Edit Distance Delete operation. 3.2 Categories of Melody Similarity Measures Existing symbolic melody similarity measures can be categorized into various types. The following sections review edit-distance based measures, geometric based measures and N-gram based measures Edit Distance Based Measures for Melody Similarity Edit distance is a method of quantifying the distance between two strings, or patterns, by counting the minimum number of operations required to transform one string into the other. Commonly, the three types of operations that are used to correct differences in the strings are [20]: 1. Insert: A character is inserted into the target string when there is no corresponding character for the character in the source string.

29 CHAPTER 3. BACKGROUND AND LITERATURE REVIEW Figure 3.4: Example of Edit Distance Substitute operation. 2. Delete: A character is deleted from the target string when it does not correspond to any character from the source string. 3. Substitute: A character in the target string is replaced with one from the source string when the corresponding characters do not match. Melody similarity measures based on edit distance use string matching, or Levenshtein distance to compute the similarity of two melodies [21], [22], [23], [24], [25], [26]. The distance between two melodies is computed by counting the minimum number of note transformation operations needed to transform the first melody into the second. Common operations include inserting a note (Figure 3.2), removing a note (Figure 3.3), and replacing a note (Figure 3.4). Each operation has a cost parameter. The similarity value is computed as the sum of the costs of all the operations used to transform one melody into the other. Edit distance has been used widely in previous work with various costs, operations, and representations. Gómez et. al. implemented and performed an analysis of the Mongeau-Sankoff algorithm [23] which introduces two additional operations fragmentation and consolidation [27]. Uitdenbogerd used edit distance with a representation based solely on pitch by ignoring the duration of notes [26]. Frieler and

30 CHAPTER 3. BACKGROUND AND LITERATURE REVIEW 17 Müllensiefen employed an edit distance algorithm to a number of simplified representations including discretization of pitch, rhythm, and contour [22], [24]. Grachten et. al. evaluated an edit distance algorithm which operates on an implication/realization [28] representation of melody [25]. Finally, Ferraro et. al. evaluated a string edit distance algorithm which is capable of polyphonic analysis; this algorithm is transposition invariant because it operates on a music representation that stores relative pitch rather than absolute pitch [21] Geometric Based Measures for Melody Similarity Geometric similarity measures are based on a two dimensional geometric representation of melody, where one axis represents time and the other axis represents pitch [22], [24], [29], [30], [31], [32], [33], [34]. This is illustrated in Figure 3.5. This geometric representation allows measures of polygonal similarity to be adapted to the assessment of melody similarity. Depending on how the geometric measure is defined, the measure may be able to model tempo rescaling (scaling along the horizontal axis) and transposition (shifting along the vertical axis). Frieler and Müllensiefen devised a geometric distance algorithm that was applied to a variety of representations [22], [24]. Typke et. al. evaluated the Earth Mover s Distance on melodies [30], [31]. Lemström et. al. proposed a geometric distance algorithm applied to a piano-roll representation of melody [29], [33]. Laitinen and Lemström proposed geometric approaches whereby tempo rescaling is handled through time-scaling and time-warping [35]. Finally, Urbano et. al. proposed a geometric representation of n-grams and evaluated similarity based upon those [32].

31 CHAPTER 3. BACKGROUND AND LITERATURE REVIEW 18 itctimep3.5: Example hfigure of Geometric Representation of Melody N-gram Based Measures for Melody Similarity An n-gram is a contiguous subsequence of n items from a given sequence. The distance between two sequences can be measured by counting the number of matching subsequences of length n that are shared by the sequences. For example, suppose we have a database of sequences and we wish to find the most similar sequence to a query sequence using n = 5. Suppose that the query sequence is ( ), which is represented by the two 5-grams ( ) and ( ). Given the following database of sequences, the distances are calculated as follows: Sequence A: ( ) is broken into the four 5-grams ( ), ( ), ( ), and ( ). The second of these 5-grams matches a query 5-gram. Sequence B: ( ) is broken into one 5-gram, ( ). Sequence C: ( ) is broken into five 5-grams ( ), ( ), ( ), ( ), and ( ). The first and last of these 5-grams match a query 5-gram. Because Sequence C has the most 5-grams matching query 5-grams it is ranked as the closest. Sequence A is second closest and Sequence B is ranked last.

32 CHAPTER 3. BACKGROUND AND LITERATURE REVIEW 19 In the case of melody similarity the items in each sequence are notes. Statistics about n-grams can be used to measure similarity, and this is efficient if n is small relative to the length of the music segment. (A melody consisting of K notes contains K n + 1 n-grams.) To evaluate the similarity of a query to a candidate sequence, the number of matching n-grams are counted. These counts are compared for all candidate sequences. The first published methods count the number of distinct matching n-grams to determine similarity [36]. Since then, additional representations and methods of comparing sets of n-grams have been developed and applied to the computation of melody similarity [22], [24], [32], [37], [38], [39]. 3.3 Evaluations To improve previous measures and to assess new melody similarity measures, evaluations must be performed. Some such evaluations have been performed in the past in a campaign entitled The Music Information Retrieval Evaluation exchange (MIREX). MIREX is an annual evaluation campaign for Music Information Retrieval (MIR) algorithms to aid in the evaluation and development of MIR techniques [40]. One of the MIR tasks evaluated at MIREX is the Symbolic Melodic Similarity (SMS) task. The goal of SMS is to, given a query, return a ranked set of the most similar items from a collection of symbolic pieces [41]. Table 3.1 details the results from the SMS task for the years 2005, 2006, 2007, and In addition to the category of distance measure used, this table shows which transformations (if any) the similarity measure is tolerant of. This information about music transformations is collected by manually inspecting the implementation details of each entry. Additionally, Table 3.2 aggregates the results (from 2005 onwards) by counting the number of times each

33 CHAPTER 3. BACKGROUND AND LITERATURE REVIEW 20 category of similarity measure is used. In the 2005 SMS task, the data used were taken from the RISM A/II collection [42]. To evaluate the results given by each similarity measure, ground truth was established by Typke et al. This ground truth was created through the cooperation of 35 music experts in manually judging the similarity of 11 melodies to a set of 50 candidate melodies [3]. Because this process proved to be time consuming the data used for the evaluations from 2006 onwards were taken from the Essen Collection ([43]) and ground truth was established ad-hoc. This ad-hoc ground truth was established by creating error-mutations in the following ways [44]: No errors (i.e. base ) One note deleted One note inserted One interval enlarged One interval compressed The MIREX SMS task results form lists of query melodies which map to sets of resulting candidate queries. The average dynamic recall (ADR) determines the best performing similarity measure(s) [41].

34 CHAPTER 3. BACKGROUND AND LITERATURE REVIEW 21 Year Source Polyphonic Monophonic Similarity method Tempo Rescaling Transposition Ornamentation Rank (ADR) 2005 [45] No Yes Combination of Edit Distance, N-Grams, and Geomentric Partial Yes No 7 of [29] No Yes Edit Distance, Geometric Distance No Yes No 5 and 6 of [30] No Yes Earth Mover s Distance Yes Yes No N/A 2005 [30] No Yes Segmented Earth Mover s Distance Partial Yes No 4 of [36] No Yes N-Gram matching Yes Partial No 3 of [37] No Yes Multilevel N-Grams of multiple features Yes Partial No 2 of [25] No Yes Edit Distance of I/R structures optimized by Genetic Algorithm Yes Yes No 1 of [21] Yes Yes Edit Distance No Yes No 2 of [46] Yes Yes Geometric Sweep Line Yes Partial No 7 and 6 of [31] Yes Yes Earth Mover s Distance No No No 1 of [22] No Yes Edit Distance, N-Grams, Geometric Partial Yes No 8 of [26] No Yes Edit Distance No Yes No 4 of [21] No Yes Edit Distance No Yes No 2 of [23] No Yes Edit Distance No Yes No 1 and 3 of [47] No Yes Time-Independent Interval Graph Yes Yes No 7 and 8 of [38] No Yes N-Gram matching Yes Yes No 4, 5, 6 of [39] No Yes N-Gram pitch and IOI matching Partial No No 6 and 8 of [48] No Yes Tree Representation with tree similarity Yes No No 7, 10, 12, and 13 of [35] No Yes Geometric Sweep Line Yes Yes No 4 and 11 of [32] No Yes N-Grams compared geometrically Yes No No 1, 2, 3, and 5 of 13 Table 3.1: Summary of results from previous MIREX Symbolic Melodic Similarity tasks

35 CHAPTER 3. BACKGROUND AND LITERATURE REVIEW 22 Method Count Edit Distance 7 N-Grams 6 Geometric Distance 3 Earth Mover s Distance 3 Trees 2 Fusion/Combination 2 Table 3.2: Aggregate of categories of techniques used in previous MIREX Symbolic Melodic Similarity tasks

36 Chapter 4 Methodology To evaluate the performance of a similarity measure we use transformations as building blocks to create a formal problem statement. Figure 4.1 illustrates the framework for evaluating melody similarity measures. The three components of the framework are the similarity measures selected for study (Section 4.1), the data sets used during testing (Section 4.2), and the evaluation criteria (Section 4.3). Figure 5.1 shows the instantiation of the framework that is used for the evaluation described in this thesis. 23

37 CHAPTER 4. METHODOLOGY 24 Figure 4.1: The framework for evaluating melody similarity measures.

38 CHAPTER 4. METHODOLOGY Selected Similarity Measures for Study This section describes the similarity measures that we select to study. The set of selected measures is representative of past work in symbolic melody similarity measures but not completely comprehensive Geometric Distance Measure We implement a geometric similarity measure which operates on a point-set representation of melody, as described in [49]. In order to create a measure that accounts for changes in melody length we select two data representations to implement: A A time sensitive representation in which the length of both the query and candidate melodies are not modified. B A time insensitive representation whereby the length of each note is ignored thereby effectively giving each note the same duration. In the case where the melodies are different lengths, the shorter of the two melodies is lengthened by enlarging the duration of each note appropriately. For example, when comparing a melody with 10 notes versus a melody with 5 notes, the note length of the former melody would be half of that of the latter. The implementation of the geometric distance measure was implemented using the General Polygon Clipper (GPC) library developed at the University of Manchester [50]. Melodies are first transformed into two dimensional point sets (of notes) where the first component represents the time passed before the onset of each note and the second component is the pitch.

39 CHAPTER 4. METHODOLOGY 26 When using representation B above, we calculate the first component (the time) of each note as the sum of the lengths of all preceding notes. Specifically, if X represents the set of lengths of all n notes in a melody: X = (X 1...X n ) then the first component of the note X i is calculated as i 1 n=1 X n. When using representation A, we simply use i 1 as the first component for note i. The second component of each note in a melody is an integer representing the pitch of the note. We calculate this as the number of half steps (semitones) the pitch differs from the middle C (C4). However, to ensure that all pitches are positive we add 24 to this value. All incipits used in our experiments are within 2 octaves of middle C (24 semitones). Finally, a polygon is constructed from the query and candidate point sets using the GPC library. The GPC library s difference(p olygon 1, P olygon 2 ) function returns a polygon which is the result of subtracting the second polygon from the first. The are of this resulting polygon is used to quantify the regions that the polygons do not share (their differences). The distance between query polygon Q and candidate polygon C is computed as difference(q, C) + difference(c, Q) Edit Distance Measure The Edit, or Levenshtein, Distance Measure is implemented by representing each melody as a sequence of notes. Each note in this sequence has both a pitch and a duration. As described in Section 3.2.1, there are a number of operations that are performed to transform one sequence to another. In our implementation, we use the insert, delete, and substitute operations. We assign all three operations equal cost values.

40 CHAPTER 4. METHODOLOGY 27 Two representations are used in our Edit Distance measure to aid in measuring the tolerance of said measures with respect to the transformations described in Chapter 2. These representations are as follows: A A duration sensitive approach which considers two notes to be matches only if both the pitch and the duration are exactly equal. B A duration insensitive approach which only requires the pitch of two notes to be equal for them to be considered matches. By contrasting the similarity ratings resulting from these two representations, we observe whether the edit-distance measure s sensitivity to transformations relates to human judgments of similarity N-Gram Distance Measure We implement an N-Gram Distance Measure, the operation of which is detailed in Section For our experiments, we chose N = 5. As for the Edit Distance Measure, we implement two types of matching: one duration sensitive and one duration insensitive. In addition to this, we chose to implement the following two methods of calculating distance from matches: A Total match counting, where the distance is equal to the inverse of the number of matches. For example, consider two melodies of size M where M = 7. We have 3 separate 5-grams for each melody. We assign a distance of 1 in case that there are no matches; this is the farthest possible distance. In the case where all 3 5-grams from the query melody match all 3 5-grams from the candidate melody, we have

41 CHAPTER 4. METHODOLOGY 28 a total of 9 matches, which is the closest possible distance. In this situation, we assign a distance of 1/9. B Distinct match counting, where the distance is again defined to be equal to the inverse number of matches. However, each n-gram from the query melody may only be matched to one n-gram from the candidate melody. So, considering the melodies where M = 7 in the previous method, there would only be 3 distinct matching 5 Grams. Therefore, the distance would be calculated as 1/ Beat and Direction Distance Measure We also implement a measure called the Beat and Direction Distance Measure [2]. This measure was developed with the goal of incorporating musical knowledge into a melody similarity measure while keeping the measure applicable to many types of melodies [2]. Recall from Chapter 1 that the correct answer in a melody similarity task is relative to many factors. Among these factors is the type of music being analyzed. The Beat and Direction Distance Measure operates on an abstraction of the query and candidate melodies by using the time signature to analyze the notes that fall on the beats of each melody. Each note that falls on the beat of a melody is called a beat note whereas every other note is called a passing note [2]. As the name implies, the Beat and Direction Distance Measure considers the ptich-direction of the melody as well as the beat notes. The measure performs a type of contourization by considering only the local extremes within the passing notes instead of analyzing each passing note. To calculate distance, the Beat and Direction Distance Measure uses a weighted

42 CHAPTER 4. METHODOLOGY 29 Condition pitch(beatnote 2 ) pitch(beatnote 1 ) > 0 pitch(beatnote 2 ) pitch(beatnote 1 ) < 0 pitch(beatnote 2 ) pitch(beatnote 1 ) = 0 Category Increasing Decreasing Unchanged Table 4.1: Possible categories for direction in the Beat and Direction Distance Measure Structure Definition Steep pitch(beatnote 1 ) pitch(beatnote 2 4 and passing notes, pitch(beatnote 1 ) pitch(passingnote) pitch(beatnote 2 ) Gradual pitch(beatnote 1 ) pitch(beatnote 2 < 4 and passing notes, pitch(beatnote 1 ) pitch(passingnote) pitch(beatnote 2 ) Upward bend pitch(passingnote) > pitch(beatnote 1 ), pitch(beatnote 2 ) Downward bend pitch(passingnote) < pitch(beatnote 1 ), pitch(beatnote 2 ) Oscillation pitch(passingnote 1 ) > pitch(beatnote 1 ), pitch(beatnote 2 ) and pitch(passingnote 2 ) < pitch(beatnote 1 ), pitch(beatnote 2 ) Table 4.2: Table 4.3 from [2] summarizing the possible structures of passing notes sum of three components: the similarity between beat notes, the similarity of directions between beat notes, and the similarity of structures between beats [2]. In our implementation these three components are calculated as follows: 1. The similarity between each corresponding beat note is calculated by taking the absolute difference between the pitches of the query and candidate beat notes. 2. To calculate the similarity of directions between beat notes, first the pitch direction is calculated for both the query and the candidate melodies. The three categories, Increasing, Decreasing, and Unchanged are defined in Table 4.1. If both the query and candidate melody s corresponding beats fall in the same category, the value for this component is 0. Otherwise, it is 1.

43 CHAPTER 4. METHODOLOGY Five structures of passing notes are used to determine the similarity between beats. These structures are summarized in Table 4.2. When corresponding beats have matching structures they are given a value of 0, otherwise they are given a 1. Finally, the overall distance between the query and candidate melodies is calculated as follows: distance = ω 1 + αω 2 + βω 3 where: ω 1 = the total score for similarity of beat notes ω 2 = the total score for similarity of directions of beats ω 3 = the total score for similariy of passing notes structures α,β = scalar weights In our experiments, we use α = 0.75 and β = Test Data for Evaluations (a) (b) Figure 4.2: An example from the RISM A/II dataset. The melody shown in (a) is used as the query melody. A group of 35 music experts determined that the most similar melody in the collection was (b). The differences between (a) and (b) can be almost completely modeled as a tempo rescaling transformation (doubling the note duration) and a transposition transformation (down a full step). As discussed in Section 3.3, the 2005 MIREX SMS task used data taken from the RISM A/II collection. The collection includes more than 657,000 pieces originating

44 CHAPTER 4. METHODOLOGY (a) (b) Figure 4.3: A second example from the RISM A/II dataset. The melody shown in (a) is used as the query melody and the melody shown in (b) is the agreed upon ground truth. In this example, the differences between (a) and (b) are more difficult to model as transformations. from over 22,000 composers from 32 countries [42]. This collection is useful due to its size and because it is made up of music written by real human composers. The collection is also in a digital format which makes it available for computational MIR tasks. Furthermore, expert-defined ground truth is provided for the similarity of 11 query melodies. This is sufficient reason to select this data for our evaluations. The generation of ground truth for the RISM A/II collection is done in a mostly manual process by Typke et. al. [3]. The set 11 of query melodies is first selected from the entire data set. To allow for the manual judgment by human experts the remainder of the data set must then be filtered to contain only relevant incipits. The filtering is done by first calculating certain features for the query melodies and then issuing SQL statements to the remainder of the database based upon these features. Some of these features (from [3]) include: Pitch range: the difference between the pitch of the highest note and the lowest note in the incipit. Duration ratio: the duration of the shortest note divided by the duration of the

45 CHAPTER 4. METHODOLOGY 32 Figure 4.4: An example of a box whisker plot from a ground truth from the RISM A/II dataset as generated by [3]. This plot shows that the rankings of experts have a median of 4. The mean of the responses is 5.5. The standard deviation of the responses by the horizontal dotted line where they demonstrate 1 standard deviation above and below the mean. longest note. Maximum interval: the largest interval between the onset of subsequent notes. Each query melody requires a different set of filtering steps to arrive at a manageable number of candidate melodies. Therefore, Typke et. al. manually apply filtering steps until the number of candidate melodies is under 300 [3]. This number of candidate melodies is sufficiently low such that the candidates may be inspected manually and sufficiently high such that candidates that are similar to the query melody are not accidentally excluded. To further ensure that possible matches are not excluded, the metadata available through the RISM A/II collection is used to find candidate incipits with similar titles. Typke et. al. state, For example, for Roslin Castle,... we made sure that every incipit whose title contains the word Roslin was included. [3]. Next, the ground truth is established by asking a group of 35 musical experts to establish an ordered similarity ranking of the candidate melodies. Figures 4.2 and 4.3 illustrate two queries and the best-matching candidate melody pairs as decided upon by the musical experts. Candidate melodies which do not seem similar to the query melody are given no rank. After collection of data, the ranks are analyzed to determine statistical significance by performing the Wilcoxon rank sum test to generate a partially ordered list of rankings.

46 CHAPTER 4. METHODOLOGY 33 Figure 4.5: An example of a visualization of a Wilcoxon rank sum test from a ground truth from the RISM A/II dataset as generated by [3]. An example of the complete ground truth for the melody in Figure 4.3 is shown in Figure 4.6. Figure 4.6 consists of the music notation representation of the query melody followed by the candidate melodies. To the right of each music notation representation there is a box-and-whisker plot representing the rankings given by the 35 experts. Figure 4.4 depicts an example of one of the box-and-whisker plots. The box in this plot represents the responses from the first to the third quartile and the whiskers represent the top and bottom 10 percent. Each red dot in the plot represents a response, the large dot represents the median of the responses, and the blue vertical line shows the mean. The horizontal dashed blue line shows the span of one standard deviation above and below the mean. Additionally, Figure 4.5 displays an example of the visualization of a Wilcoxon rank sum test. The size of the plot is representative of the number of incipits that are ranked higher than the current one. The colour of the plot is indicative of the p-value of the Wilcoxon rank sum test. Specifically, the proportion of the plot that is the darker, red colour is the p-value. Whenever the p-value is below a certain threshold (which is set to 0.25 in [3]) there is sufficient evidence to say that the rankings of the preceding melodies is not coincidental. Therefore, whenever the p-value falls below 0.25 there is a separation to create a partially ordered list.

47 CHAPTER 4. METHODOLOGY Evaluation Criteria The criteria with which we propose to evaluate our selected similarity measures are twofold. First, for each query melody selected in Section 4.2 we compute a list of candidate melodies ranked by similarity. We use this list to compute the Average Dynamic Recall (ADR) for each measure. Second, the relationship between ground truth query and candidate pairs is manually determined with respect to the music transformations selected in Chapter 2. We then combine the results from these steps in order to determine two things: The effectiveness of each selected measure in recognizing the music transformations The relationship between the human judgment of melody similarity (as defined by the ground truth) and the music transformations Each similarity measure returns a ranked list of N queries where N is the number of candidates with agreed upon rankings in the ground truth. The Average Dynamic Recall (ADR) is calculated for each similarity measure s ranked list of candidates as follows: ADR = 1 j j rank(i) i=1 where j is the number of groups in the ground truth and rank(i) is the number of correct candidates in the current group, divided by the total number in the current group. Figure 4.2 illustrates of the second step. Manual analysis of Figure 4.2 discovers that the differences between the query and candidate melodies can be explained well

48 CHAPTER 4. METHODOLOGY 35 as a set of transformations applied sequentially. Applying a tempo rescaling transformation then a transposition transformation to the query melody allows us to almost fully transform the query melody into the candidate.

49 Figure 4.6: Figure from data generated in [3] reproduced with permission pending. An example of ground truth from the RISM A/II dataset as generated by [3]. The query melody is shown at the top. This is followed by 10 candidate melodies that experts rank as most similar to the query melody. These 10 candidate melodies are split into 5 groups (a partially ordered list) using the Wilcoxon rank sum test. CHAPTER 4. METHODOLOGY 36 Ground truth for / SELECT candidatetuneid, rank FROM `MIRgroundtruth` WHERE QueryTuneID = AND SubjectID between 1 and 400 AND rank <300 ORDER BY candidatetuneid, rank 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,3, XML source 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,4,5,6,7,8, XML source Wilcoxon: 1.463e-11 XML source 1,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,4,5,5,5,5,5,5,5,6,7,8,11, XML source Wilcoxon: 2.051e-11/ XML source 2,2,3,3,3,3,4,4,4,4,4,4,4,4,4,4,5,5,6,6,6,7,7,7,7,7,8,9,24, XML source Wilcoxon: 2.624e-12/4.388e-06/ XML source 2,3,3,3,4,4,4,4,4,4,5,5,5,5,5,5,5,6,6,6,6,6,6,7,7,8,8,9,10, XML source Wilcoxon: 2.223e-12/6.844e-07/ /0.434 XML source 2,3,4,4,5,5,6,6,6,6,8,9,9,16, XML source Wilcoxon: 7.821e-10/4.076e-05/ /0.2363/ XML source 3,3,3,4,4,4,4,4,5,5,5,5,6,6,6,6,6,7,7,7,7,7,8,8,8,14,46, XML source Wilcoxon: 3.635e-12/1.329e-07/ /0.1154/0.3147/0.989 XML source 3,4,4,5,5,5,5,6,6,6,7,7,8,8,8,10,12,17,37, XML source Wilcoxon: 6.065e-11/3.045e-07/5.843e-05/ / /0.3874/ XML source 2,4,4,5,5,6,6,7,7,7,7,8,8,9,9,9,13,16,40, XML source Wilcoxon: 7.057e-11/5.005e-07/4.167e-05/ / /0.1803/ / XML source 4,5,8,8,8,9,10,10,10,11,13,13,27, XML source Wilcoxon: 9.23e-10/2.922e-07/8.068e-06/8.329e-05/ / / / / XML source IGNORED: 3,9,9,10,10,12,13,42, There are 31 votes. Threshold: * 31 = /462520: 1 (31) /464399: 2 (30) /466136: 3 (30) /289445: 4 (29) /464580: 5 (29) /434267: 6 (14) /476018: 6 (27) /472894: 6 (19) /475677: 7 (19) /123236: 10 (13) /393344: 10 (8) /474698: 10.5 (10) /139179: 11 (5) /219576: 13.5 (6) /432030: 14 (6) /252041: 14 (6) /8652: 15.5 (4) /61142: 16 (5) /16322: 18 (6) /101: 18 (7) XML source 6,9,9,10,10,11,12,15,35,38, XML source 9,11,11,23,28, XML source 7,8,12,15,21,44, XML source 11,13,14,14,19,41, XML source 10,12,13,15,36,41, XML source 12,14,17,24, XML source 6,10,16,30,36, XML source 1,4,11,25,26,28, XML source 6,8,9,18,22,31,32, XML source

50 Chapter 5 Experiments and Results We now describe the implementation details of our experiments and report on their results. Figure 5.1 illustrates the specific details for the experimental implementation of our approach in this thesis. This figure depicts the selections of transformations made in Chapter 2 as well as the selection of data and similarity measures made in Sections 4.2 and 4.1. We now describe the evaluation implementation process in Section 5.1, present our results in Section 5.2, and provide an informal analysis of these results in Section

51 CHAPTER 5. EXPERIMENTS AND RESULTS 38 Figure 5.1: Flow chart illustrating the data and algorithms used in the experiment to evaluate selected melody similarity measures.

52 CHAPTER 5. EXPERIMENTS AND RESULTS Evaluation Process The evaluation framework discussed in this thesis is implemented in the Java programming language. Java is chosen for its portability between different platforms and its object oriented nature. Object oriented design allows the implementation of the framework to have extensibility. Figure 5.1 details the implementation of the underlying search and similarity measure classes. Implementing a subclass of the interface ISimilarityMeasure allows a user to create their own melody similarity measure to be evaluated alongside the currently implemented measures. The evaluation implementation uses the similarity measures to perform searches on the selected dataset described in Section 4.2.

53 CHAPTER 5. EXPERIMENTS AND RESULTS 40 Figure 5.2: Class diagram depicting the implementation discussed in this thesis

54 CHAPTER 5. EXPERIMENTS AND RESULTS 41 In addition to these framework classes a Graphical User Interface (GUI) is developed to facilitate greater ease of use of the software. The GUI uses Java s Swing/AWT framework to display pertinent information and controls to the user. The GUI provides four separate tabs of functionality to the user for different tasks. These tasks are as follows: 1. Melody Viewer: This tab (Figure 5.3) allows the user to inspect the incipits from the RISM A/II collection, shown in a rectilinear polygon representation. This polygonal representation is similar to a piano roll representation which is named for the rolls on which music was recorded for player pianos to reproduce.

55 CHAPTER 5. EXPERIMENTS AND RESULTS 42 Figure 5.3: Screenshot of the Melody Viewer tab.

56 CHAPTER 5. EXPERIMENTS AND RESULTS Single Measure Evaluation: Figure 5.4 shows this tab. The topmost list allows the user to select a query melody from the RISM A/II collection. A polygonal representation of the selected melody shows the user the melody that they select. The bottom list facilitates the selection of a similarity measure by the user. Finally, the button (when clicked) performs an evaluation with the selected melody as the query and the remainder of the dataset as candidates. This evaluation returns the most similar melody found in the dataset according to the selected similarity measure, as shown in Figure 5.5.

57 CHAPTER 5. EXPERIMENTS AND RESULTS 44 Figure 5.4: Screenshot of the Single Measure Evaluation tab.

58 CHAPTER 5. EXPERIMENTS AND RESULTS 45 Figure 5.5: Screenshot of the results from running the Single Measure Evaluation tab.

59 CHAPTER 5. EXPERIMENTS AND RESULTS Single Measure Evaluation versus Ground Truth: This tab allows the user to perform an evaluation as described in Section 4.3. The functionality and look of this tab mirrors that of the Single Measure Evaluation tab except for the list of available query melodies. This tab lists only the query melodies which have ground truth associated with them from [3]. Additionally, upon execution, the evaluation presents a different type of results. Figure 5.7 illustrates an example of the results produced by this tab. Instead of displaying the most similar melody to the selected query, the results window displays the recall ability of the selected similarity measure with respect to the partially ordered list provided by the ground truth.

60 CHAPTER 5. EXPERIMENTS AND RESULTS 47 Figure 5.6: Screenshot of the Single Measure Evaluation versus Ground Truth tab.

61 CHAPTER 5. EXPERIMENTS AND RESULTS Complete Evaluation versus Ground Truth: This tab asks the user to select a query melody to use in an evaluation. Figure 5.8 displays an example of this tab. Upon execution, the software performs an evaluation of all available similarity measures using the selected melody as the ground truth. Figure 5.9 illustrates an example of the results obtained by comparing the selected similarity measure s results against the ground truth partially ordered list.

62 CHAPTER 5. EXPERIMENTS AND RESULTS 49 Figure 5.7: Screenshot of the results from running the Single Measure Evaluation versus Ground Truth tab.

63 CHAPTER 5. EXPERIMENTS AND RESULTS 50 Figure 5.8: Screenshot of the Complete Evaluation versus Ground Truth tab.

64 CHAPTER 5. EXPERIMENTS AND RESULTS Results Tables present the results of the evaluation carried out during this thesis research. The top row describes the size of each group (partial ordering) in the ground truth. Each group is comprised of the set of all melodies that are judged to be more similar to the query melody than those in later groups. That is to say that if the first group has 1 melody, the second 3, and the third 5 then the group sizes will be 1, 4, and 9 accordingly. The remainder of the rows each show the results for the query melody in question for a single similarity measure. Specifically, the first column names the similarity measure, and the subsequent columns list the number of correct candidate melodies that were selected by the similarity measure along with the recall value for that group.

65 CHAPTER 5. EXPERIMENTS AND RESULTS 52 Figure 5.9: Screenshot of the results from running the Complete Measure Evaluation versus Ground Truth tab.

Melody Retrieval using the Implication/Realization Model

Melody Retrieval using the Implication/Realization Model Maarten Grachten, Josep Lluís Arcos and Ramon López de Mántaras IIIA, Artificial Intelligence Research Institute CSIC, Spanish Council for Scientific