Ph.D Research Proposal: Coordinating Knowledge Within an Optical Music Recognition System J. R. McPherson March, 2001 1 Introduction to Optical Music Recognition Optical Music Recognition (OMR), sometimes also called musical score recognition or simply score recognition, is the process of automatically extracting musical meaning from a printed musical score. Music notation provides a rich description of the composer s ideas, but ultimately sheet music is open to some degree of interpretation by performers. Performance considerations aside, the advantages of a computerised representation of a musical score are numerous. These include: the ability to automatically transpose a particular instrument; converting representation to other musical formats or notations, such as Braille-reading machines, for various software packages, or re-typesetting a score published in an outdated fashion; allowing musicians to read the music from a computer display, for example to eliminate the need for page turns [GWMD96, McP99]; a form of compression, resulting in smaller data sizes [BI98]; ease of sharing and archiving; increased ease of editing (using appropriate software), aiding in composition; and automatic indexing and retrieval of information [MSBW97]. 1.1 General framework for OMR The automated process of extracting musical meaning from sheet music normally follows a number of specialised steps, performed in a fixed order. The first step is to acquire a digital form of the sheet music that a computer can access. Today, this step is fairly easy, with the widespread availability of cheap scanner hardware that can create both colour and monochrome digital images at a resolution of three hundred dots per inch or higher, which is more than adequate for our processing purposes. 1
The second step is to perform various image processing techniques to the acquired image. This is necessary to recognise the symbols that make up the page for example, lines and note heads. This step is the hardest, and is often broken up into two or more separate steps. The final step is to determine the musical meaning (also called musical semantics of the image based on the objects found in the previous step. In CMN for example, objects like notes and rests have musical qualities such as pitch, volume and duration; objects such as slurs, accents and trills affect individuals notes; and objects such as tempo markings, key signatures and time signatures affect the notes that follow. 1.2 Background and Starting Base Common Music Notation (CMN), also called Western staff notation or Western music notation, is the notation most widely used today an example of CMN is shown in Figure 1. Other music notations include guitar tablature, plainsong notation, sacred harp notation, and various Asian, African and Indian musical notations. Figure 1: A sample of Common Music Notation: Handel s Sonata V for flute and piano. Ideally, an OMR system should not be limited to any particular set of symbols. It should be possible to add rules that allow the system to understand a new notation without making significant internal changes to the system. This is referred to as extensibility. Bainbridge s CANTOR system [Bai97] was one of the first fully extensible optical music recognition systems developed. Most prior work was limited to work on small subsets of CMN, and often made assumptions about staff lines, such as there were always five lines per staff. While CANTOR still has the restriction that the music must be stave-based, there can be an arbitrary number of lines per staff. Here, extensible refers to the fact that one of the design goals was to research and design a system that did not have hard-coded shapes built into it. This research led to the formation of Primela a Primitive Expression Language for describing specific musical shapes. A set of Primela descriptions can be written to describe a particular music notation and then loaded and used at run-time, to process an image. 2
CANTOR consists of four main steps: Staff line identification, which locates staffs, removes staff lines and locates objects in the bitmap. Primitive Recognition, which identifies basic shapes, such as (for the CMN Primela descriptions) slurs, noteheads, tails, accidentals, and lines. Primitive Assembly, which joins the basic primitives found into musical objects, such as noteheads, stems and tails into a note; and Musical Semantics, which determines musical qualities such as pitch and duration of the musical objects found, and can output various musical file formats. 2 Areas of Research Most current projects in the field of OMR are concerned with improving the accuracy of the various components, particularly the pattern recognition stages. Instead of focusing solely on the individual components, I wish to research and create methods that improve the overall system not merely by improving components in isolation, but by improving how they interact with each other so as to maximise the amount of musical information gained from the image. Part of my research will involve determining and evaluating appropriate methods for the process controlling the interaction, known as the coordinator. 2.1 Coordinating interaction between components Determining how best to coordinate the information receive from the OMR components will be the main area of focus for the thesis. Figure 2 shows how most current systems operate. The different phases of the OMR system are performed in a linear sequence, and each phase s output becomes the next phase s input. This also means that each phase is tightly coupled to both the previous and following one, as they must share common data structures and formats. Scanned music Staff line identification Image enhancement Musical object location Image enhancement Musical feature classification Musical knowledge Musical encoded data file Musical semantics Figure 2: The current pipeline approach However, this model has some limitations. Most seriously, errors made in an early step will propagate through the following steps. For example, when performing musical semantics analysis on the recognised components, an error may be detected, such as a bar of music not having enough (or too many) 3
notes in it. Because this type of error can not be corrected within the current context, the system is forced to output something that it knows is not quite right. (Some errors, however, such as a missing or mis-detected accidental in a key signature, could conceivably be corrected in this context.) What would improve the system s overall accuracy would be to use this newly-gained context to re-perform a previous stage, and hopefully correct the error given this new information. IMAGE Co-ordinator MUSIC REPRESENTATION Page Layout Musical Semantics/Analysis Staff Processing Primitive Location Primitive Identification Primitive Assembly Figure 3: The proposed coordinated approach Figure 3 shows a possible revised framework to allow feedback to earlier stages. All execution is controlled by a coordinating process the modules can not communicate directly. The idea here is that the top-level process controls the flow of execution, based on a number of variables. Part of the research is to determine the choice of variables used to control program flow, and what affect these variables have on both the performance and the run-time behaviour of the system. This type of framework would also encourage less integration between the various components. Loosely integrated components would allow, for example, the addition of several competing components that are capable of doing the same or similar steps which could have their results compared for discrepancies by the coordinator. This would provide either more confidence that the results are right, if the different components agree, or particular areas that should be further examined if the results conflict. Another advantage is that this framework allows modules that do not directly perform any music processing but still provide additional context. An example of this is a component that could detect the scan quality (perhaps from the level of noise in the bitmap) and if the quality is low then tolerances could be lowered, or a set of descriptions that is specifically designed for noisy data could be used. 2.2 Page Layout I would like to spend some effort into investigating and/or designing algorithms for using a priori knowledge to determine possible object types before using the lower level recognition subsystems such as staff location or character/text recognition. This more general area of research is known as document image analysis, and there are techniques that might be researched and improved with respect to the OMR domain. This could involve the system keeping a history of processed documents, to aid in predicting the layout of future documents, and using prior knowledge to decide that there may be a title and author somewhere 4
near the top of the page. The proposed coordinated approach for the OMR system could then decide whether or not to test this hypothesis given knowledge gained about this area of the page from other sources. 2.3 Classification Algorithms for feature extraction One of the more recent developments in the field of OMR is the use of machinelearning techniques to develop shape descriptions, given a set of training data [Ala95, BAD99, SD98]. These techniques could be investigated to design feature sets for classification of musical primitives for either the current Primela framework, or some new, replacement method for differentiating objects. 2.4 Illustration of the Concept There is currently an existing prototype which is based on the CANTOR code and is work-in-progress capable of using message passing to provide feedback from a particular phase to earlier phases. While not yet very advanced, the following example demonstrates the potential improvement that the methods under investigation may offer. Figure 4(a) shows a small extract from the Clarinet Concerto by Mozart. This extract is from the pianist s part, and also has the clarinetist s part displayed above the piano stave. This incidentally also demonstrates how OMR must be able to deal with symbols at different scales within the same piece. Figures 4(b) and 4(c) show the vertical lines and the flats respectively that were found by CANTOR in the pattern recognition stage. There are some errors in both of these classifications: There are a few mis-identified vertical lines: the time signature ( 6 8 ) was just broken enough to pass as two vertical lines. The musical semantics modules could pick up that there was no time signature yet there were extra vertical lines where a time signature might be expected, and allow the system to reexamine this area. Also, the two letter l s of the word Allegro were not unreasonably determined to be vertical lines, as they were close enough to the staff to be checked. However, they are unlikely to have any musical meaning for CMN, and are also close to other textual characters. There are four naturals in the extract that were determined to be flats, due to the default descriptions used. This could be solved by writing Primela descriptions that correctly differentiate between flats and naturals for the particular fonts used in this piece of music, but it would be more elegant to automatically correct these with semantic analysis, by noticing that accidentals rarely appear that have no affect on the note, due to either the last occurring key signature or from an accidental on the same note earlier in the same bar. Unfortunately, in this particular case there are also missing flats in the key signatures of two of the staves. These could also be picked up using semantic analysis, by noticing that one staff did have a key signature, so the others probably will as well. This, coupled with the fact that there will be unrecognised objects in the position where a key signature could be expected, should provide enough context that the recognition stage should look there again for a key signature. Lastly, for whatever reason the first chord in the second bar did not have a note stem recognised as a vertical line see the circled area within each figure 5
to locate this object. (CANTOR currently checks for vertical lines before checking for accidentals, although this is user-defined in the Primela descriptions.) Because of this, the shape passed the tests as possibly being a flat. This is as far as CANTOR goes. However, when the prototype system assembles the primitives together, it is noticed that this particular flat does not have a notehead in the appropriate position to its immediate right. The primitive assembly module now issues a request to the coordinator to check this primitive s classification again. Note that if the request is rejected, the primitive assembly stage has already been completed, and processing can continue regardless. The coordinator determines that the pattern recognition module is capable of fulfilling this request, so passes the request to it. This stage now takes account of this new context, and subsequently rejects the shape as possibly being a flat (Figure 4(d)). Currently this context (that is, the primitive could not be assembled) is accounted for by re-testing the object for the same classification, but with a higher threshold for passing. While this may seem like a small step, it can have an impact on the final output this is the difference between the music as written, and an incorrect note resulting in a dischord. Unfortunately, the prototype does not yet use this new context to correctly identify this shape, in this case as a vertical line. The prototype system does not currently perform semantic analysis. As the above discussion shows, there are plenty of opportunities to use musical context for improvement in the recognition stages. The key will be finding a generalised approach for this task. 3 Intended Schedule and Requirements This research will be carried out using existing equipment within the department. No extra computing (or other) resources are expected to be required. The following is an estimate of the work likely to be completed. Depending on the progress made during these tasks, other work, such as that mentioned in Sections 2.2 and 2.3, might be undertaken. Also, new developments by other researchers may cause a change in direction or scope for this research. Task Months Continue research, complete first prototype. 6 Experimentation with prototype 2 Write-up methods, ideas and findings. 1 Investigate and create other coordinators 12 Comparisons between coordinators and other OMR systems 3 Completion of write-up 5 Total: 29 Note that some work has previously been done during enrolment for a Masters degree since July 2000. There are currently no foreseen ethical issues arising from this research. If at a later date it is necessary to perform evaluation studies on various methods and/or software, then ethical approval from the school s Ethics Committee will be sought. 6
(a) The starting image (b) The vertical lines found by CANTOR (c) The flats found by CANTOR (d) The flats found by CANTOR with coordination Figure 4: Part of the first line of the Rondo from Mozart s Clarinet Concerto, with area of interest circled 7
References [Ala95] Jarmo T. Alander. Indexed bibliography of genetic algorithms in optics and image processing. Report 94-1-OPTICS, University of Vaasa, Department of Information Technology and Production Economics, 1995. ftp.uwasa.fi/cs/report94-1/gaopticsbib.ps.z. [BAD99] [Bai97] [BI98] Kyungim Baek Bruce A. Draper, Jose Bins. Adore: Adaptive object recognition. In Proceedings of the International Conference on Vision Systems, pages 522 537, Las Palmas de Gran Canaria, Spain, Jan 1999. David Bainbridge. Extensible Optical Music Recognition. PhD thesis, University of Canterbury, Christchurch, New Zealand, 1997. David Bainbridge and Stuart Inglis. Musical image compression. In Proceedings of the IEEE Data Compression Conference, pages 209 218, Snowbird, Utah, 1998. IEEE. [GWMD96] Christopher Graefe, Derek Wahila, Justin Maguire, and Orya Dasna. Designing the muse: A digital music stand for the symphony musician. In Proceedings of the CHI 96 Conference on Human factors in computing systems, page 436, Vancouver, Canada, 1996. ACM. [McP99] J. R. McPherson. Page turning score automation for musicians. B.Sc Honours thesis, University of Canterbury, New Zealand, 1999. [MSBW97] Rodger J. McNab, Lloyd A. Smith, David Bainbridge, and Ian H. Witten. The New Zealand Digital Library MELody index, May 1997. [SD98] Marc Vuilleumier Stückelberg and David Doermann. On musical score recognition using probabilistic reasoning. In Proceedings of the Fifth International Conference on Document Analysis and Recognition, ICDAR 98. IEEE, 1998. 8