Development of an Optical Music Recognizer (O.M.R.).

Development of an Optical Music Recognizer (O.M.R.). Xulio Fernández Hermida, Carlos Sánchez-Barbudo y Vargas. Departamento de Tecnologías de las Comunicaciones. E.T.S.I.T. de Vigo. Universidad de Vigo. E.T.S.I.T. Ciudad Universitaria S/N. 36200 Vigo. Phone: (986) 812131. Fax: (986) 812121. e-mail: xfernand@tsc.uvigo.es. Abstract: This communication describes a system able to recognize printed music and to convert it into a standard MIDI file [1] that we could hear using any common sound card. The recognizer was developed to work with machine-printed music notation because it relies on some rules of music writing. The final application has an user friendly user interface working in the Windows environment. 1. Introduction: Our recognizer is developed to work with machine-printed music notation (not with handwriten music), because there are some rules [2], [3] in music writing that are followed by music typewriters but not always applied in handwriten music. It s important to see that music symbols vary in orientation, positioning and appearance. Typical music symbols are much less regular in appearance and positioning than the characters of printed text. Adjacent and overlapping symbol placements are used, making harder the recognition process. Our system works with bilevel (black and white), and it s completely resolution independent. Minimum resolution is 200 dpi. Higher resolutions involve greater computational costs but don t mean better performance. The aplication runs on Windows 3.x or Windows 95, and it s very easy to use. In our first version the images are obtained from graphic files, but soon it will be posible to read images directly from scanners using the TWAIN protocol. Our application can locate and recognize the following symbols: all notes and rests (whole, half, quarter, eight, sixteenth...) and what they involve (flags or hooks, noteheads...), accidentals (flat, natural and sharp), clefs (trebble, alto and bass), key signature, and some more. Also locates staff lines and posible systems. This printed music recognizer uses a wide variety of image processing methods, such as thinning, erosion, segmentation, matching with masks, thresholding and projections over the X and Y axis. The applications of this system vary from music teaching to the masive storage of printed music. 2. General Overview: All processing is structured in a layerbased architecture. This means that we work with different processed images that we can consult at any moment. In that way, we can try to find out where there are some symbols in one layer, and check them in another one. We always keep the original image as a layer. We sometimes have to copy some windows between different layers. Our recognizer doesn t find the symbols in the same order in which a musician would read them, this is: left to right. Otherwise, we work with each bar and try to extract a particular symbol (i.e. black-headed notes, rests...). 3. First Processing Stage: There are three important tasks that our recognizer performs in the first processing stage: page geometry computation, staff lines removing, and segmentation of image zones (bars) where we wil try to locate musical symbols.

When we say geometry we mean how staffs are grouped forming systems (or if there is no system at all). In a system some of the voices involved are played simultaneously. To find out if there are systems we do as a musician would, id est: we find out if there are long bar lines joining different staffs. To implement this we make first vertical erosions in a certain region of our image and then we follow vertical lines. We include in this step an estimation of the staff lines spacing and staff lines thickness. Sizes and distances obtained along the recognizing process will be measured in units relative to the staff line spacing and thickness that we estimate at this point. Estimation of staff line spacing is done by scanning nine evenly spaced columns in the image and making an histogram [6] of the distances between opposite transitions. We choose a balanced media between the most repeated value (D), (D+1) and (D-1). Staff lines define the vertical coordinate system for pitches and provide a horizontal direction for the temporal coordinate system. The five staff lines that we find on a piece of printed music are not exactly paralell, not exactly horizontal, not exactly equidistant, their thickness is not exactly constant, and they are even not exactly straight. Scanning and quantization noise are the reason for those problems. Using our estimation of staff lines spacing and a non-fixed staff template we can find out the positions of the staffs in the image. This task is very important because we ll try to search symbols later over these located staffs or near them, not all over the image. Later, we find the first staffs in both the left and right regions of the image and compute the image skew. When we know where the staffs are, we can estimate the staff lines thickness more precisely. With this purpose, we make a new histogram of black lines thickness in some columns of the image. Before finding staffs we didn t know wich of those black lines were true staff lines, but at this moment we can remove the false ones getting a better histogram. At this point, we remove the staff lines. As we know some points belonging to staffs and the slope or skew of staff lines, we can remove them easily following those lines and removing all vertical transitions with a thickness lower than 1.5 times the computed thickness. Staff lines have been removed so we have in a layer (image in memory) the same original printed music without horizontal staff lines. The image now is much more clean, with a lot of isolated symbols, and it s easier to locate them. The next step, as we said before, it s to find out in which regions of the image there are music symbols. This involves finding out where there are bars: we ll need the positions of that vertical bar lines. Those vertical lines divide the sheet music into intervals of the same temporal duration. That is very important for our purposes because this will help us to correct posible errors. Our application can locate simple bars, double ones and repetition bars. The way we do it is making X proyections of different parts of the located staffs. Once we have removed staff lines, we get high peaks in the X proyection that we study to know if they were really caused by bar lines. Then we remove those vertical lines too. All our proceses will be done now in a bar-level way. We will take every bar and study it deeply. The place the bar is, the way we process it, id est: if it s the first bar of a staff, we will look for the clef and key signature. Figure 1: original image.

Figure 2: original image with the staff and vertical lines removed. Figure 3: division into bars. 4. Locating symbols: Our system seldom uses templatematching methods. We locate different parts of symbols in different ways, depending on the symbol. As there are usually more black headed notes are than white headed ones, then we locate them using erosion methods (erosion depth always depends on our previously computed thickness). Here, we are not really searching the notes but the black heads. We take a bar and erode it twice the computed thickness. Figure 4: heads of black headed notes. Then, we find out the bounding boxes of each group of black pixels and filter them using their horizontal and vertical dimensions. We have now some bounding boxes that could be real black heads. Later, we study the correct layer (image without staff and vertical lines) to locate the vertical line of each posible black head (note) and (following the line) we count the hooks. So we know the duration of these notes. That process is made processing, for each note, a window containing both vertical line and hooks (if any). We make a vertical erosion, depending again on thickness. In this manner, we obtain a second window with just horizontal white/black/white transitions. Those transitions are the hooks, we now only have to count them. We could now continue searching other kind of symbols all around the image, but we use the information that we know at this point to decide where to search. We know time signature (i.e. ¾), so we have a measure of total time that must be in a bar (between two vertical lines). We have found at this point all the black headed notes, so that we can add their durations. So we can decide in which bars is worth continuing with next recognizing steps. Obviously, those bars will have an accumulated time (computed adding note durations) less than time fixed by time signature. In figure 5, we have a ¾ time signature, we have shaded bars where we will continue searching symbols. See that when we start next recognizing steps in those bars, we work in a image layer very different to the original one. Our layer now, is such as figure 6. We have deleted from this layer almost every symbol that we have found: staff lines, bar lines, black-headed notes (with their vertical lines and hooks), accidentals associated to those notes, and some more. We have found clef and key signature in the first steps of our proccess, but we have not deleted them because it is not necessary. We use mask matching to decide the clef.

Figure 5: original image, we have marked those bars where we have to continue searching after finding black headed notes. Figure 6: image layer where we will search for white headed notes. The key signature is obtained looking accidentals in the sequence of positions where we know that should be. We know that there could be from one to seven accidentals, or none of them. But if there are, the sequence of positions is fixed. To recognize those accidentals and everything else in the image, we use knowledge about their position, size (X and Y axis), and projections over X axis. So we decide between flat, natural and sharps. To locate whole and half noteheads, we search first the vertical line (if it exists), and then we follow the contour of notehead to find its dimensions. The application proceeds extracting or deleting the symbols that recognizes from the working layer, so this layer becomes cleaner for later searchs. We use time signature to know where there may be more unrecognized symbols, comparing the fixed total time with the accumulated time of localized symbols. In this way, we get a lower proccessing time because we only look for symbols where they may be, and we almost never perform searchs in all image space. When we have all the results, we generate an ASCII file, in MEL format (Musical Events Listing), that contains all information about the recognized music symbols. This file doesn t contain position information: it s just an easy-to-read format containing the recognized music. We could define a Extended MEL format with position information, but this was not our purpose. 5. MEL to MIDI Conversion: We have developed a MEL to MIDI file converter. This application takes our MEL (ASCII) file resulting from our recognizer, and converts it into a Standard MIDI File. In this way, we can hear the music that we ve recognized in any PC with a common sound card. This system is not a text to MIDI converter: it only converts MEL 1.0 files into MIDI, so we are working in better specifications for MEL files, that involve new recognized symbols. 6. Future Lines and Conclusions: We can comment the following future developments (we are already working in some of them): Scanner control using the TWAIN standard. Extension of the MEL format to include new recognized symbols.

Study of the much more difficult problem of recognizing handwritten music. As main conclusion we have designed a system able to read printed music. The algorithms we use are fast and easy. Our preferred methods are the morphological ones. 7. References: [1] The International MIDI Association. Standard MIDI-File Format Spec.1.1. [2] J. Chailley, H. Challan H. Teoría completa de la Música. Vol. I. De.Alphonse Leduc. [3] J. Zamacois. Teoría de la Música (Libro II). Labor. [4] H. S. Baird, D. Blostein. A Critical Survey of Music Image Analysis. Structure Document Image Analysis, pp 405-434. Springer-Verlag. [5] T. Kientzle. Scaling Bitmaps with Bresenham. C/C++ Users Journal. Pp 51-53. October, 1995. [6] H. Kato, S., Inokuchi. A Recognition System for Printed Piano Music Using Musical Knowledge and Constraints. Structure Document Image Analysis, pp 435-457. Springer-Verlag. [7] N. P. Carter, R. A., Bacon. Automatic Recognition of Printed Music. Structure Document Image Analysis, pp 458-465. Springer- Verlag. [8] D. Phillips. Image Proccessing in C. Chapter: Analyzing and Enhancing Digital Images. R&D Technical Books.