A METRIC FOR MUSIC NOTATION TRANSCRIPTION ACCURACY

Similar documents
A QUERY BY HUMMING SYSTEM THAT LEARNS FROM EXPERIENCE

Melodic Similarity - a Conceptual Framework

e-workbook TECHNIQUES AND MATERIALS OF MUSIC Part I: Rudiments

Ranking Fuzzy Numbers by Using Radius of Gyration

Stochastic analysis of Stravinsky s varied ostinati

Study on evaluation method of the pure tone for small fan

RBM-PLDA subsystem for the NIST i-vector Challenge

Scalable Music Recommendation by Search

CLASSIFICATION OF RECORDED CLASSICAL MUSIC USING NEURAL NETWORKS

H-DFT: A HYBRID DFT ARCHITECTURE FOR LOW-COST HIGH QUALITY STRUCTURAL TESTING

C2 Vectors C3 Interactions transfer momentum. General Physics GP7-Vectors (Ch 4) 1

Experimental Investigation of the Effect of Speckle Noise on Continuous Scan Laser Doppler Vibrometer Measurements

Compact Beamformer Design with High Frame Rate for Ultrasound Imaging

R&D White Paper WHP 119. Mezzanine Compression for HDTV. Research & Development BRITISH BROADCASTING CORPORATION. September R.T.

Language and Music: Differential Hemispheric Dominance in Detecting Unexpected Errors in the Lyrics and Melody of Memorized Songs

On the Design of LPM Address Generators Using Multiple LUT Cascades on FPGAs

other islands for four players violin, soprano sax, piano & computer nick fells 2009

Version Capital public radio. Brand, Logo and Style Guide

Deal or No Deal? Decision Making under Risk in a Large-Payoff Game Show

A Reconfigurable Frame Interpolation Hardware Architecture for High Definition Video

VOICES IN JAPANESE ANIMATION: HOW PEOPLE PERCEIVE THE VOICES OF GOOD GUYS AND BAD GUYS. Mihoko Teshigawara

Chapter 4. Minor Keys and the Diatonic Modes BASIC ELEMENTS

4.5 Pipelining. Pipelining is Natural!

University of Bristol - Explore Bristol Research. Peer reviewed version. Link to published version (if available): /VETECF.2002.

Music Technology Advanced Subsidiary Unit 1: Music Technology Portfolio 1

Music from an evil subterranean beast

LISG Laser Interferometric Sensor for Glass fiber User's manual.

Precision Interface Technology

Citrus Station Mimeo Report CES WFW-Lake Alfred, Florida Lake Alfred, Florida Newsletter No. 2 6.

Grant Spacing Signaling at the ONU

Content-Based Movie Recommendation Using Different Feature Sets

Cross-Cultural Music Phrase Processing:

Chapter 1: Choose a Research Topic

Making Fraction Division Concrete: A New Way to Understand the Invert and Multiply Algorithm

Keller Central Percussion

A Low Cost Scanning Fabry Perot Interferometer for Student Laboratory

The game of competitive sorcery that will leave you spellbound.

Focus: Orff process, timbre, movement, improvisation. Audience: Teachers K-8

NOTE-LEVEL MUSIC TRANSCRIPTION BY MAXIMUM LIKELIHOOD SAMPLING

crotchets Now transpose it up to E minor here! 4. Add the missing bar lines and a time signature to this melody

EVALUATING AUTOMATIC POLYPHONIC MUSIC TRANSCRIPTION

Precision Interface Technology

EWCM 900. technical user manual. electronic controller for compressors and fans

A 0.8 V T Network-Based 2.6 GHz Downconverter RFIC

Texas Bandmasters Association 2016 Convention/Clinic

Lesson 1 Group 2. Cotton Tail Composed by Duke Ellington. This version is from Duke Ellington, Ella Fitzgerald and Duke Ellington.

Adapting Bach s Goldberg Variations for the Organ. Siu Yin Lie

Jump, Jive, and Jazz! - Improvise with Confidence!

Û Û Û Û J Û . Û Û Û Û Û Û Û. Û Û 4 Û Û &4 2 Û Û Û Û Û Û Û Û. Û. Û. Û Û Û Û Û Û Û Û Û Û Û. œ œ œ œ œ œ œ œ. œ œ œ. œ œ.

Auditory Stroop and Absolute Pitch: An fmri Study

Introductions to Music Information Retrieval

FM ACOUSTICS NEWS. News for Professionals. News for Domestic Users. Acclaimed the world over: The Resolution Series TM Phono Linearizers/Preamplifiers

A Practical and Historical Guide to Johann Sebastian Bach s Solo in A Minor BWV 1013

2017 ANNUAL REPORT. Turning Dreams into Reality FORT BRAGG OUR MISSION: 1, EDUCATION EXPERIENCE EXPLORATION

TABLE OF CONTENTS. Jacobson and the Meaningful Life Center. Introduction: Birthday Greeting from Rabbi Simon. Postscript: Do You Matter?

However, in studies of expressive timing, the aim is to investigate production rather than perception of timing, that is, independently of the listene

A STUDY ON LSTM NETWORKS FOR POLYPHONIC MUSIC SEQUENCE MODELLING

MARTIN KOLLÁR. University of Technology in Košice Department of Theory of Electrical Engineering and Measurement

Multiple instrument tracking based on reconstruction error, pitch continuity and instrument activity

FOR PREVIEW REPRODUCTION PROHIBITED

SUITES AVAILABLE. TO LET Grade A Offices

Spreadsheet analysis of a hierarchical control system model of behavior. RICHARD S. MARKEN Aerospace Corporation, Los Angeles, California

(2'-6") OUTLINE OF REQUIRED CLEAR SERVICE AREA

BRASS TECHNIQUE BARITONE

This is a repository copy of Temporal dynamics of musical emotions examined through intersubject synchrony of brain activity..

HURDLING THE HAZARDS OFTHE BEGINNING ARRANGER

Auburn University Marching Band

Robert Alexandru Dobre, Cristian Negrescu

r r IN HARMONY With Nature A Pioneer Conservationist's Bungalow Home By Robert G. Bailey

Design of Address Generators Using Multiple LUT Cascade on FPGA

Hidden Markov Model based dance recognition

SCP725 Series. 3M It s that Easy! Picture this:

Meter Detection in Symbolic Music Using a Lexicalized PCFG

Reference. COULTER EPICS ALTRA Flow Cytometer COULTER EPICS ALTRA HyPerSort System. PN CA (August 2010)

A SCORE-INFORMED PIANO TUTORING SYSTEM WITH MISTAKE DETECTION AND SCORE SIMPLIFICATION

De-Canonizing Music History

Take a Break, Bach! Let Machine Learning Harmonize That Chorale For You. Chris Lewis Stanford University

ABOVE CEILING. COORDINATE WITH AV INSTALLER FOR INSTALLATION OF SURGE SUPRESSION AND TERMINATION OF OUTLET IN CEILING BOX

Flagger Control for Resurfacing or Moving Operation. One-Lane Two-Way Operation

TOWARDS COMPLETE POLYPHONIC MUSIC TRANSCRIPTION: INTEGRATING MULTI-PITCH DETECTION AND RHYTHM QUANTIZATION

Statistical Modeling and Retrieval of Polyphonic Music

Newton Armstrong. unsaying (2010) for violoncello and voice

Soundprism: An Online System for Score-Informed Source Separation of Music Audio Zhiyao Duan, Student Member, IEEE, and Bryan Pardo, Member, IEEE

Copland and the Folk Song: Sources, Analysis, Choral Arrangements

City, University of London Institutional Repository

Computational Modelling of Harmony

Flagger Control for Resurfacing or Moving Operation. One-Lane Two-Way Operation

Automatic Rhythmic Notation from Single Voice Audio Sources

GEOGRAPHIC VARIATION IN SONG AND DIALECTS OF THE PUGET SOUND WHITE-CROWNED SPARROW

Automatic Music Transcription: The Use of a. Fourier Transform to Analyze Waveform Data. Jake Shankman. Computer Systems Research TJHSST. Dr.

Lecture 9 Source Separation

TOWARD AN INTELLIGENT EDITOR FOR JAZZ MUSIC

MUSICAL INSTRUMENT IDENTIFICATION BASED ON HARMONIC TEMPORAL TIMBRE FEATURES

Multiple Bunch Longitudinal Dynamics Measurements at the Cornell Electron-Positron Storage Ring

COMPARING VOICE AND STREAM SEGMENTATION ALGORITHMS

An Empirical Comparison of Tempo Trackers

DEEP SALIENCE REPRESENTATIONS FOR F 0 ESTIMATION IN POLYPHONIC MUSIC

Music Similarity and Cover Song Identification: The Case of Jazz

SCORE-INFORMED IDENTIFICATION OF MISSING AND EXTRA NOTES IN PIANO RECORDINGS

Westerville Parks and Recreation Civic Theatre presents AUDITION PACKET AUDITIONS:

Transcription:

A METRIC FOR MUSIC NOTATION TRANSCRIPTION ACCURACY Andea Cogliati Univesity of Rocheste Electical and Compute Engineeing andea.cogliati@ocheste.edu Zhiyao Duan Univesity of Rocheste Electical and Compute Engineeing zhiyao.duan@ocheste.edu ABSTRACT Automatic music tansciption aims at tanscibing musical pefomances into music notation. Howeve, most existing tansciption systems only focus on paametic tansciption, i.e., they output a symbolic epesentation in absolute tems, showing fequency and absolute time (e.g., a pianooll epesentation), but not in musical tems, with spelling distinctions (e.g., A vesus G ) and quantized mete. Recent attempts at poducing full music notation output have been hindeed by the lack of an objective metic to measue the adheence of the esults to the gound tuth music scoe, and had to ely on time-consuming human evaluation by music theoists. In this pape, we popose an edit distance, simila to the Levenshtein Distance used fo measuing the diffeence between two sequences, typically stings of chaactes. The metic teats a music scoe as a sequence of sets of musical objects, odeed by thei onsets. The metic epots the diffeences between two music scoes based on twelve aspects: balines, clefs, key signatues, time signatues, notes, note spelling, note duations, stem diections, goupings, ests, est duation, and staff assignment. We also apply a linea egession model to the metic in ode to pedict human evaluations on a dataset of shot music excepts automatically tanscibed into music notation.. INTRODUCTION Automatic Music Tansciption (AMT) is the pocess of infeing a symbolic epesentation of a musical pefomance. Despite fou decades of active eseach, AMT is still an open poblem, with humans being able to achieve bette esults than machines []. AMT systems can be boadly classified into two categoies accoding to the chosen symbolic epesentation: paametic tansciption and music notation tansciption. Paametic tansciption systems output a paametic epesentation of the musical pefomance, such as an unquantized MIDI pianooll []. This epesentation is expessed in physical tems, such as seconds fo note onset and duation, and hetz o MIDI numbes fo pitch []. It can faithfully epesent the muc Andea Cogliati, Zhiyao Duan. Licensed unde a Ceative Commons Attibution.0 Intenational License (CC BY.0). Attibution: Andea Cogliati, Zhiyao Duan. A metic fo music notation tansciption accuacy, th Intenational Society fo Music Infomation Retieval Confeence, Suzhou, China, 0. sical pefomance, but nomally it does not explicitly encode high-level musical stuctues, such as key, mete and voicing []. Music notation tansciption systems, on the othe hand, output a common music notation that human musicians ead. This epesentation is expessed in musically meaningful tems, such as quantized mete fo note onset and duation, and spelling distinctions (e.g., A vesus G ) fo pitch. Compaed to paametic tansciption, music notation tansciption is geneally moe desiable fo many applications connecting humans and machines, such as computational musicological analysis and music tutoing systems. The vast majoity of existing AMT methods, howeve, ae paametic tansciption systems. Reseaches have put consideable effot towad building music notation tansciption systems by identifying musical stuctues fom unquantized paametic epesentations, especially MIDI files, fom both MIR and cognitive pespectives [0]. Cambouopoulos [] descibed the key components necessay to convet a MIDI pefomance into music notation: identification of elementay musical objects (i.e., chods, apeggiated chods, and tills), beat identification and tacking, time quantization and pitch spelling. Takeda et al. [] descibe a Hidden Makov Model (HMM) fo the automatic tansciption of monophonic MIDI pefomances. Cemgil [] pesents a Bayesian famewok fo music tansciption, identifying some issues elated to automatic music typesetting (i.e., the automatic endeing of a musical scoe fom a symbolic epesentation), in paticula tempo quantization, and chod and melody identification. Kaydis et al. [] poposed a peceptually motivated model fo voice sepaation capable of gouping polyphonic goups of notes, such as chods o othe foms of accompaniment figues, into a peceptual steam. A moe ecent pape by Gohganz et al. [] intoduced the concepts of scoe-infomed MIDI file (S-MIDI), in which musical tempo and beats ae popely epesented, and pefomed MIDI file (P-MIDI), which ecods a pefomance in absolute time. The pape also pesented a pocedue to appoximate an S-MIDI file fom a P-MIDI file that is, to detect the beats and the mete implied in the P-MIDI file, stating fom a tempogam then analyzing the beat inconsistency with a salience function based on autocoelation. Reseaches have also attempted to infe musical stuctues diectly fom audio. Ochiai et al. [] poposed a model fo the joint estimation of note pitches, onsets, offsets and beats based on Non-negative Matix Factoization

(NMF) constained with a hythmic stuctue modeled with a Gaussian mixtue model. Collins et al. [] poposed a model fo multiple fundamental fequency estimation, beat tacking, quantization, and patten discovey. The pitches ae estimated with a neual netwok. An HMM is sepaately used fo beat tacking. The esults ae then combined to quantize the notes. Note spelling is pefomed by estimating the key of the piece and assigning to MIDI notes the most pobable pitch class given the key. An immediate poblem aising when building a music notation tansciption system by incopoating the abovementioned musical stuctue infeence methods is to find an appopiate way to evaluate the tansciption accuacy of the system. In ou pio wok [], we asked music theoists to evaluate music notation tansciptions along thee diffeent musical aspects, i.e., the pitch notation, the hythm notation, and the note positioning. Howeve, subjective evaluation is time consuming and difficult to scale to povide enough feedback to futhe impove the tansciption system. It would be vey helpful to have an objective metic fo music notation tansciption, just like the standad metic F-measue fo paametic tansciption []. Consideing the inheent complexity of music notation, such a metic would need to take into account all of the aspects of the high-level musical stuctues in the notation. To the best of ou knowledge, thee is no such metic, and the goal of this pape is to popose such a metic. Specifically, in this pape we popose an edit distance, based on simila metics used in bioinfomatics and linguistics, to compae a music tansciption with the goundtuth scoe. The design of the metic was guided by a datadiven appoach, and by simplicity. The metic is calculated in two stages. In the fist stage, the two scoes ae aligned based on the pitch content; in the second stage, the diffeences between the two scoes ae accumulated, taking into account twelve diffeent aspects of music notation: balines, clefs, key signatues, time signatues, notes, note spelling, note duations, stem diections, goupings, ests, est duation, and staff assignment. This will seve the same pupose as F-measue in evaluating paametic tansciption. To validate the saliency and the usefulness of this metic we also apply a linea egession model to the eos measued by the metic to pedict human evaluations of tansciptions. viewed as a sequence of musical chaactes, such as clefs, time and key signatues, notes and ests, possibly occuing concuently, such as in simultaneous notes o chods. Tansciption eos include alignment eos due to wong mete estimation o quantization, exta o missing notes and ests, note and est duation eos, wong note spelling, wong staff assignment, wong note gouping and beaming, and wong stem diection. All of these eos contibute to a vaious degee to the quality of the esulting tansciption. Howeve, the impact of each eo and eo categoy has not, to the best of ou knowledge, been eseached. As an example, Fig. shows two tansciptions of the same piece. Both tansciptions contain simila eos, i.e., wong mete detection, but the tansciption in Fig. c is aguably wose than that in Fig. b. A simila poblem can be obseved with the standad F-measue typically used to evaluate paametic tansciptions []; while the metic is objective and widely used, the impact of diffeent eos on the peceptual quality of a tansciption has not been eseached. Intuitively, cetain eos, such as exta notes outside of the hamony, should be peceptually moe objectionable than othes, such as octave eos. This is the eason fo both poposing an objective metic and coelating the metic with human evaluations of tansciptions. p J J J (a) Gound tuth J (b) Tansciption with a wong pickup measue (c) Tansciption off by a th note R R. BACKGROUND Appoximate sequence compaison is a typical poblem in bioinfomatics [], linguistics, infomation etieval, and computational biology []. Its pupose is to find similaities and diffeences between two o moe sequences of elements o chaactes. The sequences ae assumed sufficiently simila but potentially coupted by eos. Possible diffeences include the pesence of diffeent elements, missing elements o exta elements. Seveal metics have been poposed to measue the distance between two sequences, including the family of edit metics [], and gappenalizing alignment techniques []. A music scoe in taditional Westen notation can be Figue : Compaison of two tansciptions of the same piece containing simila eos but with diffeent eadability.. PROPOSED METHOD The poposed metic is calculated in two stages: in the fist stage, the tansciption is aligned with the goundtuth music notation based on its pitch content only, i.e., all of the othe objects, such as ests, balines, and time and key signatues ae ignoed; in the second stage, all of the objects occuing at the aligned potions of the scoes

m m m Ó. Ó. Ó. Ó. Ó. Ó. Ó. Ó. Ó. Ó. Figue : Alignment Ó. between Ó. Ó. the gound-tuth Ó. Ó. (top) and a tansciption (bottom) of Bach s Minuet in G. Ó. Aows indicate aligned beats. m m m. J... j. J w Figue :. Alignment between the gound-tuth j(top) and J anothe tansciption (bottom) of Bach s Minuet in G. Aows indicate aligned beats. ae gouped togethe and compaed. The metic epots the diffeences in aligned potions in tems of twelve aspects: balines, clefs, key signatues, time signatues, notes, note spelling, note duations, stem diections, goupings, ests, est duation, and staff assignment. Some algoithms to efficiently calculate cetain edit distances, e.g., the Wagne-Fische algoithm to calculate the Levenshtein distance between two stings, ae able to align two sequences and calculate the edit costs in a single stage. We initially tied to apply the same stategy to ou poblem, but we discoveed that the algoithm was not sufficiently obust, especially with tansciptions highly coupted by wong mete estimation. Intuitively, notes ae the most salient aspects of music, so it is aguable that the alignment of two tansciptions should be based pimaily on that aspect, while the oveall quality of the tansciption should be judged on a vaiety of othe aspects. The gound tuth and the tansciption ae both encoded in MusicXML, a standad fomat to shae sheet music files between applications []. The two scoes ae aligned using Dynamic Time Waping []. The local distance is simply the numbe of mismatching pitches, egadless of duation, spelling and staff positioning. To illustate the pupose of the initial alignment, we show two examples in Fig. and Fig.. The alignment stage outputs a list of pais of aligned beats. Fig. shows the alignment of a faily good tansciption of Bach s Minuet in G fom the Notebook fo Anna Magdalena Bach, with the gound tuth, which coesponds to the following sequence, expessed in beats, numbeed as quate notes stating fom 0 (GT is gound tuth, T is tansciption): GT 0.0.0..0..0.0 T 0.0.0..0..0.0.0.0.0.0..0..0.0.0.0.0..0..0.0.0.0.0.0..0..0.0.0.0.0..0..0.0..0..0.0..0. In this case, since the tansciption is popely aligned with the gound tuth, the sequence is just a list of all equal numbes, one fo each onset of the notes in the scoe. Howeve, beat.0 in the gound tuth is matched with beats.0 and.0 in the tansciption; the same happens fo beats.0 and.0, so DTW cannot popely distinguish epeated pitches. Only one alignment is shown in the figue fo claity. Fig. shows an example of an alignment fo a badly aligned tansciption of the same piece. The coesponding sequence is the following: GT 0.0 0.0 0.0.0.0. T 0.0 0..0..0..0..0.0.0.0.0.0....0..0.0.0.0.0.0..0.0...0....0..0.0.0.0.0.0..0...0.0 In this case, multiple beats in the tansciption coespond to the same beat in the gound tuth, e.g., beat.0 in the gound tuth coesponds to beats. and.0 in the tansciption, because a single note in the gound tuth has been tanscibed as two tied notes. Only one alignment is shown in the figue fo claity. To calculate the distance between the two aligned scoes, we poceed by fist gouping all of the musical objects occuing inside aligned potions of the two scoes into sets, thus losing the elative location of the objects within each set but peseving all of the othe aspects, including staff assignment. Then the aligned sets ae compaed, and the diffeences between the two sets ae epoted sepaately. The following aspects only allow binay matching: balines, clefs, key signatues, and time signatues. Rests ae matched fo duation and staff assignment, i.e., a est with the coect duation but on the wong staff will be consideed a staff assignment eo, a est with the coect staff assignment but wong duation will be consideed a est duation eo. A missing o an exta est will be consideed a est eo. Notes ae matched fo spelling, duation, stem diection, staff assignment, and gouping into chods. Fo goupings, we only epot the absolute value of the diffeence between the numbe of chods pesent in the two sets. The metic does not distinguish missing o

Pedicted scoe Pedicted scoe Pedicted scoe Evaluato scoe (a) Pitch Notation Evaluato scoe.... (b) Rhythm Notation. Evaluato scoe (c) Note Positioning Figue : Coelation between the pedicted atings and the aveage human evaluato atings of all of the tansciptions in the dataset. exta elements. These choices wee dictated by simplicity of design and implementation. All of the eos ae cumulated fo all of the matching sets. The eos fo balines, notes, note spelling, note duations, stem diections, goupings, ests, est duation, and staff assignment ae then nomalized by dividing the total numbe of eos fo each aspect by the total numbe of musical objects taken into account in the scoe. This step is necessay to nomalize the numbe of eos fo pieces of diffeent lengths. The eos fo clefs, key signatues, and time signatues ae not nomalized, as they ae typically global aspects of the scoes, and not influenced by the length of the piece. This might be a limitation fo pieces with fequent changes in key signatue o time signatue. As an example, the set of objects at the fist beat of the fist measue of Fig. include the initial balines, clefs, time signatue, key signatue, and notes stating on the downbeat of the measue. Balines, clefs, time signatue, and key signatue ae all coectly matched. All of the notes ae coect in pitch, spelling and duation, howeve thee ae two eos in stem diection, one eo in gouping, and one eo in staff assignment. All of the ests ae consideed est eos at each espective onsets. Fo the fist beat of the fist measue of Fig., all of the elements of the tansciption till the fist tanscibed notes (the thee notes pointed by the fist aow) and the notes tied to them will be consideed as pat of the same set. The wong key signatue and time signatue will be epoted as eos. The two eight ests will be epoted as est eos. The thee notes in the tansciption ae popely spelled, but thei duation is wong, so that will be counted as thee note duation eos. The missing D fom the chod will be epoted as a note eo. The exta tied notes will be epoted as note eos as well. In summay, the following twelve nomalized eo counts ae calculated by the metic: balines, clefs, key signatues, time signatues, notes, note spelling, note duations, stem diections, goupings, ests, est duation, and staff assignment. In ode to tanslate these eo counts into a musically elevant evaluation, we popose to use linea egession of the twelve eo counts to fit human atings of thee musical aspects of automatic tansciptions, i.e., the pitch notation, the hythm notation, and the note positioning. Fo each aspect, the linea egession leans twelve weights, one fo each of the nomalized eo counts, to fit the human atings. These weights can then be used to pedict the human atings of othe music notation tansciptions.. EXPERIMENTAL RESULTS To evaluate the poposed appoach, we calculate the nomalized eo count and un linea egession to fit human atings of shot music excepts collected in ou pio wok []. These music excepts wee fom the Kostka- Payne music theoy book, all of them piano pieces by wellknown composes, and wee pefomed on a MIDI keyboad by a semi-pofessional piano playe. These excepts wee then tanscibed into music notation using fou diffe-

ent methods: a novel method poposed in the pape (which will be efeed to as CDT), MuseScoe, GaageBand and Finale. Fo each tansciption, the human evaluatos wee asked to assign a numeical ating between and fo thee musical aspects, i.e., the pitch notation, the hythm notation, and the note positioning. The poposed method of calculating the eo counts uses MusicXML [], the de facto standad fo shaing sheet music files between applications, as the fomat of music notation. Two of the methods evaluated in the pape (Finale and MuseScoe) can output the scoes into MusicXML. Fo GaageBand, CDT and the gound tuth, howeve, MusicXML was not available o was difficult to output automatically. We had to manually convet the scoes into MusicXML. The tanscibed scoes ae named with the initial of the tansciption method and a numbe indicating the except. So, M-.mxl epesents the eight except tanscibed with MuseScoe. The lette K, fo Kostka-Payne, indicates the gound tuth scoes. This dataset and a Python implementation of the poposed appoach ae available at http://www.ece.ocheste.edu/ acogliat/epositoy.html. The implementation uses the music toolkit [] fo pasing the MusicXML files and pocessing the impoted scoes. The implementation has been tested with music V..0. In ode to validate the quality of the pediction we calculated the coefficient of detemination R, which is the squae of the Peason coelation coefficient. The R was 0. fo the pitch notation coelation, 0. fo the hythm notation, and 0.0 fo note positioning. These esults ae eflected in Fig. ; the poposed metic fits the data adequately, in geneal, even though the coelation is not pefect. It can also be noted that the pediction of the scoe fo note positioning is the best, while the pediction of the scoe fo hythm notation is the wost. To undestand the undelying causes of the covaiance we fistly analyzed the atings given by the human evaluatos. As we can see fom Fig., the human evaluatos wee oftentimes in disageement among themselves. It must also be noted that in ou pio wok [], the human annotatos wee not given exact instuctions on what featues to conside fo the evaluation, so a consideable amount of subjectivity and judgment calls wee likely to be pesent in the atings. We also analyzed two tansciptions with the lagest deviation fom the pedicted atings, i.e., one tansciption with a high pedicted ating and a low human ating, and one tansciption with a low pedicted ating and a high human ating. The lagest positive deviation occued fo the hythm notation of tansciption M-, fo which the poposed metic pedicted a ating of., while the aveage human ating was.. If we compae the tansciption with the gound tuth in Fig. we can see that MuseScoe misintepeted the mete, causing the poposed metic to epot a lage numbe of note duation eos and baline eos, which esulted in a low ating. Human annotatos, on the othe side, likely penalized the mete eo only once Scoe Scoe Scoe Piece (a) Pitch Notation Piece (b) Rhythm Notation Piece (c) Note Positioning Figue : Distibutions of the human atings of the tansciptions contained in the dataset. Each boxplot epesents the atings fom human evaluatos. globally, but still consideed the tansciption acceptable oveall. The lagest negative deviation occued fo the pitch no-

Piano n j U.. U b n n π b b Ó b b b.. b K- n b.. Ó b b b. b b. b b U.. U b.. b b bb b b b b b (a) Gound b b Tuth b b b b b b n legato b b b b b n n b b n b b J b.. w b b b n b b b b b nb Ó n Ó. b n Ó b b n Ó w w bw w n b n b n b b n b b b. (b) n M- f b b b b. n b n b n Ó. J. Figue : Tansciption of the fist except in the dataset by MuseScoe, which shows the lagest positive diffeence between the aveage human ating and the pedicted ating, Ó that is a high human ating and ab. low pedicted ating. This b evaluation diffeence occus on the hythm notation. 0 tation of tansciption C-, fo which the poposed metic pedicted ab ating of., b while the annotatos assigned an aveage scoe of of.. If we compae the tansciption with the gound tuth in Fig., we can notice that CDT makes a single mistake in notating the pitches, i.e., G instead of E. It also makes a systematic eo notating all Bs one octave lowe. Finally, not gouping the eight notes in the teble staff makes the tansciption had to ead. Possibly, the human annotatos penalized the tansciption because of its poo eadability.. CONCLUSION AND FUTURE WORK In this pape we poposed an objective metic to measue the diffeences between music notation tansciptions and the gound tuth scoe. The metic is calculated by fist aligning the pitch content of the tansciption and the gound-tuth music notation, and then counting the diffeences in twelve key musical aspects: balines, clefs, key signatues, time signatues, notes, note spelling, note duations, stem diections, goupings, ests, est duation, and staff assignment. We then used linea egession to pedict human evaluato atings along thee aspects of music notation, namely, pitch notation, hythm notation, and note positioning, fom the eo counts. Expeiments show a clea coelation between the pedicted atings and the aveage human atings, even though the coelation is not pefect. One issue with the pediction is the high vaiance of the evaluato atings, which likely oiginates fom the inheent subjectivity of the tasks. Anothe issue of the poposed Piano j n K- n n (a) Gound Tuth n R (b) C- n Figue : Tansciption of the thiteenth except in the U U dataset by CDT, which shows the lagest negative deviation between the aveage human ating and the pedicted ating on hythm U U notation, that is a low human ating and a high pedicted ating. This evaluation diffeence occus on the pitch notation. metic is that it does not incopoate music theoy knowledge, such as the method poposed by Tempeley to evaluate metical models []. The cuent expeiments wee conducted on music notation tansciptions of human pefomances ecoded on a MIDI keyboad; as a consequence, the tansciptions do not contain the eos commonly obseved in audio-to- MIDI convesion pocesses, such as octave eos and exta o missing notes [,]. Moe eseach is necessay to evaluate the pefomance of the poposed method in the pesence of such eos. In addition, the excepts in the dataset wee vey shot, compaed to eal piano pieces, so additional eseach is necessay to assess the obustness of the metic, and its computational complexity on longe pieces. A Python implementation of the poposed appoach, along with the dataset, is available at http: //www.ece.ocheste.edu/ acogliat/ epositoy.html. This implementation can be used to calculate the twelve eo counts as well as to pedict human atings on the thee musical aspects of a music notation tansciption.. REFERENCES [] Met Bay, Andeas F Ehmann, and J Stephen Downie. Evaluation of Multiple-F0 Estimation and Tacking Systems. In Poc. of Intenational Society fo Music Infomation Retieval (ISMIR), pages 0, 00. [] Emmanouil Benetos, Simon Dixon, Dimitios Giannoulis, Holge Kichhoff, and Anssi Klapui. Automatic music tansciption: challenges and futue diections. Jounal of Intelligent Infomation Systems, ():0, 0. [] Emilios Cambouopoulos. Fom MIDI to taditional musical notation. In Poc. of the AAAI Wokshop on Atificial Intelligence and Music: Towads Fomal Mod-

els fo Composition, Pefomance and Analysis, volume 0, 000. [] Ali Taylan Cemgil. Bayesian music tansciption. PhD thesis, Radboud Univesity Nijmegen, 00. [] Andea Cogliati and Zhiyao Duan. Piano Music Tansciption Modeling Note Tempoal Evolution. In Poc. of the IEEE Intenational Confeence on Acoustics, Speech, and Signal Pocessing (ICASSP), pages, Bisbane, Austalia, 0. IEEE. [] Andea Cogliati, Zhiyao Duan, and Bendt Wohlbeg. Piano Tansciption with Convolutional Spase Lateal Inhibition. IEEE Signal Pocessing Lettes, ():, 0. [] Andea Cogliati, David Tempeley, and Zhiyao Duan. Tanscibing human piano pefomances into music notation. In Poc. of Intenational Society fo Music Infomation Retieval (ISMIR), 0. [] Tom Collins, Sebastian Böck, Floian Kebs, and Gehad Widme. Bidging the audio-symbolic gap: The discovey of epeated note content diectly fom polyphonic music audio. In Audio Engineeing Society Confeence: d Intenational Confeence: Semantic Audio, 0. [] Kazuki Ochiai, Hiokazu Kameoka, and Shigeki Sagayama. Explicit beat stuctue modeling fo nonnegative matix factoization-based multipitch analysis. In IEEE Intenational Confeence on Acoustics, Speech and Signal Pocessing (ICASSP), pages, 0. [] H. Sakoe and S. Chiba. Dynamic pogamming algoithm optimization fo spoken wod ecognition. IEEE Tansactions on Acoustics, Speech, and Signal Pocessing, ():,. [] Hauto Takeda, Naoki Saito, Tomoshi Otsuki, Mitsuu Nakai, Hioshi Shimodaia, and Shigeki Sagayama. Hidden Makov model fo automatic tansciption of MIDI signals. In Multimedia Signal Pocessing, 00 IEEE Wokshop on, pages, 00. [] David Tempeley. An Evaluation System fo Metical Models. Compute Music Jounal, 00. [0] David Tempeley. Music and pobability. The MIT Pess, 00. [] David Tempeley. A unified pobabilistic model fo polyphonic music analysis. Jounal of New Music Reseach, ():, 00. [] Michael Scott Cuthbet and Chistophe Aiza. music: A Toolkit fo Compute-Aided Musicology and Symbolic Music Data. In Poc. of Intenational Society fo Music Infomation Retieval (ISMIR), 0. [] Michael Good. MusicXML fo notation and analysis. The vitual scoe: epesentation, etieval, estoation, :, 00. [] Haald Gohganz, Michael Clausen, and Meinad Mülle. Estimating Musical Time Infomation fom Pefomed MIDI Files. In Poc. of Intenational Society fo Music Infomation Retieval (ISMIR), pages 0, 0. [] Ioannis Kaydis, Alexandos Nanopoulos, Apostolos Papadopoulos, Emilios Cambouopoulos, and Yannis Manolopoulos. Hoizontal and vetical integation/segegation in auditoy steaming: a voice sepaation algoithm fo symbolic musical data. In Poc. th Sound and Music Computing Confeence (SMC00), 00. [] Jonathan M. Keith, edito. Bioinfomatics, volume of Methods in Molecula Biology. Spinge New Yok, New Yok, NY, 0. [] Meinad Mülle. Fundamentals of Music Pocessing: Audio, Analysis, Algoithms, Applications. Spinge, 0. [] Gonzalo Navao. A guided tou to appoximate sting matching. ACM Computing Suveys, ():, 00.