arxiv: v2 [cs.sd] 13 Dec 2016

Similar documents
PRACTICE FINAL EXAM T T. Music Theory II (MUT 1112) w. Name: Instructor:

VOCAL MUSIC I * * K-5. Red Oak Community School District Vocal Music Education. Vocal Music Program Standards and Benchmarks

Corporate Logo Guidelines

GRABLINKTM. FullTM. - DualBaseTM. - BaseTM. GRABLINK Full TM. GRABLINK DualBase TM. GRABLINK Base TM

The Official IDENTITY SYSTEM. A Manual Concerning Graphic Standards and Proper Implementation. As developed and established by the

Pitch I. I. Lesson 1 : Staff

Standards Overview (updated 7/31/17) English III Louisiana Student Standards by Collection Assessed on. Teach in Collection(s)

DRAFT. Vocal Music AOS 2 WB 3. Purcell: Music for a While. Section A: Musical contexts. How is this mood achieved through the following?

Reverse Iterative Deepening for Finite-Horizon MDPs with Large Branching Factors

Application Support. Product Information. Omron STI. Support Engineers are available at our USA headquarters from

Generating lyrics with the variational autoencoder and multi-modal artist embeddings

Soft Error Derating Computation in Sequential Circuits

LOGICAL FOUNDATION OF MUSIC

SeSSION 9. This session is adapted from the work of Dr.Gary O Reilly, UCD. Session 9 Thinking Straight Page 1

Safety Relay Unit G9SB

Have they bunched yet? An exploratory study of the impacts of bus bunching on dwell and running times.

VISUAL IDENTITY GUIDE

Chapter 1: Introduction

WE SERIES DIRECTIONAL CONTROL VALVES

Before Reading. Introduce Everyday Words. Use the following steps to introduce students to Nature Walk.

walking. Rhythm is one P-.bythm is as Rhythm is built into our pitch, possibly even more so. heartbeats, or as fundamental to mu-

Safety Relay Unit G9SB

TAU 2013 Variation Aware Timing Analysis Contest

CMST 220 PUBLIC SPEAKING

Reproducible music for 3, 4 or 5 octaves handbells or handchimes. by Tammy Waldrop. Contents. Performance Suggestions... 3

Interactions of Folk Melody and Transformational (Dis)continuities in Chen Yi s Ba Ban

ARCHITECTURAL CONSIDERATION OF TOPS-DSP FOR VIDEO PROCESSING. Takao Nishitani. Tokyo Metropolitan University

Binaural and temporal integration of the loudness of tones and noises

ECE 274 Digital Logic. Digital Design. Sequential Logic Design Controller Design: Laser Timer Example

Explosion protected add-on thermostat

Chapter 5. Synchronous Sequential Logic. Outlines

Engineer To Engineer Note

CPE 200L LABORATORY 2: DIGITAL LOGIC CIRCUITS BREADBOARD IMPLEMENTATION UNIVERSITY OF NEVADA, LAS VEGAS GOALS:

Answers to Exercise 3.3 (p. 76)

Evaluation of the Suitability of Acoustic Characteristics of Electronic Demung to the Original Demung

LCD Data Projector VPL-S500U/S500E/S500M

Contents. English. English. Your remote control 2

Panel-mounted Thermostats

ECE 274 Digital Logic. Digital Design. Datapath Components Registers. Datapath Components Register with Parallel Load

Chapter 3: Sequential Logic Design -- Controllers

THE MOSSAT COLLECTION BOOK SIX

A Proposed Keystream Generator Based on LFSRs. Adel M. Salman Baghdad College for Economics Sciences

1 --FORMAT FOR CITATIONS & DOCUMENTATION-- ( ) YOU MUST CITE A SOURCE EVEN IF YOU PUT INFORMATION INTO YOUR OWN WORDS!

Introduction. APPLICATION NOTE 712 DS80C400 Ethernet Drivers. Jun 06, 2003

CPSC 121: Models of Computation Lab #2: Building Circuits

lookbook Transportation - Airports

lookbook Higher Education

DIGITAL EFFECTS MODULE OWNER'S MANUAL

Mapping Arbitrary Logic Functions into Synchronous Embedded Memories For Area Reduction on FPGAs

PIRELLI BRANDBOOK 4. IDENTITY DESIGN

Contents 2. Notations Used in This Guide 6. Introduction to Your Projector 7. Using Basic Projector Features 29. Setting Up the Projector 16

Predicted Movie Rankings: Mixture of Multinomials with Features CS229 Project Final Report 12/14/2006

Contents 2. Notations Used in This Guide 6. Introduction to Your Projector 7. Using Basic Projector Features 30. Setting Up the Projector 17

Chapter 2 Social Indicators Research and Health-Related Quality of Life Research

Applications to Transistors

Kelly McDermott h#s tr#veled the U.S., C#n#d# #nd Europe #s performer, te#cher #nd student. She h#s # B#chelor of Music degree in flute perform#nce

Contents 2. Notations Used in This Guide 6. Introduction to Your Projector 7. Using Basic Projector Features 29. Setting Up the Projector 16

Animals. Adventures in Reading: Family Literacy Bags from Reading Rockets

Notations Used in This Guide

Contents 2. Notations Used in This Guide 6. Introduction to Your Projector 7. Using Basic Projector Features 28. Setting Up the Projector 15

Your Summer Holiday Resource Pack: English

MILWAUKEE ELECTRONICS NEWS

Standard Databases for Recognition of Handwritten Digits, Numerical Strings, Legal Amounts, Letters and Dates in Farsi Language

Contents 2. Notations Used in This Guide 7. Introduction to Your Projector 8. Using Basic Projector Features 34. Setting Up the Projector 17

LAERSKOOL RANDHART ENGLISH GRADE 5 DEMARCATION FOR EXAM PAPER 2

User's Guide. Downloaded from

Big Adventures. Why might you like to have an adventure? What kind of adventures might you enjoy?

THE SOLAR NEIGHBORHOOD. XV. DISCOVERY OF NEW HIGH PROPER MOTION STARS WITH 0B4 yr 1 BETWEEN DECLINATIONS 47 AND 00

Phosphor: Explaining Transitions in the User Interface Using Afterglow Effects

Cooing, Crying, and Babbling: A Link between Music and Prelinguistic Communication

MODELING OF BLOCK-BASED DSP SYSTEMS Dong-Ik Ko and Shuvra S. Bhattacharyya

Sequencer devices. Philips Semiconductors Programmable Logic Devices

Notations Used in This Guide

Brain potentials indicate immediate use of prosodic cues in natural speech processing

A New Concept of Providing Telemetry Data in Real Time

When it comes to creating music, composers like to push the limits. Get ready to hear how!

Pro Series White Toner and Neon Range

This page intentionally left blank

3. Factors to Explain Hollywood's Supremacy

LCD VIDEO MONITOR PVM-L1700. OPERATION MANUAL [English] 1st Edition (Revised 2)

Synchronising Word Problem for DFAs

INPUT CAPTURE WITH ST62 16-BIT AUTO-RELOAD TIMER

Star. Catch a. How. Teachers Pack. A starry moonlit adventure. Based on the beautiful picture book by Oliver Jeffers

Train times. Monday to Sunday. Stoke-on-Trent. Crewe

TAP 413-1: Deflecting electron beams in a magnetic field

1. acquiring 2. compilation 3. cornerstone 4. concise 5. comprehensive 6. advancement

Appendix A. Quarter-Tone Note Names

Successful Transfer of 12V phemt Technology. Taiwan 333, ext 1557 TRANSFER MASK

What do these sentences describe? Write P for plants, A for animals or B for both next to each sentence below. They ve got roots.

artifacts, of thinking, feeling, believing, valuing and acting.

months ending June 30th 2001 Innovators in image processing

Structural and functional asymmetry of lateral Heschl s gyrus reflects pitch perception preference

Preview Only. Editor s Note. Pronunciation Guide

UNIT TOPIC LANGUAGE. The Arts. Travel & Holidays. Education, Work & Leisure Activities. City Life. Technology. Nature. Health

First Grade Language Arts Curriculum Essentials

1. Connect the wall transformer to the mating connector on the Companion. Plug the transformer into a power outlet.

LCD VIDEO MONITOR PVM-L3200. OPERATION MANUAL [English] 1st Edition (Revised 1)

Sa ed H Zyoud 1,2,3, Samah W Al-Jabi 2, Waleed M Sweileh 4 and Rahmat Awang 3

Outline. Circuits & Layout. CMOS VLSI Design

ViaLiteHD RF Fibre Optic Link

Transcription:

Towrds computer-ssisted understnding of dynmics in symphonic music rxiv:1612.02198v2 [cs.sd] 13 Dec 2016 Mrten Grchten 1, Crlos Edurdo Cncino-Chcón 2, Thssilo Gdermier 2 nd Gerhrd Widmer 1,2 1 Deprtment of Computtionl Perception, Johnnes Kepler University, Linz, Austri 2 Austrin Reserch Institute for Artificil Intelligence, Vienn, Austri Abstrct Mny people enjoy clssicl symphonic music. Its diverse instrumenttion mkes for rich listening experience. This diversity dds to the conductor s expressive freedom to shpe the sound ccording to their imgintion. As result, the sme piece my sound quite differently from one conductor to nother. Differences in interprettion my be noticeble subjectively to listeners, but they re sometimes hrd to pinpoint, presumbly becuse of the coustic complexity of the sound. We describe computtionl model tht interprets dynmics expressive loudness vritions in performnces in terms of the musicl score, highlighting differences between performnces of the sme piece. We demonstrte experimentlly tht the model hs predictive power, nd give exmples of conductor ideosyncrsies found by using the model s n explntory tool. Although the present model is still in ctive development, it my pve the rod for consumer-oriented compnion to interctive clssicl music understnding. Keywords: Mchine Lerning, Musicology, Musicl Expression, Computtionl Modeling, Neurl Networks 1 Introduction When you sk visitors of clssicl concert wht it is tht mkes ttending concert worthwhile you will get vriety of nswers, but some likely resons re tht good performnce ffects you emotionlly, nd cn be so immersive tht you forget the world round you [Roose, 2008]. The relted question wht it is tht mkes good performnce, is eqully mbiguous, nd ultimtely depends on personl tste. Nevertheless, To pper in IEEE Multimedi (https://www.computer.org/multimedi-mgzine/). c IEEE 1

it hs been long known tht music is more engging when it is plyed expressively. A stright-forwrd mechnicl reproduction of written musicl score, s computer would produce it, typiclly sounds dull, nd to the trined listener it my sound odd, or even plin wrong. The performnce of piece of music cn be clled expressive when it conveys informtion to the listener tht literl, mechnicl rendition of the score would not. The informtion my be n ffective qulity (for instnce, the listener my perceive piece s being performed solemnly, or joyfully), but the performnce my lso express structurl informtion bout the music (for instnce, the listener my notice from the wy the music is performed tht musicl phrse is coming to n end). Musicins convey such informtion by vrying the wy they perform the written score. Among tempo, nd rticultion, one of the more slient expressive spects of the performnce is dynmics vritions in loudness of the performnce for the purpose of musicl expression. By vrying these prmeters during the interprettion of the piece, musicins mke performnce sound more nturl, nd live. Creful control of these prmeters lso llows musicins to crete phrsing in the music, producing perception of coherence in the music over longer time spns. Apprecition of the music by listeners is fcilitted by their fmilirity with the piece, nd music understnding in generl. This is reflected in desire for informtion, expressed by concert-goers, bout the music they re to her in the concert [Melenhorst nd Liem, 2015]. Although it is reltively stright-forwrd to obtin biogrphicl or historicl informtion bout composer or piece, through web or librry serch, there re few fcilities tht help listeners become fmilir with the musicl detils of piece, or specific performnce. Aprt from synchronized musicl scores vilble nowdys in music videos from online services such s Youtube, notble step in tht direction is the ipd Mgzine by the Dutch Royl Concertgebouw Orchestr, RCO Editions, in which recorded performnces cn be plyed bck in sync with the musicl score. The dt used in the experimentl evlution of the model presented here origintes (mostly) from collbortion between RCO nd the Austrin Reserch Institute for Artificil Intelligence (OFAI) to produce the score-synchroniztion in the RCO Editions. Although musicl expression plys n importnt role in the musicl experience, to dte there re virtully no fcilities for the interested listener to lern more bout the expressive spects of music. A tool tht cn ttribute vritions in the expressive qulity of performnce to fctors like performnce directives (like crescendo, diminuendo, nd fermt), nd other spects of the written score, my elucidte expressive intentions of the conductor to listeners, thereby stimulting their enggement with, nd understnding of the music. In this wy, it ddresses the needs of (ctul or potentil) clssicl music listeners, who in user study...expressed interest in the structure of the music, the composer s intention, the conductor s interprettion, nd the discovery of style differences in comprison to recordings. [Melenhorst nd Liem, 2015, Sec 7.2]. Such tool my be prt of n ctive music listening interfce [Goto, 2007] for clssicl music such s the integrted prototype 1 of the PHENICX project [Liem et l., 2015]. 1 http://bet.phenicx.com/ 2

An importnt question to be ddressed in the development of n end user tool for understnding dynmics is wht level of informtion is pproprite for the end user. Unexperienced listeners my benefit most from simple pproch, such s merely highlighting the prts of piece where two performnces differ substntilly. For such use cses, where the need for n explntory model is reduced, more descriptive pproch like tht of Liem nd Hnjlic [2015] my be useful. Musiclly trined listeners however, my be interested in further detils, nd my be helped by tool tht during the plybck of performnce, highlights spects of the score tht explin expressive peculirities of the current performnce with respect to typicl performnce of the sme piece. Finlly, tool for musicologists would not only provide qulittive explntion of differences in expressive interprettion between performnces, but lso llow for the use of these explntions in comprtive nlysis of sets of performnces in terms of expression, uncovering consistent expressive strtegies of conductors, or grouping performnces in terms of their expressive chrcteristics. The purpose of the current rticle is to pve the wy for such fcility. More specificlly, we present computtionl model of dynmics in music, nd show how such model my help to understnd the fctors tht contribute to dynmics. Although the user requirements my vry considerbly for the different use cses described bove, we believe it is desirble to hve unified technologicl bsis for computer-ssisted understnding of dynmics t different levels of user expertise. In this pper we focus on the cpbility of the proposed model to extrct reltively detiled informtion from performnce, tht we believe is most useful for expert users, like musicologists. Although beyond the scope of this pper, our belief is tht when the model cn extrct useful informtion for this clss of users, pproprite summriztion nd selection of the extrcted informtion cn help the model cter to use cses involving less experienced users. In the following, we give brief overview of relted work in computtionl modeling of musicl expression, nd continue to give description of the model. We report n experimentl evlution of the model, showing tht it explins considerble portion of the dynmics in recorded performnces, on the bsis of only the written score. After this vlidtion, we illustrte how the model my be used s tool to visulize nd explin differences in dynmics between performnces. 2 Computtionl modeling of musicl expression Reserch into musicl expression is ongoing, nd our knowledge of underlying principles nd mechnisms is fr from complete. In this light, computtionl modeling, mchine lerning, nd dt nlysis methods hve proven helpful in the study of musicl expression, which lrgely relies on tcit knowledge by the musicin. For exmple, Cnzz et l. [2003] derive mpping between sensoril expressive djectives nd coustic ttributes of the performnce. Widmer [2003] uses mchine lerning to infer simple rules from lrge set of performnces of Mozrt pino sonts, linking (for instnce) expressive timing ptterns to rhythmic chrcteristics of the music. Another pproch is tken by Friberg et l. [2006], where set of rules predict timing, dynmics nd rticultion, bsed on locl musicl context. Using the rules from the system s 3

mcro-rules to model lrger time scle performnce trends, Bresin [1998] combined this symbolic model with neurl network to complement the mcro-rules with microrules lerned from set of recorded performnces. Another computtionl model is proposed by Grchten nd Widmer [2012]. This model represents score informtion in terms of bsis-functions (BFs), nd models dynmics s liner combintion of those bsis-functions. A distinctive feture of the model is tht in ddition to pitch nd time informtion, it incorportes dynmic mrkings such s crescendo, nd diminuendo signs, by which the composer suggests prticulr wy of performing the piece. Subsequent versions of the model hve proven more effective by dropping the linerity constrint, llowing for non-liner combintions of bsis-functions [Cncino Chcón nd Grchten, 2015], nd temporl dependencies [Grchten nd Cncino Chcón, 2016]. Liem nd Hnjlic [2015] propose method for compring performnces by principl component nlysis of udio spectrogrms. This method hs the dvntge tht especilly spects like timbre cn be studied, but it is purely descriptive of differences in the udio, rther thn linking differences to prticulr informtion in the score, s in the method proposed here. 3 A computtionl expression model for ensemble performnce The tool we present here is bsed on the bsis-function modeling (BM) pproch mentioned bove. In this pproch, the dynmics of recorded ensemble performnce over time is modeled s combintion of set of bsis-functions tht describe the musicl score. Figure 1 illustrtes the ide of modeling dynmics using bsis-functions schemticlly. Although bsis-functions cn be used to represent rbitrry properties of the musicl score, the BM frmework ws proposed with the specific im of modeling the effect of dynmic mrkings hints in the musicl score, to ply pssge in prticulr wy. For exmple, p (for pino) tells the performer to ply prticulr pssge softly, wheres pssge mrked f (for forte) should be performed loudly. Thus, p nd f specify constnt dynmic levels, nd re modeled using step-like function. Grdul increse/decrese of dynmics (crescendo/diminuendo), indicted by right/left-oriented wedges, respectively, re encoded by rmp-like functions. A third clss of dynmic mrkings, such s mrcto (the ht over note), or mrkings like sforzto (sfz), or forte pino (fp), indicte the ccentution of tht note/chord. This clss of mrkings is represented through (trnslted) unit impulse functions. A set of bsis-functions ϕ is then combined in function f, prmetrized by vector of weights w, to pproximte the dynmics mesured from recorded performnce of the score. The function f my be simple liner function, or complex non-liner function, such s neurl network. In either cse, the vector of weights w determine how the bsis-functions ϕ influence the estimted dynmics f, nd it is through djusting w tht the model cn be trined to predict dynmics, given set of recordings. The exmple bsis-functions given bove re schemtic functions, representing 4

bet 1 bet 2. f (ϕ,w) mesured loudness ϕ 1 ϕ 2 ϕ 3 ϕ 4 ϕ 5 ϕ 6 ϕ Figure 1: Schemtic view of dynmics s function f(ϕ; w) of bsis-functions ϕ = (ϕ 1,, ϕ K ), representing dynmic nnottions nd metricl bsis-functions region (step functions), trnsition (rmp functions), or the occurrence of some instntneous event (impulse functions). Such functions re not limited to representing dynmic mrkings. For exmple, figure 1 shows two impulse functions, representing the first nd second bets in ech br, respectively. Similrly (but not shown), trnsition functions effectively encode slurs, nd phrse mrks. In ddition to such schemtic functions, bsis-functions my encode numeric ttributes of notes, such s their pitch, durtion, nd the number of notes sounding simultneously. By representing locl, note level ttributes, s well s mid (crescendi/diminuendi, slurs) nd long (pino/forte mrks, phrse-mrks, repets) rnge structures, the bsis-function modeling pproch llows for uniform encoding of score informtion t different time scles. The bsis-function pproch is similr to the multiple viewpoint (MV) representtion of musicl informtion proposed by Conklin nd Witten [1995], in the sense tht both im to provide uniform wy of representing diverse spects of musicl informtion for the purpose of modeling. The min difference is tht the ltter hs been conceived for use in Mrkov models tht del with discrete event spces, nd thus yields representtions in terms of discrete vlues, wheres the former ws designed for cpturing quntittive reltionships between score informtion nd expressive prmeters. Before we ddress the question how the bsis-functions cn be used to model musicl expression, we discuss how we represent dynmics s n expressive prmeter, nd how the ensemble scenrio is different from the solo instrument scenrio, since both issues hve implictions for the bsis-function modeling pproch. 5

3.1 Mesuring the dynmics of music performnce In erlier work [Cncino Chcón nd Grchten, 2015], we hve restricted the model to solo pino recordings, vilble in the form of precise mesurements of the pino key movements, using Bösendorfer computer-monitored grnd pino.in such recordings, the recorded hmmer velocity of the pino keys is direct representtion of the dynmics, reveling how loud ech individul note is plyed. For coustic instruments other thn the pino, such precise recording techniques re not vilble, nd therefore the dynmics of complete symphonic orchestr cnnot be mesured in similr wy. Another pproch would be to record ech instrument of the orchestr seprtely, nd mesure the loudness vritions in ech of the instruments. This pproch is not fesible either, becuse prt from the finncil nd prcticl brriers, the live setting in which orchestrs normlly ply prevents clen seprtion of the recording chnnels by instrument. This mens we re left with only rudimentry mesure of dynmics, nmely the overll vrition of loudness over time, mesured from the orchestrl recording. Note tht the loudness of recording is ffected by more thn just dynmics. Room coustics, microphone positioning, nd vrious processing steps during production, possibly including udio compression, nd level-djustments between instrument groups, my ll ffect the finl loudness of the recording. To some extent, such effects my be countered by normlizing the mesured loudness per piece in terms of men nd vrince. We compute loudness of the recordings using the EBU-R-128 [2011] loudness mesure, which tkes into ccount humn perception (i.e., the fct tht signls of equl power but different frequency content re not perceived s being eqully loud) nd is now the recommended wy of compring loudness levels of udio content in the brodcsting industry. To obtin instntneous loudness vlues, we compute the momentry vrint of the mesure on consecutive blocks of udio, using block size nd hop size of 1024 smples, using 44100Hz smplerte. Becuse we wnt to focus on vritions in loudness, rther thn the overll loudness level nd rnge, we subtrct the men nd divide by the stndrd-devition of the loudness vlues per recording. Finlly, in order to model loudness vritions s function of the score informtion, the performnce of the piece must be ligned to the score. To tht end we produce synthetic udio rendering of the score, nd lign it to the recorded performnce using the method described by Grchten et l. [2013]. Through the resulting score performnce lignment, the loudness curve computed from the recording cn be indexed by musicl time (such tht we know the instntneous loudness of the recording t, sy, the second bet of mesure 64 in the piece). Note tht the correctness of the lignment is prerequisite for the explntory use of the expression model: n incorrect lignment would (t the misligned pssges) led to n explntion of the loudness in terms of pssges of the score tht re not ctully being plyed. 3.2 From solo to ensemble performnce In terms of modeling, there re severl significnt differences between solo instrument setting nd n ensemble setting. Firstly, in n ensemble setting, multiple sets of bsis-functions re produced, ech set describing the score prt of prticulr instru- 6

ment. Furthermore, in symphonic piece, multiple instntitions of the sme instrument my be present. Lstly, different pieces my hve different instrumenttions. This poses chllenge to n expression model, which should ccount for the influence of instruments consistently from one piece to the other. We ddress these issues by defining merging opertion tht combines the informtion of different sets of bsis-functions for ech instnce of n instrument into single set of bsis-functions per instrument clss. The wy dynmics is mesured nd represented (see Section 3.1) lso hs repercussions for the bsis-function modeling pproch. In contrst to the digitl grnd pino setting, the overll loudness mesured from n orchestrl recording does not provide loudness vlue for ech performed note, but one per time instnt. Thus, bsis-function informtion describing multiple simultneous notes must be combined to ccount for single loudness vlue. We do so by defining fusion opertors for subsets of bsisfunctions. In most cses, we use the verge opertor s defult. For some bsisfunctions however, we use the sum opertor, in order to preserve informtion bout the number of instnces tht were fused into single instrument. Future experimenttion should provide more informed choices s to the optiml fusion opertors to use. Both the merging nd the fusion opertions re illustrted for musicl excerpt in Figure 2. 3.3 Liner, non-liner, nd temporl models Initil versions of the bsis-function expression model used liner model (Lin) [Grchten nd Widmer, 2012]. In liner model, the expressive prmeters re simply weighted sum of the bsis-functions, where the prmeters of the model re the weights for ech bsis-function, to be estimted bsed on trining dt. A mjor dvntge of liner model is tht the link between the bsis-functions nd the predictions is very cler: the weight for bsis-function expresses how strongly the bsis-function influences the output. This mkes it esy to perform qulittive nlysis of wht the model hs lerned, nd by fitting the model on prticulr piece, or on severl pieces by the sme performer, the weights my lso cpture chrcteristics of the expressive qulity of piece, or performer [Grchten nd Widmer, 2012]. The simplicity of liner modeling is t the sme time drwbck. There re two min limittions to the liner pproch. Firstly, the shpe of bsis-function cn only be used literlly (prt from scling nd verticl trnsltion) to pproximte n expressive prmeter. For exmple, crescendo nnottion is schemticlly represented s rmp function, nd this mens tht ny increse of loudness in tht region cn only be pproximted s liner slope. In relity, it is likely tht the shpe of the loudness increse is not strictly liner. Secondly, the liner pproch does not model ny interctions between bsis-functions. To overcome these limittions, Cncino Chcón nd Grchten [2015] proposed non-liner bsis-function model for expression, bsed on feed-forwrd neurl network (FFNN), showing the dvntges over linerity of the model, both in the non-liner trnsformtion of the bsis-functions, nd in the interction between bsisfunctions. 7

Oboe I Oboe II, III merging Oboe t note ϕ pitch ϕ dur ϕ f ϕ p ϕ pitch ϕ dur ϕ f ϕ p 0 x 1 72 1 1 0 - - - - 0 x 2 - - - - 67 1 0 1 0 x 3 - - - - 64 1 0 1 1 x 4 64 0.5 1 0 - - - - 1 x 5 - - - - 71 1 0 1 1 x 6 - - - - 59 1 0 1 1.5 x 7 62 0.5 1 0 - - - - Oboe I Oboe II, III t = (0, 1, 1.5) f p x 1 x 2 x 3 x 4 x 7 x 5 x 6 t note ϕ pitch ϕ dur ϕ f ϕ p 0 x 1 72 1 1 0 0 x 2 67 1 0 1 0 x 3 64 1 0 1 1 x 4 64 0.5 1 0 1 x 5 71 1 0 1 1 x 6 59 1 0 1 1.5 x 7 62 0.5 1 0 fusion Oboe t note ϕ pitch ϕ dur ϕ f ϕ p 0 x 1 68 1 0.33 0.67 1 x 2 65 0.83 0.33 0.67 1.5 x 3 62 0.5 1 0 Figure 2: Illustrtion of merging nd fusion of score informtion of two different prts belonging to the sme instrument clss Oboe. The first mtrix shows four exmple bsis-functions, ϕ pitch, ϕ dur, ϕ f nd ϕ p, for the notes of ech of the two score prts. The second mtrix is the result of merging bsis-functions of different Oboe instntitions into single set. The lst mtrix is the result of fusion, pplied per bsis-function to ech set of vlues occurring t the sme time point A more powerful type of non-liner modeling cn be obtined by introducing recurrence reltions to the neurl network rchitecture: recurrent neurl networks (RNNs) re prticulr kind of discrete time dynmicl rtificil neurl networks suited for nlyzing sequentil dt, such s time-series. These dynmic models hve been successfully used for generting text sequences, hndwriting synthesis nd modeling motion cpture dt [Grves, 2013]. The structure of n RNN is similr to tht of feed forwrd neurl network, with the ddition of connections mong its subsequent hidden sttes, llowing informtion from the pst to influence the hidden stte tht corresponds to the present. RNNs re not limited to forwrd temporl dependencies: Correltions between present nd future events my be modeled by bckwrd temporl connections. An RNN with both forwrd nd bckwrd connections is referred to s bi-directionl RNN (birnn). The benefits of such models hve been demonstrted in the context of expressive timing vritions in Chopin pino music [Grchten nd Cncino Chcón, 2016]. Figure 3 illustrtes how the birnn is used to predict the dynmics of performed piece, given the bsis-function description of the piece ϕ. The grph structure of FFNN is similr, but lcks the lterl connections between hidden sttes h, nd only hs single hidden lyer h, replcing h (fw) t nd h (bw) t. 8

y 1 y 2 y 3... y N 2 y N 1 y N h (bw) 1 h (bw) 2 h (bw) 3... h (bw) N 2 h (bw) N 1 h (bw) N. h (fw) 1 h (fw) 2 h (fw) 3... h (fw) N 2 h (fw) N 1 h (fw) N φ 1 φ 2 φ 3... φ N 2 φ N 1 φ N Figure 3: Grph structure of birnn, expnded for piece of N time steps. ϕ t is vector holding the vlues of set of bsis-functions ϕ, evluted t time t, describing the musicl score t tht time. h (fw) t nd h (bw) t model the forwrd nd bckwrd score context t t, respectively, nd jointly predict y t, the performnce dynmics t t 4 An experimentl ssessment of the model In this Section, we provide n evlution of the model in terms of predictive ccurcy on set of orchestr performnces tht re vilble commercilly. Such n evlution my not t first seem to be of mjor interest, since we intend to use the expression model s tool for understnding expression rther thn predicting it. However, it should be kept in mind tht in the re of musicl expression, there is still little consolidted theory to use s bsis for building models. Therefore, computtionl models of musicl expression by necessity hve n explortory role. In this context, mesuring the predictive ccurcy of the model on set of music recordings helps us to get generl impression of how well the model cptures relevnt fctors of the music in reltion to expression. By testing both simple liner nd more complex ssocitions between the bsis-function representtion nd the recorded loudness of performnces, we im to give more complete picture of the merits nd limittions of the bsis-function modeling pproch. More specificlly, we test the liner bsis-function model (Lin), the non-liner model (FFNN), nd the bi-directionl recurrent model (birnn) described in the previous Section. 4.1 Dt The corpus used for the experiments is summrized in Tble 1 below. It consists of symphonies from the clssic nd romntic period. The corpus contins recorded performnces (udio), mchine-redble representtions of the musicl score (MusicXML) nd utomticlly produced, mnully corrected lignments between score nd performnce, for ech of the symphonies. We use recordings of performnces by the Royl Concertgebouw Orchestr conducted by Ivn Fischer or Mriss Jnsons, ll performed t the Royl Concertgebouw 9

Tble 1: Pieces/movements used for leve-one-out cross-vlidtion of the model Composer Piece Movements Conductor Orchestr Beethoven Symphony No. 5 in C-Min. (op. 67) 1, 2, 3, 4 Fischer RCO Beethoven Symphony No. 6 in F-Mj. (op. 68) 1, 2, 3, 4, 5 Fischer RCO Beethoven Symphony No. 9 in D-Min. (op. 125) 1, 2, 3, 4 Fischer RCO Mhler Symphony No. 4 in G-Mj. 1, 2, 3, 4 Jnsons RCO Bruckner Symphony No. 9 in D-Min. (WAB 109) 1, 2, 3 Jnsons RCO in Amsterdm, the Netherlnds. The corpus mounts to totl of 16 movements from 4 pieces. The corresponding performnces sum up to totl length of lmost 4 hours of music. From the 20 scores totl of 53816 note onsets, nd 1420 bsis-functions were extrcted. The loudness nd score performnce lignment is computed s described in Section 3.1. 4.2 Method We use leve-one-out cross-vlidtion, where the model is trined on 19 of the 20 movements nd then is used to predict the trget vlues for the unseen remining movement. The non-liner models (FFNN, nd birnn) re trined by grdient descent optimiztion. Both the feed-forwrd nd the recurrent neurl network re set up with single hidden lyer of 20 units. From the 19 trining movements, four movements re kept for vlidtion, to void overfitting the models to the trining dt, prctice known s erly stopping. The predictions re evluted with respect to the trget in terms of the Coefficient of Determintion (R 2 ), mesuring the proportion of vrince explined by the model, nd Person s Correltion Coefficient r. Note tht since we report on predicted rther thn fitted dt, R 2 vlues cn be negtive, in cse the prediction residul hs lrger vrince thn the signl itself. The set of bsis-functions used in the experiments encode note pitch, durtion, nd metricl position, the number of simultneous notes within instrument groups, interonset intervls between subsequent notes, repet signs, note ccent, stccto, fermt signs, nd dynmic mrkings. A full description of the bsis-functions is omitted for brevity 2. 4.3 Results nd discussion The results of the experiments re shown in Tble 2. We observe tht both the R 2 nd the r vlues for Lin re generlly lower thn those for FFNN nd birnn, demonstrting tht the non-liner modeling provides cler dvntge over the liner modeling pproch. Given the reltively smll dt set, this result is not trivil, since the FFNN nd birnn hve much more prmeters thn the Lin model, nd re therefore more prone to overfitting. Furthermore, the birnn model provides more ccurte predictions thn the FFNN model, lthough this dvntge is less prominent thn the dvntge over Lin. 2 A complete description of the bsis functions cn be found in the following technicl report: http: //lrn2cre8.ofi.t/expression-models/tr2016-ensemble-expression.pdf 10

Tble 2: Predictive ccurcy in leve-one-out scenrio for the different models; R 2 = coefficient of determintion (lrger is better); r = Person correltion coefficient (lrger is better); Best vlue per piece nd mesure emphsized in bold Composer / Piece R 2 r Lin FFNN birnn Lin FFNN birnn Beethoven S5 Mv 1-0.26-0.22-0.18 0.18 0.21 0.26 Mv 2 0.34 0.46 0.56 0.58 0.70 0.76 Mv 3 0.23 0.40 0.44 0.53 0.64 0.66 Mv 4 0.06 0.26 0.25 0.41 0.53 0.52 Beethoven S6 Mv 1 0.36 0.36 0.39 0.61 0.63 0.65 Mv 2 0.07 0.15 0.17 0.36 0.40 0.41 Mv 3 0.51 0.60 0.62 0.72 0.81 0.82 Mv 4 0.11 0.27 0.29 0.38 0.54 0.56 Mv 5 0.36 0.44 0.49 0.60 0.70 0.75 Beethoven S9 Mv 1 0.34 0.36 0.42 0.59 0.61 0.65 Mv 2 0.36 0.40 0.53 0.60 0.64 0.74 Mv 3-0.30-0.06-0.02 0.20 0.17 0.22 Mv 4 0.11 0.37 0.49 0.52 0.64 0.70 Mhler S4 Mv 1-0.17 0.29 0.37 0.37 0.54 0.61 Mv 2-0.48-0.02-0.02 0.06 0.20 0.23 Mv 3-1.22 0.25 0.26 0.20 0.51 0.53 Mv 4-1.99 0.09 0.18 0.15 0.33 0.44 Bruckner S9 Mv 1-39.06 0.45 0.59 0.26 0.68 0.77 Mv 2 0.24 0.48 0.55 0.58 0.72 0.74 Mv 3-3.54 0.32 0.40 0.25 0.57 0.65 11

The fct tht the results of the liner model re inferior suggests tht lthough the bsis-functions used to represent the score cpture relevnt informtion, their shpes (such s the rmp function to represent crescendo) re too schemtic to work well s pproximtions of mesured loudness curves. The improvement of the results in the FFNN nd RNN models suggest tht the non-liner trnsformtion of these shpes llevites this problem to some extent. The cpbility of the non-liner models of modeling interctions between bsis-functions, s demonstrted by Cncino Chcón nd Grchten [2015], my further explin the improved results of these models. 5 The expression model s n nlyticl tool for dynmics in symphonic music In this Section, we demonstrte tht the BM frmework cn be used for explntory purposes, nd thus form the bsis for tool tht elucidtes differences in expressive interprettions between performnces, s discussed in Section 1. The explntory power of BM models lies in the fct tht they represent dynmics s function of the bsis-functions. As model lerns from trining dt how the bsis-functions relte to dynmics, some bsis-functions my prove to be very importnt for n ccurte prediction of dynmics, while others my hve little or no influence t ll. In other words, the model lerns to be more sensitive to some bsis-functions thn to others. We cn impose sensitivities specific to prticulr performnce on model by fitting the model to tht performnce djusting its prmeters such tht its prediction error for the dynmics of tht performnce is minimized. When fitting models to two different performnces of piece, the differences in dynmics between the performnces tend to led to different sensitivities in the models. For exmple, model fitted to drmtic performnce my lern tht dynmics nnottions such s pino nd forte hve lrge effect on the dynmics of the performnce, wheres model fitted to more restrined performnce my be less sensitive to these nnottions. Thus, compring differences in sensitivities between models fitted to different performnces cn give us qulittive explntions of the differences in dynmics, in the style of Performnce A is louder thn performnce B t this point in the piece, becuse the string instruments re more prominent, or Performnce B emphsizes the downbet more strongly thn Performnce A. When fitting two models to two different performnces for comprison purposes, it mkes sense to strt the fitting process from common model tht ws pretrined on number of other recordings. Firstly, this speeds up the fitting process since the pretrined model will lredy provide rough pproximtion of the dynmics curves, nd secondly, strting from common bsis encourges similr explntions for similr trends in the performnces, nd thereby prsimonious explntions of the differences between the performnces. We compute the sensitivities of model to ech of the bsis-functions using locl differentil bsed sensitivity nlysis technique [Hmby, 1995], which consists in computing the grdient of the output of the model with respect to ech of its inputs. By multiplying the grdients (sensitivities) with the inputs (bsis-functions) over the 12

course of piece, we obtin sensitivity grph for performnce. The multipliction is motivted by the fct tht even when model is sensitive to prticulr bsis-function, this bsis-function does not ffect the output of the model whenever it is inctive (i.e. zero vlued). Moreover, by subtrcting the sensitivity grphs of two performnces, we obtin wht we cll sensitivity-difference (SD) grph, visul representtion expressing the reltive influence of ech bsis-function in ech of the two performnces, to be illustrted shortly. As cse study, we compre performnces of the 3rd Movement (Lustiges Zusmmensein der Lndleute) of Beethoven s Symphony No. 6 Op. 68 by different conductors nd orchestrs. This piece is scherzo which suggests dnces of the country folk. The scherzo is in 3 4 meter, with its trio in 2 4 meter. The two performnces we compre here re by the conductors Georg Solti (with the Chicgo Symphony Orchestr, recorded in 1974) nd Nikolus Hrnoncourt (with the Chmber Orchestr of Europe, recorded in 1991), herefter referred to s Solti nd Hrnoncourt, respectively. We compre the performnces by using birnn model tht ws pretrined on the dtset described in Tble 1, nd then fitting the pretrined model to both performnces, respectively. As n exmple, consider the smll, but mrked difference between Solti nd Hrnoncourt in brs 87-90 of the SD grph in Figure 4 (showing selection of the most influentil bsis-functions in the frgment). In br 87, Beethoven s score prescribes four-br diminuendo of the violins to trnsition from n ongoing fortissimo (ff ) pssge (strting before nd continuing in the depicted frgment) to quiet nd lyricl pinissimo (pp) pssge feturing singing oboe, strting with br 91. The SD grph shows tht the incresed loudness in Hrnoncourt (compred to Solti) is ttributed to sustined influence of the ff in the violins over the course of the diminuendo. Note tht this ttribution is prsimonious explntion of the loudness difference, becuse it involves only single bsis-function. A hypotheticl, less concise explntion for instnce, could involve n incresed influence of ech of the metricl positions for Hrnoncourt. Together, the incresed influence of these bsis-functions would lso led to louder performnce of the frgment overll, but my not be comptible with Hrnoncourt s interprettion of the rest of the piece. Listening to the respective pssges, we note indeed clerly udible difference: Solti tkes the diminuendo very strictly, immeditely softening the orchestr nd quickly rriving t very soft plying level lredy before the ctul rrivl of the pp. Hrnoncourt s ritrdndo is more of continution of the preceding fortissimo pssge: he only grows slightly softer during the ritrdndo, nd obeys the pp more bruptly when it rrives (the purple color of the pp strting with br 91 indictes tht Solti s pp is ctully slightly louder thn Hrnoncourt s). It turns out tht these re consistent nd obviously deliberte choices, s we find the exct sme pttern lter on in the piece, in brs 292 295, where we hve n nlogous musicl pssge. Furthermore, the SD grph shows slight but systemtic pttern in the metricl bsis-functions. This pttern suggests the model found slight differences in the metricl ccentution, with Solti plcing more emphsis thn Hrnoncourt on the lst bet of the br, nd vice vers for the first to bets. Listening revels tht these differences re too subtle to be herd, however. Finlly, it is importnt to note tht the SD grph pertins to the model fit dynmics 13

Loudness (stndrd-scores) Oboe Durtion: durtion Oboe LoudnessAnnottion: ff Oboe PolynomilPitch: pitch Violin Durtion: durtion Violin LoudnessAnnottion: ff Violin LoudnessAnnottion: pp Violin Metricl: 3 4 bet 1 Violin Metricl: 3 4 bet 2 Violin Metricl: 3 4 bet 3 Violin PolynomilPitch: pitch 0 1 2 Hrnoncourt (mesured) Hrnoncourt (model fit) Solti (mesured) Solti (model fit) 87 88 89 90 91 92 Time (brs) 87 88 89 90 91 92 hr. > sol. hr. = sol. sol. > hr. Figure 4: Top: Mesured nd fitted loudness curves for n excerpt of the performnce of Beethoven s 6th Symphony, 3rd Movement (brs 87 to 92) by Hrnoncourt (ornge) nd Solti (purple); Bottom: sensitivity-difference grph for Hrnoncourt nd Solti. Ornge tones indicte tht bsis-function hs stronger (positive) contribution to loudness in Hrnoncourt thn in Solti, purple tones indicte the opposite. (Figure best viewed in color) curves in Figure 4 (top), not the mesured curves. There re some fluctutions in the mesured curve (such s tht on bet 3 of mesure 89) tht re not cptured, nd therefore cnnot be explined by the SD grph. 6 Conclusions This rticle demonstrtes computtionl model for modeling loudness vritions in udio recordings of symphonic pieces. The model reltes these vritions to informtion from the written musicl score, described in the form of bsis-functions. An evlution of different vrints of the model shows tht non-liner version including temporl dependencies is most effective in predictive setting, where the model predicts loudness vritions bsed on the written score, fter being trined on set of 14

recordings. Exmples given in Section 5 illustrte how the model cn be used s wy to explin differences between performnces in terms of the written score. It must be kept in mind however, tht the dt set used for vlidting the model, lthough comprising works from severl composers, is performed by single orchestr, nd two different conductors. Anecdotl cross-vlidtion suggests the model trined on RCO recordings generlizes well to recordings by different orchestrs, but more elborte experimenttion is necessry to mke stronger clims bout the robustness of the model ginst vrince in recording/mixing/mstering conditions cross recordings. Furthermore, the mesured overll loudness vrition is only corse mesure of ( single spect of) musicl expression, nd tht currently, the model pproximtions (nd thus its explntions) my not be dequte t ll positions in the performnce. Better model pproximtions nd predictions will llow for novel explntory uses, such s using the predictions of model tht ws trined on multiple performnces of piece s bseline performnce, bsed on which the idiosyncrsies of conductors cn be estblished. The exmples were hnd-picked here, since the expression model is currently in stge where we re testing its vlidity, nd experimenting with different sets of bsisfunctions. In the future, the model should be cpble of utomticlly identifying excerpts from piece where two or more performnce differ substntilly from ech other, in order to highlight them to the listener, nd show which spects of the performnce re different. Future versions of the model should not be restricted to loudness vritions, but cover tempo vritions s well. In combintion with web-service for ligned music plybck nd visuliztion, such s presented by Gsser et l. [2015], the model presented here llows listeners with desire to get better grsp on piece of music, to compre different performnces of the piece in terms of their expressive chrcter, nd get better understnding of wht it is tht mkes the performnces different. Acknowledgements This work is supported by the Europen Union s Seventh Frmework Progrmme FP7 / 2007-2013 (projects PHENICX / grnt number 601166 nd Lrn2Cre8 / grnt number 610859), nd by the Europen Reserch Council (ERC) under the EU s Horizon 2020 Frmework Progrmme (ERC Grnt Agreement number 670035, project CON ESPRESSIONE). We wish to thnk the nonymous reviewers for their useful comments. Furthermore, we re grteful to the Royl Concertgebouw Orchestr, in prticulr Mrcel vn Tilburg nd Dvid Bzen, for providing the udio recordings used in this study. References Bresin, R. (1998). Artificil neurl networks bsed models for utomtic performnce of musicl scores. Journl of New Music Reserch, 27 (3):239 270. 15

Cnzz, S., De Poli, G., Rodá, A., nd Vidolin, A. (2003). An bstrct control spce for communiction of sensory expressive intentions in music performnce. Journl of New Music Reserch, 32(3):281 294. Cncino Chcón, C. E. nd Grchten, M. (2015). An evlution of score descriptors combined with non-liner models of expressive dynmics in music. In Jpkowicz, N. nd Mtwin, S., editors, Proceedings of the 18th Interntionl Conference on Discovery Science (DS 2015), Lecture Notes in Artificil Intelligence, Bnff, Cnd. Springer. Conklin, D. nd Witten, I. (1995). Multiple viewpoint systems for music prediction. Journl of New Music Reserch, 24(1):51 73. EBU-R-128 (2011). BU Tech 3341-2011, Prcticl Guidelines for Production nd Implementtion in Accordnce with EBU R 128. Friberg, A., Bresin, R., nd Sundberg, J. (2006). Overview of the kth rule system for musicl performnce. Advnces in Cognitive Psychology, 2(2 3):145 161. Gsser, M., Arzt, A., Gdermier, T., Grchten, M., nd Widmer, G. (2015). Clssicl music on the web - user interfces nd dt representtions. In Proceedings of the 16th Interntionl Society for Music Informtion Retrievl Conference, ISMIR 2015, Málg, Spin, October 26-30, 2015, pges 571 577. Goto, M. (2007). Active music listening interfces bsed on signl processing. In Proceedings of the IEEE Interntionl Conference on Acoustics, Speech, nd Signl Processing, ICASSP, pges 1441 1444. Grchten, M. nd Cncino Chcón, C. E. (2016). Temporl dependencies in the expressive timing of clssicl pino performnces. In The Routledge Compnion to Embodied Music Interction. Routledge. In press. Grchten, M., Gsser, M., Arzt, A., nd Widmer, G. (2013). Automtic lignment of music performnces with structurl differences. In Proceedings of the 14th Interntionl Society for Music Informtion Retrievl Conference, Curitib, Brzil. Grchten, M. nd Widmer, G. (2012). Liner bsis models for prediction nd nlysis of musicl expression. Journl of New Music Reserch, 41(4):311 322. Grves, A. (2013). Generting Sequences With Recurrent Neurl Networks. rxiv, 1308:850. Hmby, D. M. (1995). A comprison of sensitivity nlyses techniques. Helth Physics, 68(2):195 204. Liem, C. C., Gómez, E., nd Schedl, M. (2015). PHENICX: Innovting the clssicl music experience. In Multimedi & Expo Workshops (ICMEW), 2015 IEEE Interntionl Conference on, pges 1 4. 16

Liem, C. C. S. nd Hnjlic, A. (2015). Comprtive nlysis of orchestrl performnce recordings: An imge-bsed pproch. In 16th Interntionl Society for Music Informtion Retrievl Conference, Málg, Spin. Melenhorst, M. S. nd Liem, C. (2015). Put the concert ttendee in the spotlight. user-centered design nd development pproch for clssicl concert pplictions. In Proceedings of the 16th Interntionl Society for Music Informtion Retrievl Conference, Málg, Spin. Roose, H. (2008). Mny-voiced or unisono?: An inquiry into motives for ttendnce nd esthetic dispositions of the udience ttending clssicl concerts. Act Sociologic, 51(3):237 253. Widmer, G. (2003). Discovering simple rules in complex dt: A met-lerning lgorithm nd some surprising musicl discoveries. Artificil Intelligence, 146(2):129 148. Author Bibliogrphies Mrten Grchten holds Ph.D. degree in Computer Science nd Digitl Communiction (2007, Pompeu Fbr University, Spin), nd is currently senior resercher t the Deprtment of Computtionl Perception t Johnnes Kepler University, Linz, Austri. His reserch interests include computtionl modeling of music perception nd cognition using mchine lerning techniques. Crlos Edurdo Cncino-Chcón is resercher t the Austrin Reserch Institute for Artificil Intelligence (OFAI). He received the Bchelor s degree in Physics t UNAM, Mexico, the Bchelor s degree in Pino t the Ntionl Conservtory of Music in Mexico nd the MSc. in Electricl Engineering nd Audio Engineering t TU Grz, Austri. Currently he is pursuing the PhD degree in Computer Science t JKU, Linz, Austri. Thssilo Gdermier received Mg. phil. degree in Systemtic Musicology from University of Vienn nd is currently resercher t OFAI. He is pursuing degree in Electricl Engineering/Telecommunictions t Vienn University of Technology. Gerhrd Widmer is Professor nd Hed of the Deprtment of Computtionl Perception t Johnnes Kepler University, Linz, Austri. His reserch interests include AI, mchine lerning, nd intelligent music processing. He is Fellow of the Europen Assocition for Artificil Intelligence, nd recipient of n ERC Advnced Grnt (2015) on computtionl models of expressivity in music. 17