Evaluating Interactive Music Systems: An HCI Approach

Similar documents
Computer Coordination With Popular Music: A New Research Agenda 1

inter.noise 2000 The 29th International Congress and Exhibition on Noise Control Engineering August 2000, Nice, FRANCE

From Idea to Realization - Understanding the Compositional Processes of Electronic Musicians Gelineck, Steven; Serafin, Stefania

White Paper Measuring and Optimizing Sound Systems: An introduction to JBL Smaart

Influence of timbre, presence/absence of tonal hierarchy and musical training on the perception of musical tension and relaxation schemas

BBC Trust Review of the BBC s Speech Radio Services

WHITEPAPER. Customer Insights: A European Pay-TV Operator s Transition to Test Automation

Social Interaction based Musical Environment

Application of a Musical-based Interaction System to the Waseda Flutist Robot WF-4RIV: Development Results and Performance Experiments

Improving music composition through peer feedback: experiment and preliminary results

Technology Proficient for Creating

MusicGrip: A Writing Instrument for Music Control

Music Performance Ensemble

Music Performance Solo

SUBJECTIVE QUALITY EVALUATION OF HIGH DYNAMIC RANGE VIDEO AND DISPLAY FOR FUTURE TV

Sound visualization through a swarm of fireflies

Montana Content Standards for Arts Grade-by-Grade View

UWE has obtained warranties from all depositors as to their title in the material deposited and as to their right to deposit such material.

Chamber Orchestra Course Syllabus: Orchestra Advanced Joli Brooks, Jacksonville High School, Revised August 2016

Indicator 1A: Conceptualize and generate musical ideas for an artistic purpose and context, using

Real-time Granular Sampling Using the IRCAM Signal Processing Workstation. Cort Lippe IRCAM, 31 rue St-Merri, Paris, 75004, France

Brain.fm Theory & Process

Jazz Melody Generation from Recurrent Network Learning of Several Human Melodies

Design considerations for technology to support music improvisation

Music in Practice SAS 2015

Playful Sounds From The Classroom: What Can Designers of Digital Music Games Learn From Formal Educators?

2 Higher National Unit credits at SCQF level 8: (16 SCQF credit points at SCQF level 8)

Improvised Duet Interaction: Learning Improvisation Techniques for Automatic Accompaniment

PRESCOTT UNIFIED SCHOOL DISTRICT District Instructional Guide January 2016

Troubleshooting EMI in Embedded Designs White Paper

Chords not required: Incorporating horizontal and vertical aspects independently in a computer improvisation algorithm

Toward a Computationally-Enhanced Acoustic Grand Piano

Instrumental Music Curriculum

Interacting with a Virtual Conductor

New Mexico. Content ARTS EDUCATION. Standards, Benchmarks, and. Performance GRADES Standards

Applying lmprovisationbuilder to Interactive Composition with MIDI Piano

Sample assessment task. Task details. Content description. Year level 9. Class performance/concert practice

Curriculum Standard One: The student will listen to and analyze music critically, using the vocabulary and language of music.

Physics 105. Spring Handbook of Instructions. M.J. Madsen Wabash College, Crawfordsville, Indiana

Subjective evaluation of common singing skills using the rank ordering method

& Ψ. study guide. Music Psychology ... A guide for preparing to take the qualifying examination in music psychology.

SCHEME OF WORK College Aims. Curriculum Aims and Objectives. Assessment Objectives

From quantitative empirï to musical performology: Experience in performance measurements and analyses

AN ARTISTIC TECHNIQUE FOR AUDIO-TO-VIDEO TRANSLATION ON A MUSIC PERCEPTION STUDY

ITU-T Y.4552/Y.2078 (02/2016) Application support models of the Internet of things

The Effects of Web Site Aesthetics and Shopping Task on Consumer Online Purchasing Behavior

Ben Neill and Bill Jones - Posthorn

Policy on the syndication of BBC on-demand content

INTERACTIVE GTTM ANALYZER

Physics 277:Special Topics Medieval Arms and Armor. Fall Dr. Martin John Madsen Department of Physics Wabash College

AHRC ICT Methods Network Workshop De Montfort Univ./Leicester 12 June 2007 New Protocols in Electroacoustic Music Analysis

Powerful Software Tools and Methods to Accelerate Test Program Development A Test Systems Strategies, Inc. (TSSI) White Paper.

York St John University

Use of Scanning Wizard Can Enhance Text Entry Rate: Preliminary Results

Music Explorations Subject Outline Stage 2. This Board-accredited Stage 2 subject outline will be taught from 2019

6 th Grade Instrumental Music Curriculum Essentials Document

Previous Lecture Sequential Circuits. Slide Summary of contents covered in this lecture. (Refer Slide Time: 01:55)

Understanding PQR, DMOS, and PSNR Measurements

MUSIC TECHNOLOGY MASTER OF MUSIC PROGRAM (33 CREDITS)

Analysis of local and global timing and pitch change in ordinary

Modeling memory for melodies

Music Performance Panel: NICI / MMM Position Statement

Research & Development. White Paper WHP 318. Live subtitles re-timing. proof of concept BRITISH BROADCASTING CORPORATION.

Ensemble Novice DISPOSITIONS. Skills: Collaboration. Flexibility. Goal Setting. Inquisitiveness. Openness and respect for the ideas and work of others

MUSIC COURSE OF STUDY GRADES K-5 GRADE

Composing with Hyperscore in general music classes: An exploratory study

BBC Red Button: Service Review

Take a Break, Bach! Let Machine Learning Harmonize That Chorale For You. Chris Lewis Stanford University

MEASURING EMERGING SCIENTIFIC IMPACT AND CURRENT RESEARCH TRENDS: A COMPARISON OF ALTMETRIC AND HOT PAPERS INDICATORS

Arts Education Essential Standards Crosswalk: MUSIC A Document to Assist With the Transition From the 2005 Standard Course of Study

However, in studies of expressive timing, the aim is to investigate production rather than perception of timing, that is, independently of the listene

Evolutionary Computation Applied to Melody Generation

First Year Evaluation Report for PDAE Grant Accentuating Music, Language and Cultural Literacy through Kodály Inspired Instruction

2/22/2017. Kansas State Music Standards: Next Step Curriculum Revision. National Music Standards Comparing 1994 to 2014

AN EXPERIMENT WITH CATI IN ISRAEL

Music. Last Updated: May 28, 2015, 11:49 am NORTH CAROLINA ESSENTIAL STANDARDS

This project builds on a series of studies about shared understanding in collaborative music making. Download the PDF to find out more.

MEMORY & TIMBRE MEMT 463

Modelling Prioritisation Decision-making in Software Evolution

About Giovanni De Poli. What is Model. Introduction. di Poli: Methodologies for Expressive Modeling of/for Music Performance

Algorithmic Music Composition

THE UNIVERSITY OF QUEENSLAND

General Terms Design, Human Factors.

Computational Parsing of Melody (CPM): Interface Enhancing the Creative Process during the Production of Music

Improving Piano Sight-Reading Skills of College Student. Chian yi Ang. Penn State University

KINDERGARTEN-CURRICULUM MAP

Musical Entrainment Subsumes Bodily Gestures Its Definition Needs a Spatiotemporal Dimension

PLACEMENT OF SOUND SOURCES IN THE STEREO FIELD USING MEASURED ROOM IMPULSE RESPONSES 1

Instructions to Authors

TongArk: a Human-Machine Ensemble

Case Study: Can Video Quality Testing be Scripted?

Time Domain Simulations

Introduction to Performance Fundamentals

15th International Conference on New Interfaces for Musical Expression (NIME)

ESTIMATING THE ERROR DISTRIBUTION OF A TAP SEQUENCE WITHOUT GROUND TRUTH 1

Third Grade Music. Curriculum Guide Iredell-Statesville Schools

Fisk Street Primary School Curriculum. The Arts. Music

High School Photography 1 Curriculum Essentials Document

Using the BHM binaural head microphone

TOWARD AN INTELLIGENT EDITOR FOR JAZZ MUSIC

Transcription:

Evaluating Interactive Music Systems: An HCI Approach William Hsu San Francisco State University Department of Computer Science San Francisco, CA USA whsu@sfsu.edu Abstract In this paper, we discuss a number of issues related to the design of evaluation tests for comparing interactive music systems for improvisation. Our testing procedure covers rehearsal and performance environments, and captures the experiences of a musician/participant as well as an audience member/observer. We attempt to isolate salient components of system behavior, and test whether the musician or audience are able to discern between systems with significantly different behavioral components. We have applied our testing methodology in comparative studies of our London and ARHS improvisation systems [1], with the help of saxophonists John Butcher and James Fei; we report on preliminary experiences and ongoing design refinements. Keywords: Interactive music systems, human computer interaction, evaluation tests. 1. Introduction In our previous work designing interactive music systems (see for example [1], [2]), we have at various times sought a methodology for evaluating the musical results of such systems. We have found relatively little in the literature that is applicable to our particular environment (see Section 2 for references). In this paper, we attempt to identify some of the major issues and problems associated with evaluation methodology of interactive music systems, propose a framework for comparative evaluations, and report on some preliminary experiences designing evaluation tests. Since 2002, we have built several interactive music systems that improvise with a saxophonist or other human instrumentalist. From the human instrumentalist s realtime performance audio stream, our system extracts timbral and gestural features that are perceptually significant; this information is used to coordinate the performance of an ensemble of virtual improvising agents. As with similar systems (see Section 2 for examples), our high-level goals are primarily focused on musical results Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. 2009 Carnegie Mellon University Marc Sosnick San Francisco State University Department of Computer Science San Francisco, CA USA msosnick@sfsu.edu from two points of view: an experienced human improviser with a rich timbral and gestural vocabulary should find it possible to work with the system in a free improvisation context; also, an audience sympathetic to free improvisation should find the performance relatively listenable. In [1], we focused on our two latest systems, the London system, and the Adaptive Real-time Hierarchical Self-monitoring (ARHS) system. Based on observations of the systems in performances at the 2006 Live Algorithms for Music conference, NIME 2007, and at CNMAT in 2008, we tried to identify and address shortcomings. Each design iteration involved refining and redesigning system components, and fine-tuning parameter and configuration choices. Our earlier design decisions were primarily driven by considerations of functionality; when the system was relatively simple, enhancing the functionality of the system usually led to more musical results. However, as the number of system components increased and their interactions became significantly more complex, it became more difficult to correlate design decisions to improvements in musicality; we felt the need for a more rigorous approach for comparative evaluation of design choices. It is relatively easy to test and verify the correct operation of system components or entire subsystems by observing their output, using system logs or audio recordings resulting from well-defined test inputs. It is also (usually) possible to identify whether the effects of a system component or a specific configuration is discernable by the human improviser in performance, or by a listener listening to a mix of the improviser and the systems audio output. However, our high-level goals are to achieve musically satisfying experiences for both the human improviser and the audience; a well-intentioned component that results in discernable changes in system behavior may very well be considered less desirable by the participating musician or audience. This desirability falls under what Ariza terms a musical judgment, i.e., a subjective evaluation of the module [3]. While Ariza is at best ambivalent about the value of such subjective evaluations, it is precisely this type of judgement that will determine if the performer will continue to use the system, and/or audiences will continue to want to listen to the results.

The evaluation of software applications is fairly welldefined in the field of HCI; however, apart from input devices, there has been relatively little discussion of HCIdriven evaluation testing in the interactive music systems community. Furthermore, relatively little of HCI testing methodologies have been applied to the dynamic user/audience environment of an automatic improvisation system. An evaluation framework for such performance systems must address both aspects of our high-level goals: 1) the system should constitute a usable environment for an experienced human improviser to perform within, preferably for an extended period of time, and 2) the results of the performance should be musically interesting for an audience that is sympathetic to free improvisation. Our paper is organized as follows. In Section 2, we survey related work, both from the HCI area and the interactive music community. In Section 3, we describe our approach for developing testing methodologies for evaluating improvisation systems, and apply it to our London and ARHS systems. The design of our evaluative questionnaires is covered in Section 4. We report on our recent experiences with our evaluation methodology in Section 5, and discuss future work. 2. Related Work Chapter 10 of [4] contains a survey of recent work in interactive improvisation systems. Such systems work mostly with pitch information, with timbre playing a relatively minor role; see for example George Lewis Voyager [5]. In Hsu s collaborations with saxophonist John Butcher, several systems were built in which timbre is an integral and dynamic factor in sensing, analysis, and interaction management; see [1], [2] for details. We are currently interested in evaluation frameworks for comparing interactive music systems (IMSs), using approaches and techniques from Human Computer Interaction (HCI). IMSs can be thought of broadly as human-computer interfaces, with the musician providing input through a microphone to the system, and the musician and audience reacting to the output produced by the computer through a sound system. Techniques developed in the field of HCI lend themselves well to software development of such systems. In fact, the iterative nature of HCI development cycles, data gathering followed by design changes, is a systemization of what is already occurring at a less formal level. Collins [4] observed that the evaluation of IMSs has often been inadequately covered in existing reports He proposed three suggestions for evaluating IMSs: 1) Technical criteria related to tracking success or cognitive modeling; 2) The reaction of an audience, (subjective) aesthetic criteria; 3) The sense of interaction for the musicians who participate. We will address in detail audience reaction and the experiences of the musicians in our framework and procedures. In [3], Ariza describes a variety of listening tests for evaluating generative music systems. His focus was on the application to generative music systems of evaluation procedures that are similar to Turing tests. Most of the tests critiqued by Ariza ask of listeners a rather high-level question, such as whether a specific piece of music was composed by a human, or by a generative system. Ariza observed: The lack of systematic evaluation of aesthetic artifacts in general is traditionally accepted: evaluation is more commonly found as aesthetic criticism, not experimental methodology. For our purposes, we are interested in studying testing procedures to distinguish between the musical behavior of two (or more) systems. We would like to capture data from the point of view of both a performing musician working with the system in real-time, and a listener observing the performance and then evaluating some of those aesthetic artifacts mentioned by Ariza. Ariza proposed that using a Turing test provides no more than a listener survey. While we agree with Ariza s findings and conclusions, we believe that, despite the musical backgrounds and biases of the users and listeners, there is value in this data. The important consideration then is whether we can produce more than musical judgements through a testing methodology that examines these very judgements. We will address this question in more detail in our questionnaire design (Section 4). Freeman [6] has used short surveys to collect feedback from audiences in his interactive pieces such as Flock, for saxophone quartet, dancers, audience participation, electronic sound, and video. Because of the relatively open structure of the piece and the nature of audience participation, survey questions tend to be relatively high level, such as whether they had fun or enjoyed participating. A numeric scale was used to rate audience responses. In [6] Freeman mentions the use of test runs before the performance, but the organization of the runs and the collection of quantitative data was not clearly documented. An example of work in generative music trying to address listener subjectivity is Unehara and Onisawa [7]. Listeners were asked to subjectively evaluate bars of existing works; genetic algorithms were then used to generate compositions based on listener preferences. Here too, the final analysis of the success of the resulting compositions comes down to a subjective satisfaction level. Wanderly and Orio s work [8] on the application of HCI testing to musical input devices exemplifies the one place HCI is comfortably applied to computer music: input devices. It is obvious that by reducing the scope of testing (i.e. to input devices) and the environment in which they are tested, many variables are eliminated, producing more reliable, objective test results. Research in areas other than interactive music systems must also account for listeners subjectivity in their

analysis of music performance. One interesting result is presented in [9], which demonstrates that by using a ranked-choice method controlled for certain parameters, it is possible to produce statistically stable results based on listeners opinions across populations. 3. Evaluation Framework We approached our design using Sharp, et al s DECIDE framework [10]. An overview of this approach follows. We will expand on some of these issues in the next few sections. D: Determine the Goals. Our goals are to develop testing methodologies for evaluating interactive music systems for improvisation. The tests will capture experiences of musicians improvising with the IMSs, and audiences observing performances with the IMSs. The results will provide both qualitative and quantitative data for evaluating different IMSs, and guide us in the design of future systems. E: Explore the Questions. What are the common environments in which a musician or listener might experience the IMSs being evaluated? What are the important behavioral components of the IMSs that distinguish them? Are these differences in behavior observable by the musician or audience member? Do these differences in behavior result in more or less musically useful results, for either the musician or audience member? Is it possible to account for the dynamic quality of the musician-ims feedback loop? C: Choose the Approach and Methods. For the early stages of this work, we prefer relatively inexpensive and unobtrusive methods of gathering data that will not interfere with the ongoing musical activity. The data gathering should also be easy to administer, with results that are easy to collect, analyze and summarize; a similar approach should be used for both musicians and audience members. We decided on simple paper or equivalent electronic questionnaires; we will discuss their design in Section 4. Audio recordings of conversations with participating musicians were made after test sessions. We are currently exploring the efficacy of audio or video recordings of individual audience reactions. I: Identify the practical Issues. Classical HCI work stresses the importance of providing identical, reproducible test environments. Test subjects should come into a test with relatively similar experience with the systems being tested. This is (of course) highly impractical in typical computer music work environments. We would like to evaluate the IMSs with different musicians. It is likely that each musician is in a different location, and each test takes place in a different studio, each with a somewhat different setup. One musician may have no experience improvising with IMSs, while another might already have worked with systems similar to the one being tested. Similar issues apply to audiences. In particular, performances are often held in different locations, with different audience demographics and energy. The previous listening experience of individual audience members may vary significantly. It is also problematic to compare audience reactions to performances involving the same IMSs, but with different musician participants. D: Deal with ethical issues. In our tests, we are merely polling and not physically testing our subjects; hence, our major consideration is that of privacy. By making the audience participation optional, and their forms anonymous, we avoid ethical issues. Likewise, the musician s participation is optional/voluntary. If audio recordings are made, care must be taken to obtain informed voluntary consent for these recordings. E: Evaluate, interpret, and present data. This is the ongoing part of our research. In general, we felt that it was unrealistic to try to provide controlled laboratory test environments. Field studies are a standard HCI practice. For our work, we attempt to collect data from in-vivo situations such as studios and performance venues, and other common computer music environments, where control of tests is challenging. We attempt to work mostly with sympathetic and experienced musicians and audiences, and document their different experiences. We expand on related issues in the next few sections. 3.1 Working with the musician in rehearsal We initially focused on capturing the musician s experience with the IMSs under consideration. To facilitate working with our IMSs, we decided to limit our choice of musicians to those already familiar with free improvisation, and having rich timbral and gestural palettes. We postulated that a musician would probably need, before any performance took place, to work with an IMS to become familiar with its behavior. There should be one or more rehearsals with each IMS. One important consideration is the amount of information about each IMS that should be made available to the musician before the rehearsal. In classical HCIbased comparative studies of two (or more) software applications or variants, detailed information about each variant is usually not given to the users beforehand. The concern is such information will bias a user toward one or the other variant. Hence, we felt that a naïve rehearsal where the musician had no information about system behavior, material choice etc. might capture interesting information about the ease of use of a system. In subsequent practice sessions (or performances), more information about a system would be provided to the musician; it might also be interesting to compare the experiences of the musician before and after receiving specifications of a system s design and behavior. We also felt that the duration of each rehearsal should be chosen carefully. A rehearsal is exploratory in nature; it should be long enough for the musician to discover and

exercise interesting modes of interaction, but not so long as to be exhausting. Our testing procedure attempts to be as unobtrusive as possible, working itself into the natural flow of rehearsal and performance being observed. During a rehearsal, the musician works with the researcher to discover the dynamics of the IMS. We break the rehearsal itself into two sections: 1) a short naïve introductory section (the musician receives no briefing of the internal details of the IMS being tested); 2) the musician is briefed on relevant details of the IMS, after which we have a longer informed rehearsal section. Both sections are recorded. After each section, the musician fills out a questionnaire. This rehearsal setup is repeated for each IMS being tested. Hence, for our project comparing the London and ARHS systems, there are four rehearsal sections (two per system), with four questionnaires being administered. At the end of the four rehearsal sections, a fifth differential questionnaire comparing the two systems is administered. 3.2 Working with the musician and audience in performance In a performance setting, the musician will perform two preferably sequential sets, one with each IMS being tested. As discussed in Section 3.1, before each set, the audience will not be told which IMS is involved, to avoid possible bias. Following the performance, or during an intermission, the musician fills in questionnaires. Audience members who wish to participate will also be given questionnaires, which identify the systems only by their order in the set (i.e., first system, second system). As with rehearsals, we gave some thought to the duration of a performance. An IMS may demonstrate interesting behavior in a relatively short time window, but for various reasons fail to sustain interest in a long performance. This might be an important consideration when comparing two IMSs. To capture qualitative data, we plan to collect feedback from audience members at a performance; recordings of the performance will also be made available after performances, and interested listeners will be encouraged to provide feedback. The issue of listener preferences for different musical genres is one we would like to avoid (at least for the time being). Hence, we would focus our data collection on audience members who are already experienced listeners of free improvisation or abstract electroacoustic music. In the listener s questionnaire, we ask audience members to rate their previous listening experience. During the entire test, we document overall testing parameters such as the testing environment, the duration of each section, etc. We also make an audio recording for future reference; this might be available after the performance for further listening tests. 4. Questionnaire Design In developing the questionnaires for the participating musicians and audience members, we needed to ensure that the results were more than, as Ariza so rightly calls them, musical judgments [3] or subjective statements. We decided that one way to circumvent this was to make the tests differential, testing at most two different systems against each other, thus at least narrowing the subjective domain to that of the two systems performance. Since it is then a comparison of two systems, it is similar in some ways to the musical Turing tests discussed in [3]. However, we wanted more information from the subjects than the binary answer that such a test provided, information that could help the developer understand how the modules developed are being perceived by the musician and audience. To move beyond Ariza s highlevel musical judgements, we need to identify and isolate relatively concrete behavioral components for each system. These components will of course vary from system to system. Our two primary systems under test are the London system and the ARHS system (see [1] for details); both try to emulate the mechanics of free improvisation. The ARHS system contains much of the sensing abilities of the London system, plus enhanced modules that enable it to respond to sudden short-term changes in the musician s performance, and to adaptively discover new combinations of musical materials during a more extended period of operation. Hence, we designed questions that might help identify these enhancements and be used to distinguish between the two systems; questionnaire respondents were asked to rate statements such as: The system was responsive to short-term changes in [the musician s] performance. The system facilitated discovery of new musical combinations. Questionnaires for both musicians and audience members contain the above statements. Other statements addressed more general, high-level impressions. For example, the musician s questionnaire contained these statements: The reactions of the system were predictable. I would perform with this system again. The latter statement is modified for the audience questionnaire to: I would attend a performance with this system again. To limit the range of answers, we decided to use the approach similar to that taken by the European

Broadcasting Union in their subjective evaluation of quality of sound programme material [11]. We used a modified Likert scale, with 1 being strongly agree, and 5 being strongly disagree; there is also the possibility of an N/A or no answer for each statement. As mentioned, we encourage and provide space for comments for each statement to capture qualitative data. The musician fills out questionnaires at various points in rehearsal and performance, as described in Section 3. After a performance, audience members are encouraged to fill out questionnaires, made available to them at the performance venue. To avoid ethical issues such as privacy or coercion, we will emphasize that response to a questionnaire is entirely voluntary, and that each response is anonymous and may be used for purposes of research. During development, the methodology and questionnaires have undergone many revisions. The current version of the questionnaire may be found at http://userwww.sfsu.edu/~whsu/imshci. 5. Recent Experiences and Future Work So far we have focused primarily on evaluation tests and questionnaires from the musician s point of view. We have worked closely with musicians, primarily free improvising saxophonists John Butcher and James Fei, in the development and testing of the London and ARHS systems. Rehearsals with Butcher took place in June 2008 at Myles Boisen s Guerilla Recording studio (Oakland CA), followed by a performance at CNMAT (Berkeley CA) two days later. Rehearsals with Fei took place in December 2008 at Harvestworks (New York). We initially expected the feedback from Butcher and Fei to be fairly clearcut, i.e., clearly preferring the more developed ARHS system. After all, the ARHS system is functionally more complex than the London system, and has enhancements specifically targeting the London system s shortcomings. At ICMC 2008, during the presentation of [1], we had played short audio clips (about 2 minutes each) of Butcher working with each system; informal audience feedback afterwards indicated that the ARHS system was preferred. (However, we clearly identified which system was involved in each clip; also, each clip was chosen to highlight the capabilities of each system.) Hence, we were very surprised with the feedback from Butcher and Fei, after our more formalized tests. As discussed in Section 3.1, with each IMS, we had Butcher and Fei start with a short naïve rehearsal, in which they were given almost no information about the system being tested. We felt that this captured a common situation in free improvisation, where two (or more) improvisers would meet and perform for the first time, without prior discussions of the performance. We had hoped that after levels were set, the musician would simply start improvising with the IMS. Through performance, s/he would discover how each system worked, and possibly identify the differences between systems. The musician would fill out a questionnaire after the naïve rehearsal, to document her/his experience in the discovery process. Butcher and Fei both found it difficult to identify differences between the London and ARHS systems in the naïve rehearsal. In fact, both felt that, in the short initial rehearsal, it was easier in some respects to work with the simpler London system, with its phrase-oriented playing. The more complex ARHS system is sensitive to short-term performance changes; it seems to encourage both musicians to play with rapid transitions and more choppy material. This change in the musicians performance in turn causes the ARHS system to make frequent gestural and timbral adjustments, resulting in a dynamic feedback loop. It is not clear why the slowly developing playing of the London system was preferred in the short naïve rehearsal. Butcher did agree that the simpler London system felt predictable in an extended session, which was not surprising. In interviews following the rehearsals, both indicated that direction from the programmer would be useful in setting a context for performance. Fei pointed out that a performer would have at least a vague idea of the musical context in which improvisation would be occurring; for example, an improvising saxophone player would work differently with a loud free jazz rhythm section, versus with quieter acoustic instruments. Butcher also suggested that the musician be asked to play with each system with several different approaches, for example as a soloist, then as a duo partner, etc. In this light, the information obtained from naïve discovery seems of limited value. We plan to drop the initial naïve rehearsals in the future. Instead, the researcher will start by giving the musician an overview of the system being tested. Then the researcher suggests a musical context or progression of gestural and material choices, such as play long tones for about a minute, followed by short gestures with rapid timbral variations, to elicit specific behavioral responses from the IMS. The musician then starts the initial rehearsal section according to the suggestions. A second free rehearsal section, with no restrictions or pre-arranged material choices, will follow. The development of this evaluation methodology is an ongoing process. We look forward to future testing with larger audiences and a wider variety of musicians. As mentioned, the most recent version of the questionnaire is available online. We are also working on making recordings available online, and implementing an automated system for collecting feedback from listeners. We look forward to and encourage input from the community. 6. Acknowledgements We especially wish to thank John Butcher and James Fei for their ongoing patience and help in developing this methodology, and CNMAT and Harvestworks for their support of our work.

References [1] W. Hsu, Two Approaches for Interaction Management in Timbre-aware Improvisation Systems in Proceedings of the ICMC, Belfast, UK, 2008. [2] W. Hsu, Managing Gesture and Timbre for Analysis and Instrument Control in an Interactive Environment Proceedings of the International Conference on New Interfaces for Musical Expression, Paris, France, 2006. [3] C. Ariza, The interrogator as critic: The questionable relevance of Turing tests and aesthetic tests in the evaluation of generative music systems. in Computer Music Journal, 33(1), 2009, pp. 1-23. [4] The Cambridge Companion to Electronic Music, N. Collins and J. d Escrivan, Eds. Cambridge, UK: Cambridge University Press, 2007, pp. 171-184. [5] G. Lewis, Too Many Notes: Computers, Complexity and Culture in Voyager in Leonardo Music Journal, vol. 10, 2000. [6] J. Freeman and M. Godfrey, Technology, Real-time Notation, and Audience Participation in Flock, in Proceedings of the ICMC, Belfast, UK, 2008. [7] M. Unehara and T. Onisawa. "4. Music Composition System Based on Subjective Evaluation," in IEEE International Conference on Systems, Man and Cybernetics, 2003, pp. 980-986. [8] M. Wanderly and N. Orio, Evaluation of Input Devices for Musical Expression: Borrowing Tools from HCI. Computer Music Journal. 2002, vol. 26, 62-76. [9] T. Nakano, M. Goto, and Y. Hiraga, Subjective evaluation of common singing skills using the rank ordering method in Proceedings of 9th International Conference on Music Perception and Cognition, Bologna, 2006. [10] H. Sharp, Y. Rogers, and J. Preece, Interaction Design, New York, Wiley, 2007. [11] Tech 3286: Assessment methods for the subjective evaluation of the quality of sound programme material Music. P. Laven, Ed. [Web site] 1997, [2009 Jan 10], Available: http://www.ebu.ch/cmsimages/fr/tec_doc_t328 6_tcm7-10487.pdf.