Application of a Musical-based Interaction System to the Waseda Flutist Robot WF-4RIV: Development Results and Performance Experiments

The Fourth IEEE RAS/EMBS International Conference on Biomedical Robotics and Biomechatronics Roma, Italy. June 24-27, 2012 Application of a Musical-based Interaction System to the Waseda Flutist Robot WF-4RIV: Development Results and Performance Experiments Klaus Petersen (IEEE Member, Jorge Solis (IEEE Member) and Atsuo Takanishi (IEEE Member) Abstract During several years of development, the hardware of the anthropomorphic flutist robot Waseda Flutist WF- 4RIV has been continuously improved. The robot is currently able to play the flute at the level of an intermediate human player. Lately we have been focusing our research on the interactivity of the performance of the robot. Initially the robot has only been able to play a static performance that was not actively controllable by a partner musician of the robot. In a realistic performance set-up, in a band or an orchestra, musicians need to interact in order to create a performance that gives a natural and dynamic impression to the audience. In this publication we present the latest developments on the integration a Musicalbased Interaction System (MbIS) with WF-4RIV. Such a human robot interaction system is to allow human musicians to do natural musical communication with the flutist robot through audio-visual cues. Here we would like to summarize our previous results, present the latest extensions to the system and especially concentrate on experimenting with applications of the system. We evaluate our interactive performance system using three different methods: A comparison of a passive (noninteractive) and an interactive performance, evaluation of the technical functionality of the interaction system as a whole and by examining the MbIS from a user perspective with a user survey including amateur and professional musicians. We present experiment results that show that our Musical-based Interaction System extends the anthropomorphic design of the flutist robot, to allow increasingly interactive, natural musical performances with human musicians. A. Research Objective I. INTRODUCTION One type of robots that perform rich communication with humans are musical performance robots. Anthropomorphic musical performance robots have the ability to mechanically emulate the human way of playing a musical instrument. These technically very complex robots reach high performance levels that are comparable to the skill level of professional human musicians. A feature that in most cases they still lack however, is the ability to interact with other musicians. Playing a fixed, invariable sequence, they might be able to perform together with other players, but as soon as there is a spontaneous variation in the musical performance, the human and the robot performance will become desynchronized. This work has been kindly supported by the GCOE-Global Robot Academia program of Waseda University. Klaus Petersen is with Waseda University, Graduate School of Advanced Science and Engineering, Waseda University, 3-4-1 Ookubo, Shinjuku-ku, 169-8555. Tokyo, Japan (phone: +81-3-5286-3257; fax:+81-3-5273-2209; e-mail: klaus@aoni.waseda.jp). Jorge Solis and Atsuo Takanishi are with the Department of Mechanical Engineering, Waseda University. Atsuo Takanishi is one of the core members of the Humanoid Robotics Institute, Waseda University (e-mail: takanisi@waseda.jp). The Waseda Flutist Robot WF-4RIV has been developed for several years. With its bio-inspired design it emulates the functionality of human organs involved in playing the flute. In in its most recent version the flutist robot is able to play the flute at the level of an intermediate level human flute player ([1]). Regarding the mechanical development of the Waseda flutist robot our research purpose is to understand more about the dexterity and motor control abilities that are necessary for humans to play the flute ([2]). In recent research efforts we have tried to integrate the flutist robot with a human band. We intend to give the robot the interactive capabilities to actively play together with human musicians or other musical performance robots. By doing so we would like to develop new means of musical expression and also get a deeper understanding of the process of communication that takes place between human musicians. We have in previous work introduced a so called Musical-based Interaction System (MbIS). This system contains several modules for audio-visual sensor processing and for mapping the results from the sensor processing to the musical performance parameters of the robot ([3], [4]). Using this system we have enabled a human musician to control the performance of the flutist robot in a natural way. Generally, the MbIS follows two main purposes: First, to reduce the complexity of the usage of the flutist robot by providing easily usable controllers and involving feedback from the mechanical state of the robot in the interaction process. Second, through teach-in capabilities the interaction system is to give a high degree of flexibility to the human musician, allowing him to freely determine the connection between a musical melody pattern and an instrument gesture ([5]). Various work about musical performance robots has been published. This includes the MuBot string instrument robot series that has been developed at the University of Electro- Communications ([6]). Regarding the interaction with musical performance robots, in [7] the drumming robot Haile has been introduced, which is able to play an improvised performance together with other musicians using acoustic rhythmical cues. This research has been extended with the development of the robotic marimba player Shimon ([8]), which additionally has the ability to analyze and react to the harmonic characteristics of the music performance of its human partner players. [9] presents an approach to control the tempo of the accompaniment of an anthropomorphic robot playing a Theremin using audio-visual cues. Although the techniques for detecting audio-visual cues and the mapping methods used are similar, the referenced musical robot 978-1-4577-1198-5/12/$26.00 2012 IEEE 937

Fig. 1. Diagram of the proposed Musical-based Interaction System (MbIS) that was implemented in the Waseda Flutist Robot WF-4RIV. The system captures the performances actions of the human musician and, after the experience level selection stage, maps the processed sensor information into musical performance parameters for the robot. The robot s performance provides musical feedback to the human musician. systems follow a slightly different research approach than our system. They focus on developing an efficient way to assist human music production. In our research we concentrate on the anthropomorphic re-production of sound production and the development of a human-like interactive behavior of the robot. An important point that seems to be missing in the majority of the previously published work is the evaluation of the proposed interaction system. In some cases the system is used in an on-stage environment to demonstrate its suitability for a real performance. In other cases user-survey results or technical measurements are provided. From an engineering point-of-view a more detailed evaluation of the performance improvements that can be achieved using the introduced interaction systems from a listener perspective, how users judge the usability of such systems, and a detailed technical evaluation, might be desirable. Therefore, in this paper we would like to especially concentrate on the evaluation of our interaction system: In the first part of this evaluation, we perform a comparative analysis, in order to determine the different characteristics (by a listener survey) of the passive and active performance system. It follows a technical system evaluation from interaction experiments with an intermediate level musician. In the third part of the evaluation, we look more closely at experiments from the non-technical user perspective. We perform a user survey to characterize the practical usability of the system. B. Implementation Concept We propose the Musical-Based Interaction System (MbIS), to allow the interaction between the flutist robot and musicians based on two levels of interaction (Figure 1): the basic interaction level and the extended interaction level. The purpose of the two-level design is to make the system usable for people with different experience levels in human-robot interaction. In the basic interaction level we focus on enabling a user who does not have much experience in communicating with the robot to understand about the device s physical limitations. We use a simple visual controller that has a fixed correlation regarding which performance parameter of the robot it modulates, in order to make this level suitable for beginner players. The WF-4RIV is built with the intention of emulating the parts of the human body that are necessary to play the flute. Therefore it has artificial lungs with a limited volume. Also other sound modulation parameters like the vibrato frequency (generated by an artificial vocal chord) have a certain dynamic range in which they operate. To account for these characteristics the user s input to the robot via the sensor system has to be modified in a way that it does not violate the physical limits of the robot. With the extended level interaction interface, our goal is to give the user the possibility to interact with the robot more freely (compared to the basic level of interaction). To achieve this, we propose a simplified learning (teach-in) system that allows the user to link instrument gestures with musical patterns. Here, the correlation of sensor input to sensor output is not fixed. Furthermore, we allow for more degrees-offreedom in the instrument movements of the user. As a result this level is more suitable for advanced level players. We use a particle filter-based instrument gesture detection system and histogram-based melody detection algorithms. In a teaching phase the musician can therefore assign instrument gestures to certain melody patterns. In a performance phase the robot will re-play these melodies according to the taughtin information. II. COMPARATIVE EVALUATION OF PASSIVE AND INTERACTIVE MUSICAL PERFORMANCE We performed qualitative validation of the distinction between passive and active performance by a user survey. The survey was done using the same musical material as in the previous section. As experimental subjects, we chose 15 amateur musicians and 2 professional musicians. While we had access to a relatively high number of amateur musician subjects within our university, the number of available professional musicians was low. We developed a questionnaire that consisted of 7 adjective pairs to characterize a musical performance by the flutist robot. The questionnaire was developed according to a concept proposed in [10]. The purpose of the survey was to find the impression of the two performance modes on the listener. For each adjective pair in the questionnaire, the survey subject was asked to express his impression of the performance on a 5-point Likert-type scale. Applied to the adjective pair interesting / boring, a 1 would account for a very boring performance and a 5 for a really interesting one. If a listener was indecisive on which adjective to choose, he could choose a 3 to emphasize neither one of both adjectives. The result of the survey for all adjective pairs is shown in Fig. 2. Especially in case of the survey with 15 amateur musician subjects, the survey shows promising results. The active performance scored significantly higher (with the result of the t-test > 0.05) for classifications interesting, varied, natural and emotional. We have already shown with the performance 938

Fig. 2. In the two graphs above, the results of the listener survey to compare the active and passive performance are shown. In a) the averaged questionnaire scoring by the amateur musicians is shown. b) shows the survey results for the professional musicians. Filled rectangles point to an adjective category for which there is a significant difference between the result for the basic and extended level system. Red boxes show the scoring for a passive performance and blue boxes display the results for an active performance. Fig. 3. The figure displays an basic interaction level setup. On the left hand side the flutist robot WF-4RIV is displayed. The right hand side shows the robot view of the interacting musician and the virtual fader that the player manipulates. The resulting data and the fill-status of the lung are shown in the graph below. index that the active performance bears a stronger correlation between visible actions by the musician and musical performance output. This results in the impression of a more interesting and varied performance on the listener of the performance. Considered that the active performance gives a more natural impression to the listener and the musical performer, the result for the adjective pairs natural / artificial and emotional / rational can be explained. The active performance was attributed a higher score for naturalness and emotionality than the passive performance. The additional physical movement, resulting in stronger synchronicity of the two performers, shows more human-like features that a static performance without further exchange of information might not display. This leads the listener / viewer to the conclusion that the active performance is more natural (human-like) and conveying more emotional content than the passive performance. III. TECHNICAL EVALUATION OF THE MBIS A. Basic Level of Interaction To demonstrate the technical functionality of the basic interaction system, we asked an intermediate level saxophone player to improvise over a repetitive musical pattern from the theme of the jazz standard piece The Autumn Leaves. By moving his instrument, the musician was able to adjust the tempo of the sequence that was performed by the flutist robot. While the musician controlled the performance of the robot, his input was modulated by the physical state of the robot. The relevant state parameter could be deliberately selected by the user. In this case we chose the state of the lung of the robot to modulate the values that were transmitted from a visual controller (Figure 3). This fader was used to continuously control the speed of a pre-defined Fig. 4. In the beginner level interaction system, the user controls the tempo of a pattern performed by the robot. The lung fill level plotted in the top graph, modulates the input data from the virtual fader resulting in the robot performance displayed by the pitch and the amplitude curve. melody sequence. The speed of the performed pattern was continuously reduced, when the lung reached a fill-level of 80%. In order to perform the experiment, the saxophone player stood in front of the robot (within the viewing angle of the robot s cameras). After introducing the functionality of the basic level interaction system to the player, we recorded the sound output of the robot, its lung fill level, virtual fader level and modulated virtual fader level for the resulting interaction of the robot and the musician. A graph of the result of the experiment is shown in Fig. 4. Before the lung reached a fill level of 80%, the performance tempo of the robot was controlled by the unmodulated fader level. With a fill level above 80%, the fader value actually transmitted to the robot (Modulated fader value) was faded out before the lung was completely empty. This adjustment can be observed at 17.5s-22.5s in the fader value plot, the modulated fader value plot and the robot output volume plot. As the fader value was faded-out rapidly, the resulting performance tempo of the robot decreased from fast 939

Fig. 5. In this screenshot an interactive performance using the extended level of interaction is displayed. In the left part, the flutist robot WF4-RV is shown. In image on the right hand side displays the interaction partner as seen by the robot. The interaction partner can select melody patterns or performance states (state 1, 2, 3) by changing the orientation of his instrument. Fig. 6. In the extended level interaction system s teach-in phase the user associates instrument motion with melody patterns. A melody pattern m performed by the musician is repeated by the robot for confirmation r. The robot state table shows the association that is being set-up the teach-in system. (160bpm) to slow (70bpm). This variation can been seen in the data shown in the robot performance pitch plot. At 23.5s and 24s the robot refilled its lungs for a duration of approx. 0.5s. These breathing points have a time-distance of approx. 40s (in the displayed graph one breathing cycle is displayed). During the breathing points no sound was produced by the robot. B. Extended Interaction Level Similar to the technical evaluation of the basic level interaction system, we concentrate on a proof-of-concept demonstration of the functionality of the extended level interaction system, rather than calculating numerical error values for the separate system components. In the experimental setup, an intermediate-level (instrument skill level) saxophone player controlled the robot for improvisation on the theme of the same song also chosen for the previous survey experiment, the jazz standard The Autumn Leaves. The musician controlled the performance parameters of the robot using the mapping module of the extended level interaction system. The experiment had two phases, the teaching phase and the performance phase. In the first phase the interacting musician taught a movement-performance parameter relationship to the robot. In this particular case we related one of three melody patterns to the inclination angle of the instrument of the robot s partner musician. From this information the robot built a state-space table that relates instrument angles to musical patterns. In the second stage (Figure 5) the interaction partner controlled the robot with these movements. When a certain instrument state is determined, the robot plays the musical pattern that relates to the current instrument angle. The transition of the teaching phase to the performance phase is defined by the number of melody patterns associated by the robot. In case of this experiment, the switch occurred Fig. 7. In the extended level interaction system s performance phase the user controls the robot s output tone by changing the orientation of his instrument. In the graph the detected instrument orientation, the associated musical pattern and the output of the robot are shown. after 3 melody patterns had been recorded. The experiment was performed by the intermediate-level flute player. One teaching phase and one performance phase was done for each experiment run. The recorded data for the teaching phase is displayed in Fig. 6. In the first part (from T = 0s), the instrument player moved his instrument to an angle of approximately 125 (state I) and plays melody pattern A. The flutist robot confirmed the detection of the pattern by repeating the melody. This is displayed in the Robot performance pitch graph and marked with m (musician) and r (robot). The association of pattern A and instrument state I was written to the robot state table. At T = 18s the player changed his instrument position to approximately 100 (state II) and played the next melody pattern, which was recognized and confirmed as melody pattern B. The association of state II and pattern B was memorized in the robot state table. At last 940

(T = 22s), the instrumentalist moved his instrument to state III (approximately 75, and played a melody pattern C. The association of instrument state III and melody pattern C was saved in the association table. The results for the performance phase of the extended level interaction experiment is shown in Fig. 7. In the teaching phase the musician associated three melody patterns A, B, C, to instrument states I, II, III. In the performance phase he recalled the melody patterns in order to build an accompaniment for an improvisation. In the graph, each time the musician shifted his instrument to a new angle (Instrument orientation graph), the detected instrument state changed. As a result of this change the robot played the answer melody that was associated in the teaching phase. This happens several times in the displayed graph. At 15s the musician moved his instrument to an angle of 150 (state I) and the robot immediately played the associated answer melody (pattern A). At 20s he shifted the instrument to 100 and triggered melody pattern B. When moving the instrument to 50 at 23s, the robot answered with melody pattern C. It remains to note that, after one pattern has been performed, the robot automatically reset its lung until the next pattern is commanded. These short breathing spots can be seen throughout the Robot performance volume plot, notably at t = 5s, t = 9s or t = 13s. The proposed technical evaluation experiments cover only two cases of a musician-robot interaction configuration. These configurations were suggested as a conceptual test of the basic and extended level interaction system by the professional musician, who we worked together with, when planning the presented experiments. Also have experiments only been performed for relatively simple types of user input. In a realistic performance, more extreme movements and musical expression than proposed here might occur. As a result of the presented experiments the basic interaction level can be characterized as functional from the technical pointof-view for a certain performance scenario. The absolute certainty that the system works with every possible input data, was not achieved. The system has been evaluated on a case-by-case basis. We tried to choose the situation proposed here as an example for a typical scenario. To further classify, in how far this makes sense also from the point-of-view of a real user, we evaluate in the next section. IV. SURVEY EVALUATION FROM THE INTERACTING USER PERSPECTIVE Goal of the development of the Musical-based Interaction System is to provide the user with an intuitive, natural interface to interact with a musical robot. To find out about the acceptance of the system and its general usability, the system needs to be tested with a variety of users, the experience and comments of the users need to be recorded and these results need to be analyzed. Users were asked to do a musical performance with the flutist robot, first using the basic level interaction system and second using the extended level interaction system. We asked the users to fill out a questionnaire for each of these performances to characterize Fig. 8. This figure shows the results of the user survey for the basic interaction system and the extended interaction system. In a) the averaged questionnaire scoring by the amateur musicians is shown. b) shows the survey results for the professional musicians. Filled rectangles in the graph point to a adjective category for which there is a significant difference between the result for the basic and extended level system. Red boxes show the scoring for the basic interaction level and blue boxes display the results for the extended interaction level. their experience with the system. This experimental method was applied for professional musicians as well as amateur musicians and the result statistically analyzed. With the results we try to show that the system provides a natural user experience to users of different experience levels. In the survey experiment we again asked 15 male amateur musicians and 2 male professional musicians to use the basic and the extended interaction system with WF- 4RIV. Regarding the low number of professional musician subjects, the same limitation as described for the previous survey apply. For each interaction level one questionnaire needed to be filled out. Re-trials of the interaction system experiment runs, if requested by the user, were allowed. Each questionnaire consisted of 8 pairs of adjectives, similar to the approach proposed in [10]. As a score system, similarly to the previously described survey, we used a 5-point Likerttype scale. Applied to the adjective pair natural / artificial, a score of 1 would account for a very natural, human-like interaction and a score of 5 for a very machine-like, static one. The results of the user survey for the basic and extended level interaction system are shown in Fig. 8. 15 Amateur musicians were asked to characterize the differences between the use of basic level of interaction and the extended level of interaction. These subjects for the survey described in this section were the same as in the previous survey experiment. The survey described in this section was done after the previously described experiment. The users were asked to attribute 8 pairs of adjectives to the two levels after 941

interacting with the robot using each of these levels. The result of the survey shows that in case of the adjective pairs natural / artificial, free movement / constrained, emotional / rational, expressive / unexpressive, easy / difficult, there occurs a significant difference (t-test result p > 0.05) between the basic and extended interaction level. A Student s t-distribution is assumed for the survey results and therefore the t-test was chosen to determine statistical significance. For the pair natural / artificial, the amateur users in average gave a score of 1.7 to the basic and a score of 4 to the advanced interaction level. A similar result was achieved for the adjective pair free movement / constrained with a score of 2 for the basic interaction level and 4.5 for the advanced interaction level. In case of the adjective pair emotional / rational, the basic level scored 2 and the extended level 4. The basic interaction level was attributed with a higher score of 3.8, than the extended interaction level with 2 for the adjective pair easy / difficult. Furthermore, we asked 2 professional musicians to use the basic and extended interaction level and attribute their impression with the previously described adjective pairs. The outcome of the experiment is similar to the results for the amateur musicians. In case of the adjective pair natural / artificial, the basic interaction level scored 2, whereas the extended interaction level achieved 4. For the adjective pair free movement / constrained a score of 2.2 was attributed to the basic level of interaction and a score of 4.5 was attributed to the extended interaction level. In case of the adjective pair emotional / rational the professional musicians in average evaluated the basic level with a score of 2.5 and the extended level with a score of 4. A score of 4.5 for the basic and 2.5 for the extended interaction system were attributed for the adjective pair easy / difficult. As average impression for the amateur musician survey subjects as well as the professional musician subjects, the extended interaction level was evaluated to be more natural and more emotional in its usage. This might be related to the additional freedom of expression given by the teach-in system and the use of the particle filter-based tracking. V. CONCLUSIONS AND FUTURE WORK In previous publications we have introduced a Musicalbased Interaction System (MbIS) for the Waseda flutist robot WF-4RIV. So far we have had only a preliminary evaluation of the technical and usability characteristics of the system. We also considered that the evaluation strategies introduced in other work published related to musical performance robot interaction systems left room for improvement. Therefore, in this paper we evaluated the interaction system in three different categories: The difference between a passive performance and an active performance was analyzed. Through the application of a user study, we concluded that to a certain degree, the active performance is more similar to a performance between humans, than a passive performance between robot and human. This led to the next section, in which we displayed experimental results to demonstrate the technical functionality of the basic and extended level of interaction. Although these results did not cover all possible cases of usage of the system, we decided to further evaluate the system in the user survey shown in the next section. The results of the user survey show that the interaction system levels are characterized differently by the amateur and professional musician users. The basic level interaction system on the one hand, is evaluated to be more constrained and in general provide a more artificial feel, but is easy to use. The extended level interaction system on the other hand is more complicated in its usage, but due to its greater flexibility leads to more natural and expressive performance. In future works, the evaluation of the system is to be continued, with focus on in-depth analysis of the results, such as using a different method of determining the statistical significance of the survey results instead of the used t-test. In the surveys presented in this paper, especially the number of professional musician survey subjects was very low. We intend to perform further surveys with a larger number of professional subjects. REFERENCES [1] J. Solis, K. Chida, K. Suefuji, K. Taniguchi, S. Hashimoto, and A. Takanishi, The waseda flutist robot wf-4rii in comparison with a professional flutist, Computer Music Journal, vol. 30, pp. 127 151, 2006. [2] J. Solis and A. Takanishi, The waseda flutist robot no. 4 refined iv: Enhancing the sound clarity and the articulation between notes by improving the design of the lips and tonguing mechanisms, IROS, pp. 2041 2046, 2007. [3] K. Petersen, J. Solis, and A. Takanishi, Toward enabling a natural interaction between human musicians and musical performance robots: Implementation of a real-time gestural interface, in Robot and Human Interactive Communication, 2008. RO-MAN 2008. The 17th IEEE International Symposium on. IEEE, 2008, pp. 340 345. [4], Development of a aural real-time rhythmical and harmonic tracking to enable the musical interaction with the Waseda flutist robot, in Intelligent Robots and Systems, 2009. IROS 2009. IEEE/RSJ International Conference on. IEEE, 2009, pp. 2303 2308. [5], Implementation of a musical performance interaction system for the Waseda Flutist Robot: Combining visual and acoustic sensor input based on sequential Bayesian filtering, in Intelligent Robots and Systems (IROS), 2010 IEEE/RSJ International Conference on. IEEE, 2010, pp. 2283 2288. [6] M. Kajitani, Development of musician robots in japan, in Australian Conference on Robotics and Automation, 1999. [7] G. Weinberg and S. Driscoll, Towards robotic musicianship, Computer Music Journal, vol. 30, pp. 28 45, 2006. [8] G. Hoffman and G. Weinberg, Gesture-based human-robot jazz improvisation, in Robotics and Automation (ICRA), 2010 IEEE International Conference on. IEEE, 2010, pp. 582 587. [9] A. Lim, T. Mizumoto, L. Cahier, T. Otsuka, T. Takahashi, K. Komatani, T. Ogata, and H. Okuno, Robot musical accompaniment: integrating audio and visual cues for real-time synchronization with a human flutist, in Intelligent Robots and Systems (IROS), 2010 IEEE/RSJ International Conference on. IEEE, 2010, pp. 1964 1969. [10] C. Bartneck, D. Kulic, and E. Croft, Measuring the anthropomorphism, animacy, likability, perceived intelligence, and perceived safety of robots, in Workshop on Metrics for Human-Robot Interactionf, 2008, pp. 37 43. 942